Scale and Performance Suggestions for High Usage Websites

I work at the BBC and scale and performance is a big deal. After talking to Mark Gledhill (one of the Senior Software Engineers) and Andy Macinnes (head of ops) I put together some notes on scale and performance. It isn’t rocket science; just good common sense.

The general message is to avoid unnecessary repetition. If you’re a scalability/performance problem then you could explore …

Tweak Server settings

Major example: increase db connections

Caching

The aim of a caching is to reduce hits on server.

You can cache at different levels:

code. e.g. store result of a calculation
file. e.g. news feed updated every 10 minutes
template. i.e. common part of the part
page.

You can flatten dynamic pages to enable caching. Until recently almost all BBC pages were flat html served statically through a publishing queue. It is clunky but serves pages fast because the pages can be heavily cached. Only recently have we moved to dynamic publishing; this has only been possible by buying lots of kit.

You must have to remember to release cache. This is complex when using several layers of caching. Poor caching can cause a memory leak if cache is not released

Database

You need fast access to the database, so …

keep db connections open, e.g. each apache server could have one db connection open at all times
index what you’re interested in
clear out expired info
Possibly replicate DB for redundancy
Cache queries
Optimise queries
- focus on writes (writes take 1000 times longer than reads)
- but reads can also be problematic if spanning multiple tables
- can aggregate data from normalised tables into a new table to query against. Requires
  synchronising the various tables on update

Load Testing

For load testing to be effective you must simulate live conditions and hit scripts with millions of requests. The testers used Avalanche for this but the developers use apache bench.

Language

You must understand what simple programming statements are doing behind the scenes, e.g. db queries in Ruby on Rails.

Page assembly

At the BBC page assembly is handled by specialists variously called Client Side Developers (CSDs), Front End Developers or Web Developers.

They have to:

reduce number of files used to assemble page
reduce page weight
take care of SSIs that have
- conditional logic
- variables which can hit a performance threshold
User work flow
- change user experience to reduce hits on server
- An example of this was the 606 forum where an editorial change eliminated a major
  performance problem without technical input.

Load Balancing

Spread load
BBC was using “DNS road Robin” where user is restricted to a particular server for 5 minutes then allocated a new server. This is because one ISP has one DNS server so the ISP, and all its users, go to one random server for 5 minutes. A major problem is that you can’t take a machine
out.
Now requests go to servers that are up
Track trends and look at the graphs over time

Multi-casting

Multi-casting means, instead of stream directly to users, w streaming to ISPs who stream to users.
This takes the load of our servers.

Get more capacity

For example:

Buy more kit in the hosting environment
Buy more streaming capability from your Content Distribution Network (CDN), e.g. Akamai
Use existing kit differently. For example if you can predict the peaks then temporarily
borrow kit from other areas for the peak and then hand it back afterwards.

Standards

The BBC has scalability issues in one of its development standards. This is so the organisation, and new developers, can benefit from past lessons.

Rough overview of the BBC set up

The BBC has impressive stats:

page traffic doubles every year
Aim to server every page < 1.3. seconds
- this isn’t possible on all pages. For example the old homepage took ~ 5 seconds.
serve 13 GB data/sec (peak over 20 GB)
6 billion page impressions / month
performance varies a lot
- Media Selector 120 requests / sec
  SSO 20 requests / sec

Broadly speaking both the old and new platforms are a four tier architecture:

presentation layer
custom applications e.g. message boards, celebdaq
platform, e.g. DNA, SSO, Postcoder, Search
DB

Old Platform

– 40-50 web servers

– 20-30 app servers

– each platform replicated on many servers, e.g. SSO on 7

– Solaris 9/10

– Apache 2 on web servers

– Apache 1.3 + modperl on app servers

– predominantly serve static pages (hence borg)

New Platform

– for dynamic serving

– Linux (HP blades)

– 160 app servers

It's a Delivery Thing

Steven Thomas on the art of leading software development teams, projects and programmes