On Availability, Scalability and Reliability

Listened to Michael Krigsman’s podcast with Rod Boothby of Joyent about reliability/scaleabililty – a fairly hot topic in the blogosphere these days with the whole Twitter Up Twitter Down scenarios and recently with Amazon (about 00:35 in the podcast).

Around 01:20 Rod tells us that reliability is a known computer science problem with known solutions. And then at 01:27 he tells us that to develop something that scales is a simple architecture. But all he offers in terms of architecture is that different parts of your site should be correctly siloed , i.e. split off from each other. For example instead of having all your code under http://www.mysite.com you should put your uploads section under uploads.mysite.com and your messaging piece under messaging.mysite.com etc etc.

Michael asks a great question at 03:27: If it’s so easy why don’t more people do it properly?

Rod’s answer is that is all comes down to load balancing.

I’ve got a different answer. My answer is that the reason more people don’t do it is that in reality there’s a whole load more to availability, scalability and reliability than just load balancing across different subdomains.

Joyent is only talking about one tiny piece of the big availability/reliability/scaleability issue. And Rod is right that what he’s talking about is pretty easy to do. But it also has next to nothing to do with what usually happens to make complex sites struggle.

Usually the bottleneck is not the network or the bandwidth or the load balancing or the processing power of the web servers (these are all pretty easy to address). Usually the issue is the database itself. Partitioning the database into different silos (sharding) is VERY hard to do. It’s also very hard to predict early on what level of sharding you will need a few years down the line.

An example: when I first came to TradingPartners I inherited a system with 4 distinct databases. It had been designed this way to silo out different kinds of data. But it happend that 80 – 90% of the database calls were to one database. Bad design? No, simply that the original architect didn’t have a crystal ball to see where the application was headed. My scaleability issue has always been the database. Everything else is comparatively easy.

There is some great material on scaling at http://highscalability.com. And if you don’t believe me that there’s more to scaling than load balancing then check this great presentation from the LiveJournal/Danga people. There is a good Venn diagram on page 13 (and on page 14).


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s