On Availability, Scalability and Reliability

Listened to Michael Krigsman’s podcast with Rod Boothby of Joyent about reliability/scaleabililty – a fairly hot topic in the blogosphere these days with the whole Twitter Up Twitter Down scenarios and recently with Amazon (about 00:35 in the podcast).

Around 01:20 Rod tells us that reliability is a known computer science problem with known solutions. And then at 01:27 he tells us that to develop something that scales is a simple architecture. But all he offers in terms of architecture is that different parts of your site should be correctly siloed , i.e. split off from each other. For example instead of having all your code under http://www.mysite.com you should put your uploads section under uploads.mysite.com and your messaging piece under messaging.mysite.com etc etc.

Michael asks a great question at 03:27: If it’s so easy why don’t more people do it properly?

Rod’s answer is that is all comes down to load balancing.

I’ve got a different answer. My answer is that the reason more people don’t do it is that in reality there’s a whole load more to availability, scalability and reliability than just load balancing across different subdomains.

Joyent is only talking about one tiny piece of the big availability/reliability/scaleability issue. And Rod is right that what he’s talking about is pretty easy to do. But it also has next to nothing to do with what usually happens to make complex sites struggle.

Usually the bottleneck is not the network or the bandwidth or the load balancing or the processing power of the web servers (these are all pretty easy to address). Usually the issue is the database itself. Partitioning the database into different silos (sharding) is VERY hard to do. It’s also very hard to predict early on what level of sharding you will need a few years down the line.

An example: when I first came to TradingPartners I inherited a system with 4 distinct databases. It had been designed this way to silo out different kinds of data. But it happend that 80 – 90% of the database calls were to one database. Bad design? No, simply that the original architect didn’t have a crystal ball to see where the application was headed. My scaleability issue has always been the database. Everything else is comparatively easy.

There is some great material on scaling at http://highscalability.com. And if you don’t believe me that there’s more to scaling than load balancing then check this great presentation from the LiveJournal/Danga people. There is a good Venn diagram on page 13 (and on page 14).

On Twitter

How better to kick off a tech blog than to talk about Twitter. The site we love and hate and love to hate.

I’m on Twitter now (follow me), have been for a comparatively short while. I enjoy it. The initial impetus was so I could easily keep my wife updated during the working day while I am abroad but it’s soon given me way more than this. I have already connected with a handful of like-minded people – which is great. I only tweet about once a day or so.

Regarding the scaling – amongst all the hyperbole, people seem to forget that this stuff is quite hard to do in real life. And you always need to be addressing your bottleneck. (Get one bottleneck out the way and another will turn up to bite you. It’s a basic law of software). And without being on the inside you have no way of knowing what is really going on. And, dare I say it, how confident someone is about their proposed software architecture bears little or no relationship to how well it will work in real life.