Failure at scalePublished on 2011-02-04 14:01:16 +0000
When you launch a high profile website, it sometimes will spectacularly fail for reasons of scale. Since this is an area of professional interest I thought I’d have a look to see whether there was anything obvious, and it was apparent that the developers didn’t appear to think at scale (and still haven’t fixed the issues).
When I visit the crime maps website, the very first thing I see is a landing page that I can type my postcode into. The HTML generated for this page must be the same for every user, so this page should be cached right? Unfortunately, not only is this page uncached but it doesn’t return any cache headers, last modified or expires headers of any form. The static files for this page do return a Last-Modified header, but I suspect that’s because they are hosted on S3. Of slight note is that the developers didn’t obey any of the website performance rules, there is no gzip and there is multiple CSS/JS files rather than a combined file.
In this case a cache wouldn’t help with scale for multiple users, but would for returning users, because the developer has specified the area they are interested in by a long/lat that is accurate to 7 decimal places - an accuracy that should be down to the millimeter of so I believe (4 places gives you accuracy to 6-10 meters or so). Since they’ve merely geolocated my postcode, accuracy that close is completely wasted and would destroy any cache that is there. My postcode is probably a hundred meters or so in size, so having a long/lat that is more accurate than 2 or 3 decimal places is probably wasted. Decreasing the accuracy of the center point could potentially give you less accurate data, but since what you are interested in is crimes within a 0.5 mile radius, you could simply increase the radius a bit and drop some of the outlying results in the clients browser, making for a much more scalable backend server.
At this point of the analysis I got just to depressed to even continue, it was clearly obvious that the developers had given very little thought as to how this website would scale. I suspect that since it is hosted on Amazon’s EC2 they figured they would scale by firing up new servers, but I contend that even at that 18m hits an hour (approx 4k a second) about 4 or 5 caches with good hit rates would have been able to serve most of the traffic without too much issue (I’d probably have suggested a cache warming for data for the major metropolises as well).
Remember kids, EC2 allows you to add new machines easily, but you still need to think about scale and performance when writing your application. EC2 doesn’t solve your scaling problems, it solves your capacity planning problems, you still need to build a site that scales without needed 1000’s of servers.