Reddit Went Down: Blame Amazon, the Cloud or Both?
Reddit went down for a period of six hours early Friday morning, making it look as bad as any service does when its millions of visitors suddenly can’t get to their beloved community.
It’s not a good thing. But according to one former Reddit employee, who left Reddit for Hipmunk last week, the problem has been going on for months with Amazon Web Services (AWS). “Keltralnis,” writes that in the past year the issues have even escalated to the office of the CIO. Ketralnis is the user name for David King.
In the comments to the post, people question Amazon Web Services as the right provider for the service. And in hindsight, King says in a comment that Reddit should have moved off Amazon last Fall.
Worst of all, community members are worried about the service having continued outages. That’s a bad place to be. No service wants to have this kind of problem.
The debate, much to King’s lament, is now becoming a conversation about the cloud. Is that far fetched? It is a bit but it does surface some issues to consider about broad services such as AWS.
Reddit user knowitistrue writes:
IT guy here. Specifically I am a data storage/data center specialist. It pains me to see the “cloud” illusion come crashing down on a great product like Reddit.
What also strikes me in this whole situation is how squeezed these guys seem to be on budget. I’ve worked for very small companies that own a SAN that could easily handle reddit that they purchased for less than $50,000. True, SANs can be very expensive (in the millions), but a good one with enough storage can be had for a good price in today’s market.
For the uninitiated, a SAN is a Storage Area Network. It’s essentially mirrored RAM in front of a shit load of disks (in every way redundant down to the power supplies). Nice SANs are usually fibre channel connected and optimized to be super reliable and redundant.
How is this different from Amazon? Amazon is a “cloud” service. This means that what Reddit is seeing as disks are actually abstractions sitting on top of a layer of code Amazon has created above a physical SAN to allow for growing/shrinking of resources, general “cloudiness” and ultimately to allow Amazon to charge for every resource, be it storage or compute time.
It’s no secret among most IT folks that the cloud really isn’t cheaper than rolling your own infrastructure for reasons exactly like this.
Reddit, if you ever need consulting, I’m available.
Reddit gives a more diplomatic perspective in its blog post:
Amazon’s Elastic Block Service is an extremely handy technology. It allows us to spin up volumes and attach them to any of our systems very quickly. It allows us to migrate data from one cluster to another very quickly. It is also considerably cheaper than getting a similar level of technology out of a SAN.
Unfortunately, EBS also has reliability issues. Even before the serious outage last night, we suffered random disks degrading multiple times a week. While we do have protections in place to mitigate latency on a small set of disks by using raid-0 stripes, the frequency of degradation has become highly unpalatable. To Amazon’s credit, they are working very closely with us to try and determine the root cause of the problem and implement a fix.
Over the course of the past few weeks, we have been working to completely move Cassandra off of EBS and onto the local storage which is directly attached to the EC2 instances. This move will be executed within the month. While the local storage has much less functionality than EBS, the reliability of local storage outweighs the benefits of EBS. After the outage today, we are going to be investigating doing the same for our Postgres clusters.
Amazon Web Services could not be reached for comment.
What do you think?