Foursquare Explains Yesterday’s 11 Hour Outage: An Overloading Of Database Shards
As you may have noticed yesterday, Foursquare was down. Very down. In fact, that service says that total downtime was around 11 hours all told. That’s not good. And they know it. So they wrote a post on their blog today apologizing, explaining what happened, and saying what they’re going to be doing differently going forward to prevent it from happening again.
The “what happened” part is fairly technical. But basically it boils down to this: Foursquare data is supposed to be spread evenly over different database “shards” (think of it as database segments). At some point yesterday morning, things got uneven, with one shard getting way more data than the others. They tried to balance it out, but that didn’t work, so they tried putting a new shard in play. Then all hell broke loose.
Foursquare says they’re not exactly sure why introducing a new shard caused a total site failure — but it did. They then spent the next several hours still attempting to even out the data, but couldn’t figure out a good way to do that without keeping everything down. Eventually, they had to basically re-do the entire problematic shard, which took hours.
The good news is that despite all of this, Foursquare promises that no data was lost.
Going forward, this shard fix ensures this problem won’t happen again anytime soon. But in the future, they’re making bigger architecture changes to ensure this never happens again. They’re also looking into better safeguards to ensure that even if there is a problem, Foursquare can stay up during it.
Obviously, all of this sounds fairly reminiscent of the early days of Twitter, when the service was unable to reliably stay up. Foursquare hasn’t had issues on that level yet, but as they continue to grow, they could get there without major changes. So it’s good to see they’re thinking ahead on this stuff.
The team (which is now 32 people) also promises to communicate better in the future when downtime and/or errors occur. They’ve created a new Status blog just for that.
- Foursquare: Yeah, The Thing We Said Wouldn’t Happen, Happened; Down 6 Hours Yesterday
- On The Upside, It Took Foursquare Only 2.5 Months To Double Check-Ins To 200 Million
- Gmail Access (& Emails) To Be Restored “Soon” to All Those Impacted By Yesterday’s Outage
- ElephantDB, a Distributed Database for Working with Hadoop
- A New Approach To Database-Aided Data Processing