Why Twitter should open up about its infrastructure

Whatever investments Twitter is making to improve the reliability of its system aren’t working, or at least not as well as they should be. The world’s favorite micro-blogging site blamed Thursday morning’s approximately two-hour outage on problems within its data centers — specifically the parallel failure of its running system and its backup system — and it’s the second time in less than two month’s Twitter’s infrastructure has brought the site down. Maybe it’s time for Twitter to talk openly about what its doing in there.

Don’t get me wrong, Twitter has been nothing if not generous in talking about the software it builds. The company has open sourced numerous data-management tools and other pieces of code. It has occasionally (at least in 2009) been willing to share how it handles, stores, searches, and analyzes billions of data points relating to users and their tweets. But when it comes to the actual infrastructure on which this software runs?

Let’s just say Twitter is less than forthcoming. The explanation of today’s outage:

The cause of today’s outage came from within our data centers. Data centers are designed to be redundant: when one system fails (as everything does at one time or another), a parallel system takes over. What was noteworthy about today’s outage was the coincidental failure of two parallel systems at nearly the same time.

It was minimally more descriptive in explaining the cascading bug that took the site down temporarily last month.

And after publicly detailing the migration into its new data center last March, Twitter has been mysteriously mum about where it’s actually running. In Salt Lake City? In Sacramento? The response from a Twitter spokesperson when asked where it’s data centers are located last June: “I can also confirm that we have multiple sites, but I won’t go into further detail.” Apparently, it now has space in Atlanta, too.

If Twitter wants to remain opaque about its practices, that’s fine — but it shouldn’t expect any slack from upset users or investors. By contrast, we have a pretty good idea where Google’s data centers are and what’s going on inside them, and we know nearly everything about Facebook’s operations. When Amazon Web Services has an outage, it might take days, but provides a detailed post-mortem report explaining what went wrong.

Even if Twitter’s infrastructure team is filled with very smart engineers, there’s certainly benefit to be derived from public discussion about what it might be doing right and wrong. Clearly, something isn’t right; the site is down too often considering how much smaller it is than the aforementioned services. While it’s not disruptive enough to anyone’s business to warrant an AWS-style explanation, we gotta get something better than blaming two hours of downtime on an “infrastructural double-whammy.”

Image courtesy of Shutterstock user Elnur.



GigaOM