How Tumblr went from wee to webscale

Tumblr, the popular microblogging service is hitting 500 million page views a day, deals with 40,000 requests per second and sends more than a terabyte of data into its Hadoop cluster per day. But it wasn’t always a superhot startup that needed to serve 15 billion page views a month, and the story of how it morphed from wee to webscale makes a great case study over at High Scalability.

The biggest takeaways from the study, which is a must-read for developers and a must-skim for entrepreneurs building consumer-facing startups (seriously, the section on lessons learned and how to hire coders applies to y’all), are how it moved from a traditional open source Linux Apache, MySQL and PHP base to using a thin veneer of PHP code on top of bleeding-edge languages such as Scala and Finagle. The post notes that, while newer noSQL data stores like HBase and Redis are used, the bulk of the data is currently stored in a heavily partitioned MySQL set up and there is no plan to replace MySQL with HBase.

The story is familiar to those who have followed the infrastructure progressions of Facebook and Twitter, and many of the tools Tumblr uses are open-source contributions from those grandaddies of webscale. But the post digs into a problem that is somewhat unique to Tumblr, in that it runs two different types of services, one that looks more like a constantly updating network of user statuses such as Twitter and another that is more like a Facebook page with a huge social graph for each writer that Tumblr must track. The implications for Tumblr’s architecture are huge. From the post:

Users form a connection with other users so they will go hundreds of pages back into the dashboard to read content. Other social networks are just a stream that you sample. …

Public Tumblelog is what the public deals with in terms of a blog. Easy to cache as it’s not that dynamic.

Dashboard is similar to the Twitter timeline. Users follow real-time updates from all the users they follow.

Very different scaling characteristics than the blogs. Caching isn’t as useful because every request is different, especially with active followers.

Needs to be real-time and consistent. Should not show stale data. And it’s a lot of data to deal with. Posts are only about 50GB a day. Follower-list updates are 2.7TB a day.

Aside from the technical issues, entrepreneurs can learn from this case study because Tumblr is trying to solve more general problems around hiring talent in NYC (it has high hopes for Mayor Bloomberg’s efforts to bolster the tech community), understanding how to conduct job interviews to find good coders and how to slowly try out new technologies at small scale before moving them over to the entire production environment. Really, the post is long, but it’s good. Go read it.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.