How Twitter is doing its part to democratize big data

Twitter has been on a tear lately when it comes to open sourcing big data tools. The latest two are Cassie — a Scala client for managing Twitter’s 1,000-plus node Cassandra cluster — and Scalding — a MapReduce framework for simplifying the creation of Hadoop jobs. If you think big data will be black magic forever, think again.

Twitter has been fairly active on the open source front for the past few years, and because it works with so much data, it has released a lot tools for doing just that. Among its various open source contributions are Gizzard, a middleware framework for distributed databases; FlockDB, a graph database of sorts for managing the Twitter social graph; and Storm, a stream-processing engine to handle data in real time.

Among the latest two, Scalding is probably the more interesting because of the general fervor over Hadoop across the IT world. In a recent Twitter Engineering blog post, Twitter data scientist Edwin Chen described Scalding thusly:

Scalding is an in-house MapReduce framework that Twitter recently open-sourced. Like [Apache] Pig, it provides an abstraction on top of MapReduce that makes it easy to write big data jobs in a syntax that’s simple and concise. Unlike Pig, Scalding is written in pure Scala — which means all the power of Scala and the JVM is already built-in. No more UDFs, folks! …

In 140: Instead of forcing you to write raw map and reduce functions, Scalding allows you to write natural code like:

Chen also illustrates some simple use cases for Scalding, such as correlating the similarities between people’s movie interests or their Foursquare checkins. In the movie example, Chen shows the code necessary collect and parse through various data as well as this simple command to actually run the job in Hadoop:

The moral of the this story, of course, isn’t so much what Twitter is doing as much as it is the democratization of big data technologies. From startups to large software vendors to web companies like Twitter, tools are emerging that should make analytics on large data sets doable by individuals that don’t bear the job title “data scientist.”

When we plan conferences such as Structure: Data, which takes place later this month in New York, we’re always looking toward the future. The big data space is advancing so fast, it’s difficult to tell where the cutting edge will be a few years from now. What’s next when skills such as building recommendation engines and ad-targeting systems become commonplace or, better yet, services, and when managing distributed systems becomes child’s play?

Image courtesy of StoreEnvy.com

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

  • Dissecting the data: 5 issues for our digital future
  • Defining Hadoop: the Players, Technologies and Challenges of 2011
  • 12 tech leaders’ resolutions for 2012



GigaOM