How Infochimps wants to become Heroku for Hadoop

Deploying and managing big data systems such as Hadoop clusters is not easy work, but Infochimps wants to change that with its new Infochimps Platform offering. The Austin, Texas-based startup best known for its data marketplace service is now offering a cloud-based big data service that takes the pain out of managing Hadoop and scale-out database environments. Eventually, it wants to make running big data workloads as simple as Platform-as-a-Service offerings like Heroku make running web applications.

The new Infochimps Platform is essentially a publicly available version of what the company has built internally to process and analyze the data it stores within its marketplace. As Infochimps CEO Joe Kelly puts it, the company is “giving folks … the iPod to our iTunes.”

The platform is hosted in the Amazon Web Services cloud and supports Hadoop, various analytical tools on top of that — including Apache Pig and Infochimps’ own Wukong (a Ruby framework for Hadoop) — and a variety of relational and NoSQL databases. It also leverages the Apache Flume project to augment data in real time as it hits the system. But the real secret sauce is Ironfan, a configuration-and-management tool that Infochimps built atop Opscode’s Chef software.

The open-source Chef software is widely used in cloud computing and other distributed environments because it makes infrastructure configuration and management so much easier, but it’s limited to a single computer at a time, Infochimps CSO Dhruv Bansal told me. Ironfan is an abstraction layer that sits atop Chef and lets users automate the deployment and management of entire Hadoop clusters at the same time. It’s what lets the company spin up clusters for Infochimps Platform customers in minutes rather than days or hours, which is the only way the company could offer big data infrastructure on demand.

Early customers include SpringSense, Runa and Black Locus, and Bansal told me some are already storing and processing hundreds of terabytes on the Infochimps Platform.

For users who’d rather work on their own internal gear, however, Infochimps is open sourcing Ironfan. Kelly said the company is currently working with early users on some in-house deployments, including atop the OpenStack cloud computing platform. Open source Ironfan doesn’t come with all the monitors, dashboards, and other bells and whistles of the Infochimps Platform, Kelly said, but it’s plenty powerful on its own. Ironfan is what lets Infochimp’s relatively small engineering team “[move] whole cities with their minds,” he said.

Of course, Infochimps isn’t abandoning its flagship data marketplace, which Kelly thinks is actually the ideal complement for the new big data platform. Customers can use the platform to process their own internal data, he explained, but they’ll be able to add a lot more value to those results by further analyzing against Infochimps’ growing collection of social, location and other data sets.

As the Infochimps Platform evolves, Kelly hopes it will help answer the question of “what does a Heroku for big data look like?” PaaS offerings such as Heroku are great, he said, because they help developers launch web applications without having to worry about managing infrastructure. He hopes the Infochimps Platform can provide a similar experience as big-data-based applications become more prevalent and startup companies look for a way to get the analytics infrastructure they need without investing heavily in people, servers and software.

A few providers already offer some form of Hadoop as a cloud-based service, including IBM and Amazon Web Services, and Microsoft is working with Hortonworks on a Hadoop distribution that can run on the Windows Azure cloud platform.

We’ll be talking a lot more about the future of big data and big data platforms at our Structure: Data conference next month in New York, where topics range from Hadoop at the low level to capturing and acting on consumer sentiment in real time.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

  • Amazon’s DynamoDB: rattling the cloud market
  • Infrastructure Q1: IaaS Comes Down to Earth; Big Data Takes Flight
  • Infrastructure Overview, Q2 2010



GigaOM