5 reasons why the future of Hadoop is real-time (relatively speaking)

In some ways, Hadoop is a like a fine wine: It gets better with age as rough edges (or flavor profiles) are smoothed out, and those who wait to consume it will probably have a better experience. The only problem with this is that Hadoop exists in a world that’s more about MD 20/20 than it is about Relentless Napa Valley 2008: Companies often want to drink their big data fast, get drunk on insights, and then have some more — maybe something even stronger. And with data — unlike technology and tannins — it turns out older isn’t always better.

That’s a crude analogy, of course, but it gets at the essence of what’s currently plaguing Hadoop adoption and what will propel it forward in the next couple years. The work being done by companies like Cloudera and Hortonworks at the distribution level is great and important, as is MapReduce as a processing framework for certain types of batch workloads. But not every company can afford to be concerned about managing Hadoop on a day-to-day basis. And not every analytic job pairs well with MapReduce.

In Part I of our four-part series on Hadoop, we looked at how the technology was born and grew into the juggernaut it is today. In Part II, we laid out the map of the current products and projects that comprise the Hadoop ecosystem. In this installment, we’ll take a closer look at some of them and how they’re positioning themselves to be important players down the road.

If there’s one big Hadoop theme at our Structure: Data conference March 20-21 in New York, it’s the new realization that people shouldn’t be asking “What’s next after Hadoop?” but rather “What will Hadoop become next?”. Based on what’s transpiring today, the answer to that question is that Hadoop will become faster in all regards and more useful as a result.

Interactivity, big-data-style

Source: Shutterstock user hauhu.

As I explained with some detail a couple weeks ago, SQL is what’s next for Hadoop, and that’s not because of familiarity alone or the types of queries permitted by SQL on relational data. It’s also because the types of massively parallel processing engines developed to analyze relational data over the years are very fast. That means analysts can ask questions and get answers at speeds much closer to the speed of their intuitions than is possible when querying entire data sets using standard MapReduce.

But just as SQL and its processing techniques bring something to Hadoop, Hadoop (the Hadoop Distributed File System, specifically) brings something to the table, too. Namely, it brings scale and flexibility that don’t exist in the traditional data warehouse world, where new hardware and licenses can be expensive; so only the “valuable” data makes its way inside and only after it has been fitted to a pre-defined structure. Hadoop, on the other hand, provides virtually unlimited scale and schema-free storage, so companies can store however much information they want in whatever format they want and worry later about what they’ll actually use it for. (Actually, though, most Hadoop jobs do require some sort of structure in order to run, and Hadoop co-creator Mike Cafarella is working on a project called RecordBreaker that aims to automate this process for certain data types.)

How hot is SQL-on-Hadoop space? I profiled the companies and projects working on it on Feb. 21, and since then EMC Greenplum announced a completely rewritten Hadoop distribution that fuses its analytic database to Hadoop, and an entirely new player called JethroData emerged along with $ 4.5 million in funding. Even if there’s a major shakeout, there will be a few lucky companies left standing to capitalize on a shift to Hadoop as the center of data gravity that EMC Greenplum’s Scott Yara (albeit a biased source) thinks will be the data equivalent of the mainframe’s demise.

This is your database. This is your database on HDFS

The SQL versus NoSQL debate appears to be dying down as companies and developers begin to realize there’s definitely a place for both in most environments, but a new debate — with Hadoop at the center — might be about to start up. At its core is the concept of data gravity and the large, attractive (in a gravitational sense) entity that is HDFS. Here’s the underlying question that might be posed: If I’m already storing my unstructured data in HDFS and am expected to replace my data warehouse with it, too, why would I also run a handful of other databases that require a separate data store?

This is in part why HBase has attracted such a strong following despite its relative technical and commercial immaturity compared with comparable NoSQL database Cassandra. For applications that would benefit from a relational database, startups such as Drawn to Scale and Splice Machine have turned HBase into a transactional SQL system. Wibidata, the new startup from Cloudera C0-founder Christophe Bisciglia and Aaron Kimball, is pushing an open source framework called Kiji to make it easier to develop applications that use HBase.

“If you talk to anyone from Cloudera or any of the platform vendors, I think they will tell you that a large percentage of their customers use HBase,” Bisciglia said. “It’s something that I only expect to see increasing.”

MapR seems to think so, too: the Hadoop-distribution vendor is getting ahead of the game by selling an enterprise-grade version of HBase called M7. Should hot startups such as TempoDB and Ayasdi decide to take their HBase-reliant cloud services into the data center, they’ll tap into Hadoop clusters, too.

And the National Security Agency built Apache Accumulo, a key-value database similar to HBase but designed for fine-grained security and massive scale. It’s now being sold commercially by a startup called Sqrrl. There’s even a graph-processing project called Giraph that relies on HBase or Accumulo as the database layer.

Whatever “real-time” means to you

Real-time is one of those terms that means different things to different people and different applications. The interactivity that SQL-on-Hadoop technologies promise is one definition, as is the type of stream processing enabled by technologies like Storm. When it comes to the latter, there’s a lot of excitement around YARN as the innovation will make it happen.

YARN, aka MapReduce 2.0, is a resource scheduler and distributed application framework that allows Hadoop users to run processing paradigms other than MapReduce. This could mean things, from traditional parallel-processing methods such as MPI to graph processing to newly developed stream-processing engines such as Storm and S4. Considering for how many years Hadoop meant HDFS and MapReduce, this type flexibility is certainly a big deal.

Stream processing, of course, is the antithesis of batch processing, for which Hadoop is known, and which is inherently too slow for workloads such as serving real-time ads or monitoring sensor data. And even if Storm and other stream-processing platforms somehow don’t make their way onto Hadoop clusters, a startup called HStreaming has made it its mission to deliver stream processing to Hadoop, and it’s on other companies’ radars, as well.

For what it’s worth, though, VertiCloud Founder and CEO and former Yahoo CTO Raymie Stata thinks we should do away with terms such as batch, real-time and interactive altogether. Instead, he prefers the terms synchronous and asynchronous to describe the human experience with the data rather than the speed of processing it. Synchronous computing happens at the speed of human activity, generally speaking, while asynchronous computing is largely decoupled from the idea of someone sitting in front of a computer screen awaiting a result.

The change in terms is associated with a change in how you manage SLAs for applications. Uploading photos to Flickr: synchronous. Running a MapReduce job: most likely asynchronous. Ironically, according to Stata, stream processing data with Storm is often asynchronous, too. That’s because there’s probably not someone on the other end waiting for a page to update or a query to return. And unless you’re doing something where guaranteed real-time latency is necessary, the occasional difference between milliseconds and 1 second probably isn’t critical.

Time to insight starts at the planning phase

Even when MapReduce is the answer, though, not everyone is game for a long Hadoop deployment process coupled with a consulting deal to identify uses and build applications or workflows. Sometimes, you just want to buy some software and get going.

Already, companies such as Wibidata and Continuuity are trying to make it easier for companies to build Hadoop applications specific to their own needs, and Wibidata’s Bisciglia said his company is doing less and less customization the more it deals with customers in the same vertical markets. “I think it’s still a couple years out before you can buy a generic application that runs on Hadoop,” he told me, but he does see opportunity for billion-dollar businesses at this level, possibly selling the Hadoop equivalent of an ERP or CRM application.

Structure Data 2012: Michael Olson – CEO, Cloudera

Cloudera CEO Mike Olson at Structure: Data 2012
(c) 2012 Pinar Ozger pinar@pinarozger.com

And Cloudera CEO Mike Olson told the audience at our Structure: Data conference last year that he’ll connect startups trying to build Hadoop-based applications with funding opportunities. In fact, Cloudera backer Accel Partners launched a Big Data Fund in 2011 with the sole purpose of funding application-level big data startups.

But maybe Cloudera, like database vendor Oracle before it, will just get into the application space itself: According to Hadoop creator and Cloudera chief architect Doug Cutting:

“I wouldn’t be surprised if you see vendors, like Cloudera, starting to creep up the stack and sell some applications. You’ve seen that before from Red Hat, from Oracle. You could argue that the relational database is a platform for Oracle and they’ve sold a lot of applications on top. So I think that happens as the market matures. When it’s young, we don’t want to stomp on potential collaborators at this point, we want to open that up to other people to really enhance the platform.”

Cloud computing is proving to be a big help in getting Hadoop projects off the ground, too. Even low-level services such as Amazon Elastic MapReduce can ease the burden of managing a physical Hadoop cluster, and there are already a handful of cloud services exposing Hadoop as a SaaS application for business intelligence and analytics. The easier it gets to store, process and analyze data in the cloud, the more appealing Hadoop looks to potential users who can’t be bothered to invest in yet another IT project.

Google (and Microsoft): A guiding light

Lest we forget, Hadoop is based on a set of Google technologies, and it seems likely its future will also be influenced by what Google is doing. Already, improvements to HDFS seem to mirror changes to the Google File System a few years back, and YARN will enable some new types of non-MapReduce processing similar to what Google’s new Percolator framework does. (Google claims Percolator lets it “process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.”) The MapR-led Apache Drill project is a Hadoop-based version of Google’s Dremel tool; Giraph was likely inspired by Google’s Pregel graph-processing technology.

Cutting is particularly excited about Google Spanner, a database system that spans data geographies while still maintaining transactional consistency. “It’s a matter of time before somebody implements that in the Hadoop ecosystem,” he said. “That’s a huge change.”

It’s possible Microsoft could be an inspiration to the Hadoop community, too, especially if it begins to surface pieces of its Bing search infrastructure as products like a couple of company executives have told me it will. Bing runs on a combination of tools called Cosmos, Tiger and Scope, and it’s part of the Online Services division ran by former Yahoo VP and Hadoop backer Qi Lu. Lu said that Microsoft (like Google) is looking beyond just search — Hadoop’s original function — and into building an information fabric that changes how data is indexed, searched for and presented.

However it evolves, though, it’s becoming pretty obvious that Hadoop is no longer just a technology for doing cheap storage and some MapReduce processing. “I think there’s still some doubt in people’s minds about whether Hadoop is a flash in the pan … and I think they’re missing the point,” Cutting said. “I think that’s going to be proven to people in the next year.”