Big data in real time is no fantasy

Big data — as in managing and analyzing — large volumes of information, has come a long way in the past couple of years. Among the greatest innovations might be the advent of real-time analytics, which allow the processing of information in real time to enable instantaneous decision-making. Even Hadoop, the set of parallel-processing tools that has become the face of big data, but which has been historically limited to batch processing, is coming along for the ride.

Analytics are nothing new, but Hadoop had made organizations of all types realize they can analyze all their data and can do so using commodity servers with local storage. They can extract valuable business insights from sources like social media comments, web pages and server log files.

Because of its parallel nature and ability to scale across thousands of nodes, Hadoop makes short work of even terabytes of information that might have taken days to process using traditional methods. But not short-enough work for some situations.

Yahoo CTO Raymie Stata explained the current state of affairs in a recent article at The Register:

With the paths that go through Hadoop [at Yahoo!], the latency is about fifteen minutes. … [I]t will never be true real-time. It will never be what we call “next click,” where I click and by the time the page loads, the semantic implication of my decision is reflected in the page.

However, thanks to various Hadoop optimizations, complementary technologies and advanced algorithms, real-time analytics are becoming a real possibility. The goal for everyone seeking real-time analytics is to have their services act immediately — and intelligently — on information as it streams into the system.

Pick a platform

Yahoo itself is working on a couple of real-time analytics projects, including S4, which we’ve profiled here, and MapReduce Online. Appistry and Accenture teamed up late last year to create a product called Cloud MapReduce. DataStax’s Brisk Hadoop distribution analyzes and stores data within the same Cassandra NoSQL database on the same system, so applications can access and serve Hadoop-processed data much faster than using separate storage systems.

This week, a startup called HStreaming launched its eponymous product, which actually is based on Hadoop. Whereas Yahoo is focused on web behavior, HStreaming lays out the following examples in its press release:

Typical examples include location information, sensor data, or log files when the traditional model of store-and-process-later is not fast enough for such data volumes. Companies need to react promptly to sensor readings or analyze web logs as they are generated because that type of information becomes quickly obsolete.

Others are using real-time analysis to make targeted advertising bot instantaneous and super-efficient. I spoke yesterday with Eric Wheeler, founder and CEO of 33Across, a marketing platform that lets companies target potential customers based on those companies’ social graphs. Essentially, he explained, “We constantly re-score the brand graph to understand who are the best targets for that ad right now. We use social connections of that brand to know whom we should next target.”

In order to do this, 33Across maintains a “massive Hadoop implementation” complemented by machine-learning and predictive-analytics algorithms that has developed in house. Presumably, the data batch processed by and stored in the Hadoop cluster adds context to streaming data as it hits the 33Across system. The more data it has about a brand’s social graph, the better decision it can make on the fly.

Jeff Jonas, chief scientist of IBM’s Entity Analytics division and all-around big-data genius, analogizes this effect to putting to together a puzzle. The more pieces you have in place, the easier it is to figure out where the next piece goes. Within the context of IBM’s big data portfolio, for example, Hadoop helps companies learn their past, which helps real-time products such as InfoSphere Streams or Jonas’s Entity Analytics software analyze streaming data more accurately.

Real-time advertising was also the impetus behind Amazon.com’s recent partnership with Triggit. Amazon wants to use its data to make money by helping other web sites better target incoming visitors as they browse from site to site. Thanks to Triggit’s predictive algorithms and cookie-analysis system, “Amazon [can] show the right ads to the right users across nine ad exchanges and more than four million websites.”

If this all sounds like high computer science, it is. But the most interesting thing about it might be that it was hardly even possible a few years ago. According to Wheeler, the tools and best practices — and in some cases, the data — weren’t readily available until recently, so the evolution from batch processing to real-time processing has happened quickly.

But we’re only “in the first inning of a doubleheader,” he said, so real-time processing will only get better as data volumes increase and models get more finely tuned.

Image courtesy of Flickr user RL Fantasy Design Studio.

Related content from GigaOM Pro (subscription req’d):