LexisNexis open sources its Hadoop killer

LexisNexis is releasing a set of open source data-processing tools that it says outperforms Hadoop and even handles workloads that Hadoop presently cannot. The technology (and new business line) is called HPCC Systems, and was created 10 years ago within the LexisNexis Risk Solutions division that analyzes huge amounts of data for its customers in intelligence, financial services and other high-profile industries. There have been calls for a legitimate alternative to Hadoop, and this certainly looks like one.

According to Armando Escalante, CTO of LexisNexis Risk Solutions, the company decided to release HPCC now because it wanted to get the technology into the community before Hadoop became the de facto option for big data processing. Escalante told me during a phone call that he thinks of Hadoop as “a guy with a machete in front of a jungle — they made a trail,” but that he thinks HPCC is superior.

But in order to compete for mindshare and developers, he said, the company felt it had to open source the technology. One big thing Hadoop has going for it is its open source model, Escalante explained, which attracts a lot of developers and a lot of innovation. If his company wanted HPCC to “remain relevant” and keep improving through new use cases and ideas from a new community, the time for release was now and open source had to be the model.

Hadoop, of course, is the Apache Software Foundation project created several years ago by then-Yahoo employee Doug Cutting. It has become a critical tool for web companies — including Yahoo and Facebook — to process their ever-growing volumes of unstructured data, and is fast making its way into organizations of all types and sizes. Hadoop has spawned a number of commercial distributions and products, too, including from Cloudera, EMC  and IBM.

How HPCC works.

Hadoop relies on two core components to store and process huge amounts of data: the Hadoop Distributed File System and Hadoop MapReduce. However, as Cloudant CEO Mike Miller explained in a post over the weekend, MapReduce is still a relatively complex language for writing parallel-processing workflows. HPCC seeks to remedy this with its Enterprise Control Language.

Escalante says ECL is a declarative, data-centric language that abstracts a lot of the work necessary within MapReduce. For certain tasks that take a thousand lines of code in MapReduce, he said, ECL only requires 99 lines. Furthermore, he explained, ECL doesn’t doesn’t care how many nodes are in the cluster because the system automatically distributes data across however many nodes are present. Technically, though, HPCC could run on just a single virtual machine. And, says Escalante, HPCC is written in C++ — like the original Google MapReduce  on which Hadoop MapReduce is based — which he says makes it inherently faster than the Java-based Hadoop version.

HPCC offers two options for processing and serving data — the Thor Data Refinery Cluster and the Roxy Rapid Data Delivery Cluster. Escalante said Thor, so named for its hammer-like approach to solving the problem, crunches, analyzes and indexes huge amounts of data a la Hadoop. Roxie, on the other hand, is more like a traditional relational database or database warehouse that even can serve transactions to a web front end.

We didn’t go into detail on HPCC’s storage component, but Escalante noted that it does utilize a distributed file system, although it can support a variety of off-node storage architectures and/or local solid-state drives.

He added that in order to ensure LexisNexis wasn’t blinded by “eating its own dogfood,” his team hired a Hadoop expert to kick the tires on HPCC. The consultant was impressed, Escalante said, but did note some shortcomings that the team addressed as it readied the technology for release. It also built a converter for migrating Hadoop applications written in the Pig language to ECL.

Can HPCC Systems actually compete?

The million-dollar question is whether HPCC Systems can actually attract an ecosystem of contributors and users that will help it rise above the status of big data also-ran. Escalante thinks it can, in large part because HPCC already has been proven in production dealing with LexisNexis Risk Solutions’ 35,000 data sources, 5,000 transactions per second and large, paying customers. He added that the company also will provide enterprise licenses and proprietary applications in addition to the open source code. Plus, it already has potential customers lined up.

It’s often said that competition means validation. Hadoop has moved from a niche set of tools to the core of a potentially huge business that’s growing every day, and even Microsoft has a horse in this race with its Dryad set of big data tools. Hadoop has already proven itself, but the companies and organizations relying on it for the their big data strategies can’t rest on their laurels.

Image courtesy of Flickr user NileGuide.com.

Related content from GigaOM Pro (subscription req’d):

  • The Case for Increased M&A in 2011: Actions and Outlooks
  • The Structure 50: The Top 50 Cloud Innovators
  • Infrastructure Q1: IaaS Comes Down to Earth; Big Data Takes Flight



GigaOM — Tech News, Analysis and Trends