Many Fortune 500 and mid-size enterprises are funding Hadoop test/dev projects for Big Data analytics, but question how to integrate Hadoop into their standard enterprise architecture. For example, Joe Cunningham, head of technology strategy and innovation at credit card giant Visa, told the audience at last year’s Hadoop World that he would like to see Hadoop evolve from an alpha/beta environment into mainstream use for transaction analysis, but has concerns about integration and operations management.
What’s been missing for Big Data analytics has been a LAMP (Linux, Apache HTTP Server, MySQL and PHP) equivalent. Fortunately, there’s an emerging LAMP-like stack for Big Data aggregation, processing and analytics that includes:
- Hadoop Distributed File System (HDFS) for storage
- MapReduce for distributed processing of large data sets on compute clusters
- HBase for fast read/write access to tabular data
- Hive for SQL-like queries on large data sets as well as a columnar storage layout using RCFile
- Flume for log file and streaming data collection, along with Sqoop for database imports
- JDBC and ODBC drivers to allow tools written for relational databases to access data stored in Hive
- Hue for user interfaces
- Pig for dataflow and parallel computations
- Oozie for workflow
- Avro for serialization
- Zookeeper for coordinated service for distributed applications
While that’s still a lot of moving parts for an enterprise to install and manage, we’re almost to a point where there’s an end-to-end “hello world” for analytical data management. If you download Cloudera’s CDH3b2, you can import data with Flume, write it into HDFS, and then run queries using Cloudera’s Beeswax Hive user interface.
With the benefit of this emerging analytical platform, data science is becoming more integral to businesses, and less a quirky, separate function. As an industry, we’ve come a long way since, industry visionary Jim Gray was famously thrown out of the IBM Scientific Center in Los Angeles for failure to adhere to IBM’s dress code.
Adobe’s infrastructure services team has scaled HBase implementations to handle several billion records with access times under 50 milliseconds. Their “Hstack” integrates HDFS, HBase and Zookeeper with a Puppet configuration management tool. Adobe can now automatically deploy a complete analytical data stack across a cluster.
Working with Hive, Facebook created a web-based tool, HiPal, that enables non-engineers to run queries on large data sets, view reports, and test hypotheses using familiar web browser interfaces.
For Hadoop to realize its potential for widespread enterprise adoption, it needs to be as easy to install and use as Lotus 1-2-3 or its successor Microsoft Excel. When Lotus introduced 1-2-3 in 1983, they chose the name to represent the tight integration of three capabilities: a spreadsheet, charting/graphing and simple database operations. As a high school student, I used it to manage the reseller database for a storage startup, Maynard Electronics. Even as a 15 year old, I found Lotus 1-2-3 easy to use. More recently, with Microsoft Excel 2010 and SQL Server 2008 R2, I can click on Excel ribbon buttons to load and prepare PowerPivot data, create charts and graphs using out-of-the-box templates, and publish on SharePoint for collaboration with colleagues.
The Fourth Paradigm quotes Jim Gray as saying “We have to do better producing tools to support the whole research cycle – from data capture and data curation to data analysis and data visualization.” As the Hadoop data stack becomes more LAMP-like, we get closer to realizing Jim’s vision and giving enterprises an end-to-end analytics platform to unlock the power of their Big Data with the ease of use of a Lotus 1-2-3 or Microsoft Excel spreadsheet.
Brett Sheppard is an executive director at Zettaforce.
Related Post from GigaOM PRO (Sub Required): The Incredible, Growing, Commercial Hadoop Market