How HBase converted MySpace’s MySQL champion and is driving Hadoop mainstream

How’s this for an understatement: Operational databases are important for many, if not the majority, of web applications. And if you’re doing big business on the web, finding one that can scale with your data volumes and still perform like you need it to is critical. MapReduce for batch data processing and analysis? Not so much, actually.

That’s why as Hadoop keeps thundering toward its destination as the de facto data platform for next-generation applications, companies such as Cloudera and Hortonworks that are making a killing off it might want to stop and thank the guys from Powerset for building HBase. Because the database — a columnar Google BigTable clone that runs on top of the Hadoop Distributed File System — is so fast and scalable, it’s helping Hadoop find a home in companies and with applications that HDFS and MapReduce alone might not have been able to penetrate so easily.

The latest HBase user I’ve come across is Gravity, the interest graph company that powers content recommendations for some of the biggest publishers on the web.

From big MySQL at MySpace to big data with HBase

Its co-founders were all senior executives at MySpace, including Gravity CTO Jim Benedetto, who was SVP of technology for the social networking pioneer. He was actually MySpace’s first architect and helped build platform’s MySQL database. Although MySpace never reached Facebook’s scale, it did have 150 millions users at its peak, all able to store unlimited numbers of wall posts, messages and photos. Benedetto eventually oversaw a 600-instance cluster that required about 30 database adminstrators to keep it up and running.

Structure Data 2012: Jim Benedetto – CTO, Gravity Ashlie Beringer – Partner, Gibson, Dunn & Crutcher

Benedetto (center) at Structure: Data 2012. (c) Pinar Ozger

So naturally, when it came time to build out the Gravity architecture, Benedetto opted for the MySQL he knew so well. Until about three years ago, he told me recently, that database held about 95 percent of the company’s data. At some point, though, Benedetto and his team realized they were spending way too much time keeping their MySQL environment up insteading of building new things, so it was time for a change.

It ultimately opted for HBase, but the decision wasn’t easy. “For us,” Benedetto said, “our data and algorithms are our company,” so making the move from a relational database to a column-based database that can serve MapReduce jobs was nerve-racking. After all, he explained, “You never want to migrate your data … and if you have to, you never want to migrate it more than once.” In fact, he added, “you’re not going back.”

But Benedetto says the move to HBase as Gravity’s primary data store has been “life-saving,” and it’s arguably a more important component of the company’s infrastructure than is Hadoop MapReduce. HBase handles the company’s real-time recommendation algorithms, and it does it across the entire Gravity platform rather than on a site-by-site basis. And although it’s not banking-grade when it comes to the consistency of transactions, Benedetto says it’s about 99.95 percent consistent in real time. Later on, batch MapReduce jobs swoop in and pick up whatever HBase dropped earlier, and process it all against the company’s graph algorithms.

An example of an interest graph from Gravity,

Scalable for sure, and getting easier to use

And although it took some serious engineering effort to get HBase operational when Gravity began working with it three years ago, Benedetto thinks HBase is getting to the point (as is rival NoSQL database Cassandra, he acknowledged) where one could safely call it “enterprise-ready.” Right now, he noted, “You’re not gonna to see HBase in a company that just buys Oracle because Oracle is the name and Oracle has been around for 20 years,” but for web startups that hope to reach a certain scale and even for existing companies that are running into the MySQL wall, he sees a shift occurring.

“The web farm is the easiest part of your infrastructure to scale because all it does is cost more money,” Benedetto explained. Databases, on the other hand, require a lot of thinking about how to migrate data, shard the database and otherwise make a piece of software likely designed for a handful of servers, max, spread across dozens or hundreds. HBase really eases the scaling process, as well as the subsequent management, he said. Now, Gravity’s 100-node HBase cluster has only two operations engineers dedicated to it.

Indeed, there are startups trying to capitalize on HBase by using it to power SQL and even MongoDB-compliant databases that can scale beyond what most relational databases can do.

Aside from scale HBase might soon start catching on because of the work companies like Gravity have been doing to make it more user-friendly. It might scale easily, but, as Benedetto noted, it’s not always easy to get started with — especially without some deep understanding of the intricacies of the underlying HDFS infrastructure. Last year, eBay VP of Experience, Search and Platforms Hugh Williams told me that although HBase is one of the big data tools the company is most excited about, it’s also the area where he’d like to see the most improvement.

To help alleviate some of the learning curve, Gravity has developed an open-source tool called HPaste that lets developers access data and run jobs on HBase data using Scala rather than the more-bloated Java programming language on which Hadoop and HBase are built. One of the biggest benefits of HPaste, Benedetto said, is that it lets new HBase developers see the data in a way that makes sense to them: HBase stores everything in byte arrays, he explained, and “when a human tries to read a byte array, it looks like ancient hieroglyphics.”

The Kiji architecture

Elsewhere, a startup called Wibidata has created an open-source framework called Kiji that aims to provide a collection of high-level APIs that should make it easier to store different data types in and develop applications on HBase. The company envisions Kiji being to HBase what the Spring Framework has become to Java over the course of the past decade.

Hadoop’s weapon for the mainstream?

But user experience aside, a lot of companies already invested in Hadoop — aside from expert users such as Facebook — are starting to see the promise of HBase and are incorporating it into their architectures.

Wibidata co-founder Christophe Bisciglia, who also co-founded Hadoop pioneer Cloudera in 2008, gave me his take on the state of HBase while discussing its role in the future of Hadoop earlier this year. ”If you talk to anyone from Cloudera or any of the platform vendors, I think they will tell you that a large percentage of their customers use HBase. It’s something that I only expect to see increasing,” he explained. “… HBase is gonna be what takes Hadoop from an ETL and BI platform into a real-time application platform.”

The Cloudera Hadoop stack (Gravityu uses Cloudera's distro).

The Cloudera Hadoop stack (Gravity uses Cloudera’s distro).

Benedetto appears to agree. He considers Hadoop as a whole incredibly important, almost on par with what Amazon Web Services did for computing resources, because it lets startups use commercial-grade open source software to do data storage and processing that previously was only available to massive web companies. “More and more … the shining star in that suite is HBase,” he said. “If I were Oracle, I’d be scared.”