CitusDB: Today, SQL on Hadoop. Tomorrow, the world!

Database startup Citus Data on Tuesday joined those trying to enable fast SQL queries on Hadoop data, but it has much larger goals. It thinks it can be the only analytic database that anyone needs, able to query data wherever it’s stored across a company’s environment — in relational databases, Hadoop, MongoDB, Amazon S3 and elsewhere.

Big data has opened companies’ eyes to the importance of analytics and alternative data stores, but combining the two often means learning new languages, using multiple tools and probably sacrificing the performance they’re used to from analytic platforms.

Citus Data’s flagship product, called CitusDB, is actually built atop PostgreSQL and its first iteration was designed for Google Dremel-like scale and speed on relational data. Thanks to a feature called “foreign data wrappers,” though, it’s able to run SQL on numerous data types (e.g., CSV, log and JSON files) that don’t comport with how Postgres formats data natively. So, while CitusDB now officially supports the Hadoop Distributed File System in addition to Postgres, it is by no means limited to them.

Matt Ocko, managing partner at Data Collective and one of Citus Data’s early investors, says the database can technically support any data source with an ODBC driver, and even could query something like log files straight from a data store. In fact, Citus is working on extending its support to MongoDB — a capability that’s in beta right now. Ocko is also particularly impressed with CitusDB’s ability to act like a fabric connecting all these data sources rather than making users query each independently and then manually join the data. He cited a demonstration in which CitusDB carried out a query that required executing a join across Postgres and Hadoop.

The other big thing about CitusDB is that it’s not just flexible but fast, too. Ocko said CitusDB has outperformed Oracle’s vaunted Exadata machine on a TPC-H benchmark test with data stored primarily on hard disk. That Postgres-Hadoop query he referenced completed in just a few seconds while running on the Amazon EC2 cloud.

CitusDB is so fast, Citus Co-founder Umur Cubukcu told me, because of how it’s architected. It moves the computation to where the data is rather than trying to move data across the network, and it has some impressive load-balancing the resource-management abilities baked in. If, for example, it needs data housed on a slow-running node in order to complete a task, the software will look for that data elsewhere rather than just wait for the congested resource to free up.

In the case of Hadoop, MapReduce brings the computation to the data, too, but every job requires a scan over the entire dataset. This is why early SQL-on-Hadoop tools such as Hive are still relatively slow. Citus software engineer Carl Steinbach, who came to the company from Cloudera, said CitusDB is between 3 and 20 times faster than Hive depending on the query type.

It’s actually much faster for short queries that might be typical in an interactive environment, but he acknowledged those aren’t really what Hive was designed to do.

Citus_Hadoop_Architecture

Rather, CitusDB’s real competition is the spate of SQL-on-Hadoop projects, products and startups of which it’s now a part. We’ll have a whole session dedicated to this topic at Structure: Data next month, and there isn’t enough room for everything on the market right now — Aster Data, Platfora, Cloudera (with Impala), Apache Drill, Drawn to Scale and Hadapt, to name several.

These are impressive technologies (at least in theory where they’re still under development), and Citus would be remiss to ignore them. But, aside from the ability to query multiple data sources, the company has something the others don’t, Cubukcu said: It has the Postgres community and all the features they’ve built into that database already. Things like connectors, authentication, full-text search and PostGIS for geospatial data that go beyond just running fast queries.

“When you’re talking about an enterprise-class database,” Steinbach said, “you’re talking about more than a query execution engine.”