Ask a group of data scientists the toughest part of their job, and many will probably tell you – it’s not the math but the work required to turn raw data into something their software can work with.
A new San Francisco startup called Trifacta wants to solve this problem — not just for data scientists, but for all data analysts — and is banking on some of the best minds in data management and human-computer interaction to make it happen.
The problem Trifacta is trying to solve is one that folks in the data world have taken to calling “data munging.” Wikipedia defines it as “the process of converting or mapping data from one ‘raw’ form into another format that allows for more convenient consumption of the data”, and it takes up a lot of time for anyone trying to work with new and perhaps unique types of data. All the data tools in the world — from Tableau to R to Hadoop — aren’t of any use if they simply can’t make sense of the data they’re being asked to analyze or visualize.
And, like most things, data munging gets harder with scale (just ask NASA). The idea of big data is traditionally premised on the “three Vs” of volume, variety and velocity, which means companies trying to create a big data strategy are going to ask their analysts to do more, faster, and with lots of new data sources such as sensors, social media and mobile phones. Something has to give, and it’s not going to be the technology, which is getting better with every step. It’s going to be the humans trying to keep up.
“The real bottleneck is in people rather than the tools they’re using,” said Trifacta Co-Founder and CEO Joe Hellerstein, who is also a professor of computer science at the University of California, Berkeley, and a technical adviser to several data-focused companies including EMC Greenplum, Platfora and SurveyMonkey. Although the costs of storage and computing are becoming commoditized, he added, “the cost of human attention is not”.
Making life easier by design
However, Trifacta is betting it can ease the pain by making it easier than ever for analysts to get their data formatted and get down to business analyzing it. According to co-founder and chief experience officer Jeffrey Heer, Trifacta blends advanced concepts such as machine learning with the cutting edge in human-computer interaction in order to make the process highly intuitive but also highly intelligent, learning as it goes what type of data it might be dealing with. It should appeal to data scientists who presently write code to solve all their formatting problems, as well as to everyday users who just like to poke around at data, he said.
Building such a product takes a team with a wide variety of skills. While Hellerstein is the hardcore computer scientist, Heer is a human-computer interaction professor at Stanford University who has helped develop a number of open source data-visualization projects such as Protovis and D3.js, and a data-munging program called Data Wrangler along with Hellerstein. CTO Sean Kandel is a financial analyst who studied analyst behavior and productivity at Stanford.
Trifacta’s advisers include New York Times visualization specialist (and former Heer student) Michael Bostock, Cloudera co-founder and chief scientist Jeff Hammerbacher, Greylock Ventures data scientist in residence DJ Patil and Tim O’Reilly.
It’s not sexy, but it’s very necessary
While Trifacta wants to span the spectrum in terms of appeal, though, its real profit center should come in the vast middle with everyday data analysts. These folks and their employers will be overwhelmed by the volume, variety and expectations that come along with big data, and will pay the price in terms of wasted time and money. That’s a big addressable market, so it’s no wonder ecosystem partners and investors have already lined up behind Trifacta.
The company has raised $ 4.3 million from Accel Partners, as well as additional funding from X/Seed Capital, Data Collective, and angel investors Dave Goldberg, Venky Harinarayan and Anand Rajaraman. Big players within the data ecosystem, including Tableau and Cloudera, are already on board as supporters, citing the improved utility of their products when users can reduce the barriers of actually doing analysis.
“Everyone’s building the [big data] freeways with Hadoop and NoSQL databases, and everyone wants access to the freeway,” said Accel partner Ping Li, who heads the firm’s Big Data Fund. He sees Trifacta as an on-ramp that lets companies actually start mining from new data sources, thus clearing up a major bottleneck. “Until that happens,” he said, “the big data wave is going to hit a wall.”
CEO Hellerstein doesn’t mind acting as the on-ramp, even if it means Trifacta won’t likely get all the glory of its somewhat more-exciting peers. “People think technology is all about building rocket ships,” he said, “but technology is most useful for building things like washing machines that remove a lot of drudgery from everyday life.” For data analysts, then, Trifacta could mean no more washing their data with rocks down by the river.
Feature image courtesy of Shutterstock user Andrea Danti.