Want to start a big data company? Here are 5 things you need to know

Big data is hot as hell, but it’s also difficult as hell. The acquisition of Infochimps by CSC got me thinking about some other big data companies that have already either closed up shop or sold themselves because that series B round just wasn’t materializing. Drawn to Scale, Ravel Data and Nodeable are just top of mind, although I’m sure there are more that never even made it onto my radar.

Instead of bemoaning their fates, though, I thought I’d distill the lessons I’ve learned watching big data startups succeed and fail and offer them as guidelines for the next batch of entrepreneurs who want to try their hands. There’s a lot of explanation below, but here’s the long story, very short: Choose your battles wisely, choose your audience wisely and build a community around your technology. Big data doesn’t need another cheerleader.

1. Infrastructure is hard.

Not only is building infrastructure tools difficult, but selling them can be, too. That probably goes double when you’re talking about big data infrastructure tools like Hadoop, NoSQL databases and stream-processing systems. Customers will likely need a lot of education, and paying customers will probably expect a lot of support and product development that addresses their concerns in a timely manner.

This requires a lot of money and, usually, people with experience deploying and supporting systems of this scale. Oh, and systems integration. If you can get these things, fantastic!

As a point of reference, Greenplum had raised nearly $ 100 million by 2010 and it still wasn’t enough to finish the job, so the company sold to EMC. Today’s best-known big data startups have all raised nearly that much or, in the case of Cloudera, much more. Infrastructure startups working on a few million in seed money and series A have a tough road ahead of them.

But then you still have to convince companies to deploy your stuff over other options from vendors with whom they might be a little more familiar and who already have the money and people — companies like Cloudera, Hortonworks, 10gen, Amazon Web Services, IBM, Oracle.

Applications — whether they’re focused on specific workloads or industries, or broadly applicable tasks like data visualization — are just easier. They might be just as difficult to build well, but prospective customers can see right away how it could be useful or how it compares to what they’re already doing. You also can sell directly into lines of business, possibly without invoking central IT at all, which means less friction and less fear. Once you start talking about adding or replacing critical systems, or putting sensitive data in a new place, things can get real hairy real fast.

2. Cloud computing is your friend.

Seriously, whether you’re selling infrastructure or applications, the cloud is just a much more efficient way to run a business if you can pull it off. That doesn’t necessarily mean hosting it with a cloud provider, but just delivering it to customers as a cloud service. You end up having more control and a deeper understanding of your product because it’s tuned to run optimally on a specific set of resources.

That means no going into customer accounts and setting it up to run on the types of servers and operating systems they’re running. There might still be some customized work to connect the service with customers’ other data sources or systems, but everyone gets pretty much the same thing. This also means most of the company’s energy can go toward product development.

Cloud computing makes it easier for potential users to play around with the product, as well. We’ve already seen, with companies from New Relic all the way up to AWS itself, how well bottom-up adoption can work. The easier it is to get started, play around and prove the value of a tool with a credit card, the easier it is to justify it as a line-item expense later on and ultimately expand its use.

Obviously, this isn’t possible in all cases, especially when you’re talking about enterprise software and large volumes of data that companies don’t want to or can’t send into the cloud. In fact, many big data startups feel the pressure from larger businesses to start offering their cloud services as traditonal software. If the money is there, that can be a wise decision, but it’s definitely not one to take lightly.

3. Developers are your friends.

So cater to them. Or if you’re doing analytics, like ClearStory, Platfora, and any number of CRM and marketing applications, analysts are your friends. Either way, aiming most of the development effort and marketing effort at the target audience seems like a good idea. And CIOs don’t seem like too good a target audience.

The problem with targeting CIOs rather than developers is that it’s possible to get caught up speaking in buzzwords and answering questions about possibly overblown concerns when you could be signing up actual users. Targeting developers (or analysts or systems administrators) was the tactic that worked for numerous cloud startups, and also for pure software plays like Splunk and Tableau.

One thing I think Infochimps can do better, for example, is push its Wukong and Ironfan technologies to a developer audience. The former lets you write MapReduce and streaming jobs as Ruby scripts. The latter is a Chef-based tool for easily configuring, deploying and managing big data resources. Especially if it were available as a classic credit-card cloud service, there would be an audience for that (think Mortar Data but not tied to Amazon’s Elastic MapReduce).

I shouldn’t have to dig for this info.

I think there’s more than a little similarity between what Infochimps is doing and what Continuuity is doing (including being seemingly forced to from the cloud into customers’ data centers), but Continuuity is all about developers. They’re called out in its tagline and they have easy ways to get started with the product. That means they can work the larger deals while hopefully accumulating a large, committed user base in the background.

4. Put the data scientists front and center.

This is as much a marketing exercise as it is a sales tool, I think, but it’s important nonetheless. Data scientists are the ones who show people what’s possible with their data and your platform. They’re also the ones people want to hear at conferences.

Almost everyone is sold on Hadoop and NoSQL already. There’s little need to debate their merits anymore, and there’s probably less need to reinforce the volume, variety, velocity cliché. Talking about configuring and integrating systems is important, but interesting to a small audience unless you’re talking about doing it at a massive scale.

There are many reasons Cloudera gets more press and more speaking slots than its Hadoop competitors, but one of them is Jeff Hammerbacher. Don’t just talk about data and the infrastructure for storing or processing it — show me what kinds of products I can build with it and what types of analyses I can run on it. At the very least, prove that you’re thinking about data in a broader context than just the newest way to sell me something.

Jeff Hammerbacher Cloudera Structure Data 2013

Jeff Hammerbacher at Structure: Data 2013. (c) Albert Chau itsmebert.com

5. Open source matters, but only if you make it matter.

Almost every big data startup relies on open source software. Some of it they’ve borrowed — stuff like Hadoop, Storm or various databases — and some of it they’ve created. In many cases, it’s a combination of both, where they’ve added functionality onto something like HBase, for example. The reason those projects are so popular is because of community.

I’ve never tried to foment an open source movement; I assume it’s hard work. But I know that placing code on Github and leaving it doesn’t accomplish a whole lot other than being able to say you’re giving back. Facebook and Google might release code as a favor, but most startups probably shouldn’t be so arrogant as to think their development teams are the best and there’s nothing they can learn.

After all, the goal of open source is often to create a community of people working on the same code in order to improve it. It seems like you have to get out and promote the technology and every turn and explain why it’s important so more people want to hack on it. This relates back to the point about luring developers, but going the freemium route when possible will get more people experimenting with the product so they can see if it’s worth their energy and, utlimately, their money.

I can’t count the number of startups who have open sourced their code, but the companies that are out there pushing their projects and building communities really stand out. We’re talking vendor startups such as Neo Technology with Neo4j, Concurrent with Cascading (see disclosure) and 10gen with MongoDB, and even end-user companies such as Twitter with its pet projects like Storm and Mesos. They’ve built something, they’ve built an open source community around it, and now they’re reaping the rewards.

Disclosure: Concurrent is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, Giga Omni Media. Om Malik, the founder of Giga Omni Media, is also a venture partner at True.

Feature image courtesy of Shutterstock user Lisa S.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

Infrastructure Q1: Cloud and big data woo enterprises
A near-term outlook for big data
Defining Hadoop: the Players, Technologies and Challenges of 2011

GigaOM

1. Infrastructure is hard.

2. Cloud computing is your friend.

3. Developers are your friends.

4. Put the data scientists front and center.

5. Open source matters, but only if you make it matter.

Related Posts: