People often ask me where the smart money is in big data. I often tell them that’s a foolish question, because I’m not an investor — but if I were, I’d look to software as a service.
There are two primary reasons why, the first of which is obvious: Companies are tired of managing applications and infrastructure, so something that optimizes a common task using techniques they don’t know on servers they don’t have to manage is probably compelling. It’s called cloud computing.
The other reason is that the big part of big data really is important if you want to get a really clear picture of what’s happening in any given space. While no single end-user company can (or likely would) address search-engine optimization, for example, by building a massive store comprised of data from hundreds or thousands of companies as well as the entire web, a cloud service dedicated to that specific task can.
From web security to systems management, we’re already seeing how centralized data stores provide SaaS companies a broad view into what’s happening that can then be filtered down to serve each individual customer’s specific situation. BloomReach, a SaaS startup that helps companies optimize web-page content, is another good example of this principle in action.
How do you say, “cotton maxi dress”
Ideally, BloomReach Head of Marketing Joelle Kaufman told me, the company wants to help customers ensure they get found in web searches by making sure they’re not invisible (buried deep down), irrelevant (not saying anything meaningful on their sites) or incompatible (not speaking their consumers’ language). On Tuesday, the company announced a new feature called Continuous Quality Management, which lets customers continuously monitor their pages to ensure they’re still featuring the right products and the right terminology. It’s the latest addition to a seemingly useful service that’s built atop a big data foundation few — if any — of its customers would ever attempt to build themselves.
BloomReach is able to help companies optimize their sites because it’s constantly crawling the web in order to figure out how everyone else is describing their content, laying out their pages and structuring their links. Running on the Amazon Web Services cloud, BloomReach runs more than 1,000 Hadoop jobs a day that process about 5 terabytes of data and a billion data points about users’ site behavior. With the latter, co-founder and CTO Ashutosh Garg explained, the company is trying to figure out who’s visiting sites, what they’re doing, how long they’re spending there and how they’re related in terms of behavior.
“You need to have the right amount of data and from the right places before we can do anything with it,” he said. “… It’s a massive machine learning problem.”
When you consider all the possible ways something could be described or formatted, the scale of the problem becomes more evident. Simple semantic analysis like associating “desk” and “table” is easy, Garg explained, but what if some wants a lightweight camera and you only have its exact weight listed without any indication of how it compares to other options? What if people searching for “smartphones” really mean “Android phones,” but you’re top-loading your results with BlackBerry phones and Windows phones?
Another of Garg’s hypotheticals has to do with consumers’ presentation biases. If, for example, they’re looking at a lot of websites that look the same or focus on the same things (e.g., megapixels for digital cameras), they’ll expect to see the same things from every site.
10 nonillion possibilities: Choose 1.
From a sheer numbers perspective, things get even hairier when you’re trying to determine the relationship between any two pages in order to figure out the best path for links to to take. Garg said this is what computer scientists call an NP-complete problem, which means the amount of time it takes to process the results is exponentially greater than the amount of content you’re analyzing. So, for example, analyzing 40 pages doesn’t take 10 times as long as analyzing 4 pages, but more like 100 times longer.
Actually, BloomReach CEO Raj De Datta gave me another example of this problem when we spoke in early 2012. Here’s how I described it then:
[I]f a company wants to display just 1,000 products across 100 pages, De Datta explained, there are 10-to-the-28th-power (10 octillion) possibilities for how to do that. When it comes time to describe those products, there are 10-to-the-30th-power (10 nonillion) possibilities.
If a website has a million pages, Garg said, “it will take you longer than the life of the universe to solve that problem.”
Where this type of problem arises, BloomReach turns to Monte Carlo simluations, a favorite technique of physicists and Wall Street quants. The method involves running lots of simulations over large data sets in order to determine approximate results in a reasonable time frame. (And if all this isn’t enough computer science and cloud infrastructure for you, I suggest attending our Structure conference in June, which features a who’s who list of speakers, including Google’s Jeff Dean, Facebook’s Jay Parikh and Netflix’s Adrian Cockroft.)
Different queries, different pages
Things get even trickier when you’re trying to change the content of web pages in real time as people are searching for things. This isn’t the best method for organic search, where pages need to stay pretty consistent with the indexed versions, but it can be ideal in situations such as paid search and mobile. There are millions of ways to segment buyers, Garg explained, and how accurately you assess their intent and display your content can make the all the difference. Whether someone is a new or repeat visitor often matters, as does whether someone is price-conscious (e.g., the query included “cheap”) or perhaps searching for a particular brand.
Around the holidays, the company actually realized something interesting: The bounce rate on queries for things like “gifts for dad” or “gifts for co-workers” was pretty high, but so was the conversion rate. The time to conversion was relatively fast, as well. It turns out, Garg explained, that people don’t like to overthink certain gifts too much, so if something is presented in a visually appealing manner and is within their price range, they’ll buy.
But creating these types of models involves more than meets the eye. For all the talk about machine learning — and machines do a majority of the work for BloomReach — people also play a critical role. A person might know better than a machine whether something was likely purchased as gift, Garg explained, or they might spot the offensive content on the T-shirt the machine decided was ideal.
“Humans are really good at creativity, thinking through stuff,” he said.
Smart humans are also good at knowing when they’re overmatched, which is why SaaS is so valuable in the big data era. CMOs could try doing what BloomReach or similar companies such as DataPop are doing, or they could pay someone to do it much better. Guess which route the smart ones will take.
Feature image courtesy of Shutterstock user Andrea Danti.
Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.
- A near-term outlook for big data
- Cloud computing infrastructure: 2012 and beyond
- Infrastructure Q1: Cloud and big data woo enterprises