Google explains how more data means better speech recognition

A new research paper out of Google describes in some detail the data science behind the the company’s speech recognition applications, such as voice search and adding captions or tags to YouTube videos. And although the math might be beyond most people’s grasp, the concepts are not. The paper underscores why everyone is so excited about the prospect of “big data” and also how important it is to choose the right data set for the right job.

Google has always been a fan of the idea that more data is better, as exemplified by Research Director Peter Norvig’s stance that, generally speaking, more data trumps better algorithms (see, e.g., his 2009 paper titled “The Unreasonable Effectiveness of Data“). Although some hair-splitting does occur about the relative value (or lack thereof) of algorithms in Norvig’s assessment, it’s pretty much an accepted truth at this point and drives much of the discussion around big data. The more data your models have from which to learn, the more accurate they become — even if they weren’t cutting-edge stuff to begin with.

No surprise, then, it turns out that more data is also better for training speech-recognition systems. The researchers found that data sets and larger language models (here’s a Wikipedia explanation of the n-gram type involved in Google’s research) result in fewer errors predicting the next word based on the words that precede it. Discussing the research in a blog post on Wednesday, Google research scientist Ciprian Chelba gives the example that a good model will attribute a higher probability to “pizza” as the next word than to “granola” if the previous two words were “New York.” When it comes to voice search, his team found that “increasing the model size by two orders of magnitude reduces the [word error rate] by 10% relative.”

The real key, however — as any data scientist will tell you — is knowing what type of data is best to train your models, whatever they are. For the voice search tests, the Google researchers used 230 billion words that came from “a random sample of anonymized queries from google.com that did not trigger spelling correction.” However, because people speak and write prose differently than they type searches, the YouTube models were fed data from transcriptions of news broadcasts and large web crawls.

“As far as language modeling is concerned, the variety of topics and speaking styles makes a language model built from a web crawl a very attractive choice,” they write.

This research isn’t necessarily groundbreaking, but helps drive home the reasons that topics such as big data and data science get so much attention these days. As consumers demand ever smarter applications and more frictionless user experiences, every last piece of data and every decision about how to analyze it matters.

Feature image courtesy of Shutterstock user watcharakun.


GigaOM