From Amazon’s top data geek: data has got to be big — and reproducible

Much has been done to bring big data closer to the people who need it. The advent of public cloud infrastructure has decimated the cost of collecting, maintaining and processing vast amounts of data. The next frontier is making that data reproducible, said Matt Wood, principal data scientist for Amazon Web Services, at GigaOM’s Structure:Data 2013 event Wednesday.

In short, it’s great to get a result from your number crunching, but if the result is different next time out, there’s a problem. No self-respecting scientist would think of submitting the findings for a trial or experiment unless she is able to show that the it will be the same after multiple runs.

“Much of today’s statistical modeling and predictive analytics is beautiful but unique. It’s impossible to repeat, it’s snowflake data science.” Wood told attendees in New York. “Reproducibility becomes a key arrow in the quiver of the data scientist.”

The next frontier is making sure that people can reproduce, reuse and remix their data which provides a “tremendous amount of value,” Wood noted.

For more on Wood, check out this Derrick Harris post.

And check out the rest of our Structure:Data 2013 coverage here, and a video embed of the session follows below:

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

  • AWS Storage Gateway jolts cloud-storage ecosystem
  • It’s time for cloud security and big data to come together
  • 9 Companies that Pushed the Infrastructure Discussion in 2010


GigaOM