How two scientists are using the New York Times archives to predict the future

Researchers at Microsoft and the Technion-Israel Institute of Technology are creating software that analyzes 22 years of New York Times archives, Wikipedia and about 90 other web resources to predict future disease outbreaks, riots and deaths — and hopefully prevent them.

The new research is the latest in a number of similar initiatives that seek to mine web data to predict all kinds of events. Recorded Future, for instance, analyzes news, blogs and social media to “help identify predictive signals” for a variety of industries, including financial services and defense. Researchers are also using Twitter and Google to track flu outbreaks.

from "Mining the Web to Predict Future Events," Horvitz and Radinsky, http://research.microsoft.com/en-us/um/people/horvitz/future_news_wsdm.pdf

from “Mining the Web to Predict Future Events,” Horvitz and Radinsky, http://research.microsoft.com/en-us/um/people/horvitz/future_news_wsdm.pdf

Eric Horvitz of Microsoft Research and Kira Radinsky of the Technion-Israel Institute describe their work in a newly released paper, “Mining the Web to Predict Future Events” (PDF). For example, they examined the way that news about natural disasters like storms and droughts could be used to predict cholera outbreaks in Angola. Following those weather events, “alerts about a downstream risk of cholera could have been issued nearly a year in advance,” the wrote.

Horvitz and Radinsky acknowledge that epidemiologists look at some of the same relationships, but “such studies are typically few in number, employ heuristic assessments, and are frequently retrospective analyses, rather than aimed at generating predictions for guiding near-term action.” They outline the advantages that software has over humans in this area:

  • Learning: Software “has the ability to learn patterns from large amounts of data, can monitor numerous information sources, can learn new probabilistic associations over time, and can continue to do real-time monitoring, prediction, and alerting on increases in the likelihoods of forthcoming concerning events.”
  • Tireless researching: Software, with its “long tentacles into historical corpora and real-time feeds,” can dig up data that humans might never find because they’re too focused on “knowledge that is easily discovered in studies or available from experts.”
  • Lack of bias: Software can assist “when inferences from data run counter to expert expectations,” or when “there is a significantly lower likelihood of an event than expected by experts based on the large set of observations and feeds being considered in an automated manner.”
  • Greater access to news: “A system monitoring likelihoods of concerning future events typically will have faster and more comprehensive access to news stories that may seem less important on the surface (e.g., a story about a funeral published in a local newspaper that does not reach the main headlines), but that might provide valuable evidence in the evolution of larger, more important stories (e.g., massive riots).”

One of the problems that the researchers faced in developing their software model is the fact that tragic events in poor African countries are often not widely reported. So they taught the software to generalize somewhat: “Instead of considering only ‘Rwanda cholera outbreak,’ an event with a small number of historical cases, we consider more general events of the form: “[Country in Africa] cholera outbreak.” We turn to world knowledge available on the Web…[that] maps Rwanda to the following concepts: Republics, African countries, Land- locked countries, Bantu countries, etc.”

Horvitz and Radinsky also taught the software what to ignore: It “was able to recognize that the drought experienced in New York City on March 1989, published in the NYT under the title: ‘Emergency is declared over drought’ would not be associated with a disease outbreak…The system estimates that, for droughts to cause cholera with high probability, the drought needs to happen in dense populations (such as the refugee camps in Angola and Bangladesh) located in underdeveloped countries that are proximal to bodies of water.”

“I truly view this as a foreshadowing of what’s to come,” Horvitz told the MIT Technology Review. “Eventually this kind of work will start to have an influence on how things go for people.” He said Microsoft isn’t commercializing the research yet, but that it will continue, and he wants to get more “data further back in time.”

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

  • Connected world: the consumer technology revolution
  • The 2013 task management tools market
  • How consumer media will change in 2013


GigaOM