Researchers mine 2.5M news articles to prove what we already know

A group of British researchers has published the results of a data mining experiment that analyzed nearly 2.5 million articles from 498 newspapers on criteria such as topic selection, writing style and sensationalism, and found — no surprise — that tabloids are the easiest to read and reporters don’t often cover women’s sports. If these findings sound predictable, that was exactly what the researchers were aiming for.

The experiment’s techniques actually point to a future where researchers are spared the grunt work of poring through thousands of pages of news or watching hundreds of hours of programming, and can actually focus their energy of explaining. As the researchers note in their paper, the real ramifications of this research lie more in what it accomplished than in what it found.

Namely, they demonstrated that with new big data techniques such as machine learning and natural-language processing, it’s possible to accurately analyze millions of pieces of content spanning almost a year without requiring humans to read and score it all. Choosing hypotheses with predictable results meant it was easier to verify their accuracy.

Here’s how how they explain the promise of their work and some potential use cases, the latter of which they go into far more detail about in the paper:

“[I]t allows researchers to focus their attention on a scale far beyond the sample sizes of traditional forms of content analysis. Rather than spending precious labour on the coding phase of raw data, analysts could focus on designing experiments and comparisons to test their hypotheses, leaving to computers the task of finding all articles of a given topic, measuring various features of their content such as their readability, use of certain forms of language, sources etc. (just a few of the tasks that can now be automated).

… Our approach — apart from freeing scholars from more mundane tasks — allows researchers to turn their attention to higher level properties of global news content, and to begin to explore the features of what has become a vast, multi-dimensional communications system.”

Put more simply: This research underscores the common big data maxim that knowing the right questions to ask is now the biggest challenge in gleaning insights from data. It’s increasingly easy to get data, analyze it and visualize it, so humans really just need to hypothesize and be able to explain the results. (This also seems like a good place to plug ScraperWiki as a great source for gathering potential research data from websites.)

Creating the workflows for gathering and analyzing the data as the authors suggest still isn’t child’s play (it might take some assistance from the computer science department), but it’s a lot better than the alternative.

Feature image courtesy of Shutterstock user Ruggiero Scardigno.


GigaOM