How the AP got a hold of its big, old data

Holding onto millions of pieces of archived content it still wanted to monetize, the Associated Press turned to a NoSQL database. Specifically, it turned to MarkLogic, a non-relational database designed for storing and accessing lots of XML-based content — like the stuff the AP has lying around — and that has already earned itself quite a following among media companies.

What the AP wanted to do, VP of information management Amy Sweigert told me, is build an application that would let it search through its mountains of archived content so it could better analyze that information. Internally, the organization AP wants to better understand how much content it’s publishing on any given topic and in what formats (e.g., stories, photos, videos), but it also wants to deliver custom data sets to business-to-business customers based on whatever their needs might be.

According to Sweigert, the AP had to go with a non-relational database for a variety of reasons, with scale and freedom from schemas being chief among them. Her team actually had built a relational database, but as content volumes grew (the new system holds about 120 million pieces of content) and the team wanted the flexibility to perform new types of searches without complicated queries and — more importantly — without having to reconfigure the database to support new methods of searching, the old database had to go.

Sweigert said many large publishers are moving toward an XML-centric data model, if they’re not already there, because the format makes it so much easier to work with old content that doesn’t necessarily have metadata associated with it. What’s more, she said, the AP is actually using MarkLogic to help add metadata to some of that old content.

In that regard, the AP’s new database sounds similar to the value proposition for publishing analytics tools like Parse.ly, which launched earlier this year and already has some big-name clients under its belt. Parse.ly analyzes clients’ web content based on the text rather than the metadata, which means publishers without strict metatagging procedures or crack data analysts can still get deep insights into what topics are driving traffic.

However they do it, the rationale is the same: find a way to keep making money off of years worth of archived content, either directly or indirectly. The direct route is probably akin to what the AP is doing with its business partners, while the indirect route is the same story as any analytics effort. That being to use older content to help identify trends that can influence future decisions on both content and products.

Image courtesy of Flickr user DBduo Photography.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

  • Putting Big Data to Work: Opportunities for Enterprises
  • 10 ways big data changes everything
  • Amazon’s DynamoDB: rattling the cloud market



GigaOM