LinkedIn University Pages are a case study in building big data apps the right way

Professional network LinkedIn rolled out its new University Pages feature last week to much fanfare, but the pages are as much as a matter of smart engineering as they are of smart business strategy. On Monday, LinkedIn Engineering published a blog post explaining the technology behind University Pages, and it underscores the importance of understanding your products, your data and the tools you need to process it.

Of course the new product started with an idea, but after that, blog post author Josh Clemm noted, the company’s data scientists spent years combing through member profiles, gathering and standardizing data about 23,000 colleges and universities. They built graph data models for each school, with the school as the primary node and things like related schools and LinkedIn-member alumni as secondary ones. That’s why you’re now able to visit any school’s LinkedIn profile (at least the ones that have been updated to the new format) and see the same information, such as where alumni work and who attended.

Notable alumni Ben Horowitz. The name rings a bell ...

Notable alumni Ben Horowitz. The name rings a bell …

Under the covers, University Pages runs atop some serious big data technologies, many of which LinkedIn built itself. Those graphs are all stored in LinkedIn’s new flagship database technology, EspressoDB. Hadoop powered much of the work involved in getting all that data into a standard format. It’s also responsible for generating page information such as “similar schools” and “notable alumni,” which run periodically as batch jobs and then dump results into LinkedIn’s Voldemort NoSQL database for fast access by web users (as well as into EspressDB to help populate the schools’ graphs).

The whole University Pages architecture, which Clemm explains in detail.

The whole University Pages architecture, which Clemm explains in detail.

Two other open source technologies — one called Bobo and another called Zoie (which LinkedIn created) — power search for the new university profiles. LinkedIn’s Databus system streams updates into the search systems to ensure they always have the most up-to-date data.

We actually profiled LinkedIn’s data engineering team, its strategy and several of its key technologies, in a February feature. One of its leaders, Bhaskar Ghosh, will be part of our big data master panel at Structure: Europe next month.

Ghosh's diagram of LinkedIn's architecture, from that February feature.

Ghosh’s diagram of LinkedIn’s architecture, from that February feature.

But the main takeaway here isn’t that LinkedIn is great or that University Pages are great. In fact, as someone officially (I think) finished with university education, I could take it or leave it. The point is that there’s a right way to build “big data” applications, and web companies seem to understand this better than most. You see similar strategies in place at Netflix, Facebook, Google and elsewhere.

They’ve built virtuous circles where infrastructure systems, data scientists, web developers and product managers all enable each other to do their jobs better. If one piece is weak, everyone suffers. And most importantly, the product suffers.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

  • A near-term outlook for big data
  • Defining Hadoop: the Players, Technologies and Challenges of 2011
  • Cloud and data first-quarter 2013: analysis and outlook


GigaOM