New open source tech Marathon wants to make your data center run like Google’s

There’s a new open source project called Marathon that’s designed to let users intelligently run a wide variety of their applications and services — Hadoop, Storm and even standard web apps — on the same set of servers. Marathon comes out of an early-stage startup called Mesosphere that’s aiming to build a data center operating system on top of the Mesos cluster-management software that’s a key part of Twitter’s infrastructure. The company’s founders are former Airbnb engineers Florian Leibert (who also worked at Twitter) and Tobias Knaup.

Marathon is just one piece of the Mesosphere puzzle, Leibert told me during a recent discussion, but it’s an important one that’s appealing even on its own. Trends such as cloud computing and big data are moving organizations away from consolidation and into situations where they might have multiple distributed systems dedicated to specific tasks.

Before delving into Marathon, though, here’s a little recent history.

Marathon running Chronos, Hadoop and a web app

Whither grid computing?

Back in the days before cloud computing was all the rage, terms like grid computing and cluster computing were more strongly associated with the term “on-demand.”

The general idea was pretty simple: organizations — often large banks or research institutions — had lots of servers and they wanted to use them as efficiently as possible. That usually meant lumping those servers into a pool (or a cloud, if you will) and using a resource scheduler of some sort to make sure that every application or workload got the resources it needed when it needed them, more or less. Instead of one smaller set of servers for each application, one big set of servers hosted them all — this is actually how Platform Computing (now part of IBM) positioned its private cloud software several years ago.

However, the concept never really caught on in mainstream enterprises, which largely opted for server virtualization as the method for consolidating their applications onto smaller server footprints and which tend to think of Amazon EC2-style virtual server provisioning when they think of private clouds. But thanks to the advent of new, popular distributed computing platforms like Hadoop and the general admiration of how webscale companies run their IT environments, the idea of these general-purpose grids or clusters is coming back.

That’s in part because managing all those different environments is becoming awfully difficult. Take a distributed web app running here, a Hadoop cluster running there, and maybe Storm or Spark running across some servers over in the corner, and all of a sudden you have three clusters that each need to be maintained. That type of complexity usually doesn’t fly at places like Google, Facebook and Twitter, where efficiency-and-automation is the name of the game.

The Mesos architecture

So these companies created, or adopted, their own software for dealing with with the growing number of workloads they need to support. At Google, it’s Borg (although the company has published a research paper about a yet-to-be-deployed system called Omega). At Twitter, it’s Mesos. At Facebook, it’s Corona (which was designed primarily for Hadoop workloads, but which the company wants to extend to support multiple frameworks a la Apache YARN).

Cade Metz at Wired Enterprise wrote a good article in March detailing the births of Mesos, Borg and Omega, and the people behind them.

It’s also worth noting that software like Mesos isn’t relegated to bare metal servers, even if that’s where the largest companies prefer to run it. Airbnb is a heavy Mesos user, and they use it to manage workloads running entirely in Amazon Web Services. There’s talk in some corners of the OpenStack community about the need for cluster-management software to sit underneath the cloud platform, ostensibly making OpenStack just another distributed framework running atop a general-purpose cluster.

Why Marathon matters

Mesos, however, is just the cluster manager, which means it’s the software that actually isolates the running workloads from each other (e.g., Hadoop from Storm). It still needs additional tooling to let engineers get their workloads running on the system and to manage when those jobs actually run. Otherwise, some workloads might consume all the resources, or important workloads might get bumped by less-important workloads that happen to require more resources.

Twitter built a tool called Aurora (which it plans to open source) to handle this, while Airbnb built a tool called Chronos. Mesosphere’s Leibert and Knaup built Chronos while at Airbnb, and Marathon is a “meta framework” that actually makes both Mesos and Chronos more useful. It runs alongside Mesos and provides high availability for running workloads and services by letting users add resources, and automatically failing services over to other nodes should one of its current nodes fail.

architecture

“Right now, if you’re the ops person, you, in the middle of the night, get paged and have to start it on another machine,” Leibert explained. Marathon is designed to solve this problem.

Rather than Chronos running on top of Mesos to schedule jobs, Marathon lets Chronos run inside Mesos. This way, Chronos is just another thing that Marathon manages. Chronos might launch and schedule the jobs it’s good at (e.g., Hadoop jobs or other short-lived tasks), while Marathon directly manages Chronos and longer-running web services. Marathon can even run multiple instances of Marathon.

A cluster running three distinct applications.

The same cluster, after one node died.

We’ll likely hear a lot more about the general idea behind Marathon and Mesos, as well as Omega, Corona and other similar efforts at our Stucture: Europe conference later this month in London. We’ll have engineering executives past and present from companies such as Twitter and Facebook, as well as Google’s EMEA cloud boss Barak Regev.

And at a broader level, projects like Marathon tie into the greater move toward software-defined networks, storage and even data centers. Companies are trying to replace expensive gear with commodity gear powered by smart software, and being able to automate cluster management and failover is certainly part of that equation.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

Cloud and data first-quarter 2013: analysis and outlook
Infrastructure Q1: Cloud and big data woo enterprises
Cloud computing 2013: how to navigate without a map

GigaOM

Whither grid computing?

Why Marathon matters

Related Posts: