Netflix open sources tool for making cloud services play nice

Netflix, it seems, is to cloud computing what Google and Facebook are to distributed systems, generally. Today, Netflix has open sourced its latest technology for keeping its cloud-hosted applications running — a set of libraries, called Hystrix, that is designed to manage interactions between the myriad services that comprise the company’s distributed architecture. If you’re building service-oriented architectures in the Amazon Web Services cloud, it might be worth a look.

Netflix Engineer Ben Christensen explained Hystrix thusly in a blog post on Monday:

Hystrix is a library designed to control the interactions between these distributed services providing greater tolerance of latency and failure. Hystrix does this by isolating points of access between the services, stopping cascading failures across them, and providing fallback options, all of which improve the system’s overall resiliency.

Hystrix actually stems from earlier work to add resilience to the Netflix API, the means by which many customer-facing applications access the services they need to run. As Christensen explained in a February 2012 blog post, services are distributed across thousands of instances in AWS, and if there are problems with those services — such as high latency or failed connections between them — it can wreak havoc on the Netflix API and seriously affect the performance of all the applications that depend on it.

Source: Netflix (https://speakerdeck.com/benjchristensen/performance-and-fault-tolerance-for-the-netflix-api-august-2012)

And, as he notes in that February post, “Intermittent failure is guaranteed with this many variables, even if every dependency itself has excellent availability and uptime … Thus, it is a requirement of high volume, high availability applications to build fault tolerance into their architecture and not expect infrastructure to solve it for them.”

Distributed systems are hard work to build and manage — ask anyone at Yahoo, Google or Facebook — and building distributed, service-oriented applications on top of those systems is probably no less difficult. Netflix has an even more-novel challenge because it opted to host all of its applications and services in the cloud, which provides some great tools for maximizing uptime but also some new layers of complexity in application architecture. The company’s focus on building resilient apps has been core to its ability to survive most of AWS’s cloud outages with little or no significant downtime.

In fact, Obama for America CTO Harper Reed told me during a post-election interview that Netflix tools and techniques helped the president’s AWS-hosted applications stay up and running even during three outages between late June and Nov. 6. Netflix has also open-sourced its Eureka load-balancing technology, its Edda dynamic querying tool, its Asgard management console and its lauded Chaos Monkey for testing application resilience.

It’s no surprise then that Netflix is something of a shining star at the AWS: Reinvent user conference in Las Vegas this week (CEO Reed Hastings will take the stage long with numerous engineers), prompting some to refer to it jokingly as a Netflix technology conference. Not that it’s an insult to anybody — Amazon and other infrastructure-as-a-service providers rent virtual servers, networks and management tools, but it takes cutting-edge users to engineer apps that can make the most of them.


GigaOM