Why you should expect more online outages but less downtime

Google’s webmail service Gmail was down for 18 minutes last week after a “routine update” broke the service for a few minutes. The search giant reported that it conducted a routine update of its load balancing software between 8:45 AM PT and 9:13 AM PT and after the problems were detected managed to quickly roll back the buggy code. But this didn’t stop some people from questioning why Google would roll out a software update at what are peak email-checking hours on the West Coast.

The answer is that most of the coders behind today’s popular web sites and services are deploying their code when it’s ready — not at some pre-determined point when downtime may not be noticed. It’s called continuous code deployment or some variation on that theme and everyone from Facebook and Netflix to smaller services do it. So while it may occasionally cause a few blips, those blips should be shorter and less catastrophic.

There’s no good time for downtime in an always-on world

James Urquhart (Cisco), Luke Kanies (Puppet Labs ), Jesse Robbins (Opscode) - Structure 2011The rational for doing these sorts of continuous deployments vary, but most fall into four categories. The first is that there really is no good time for downtime anymore, but if you break it, wouldn’t you rather have happy and awake staff on the clock ready to fix it? Jesse Robbins, the chief community officer of Opscode points out that even good times for downtime can vary across customers.

“One of Opscode’s earliest customers is a popular dating website, and their peak traffic is on Friday night when people are exchanging phone numbers to go on dates… the exact opposite of peak time for a CRM,” says Robbins.

Plus, as Robert Treat, COO of OmniTI, a consulting firm that helps web sites scale out their business points out, sometimes deploying at off hours means little because the site won’t actually break until it experiences peak loads. For many of these sites using continuous code deployments scaling its users is what caused the need for new code in the first place. Until the site experiences that load they don’t know if the fixes worked or not.

Just in time code-deployment

markimbiacoThe second category is economic. When you wait to deploy your code in these massive quarterly installs, you’re deciding to avoid the efficiencies that the new code could bring to the site today. This thinking is more common to companies who view their web operations as a fundamental cost of doing business as opposed to some sort of cost center that keeps email up and running.

“Code that has been written but not yet deployed is very similar to inventory,” says Mark Imbriaco of Github. “You’ve paid the cost to develop the software, but are not yet getting any of the benefit from it. Shipping that code to production sooner means that you and your customers can benefit from it much faster. This is a pretty serious competitive advantage for companies that deliver features faster than their competitors.”

Routine code deployments makes for happy developers

Thinking of code deployment as a Big Fat Hairy Deal adds layers of stress and process to getting it into production, but if it’s a routine part of the job, developers can try things out and deploy code and move on with their lives. This reduces stress around the deployment, but it also frees their minds up for new problems and jobs, notes Johns Allspaw of Etsy. Plus he says. “Fast and frequent feedback is what allows for developers to be productive. Developers hate being bored.”

Punishing your site makes it stronger

Aditya Agarwal Dropbox Adrian Cockcroft Netflix Alexei Rodriguez Evernote Corporation

(L to R) Alexei Rodriguez – VP of Operations, Evernote Corporation; Adrian Cockcroft – Director, Architecture, Netflix ; Aditya Agarwal – VP Engineering, Dropbox
(c)2012 Pinar Ozger pinar@pinarozger.com

The third school of thought is popularized by Netflix and is basically an invitation to break things because a system that is so fragile that one code upgrade brings it down, clearly isn’t resilient enough. In many ways Netflix takes the idea of building out an architecture that’s dependent on a genius IT professional and his version of delicate pieces and crazy glue and flips it on its head. Instead of a fragile model car Netflix is building the Tonka trucks of IT –ready to take a few glitches and keeping on serving up videos.

“Systems that contain and absorb many small failures without breaking and get more resilient over time are “antifragile” as described in [Nassim] Taleb’s latest book,” explains Adrian Cockcroft of Netflix. “We run chaos monkeys and actively try to break our systems regularly so we find the weak spots. Most of the time our end users don’t notice the breakage we induce, and as a result we tend to survive large-scale outages better than more fragile services.”

That’s the rationale behind those software updates the might cause a momentary web service outage or two. As the devops movement spreads, more businesses will likely find reasons to move toward continuous code deployment. Plus, as Allspaw of Etsy points out, the tools to test code and instantly monitor the effects of new deployments are getting better and faster. That means if you accidentally break a site, the dev teams notices it faster and fixes it. So maybe there are more outages, but they shouldn’t last as long.


GigaOM