Devops, complexity and anti-fragility in IT: Risk and anti-fragility

I am taking a few posts to explore the rise of new software development and operations models, and why these models are critical to the enterprise. Today, I want to explore the risk economics of software development and the concept of “anti-fragility.”

Enterprise IT organizations have spent decades trying to create systematic approaches to control and (hopefully) eliminate disruption in computing operations. The standard approach to date has been to strictly control change. Now, concepts like continuous integration and deployment, modularized application systems, and “fail fast” agile processes encourage continuous change.

Embracing anti-fragility

So why would anyone want to promote an approach that encourages constant change, when failure in the form of outages or breaches or large-scale processing errors exact such a heavy toll on businesses? The short answer is because some application domains require it, but that’s also a bit glib. Instead, let me bring in the concept of “anti-fragility,” as coined by Nassim Nicholas Taleb in his book “Anti-fragile: Things That Gain From Disorder.”

I explained the gist last week:

“Anti-fragility is the opposite of fragility: as Taleb notes, where a fragile package would be stamped with ‘do not mishandle,’ an anti-fragile package would be stamped ‘please mishandle.’ Anti-fragile things get better with each (non-fatal) failure.”

Anti-fragile systems benefit from variability and can take advantage of differences from the “normal” to ultimately gain value. Anti-fragile systems behave in such a way that failures due to change exact a small cost, but successful change drives exponentially higher value, so the system gains overall. Taleb argues this is only achieved by keeping the scope of each activity small enough that the downside risk is manageable (and results in strengthening the system), and that any gains can be maintained ongoing.

Jez Humble, co-author of the book “Continuous Delivery,” did a very good job of analyzing anti-fragility from the perspective of software delivery:

Taleb shows why the traditional approach of operations – making change hard, since change is risky — is flawed: ‘the problem with artificially suppressed volatility is not just that the system tends to become extremely fragile; it is that, at the same time, it exhibits no visible risks. . . . These artificially constrained systems become prone to Black Swans. Such environments eventually experience massive blowups. . . . catching everyone off guard and undoing years of stability or, in almost all cases, ending up far worse than they were in their initial volatile state’ . . .

This is a great explanation of how many attempts to manage risk actually result in risk management theatre — giving the appearance of effective risk management while actually making the system (and the organization) extremely fragile to unexpected events. It also explains why continuous delivery works. The most important heuristic we describe in the book is ‘if it hurts, do it more often, and bring the pain forward.’ The effect of following this principle is to exert a constant stress on your delivery and deployment process to reduce its fragility so that releasing becomes a boring, low-risk activity.

Today’s IT models don’t demonstrate that behavior, at least at the project level. As Humble noted, most IT projects are highly fragile — a few relatively small errors during development or operations can send the entire project crashing down at an inopportune time. IT projects (and individual project releases, for that matter) tend to:

  • have giant scopes of hundreds or thousands of requirements.
  • be managed through a series of organizational siloes with weak feedback loops between the silos.
  • introduce new operations vulnerabilities with each release, due to dependence upon manual process steps, and highly context-specific, fragile “scripting.”

Change, therefore, is artificially suppressed, or at least intensely controlled. This just makes projects more fragile in the long term, especially from the perspective of meeting constantly changing business needs.

Approaching anti-fragility through devops

It doesn't have to be this way.

It doesn’t have to be this way.

One solution to that problem is highlighted today in the form of devops or “noops”-driven software organizations like Netflix  and Etsy. The software approach these organizations take is one of releasing small changes as often as possible, with heavy reliance on automation, and — this is very important — measuring the resulting effect on dynamics important to the business stakeholders.

Oh, and they can quickly reverse or replace stuff that doesn’t work out as expected. Which happens fairly often. Which leaves them no worse off then they were before they tried the change. See the anti-fragility yet?

However, in order to get to this state of low-risk, constant experimentation, these organizations have had to employ skills, tools, processes and practices that are significantly different than the change-management techniques of the past. The most obvious qualities of their devops systems are:

  • Automation enforces certain practices, such as running various tests with every build or build environment (e.g., running regression tests before moving from dev to staging).
  • Culture enforces practices, such as Etsy’s practice of allowing developers to own their mistakes without fear of reprisal, which encourages tribal knowledge of how to avoid such mistakes in the future. Culture also dicates that dev, ops, security, business and other stakeholders all work together over the entirety of the application lifecycle.
  • Prudent measurement of all elements of the processes, tools and applications provides the key feedback necessary to continually strive for improvement.

Devops and anti-fragility are by no means synonymous, however. Devops can be implemented in such a way that it doesn’t exhibit the trait of being anti-fragile —  like when high developer turnover results from always putting the “best people” on the “next great thing,” and knowledge or culture is lost as a result.

Anti-fragility can also be acheived without devops, though I’m not aware of another consistent methodology for doing so. Nonetheless, anti-fragility is a trait to be strived for, not a methodology itself, so there are options for achieving that trait.

No silver bullet

And, lest you think we’ve hit upon yet another “drop everything and change the way you do things” approach to enterprise IT, I would caution against applying anti-fragility religion where the investment wouldn’t pay off.

Given the difference between devops and most “construction-method” approaches to IT that we see today, for example, I would argue that enterprises should adopt devops and address anti-fragility first by using it for those IT projects that would benefit from continuous change. Ones like marketing applications, business process automation and so on. Less critical would be systems like core ERP databases and infrastructure that don’t often need to undergo change.

You can’t crowbar a change-averse technology into a change-driven methodology. However, over time you might be able to adopt a few of the benefits to lessen risk when change is necessary in those systems.

Here’s a heuristic I’m experimenting with: the more that differentiation and adaptation are important to the solution at hand, the more that anti-fragility should be strived for. Undifferentiated activities (such as running data center facilities or core SAP packages) should strive for resiliency, but perhaps adapt automation, etc., over time as part of a more traditional approach to software project control.

This heuristic is backed up, somewhat, by the work of my friend Simon Wardley, who has one of the most-comprehensive theories of the evolution of enterprise activities from inovation through commoditization. Activities at different stages of that spectrum benefit from different practices, and IT is no exception to that rule.

In my next post, I’ll go into more detail about this spectrum of practices, define the “stability-resiliency” tradeoff and explain how enterprise IT can navigate it. In the meantime, this is an opportunity for you to express your thoughts on the subject of new IT models and old, either here in the comments or via Twitter, where my handle is @jamesurquhart.

For more discussion on devops and next-generation systems management, check out this panel discussion (that includes James Urquhart) from Structure 2012:

document.getElementById(‘wpcom-iframe-form-60fda91c94c99b16c39824c118da74da’).submit();

Watch live streaming video from gigaomstructure at livestream.com

Feature image courtesy of Shutterstock user Claudio Divizia; name tag image courtesy of Shutterstock user alexmillos.


GigaOM