Devops, complexity and anti-fragility in IT: An introduction

Some time ago, my friend Phil Jaenke and I (and a few others) got into a debate on Twitter. The discussion started as an exploration of the changing nature of software development, operations and change control, and whether they are good or bad for the future of software resiliency. It resulted in a well-articulated post from Phil, arguing that you can’t have resiliency without stability, and vice versa.

However, as I started trying to outline a response, I realized that there was a lot of ground to cover. The core of Phil’s argument comes from his background as a hardware and systems administration expert in traditional IT organizations. And with that in mind, what he articulates in the post is a reasonable way to see the world.

However, cloud computing is changing things greatly for software developers, and these new models don’t take kindly to strict control models. From an application down perspective, Phil’s views are highly suspect, given the immense success of companies like Etsy and Netflix (despite their recent problems) have had with continuous deployment and engineering for resiliency.

Reconciling the two views of the world means exploring three core concepts required to understand why a new IT model is emerging, but not necessarily replacing everything about the old model.

The first of these concepts is devops, which earned its own three-part series from me a few years ago, and has since spawned off its own IT subculture. The short, short version of the devops story is simple: modern applications (especially on webscale, or so-called “big data” apps) require developers and operators to work together to create software that will work consistently at scale. Operations specifications have to be integrated into the application specifications, and automation delivered as part of the deployment.

In this model, development and operations co-develop and very often even cooperate, thus the term devops.

The second concept is one that I spoke about often in 2012: complex adaptive systems. I’ve defined that broad concept in earlier posts, but the stability-resiliency tradeoff is a concept that is derived from the study of complex adaptive systems. Understanding that tradeoff is critical to understanding why software development and operations practices are changing.

The third concept is that of anti-fragility, a term introduced by Nassim Nicholas Taleb in his recent book Anti-fragile: Things That Gain from Disorder. Anti-fragility is the opposite of fragility: as Taleb notes, where a fragile package would be stamped with “do not mishandle,” an anti-fragile package would be stamped “please mishandle.” Anti-fragile things get better with each (non-fatal) failure.

Although there are elements of Taleb’s commentary that I don’t love (the New York Times review linked to above covers the issues pretty well), the core concept of the book is a critical eye-opener for those trying to understand what cloud computing, build automation, configuration management automation and a variety of other technologies are enabling software engineers to do today that were prohibitively expensive even 10 years ago.

So, over the next few weeks, I will try to explore these concepts in greater detail. Along the way, I will endeavor to address Jaenke’s concerns about the ways in which these concepts can be misapplied to some IT activites.

Please join me for this exploration. Use the comments section to push back when you think I am off-base, acknowledge when what I say matches what you have experienced, and, above all, how you think about how your organization and career will change one way or another.

As always, I can be found on Twitter as @jamesurquhart.

Feature image courtesy of Shutterstock user Sinisa Botas.


GigaOM