Resiliency and reliability, the devil is in the details

Your cloud based application is a globe trotter. You’ve got customers using it in Spain and California at the same time. With that kind of distribution, your application and infrastructure design characteristics likely include some level of self-healing and application-based resiliency.

But that doesn’t mean you can stop worrying about hardware redundancy or high availability data centers. As cloud applications get more complex, you must add complexity at the hardware and infrastructure level. The key is that complexity will be under your control, which will help your application stay resilient.

Resiliency isn’t just a buzzword anymore

The very complexity of modern distributed or even vertically integrated solutions (I.e., Amazon vs. VCE vs. legacy IT architectures) means that you can’t accommodate all risk vectors. Adrian Cockcroft of Netflix is a brilliant cloud use/application design strategist, and he has many other incredibly skilled people working with him, yet Netflix was still unable to account for every variable that might occur in the complex system (Amazon) that supports his complex system, causing an outage affecting millions of customers on Christmas eve.

I’ve been designing, building, implementing, and operating critical infrastructure for enterprises for over twenty years now. In that twenty years I’ve uncovered a large number of truths about the IT world, and one of those truths is “you can’t assume anything”. Every time a server is installed, a switch configured or a new update applied to an application a change has occurred. The change might have been tested under every known scenario (impossible), yet still fail because of a heretofore unimagined use case. In my experience, it’s often the combination of activities that occur at the worst possible moment that cause you problems.

Go hybrid, young developer. Go, hybrid.

" Photo courtesy of Flickr user piermario

<a title="Attribution-NoDerivs License" Photo courtesy of Flickr user piermario

When building for scale as Amazon or Google do, complexity is also magnified. And since you can’t avoid scale if you want to deliver to a large population of customers, you must plan for resiliency and availability.

There has been considerable discussion about resiliency lately, and in some cases the underlying concerns around management and performance of complex systems has been well documented. One of the best discussions on the topic of resiliency was by Richard Cook.

Mr. Cook points out, as I’ve written in the past, that your operations practices and people are often the keys to your success. What I would also argue is that when you consider your design, locations, people and process you should include strategy for system reliability beyond mere application resilience.

Many of the companies I work with have global organizations with tens of thousands of internal and or external customers. In all cases these companies have decided that one cloud isn’t the right approach. A hybrid approach, while more complex in the beginning, offers significant upside in several areas over the long term besides higher reliability, including avoiding lock in, and having the ability to deliver service to any region, regardless of the available cloud provider. These same companies have also decided that carrying the risk that their most important IT assets are down through acts of god is also unacceptable.

Any large system has a risk that a single problem can cause a cascading effect on its functionality or availability, as has been the case with Amazon and Google among others. One of the best ways to avoid this cascade effect is to ensure that some of your applications or portions of each application run on different platforms. Having different platforms helps to ensure that problems propagating in any one cloud platform won’t affect applications in the other.

What apps need a hybrid cloud?

digital moneyAs I’ve said before there is no one size fits all cloud or IT strategy. Each application or service has to be evaluated against traditional risks of outage vs. loss of business. The vast majority (70 percent or more) of the applications in an enterprise IT org don’t require anything approaching 100 percent uptime. Of the 30 percent that are critical, many will likely support the business just fine with a 99.95 percent uptime design. What you have left is the 5-15 percent of applications that the business really considers “mission critical”.

It’s also true that in some cases just 5 percent of applications could account for 50 percent or more of your company’s IT work output. So the number of applications is less important that their scope and importance. This differentiation of applications is critical to help IT teams make the decision about what applications need the resiliency of more than one cloud platform.

High reliability doesn’t have to cost double the infrastructure and people costs like you might expect from historical efforts in legacy IT. The ability to provide redundant environments or splitting your applications between two environments is a real option today when using cloud. That’s not to say there won’t be extra costs, but the costs are more related to startup investment than they are to on-going hardware and environment management and replacement costs.

In other words, if your environment is big enough or critical enough spending some upfront effort and cash to enable a hybrid cloud environment (I.e., private & public cloud or two different public environments) can’t easily be justified. In fact, there’s an online gaming company that is using the “any cloud” approach as part of their business model. The use of multiple clouds allows it to grow quickly where and when it needs to while supporting latency and availability requirements.

Complexity you control versus complexity you can’t

Once you’ve taken the steps necessary for adding another cloud provider and made multi-cloud work for critical app, it’s now much less of a hurdle to make them work for less critical environments. I also want to make clear that you can distribute one application across several clouds or split multiple applications across several clouds. The result of the multiple applications on several different clouds is lower reliability for any one application, but higher reliability for your applications as a whole. In other words, you’ll have avoided the risk of having all your apps down at the same time.

I know it seems counterintuitive to create more complexity to help protect you from complexity, but it’s not. The complexity you’re creating can be easily managed with the appropriate up-front effort and it’s under your control. On the other hand, the complexity you accept for your critical app running on one cloud is completely out of your control.

So go forth and build sustainable, supportable, highly reliable application environments across providers, in the long run it’s the smart and safer thing to do.

Mark Thiele is executive VP of Data Center Tech at Switch, the operator of the SuperNAP data center in Las Vegas. Thiele blogs at SwitchScribe and at Data Center Pulse, where is also president and founder. He can be found on Twitter at @mthiele10.


GigaOM