There is no such thing as risk due to “the cloud”

Problems with Amazon’s cloud computing service over the weekend underscored how businesses and consumers are increasingly exposed to unforeseen risks as they embrace life in the cloud, Quentin Hardy reports in Monday’s New York Times.

The risk

The risk to consumers isn’t that the Cloud exposes them to additional risk that they wouldn’t ordinarily be exposed to, the risk is that businesses have disaster recovery policies that are either unworkable, too long in duration or otherwise flawed.

Boards are specifying that disaster recovery be part of companies’ plans but without adequately addressing the risk of failure due to a datacenter outage against today’s demands of “dial-tone” quality service. Technical leaders, in turn, take the short-view assuming that it performing an actual recovery could never happen, or it would be such a lightning-strike event that people will understand when it happens.

They probably won’t.

In today’s fully-connected world, you need to be fully-connectable with almost no exceptions to be perceived as professional and worthy of a consumer’s business.

“The Cloud” isn’t what you’re actually worried about

Customers aren’t exposed to some new cloud-based risk that didn’t exist previously, they’re exposed to the same old risks that concentrations of compute power have had since the first computers were built and installed. The difference is that today, customers rely on those services in a very personal way.

Every big ecommerce system has had this risk since their inception. Fires burn. The power goes out. Tornados hit buildings. Earthquakes break infrastructure. Server racks get unplugged to plug in the vacuum. Your systems simply need to be ready for this eventuality.

Companies with any kind of expectation of service need to be ready for this to happen and completely focused on continually reducing downtime until its systems are fully resilient to regional datacenter crises.

The cloud actually helps businesses solve these problems by providing similar services in multiple locations. Amazon has several datacenters, and businesses could easily be taking advantage of those services to yield increased reliability.

The challenge

From the moment you have core customers that you can’t fail, you need to have a plan that executes in minutes, seconds or microseconds. If you’re planning on hours, you’re losing customers, unless your competitors are in the same datacenter you are and have the same kind of recovery plans. You’re not really trading cost against risk. You’re trading cost against customers. You are trading your, hopefully, good reputation.

It isn’t hard if you proceed in a stepwise manner, reducing the downtime and solving problems until you get to the core issues that exist in dealing with transactions across multiple datacenters. These problems are solvable, and the expectation for recovery speed should be constantly rising.

But what about the cloud?

Building on the cloud is likely to have a advantage versus doing it the old way with your servers in your datacenters. I would be very surprised if Amazon doesn’t provide a new level of service which allows for nearly seamless recovery in such failure situations in the next year or two. Amazon has the datacenters, bandwidth and infrastructure to make this happen. The cloud is likely on the cusp of reducing this kind of risk.

Even if Amazon doesn’t “fix” this problem with no work on your part, they still offer services across multiple regions and make it easy for you to fix your own problem, as they suggest in their whitepaper.

The real problem

The problem for consumers is that they don’t understand how seriously companies take their responsibility to build robust systems. I’m not sure what an adequate measure of that seriousness would be, but customers know the impacts when they see them.

As a business, your problem is getting the leaders of your company to take these risks seriously and plan for infrastructure failure as if it is an eventuality. Stop wasting time rehearsing for recovery and build robustness that works so you don’t have to rehearse. Maybe even “hire” a chaos monkey and make sure it knows how to really break your systems.