BCM and the Cloud lessons from experience

 
A relatively simple and entirely repeatable human error led to the failure of one of the most respected and reliable Cloud Computing providers, Amazon.
 
Despite having Business Continuity Plans, the resulting collapse left 1,000's of customers and millions of users unable to access a wide variety of Websites causing millions to be lost.
 
Some forecasters are already saying that the EC2 failure will slow the growth of Cloud Computing, with companies concentrating rather more on private Cloud options, rather commit to Public Clouds.   
 
We have been looking at how Cloud Computing has been developing over the past few years, holding a number of workshops on the risk management aspects and two Cloud debates in the the past year alone, and  it has become clear that many organisations are not fully realising just how complex the Cloud landscape can be.  We even looked at using the Cloud ourselves, but held back following conversations with a few of the providers who were almost universally unable to answer what we considered to be some pretty basic BCM questions. 
 
Putting aside the obvious issue of ensuring that Cloud Providers actually have effective business continuity plans in place, that have been proven through testing, a key problem they are having to juggle is the sheer complexity that comes from developing an environment that appeals to the largest market.
 
The diversity of customer environments, tools and software overlays on their own systems and software environment and creates a landscape that is getting increasingly difficult to test and maintain and that increases both the complexity and dependencies that exists. The trouble is though that much of this risk can be hidden from view until the conditions required combine to generate the failure and by that time it is too late. The consequences may be relatively trivial and in many cases they are though that still causes disruption and inconvenience, but as Amazon found out sometimes they aren't minor and extended, highly embarrassing, expensive failures can very quickly be the result. 
 
We would argue that Cloud Providers, their Suppliers and Customers need to find a way to reduce this complexity (or at least be able to cope with it) to a level that positively enables systems resilience.  At each step they should be able to answer concerns and the difficult questions and look to extend functionality through a properly planned and tested, modular approach. This means a lot more attention on the Quality Assurance aspects of software and systems design and, we feel, a far more inclusive approach to the business continuity and recovery dimensions that seem to have been added as an afterthought in too many cases.    
 
This need not limit (too much) the development of the Cloud, but for companies, large and small, the question has now been posed "can we really rely on it?". For us, like many others, we aren't yet sure and that lack of confidence will hold back the development of the sector, at least for now.  
 
The upsides are potentially really positive for user organisations, especially with regards to establishing better performance, capacity and BCM across the SME sector, but we'd like to see a significantly better framework that ensured the proper foundations are in place within providers to make sure the reality meets the promise offered.      
 
Designing for Resilience and Continuity requires time and focus. We feel the providers that are able to demonstrate that they really have applied themselves and can manage the complexity issues in the Cloud, delivering real resilience and business continuity, will gain a huge competitive advantage in coming years and we wouldn't be in the least bit surprised if Amazon was at the front of the this pack. After all, there is nothing like direct, public,  painful experience to make someone a believer in BC!