A lot has been written about the outage, so I won’t repeat that here (if you want to catch up suggest you read: TechCrunch for the PlayStation story, GigaOM for an overview of the Amazon issue, eWeek for some interesting analysis and the DotCloud blog for a very readable explanation of day to day use of a public cloud service like Amazon).
Instead, let’s focus on the big picture of the cost of reliable cloud, comparing it – again – with the move from mainframe to distributed computing. History tends to repeat itself, especially in IT where generations of technology tend to be heavily siloed, and staffed with different generations of people who often do not even sit at the same table in the canteen.
Somewhere during the 1980s, IT pros started to realize they could get the same processing power for significantly less money, when selecting distributed servers – till that time used mainly for scientific work – instead of traditional mainframes. Soon after, companies began porting existing applications to the new platforms, focusing initially on applications that were more compute than I/O or data intensive (sounds familiar?). And indeed, initially the new departmental platform, not requiring all the resource intensive water cooling, air-conditioning or a heightened floor did seem a lot more cost effective. But, it didn’t take long after initial proofs-of-concept to find that for some applications we needed more processors or to set up clusters, or uninterruptable power supplies and other redundant features. On the storage front, we started to use the same – not so cheap – storage solutions as we did on the mainframe. And shortly thereafter, the distributed boxes started to look and cost about the same as the systems they were replacing. Let’s not forget that at the same time, smart mainframe developers – pushed by the competition from distributed systems – found ways to abandon water cooling, leverage off the shelf standard components like NICs and RISC processors and even re-examined their licensing cost for specific (Linux) workloads.
What does this all have to do with cloud computing? Likewise, we see that running a certain “compute intensive” workload can be done much faster and cheaper in the new environment. But when replacing these one-off batch jobs with services that have higher availability and reliability needs, the picture changes. We need to have redundant copies and failover machines in the data center, and in many cases a backup data center in another part of the country, preferably located on an alternate power grid and connected to multiple network backbone providers. All of a sudden it sounds a lot like the typical set-up a bank would have – and likely with a similar cost profile. So far cloud seems to offer cost benefits, but will this cost advantage still exist if we need to replicate our whole cloud setup at a second vendor – worst case scenario, doubling the cost. Now cloud has many other advantages beyond cost (elasticity, scalability, ubiquitous access, pay per use, etc.), so many specific use cases are still ideally suited for the cloud, but if a certain application has no need for these, then it may be worth (re-)considering the business case for a move to the cloud.
Regarding redundancy, the cloud business case actually has two opposing vectors. On one hand, there is the fact that as a user you have less control (you can’t fly there and ‘kick’ the server) leads to requiring a secondary backup installation. On the other hand, the cloud – with its pay-as- you-go model – offers much more efficient ways to arrange for a backup configuration for running your applications, than finding your own second location and filling it with shiny new kit.
At the same time you have to take into account the likelihood of the cloud having capacity available at the moment you need it. If your own data center fails, I am sure you could find another cloud provider with some capacity. But in this case, where one of the largest (cloud) data centers in North America had issues, all customers, in principle, could be looking to move their workloads elsewhere, it is not certain that enough excess capacity would be available at alternate providers.
In this specific case there seemed to have been technical issues inside Amazon and Sony’s data centers that caused a series of events impacting the services running within, but what if there had been a major physical problem, such as a large fire or accident? With more mission critical applications moving to the cloud, companies need to make contingency planning a top priority.
As a result of last week’s incidents, enterprises should take stock to assess what services are truly vital to their customers and/or to their own continuity. Organizations (and the world) seem to be more resilient than you might expect. In the Netherlands we saw Telcos ceasing mobile service in a large part of the country for periods of up to half a day, with ISPs unable to offer Internet services for as long as a week in certain regions. And yet, these companies did not go under. Each industry, government and organization will have to assess its own priorities and success metrics/criteria and standards. As I mentioned in a past blog post, aiming to continue your on-demand video services after a mayor flood, hurricane or nuclear disaster may be over-shooting what is necessary under those circumstances – survival will be the only concern in a case like that.
Looking at the reactions in the market so far, the responses of the vendors impacted by the Amazon mishap seem to be a lot more benign that the reactions of impacted PlayStation gamers. Maybe because these vendors feel they have a good thing going and the last thing they won’t to do is kill it by too much honesty. What is refreshing is that not too many vendors crawled out of the woodwork saying “private cloud, private cloud, private cloud” — not even my colleagues who recently published a book on the topic. Good! Nobody loves “I told you so” types and there’s no point in kicking your opponent when he is down. But the case for private cloud did get a bit better, or is it just me thinking that?