For all the optimism surrounding the potential of computing in the cloud - lower costs, better performance, easier scaling - it isn't a perfect system. No matter how distributed and redundant the architecture or how rigorous the backup system, when it comes right down to it, there's a complex series of hoops through which the data has to jump to travel between the user and where it actually resides on a piece of physical hardware. And when a segment of that process fails, all the benefits of the cloud suddenly seem all the less magical.
Take a recent unfortunate situation for Ylastic, a company that provides a single front-end to manage Amazon Web Services, who was recently an unwillingly participant in one of these cloud bursts.
Ylastic noticed something strange occurring with one of the Amazon Elastic Cloud Compute (EC2) Elastic Block Stores (EBS), a service that is "particularly suited for applications that require a database, file system, or access to raw block level storage."

But something wasn't quite right. And over the course of a few hours the story played out via Twitter as Ylastic noticed issues with its EBS instances.

When the problem was finally identified, Ylastic discovered that the data could not be recovered. They were forced to recover from an earlier snapshot, that contained only a subset of the data.

Finally, after recovering what data they could, Ylastic had to go to its customers with the unfortunate message:
"AWS has finally terminated the frozen instances. But the EBS volume is still detaching and has been for hours. It doesn't seem like we will be able to get into it at this point. Some time in the last month or so, our EBS snapshotting of this stuck volume seems to have stopped working correctly.... We have gone back and run through all the snapshots, and the last good snapshot that we have is from October 1."
Who was at fault? Amazon? Ylastic? Truly, no one. It was simply a combination of issues. A perfect storm in the cloud, as it were. And that perfect storm resulted in data loss for Ylastic and its customer base.
Does this mean we should run screaming from the potential the cloud holds? No, absolutely not. But it's an unfortunate reminder that the system is far from perfect and that those who are relying on the cloud to serve critical aspects of their business should be ever diligent to ensure that the data is being backed up.
For all the technical magic of the cloud, it's still the basics of data management that matter most.
Comments
Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts
Hi there Rick,
1. Thanks for taking the time to write this.
2. I know next to nothing about cloud computing.
3. Here are a few thoughts:
a:
From your post I was able to understand that Cloud computing isn't perfect.
What is?
b:
You gave a single, albeit catastrophic, example of a glitch in a system, and used it to cast doubt on the system as a whole - that looks like faulty logic to me.
c:
I don't know much, but I do know that cloud computing is still in it's infancy.
This is a good thing because it means that we can expect better performance in the future.
With the rate of technological advancement being what it is, and the financial potential that cloud computing holds, I'm willing to bet that this future is nearer rather than farther.
d:
My sincere condolences to the people at Ylastic. I hope they never suffer such setbacks again and pray that their recent trials helped them improve their operation.
All the best and have a great weekend,
Mike Darnell,
@pop_art
http://DigitalArtPrintGallery.
Hi,
I recently presented my ideas of cloud high availability on the following URL:
http://mukulblog.blogspot.com/2008/07/cloud-availability.html
If you have a minute, check it out.
Thanks,
Mukul.
Hey Rick,
Great article and great reminder that no matter what the technology, companies need to have a plan for backing up data and testing the validity of those backups.
As the CTO of two different start-up over the last 8 years, I can tell you that I've seen almost every type of failure out there. I currently use cloud computing for some of my clients and while (on the whole) I really like it, I still realize that they're not immune to failures.
I believe part of the challenge for those of us responsible for the data comes from the good things cloud computing has brought us. The appeal (and ultimately the source of problems for companies like Ylastic) is in the set-it-and-forget-it mindset that cloud computing offers. Companies sell the ideas that I don't have to worry about server maintenance and backups - they've got it covered. I can focus instead on the growth of the business.
On the whole, they're right. But it's a false sense of security when it's taken to the extreme conclusion that I will nev
While there's no sense in tossing blame around, I would hope that Amazon would respond to a problem like this with an explanation and solution. They claim EBS volumes (basically a hard disk in the cloud that you can attach to your servers) have an annual failure rate of 0.1% – 0.5%. That's pretty good as far as hard drives go, but obviously no guarantee. As protection against this, they tell you to make "snapshots", or point-in-time backups of your EBS volumes to their safer, more redundant S3 storage service. The snapshot feature is fairly "behind the curtain" - you can't even access them up yourself from S3. So Amazon is really on the line for making sure they work, and having them quietly fail when all else seems fine seems like a really big glitch and something they should be actively addressing.
Sounds like someone want's all of the advantages of cloud resources, but make no effort, or take any responsibility for it.
So you are saving a fortune by using S3 and EC2, but then complain that the service was out for an hour? Go back to having everything in house, with all the associated overhead then.
Extending your product or service out to a cloud, creating a dependency, means you have to plan ahead and have an elegant recovery method - like Google Gears.
Aside from the company not being diligent about it's backup process, I think there is a bigger lesson to be learned from this.
As a consumer of a cloud product, be it a low level like S3 or a high level like gmail or flickr, we should be taking responsibility to back up our personal information that is stored in the cloud.
Having an extra copy of your information in another location would be a near guarantee you never lose your info.
I believe there is a new backup type business starting here, I read somewhere recently about Mozy.com doing something like this for it's customers.
So lesson learned never trust any company with "Your" data, always have a copy of it too.
This serves as a great reminder - no matter what the technology, how distributed, or how far we've come along... Good backups are still an absolutely essential component to any network strategy. How do you know they're good backups? By restoring them and testing.
I don't see this as a failure of cloud computing, as you mentioned - This is more a story of a backup strategy gone horribly wrong.
I wince as I read it - having been in technology for 20 years, I've seen this scenario played out on smaller scales over and over again.
Cool, sexy, distributed coolness sometimes has to rely on old, clunky procedures and tests to ensure constant forward motion.
I love the attempts at transparency here. Keeping users in the loop during downtime events is a good start. The key is whether or not your users actually know where to look for that information. Feels like in this case it could easily be missed. Have some advice for them here: http://www.transparentuptime.com/2008/11/transparency-case-study-courtesy-of.html
Love the illustration in this post :)
So lesson learned never trust any company with "Your" data, always have a copy of it too
Thanks,