ReadWriteWeb

More Amazon S3 Downtime: How Much is Too Much?

Written by Richard MacManus / July 20, 2008 2:04 PM / 46 Comments

Today's big news is that Amazon's S3 online storage service has experienced significant downtime. Allen Stern, who hosts his blog's images on S3, reported that the downtime lasted 3.5 over 6 hours. Startups that use S3 for their storage, such as SmugMug, have also reported problems. Back in February this same thing happened. At the time RWW feature writer Alex Iskold defended Amazon, in a must-read analysis entitled Reaching for the Sky Through The Compute Clouds. But it does make us ask questions such as: why can't we get 99% uptime? Or: isn't this what an SLA is for?

You can see the status as of writing in the screenshot below, taken from the Service Health Dashboard:

Interestingly, SmugMug - an online photo and video provider - doesn't seem too concerned about the outage. It seemed almost blase about it in its blog post today:

"Historically, Amazon has been very stable. We've seen three of these in our entire history with Amazon (>2 years), including this one. I expect, like the last two, that service will be restored shortly. You can keep track of their efforts over on their own Status Dashboard.

Our faith in Amazon, and the care they take of your priceless memories, hasn't been shaken. Your photos and videos are safe - which is our #1 concern. Since problems in this industry are inevitable, and Amazon's performance over the last two years has been so exceptional, we've been afraid an outage like this. I'm sure there will be more over the next few years, too.

The important thing is that they're few and far between, short, and handled properly. Every component SmugMug has ever used, whether it's networking providers, datacenter providers, software, servers, storage, or even people, has let us down at one point or another."

This almost exactly mirrors what Alex said in February. Cloud computing is a complex business, wrote Alex, and Amazon is simply the best available option:

"The truth is that we cannot do it better than Amazon. They spent a massive amount of money, talent and most importantly time, trying to solve this problem. To think that this can be replicated by a startup in a matter of months, assembled, be cost effective, and work properly is just absurd. Large-scale computing is an enormously complex problem, that takes even the best and brightest engineers years to get right."

Here's a diagram that Alex did to illustrate the concept of cloud computing:

I guess the answer to the question, how much is too much downtime, is: hey, whataya gonna do? (imagine that said in a New York accent and with a shrug).

What do RWW readers think: are these outages getting too much, or do you still cut Amazon some slack?



2 TrackBacks

TrackBack URL for this entry: http://www.readwriteweb.com/cgi-bin/mt/mt-tb.cgi/4488

Comments

Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts

  1. Yeah, we use AWS for our social network at ESPN. Downtimes like this are irritating, but we trust Amazon to adjust and grow and avoid things like this. The experience has been very positive overall.

    Posted by: Cody | July 20, 2008 3:12 PM



  2. Five plus hours of downtime... (so far). Pretty unacceptable! I guess it's one of the dangers of offloading content, images and files. I'm an unhappy S3 customer right now!

    Posted by: Doug | July 20, 2008 3:16 PM



  3. 3.5 hours of downtime in a month is actually about 99.3% uptime.

    Posted by: Jeffrey McManus | July 20, 2008 3:17 PM



  4. Failure happens. Year-round 24/7 service is a difficult thing to provide.

    Isolated failures?
    We're living in an imperfect world.

    Consistent, repeated failure modes?
    You're doing something wrong.

    People rankled about 5 hours of downtime should try providing the same level of service. In my experience, it's much easier to write-off your own mistakes (and most organizations do), than it is to understand someone else's -- even when they're doing a better job than you would.

    Posted by: Hubert G. Fillywick | July 20, 2008 3:19 PM



  5. Thanks for the link Richard - I love how they call it "elevated error rates" - on mosso my host they call downtime "degraded" - just call it what it is.

    It seems they are bringing up the european hosts now and then the usa - assuming it works, downtime will be over 6 hrs - not a huge deal for a blog like mine but certainly a huge deal for Web apps who live off images and other static files.

    Posted by: allen stern | July 20, 2008 3:30 PM



  6. I don't think that for *most* applications this is a big deal. We've all got used to websites being down from time to time, and I think 3-4 hrs per year is probably acceptable.

    Buyt it does beg some interesting questions around how the market for cloud computing is going to grow. If, for instance, I had an app that depended on S3 then I could expect 1-2 outages of an hour+ per year. If, however, my app depended on more providers (say S3 and Joyent) then my downtime could be expected to be the sum of *their* downtimes and so on and so forth. Would this then lead to developers tending to settle on just one infrastructure provider?

    Posted by: David Preece | July 20, 2008 3:43 PM



  7. Jeffrey, good point -- although according to Allen's comment the latest report is over 6 hours downtime. Post updated with that info.

     Posted by: Richard MacManus Author Profile Page | July 20, 2008 4:02 PM



  8. When the e-commerce giant's smallest possible division made a splash with their offering last year we spent a lot of time thinking about it.

    Let's just say that this sort of major outage is not really the worst case scenario.

    How about this: this teeny tiny division loses the middle three guys to a startup. The top guy is now spending 100% of his time doing politics to keep the money losing division on the budget and 50% of his time doing PM and leadership for the sorta senior technical guys.

    Then what happens?

    -OT

    Posted by: OliverTaco | July 20, 2008 4:08 PM



  9. FYI, AWS Service Health Dashboard http://status.aws.amazon.com/

    Posted by: Doug | July 20, 2008 4:09 PM



  10. thanks Doug, added link and screenshot

     Posted by: Richard MacManus Author Profile Page | July 20, 2008 5:02 PM



  11. One of the reasons I sold Feed Digest was because of the hordes of people who would go absolutely insane over even, say, 30 minutes of downtime. Thankfully beyond a certain point unscheduled downtime was very rare and wasn't even common (as in, like every month or something) but some people out there *really* seem to hate downtime :) (or like complaining a lot)

    Posted by: Peter Cooper | July 20, 2008 5:30 PM



  12. Outage was about seven hours for us but overall S3 has met its SLA in 8 out of the 10 months we've been on S3, and the other two (assuming no further outages this month) were still over 99%. More than acceptable given the other variables in play, and with some effort S3 can be part of a redundant solution that could add more 9s if needed for most apps.

    Posted by: gz | July 20, 2008 7:27 PM



  13. It’s safe to presume that S3 and related services are going through a phase of rapid expansion - hitting walls as they push out further and introducing defects as their environment changes. All new systems go through a similar adolescent phase so I’m not extrapolating this into a trend, yet.

    SLAs are not useful - they don’t present accurate expectations (it's a sales tool) and the recourse is always insignificant in comparison to the lost incurred. "How good are you" is more useful than "how good do you promise you’ll be". Think about that point if you're deciding to use S3 in the near future, is the contract or the service history more relevant to your decision?

    Anyway, S3 et al better be scrambling to tweak their architecture or I’m switching off either to www.nirvanix.com or bringing it back in house where there are a few throats to choke.

    Posted by: Steve Ireland | July 20, 2008 7:32 PM



  14. While I tend to agree that downtime is inevitable; and I also agree that it is irritating; what I find amusing is that most people are more tolerant of downtime because it's on a service that Amazon is offering.

    If this was a smaller company, then they would have been lynched. Does Amazon's brand name and size let them get away with service downtime?

    Posted by: Vaibhav | July 20, 2008 8:08 PM



  15. SmugMug is hardly a startup, founded in 2002.

    That being said, SmugMug's user community has had a lots to say about the outage.

    Annoyingly, even after Amazon got S3 back up at 5:12PM PDT, there were additional issues that delayed full service restoration until just now (~9:00PM PDT).

    I wonder how much of that was Amazon's fault vs. SmugMug's (I see other S3-dependent services were back up shortly after Amazon restored service).

    Posted by: Darryl Lee | July 20, 2008 9:22 PM



  16. Well, question also is, what in comparison with other offers from the non cloud computing bench ?

    Do you have any numbers or average downtime for dedicated servers, ... ?

    Posted by: Yves Hiernaux | July 20, 2008 9:53 PM



  17. Great points here. Just added reference to this blog from our blog post on this subject -

    http://secobackup.com/blog/2008/07/21/secobackups-resilience-and-todays-amazon-downtime/

    We talk about the nature of Cloud Computing, doing web services over the WAN and software resilience that needs to be built into products to complement the cloud.

    /gp
    SecoBackup
    http://secobackup.com/products-secobackup.html

    Posted by: SecoBackup | July 20, 2008 10:09 PM



  18. Yves - I think you hit the nail on the head.

    I am often surprised how rarely (uh, never) people use the uptime (or downtime) of on-premise software as a benchmark or reference point for uptime of web services and cloud software.

    What's the uptime of typical data storage on-premise? It's certainly not 100%.

    What people should be discussing is the marginal, or incremental, uptime or downtime of web services as compared to on-premise.

    Also, we should be discuss uptime in the context of the total value proposition of web services and cloud software. I certainly don't think the core value prop of cloud software is higher reliability than on-premise software. There are a lot of other components.

    Seems like any time a large service has downtime, the discussion instantly goes to "how much is too much".

    If you're really going to ask that, then you have to include all the other components of the value prop in the equation.

    Uptime is just one criterion. But it seems to drive all the buzz.

    Posted by: kayvaan | July 20, 2008 11:25 PM



  19. This happens too often, Amazon just doesnt have their act together and its starting to show. People should start looking at Amazon's biggest competitor, Nirvanix. I think they were even first with an SLA before S3 put theres out. The issue is Amazon S3 has one data center so when it fails, it costs startups and businesses big $$. Nirvanix built out storage nodes across the globe so there is no single point of failure, if one node went down another one could pick up the slack, easily. They really have their stuff together unlike these Amazon dudes.

    Check them out:www.nirvanix.com

    Posted by: Nick | July 21, 2008 12:14 AM



  20. Our service, like many others, was affected too. It's understandable that this might happen even with the big guys.
    The important thing though is to know "how much is too much" for the end users. That tells if we should look for alternatives. And this question here will help us see that.
    Thanks Richard.

    Hopefully with a bit of competition on the market this will get improved, and won't be an issue that often.

    Posted by: The BlogUpp! Team | July 21, 2008 12:18 AM



  21. Here's a thought -- what about a simple API layer that sits between an app and the cloud that allows for redundancy across different service providers?

    I'd love to be able to take posterous.com user data and store it on both Mosso and S3. The redundancy can be worth it in the long run, especially if you can guarantee no downtime.

    It could even be generalized into a layer that allows for automatic memcaching of frequently accessed local data.

    Posted by: Garry Tan | July 21, 2008 1:04 AM



  22. Yeah, Its really challenging to go continuously without any failure for a very long time. failure happens sometime but we at our office try to stay stable and reliable as much as possible. I got similar information from dreamworldtech.com . please reply me after checking informatiom from this website..... Good Luck

    Posted by: John Wade | July 21, 2008 2:33 AM



  23. Our service (PhotoShare) was seriously affected yesterday too (Amazon S3 outage affected PhotoShare service too).

    Although I am not happy about this incident and I think Amazon must do much much better job to keep us as a customer, I still believe the cloud computing is the right direction. It really does not make sense for each web service company to have their own machines. It is like every household needs to have a electronic generator.

    Posted by: Satoshi Nakajima | July 21, 2008 4:51 AM



  24. For full disclosure, I am with Nirvanix. This is no place for an ad, but we know you will be pleased with our Storage Delivery Network. To make the process of file and folder migration easier, we have made a software tool you may download here: http://developer.nirvanix.com/files/folders/applications/entry886.aspx

    Posted by: Robin | July 21, 2008 10:23 AM



  25. Sometimes disruption causes some ups and down - a move to cloud computing is definitely in the class of a disruptive change and these outages are an example of the ups and downs to be expected.

    More here

     Posted by: Ben Kepes Author Profile Page Posted on FriendFeed   | July 21, 2008 12:47 PM



  26. The comments from the readers give Amazon a wide berth. In the world of enterprise applications 6 hours is an eternity. The comments reflect the fact that Amazon is a Web 2.0 darling right now. I can only imagine the vitriol if it were Microsoft, not Amazon, who had 6 hours of downtime.

    Posted by: jholbrook | July 21, 2008 1:41 PM



  27. @jholbrook - I don't disagree that 6 hours is an eternity. But I still think that it doesn't help to razz Amazon about downtime without understanding what comparable downtimes are for on-premise versions of the given software.

    And I agree with you that when (not if) Microsoft has this issue, the vitriol will probably be even worse. Microsoft is coming out with its Business Productivity Online suite later this year (Exchange, Sharepoint, etc. online) and it will be interesting to see uptime and responses to downtime (full disclosure, I work for Microsoft).

    Posted by: kayvaan | July 21, 2008 2:34 PM



  28. Holy Smokes Batman? is he for real? Excellent article indeed it was.

    JT
    www.FireMe.To/udi

    Posted by: Jimmy Crack Corn | July 21, 2008 6:45 PM



  29. There is a sunny side to this: someone else to blame. If your server goes down it is your fault and reflects poorly on you. If Amazon's server goes down it is out of your control and people will be empathetic.

    Posted by: Aaron | July 21, 2008 7:08 PM



  30. The article expects them to have 99% uptime - so over 1 year that allows for 3.6 days of downtime. If their service was out for 6 hrs recently and 6 hrs in february. I think they are meeting the SLA of 99%

    And you bio states you did analysis work???

    Posted by: wwwdeveloper | July 21, 2008 7:45 PM



  31. When I first started using S3 I thought it was the most amazing thing ever. Within 24 hours, I had my entire web imaging system (heavy user contribution) using s3. However, I wish amazon's .NET methods had
    included some setup in web.config that would have specified a local cache folder should their service ever go down.

    I'm programming this myself now, and using a getFileLink() method around any s3 hosted file. It checks s3 status every 5 minutes. If s3 is down, it uses the local cache for the next 5 minutes.

    Essentially, it adds our own local servers to the 'cloud'. I'm surprised s3 didn't provide this feature in the first place.

    Posted by: Seth Caldwell | July 21, 2008 7:51 PM



  32. I don't use Amazon S3, but if my web host was to simply go down for 6 hours, I wouldn't be using that web host anymore. (never happened.) It is simply unacceptable, period.

    Posted by: David | July 21, 2008 7:55 PM



  33. uhh...i am just missing something or is the diagram devoid of any meaningful data?

    Posted by: mark e | July 21, 2008 8:28 PM



  34. Quick math for you buddy boy.
    99% uptime = .99 * 365 days =361.5 active days.

    3 days and 12 hours of downtime.
    84 hours of down time in total.
    That would be 13 minutes of downtime a day if it were spread out.

    but it typically isn't.
    so bearing that in mind, i am gathering s3 isn't down for days at a time, but perhaps hours and perhaps every once and a while. i think you still get your 99% up time

    Posted by: dave | July 21, 2008 9:06 PM



  35. The guys who are complaining about this are perpetually angry and just want something to complain about.

    If Amazon isn't meeting their SLA, I'm sure there's some way to be reimbursed for the downtime.

    If this makes you oh-so-angry, move your site somewhere else.

    I used to work for a hosting company and realized there are people who want to here you're fixing a problem and other people who just want to complain, even if it only cost them $8/mo.

    I'm sure there are options that will provide better uptime than Amazon -- anyone looked in to the pricing of Akamai and distributed data centers? Don't go relying on any of that load balancer stuff, better to pay tons of staff to keep a non-distributed system up, synchronized and running :)

    PS. That's awesome! I just read an article about Satoshi Nakajima today, and here he is commenting on this article.

    Posted by: Andrew | July 21, 2008 9:16 PM



  36. It would be interesting to see Nirvanix's fully public historical record of outages, to compare with http://status.aws.amazon.com/ ...

    Posted by: dan-o | July 22, 2008 4:16 AM



  37. We use Amazon S3 for our storage network and although the downtime is frustrating for us and users, Amazon is defo the way to go, they do have a good solution for startups who want to start small and scale!

    Jamie.

    Posted by: Jamie | July 22, 2008 5:35 AM



  38. While this certainly stinks - overall Amazon s3 has better reliability and uptime than mosso.com does - and thats we have things hosted right now.

    Posted by: Matt Ellsworth | July 22, 2008 6:33 AM



  39. dan-o: nirvanix has a far worse track record than S3. they have lost some customer files entirely and forced one company out of business. stay away from this company!!!

    is this enough for you:

    http://www.techcrunch.com/2008/07/10/mediamaxthelinkup-closes-its-doors

    Posted by: Hamul | July 22, 2008 9:03 PM



  40. MediaMax went out out of sight because 25gb of free storage just doesnt fly. Nirvanix and MediaMax are two seperate entities. Amazon even copied Nirvanix by unveiling an SLA months after Nirvanix launched as they are backed by a 100% SLA, they are leaving S3 in the dust.

    Posted by: Mike | July 23, 2008 9:15 AM



  41. Thought I'd share with you all the "real" cause of Amazon S3 Downtime http://benjaminkim.com/archives/cause-of-amazon-s3-downtime-revealed/

    Posted by: James | July 23, 2008 5:54 PM



  42. Hamul, this is the second site I've seen you on warning people about the Nirvanix threat.

    The fact is, Nirvanix has lost MILLIONS of user files! In fact, TheLinkup.com, which is the company that Nirvanix is spun out from, says that openly, and attributes that as being the reason they're no longer in business.

    I am in the process of contacting current Nirvanix corporate customers and warning them about what happened. Mediamax didn't go under because of their poor business model - they went under because of the gross incompetence of their spin-off/subsidiary company Nirvanix. It was Nirvanix's terrible mistake that began the process, and I believe, will be repeated.

    Nirvanix is being run by the same people who deleted entire terabytes of customer data in an "engineering error" - watch out, it WILL happen again, and it won't be funny when it is your files.


    MediaMax went out out of sight because 25gb of free storage just doesnt fly.

    No, they went under because their SPIN-OFF company, Nirvanix, deleted millions of users files! Their actions caused TheLinkup/Streamload/Mediamax to go out of business.

    Nirvanix and MediaMax are two seperate entities.

    Nirvanix is being run by the same people who ran Mediamax (into the ground). It uses the same computers and servers, per employees of Mediamax. In fact, when they "spun off" from Mediamax, they didn't go very far, their offices are right across the street.

    Do a Google search for "Mediamax employees" and then do one for "Nirvanix employees". You'll see that some people have already made comparison showing it is run by the same rascals, and you'll see admissions that state that most of the former Mediamax employees moved to the new company, Nirvanix. Like rats escaping a sinking ship, if you ask me.

    Don't trust Nirvanix - they did it before, they'll do it again.

    Posted by: Tom Bassett | July 24, 2008 9:36 PM



  43. 7%

    Posted by: XtianDream | July 25, 2008 4:24 AM



  44. My point, somewhat echoed by others above, is what's a better alternative? At least Amazon is a name brand that proven reliability with other services. Hasn't everyone with a commercial web site run into occasional downtown woes with hosting? On the flip side, 6 hours is a long time to be down during business hours. With distributed storage, how does the entire system go down for that length of time? Don't think Amazon fully explained.

    Posted by: Joe | July 28, 2008 8:38 AM



  45. @ Tom

    You seem to be very misinformed, please check out this link: http://developer.nirvanix.com/blogs/nirvanix/default.aspx

    Posted by: Vick | July 28, 2008 12:22 PM



  46. @ Tom

    You seem to be very misinformed, please check out this link: http://developer.nirvanix.com/blogs/nirvanix/default.aspx

    I'm not misinformed. You are a shill. You've left comments on several blogs, with the same message, using a different username each time. You left the message 3 times on MY OWN BLOG using DIFFERENT names, but the SAME EMAIL ADDRESS.

    You work for Nirvanix.

    Steve Iverson, president of Mediamax, and John Hood, former head of customer service for Mediamax, have placed the blame on Nirvanix. Iverson has outright called Nirvanix liars, and said that you threatened to sue him and Hood. You bullied Newsvine.com into deleting the article I wrote on the topic. You lie and cheat and commit fraud, and you think starting a new company with a new name will let you hide from the past? It isn't gonna work.

    Posted by: Tom Bassett | August 6, 2008 2:12 PM



RWW SPONSORS


FOLLOW @RWW ON TWITTER

ReadWriteWeb on Facebook



TEXT LINK ADS