Amazon S3 / EC2 / AWS outage this morning…

Many of Amazon.com’s Web Services were down this morning with some customers reporting outages lasting over three hours. Sites that depend on services that depend on EC2 or S3 are down as well.

amazon-ec2-s3-outage.png

Failures like this happen in every system, and anyone that promises otherwise is foolish or lying (or both). Amazon does not promise that their systems won’t fail, they offer service credits when S3 does fail in accordance with their Service Level Agreement. (see: earlier Radar post, video of my panel discussion about SLAs and regulation)

Nick Carr mentions what happened after the Salesforce outage in 2006:

[…] I feel compelled to point out the inevitable glitches that are going to happen along the way. How the supplier responds – in keeping customers apprised of the situation and explaining precisely what went wrong and how the source of the problem is being addressed – is crucial to building the trust of current and would-be users. When Salesforce.com suffered a big outage two years ago, it was justly criticized for an incomplete explanation; the company subsequently became much more forthright about the status of its services and the reasons behind outages. Given that entire businesses run on S3 and related services, Amazon has a particularly heavy responsibility not only to fix the problem quickly but to explain it fully.

Nick is referring to trust.salesforce.com which is currently the gold standard of availability reporting for Software as a Service providers. I hope this incident provides both pressure and incentive for other services to adopt similar standards soon.

Updated: David Ulevitch of OpenDNS added:

we’ve been providing a similar site to Trust.Salesforce.com since we launched — and we find that the milage it brings us in user trust far outweighs the embarrassment of whatever we have to put up there. Our site’s version is at http://system.opendns.com.

(Disclosure: OpenDNS is a Minor Ventures company along with Swivel where I am an Advisor.)

Phil Gross of Intuit Quickbase points to http://service.quickbase.com adding:

[…] We have found that being as clear and upfront as possible when there are issues goes a long way towards keeping customers happy, and it’s also just the right thing to do. One thing to remember, if other companies are thinking about developing a similar service, is *not* to host your service page with your main web host or data center. Our service page is at our disaster recovery center, in a completely separate region of the country, so that if there were a network outage, we could still get the word out, and update on when we’d be back up.

My friend Scott Ruthfield points out DoubleClick’s dashboard at http://qos.doubleclick.net/

Updated: Official update posted on the on AWS forums:

Here’s some additional detail about the problem we experienced earlier today.

Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types.

Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles. This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST. By 6:48am PST, we had moved enough capacity online to resolve the issue.

As we said earlier today, though we’re proud of our uptime track record over the past two years with this service, any amount of downtime is unacceptable. As part of the post mortem for this event, we have identified a set of short-term actions as well as longer term improvements. We are taking immediate action on the following: (a) improving our monitoring of the proportion of authenticated requests; (b) further increasing our authentication service capacity; and (c) adding additional defensive measures around the authenticated calls. Additionally, we‚Äôve begun work on a service health dashboard, and expect to release that shortly.

Are there any other companies that provide similar reporting on their availability and performance?

tags: , , , ,