Amazon has published a more detailed explanation about the outage that knocked out a number of popular websites on Friday night, including Netflix, Instagram, and Pinterest. The culprit: a 20-minute power outage at a single Northern Virginia data center.
Problems started at 7:24 p.m. PDT when there was a "large voltage spike" on the grid used by two of Amazon's data centers. When technicians tried to move to backup power, the diesel-powered generators just didn't work properly at one of the data centers. "The generators started successfully," Amazon now says, "but each generator independently failed to provide stable voltage as they were brought into service."
Judging from Amazon's explanation, the generators may have been powering up, but the switching equipment at the data center didn't think they were ready for a switchover.
Then, to confuse matters more, the power went back on for a few minutes and then failed again, just three minutes before 8 p.m. Seven minutes later, the data center's battery backups started to fail.
Then the data center went dark.
It turns out that an abrupt power outage like that is pretty bad for the cloud. Though the backup generators finally started to restore power just 10 minutes into this second outage (power was fully restored 10 minutes after that), Amazon technicians soon discovered that it was going to take them about three hours to reboot affected servers in the data center and that this delay would be compounded by several bugs in their cloud software that they hadn't known about.
A bug in their Elastic Load Balancers (ELB) software -- which customers use to spread internet traffic across different Amazon data centers -- caused this important service to get overwhelmed across Amazon. This was the worst possible time for this service to go down, because customers whose programs ran in the downed data center needed this service to redirect internet traffic. ELB "fell increasingly behind in processing these requests; and pretty soon, these requests started taking a very long time to complete," Amazon said in its analysis.
Another bug in Amazon's Relational Database Service kept a "small number" of databases from recovering properly from the power outage. Amazon technicians were able to get things up and running for these customers only when they manually restarted the failover systems, Amazon said.
Conventional storage products are pretty good at recovering from a power failure, but Amazon ran into bottlenecks restoring, for example, its Elastic Block Store services. This is the kind of stuff you learn when you're building what is essentially a new operating system for the internet and nature hands you a sudden power outage.
"Amazon chose to do things themselves, which does give them the advantage of being able to deliver new services," says Justin Santa Barbara, the founder of Amazon customer (and competitor) FathomDB, a cloud-based database service. "The flip side is that things that everyone else has working don't necessarily work for them."
Amazon is working to convince customers that it can do a better job of keeping servers up and running. "We will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make further changes to improve our services and processes," the company said on its summary of the outage.
The failing generators had been tested just six weeks previously, but now Amazon says its going to repair and retest the equipment -- and replace it if it's not up to snuff.
The company did not respond on Tuesday to requests for more information on the outage.