One of the great things about cloud computing is that you don't have to worry about bumpy old software updates, except when they take down the entire cloud.
That's what happened on Tuesday when Microsoft's Hotmail, Outlook.com, and Skydrive sites went down for some.
On Thursday, Microsoft explained what happened, and as near as we can tell, the problem was a software glitch -- in an update to the data center's air conditioning system. Microsoft says that things went wrong when it installed new firmware "on a core part of our physical plant," which caused the entire data center to overheat.
Translation: Microsoft was probably updating its heating, ventilation, and air conditioning system, called an HVAC system by operations people, when things went wrong. Without air conditioning, the heat from thousands of servers would make it too hot to operate a computer in the data center. We asked Microsoft to clarify what core part of the physical plant went down, and which data center was hit, but they wouldn't tell us.
Industrial computer system experts we spoke to on Thursday, though, said that this seems like a likely explanation.
Twenty years ago, these control systems mostly ran specialized firmware, but over the past decade, a lot of them have moved onto less expensive commodity platforms based on operating systems such as Windows or Linux. This in turn, has made them vulnerable to viruses and, apparently, buggy firmware updates.
"I've certainly heard of firmware updates taking out other systems, but this is the first time in a data center," says Eric Byres, chief technology officer with Tofino Industrial Security. He's spent a good part of his career tracking these outages.
Plant operators are typically electrical engineers, not computer science experts, but in the past few years they've been increasingly under pressure to update their control system software. That's because malicious software like the Stuxnet worm has put industrial control system security in the spotlight.
A typical industrial system might get a firmware update once a year, Byres says. "We've managed to get ourselves in a lovely little conflict here, where we want to patch more often and more aggressively, whereas we've had this history of patching on control systems very slowly and very conservatively."
Here's the official explanation for the outage, from a blog post written by Microsoft's Arthur de Haan:
(Photo: Microsoft)