A couple of weeks ago, on September 8th, a number of Windows Live Services, including Hotmail and SkyDrive, became unreachable via their respective domain names for a period of from about one hour to many hours, depending on how fast the DNS service fix took to propagate around the world. Tonight, Windows Live Corporate Vice President for Test and Service Engineering Arthur de Haan took to the Inside Windows Live blog to explain what happened.
In short, de Haan explained, a “tool that helps balance network traffic was being updated and the update did not work correctly. As a result, configuration settings were corrupted, which caused a service disruption.” We reported on the outage at the time (with slightly different timings for the beginning of the outage and the resurrection of the services), and while it took some time for the services to be restored to everyone around the globe, luckily no data was lost and everything was back up and running within a few hours.
The blog post goes into a bit of detail about what actually happened:
We determined the cause to be a corrupted file in Microsoft’s DNS service. The file corruption was a result of two rare conditions occurring at the same time. The first condition is related to how the load balancing devices in the DNS service respond to a malformed input string (i.e., the software was unable to parse an incorrectly constructed line in the configuration file). The second condition was related to how the configuration is synchronized across the DNS service to ensure all client requests return the same response regardless of the connection location of the client. Each of these conditions was tracked to the networking device firmware used in the Microsoft DNS service.
… and what has been and is being done to correct the problem before it happens again:
After restoring service, we have identified two streams of work to drive specific service improvements around monitoring, problem identification, and recovery. Along with these service improvements, Microsoft is focused on further hardening the DNS service to improve its overall redundancy and fail-over capability.
We are also developing an additional recovery process that will allow a specific property the ability to fail over to restore service and then fail back when the DNS service is restored. In addition, we are reviewing the recovery tools to see if we can make more improvements that will decrease the time it takes to resolve outages.
Reliability of cloud services has been a hot topic lately, with all of the major service providers experiencing at least some glitches. Can we depend on cloud services to be always available? Do you?