Ideally, the system would also roll back updates that cause problems automatically; this already happens on Bing, but that's not always possible. "We have some metrics that allow us to tell if a build is doing well or not and to go backwards if we need to. But that's not always something we can do. For example if we've made an update to a database schema, we can't roll it back because you'd lose data. If we've put out a feature that allows a customer to create something, we can't go back because they'd lose what they've built. So we have to do what we call roll forward – fix the bug and get it out."
Many Azure failures happen without anyone ever noticing, he points out. "The system is designed to auto recover from failure as much as possible. For example, with a service like AzureDB or Azure Storage there are machine failures all the time in the clusters because we're running at such large scale. We lose 2-3% of our servers every year but it's completely transparent to anybody running on those services because of the way they're designed. We monitor it to see if there's a problem causing those failures, but a server failing is invisible to customers."
Lessons in Azure
One lesson that Microsoft has learned from the most recent Azure outages is that it needs a system outside Azure where customers can check on the status of the service. "The dashboard is a higher level piece of the platform and it depends on parts of the system above Azure Storage and the core services. It needs to be higher level to be public facing, but it totally makes sense that even if Azure is down that we need to communicate with customers. So we've come up with a system to fail that dashboard over outside Azure if it runs into problems because it's depending on another part of Azure."
But Microsoft isn't going to try that with any other Azure features, he says. "It's just a communication interface – it's not a service. We can't fail over an Azure service out of Azure, because we'd need another Azure to run it on!"