Google has revealed the reasons behind the shut down of one of its data centres last month, citing power failure as the reason of the blackout.

Although Google's data centre was only down for a couple of hours, it did paint a picture of what it would be like if its centres failed more often.

The post-mortem of the power failure, which shut down the company's mobile apps service, showed that lack of human intervention and a failed backup server was to blame.

"The underlying cause of the outage was a power failure in our primary datacenter," explains the Google report.

"While the Google App Engine infrastructure is designed to quickly recover from these sort of failures, this type of rare problem, combined with internal procedural issues extended the time required to restore the service."

Not sufficiently trained

Explaining why there was a delay between the power failure and the system being restored, the report notes: "Recent work to migrate the datastore for better multihoming changed and improved the procedure for handling these failures significantly.

"However, some documentation detailing the procedure to support the datastore during failover incorrectly referred to the old configuration. This led to confusion during the event.

It concluded: "Although we had procedures ready for this sort of outage, the oncall staff was unfamiliar with them and had not trained sufficiently with the specific recovery procedure for this type of failure."

So, Google is human after all. For some reason this is a strangely comforting thought.

Via Data Centre Knowledge