AWS explains embarrassing reason behind last week's major cloud outage

(Image credit: Mike Moore)

Amazon has come clean regarding the huge AWS outage that took place last week. The technology giant revealed that its attempt to add server capacity caused the AWS US-EAST-1 region to experience a period of unexpected downtime.

The trigger for the disruption was the small addition of capacity to AWS’ Kinesis service, which is used to underpin a significant number of other AWS offerings. The Kinesis servers create new threads for other servers involved with the AWS front-end in order that they can communicate with one another. The extra capacity caused the servers to exceed the maximum number of allowed threads.

Although AWS discovered the root cause of the issue pretty quickly, bringing everything back online was not quite so straightforward. Bringing servers back too quickly could result in errors, request latencies, or even see some removed from the fleet entirely. As a result, Amazon could only bring back a few hundred servers at a time, which delayed the recovery process.

Improvements to be made

Amazon is already working on a series of proposals that will help avoid similar incidents occurring again in the future.

“In the very short term, we will be moving to larger CPU and memory servers, reducing the total number of servers and, hence, threads required by each server to communicate across the fleet,” an AWS post explained

“This will provide significant headroom in thread count used as the total threads each server must maintain is directly proportional to the number of servers in the fleet. Having fewer servers means that each server maintains fewer threads. We are adding fine-grained alarming for thread consumption in the service.”

In addition, AWS has pledged to finish testing an increase in thread count limits and making improvements to the cold-start time for its front-end fleet of servers. The company also apologized for the downtime, which caused a number of high-profile sites, including the likes of Coinbase, Flickr, and Roku, to go offline.

Via The Register

Barclay Ballard

Barclay has been writing about technology for a decade, starting out as a freelancer with ITProPortal covering everything from London’s start-up scene to comparisons of the best cloud storage services.  After that, he spent some time as the managing editor of an online outlet focusing on cloud computing, furthering his interest in virtualization, Big Data, and the Internet of Things.