The AWS Outage is a wake-up call. Trust me.
Back in the day, we ran websites off personal and corporate servers, usually located within our homes and offices. As the internet grew, we built server racks, co-locations and datacenters. Eventually, though, businesses and services of all sizes offloaded server efforts to third parties—or as they’re known now, cloud services.
The logic is solid. We live in homes, but do not physically build our own houses. The act of serving and scaling websites is not core to the service they provide. Well, it sort of is in that without servers there is no service. But the server is running through APIs, scripts, and other algorithms and programs developed by the company to deliver things like your Netflix stream, the details of your Coinbase wallet account, or the next Tinder prospect.
The ability of cloud services like Amazon Web Services (AWS) and Microsoft’s Azure to, if you pay enough, rapidly scale up (or down, as needed) makes them a smart strategic decision for any business of any size. You never know, for instance, when a small business is going to balloon into a big one and when it needs to service 10,000 simultaneous users instead of 500.
That’s the obvious upside of Cloud-based web services. The downside is what happened this week with AWS.
Tuesday afternoon, huge chunks of AWS crumbled. The AWS Health Dashboard provides a nice play-by-play of the nearly seven-hour outage. At the heart of it was not, at least according to Amazon, an attack, hack, or Denial of Service (DDoS) assault. It was a pair of misbehaving APIs in one sector of the massive service.
We all live in fear of a major DDoS or hack breaching these systems (really any system we rely on) and bringing them to their knees, but that’s rarely the case. When Cloudflare went down in 2019, it was initially assumed to be an attack on its system. However, we soon found out that it was just a bad software deployment, essentially human error.
Even with the AWS outage contained to what Amazon calls “US-EAST-1 Region,” the impact was significant and widespread. It was felt across consumer-facing platforms like Disney+ and, naturally, Amazon.com and some Alexa services.
When I posted the ongoing news on Twitter, I noticed how many people virtually slapped their heads and exclaimed, “That’s why [insert service] was out!”
It occurred to me that many of these users had no idea that AWS sits behind their favorite consumer and business systems. No one, by the way, has the exact number (outside Amazon), but recent reports claim AWS serves millions. Microsoft’s Azure also reports millions of users and the majority of Fortune 500 companies. Google Cloud has big names like Verizon, NewsCorp and Facebook.
Does something need to change?
The widespread use of cloud services is not a bad thing, though the lack of insight can lead to confusion and finger-pointing, like the guy who couldn’t amend orders in his system and got multiple error messages blaming his own systems (and not a third-party provider like AWS).
The combination of cloud systems’ wide reach and general lack of information and real-time feedback to affected customers is cause for some concern. The scale of one any one outage is probably cause for alarm, especially as we consider the inevitable next one.
Gone are the days when someone’s server rack goes down and one website hiccups. Now we have small failures in big cloud systems like AWS, Axure and Cloudflare that trigger a tsunami of outages.
One person on Twitter asked, “What happened to scaling and load balancing?” It’s a fair question. AWS is built on hundreds of separate cloud server clusters and has tons of redundancies, scaling, and load balancing. And still, sometimes, it isn’t enough. Complex systems can misbehave and are especially vulnerable to software updates that can collide with ageing code. For as powerful and distributed as all these cloud services are, AWS included, they’re still programmed, run, and serviced by fallible humans.
So how do we better inform the public and, more importantly, protect AWS, Azure, Cloudflare, and others from these kinds of errors, ones that lead not only to downed sites and services but the loss of millions of dollars?
It may be time to step back and look at cloud systems integrity, security, in the same way we watch out for water systems. None of them are too big to fail, it seems, but all are too important to damage, violate, or lose.