The mega Facebook outage won't be the last of its kind - here's why

Facebook has said that a global outage that recently took its services and internal communications tools offline for several hours was due to a "faulty configuration change" to its routers.

Although all affected apps are now back online, it has still left many wondering what happened, if the situation could have been avoided and whether a similar outage could happen again anytime soon.

The company revealed that a misconfiguration within its BGP routing design was allowed to propagate across its routing fabric internally (iBGP) and then externally (eBGP).

BGP did it

BGP (Border Gateway Protocol) is today's protocol for routing internet traffic, replacing legacy routing protocols such as RIP and OSPF for public internet infrastructure.

BGP is responsible for selecting the best available routes to communicate data from a source to a specific destination.

“Due to Facebook's continued improvement in reducing their attack surface the issue was further compounded by an inability to access their internal management network (OOB - Out-of-Band), significantly delaying the time to resolve the issue due to not being able to access their own network and fix the configuration; a bit like forgetting your root or admin password and irreversibly losing access to your workstation, though at global internet scale,” added David.

Facebook's authoritative name servers are advertised to the rest of the internet via border gateway protocol (BGP).

David explained that to ensure reliable operation, Facebook's DNS servers disable BGP advertisements if they themselves can not speak to their data centers.

“In the recent outage the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements,” he explained.

“The end result was that their DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find their servers.”

DNS is the internet’s equivalent to the list of contacts on a phone, which tells a browser what to do by translating a URL into a numbered IP address. The Domain Name System (DNS) is designed to provide translations, converting hostnames, or URLs, to IP addresses (via name resolution),

Lavell Juan, CEO of vertically integrated social network company Brag House mentioned to TechRadar Pro that the foundation of any good social network is usability and a scalable infrastructure.

“Ensuring the design and frontend development engages the user and makes all the functionality easily accessible is just as important as having an infrastructure that can grow with and support the user base. From there, it's all about finding the right languages and frameworks to best create your end product,” he said.

“The most common misconfigurations involve software and data servers. Software can be outdated or missing a key security patch, while servers could require an upgrade or be incorrectly sized. The best way to avoid these issues is through proper documentation and automating processes to reduce manual work.”

System outage in red on a computer keyboard — (Image credit: Shutterstock/hafakot)

Preventing further outages

With the advent of cloud-scale network fabrics that lean extensively on automation both to enable scale and also to remove human error, there is still a human component to the overall process.

David explained that the concept of 'guardrails' being used to ensure critical infrastructure decisions are controlled and validated before being deployed are absolutely vital to the stability and continuity of services at internet-scale.

“Guardrails apply not only to the cloud service providers' management of infrastructure but also to the businesses that build upon these platforms,” he said.

“Owners of websites need to be careful about cloud vendor lock-in and design-in the ability to migrate their business assets and processes to other competing cloud platforms, which in turn puts pressure on these cloud service providers to provide the best possible service or lose their clients.”

Juan pointed the finger at human error causing the majority of interruptions and suggested that testing should be an integral part of the development process and should catch the vast majority of these misconfigurations before they get pushed to production.

While gatekeeper may be too strong a term, Amazon, Facebook, Apple, and Google have become custodians of access to some of the largest marketplaces today. As much as the fear of vendor lock-in is acknowledged there is also the fear of marketplace lock-out.

“All of these businesses apply the tactics and economics of platform strategy - each provides technologies such as identity and authentication for their user base enabling their users to access apps within and across internet ecosystems; without access how will the businesses built upon those platforms reach their customers,” said David.

“Companies have little choice but to ensure they are integrated into these ecosystems and are already, in many cases, entirely dependent on them for their own success. Multi-provider strategies are key though, within the oligopoly of Facebook, Apple, Amazon and Google, technology alone will not be a panacea for mitigation of these risks.”

This does not mean that there are no options available, albeit these options are more likely to be valid for larger businesses. The world has already seen the likes of Netflix and Dropbox migrate away from Amazon to run their own cloud infrastructures.

The key takeaway, David says, is that much of the know-how and technology has been commoditized and is available to businesses - however, the availability of a highly-skilled workforce is still lacking to take advantage and benefit from this.

“Investing in training and organic growth to ensure companies can compete at the same levels of technological maturity must be prioritized as a competitive strategy for tomorrow's businesses,” he concluded.

So in short, can big outages like that of Facebook and WhatsApp happen again? Yes - but because this outage was caused by underlying technological issues such as a bug or human error, minimizing these disruptions is something that can be achievable through regular testing throughout the development process.

Here's our list of the best website builders right now

TOPICS

Abigail is a B2B Editor that specializes in web hosting and website builder news, features and reviews at TechRadar Pro. She has been a B2B journalist for more than five years covering a wide range of topics in the technology sector from colocation and cloud to data centers and telecommunications. As a B2B web hosting and website builder editor, Abigail also writes how-to guides and deals for the sector, keeping up to date with the latest trends in the hosting industry. Abigail is also extremely keen on commissioning contributed content from experts in the web hosting and website builder field.

BGP did it

Preventing further outages

Useful links