It was big trouble for big social: Instagram, WhatsApp, Facebook and Facebook Messenger experienced widespread, prolonged outages on Monday -- meaning billions of users around the globe couldn't use some of the world's most popular sites.
While incidents like this aren't unheard of, the length of the outage, which spanned over six hours, was very rare. Facebook's services finally came back online late in day after the significant outage.
The cause of the outage was difficult to diagnose at first for the social media giant, which reportedly had frantic engineers rushing from server to server in efforts to diagnose the issue. Facebook updated its engineering page at the end of the day with information pointing to the root of the problem: config changes in the company's internal hardware.
"To all the people and businesses around the world who depend on us, we are sorry for the inconvenience caused by today’s outage across our platforms. We’ve been working as hard as we can to restore access, and our systems are now back up and running.
"The underlying cause of this outage also impacted many of the internal tools and systems we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem.
"Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.
"Our services are now back online and we’re actively working to fully return them to regular operations. We want to make clear at this time we believe the root cause of this outage was a faulty configuration change. We also have no evidence that user data was compromised as a result of this downtime.
"People and businesses around the world rely on us everyday to stay connected. We understand the impact outages like these have on people’s lives, and our responsibility to keep people informed about disruptions to our services. We apologize to all those affected, and we’re working to understand more about what happened today so we can continue to make our infrastructure more resilient."
The outage started with the company's Newsroom and extended to the full site and social network. But Instagram and other services remained down in some areas for much longer, leading us to believe services resumed on a region-by-region basis.
DownDetector – a website that tracks outages of online services – had shown that all services were struggling in many territories. There had also been reports that services with accounts tied to Facebook logins, like Airbnb and Strava, were not working.
There's was no way to avoid the issues, so users just had to wait until they were solved to reconnect to WhatsApp, Instagram or Facebook.
Shortly after the outage began, Facebook's communications executive, Andy Stone, was the first to share an update on Twitter to say the company is aware of the issues and it's currently working on a fix, followed by WhatsApp's Twitter account.
Soon thereafter, Facebook's official account acknowledged that users were having trouble accessing the company's apps and products. Facebook CTO Mike Schroepfer tweeted an apology four hours into the outage, and after six hours, the company sent another tweet apologizing as its services come back online.
The issue affected other Facebook products, too: users reported issues with the company's Oculus virtual reality gaming services.
Noted Facebook and Twitter data miner Jane Manchun Wong warned users via tweet not to restart their Oculus devices during the outage lest they lose their games. VR game and software designer Julien Dorra tweeted a video of what it's like to load up Oculus in the midst of the outage:
Facebook brought Oculus down with them 🙁 pic.twitter.com/rfapj1yaSUOctober 4, 2021
And the outage might have impacted Facebook's real-world infrastructure as well: according to a tweet by New York Times reporter Sheera Frenkel, a Facebook employee reportedly can't even enter company buildings due to malfunctioning badges.
Another NYT report claims employees struggled to make calls from work-issued phones or receive emails from outside the company.
Facebook outages: what happened?
None of the Facebook, Whatsapp, or Instagram accounts have explained what originally caused the outage, leading to speculation and analysis. At this point, most agree that this isn't a hack or directed attack on Facebook's infrastructure, and sources have told the New York Times that it probably wasn't a cyberattack because 'one hack was unlikely to affect so many apps at once.'
Instead, evidence shows the company's network paths to the outside web just disappeared without explanation this morning.
Brian Krebs of cybersecurity firm Krebs on Security tweeted his conclusion that the domain name system (DNS) records routing traffic to Facebook sites and services were simply withdrawn – as in, gone from the web – this morning:
Confirmed: The DNS records that tell systems how to find https://t.co/qHzVq2Mr4E or https://t.co/JoIPxXI9GI got withdrawn this morning from the global routing tables. Can you imagine working at FB right now, when your email no longer works & all your internal FB-based tools fail?October 4, 2021
In a follow-up tweet, Krebs clarified with his belief that the border gateway protocol (BGP) routes serving Facebook's DNS were gone, making every site on a Facebook domain inaccessible. This presumably explains why its services and third-party login access, as well as Instagram/WhatsApp/Facebook Messenger, are completely down.
Other networking companies have noticed and theorized the issue is with BGP routes, including Cloudflare SVP Dane Knecht, who tweeted an observation that Facebook DNS and other services are down and 'their BGP routes have been withdrawn from the internet.'
He noted that Cloudflare also saw its own failures, but a follow-up tweet suggested it was recovering. Separately, Cloudflare CTO John Graham-Cumming tweeted seeing Facebook's BGP changes as they happened and suggested they were mostly BGP route withdrawals.
PJ Norris, principal systems engineer at Tripwire, sent the following analysis to TechRadar regarding the outage:
"Around 15.40 UTC on Monday 4th October, a change was made to the BGP – Border Gateway Protocol. BGP is a technology which ISP’s share information about which providers are responsible for routing Internet traffic to which specific groups of Internet addresses.
"In other words, Facebook inadvertently removed the ability to tell the world where it lives.
"Backing out the change was not easy though, since Facebook uses their own in house communication and email services which were impacted by the outage. With people working remote during the pandemic, this was a big issue.
"Those who were onsite at the data centres and offices who were trying to back out the change, were unable to access the environments as the door access control system was down due to the impact of the outage.
"So the question always comes down to, “could this have been avoided?” It’s evident at this early stage that Facebook had a single point of failure that cascaded in to a significant and costly outage for the technology giant."
BGP is a big (global) problem
While DNS is a website's numerical address on the internet (which is translated from the 'www.___.com' you type in your search bar), BGP routes are the pathways that requests take through servers and computers to get to their destination. When Facebook's BGP routes were reportedly withdrawn from the internet, sites connected to those routes (like Cloudflare above) saw them collapse, and Facebook sites and services become inaccessible.
Internet theorizing on the r/sysadmin subreddit suggested that a configuration change happened this morning that caused the BGP routes to go down, and this cut Facebook off from making remote changes – from here on, only physical access could fix the damage (emphasized in a screenshot by Twitter user Andree Toonk).
An aforementioned New York Times report supports this theory, citing an alleged internal Facebook memo that a small team of employees was dispatched to the company's Santa Clara, CA data center to manually reset the company servers.
Just before Facebook services started coming back online, Krebs cited a source in stating that the outage was caused by a faulty BGP update that blocked remote users from reverting changes while locking out local access:
From trusted source: Person on FB recovery effort said the outage was from a routine BGP update gone wrong. But the update blocked remote users from reverting changes, and people with physical access didn't have network/logical access. So blocked at both ends from reversing it.October 4, 2021
Outages: a continuing problem?
"Outages are increasing in volume and can often point towards a cyber-attack, but this can add to the confusion early on when we are diagnosing the causes," said Jake More, expert at cybersecurity and antivirus company ESET, in an emailed comment to TechRadar. "As we saw with Fastly in the summer, web-blackouts are more often originate from undiscovered software bug or even human error."
March and April 2021 saw a similarly major outage where each of Facebook's services affected today – Facebook, Instagram, WhatsApp, and Facebook Messenger – was down for over half an hour each time. But given how much faster those issues were resolved, the latest outage seems to be a catastrophe of a much higher magnitude.
Those last outages were due to a bug in the Domain Name System (DNS) of these services, but seemingly not as severe as a BGP issue.