Facebook is finally back after hours-long outage, but Instagram and WhatsApp are still down

Social app icons on a phone screen — (Image credit: dole777 / Unsplash)

It was big trouble for big social: Instagram, WhatsApp, Facebook and Facebook Messenger experienced widespread, prolonged outages on Monday -- meaning billions of users around the globe couldn't use some of the world's most popular sites.

While incidents like this aren't unheard of, the length of the outage, which spanned over six hours, was very rare. Facebook's services finally came back online late in day after the significant outage.

The cause of the outage was difficult to diagnose at first for the social media giant, which reportedly had frantic engineers rushing from server to server in efforts to diagnose the issue. Facebook updated its engineering page at the end of the day with information pointing to the root of the problem: config changes in the company's internal hardware.

Facebook outages: what happened?

None of the Facebook, Whatsapp, or Instagram accounts have explained what originally caused the outage, leading to speculation and analysis. At this point, most agree that this isn't a hack or directed attack on Facebook's infrastructure, and sources have told the New York Times that it probably wasn't a cyberattack because 'one hack was unlikely to affect so many apps at once.'

Instead, evidence shows the company's network paths to the outside web just disappeared without explanation this morning.

Brian Krebs of cybersecurity firm Krebs on Security tweeted his conclusion that the domain name system (DNS) records routing traffic to Facebook sites and services were simply withdrawn – as in, gone from the web – this morning:

Confirmed: The DNS records that tell systems how to find https://t.co/qHzVq2Mr4E or https://t.co/JoIPxXI9GI got withdrawn this morning from the global routing tables. Can you imagine working at FB right now, when your email no longer works & all your internal FB-based tools fail?October 4, 2021

In a follow-up tweet, Krebs clarified with his belief that the border gateway protocol (BGP) routes serving Facebook's DNS were gone, making every site on a Facebook domain inaccessible. This presumably explains why its services and third-party login access, as well as Instagram/WhatsApp/Facebook Messenger, are completely down.

Other networking companies have noticed and theorized the issue is with BGP routes, including Cloudflare SVP Dane Knecht, who tweeted an observation that Facebook DNS and other services are down and 'their BGP routes have been withdrawn from the internet.'

He noted that Cloudflare also saw its own failures, but a follow-up tweet suggested it was recovering. Separately, Cloudflare CTO John Graham-Cumming tweeted seeing Facebook's BGP changes as they happened and suggested they were mostly BGP route withdrawals.

PJ Norris, principal systems engineer at Tripwire, sent the following analysis to TechRadar regarding the outage:

"Around 15.40 UTC on Monday 4th October, a change was made to the BGP – Border Gateway Protocol. BGP is a technology which ISP’s share information about which providers are responsible for routing Internet traffic to which specific groups of Internet addresses.

"In other words, Facebook inadvertently removed the ability to tell the world where it lives.

"Backing out the change was not easy though, since Facebook uses their own in house communication and email services which were impacted by the outage. With people working remote during the pandemic, this was a big issue.

"Those who were onsite at the data centres and offices who were trying to back out the change, were unable to access the environments as the door access control system was down due to the impact of the outage.

"So the question always comes down to, “could this have been avoided?” It’s evident at this early stage that Facebook had a single point of failure that cascaded in to a significant and costly outage for the technology giant."

BGP is a big (global) problem

While DNS is a website's numerical address on the internet (which is translated from the 'www.___.com' you type in your search bar), BGP routes are the pathways that requests take through servers and computers to get to their destination. When Facebook's BGP routes were reportedly withdrawn from the internet, sites connected to those routes (like Cloudflare above) saw them collapse, and Facebook sites and services become inaccessible.

Internet theorizing on the r/sysadmin subreddit suggested that a configuration change happened this morning that caused the BGP routes to go down, and this cut Facebook off from making remote changes – from here on, only physical access could fix the damage (emphasized in a screenshot by Twitter user Andree Toonk).

An aforementioned New York Times report supports this theory, citing an alleged internal Facebook memo that a small team of employees was dispatched to the company's Santa Clara, CA data center to manually reset the company servers.

Just before Facebook services started coming back online, Krebs cited a source in stating that the outage was caused by a faulty BGP update that blocked remote users from reverting changes while locking out local access:

From trusted source: Person on FB recovery effort said the outage was from a routine BGP update gone wrong. But the update blocked remote users from reverting changes, and people with physical access didn't have network/logical access. So blocked at both ends from reversing it.October 4, 2021

Outages: a continuing problem?

"Outages are increasing in volume and can often point towards a cyber-attack, but this can add to the confusion early on when we are diagnosing the causes," said Jake More, expert at cybersecurity and antivirus company ESET, in an emailed comment to TechRadar. "As we saw with Fastly in the summer, web-blackouts are more often originate from undiscovered software bug or even human error."

March and April 2021 saw a similarly major outage where each of Facebook's services affected today – Facebook, Instagram, WhatsApp, and Facebook Messenger – was down for over half an hour each time. But given how much faster those issues were resolved, the latest outage seems to be a catastrophe of a much higher magnitude.

Those last outages were due to a bug in the Domain Name System (DNS) of these services, but seemingly not as severe as a BGP issue.

Have an Android phone? These are the best Clubhouse alternatives

TOPICS

James is the Editor-in-Chief at Android Police. Previously, he was Senior Phones Editor for TechRadar, and he has covered smartphones and the mobile space for the best part of a decade bringing you news on all the big announcements from top manufacturers making mobile phones and other portable gadgets. James is often testing out and reviewing the latest and greatest mobile phones, smartwatches, tablets, virtual reality headsets, fitness trackers and more. He once fell over.

Updated: Facebook apologises for outage, claims no user data compromise

Facebook outages: what happened?

BGP is a big (global) problem

Outages: a continuing problem?

Useful links