Site navigation

Facebook Outage: The Bigger Resilience Picture

Charlotte Binstead

,

Facebook Outage What
What are the lessons about resilience we can learn from the Facebook outage? Cloudsoft’s Charlotte Binstead tells us more.

Facebook, Instagram, and WhatsApp are all back online after a six-hour global outage, starting at around 4pm GMT on the 4th of October.

A single minute of downtime can cost Facebook over $200,000, and this outage looks to have cost them over $70 million in six hours (and that’s before we account for the reputational damage this kind of massive outage can have).

This shows that downtime is a significant cost to businesses, especially when it’s unplanned; the IDC estimates that Tier 1 banks waste around $2.5 billion a year on critical application failure.

More than 3.5 billion people across the world use Facebook, Instagram, WhatsApp and Facebook’s own Messenger service, and not just to share memes and watch dog videos.

These apps are hardwired into how we communicate, how businesses operate, how we access other services (using social sign-on) and in some parts of the world – for example India and many parts of Southeast Asia – Facebook is synonymous with the internet.


The domino effect

The outage has thrown into sharp relief the complex network of functions and services reliant on the availability and resilience of a single service provider.

According to the New York Times, users reported being unable to access internet-connected smart devices like smart TVs and thermostats – not provided by Facebook, but accessed via Facebook credentials.

Facebook and Instagram especially are part of the economic fabric too; businesses around the world, reliant on Facebook platforms to drive orders, essentially ceased to trade while the platforms were offline.


Recommended


Nor is Facebook uniquely complex – most enterprises of any scale are operating on hybrid architectures that have grown organically over time, resulting in this kind of complexity driven vulnerability.

Enterprises now operate tens of thousands of applications across thousands of workloads, making identifying, resolving, and further preventing an issue incredibly difficult.

Facebook claims the outage was caused by configuration changes which affected traffic between its data centres, and the effects were not limited to users.

Troublingly, the outage also affected Facebook’s internal systems taking out internal communications platforms, locking staff out of systems and, most alarmingly, actually hindering engineers from physically accessing servers as their security credentials were blocked.


No one’s too big to fail

Yesterday’s outage shows how easy it is for enterprises like Facebook to fail on a global scale, with wide reaching ripple effects.

These are the kinds of outages that trouble regulators and are the impetus behind new regulations governing Digital Operational Resilience, initially in the financial sector.

Earlier this year, the Financial Conduct Authority (FCA) published its final guidance on operational resilience in the Financial Services sector which comes into force in March 2022.

The FCA guidance aligns with the EU’s Digital Operational Resilience Act (DORA), which is currently under consultation and set to be enacted from 2023. In addition to identifying any vulnerabilities in their operational resilience, firms are expected to have:

  • Identified their important business services;
  • Set impact tolerances for the maximum tolerable disruption, and;
  • Carried out mapping and testing to a level of sophistication necessary to do so.

The chaos wrought to Facebook’s internal systems, and the hindrance it caused to restoring services, will no doubt be playing on the minds of many involved in writing these regulations and those set to be affected by it.

Charlotte Binstead

Marketing Manager, Cloudsoft

Latest News

Cloud Digital Transformation Editor's Picks Featured
Climate Technology
%d bloggers like this: