The dangers of social media downtime and its impact on business

The dangers of social media downtime and its impact on business

It was a bad day when a social media giant became inaccessible without warning and serves as a salient warning about network resilience.

Back in October 2021, Facebook, WhatsApp Messenger and Instagram hit the headlines, but all for the wrong reasons. On that fateful day, an estimated $100 million in revenue was lost all because of a simple mistake, which left the services offered by Meta, the parent company of Facebook and its associated platforms, offline for almost six hours.

During routine maintenance, an engineer had run a command that accidentally disconnected all of Facebook’s data centres. Unfortunately, this error had a huge impact on the 10 million brands that use the tech giant’s platforms to advertise their products and services.

While frustrating users around the world, the damage to Meta’s reputation may prove to be even more costly as businesses cannot take the platforms’ reliability for granted. It is estimated that businesses that use Meta’s advertising services saw sales drop between 30% to 70%, compared to the same period a week earlier.

What went wrong?

The outage was triggered by a system that manages Meta’s global backbone routers that coordinate network traffic between their data centres. A faulty configuration change occurred, and a command was issued which caused a complete disconnection between their servers, data centres and the internet. Meta’s data centres couldn’t be accessed remotely as their networks were down and, while engineers were sent onsite to debug the issue, the disruption also caused them to be locked out of all buildings. 

Network outages are not uncommon

While Meta received a lot of flak over the incident, it is worth remembering that network outages are not uncommon. Last July, people visiting various websites, including Airbnb, HSBC, British Airways, UPS and the PlayStation network received Domain Name System (DNS) error messages, which meant they could not reach the sites. The previous month cloud computing provider Fastly had its services interrupted which also took down a large number of sites, including many governments and newspaper websites around the world. It had emerged that a customer had inadvertently changed a setting that affected the entire infrastructure.

Could the network outage have been prevented?

The Meta outage serves to highlight the importance of solving network outages quickly and efficiently. Remote access through Out-of-Band management (OOB) is the key here. OOB has moved on significantly in recent years from providing purely reactive emergency-only access to delivering much more proactive approaches to network resilience and increasingly interactive network management, with Network Operations (NetOps) workflows, orchestration and automation. This approach provides an alternative pathway to equipment located at various sites even when the primary network is down and facilitates access to edge infrastructure to ensure business continuity.

With a combination of the latest OBB and NetOps tools, businesses can easily configure and set up systems that deliver secure provisioning of different locations through the network operations centre (NOC). This means that there is no need to send out engineers to a site. To facilitate this, tools are available to automate and orchestrate the NetOps workflow.

In the case of Meta’s outage, a network platform could have been used to backup device configuration files before making changes to the network. This would have allowed engineers to restore the good configuration files as soon as they discovered the erroneous change had caused the outage. Pushing the saved configuration files back to the affected equipment would have quickly restored the network. Using this immediate access would have significantly reduced the outage time.

By having an always-on independent management plane -using OOB, businesses will -maintain secure remote access to all of their sites and devices, even when an outage occurs, meaning network engineers can very quickly remotely identify a problem and provide a suitable remedy, typically without ever having to go onsite. 

If you don’t want to make the headlines for the wrong reasons, ensure you have a resilient network. This will mean your customers will remain connected and you will maintain that all important business continuity, keeping your reputation intact.

Alan Brown
Alan Brown

Share via
Copy link