The need for Responsible Automation of Production Incidents
Unless you are living under a rock, you saw the Facebook outage play out earlier this week. While not being able to share and comment on cat pictures was the common outcome for most customers, the story was much bigger in the operations community.
Facebook has already revealed a bit about the challenges they faced, i.e. the need to reach equipment physically, made significantly challenging by the fact that the badging systems were inaccessible due to the same outage!
Facebook has implemented a mature process and automated incident response for a long time now and I won’t go out of my way to suggest changes there. But the internet never fails to provide a good quote:
“Best type of oncall is the flying blind oncall”
Here are some interesting stories around the rest of the Internet that have their origin in the Facebook outage.
Social Identity Blackout
A lot of us use the convenience of Facebook identity to access apps and services on the internet ranging from shared rides to delivery apps. I counted 86 apps connected to my Facebook profile. Connecting to any of these services during the outage was not possible if the service needed re-authentication. Immediately we see that the impact was much more serious with a blast radius that spans various parts of daily life.
In various news outlets, we saw reports around people not being able to enter their apartments because they used their Facebook identity to sign up on the electronic entry system, to not being able to use the points on their burger orders.
To a provider of these apps and services, clearly one of the first symptoms was an authentication failure spike. While there is not much these app developers can do about the failures themselves, having the ability to correlate an incoming call to the root-cause is useful considering the spike in the customer support call volume.
The Retry Deluge
In order to provide the reliability that it does, the Internet is full of retry logic when faced with failure. The Facebook outage itself created a Domain Name Service (DNS) lookup failure. DNS is a service of the Internet, provided by various Internet Service Providers (ISP) and Content Delivery Networks (CDN), that maps a website name to its server IP address.
Once the incident started, the DNS servers themselves fell under a DDOS attack because of the retry coming from billions of devices due to the Facebook DNS lookup failure. Since DNS is also used by all other services for their operation, these services also started slowing down since the DNS lookup for the good websites slowed down from the DDOS attack!
And so, the phones start ringing at the support desks!
Fortunately, ISPs have a workaround to drop or rate limit the bad traffic at the edge of their network to minimize the impact. However, getting this done will need some quick automation given that these rules must be programmed in a lot of devices and quickly!
The Internet / Facebook Duality
When a large site like Facebook goes down, the customers confuse the service down with the internet down! There is a very large number of customers, who use their devices just for Facebook and hence Facebook down for them is really an Internet down event. When the Internet goes down, we call our ISP!
And so, the phones start ringing at the support desks! (insert joke about deja-vu here!).
“I work at an ISP repair call center and it is a madhouse. We are currently ‘code red’ because so many people only do Facebook/Instagram on the internet (old people mostly) and so they think the internet is down.”
Data Network Blackout
Several ISPs reported a complete data blackout where customers were not able to access data services in entire countries. This is a follow on from the Facebook outage, although its root-cause is not quite clear. It could be that the control traffic failures arising from DNS were not managed well and caused other control failures cascading into a full blackout, or just simply that there was shared infrastructure at play.
In any case, the phones start ringing at the support desks 🙂
While it is true that such a failure is a black swan event at Facebook, we start to see that reliability of a service today depends on a lot of things that are outside the scope of the service. However such events are happening all the time on the broad set of dependencies that services have racked up.
It’s also worth pointing out that one of the causes of the outage stems from automation, and serves to teach us that automation is a tool that needs to be wielded with great responsibility. It’s a calling for intelligent automation..