Why is incident response such a manual process?

Tell tale signs that you need to take a close look at your MTTR

In today’s agile world, service reliability can be directly correlated with customer satisfaction and successful business outcomes. However, manual incident response can affect your service reliability for the following five reasons. Let’s walk through them one at a time.

 

Too many alerts, too little intelligence (for actionability).

 

Writing alert threshold is more of an art than a science. The thresholds for an alert are contextual and subject to things like time-of-day or seasonality. Consequently alerts need to be investigated in context and that results in more often than one would like. The on-call engineer is really looking to answer “is this expected”. More often than not, we fall into the “this is expected” bucket. This is what makes operations such a toil. The manual incessant activity associated with checking context signals for an alert just to be able to say that this is expected.

 

Tribal knowledge hurts everyone!

 

Being a Subject Matter Expert is simultaneously a matter of both pride and frustration. After working on a product/feature, engineers pick up a host of knowledge about some narrow areas. While tribal knowledge is good for job security, it also means that when failures happen in production, this undocumented knowledge leads to panic calls that need to be taken from the beach and spoil the fun for entire family. A hanging sword all the time is not really worth the perks that go with it. Runbooks are a great way of combating this organizational problem.

 

 

Runbooks are located on an island; one that the ship called SDLC does not visit.

 

The best resource to triage and troubleshoot incidents is a Runbook. These are the investigation and/or mitigation plans that is written when the alert rules are being configured, or when the service or user journey is designed. This is the best time to capture why the alert was configured, what are the dependencies, and what should be done to handle it. However, since runbooks and alerting are not in SDLC, and hence they tend to fall out of date very quickly. Our best bet for handling incidents is given a step-child treatment. The process to keep them up to date is manual and discretionary. Runbooks need to shift left too.

 

In the API economy, your reliability is a function of reliability of all the APIs that you use.

 

Modern applications are built on top of a host of APIs from third parties. Diagnosing incidents need a single integration platform which can communicate with the broad dependency graph that is present in modern applications. Such a “control tower” needs to have the ability to interrogate not just the local resources within the DMZ but also the third party APIs and services if one wants to get the full picture of application health. We saw this play out recently with the famous Facebook (Meta?) outage.

 

Tool Sprawl make locating context difficult.

 

Observability is spread across a breadth of tools. This leads to alerting also being spread across a breadth of tools. Gathering and correlating context from this set of tools is an onerous task and one that has to be done during incidents. This also makes writing a health check dashboard complicated because these signals need to be collected from the tool sprawl. The rate of change in software makes this entire operation an ever-moving target. While there are “single-screen” solutions available today, we see often that there is still some remaining context that is locked away and inaccessible through the “single-screen”. Developers go through a different screen, back-door and/or scripts to get this context.

 

But what if incident response could be automated?

 

When the proverbial pager beeps, the first task of the incident response team is collecting information from a number of sources. Once it is collected, the on-call team can analyze the data and escalate accordingly. However, acquiring data from multiple sources and then analyzing them is contextual and time consuming. Having a dynamic workflow that allows for easy interrogation of tool sprawl would go a long way removing toil.

 

In my next blog, I will walk through the critical capabilities needed to automate incident response, perhaps bringing in the notion of automated runbooks.

 

Share your thoughts