Part 1 — What is Incident Response, how does it work, and what do the best do?

This is the first in a three-part series designed to help you start or improve your incident response and runbook automation in your organization.

Goals are to give you an unbiased framework for evaluating where you are, what leaders in the space are doing, and where things are going.

This post specifically looks into the types of incident response, and how many organizations are handling this today.

The more common incidents that require organized responses range from attacks on your business systems or IT networks, to theft or destruction of your data and intellectual property. These can be caused by internal or external factors, deliberately or unintentionally.

Runbook automation, increasingly called “operations as code” reduces the resources, time, risk, and complexity of performing operations tasks and ensures consistent execution. You can take operations as code and automate operations activities by using event-based scheduling and triggers, and let them manage future occurrences ongoing.

Through integration at the infrastructure level you avoid traditional “swivel chair” processes that require multiple interfaces and systems to complete a single operations activity. This “Mindset shift” is key to point out as it is more time consuming and difficult than many SRE and DevOps let on, leading to decreased productivity and increased stress and burnout.

Incident Response Processes

A key difference between various incident response processes is the type of triggers initiating a response.

The first type of triggers are usually tangible with users already noticing performance issues, such as zero-day threats. The second type provides early warnings based on indirect indications of security compromise or system failure, and attempts to preempt issues before they impact the end-users — however, these triggers are more prone to false positives.

These triggering approaches determine the main types of incident response processes:

  • Front-loaded prevention — collect performance intelligence data and indicators of system failure to prevent degradation or downtime early. This approach helps address issues and threats before they cause damage, but their higher rate of false positives can increase costs. Usually the higher number of false-positive responses is an accepted price to pay to protect critical apps and data.
  • Back-loaded recovery — collect data on visible threats and incidents, which may already have occurred. This approach minimizes false positives but does not stop failures or attacks early. It is unsuitable for critical infrastructure or high-stakes applications.
  • Hybrid incident response — combine front-loaded and back-loaded data processing to enable early detection and immediate/urgent response. This is a more comprehensive response strategy, and as such is a larger investment in organizational resources. A hybrid approach should emphasize prevention, with the back-loaded response reserved for non-critical components.

Our point of view is proactive, automated response to security threats, infrastructure failures, or fat fingering is harder and more expensive, but builds speed, accuracy, and scalability into your team and organization.

How we see incident response being handled today

Playbooks, runbooks, wikis oh my. At a recent DevOpsDays event we had the privilege of talking to a hundred or so Dev, Eng, Security and Ops pros about how they handle incident response in their teams and orgs.

The spectrum is eye opening. From literally no owners and process to very defined automation and protocols.

The reasons why varied as well. From getting the point experts off (rough) on call schedules, to regulations, to management decisions.

If I had to rank what we said, I would say Wikis or other shared documentation are the number one way incident response and platform reliability are being handled now. Task-based automation being the second way. (We will survey more formally and share results soon)

Wikis or basic process and documentation

  • These scenarios are a great first step, and where most teams are in our estimation.
  • Some playbooks or runbooks are laid out in text, sometimes by specific incident categories, to help reduce MTTR on critical issues where downtime does occur.
  • This also serves to reduce tribal knowledge and dependence on SMEs, and/or to transition as teams change due to promotion/rotation, M&A, or simply turnover.

Piecemeal, Task-specific Automation with Human in the Loop

  • The more advanced stage where more progressive teams have DIY’d open source or IT automation solutions to keep MTTR very low due to SLAs, to keep platforms reliable during code pushes, and/or multiple experts/teams on the same page(s).

Integrated Automation

  • Leading SRE and DevOps teams have synthesized their processes for automated diagnosis and remediation to maintain very aggressive SLAs and minimize performance degradation, security incidents, and human error.
  • All especially important in organizations with frequent code pushes, and/or larger, more distributed teams, and more dependence on the apps and infrastructure for revenues.
  • Very advanced teams have incorporated this into their software development lifecycle.
Running checks and remediating is the standard way of troubleshooting

In the next post we will delve deeper into why tribal knowledge, Wikis, and basic automation aren’t enough by sharing some of the best practices we are seeing around runbook processes and their automation.

Again, mindset shift, i.e. moving from coding to firefighting and back, is more time consuming and taxing than many SRE and DevOps let on, leading to decreased productivity and increased stress and burnout.

Thanks for taking a look!

Share your thoughts