Part 2 — Best Practices for Modern Incident Response: Why Monitoring, Alerting and Remediation are Not Enough

This second post in a 3-part series specifically looks into the ways leading organizations handle incident response today.

In our first post we answered how you can start or improve your incident response using RunBook automation in your organization.

The second post gives you a framework for evaluating processes and technologies for the three phases we see that leading organizations are building in the space.

With 97% of respondents in a recent survey reporting that their leadership has reliability priorities for cloud infrastructure, we are focusing on platform reliability, with a specific view into improving Kubernetes management.

Modern diagnostics tools enable proactive management through constant monitoring and probing for signals where issues may be arising, real time or almost real-time automated remediation of addressable issues, and escalation of unaddressable issues to appropriate teams.

Issues with Traditional Incident Response

Today, there are many who define incident response as a collection of activities — the three pillars being monitoring, alerting and remediation. For traditional organizations these are critical inputs to incident response, where teams are then called in, issues prioritized, investigated, and resolved. While all three are inputs to incident response, they are not solutions in and of themselves.

Rather than focusing on a specific outcome, this siloed approach to observability (detect), prioritization (inspect), and response (protect) is typically focused on technical instrumentation and underlying data formats.

While having all three help you triage, they are no guarantee you can respond to issues more quickly.

Ironically, more monitoring and alerting may actually slow down reaction times and resolution, creating issues at scale.

Present-Day Incident Response for Platform Reliability

Containers and µ-services, today’s cloud infrastructure of choice, has broken traditional incident response.

Specific to Kubernetes, a highly sophisticated yet complex, container orchestration and management platform, we can apply these three stages to K8s troubleshooting of applying solutions to common errors, which may include CreateContainerConfigError, ImagePullBackOff, CrashLoopBackOff, Kubernetes Node Not Ready and many more…

Present-Day Alerting — Prepared to spot issues

Reviewing recent changes to clusters, pods, nodes, or other potentially affected areas to find root cause.

  • Analyzing YAML configs, Github repos, and logs for VMs or bare metal machines running the malfunctioning components.
  • Looking at Kubernetes events and metrics such as disk pressure, memory pressure, and utilization. In a mature environment, you should have access to dashboards that show important metrics for clusters, nodes, pods, and containers — ideally in time-series formats.
  • Comparing similar components for outliers, analyzing dependencies…

To achieve the above, teams typically use the following technologies:

  • Monitoring Tools: Datadog, Dynatrace, Grafana, New Relic
  • Observability Tools: Lightstep, Honeycomb
  • Live Debugging Tools: OzCode, Rookout
  • Logging Tools: Splunk, LogDNA, Logz.io

This allows you to see issues with and between services, though stops at the identification and notification of the issue.

Present-Day Triage — Automated Management of Issues

In µ-services architectures, it is common for each component to be developed and managed by separate teams. Because production incidents often involve multiple components, collaboration is essential to remediate problems fast.

Once issues are understood, there are three main approaches to remediating them:

  • Ad hoc solutions — based on tribal knowledge by the teams working on the affected components. Very often, the engineer who built the component will have unwritten knowledge on how to debug and resolve it.
  • Manual RunBooks — a clear, documented procedure showing how to resolve each type of incident. Having a RunBook means that members of the team can quickly resolve the issue in a similar (hopefully tested) manner..
  • Automated RunBooks — an automated process, which could be implemented as a script, infrastructure as code (IaC) template, or Kubernetes operator, and is triggered automatically when the issues are detected.

To achieve the above, teams typically use the following technologies:

  • Incident Management: PagerDuty, Kintaba
  • Project Management: Jira, Monday, Trello
  • Infrastructure as Code: Amazon CloudFormation, Terraform

It can be challenging to automate responses to all common incidents, but it can be highly beneficial, reducing downtime and eliminating human error.

Present-Day Remediation — Prepared to Respond

Successful teams make prevention their top priority. Over time, this will reduce the time invested in identifying and troubleshooting new issues. Preventing production issues in Kubernetes involves:

  • Creating policies, rules, and playbooks after every incident to ensure effective remediation.
  • Investigating if a response to the issue can be automated, and how
  • Defining how to identify the issue quickly next time around and make the relevant data available — for example by instrumenting the relevant components
  • Ensuring the issue is escalated to the appropriate teams and those teams can communicate effectively to resolve it

To achieve the above, teams commonly use the following technologies:

  • Chaos Engineering: Gremlin, Chaos Monkey, ChaosIQ
  • Auto Remediation: Shoreline, OpsGenie

A Modern Approach — Preventing vs Preparing vs Looking

Proactive alerting — Traditional alerting has notified of up/down at the simplest level, and more recent apps notify of deviations off of the norm to clue in to potential issues before they become existing ones. The latest developments have revolved around combining smart alerting signals blended with preset remediations.

Having the most common issues for each service monitored regularly for deviations from intelligently tracked performance coupled with a preloaded, automated remediation to address the issue — should it arise — is the way to scale cloud native infrastructure confidently.

Modern Triage — Traditional triage has alerts that signal an issue or issues, followed by (certain) teams investigating from top down — website/app, the network, the underlying cloud or server/storage…

Once an issue with an AZ, Data center, K8s cluster is identified, remedies can be made.

This can take seconds, minutes, or unfortunately sometimes hours given the collaboration microservices brings.

Having the collaboration happen as a “peace time” practice for preparation is good. Having it happen in context of preventing issues from ever happening is ideal and modern tools are allowing for this.

Modern Remediation — Understanding the problem before there is a problem. Navigating cloud — native and µ-services based systems is a twisted set of paths through a web of dependencies to even understand which services you need to investigate and work with.

With a ticking clock and management watching, this can be a stressful endeavor.

Working through stacks of issues and presetting RunBook automation can be an ominous job. With modern RunBook automation you can have stacks of RunBooks prebuilt for your most common issues, probing for them before you have issues.

Having a number of issues resolved before systems and services go down, zeroes out your KPIs for many incidents. No Time to Alert metrics, No Time to Root Cause metrics, and No Time to Resolution metrics because you caught and addressed them before they became *major* issues..

And better business — On top of slowdowns or in worst case outages, proactive automation and remediation extends to better business management.

Help CloudOps eliminate under/over provisioning by rightsizing and optimizing your EKS, Cloud, and other critical services in similar preventative fashion to incident prevention.

Incident Response for Security

As much of Incident Response concerns security, there are six steps from SANS, a security software training organization, that occur in a cycle each time an incident occurs. There are 4 to 7 steps from other public and private organizations.

Their steps are:

  • Preparation
  • Identification
  • Containment
  • Eradication
  • Recovery
  • Lessons learned

Our focus at unSkript is to help you succeed with not only the automation but also the process and documentation to enable a proactive approach to incident response. The learnings from this effort will extend to your platform reliability approach, cloud cost management, and other tools and cost optimization efforts.

Below are other lists below you might pursue.

NIST

What Does the NIST Incident Response Cycle Look Like? NIST’s incident response cycle has four overarching and interconnected stages: 1) preparation for a cybersecurity incident, 2) detection and analysis of a security incident, 3) containment, eradication, and recovery, and 4) post-incident analysis.

In the next post we will share some of the vision leading RunBook builders, technologists, and investors see coming in the future of RunBook automation.

Again, mindset shift, i.e. moving from coding to firefighting and back, is more time consuming and taxing than many SRE and DevOps let on, leading to decreased productivity and increased stress and burnout.

Thanks for taking a look!

Schedule a meeting to see how our new diagnostics check and automatically remediate based on failure readouts for your critical services — Kubernetes, AWS, Kafka, Elastic, and dozens more today.

Or try our open source sequential checks and remediations on GitHub.

Share your thoughts