Stethascope and mask

Building a Healthcheck RunBook: Analysis of K8s Logs

There’s nothing worse than an outage. Every outage has a cost – both in revenue, sales, but also reputation and perceived reliability.  No one wants their company to appear in the news as having a DevOps issue that cratered their service.  

Building RunBooks to perform regular health checks on your system is an effective way to catch issues early – before they become catastrophic.  If things have already started breaking the health check alert can point you in the direction of the root cause – reducing the Mean Time to Resolution (MTTR).

In this post, we’ll build a RunBook that checks our Kubernetes log records for alerts with a level of WARNING, and if there is one – sends a Slack message to the team.

The Problem

Like many  RunBooks, this project originated from an issue that originated on an Elasticsearch Pod.  Here is what the logs were telling us:

[2023-01-11T23:45:21,890][WARN ][r.suppressed             ] 
\path: /audit-alias/_doc/1e84dc30-df60-46b6-8212-e1a51a5a0508, 
params: {refresh=true, index=audit-alias, 

[audit-000157][0] primary shard is not active Timeout: [1m],

The issue is that the primary shard was timing out.

The resolution was to restart the unresponsive Pod.  Since the resolution to this issue is a simple restart, this is a prime example of a RunBook that can be automated ensuring service uptime, without impacting the DevOps team. 

But, while we want to cure this very specific issue, it is likely that there are dozens (if not hundreds) of other warnings that might appear in the Kubernetes logs.  Creating a general warning notebook that alerts the team of any warning in the logs seems like it might be a good idea.  in this post, we’re going to focus on the general warning, and have the detailed automated fix in a following blog post.

Creating a General Warning Notebook

Tl;dr: This exact RunBook will soon be available in our Open Source repository, so you can use it right ‘off the shelf.’  If you want to see the results of the RunBook, you can import this RunBook, and follow the configuration steps below.

In our General Warning Notebook, we will go through four steps:

  1. Get a list of all Kubernetes Pods running in an environment,
  2. Grab the logs from all of these pods.
  3. Parse the logs for Warning messages
  4. If messages are found, send a Slack message to the team.

We can automate this Notebook to run on a regular schedule to ensure that any K8s warning messages can be dealt with promptly.

Building the RunBook.  

In unSkript, we’ll begin by creating a new RunBook.  In the unSkript Free trial, this can be done by clicking the Create button on the xRunBooks page.   If you are using our Docker Open Source, here’s how to create a new RunBook

We’ll want two input parameters in your RunBook: 

  • namespace: the K8s resources that we will be examine 
  • hoursToExamine: the number of hours that we will examine in the logs.

Here I am have set the default namespace to “logging”, and to examine the logs for the last 1 hour:

Parameters for the RunBook

Now we can add our Actions.  Three of the Actions in this RunBook are already available in our Action library:

  1. Kubectl list pods. In the right navigation – search for this Action, and then drag it into your RunBook Workspace.  Click the configuration button.
    1. If you have a Kubernetes connection created already, select the correct credential to connect to the namespace you desire.  If you have not yet created a credential to connect to Kubernetes, here’s how.
    2. The required field is the namespace we’ll connect to. Since we are using the RunBook parameter to define the namespace, enter the parameter name (in the screenshot above it would be ‘namespace’)
    3. Click the “Output: tab, name the output “podList” and click Apply changes.  This will assign a variable name to the list of pods that we can use in the next Action.
    4. You can test this configuration by clicking the “Run Action” button.  You should obtain a list of the K8s pods in the namespace.
  2. Kubectl get logs. Close the configuration and search for the second Action.
    1. Add the same Credential as above.
    2. There are two required parameters, “Pod Name” and “Namespace.”
      1. For “Namespace”, use the same parameter name used in Step 1.
      2. For “Pod Name”, we have multiple pods that we would like to iterate over.  In order to do this, click the “Iterator” tab.  We’ll want to iterate over the “podList” we created in Step 1, and then place those values into pod_name.  Click Apply Changes for these to take effect.Action iteration screenshot
      3. In the Output tab, name the output variable. In this example, we used “allTheLogs.”
      4. You can now test this Action by clicking Run Action. Depending on how many Pods there are in your namespace, this might take a while as the iteration is serial.  All of the logs will be saved in a Dict.
  3. Now we will create a custom Action to parse the logs.
    1. In the top menu click “Add” and Choose Action.  You can edit the name – we chose parse dict of logs as our name.
    2. The code looks like this:
    3. import re
      from datetime import datetime, timedelta
      #get all warnings
      #only report warnings found in the x hours
      timeDiff = timedelta(hours=hoursToExamine)
      #if there are warnings that are ok to supress, add them to this list
      stringsToIgnore = ["arerqewreqwr" ]
      #this will hold all the warnings
      warning_text_all = {}
      #Specific issues we can deal with:
      primaryShardIsNotActive = False
      #we've collected a bunch of logs, lets loop through them for Warnings
      for instance in allTheLogs:
         log = allTheLogs[instance]
         #find the position of all instances of '[WARN' in the logs
         warning_start = [m.start() for m in re.finditer(re.escape('[WARN'), log)]
         for i in warning_start:
            warningtime = log[i-24:i-5]
            issue = log[i:i+400]
            warningtimeDT = datetime.strptime(warningtime, '%Y-%m-%dT%H:%M:%S')
            if warningtimeDT > timeDiff:
              if issue not in stringsToIgnore:
                warning_text_all[instance] = { warningtime:issue}
                print(warning_text_all, len(warning_text_all))
                if issue.find("primary shard is not active Timeout")>0:
                   primaryShardIsNotActive = True
    4. We loop through all of the logs, and do a regular expression search for ‘[WARN’, and create a list of the location of this string in the log.
    5. Next we loop through all of the places there is a warning in the log, and extract the date and time of the warning.  If it falls into the range of the ‘hoursToExamine’ that is set by the RunBook, we have found a new warning.
    6. We can suppress warnings that have been examined as not important in the stringsToIgnore list – because we do not want alert fatigue to set in, and cry wolf for issues that will not affect the system.
    7. If our WARNING log entry passes all of these criteria, an entry will be written into warning_text_all with the date, Pod name, and the warning message.
    8. We also have the ability to check for specific issues in this code.  If we see the “primary shard is not active” text, we can call another RunBook to resolve this issue. We’ll cover this in a later post.

4. Post Message to Slack Finally, if there are warnings, we want to alert the team.  That’s really why the RunBook is here, right?

    1. Search the Action list, and drag the “Post Message to Slack” Action into your RunBook.  In the Configurations, add your Slack Credential (or create a new Slack Connection).
    2. There are 2 input parameters:
      1. Slack Channel
      2. Message. Here we just print the “warning_text_all” Dict that was created in Step 3.
    3. Start Conditions.  We only want to send a Slack message when there is a warning – alert fatigue and all of that.  So in the Start Condition tab of the Action, we can tell this step to only run when: len(warning_text_all) >0.  Now, the Slack message will only be sent when there are warnings in the logs

We can now save this RunBook.

Scheduling the RunBook

  1. In the unSkript free trial (and our SAAS service), RunBooks can be scheduled. Under Events, click schedule and Add Schedule.  Choose the RunBook you created, and the frequency you’d like the RunBook to run.

Keeping Kubernetes Healthy

Philosophers have debated for centuries “If a tree falls in the woods, but no one is there to hear it, does it make a sound?”  There is no similar corollary in DevOps.  If your systems are reporting errors in their logs, but no one is there to see them – the system will still break. In this post, we’ve built a healthcheck for K8s pods by scanning the logs on regular intervals and surfacing the WARNING errors for the DevOps and SRE team – making it easier and more likely that the messages will be seen and acted upon – and preventing catastrophic errors from going unnoticed.

Interested in giving health checks with unSkript a try?  Check out hundreds of open source Actions and RunBooks in our GitHub repo (and instructions for our Docker build), and try out our free Sandbox, where you can build and deploy RunBooks for free. We also have a growing Cloud Automation Slack channel where we discuss RunBook automations.  We look forward to working with you as you begin automating your DevOps processes!

Share your thoughts