Most of the tech giants including companies like Amazon, Netflix, started to build their systems using a monolithic architecturebecause back in the time it was much faster to set up a monolith and get the business moving. But over time as the product matures or fast growth happens, with growing systems the code gets more and more complicated. They all faced this problem and looked at microservices as a solution. One of the biggest benefits of microservices is that each microservice can be developed, scaled, and deployed independently.
With great power comes great responsibility and that’s what happened when organizations switched to microservices and API driven architectures from more monolithic application architectures, they got significant benefits in delivery speed and scalability but on the flip side now they have to deal with the operational complexity in managing, monitoring and securing the new distributed architecture.
API-driven applications come with their issues like design complexity, visibility, communication, security, etc. With Cloud-native, kubernetes, CI/ CD, software delivery trains are also speeding up. In a rush of faster deliveries, keeping track of everything is hard.
So, how do you keep track of your overall system performance at 1000 ft level? Service health acts like a good indicator for this. Service Health enables you to quickly recognize the impact of a problem on your organization’s business services and applications. Different levels of measurement build on and combine with one another based on predefined rules to present a broad picture of system health, rooted in fine-grained data.
How do you measure service health? That’s a separate question and you might find a lot of information about it somewhere so we won’t dig deeper into it. What’s important here is how do you consume these service health measures? Many organizations configure alerts, many others create dashboards which collectively display the health of services across different applications.
The problem with dashboards is they are not actionable. While alerts are good triggers, there needs to be human intervention to debug and resolve the issue if any. Sometimes there are false positives as well. In many organizations, when service health degrades there are predefined set of steps that developers or DevOps engineers follow in order to find more information and resolve the issue. For example, Kubernetes pod health degraded which might get fixed by just restarting the pod. Can we automate this process? With the right trigger and necessary context, we definitely can.
Have you ever tried doing it? If yes, let us know how and if not, we have something cooking for you.