Most software developers have an understanding of how important monitoring is. After all, you want to make sure your (cloud) solution does what it's supposed to do to serve your users. Platforms and tools offer easy ways to create dashboards to visualize telemetry and platform metrics to quickly get you started. But does your dashboard give you the insight you need into the state of your solution? There may be room for improvement. In this two-part series, we'll look at:

If you're wondering what a health model is, this is an excellent introduction video to start with:

https://www.youtube.com/watch?v=9C4RUuqZG8w

Your dashboard may look something like the image above. It has enough pretty colors to impress colleagues strolling by the big screen in your operations room, but it may lack real usefulness in troubleshooting ongoing issues.

What's wrong with my dashboard?

We find two major shortcomings with typical metrics-based dashboards:

You don't know if everything is working

It takes a lot of context and prior knowledge of the solution to understand if what you're looking at is good or bad. As an example, take a virtual machine instance or cluster node running at 90% CPU. If it's a frontend node, this may be indicative of bad performance, but if it's a background task worker this may be expected and fine. For each of the metrics shown, the operator has to identify if they are within the acceptable range and/or maybe trending towards an unhealthy state. This knowledge may be obvious for the developer or architect of the solution, but that may not always be the person looking at the screen.

You don't know the impact of an outage

If you look at your dashboard and conclude that something is amiss, the next question to ask is "What does this mean?". In a customer facing or business critical solution, it is essential to understand what the impact is of an ongoing outage. A complete failure of your sign-in or shopping cart service may warrant a different response than the failure of a background task that can soon be picked up again. Failures never happen in isolation, since many components in a distributed system depend on each other. Imagine the relationship between your frontend, authentication service, APIs, database, background workers, etc. A proper health model should not only show that there is an issue, but also show what the impact is across the platform.

Three guiding principles of a Health Model

Considering the points above, we propose three guiding principles for a Health Model:

A health model should show health status, not metrics. The logic that determines if a component state is healthy, degraded, or unhealthy should be captured in the model, not only in the head of the developer. For example, the model could be presented as a series of traffic lights for each of the solution's services or architectural components. In this case, green could mean 'everything is fine', yellow means 'attention required' and red means 'users impacted'.
The Health Model should be hierarchical in nature: If one component shows a degraded state, so should the components that depend on it. Showing the relationship between components makes it immediately clear what the cascading (business) impact is of an issue anywhere in the system.
The Health Model should present an exhaustive view of the application health. If the health model shows "all green", then you should be confident that everything is indeed working. If that is not the case, you must improve the model or it will lose its usefulness.

How do I build a health model?

We will discuss the actual implementation of the Health Model in the next article, but let's start with the conceptual approach. The first step into building a health model is identifying the logical components of your solution. We like to take a user-centric approach to this: People use the application or website to fulfil certain tasks: browsing items, creating new items, logging in, adding items to a shopping basket. Each of these tasks requires a number of consecutive actions that we call the "user flow". Let's take an imaginary game website where users can play a game and view their stats. We can identify the following user flows:

Show player stats. User want to see their scores on the website
Add game result. When a player is done with playing a game, the result of the game should be stored.
Serve static site assets. This is the loading of the website itself. It involves assets like html pages, client-side scripts, images, etc.

The user flows should, together, cover all the functionality the website has to offer. They are also independent of each other, although they may share a common infrastructure. The user flows are also what your performance targets would be assigned to. For example, you may have a Service Level Objective (the target you engineer your solution for) of 100ms for the processing of a new game result.

Now that you have identified your user flows, it's time to understand which technical components need to be in working order to fulfil the user flow. These technical components may be of different levels: There's things like microservices or APIs such as, in the case of the gaming website, the GameService API or the GameResultsWorker service. They in turn depend on infrastructure components, such as a Kubernetes cluster or a storage account. A (simplified) health model for the gaming website may look something like this:

Now that the dependency model is defined, we must define the properties and values that codify whether the component is Healthy, Degraded/Attention Required or Unhealthy. This is where the fun starts, because the component health state may be derived from multiple sources, such as:

The metrics from the underlying infrastructure components, such as CPU and memory metrics
Telemetry coming from the microservices, such as the request processing time or the number of HTTP 500 errors
The result of log searches
Analytics sources forecasting trends in metrics or anomaly detection models

All these sources may be weighed together and combined, resulting in a health state. Let's see if we can come up with a simple model for some of the components of our fictitious game service. In this case, ResultWorker and GameService are microservices running on the Kubernetes cluster. The ResultWorker processes messages from a queue and GameService serves a rest API. In this case, we've combined platform metrics (such as the response time from a storage account) with application telemetry (such as the processing time of worker runs):

Health Model - Example threshold values per component

The table above shows example values for each of the components. As soon as one of the "green" conditions is violated, the status for that component turns to "yellow" or "red". Coming up with these values is not a one-off exercise. One has to determine what is important to the user experience in this specific scenario, and picking the values is going to be an iterative process as the maturity of your operation grows.

In addition to the metrics and threshold points defined above, the health state of a component is driven by the state of its dependents: A component can never have a 'healthier' state than the lowest scoring dependency. In other words: If something is unhealthy, all components above it should also show unhealthy.

The image above shows examples of what changes in metrics could do to the health model and what that would look like in a visualization.

Before we move forward with the implementation of the Health Model, it's important that the conceptual phase is complete and well understood. At this point in our journey, you will have 1) Identified the user flows, services and technical components, 2) established the dependencies between the components and 3) have a list if conditions that, for each component, decide the health state.

Now let's proceed to Part 2 - Building a Health Model with Azure Log Analytics, where we use Azure Log Analytics to build out the Health Model.

Why you need a Health Model for your cloud solution

What's wrong with my dashboard?

You don't know if everything is working

You don't know the impact of an outage

Three guiding principles of a Health Model

How do I build a health model?

Comments

Azure Reliability Engineering

Announcing the Az Zones CLI Extension

More from this blog

Announcing the Az Zones CLI Extension

Getting Started with Chaos Engineering on Azure

Building a Health Model with Azure Log Analytics

Dynamic DNS with Edgerouter and Azure

Command Palette

What's wrong with my dashboard?

You don't know if everything is working

You don't know the impact of an outage

Three guiding principles of a Health Model

How do I build a health model?

Comments

Azure Reliability Engineering

Announcing the Az Zones CLI Extension

More from this blog