Building a Health Model with Azure Log Analytics

...by turning your metrics and telemetry into a unified score!

·

13 min read

This post is part two of a three part series on Health Modeling for cloud solutions. In the first part, we discussed what a Health Model is and why you need one. I highly recommend reading that article first, since we'll be building on top of the same example and concepts here.

This chapter discusses the implementation of the Health Model using Azure Log Analytics. Log Analytics is an Azure service used for collecting platform metrics and application telemetry. We're using this service because it allows us to write complex queries over both metrics and telemetry data and has integrations with popular dashboarding tools such as Azure Dashboards, PowerBI and Grafana.

The Health Score

As a refresher, let's take a look at the Health Model graph for our fictitious gaming platform again:

Example Health Model graph

We'll talk about creating the actual visualization in the next post. In this blog, I'll focus on determining the Health Score for each of the components in the graph. This score needs to have a numerical value, which we can later visualize as e.g. green, yellow, or red. For now, we'll define HEALTHY as 1.0, DEGRADED as 0.75 and UNHEALTHY as 0.5. Whether you end up using these or calculate a more continuous scale is up to you. Our goal for this exercise is to create an output that looks something like this:

ComponentTimeHealthScore
Website2021-12-09T11:00:00Z1.0
Website2021-12-09T10:59:00Z0.75
Website2021-12-09T10:58:00Z0.75
Website2021-12-09T10:57:00Z0.75
Website2021-12-09T10:56:00Z1.0

This data shows that the component called 'website' had a three minute dip in health, but was back up to normal at 11:00 UTC (Zulu) time.

So how do we generate an output like that from a Log Analytics query?

Collecting metrics and app telemetry

Before we can write queries, we need to start collecting data. We're going to assume you have an Azure subscription and know how to create resources, so go ahead and create a Log Analytics instance. If your solution uses multiple regional deployments, it is recommended to deploy one Log Analytics instance per region.

All Azure resource types (Virtual Machines, AKS Clusters, Databases, Event Hubs, etc) natively support exporting their metrics to Log Analytics. Setting up metrics collection can be done per resource using a Diagnostic Setting or for an entire resource group or subscription using an Azure Policy.

In addition to platform metrics, we also want to have application telemetry. This allows us to include metrics like request performance and exception counters in the Health Model. On Azure, application telemetry is collected using Application Insights, which has SDKs for most programming languages. When creating your Application Insights instance, make sure to select the 'workspace based' model and point it to your Log Analytics workspace so the data ends up there. As illustrated in the image below, Log Analytics will then hold both the resource metrics, as well as the application telemetry.

Health Model - Data Collection

Status queries and Score queries

Now that we have all our data in place, we can start with the fun part. As explained in the first blog post, the Health of the top component ('website' in this case) is defined by the health of all components underneath. This means that each component needs its own Health Score, produced by a Log Analytics query.

Log Analytics queries are written in Kusto Query Language (KQL). Those who are experienced in writing KQL know that queries can get quite complex quickly as they grow in size. For this reason, we've chosen to split up each of the queries in two, separating the evaluation of the data from the calculation of the score:

  • The status query evaluates the log analytics data and returns a list of metrics, together with an indication of whether they cross the thresholds for yellow or red.

  • The score query calls the status query and, based on the number of yellow or red values, returns a numerical Health Status. In case the component has dependencies, the status queries of those dependencies are also called.

When combined, the query structure is as shown in the image. You may notice that the Website and the User Flow queries do not have Status queries. This is because their score is only based on the score of dependent components, not on any metrics directly. GameServiceHealthScore and ResultWorkerHealthScore call both their own HealthStatus function as well as ClusterHealthStatus, since they depend on their own telemetry as well as the status of the cluster. Note that not all components from the graph are included in this image, it's here only to illustrate the concept.

Health Model - Nested Query Structure

Let's dive in to the Status query and the Score query to see what they look like.

The Status query

The Status query evaluates each of the metrics and determines if they are within expected range. Depending on your needs you can make the query as complex as you need, but all of the queries follow a similar pattern:

  1. Define the table of desired metrics with threshold values

  2. List the source table and inner-join it with the metrics/thresholds table (leaving only the metrics you're interested in)

  3. extend with IsYellow and IsRed fields and compare to threshold values

In its simplest form, a status query for an AKS cluster could be:

(whitespace added for clarity, remove before running query)

// Azure resource type to filter on
let _ResourceType = "CONTAINERSERVICE/MANAGEDCLUSTERS"; 

// Define the thresholds table with (MetricName, YellowThreshold, RedThreshold)
let Thresholds=datatable(MetricName: string, YellowThreshold: double, RedThreshold: double) [
    "node_cpu_usage_percentage", 60, 90,   // Average node cpu usage %
    "node_disk_usage_percentage", 60, 80,  // Average node disk usage %
    "node_memory_rss_percentage", 60, 80    // Average node memory usage %
    ];

// The data table holding the metrics from all Azure resources:
AzureMetrics

// Proceed only with data from last day. This could be parameterized to match the dashboard view
| where TimeGenerated > ago(1d)

// Proceed only with data for the right resource provider, as defined above
| where extract("(PROVIDERS/MICROSOFT.)([A-Z]*/[A-Z]*)", 2, ResourceId) == _ResourceType

// Select the fields we want to keep
| project TimeGenerated, MetricName, Value=Average

// Inner join the thresholds table. This only keeps the metrics that are in BOTH tables
| lookup kind=inner Thresholds on MetricName

// Add IsYellow and IsRed fields, based on value relative to the thresholds
| extend IsYellow = iff(Value > YellowThreshold and Value < RedThreshold, 1, 0)
| extend IsRed = iff(Value > RedThreshold, 1, 0)

The result of this query when run against an existing AKS cluster would be:

Status Query - Example Result

Obviously this is not a particularly busy cluster. But if we lower the threshold to below the values, we will see the IsYellow flag changing:

Status Query - Example Result

Once we have our query ready, we're going to save it as a function so it can be called by the Scoring query. On the header bar in the Log Analytics query editor, click Save -> Save as Function. Set a name like 'ClusterHealthStatus' and a category such as 'HealthModel', and click Save. Now you can simply call ClusterHealthStatus() and get the result of your query.

The metrics chosen for evaluation as well as the threshold values for yellow and red are very specific to your project. The ones used here are examples that may or may not apply in your case. My recommendation is to start with some basic metrics and values and improve over time: The model will become more reliable as it operates under normal and abnormal circumstances and more and more edge cases may be taken into account. This should be an iterative process, part of your regular operations practice.

I'm not going to post all the queries for the example here at this time. They are very similar to the one above and explaining them would add too much noise. However, they will appear on Github shortly as part of a larger project and I will add the link when they do.

The Score Query

Next, let's take a look at the Scoring query. This query calls one or multiple Health Status functions. It may be expanded to contain more complex logic, but for this exercise we set the following rule: The HealthScore is 1.0 if all metrics are within thresholds, substract 0.25 if any threshold is yellow and subtract 0.5 if any threshold is red. Or, in an expression:

HealthScore = 1 - (max(IsYellow) * 0.25) - (max(IsRed) * 0.5)

The scoring query for the cluster may, in its simplest form, look like this. This query must also be saved as a function in Log Analytics. Call it ClusterHealthScore:

ClusterHealthStatus
// We want to see our scores for each of the time periods, e.g. 1 minute, so summarize first
| summarize YellowScore=max(IsYellow), RedScore=max(IsRed) by TimeGenerated

// Add the HealthScore column
| extend HealthScore = 1 - (YellowScore * 0.25) - (RedScore * 0.5)

// We need to add a column called ComponentName, since that will be returned to other scoring functions calling this one. It is also used in the visualization.
| extend ComponentName = "Cluster"

If a Health Score query calls multiple Health Status functions, the result should be union-ed first before the score is calculated. This example is for the ResultWorker HealthScore:

ResultWorkerHealthStatus()
| union ClusterHealthStatus()
| union KeyvaultHealthStatus()
| union CheckpointStorageHealthStatus()
| summarize YellowScore = max(IsYellow), RedScore = max(IsRed) by TimeGenerated
| extend HealthScore = 1 - (YellowScore * 0.25) - (RedScore * 0.5)

// In addition to the component name, we also add the dependencies. This is used later to build a visualization
| extend ComponentName = "ResultWorker", Dependencies="Cluster,Keyvault,CheckpointStorage"

Lastly, we have a query that gives the overall, combined health of the solution. This calls all the Health Scores and rolls them up into a single health status. Note that this is the only query that calls other score queries. We reason we don't do this elsewhere is to limit the depth of the nested query structure. This greatly improves Log Analytics' performance. This one would be saved as 'WebsiteHealthScore':

// This is the aggregate score for the whole website.
AddGameResultUserFlowHealthScore()
| union GetPlayerStatsUserFlowHealthScore()
| union ShowStaticContentUserFlowHealthScore()
| summarize YellowScore = max(YellowScore), RedScore = max(RedScore) by bin(TimeGenerated, 2m)
| extend HealthScore = 1 - (YellowScore * 0.25) - (RedScore * 0.5)
| extend ComponentName = "Website", Dependencies = "AddGameResultUserFlow,ShowStaticContentUserFlow,GetPlayerStatsUserFlow"

Bring it all together

With these examples in your toolbox, you should be able to build a rudimentary health model. Don't build something very complex before you have the simple version working, but take an iterative approach. As a first step, take a small part of your solution and start building a model:

  • Build the Health Status and Health Score queries for the components of your solution

  • Create a Health Score query for the top component. The next step is to visualize the Health Model. To do that, we'll need the following data:

    • Health Scores for each component

    • The name of each component

    • For each component, a list of other components it depends on.

We'll get that if we call all the Health Score functions together:

WebsiteHealthScore
| union AddGameResultUserFlowHealthScore
| union GetPlayerStatsUserFlowHealthScore
| union ShowStaticContentUserFlowHealthScore
| union PublicBlobStorageHealthScore
| union KeyvaultHealthScore
| union GameServiceHealthScore
| union CheckpointStorageHealthScore
| union ResultWorkerHealthScore
| union ClusterHealthScore

// Our graph visual only needs the latest values for each component:
| summarize arg_max(TimeGenerated, *) by ComponentName,Dependencies
| project-away TimeGenerated

If we run this query on our Log Analytics instance, we see the following result:

Health Model - Visualization query

Sidenotes and gotchas.

There are several things you may want to consider or may run into when building your health model. They're not part of the story above to keep it simple, but I do feel they should be mentioned

Deployment Automation

Do not edit your health model queries in your production environment. Treat them as code and keep them in a repository. This will help you create a consistent health modeling experience across environments, ensure you can track the improvements you make, and allow you to revert to older versions if you mess up :). You can automatically deploy functions to a Log Analytics instance using ARM Templates, Bicep or Terraform. The resource type you're looking for is called savedSearches.

Missing Data

You'll need to figure out how you deal with missing data. Metrics like memory use are output on a schedule, so you can count on them being present. If they are no longer sent, something is wrong. But if you build the health status around the response time of an API, having no calls means no response time metrics. You decide for your solution if not having API calls is a good or a bad thing. As a remediation approach, you can use the make-series KQL operator over the series to create a default response for each time unit:

| make-series Value=avg(DurationMs) default=0 on TimeGenerated from timespanStart to timespanEnd step 1m
| mv-expand TimeGenerated, Value

Alternatively, you assume all is good if you hear nothing by appending the scoring function

| extend HealthScore = iff(isnull(HealthScore), 1.0, HealthScore)

Complex queries

The simplistic 3-step model for the Health Stauts queries may not hold as your query gets more complex and you're trying to integrate multiple sources which may supply their data in different formats. This may be the case when you query both metrics and telemetry. In that case, you may want to run multiple queries, normalize the format and union the result:

// Different queries running on different tables. 
// All queries should return TimeGenerated,MetricName and Value
let q1 = Table1 | where xx==yy | ... |Project TimeGenerated,MetricName,Value
let q2 = Table2 | where xx==yy | ... |Project TimeGenerated,MetricName,Value
let q3 = Table3 | where xx==yy | ... |Project TimeGenerated,MetricName,Value

let Thresholds=datatable(MetricName: string, YellowThreshold: double, RedThreshold: double) [
        "avgTimeSinceEnqueued", 200, 1000,
        "failureCount", 3, 10,
        "avgProcessingTime", 100, 200
        ];

q1 
| union q2 
| union q3
| lookup kind = inner Thresholds on MetricName
| extend IsYellow = iff(Value > YellowThreshold and Value < RedThreshold, 1, 0)
| extend IsRed = iff(todouble(Value) > RedThreshold, 1, 0)

Historical Data

If you want to create a "colored graph"-style visualization as shown above, then the aforementioned table gives you exactly the output that you need: It visualizes the latest state that is available in the model. However, you may also want to track the health status of your solution over time. This can be done using the same queries, since they will all include the Health Score summarized per time bin. If you remove the last two lines of the visualization query, that is exactly what you'll get. You will need a different type is visualization, which we'll also discuss in the next post.

Log Analytics limitations

Every time you run the health score function, a complex query is executed. This may take a few seconds to compute, which is fine because you're not going to refresh it every second. Log Analytics does have a limit on the number of concurrent queries per calling principal. That means that if you as a user try to run more than 5 queries at a time, some may be queued for execution rather than run right away. This may become more of a problem if you have a Grafana dashboard that is used by multiple people and connects to Log Analytics using a single service principal.

You can counter this to some extend by not setting the refresh rates too high and not putting too many visuals on a page that all load from Log Analytics.

Keep in mind that every time you run the top query, the whole nested query structure is called. Depending on the time window, this may be a substantial amount of data and it may lead to performance issues, especially when being called from Grafana or another visualization tool. You can do some optimization here related to time windowing and binning, but if performance is still not adequate, you will have to separate the calculation and visualization processes. This means that you externally trigger the health score calculation on a schedule, e.g. every minute. The result is then stored somewhere else, such as a different Log Analytics table or even a relational database. That data source is subsequently what the visualization tool reads from.