Should you’re operating an Azure Kubernetes Service (AKS) cluster, you know the way crucial it’s to watch your node swimming pools, particularly once they hit their most node rely and may’t scale additional.
Not too long ago, I confronted the problem of constructing an Azure alert that dynamically checks each node pool in an AKS cluster and fires if any of them hit their maxCount
. Alongside the best way, I bumped into some tough points, from lacking information and parsing issues to permission errors. On this information, I’ll present you easy methods to arrange an AKS node pool maxCount alert to watch when your cluster reaches its autoscaling restrict.
Why Monitor Node Pool Maximums
In AKS, node swimming pools are teams of VMs that run your Kubernetes workloads. With auto-scaling enabled, AKS mechanically provides or removes nodes based mostly on demand, however solely as much as a most (maxCount
). If a node pool hits that restrict, it gained’t scale any additional—resulting in efficiency bottlenecks and even failed deployments.
Organising alerts based mostly on hardcoded thresholds will not be superb. The higher method is to create a dynamic alert that:
- Checks all node swimming pools in your cluster
- Appears up every pool’s precise
maxCount
from Azure - Compares it to the present variety of Prepared nodes
- Alerts you when a pool hits or exceeds that most
The Plan: Mix ARG with Log Analytics
The method was simple in idea:
- Use Azure Useful resource Graph to retrieve every pool’s
maxCount
- Use Log Analytics (KQL) to rely what number of nodes are literally operating
- Be a part of these outcomes and alert when any pool is full
The thought was to hitch ARG information (node pool configs) with Log Analytics information (node counts) in a KQL question, then arrange an alert to inform me when any node pool hits its restrict. Sounds easy, proper? Effectively, Azure had different plans!
The Journey: Hitting Roadblocks
Listed here are the important thing points I bumped into, and the way I solved them.
Roadblock 1: ARG Permissions
The primary challenge got here up instantly. Once I tried to run the useful resource graph question in Log Analytics:
arg(“”).Sources | the place sort == ‘microsoft.containerservice/managedclusters’ and title == ‘aks-pixelrobots-dev’ | lengthen agentPools = properties.agentPoolProfiles | mv–broaden agentPool = agentPools | the place agentPool.enableAutoScaling == true |
Errors occurred whereas resolving distant entities. Cluster=’https://ade.loganalytics.io/…’: not licensed to entry useful resource title: AzureResourceGraph…
Seems, the arg("")
operate solely works if:
- You (or the alert identification) have Reader entry on the subscription
- The workspace helps ARG queries
The repair:
- Assign Reader to my person and to the managed identification of the alert
- Additionally assign Log Analytics Reader for workspace entry
Roadblock 2: mv-expand Limits
To extract every agent pool from the cluster properties, I used mv-expand
. However then I hit this error:
Azure Useful resource Graph: A minimum of one mvexpand operator has a row restrict of 2147483647, which exceeds the max restrict of 2000.
Though my cluster solely had a number of swimming pools, mv-expand
tried to over-fetch. The repair was easy—add a restrict to the enlargement.
Right here’s the sample I used:
| mv–broaden agentPool = agentPools restrict 50 |
This tells the engine to broaden solely as much as 50 swimming pools, which is greater than sufficient for a typical AKS cluster.
Roadblock 3: Empty nodePoolName Values
Subsequent, I had bother becoming a member of the ARG information with the dwell node information. Particularly, I couldn’t get a sound node pool title from KubeNodeInventory
.
Initially, I attempted:
KubeNodeInventory | the place TimeGenerated > in the past(5m) | the place ClusterName == ‘aks-pixelrobots-dev’ | lengthen parsedLabels = parse_json(Labels) | lengthen nodePoolName = tostring(parsedLabels[“agentpool”]) |
However this returned clean values. The issue? Labels
was an array, not a flat object.
Right here’s the repair that labored:
| lengthen nodePoolName = tolower(tostring(parsedLabels[0].agentpool)) |
This accesses the primary object within the array, the place the agentpool
label lives. Including tolower()
ensures constant casing when becoming a member of with ARG information.
Roadblock 4: Wildly Inaccurate Node Counts
Even after the whole lot appeared good, the counts had been manner off. Some node swimming pools confirmed 200%+ utilization, which is clearly fallacious.
The error? I used rely()
to tally the nodes:
| summarize currentNodeCount = rely() by ClusterName, nodePoolName |
rely() tallied all rows, together with duplicates (e.g., a number of standing updates per node). Plus, I wasn’t filtering for Prepared nodes. I switched to:
| the place Standing == ‘Prepared’ | summarize currentNodeCount = dcount(Laptop) by ClusterName, nodePoolName |
dcount(Laptop)
counts distinctive nodes, and Standing == 'Prepared'
ensures solely wholesome nodes are included. This introduced counts again to actuality (e.g., 3 nodes for systempool).
The Ultimate Answer
As soon as all of the items had been working, I ended up with two queries: one for monitoring and one for alerting.
Monitoring Question
This question exhibits all auto-scaling node swimming pools in your cluster and their present utilization, so you possibly can monitor the whole lot in a single view.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
let clusterName = “aks-pixelrobots-dev”; let maxNodes = arg(“”).Sources | the place sort == ‘microsoft.containerservice/managedclusters’ and title == clusterName | lengthen agentPools = properties.agentPoolProfiles | mv–broaden agentPool = agentPools restrict 50 | the place agentPool.enableAutoScaling == true | mission clusterName = title, nodePoolName = tolower(tostring(agentPool[“name”])), maxCount = toint(agentPool[“maxCount”]); let currentNodes = KubeNodeInventory | the place TimeGenerated > in the past(5m) | the place ClusterName == clusterName | the place Standing == ‘Prepared’ | lengthen parsedLabels = parse_json(Labels) | lengthen nodePoolName = tolower(tostring(parsedLabels[0].agentpool)) | the place nodePoolName != ” | summarize currentNodeCount = dcount(Laptop) by ClusterName, nodePoolName; maxNodes | be a part of variety=leftouter currentNodes on $left.nodePoolName == $proper.nodePoolName | mission clusterName, nodePoolName, currentNodeCount = coalesce(currentNodeCount, 0), maxCount, usagePercent = iff(maxCount > 0, 100.0 * coalesce(currentNodeCount, 0) / maxCount, 0.0), isAtMax = coalesce(currentNodeCount, 0) >= maxCount | order by isAtMax desc, usagePercent desc |
This provides a full overview of your cluster’s node pool utilization.
clusterName nodePoolName currentNodeCount maxCount usagePercent isAtMax aks–pixelrobots–dev systempool 3 3 100.0 true aks–pixelrobots–dev devops 10 12 83.33 false aks–pixelrobots–dev pronodes 5 30 16.67 false |
Alert Question
This model filters the outcome to solely embrace node swimming pools which are at or over their maxCount
. Use this one to energy your alert rule.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
let clusterName = “aks-pixelrobots-dev”; let maxNodes = arg(“”).Sources | the place sort == ‘microsoft.containerservice/managedclusters’ and title == clusterName | lengthen agentPools = properties.agentPoolProfiles | mv–broaden agentPool = agentPools restrict 50 | the place agentPool.enableAutoScaling == true | mission clusterName = title, nodePoolName = tolower(tostring(agentPool[“name”])), maxCount = toint(agentPool[“maxCount”]); let currentNodes = KubeNodeInventory | the place TimeGenerated > in the past(5m) | the place ClusterName == clusterName | the place Standing == ‘Prepared’ | lengthen parsedLabels = parse_json(Labels) | lengthen nodePoolName = tolower(tostring(parsedLabels[0].agentpool)) | the place nodePoolName != ” | summarize currentNodeCount = dcount(Laptop) by ClusterName, nodePoolName; maxNodes | be a part of variety=leftouter currentNodes on $left.nodePoolName == $proper.nodePoolName | the place coalesce(currentNodeCount, 0) >= maxCount | mission clusterName, nodePoolName, currentNodeCount = coalesce(currentNodeCount, 0), maxCount, usagePercent = iff(maxCount > 0, 100.0 * coalesce(currentNodeCount, 0) / maxCount, 0.0), isAtMax = coalesce(currentNodeCount, 0) >= maxCount | order by usagePercent desc |
Setting Up the Alert
After you have the alert question working, establishing the precise alert in Azure Monitor is simple.
- Go to your Log Analytics workspace
- Paste within the alert question and run it
- Click on “New alert rule”
- For the situation, use a customized log search
- Set it to set off when outcomes are better than 0
- Set frequency and analysis to five minutes
- Hyperlink to an motion group along with your most well-liked notifications
- Identify your rule and put it aside
Don’t overlook to assign the proper permissions (Reader + Log Analytics Reader) to the alert’s managed identification.
Why This Setup Is Value It
With this answer in place, you’re getting:
- Actual-time alerts based mostly on dwell config and present utilization
- Correct node counting by filtering solely Prepared nodes
- Dynamic monitoring throughout all node swimming pools, not simply hardcoded ones
- A clear, reusable question setup you possibly can tweak anytime
Suggestions and Concepts
- Need to monitor a number of clusters? Take away the
title == clusterName
filter - Need to monitor developments? Use the monitoring question in a workbook or dashboard
- Need to route alerts to Groups or Slack? Add a webhook to your motion group
Wrapping Up
This was a enjoyable and academic deep dive into the ability of mixing Azure Useful resource Graph with KQL. Should you’re operating AKS and care about autoscaling, this alert setup provides you visibility and safety when issues hit their limits.
Let me know when you give it a attempt or when you run into related roadblocks. I’d love to listen to how others are fixing this problem in their very own clusters.
Pleased scaling!