feat: add new Management Cluster Triage Grafana SRE dashboard by cadenmarchese · Pull Request #5569 · Azure/ARO-HCP

cadenmarchese · 2026-06-09T13:25:20Z

https://redhat.atlassian.net/browse/ARO-26980

What

Adds a new Grafana Dashboard for basic resource utilization and top talker metrics for mgmt clusters.

Why

This will help the team identify noisy neighbors, problematic nodes or pods, and other basic metrics at a glance.

Testing

The Grafana dashboard was served locally first to confirm functionality. Then, it was tested against the arohcp-dev Grafana instance with a personal dev env (containing 5 HCPs) to confirm functionality of the queries.

Special notes for your reviewer

PR Checklist

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new Grafana “Management Cluster Triage” dashboard to speed up incident triage for KAS availability by correlating node health, workload stability, noisy-neighbor HCP resource use, and etcd early-warning signals.

Changes:

Introduces a new Grafana dashboard JSON for management-cluster triage.
Adds templating for separate SVC and HCP Prometheus datasources plus a cluster selector.
Adds panels for node readiness/pressure, workload instability signals, top HCP namespace resource usage, and etcd WAL/leader-change indicators.

swiencki · 2026-06-09T21:33:16Z

/lgtm - great looking dashboard, look forward to it

openshift-ci · 2026-06-09T21:33:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cadenmarchese, swiencki
Once this PR has been reviewed and has the lgtm label, please assign hbhushan3 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

observability/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cadenmarchese · 2026-06-11T14:43:17Z

@roivaz @sclarkso PTAL? ty!

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 10 comments.

cadenmarchese · 2026-06-16T17:37:30Z

/retest-required

so that leader elections are more visible rate(...[5m]) → increase(...[1h])

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

cadenmarchese · 2026-06-17T13:58:59Z

/retest

add new management cluster triage Grafana dashboard

f3b9a91

Copilot AI review requested due to automatic review settings June 9, 2026 13:25

openshift-ci Bot requested review from roivaz and sclarkso June 9, 2026 13:25

Copilot AI reviewed Jun 9, 2026

View reviewed changes

address copilot review comments

3d42820

swiencki reviewed Jun 11, 2026

View reviewed changes

Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json

use utc timezone in mgmt-cluster-triage.json

49ab69d

Copilot AI review requested due to automatic review settings June 15, 2026 20:24

Copilot AI reviewed Jun 15, 2026

View reviewed changes

address copilot feedback

7716d9a

swiencki reviewed Jun 17, 2026

View reviewed changes

Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json Outdated

decrease threshold of etcd leader election graph

f3cae31

so that leader elections are more visible rate(...[5m]) → increase(...[1h])

Copilot AI review requested due to automatic review settings June 17, 2026 13:01

Copilot AI reviewed Jun 17, 2026

View reviewed changes

address copilot feedback

8d0803a

Conversation

cadenmarchese commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Testing

Special notes for your reviewer

PR Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

swiencki commented Jun 9, 2026

Uh oh!

openshift-ci Bot commented Jun 9, 2026

Uh oh!

cadenmarchese commented Jun 11, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cadenmarchese commented Jun 16, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cadenmarchese commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cadenmarchese commented Jun 9, 2026 •

edited

Loading