feat: add new Management Cluster Triage Grafana SRE dashboard#5569
feat: add new Management Cluster Triage Grafana SRE dashboard#5569cadenmarchese wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a new Grafana “Management Cluster Triage” dashboard to speed up incident triage for KAS availability by correlating node health, workload stability, noisy-neighbor HCP resource use, and etcd early-warning signals.
Changes:
- Introduces a new Grafana dashboard JSON for management-cluster triage.
- Adds templating for separate SVC and HCP Prometheus datasources plus a cluster selector.
- Adds panels for node readiness/pressure, workload instability signals, top HCP namespace resource usage, and etcd WAL/leader-change indicators.
|
/lgtm - great looking dashboard, look forward to it |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: cadenmarchese, swiencki The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/retest-required |
so that leader elections are more visible rate(...[5m]) → increase(...[1h])
|
/retest |
https://redhat.atlassian.net/browse/ARO-26980
What
Adds a new Grafana Dashboard for basic resource utilization and top talker metrics for mgmt clusters.
Why
This will help the team identify noisy neighbors, problematic nodes or pods, and other basic metrics at a glance.
Testing
The Grafana dashboard was served locally first to confirm functionality. Then, it was tested against the arohcp-dev Grafana instance with a personal dev env (containing 5 HCPs) to confirm functionality of the queries.
Special notes for your reviewer
PR Checklist