Skip to content

feat: add new Management Cluster Triage Grafana SRE dashboard#5569

Open
cadenmarchese wants to merge 6 commits into
Azure:mainfrom
cadenmarchese:cadenmarchese/ARO-26980/mgmt-cluster-dashboard
Open

feat: add new Management Cluster Triage Grafana SRE dashboard#5569
cadenmarchese wants to merge 6 commits into
Azure:mainfrom
cadenmarchese:cadenmarchese/ARO-26980/mgmt-cluster-dashboard

Conversation

@cadenmarchese

@cadenmarchese cadenmarchese commented Jun 9, 2026

Copy link
Copy Markdown
Member

https://redhat.atlassian.net/browse/ARO-26980

What

Adds a new Grafana Dashboard for basic resource utilization and top talker metrics for mgmt clusters.

Why

This will help the team identify noisy neighbors, problematic nodes or pods, and other basic metrics at a glance.

Testing

The Grafana dashboard was served locally first to confirm functionality. Then, it was tested against the arohcp-dev Grafana instance with a personal dev env (containing 5 HCPs) to confirm functionality of the queries.

Special notes for your reviewer

PR Checklist

  • PR is scoped to a single task (no mixed concerns)
  • Title follows Conventional Commits format
  • Summary explains the "Why" behind the change
  • Linked to relevant ticket/issue
  • Screenshots included (if graph/UI/metrics changes) - Please see management_cluster_triage_3.pdf
  • Self-reviewed the diff
  • CI/CD checks are passing (ignore Tide)
  • Commit history is clean (rebased/squashed)
  • Tricky code blocks are commented
  • Specific reviewers tagged
  • All comment threads resolved before merge

Copilot AI review requested due to automatic review settings June 9, 2026 13:25
@openshift-ci openshift-ci Bot requested review from roivaz and sclarkso June 9, 2026 13:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new Grafana “Management Cluster Triage” dashboard to speed up incident triage for KAS availability by correlating node health, workload stability, noisy-neighbor HCP resource use, and etcd early-warning signals.

Changes:

  • Introduces a new Grafana dashboard JSON for management-cluster triage.
  • Adds templating for separate SVC and HCP Prometheus datasources plus a cluster selector.
  • Adds panels for node readiness/pressure, workload instability signals, top HCP namespace resource usage, and etcd WAL/leader-change indicators.

Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json Outdated
Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json Outdated
Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json Outdated
Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json Outdated
@swiencki

swiencki commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

/lgtm - great looking dashboard, look forward to it

@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cadenmarchese, swiencki
Once this PR has been reviewed and has the lgtm label, please assign hbhushan3 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cadenmarchese

Copy link
Copy Markdown
Member Author

@roivaz @sclarkso PTAL? ty!

Copilot AI review requested due to automatic review settings June 15, 2026 20:24

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 10 comments.

Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json Outdated
Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json Outdated
Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json Outdated
Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json Outdated
Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json Outdated
Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json Outdated
@cadenmarchese

Copy link
Copy Markdown
Member Author

/retest-required

Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json Outdated
so that leader elections are more visible

rate(...[5m]) → increase(...[1h])
Copilot AI review requested due to automatic review settings June 17, 2026 13:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json Outdated
Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json Outdated
Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json Outdated
Comment thread observability/grafana-dashboards/sre/user-journey/mgmt-cluster-triage.json Outdated
@cadenmarchese

Copy link
Copy Markdown
Member Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants