feat: add KMS etcd encryption SLI observability stack by wanghaoran1988 · Pull Request #5690 · Azure/ARO-HCP

wanghaoran1988 · 2026-06-17T01:49:10Z

Summary

Add observability instrumentation for the customer-managed etcd encryption (KMS) user journey (ARO-25913).

Add ValidAzureKMSConfig and EtcdAvailable HostedCluster conditions to kube-state-metrics custom resource state config
Add recording rules for all 5 baseline SLIs (availability, errors, latency, traffic, saturation)
Add 8 multi-window multi-burn-rate alert rules in the RP lane
Add Grafana dashboard with 11 panels, HCP cluster filter, and SRE-friendly descriptions
Add apiserver_envelope_encryption_* metrics to the HyperShift SRE metrics set ConfigMap

SLO targets (per ADR)

SLI	SLO	Alert
Availability (`ValidAzureKMSConfig=True`)	99.95% over 30d	`UJKmsAvailability{1h30m,6h30m}`
Errors (KMS gRPC error rate)	< 0.05% over 30d	`UJKmsErrors{1h5m,6h30m,3d6h}`
Latency (encrypt/decrypt p99)	< 500ms	`UJKmsLatencyHigh`
DEK cache saturation	cache > 0	`UJKmsDekCacheEmpty`
Status check freshness	age < 5min	`UJKmsStatusCheckStale`

Known issue: HyperShift KAS ServiceMonitor bug

The sre-metrics-set ConfigMap update in this PR adds apiserver_envelope_encryption_* metrics to the KAS allow-list, but the HyperShift control-plane-operator has a bug where it doesn't apply the SRE metrics set to the KAS ServiceMonitor — it uses the hardcoded Telemetry default (3 metrics only) regardless of METRICS_SET=SRE.

Fix submitted: openshift/hypershift#8715 — adds the missing metrics.KASRelabelConfigs(cpContext.MetricsSet) call to v2/kas/servicemonitor.go, matching the pattern used by every other component (etcd, KCM, CVO, etc.).

Until that fix is merged and a new CPO image ships, the envelope_encryption metrics from HCP KAS pods are not scraped. The KSM-based availability SLI (ValidAzureKMSConfig condition) works independently and does not require this fix.

Test plan

cd observability && make recording-rules — promtool tests pass, Bicep generated
cd observability && make alerts — promtool tests pass, Bicep generated
go test ./tooling/helmtest/... -run TestHelmTemplate — Helm fixtures match
E2E verified on personal dev environment:
- KSM metrics flowing to HCP AMW (ValidAzureKMSConfig=True, EtcdAvailable=True)
- Envelope encryption metrics flowing (9 series, all grpc_status_code=ok)
- Recording rules producing data (rate5m, dek_cache, status_check_age)
- Alert rule groups deployed to Azure (6 groups)
- Dashboard rendering with real data (all 11 panels green for healthy cluster)
- Alert expressions correctly fire when KMS is degraded

wanghaoran1988 · 2026-06-17T01:49:26Z

Dashboard Screenshots

Copilot

⚠️ Not ready to approve

The KAS SRE metrics allow-list change would drop apiserver_request_sli_duration_seconds_{bucket,count} (breaking existing HCP latency SLIs), and the new dashboard’s histogram p99 queries are currently incorrect without rate(...[5m]).

Pull request overview

Adds an observability stack (metrics, recording rules, alerts, and a Grafana dashboard) to monitor the customer-managed etcd encryption (Azure KMS) user journey across HCPs.

Changes:

Exposes new HostedCluster condition metrics via kube-state-metrics (ValidAzureKMSConfig, EtcdAvailable) and adds new KMS-focused recording rules.
Introduces multi-window/multi-burn-rate alerting for KMS availability and KMS operation error budget burn, plus threshold alerts for latency/saturation/freshness.
Adds an SRE “user journey” Grafana dashboard for KMS/etcd encryption SLIs.

File summaries

File	Description
observability/recording-rules-hcps.yaml	Registers new KMS recording rule groups into the HCP recording-rules bundle.
observability/prometheus/values-mgmt.yaml	Adds KSM customResourceState metrics for `ValidAzureKMSConfig` and `EtcdAvailable` HostedCluster conditions.
observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_mgmt_resources.yaml	Updates Helm golden fixture for the new KSM condition metrics.
observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_mgmt_resources_unset.yaml	Updates Helm golden fixture for the new KSM condition metrics (unset variant).
observability/grafana-dashboards/sre/user-journey/kms-etcd-encryption.json	Adds the new SRE user-journey dashboard (11 panels) for KMS SLIs.
observability/alerts/HCPkmsMonitor-prometheusRule.yaml	Adds KMS alert rules (availability burn-rate tiers, errors burn-rate tiers, and threshold alerts).
observability/alerts/HCPkmsMonitor-prometheusRule_test.yaml	Adds promtool tests for the KMS alert rules.
observability/alerts/HCPkasRecord-prometheusRule-kms.yaml	Adds recording rules for KSM-based KMS availability SLI windows.
observability/alerts/HCPkasRecord-prometheusRule-kms-envelope.yaml	Adds recording rules for envelope-encryption KMS errors/latency/saturation SLIs.
observability/alerts/HCPkasRecord-prometheusRule-kms-envelope_test.yaml	Adds promtool tests for the envelope-encryption KMS recording rules.
observability/alerts/HCPkasRecord-prometheusRule-kms_test.yaml	Adds promtool tests for the KSM-based KMS recording rules.
observability/alerts-rp-services.yaml	Registers the new KMS alert rule file into the RP lane bundle.
hypershiftoperator/deploy/templates/sre-metrics-set.configmap.yaml	Updates the HyperShift SRE metrics-set allow-list for KAS metrics (now a restrictive regex).
hypershiftoperator/zz_fixture_TestHelmTemplate_dev_westus3_mgmt_1_hypershift.yaml	Updates the HyperShiftOperator Helm fixture to reflect the new metrics-set allow-list.
dev-infrastructure/zz_fixture_TestHelmTemplate_dev_westus3_mgmt_1_arohcp_monitor.yaml	Updates monitor Helm fixture for the new KSM condition metrics (rendered config).
dev-infrastructure/modules/metrics/rules/generatedRPPrometheusAlertingRules.bicep	Regenerates RP alerting rule groups to include the new KMS alerts.
dev-infrastructure/modules/metrics/rules/generatedHCPRecordingRules.bicep	Regenerates HCP recording rule groups to include the new KMS recording rules.

Copilot's findings

Files reviewed: 17/17 changed files
Comments generated: 4

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.

openshift-ci · 2026-06-17T02:28:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wanghaoran1988
Once this PR has been reviewed and has the lgtm label, please assign avollmer-redhat for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

dev-infrastructure/OWNERS
observability/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

⚠️ Not ready to approve

The generated RP alerting rules Bicep appears to have dropped existing nodepool rule groups and the new dashboard contains PromQL bugs (histogram_quantile without rate and NaN-on-idle error-rate expressions).

Copilot's findings

Files reviewed: 15/15 changed files
Comments generated: 6

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.

Add observability instrumentation for the customer-managed etcd encryption (KMS) user journey (ARO-25913). KSM custom resource state config: - Expose ValidAzureKMSConfig and EtcdAvailable HostedCluster conditions as Prometheus metrics via kube-state-metrics Recording rules (registered in recording-rules-hcps.yaml): - KMS availability burn-rate windows (ratio_avg, sum/count_over_time) - KMS operation error rate and multi-window averages - KMS encrypt/decrypt p99 latency - DEK cache size and KMS status check freshness Alert rules (registered in alerts-rp-services.yaml, RP lane): - UJKmsAvailability: multi-window burn-rate (14.4x fast, 6x medium) - UJKmsErrors: multi-window burn-rate (14.4x, 6x, 1x slow) - UJKmsLatencyHigh: p99 > 500ms threshold - UJKmsDekCacheEmpty: cache size = 0 threshold - UJKmsStatusCheckStale: status check age > 5min threshold Grafana dashboard (SRE User Journey folder): - 4 gauge panels: availability, error rate, latency p99, DEK cache - 7 timeseries panels with SLO target lines and per-HCP filtering - SRE-friendly panel descriptions explaining good/bad values SLO targets per ADR: - Availability: 99.95% ValidAzureKMSConfig=True over 30d - Errors: <0.05% KMS operation error rate over 30d - Latency: p99 <500ms for encrypt/decrypt - Saturation: DEK cache >0, status check age <5min

Copilot AI review requested due to automatic review settings June 17, 2026 01:49

openshift-ci Bot requested review from geoberle and hbhushan3 June 17, 2026 01:49

openshift-ci Bot added the needs-rebase label Jun 17, 2026

Copilot started reviewing on behalf of wanghaoran1988 June 17, 2026 01:49 View session

wanghaoran1988 force-pushed the worktree-kms-sli-ksm-config branch from f580183 to c26672b Compare June 17, 2026 01:52

openshift-ci Bot removed the needs-rebase label Jun 17, 2026

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 17, 2026 02:28

wanghaoran1988 force-pushed the worktree-kms-sli-ksm-config branch from c26672b to e272e85 Compare June 17, 2026 02:28

Copilot started reviewing on behalf of wanghaoran1988 June 17, 2026 02:28 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

wanghaoran1988 force-pushed the worktree-kms-sli-ksm-config branch from e272e85 to fb81950 Compare June 17, 2026 02:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add KMS etcd encryption SLI observability stack#5690

feat: add KMS etcd encryption SLI observability stack#5690
wanghaoran1988 wants to merge 1 commit into
Azure:mainfrom
wanghaoran1988:worktree-kms-sli-ksm-config

wanghaoran1988 commented Jun 17, 2026

Uh oh!

wanghaoran1988 commented Jun 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openshift-ci Bot commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wanghaoran1988 commented Jun 17, 2026

Summary

SLO targets (per ADR)

Known issue: HyperShift KAS ServiceMonitor bug

Test plan

Uh oh!

wanghaoran1988 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dashboard Screenshots

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

⚠️ Not ready to approve

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openshift-ci Bot commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

⚠️ Not ready to approve

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wanghaoran1988 commented Jun 17, 2026 •

edited

Loading