feat: add KMS etcd encryption SLI observability stack#5690
feat: add KMS etcd encryption SLI observability stack#5690wanghaoran1988 wants to merge 1 commit into
Conversation
f580183 to
c26672b
Compare
There was a problem hiding this comment.
⚠️ Not ready to approve
The KAS SRE metrics allow-list change would drop apiserver_request_sli_duration_seconds_{bucket,count} (breaking existing HCP latency SLIs), and the new dashboard’s histogram p99 queries are currently incorrect without rate(...[5m]).
Pull request overview
Adds an observability stack (metrics, recording rules, alerts, and a Grafana dashboard) to monitor the customer-managed etcd encryption (Azure KMS) user journey across HCPs.
Changes:
- Exposes new HostedCluster condition metrics via kube-state-metrics (
ValidAzureKMSConfig,EtcdAvailable) and adds new KMS-focused recording rules. - Introduces multi-window/multi-burn-rate alerting for KMS availability and KMS operation error budget burn, plus threshold alerts for latency/saturation/freshness.
- Adds an SRE “user journey” Grafana dashboard for KMS/etcd encryption SLIs.
File summaries
| File | Description |
|---|---|
| observability/recording-rules-hcps.yaml | Registers new KMS recording rule groups into the HCP recording-rules bundle. |
| observability/prometheus/values-mgmt.yaml | Adds KSM customResourceState metrics for ValidAzureKMSConfig and EtcdAvailable HostedCluster conditions. |
| observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_mgmt_resources.yaml | Updates Helm golden fixture for the new KSM condition metrics. |
| observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_mgmt_resources_unset.yaml | Updates Helm golden fixture for the new KSM condition metrics (unset variant). |
| observability/grafana-dashboards/sre/user-journey/kms-etcd-encryption.json | Adds the new SRE user-journey dashboard (11 panels) for KMS SLIs. |
| observability/alerts/HCPkmsMonitor-prometheusRule.yaml | Adds KMS alert rules (availability burn-rate tiers, errors burn-rate tiers, and threshold alerts). |
| observability/alerts/HCPkmsMonitor-prometheusRule_test.yaml | Adds promtool tests for the KMS alert rules. |
| observability/alerts/HCPkasRecord-prometheusRule-kms.yaml | Adds recording rules for KSM-based KMS availability SLI windows. |
| observability/alerts/HCPkasRecord-prometheusRule-kms-envelope.yaml | Adds recording rules for envelope-encryption KMS errors/latency/saturation SLIs. |
| observability/alerts/HCPkasRecord-prometheusRule-kms-envelope_test.yaml | Adds promtool tests for the envelope-encryption KMS recording rules. |
| observability/alerts/HCPkasRecord-prometheusRule-kms_test.yaml | Adds promtool tests for the KSM-based KMS recording rules. |
| observability/alerts-rp-services.yaml | Registers the new KMS alert rule file into the RP lane bundle. |
| hypershiftoperator/deploy/templates/sre-metrics-set.configmap.yaml | Updates the HyperShift SRE metrics-set allow-list for KAS metrics (now a restrictive regex). |
| hypershiftoperator/zz_fixture_TestHelmTemplate_dev_westus3_mgmt_1_hypershift.yaml | Updates the HyperShiftOperator Helm fixture to reflect the new metrics-set allow-list. |
| dev-infrastructure/zz_fixture_TestHelmTemplate_dev_westus3_mgmt_1_arohcp_monitor.yaml | Updates monitor Helm fixture for the new KSM condition metrics (rendered config). |
| dev-infrastructure/modules/metrics/rules/generatedRPPrometheusAlertingRules.bicep | Regenerates RP alerting rule groups to include the new KMS alerts. |
| dev-infrastructure/modules/metrics/rules/generatedHCPRecordingRules.bicep | Regenerates HCP recording rule groups to include the new KMS recording rules. |
Copilot's findings
- Files reviewed: 17/17 changed files
- Comments generated: 4
Note
Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.
c26672b to
e272e85
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: wanghaoran1988 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
⚠️ Not ready to approve
The generated RP alerting rules Bicep appears to have dropped existing nodepool rule groups and the new dashboard contains PromQL bugs (histogram_quantile without rate and NaN-on-idle error-rate expressions).
Copilot's findings
- Files reviewed: 15/15 changed files
- Comments generated: 6
Note
Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.
Add observability instrumentation for the customer-managed etcd encryption (KMS) user journey (ARO-25913). KSM custom resource state config: - Expose ValidAzureKMSConfig and EtcdAvailable HostedCluster conditions as Prometheus metrics via kube-state-metrics Recording rules (registered in recording-rules-hcps.yaml): - KMS availability burn-rate windows (ratio_avg, sum/count_over_time) - KMS operation error rate and multi-window averages - KMS encrypt/decrypt p99 latency - DEK cache size and KMS status check freshness Alert rules (registered in alerts-rp-services.yaml, RP lane): - UJKmsAvailability: multi-window burn-rate (14.4x fast, 6x medium) - UJKmsErrors: multi-window burn-rate (14.4x, 6x, 1x slow) - UJKmsLatencyHigh: p99 > 500ms threshold - UJKmsDekCacheEmpty: cache size = 0 threshold - UJKmsStatusCheckStale: status check age > 5min threshold Grafana dashboard (SRE User Journey folder): - 4 gauge panels: availability, error rate, latency p99, DEK cache - 7 timeseries panels with SLO target lines and per-HCP filtering - SRE-friendly panel descriptions explaining good/bad values SLO targets per ADR: - Availability: 99.95% ValidAzureKMSConfig=True over 30d - Errors: <0.05% KMS operation error rate over 30d - Latency: p99 <500ms for encrypt/decrypt - Saturation: DEK cache >0, status check age <5min
e272e85 to
fb81950
Compare


Summary
Add observability instrumentation for the customer-managed etcd encryption (KMS) user journey (ARO-25913).
ValidAzureKMSConfigandEtcdAvailableHostedCluster conditions to kube-state-metrics custom resource state configapiserver_envelope_encryption_*metrics to the HyperShift SRE metrics set ConfigMapSLO targets (per ADR)
ValidAzureKMSConfig=True)UJKmsAvailability{1h30m,6h30m}UJKmsErrors{1h5m,6h30m,3d6h}UJKmsLatencyHighUJKmsDekCacheEmptyUJKmsStatusCheckStaleKnown issue: HyperShift KAS ServiceMonitor bug
The
sre-metrics-setConfigMap update in this PR addsapiserver_envelope_encryption_*metrics to the KAS allow-list, but the HyperShift control-plane-operator has a bug where it doesn't apply the SRE metrics set to the KAS ServiceMonitor — it uses the hardcoded Telemetry default (3 metrics only) regardless ofMETRICS_SET=SRE.Fix submitted: openshift/hypershift#8715 — adds the missing
metrics.KASRelabelConfigs(cpContext.MetricsSet)call tov2/kas/servicemonitor.go, matching the pattern used by every other component (etcd, KCM, CVO, etc.).Until that fix is merged and a new CPO image ships, the envelope_encryption metrics from HCP KAS pods are not scraped. The KSM-based availability SLI (
ValidAzureKMSConfigcondition) works independently and does not require this fix.Test plan
cd observability && make recording-rules— promtool tests pass, Bicep generatedcd observability && make alerts— promtool tests pass, Bicep generatedgo test ./tooling/helmtest/... -run TestHelmTemplate— Helm fixtures matchValidAzureKMSConfig=True,EtcdAvailable=True)grpc_status_code=ok)