Skip to content

feat: add KMS etcd encryption SLI observability stack#5690

Open
wanghaoran1988 wants to merge 1 commit into
Azure:mainfrom
wanghaoran1988:worktree-kms-sli-ksm-config
Open

feat: add KMS etcd encryption SLI observability stack#5690
wanghaoran1988 wants to merge 1 commit into
Azure:mainfrom
wanghaoran1988:worktree-kms-sli-ksm-config

Conversation

@wanghaoran1988

Copy link
Copy Markdown

Summary

Add observability instrumentation for the customer-managed etcd encryption (KMS) user journey (ARO-25913).

  • Add ValidAzureKMSConfig and EtcdAvailable HostedCluster conditions to kube-state-metrics custom resource state config
  • Add recording rules for all 5 baseline SLIs (availability, errors, latency, traffic, saturation)
  • Add 8 multi-window multi-burn-rate alert rules in the RP lane
  • Add Grafana dashboard with 11 panels, HCP cluster filter, and SRE-friendly descriptions
  • Add apiserver_envelope_encryption_* metrics to the HyperShift SRE metrics set ConfigMap

SLO targets (per ADR)

SLI SLO Alert
Availability (ValidAzureKMSConfig=True) 99.95% over 30d UJKmsAvailability{1h30m,6h30m}
Errors (KMS gRPC error rate) < 0.05% over 30d UJKmsErrors{1h5m,6h30m,3d6h}
Latency (encrypt/decrypt p99) < 500ms UJKmsLatencyHigh
DEK cache saturation cache > 0 UJKmsDekCacheEmpty
Status check freshness age < 5min UJKmsStatusCheckStale

Known issue: HyperShift KAS ServiceMonitor bug

The sre-metrics-set ConfigMap update in this PR adds apiserver_envelope_encryption_* metrics to the KAS allow-list, but the HyperShift control-plane-operator has a bug where it doesn't apply the SRE metrics set to the KAS ServiceMonitor — it uses the hardcoded Telemetry default (3 metrics only) regardless of METRICS_SET=SRE.

Fix submitted: openshift/hypershift#8715 — adds the missing metrics.KASRelabelConfigs(cpContext.MetricsSet) call to v2/kas/servicemonitor.go, matching the pattern used by every other component (etcd, KCM, CVO, etc.).

Until that fix is merged and a new CPO image ships, the envelope_encryption metrics from HCP KAS pods are not scraped. The KSM-based availability SLI (ValidAzureKMSConfig condition) works independently and does not require this fix.

Test plan

  • cd observability && make recording-rules — promtool tests pass, Bicep generated
  • cd observability && make alerts — promtool tests pass, Bicep generated
  • go test ./tooling/helmtest/... -run TestHelmTemplate — Helm fixtures match
  • E2E verified on personal dev environment:
    • KSM metrics flowing to HCP AMW (ValidAzureKMSConfig=True, EtcdAvailable=True)
    • Envelope encryption metrics flowing (9 series, all grpc_status_code=ok)
    • Recording rules producing data (rate5m, dek_cache, status_check_age)
    • Alert rule groups deployed to Azure (6 groups)
    • Dashboard rendering with real data (all 11 panels green for healthy cluster)
    • Alert expressions correctly fire when KMS is degraded

Copilot AI review requested due to automatic review settings June 17, 2026 01:49
@openshift-ci openshift-ci Bot requested review from geoberle and hbhushan3 June 17, 2026 01:49
@wanghaoran1988

wanghaoran1988 commented Jun 17, 2026

Copy link
Copy Markdown
Author

Dashboard Screenshots

image image

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Not ready to approve

The KAS SRE metrics allow-list change would drop apiserver_request_sli_duration_seconds_{bucket,count} (breaking existing HCP latency SLIs), and the new dashboard’s histogram p99 queries are currently incorrect without rate(...[5m]).

Pull request overview

Adds an observability stack (metrics, recording rules, alerts, and a Grafana dashboard) to monitor the customer-managed etcd encryption (Azure KMS) user journey across HCPs.

Changes:

  • Exposes new HostedCluster condition metrics via kube-state-metrics (ValidAzureKMSConfig, EtcdAvailable) and adds new KMS-focused recording rules.
  • Introduces multi-window/multi-burn-rate alerting for KMS availability and KMS operation error budget burn, plus threshold alerts for latency/saturation/freshness.
  • Adds an SRE “user journey” Grafana dashboard for KMS/etcd encryption SLIs.
File summaries
File Description
observability/recording-rules-hcps.yaml Registers new KMS recording rule groups into the HCP recording-rules bundle.
observability/prometheus/values-mgmt.yaml Adds KSM customResourceState metrics for ValidAzureKMSConfig and EtcdAvailable HostedCluster conditions.
observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_mgmt_resources.yaml Updates Helm golden fixture for the new KSM condition metrics.
observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_mgmt_resources_unset.yaml Updates Helm golden fixture for the new KSM condition metrics (unset variant).
observability/grafana-dashboards/sre/user-journey/kms-etcd-encryption.json Adds the new SRE user-journey dashboard (11 panels) for KMS SLIs.
observability/alerts/HCPkmsMonitor-prometheusRule.yaml Adds KMS alert rules (availability burn-rate tiers, errors burn-rate tiers, and threshold alerts).
observability/alerts/HCPkmsMonitor-prometheusRule_test.yaml Adds promtool tests for the KMS alert rules.
observability/alerts/HCPkasRecord-prometheusRule-kms.yaml Adds recording rules for KSM-based KMS availability SLI windows.
observability/alerts/HCPkasRecord-prometheusRule-kms-envelope.yaml Adds recording rules for envelope-encryption KMS errors/latency/saturation SLIs.
observability/alerts/HCPkasRecord-prometheusRule-kms-envelope_test.yaml Adds promtool tests for the envelope-encryption KMS recording rules.
observability/alerts/HCPkasRecord-prometheusRule-kms_test.yaml Adds promtool tests for the KSM-based KMS recording rules.
observability/alerts-rp-services.yaml Registers the new KMS alert rule file into the RP lane bundle.
hypershiftoperator/deploy/templates/sre-metrics-set.configmap.yaml Updates the HyperShift SRE metrics-set allow-list for KAS metrics (now a restrictive regex).
hypershiftoperator/zz_fixture_TestHelmTemplate_dev_westus3_mgmt_1_hypershift.yaml Updates the HyperShiftOperator Helm fixture to reflect the new metrics-set allow-list.
dev-infrastructure/zz_fixture_TestHelmTemplate_dev_westus3_mgmt_1_arohcp_monitor.yaml Updates monitor Helm fixture for the new KSM condition metrics (rendered config).
dev-infrastructure/modules/metrics/rules/generatedRPPrometheusAlertingRules.bicep Regenerates RP alerting rule groups to include the new KMS alerts.
dev-infrastructure/modules/metrics/rules/generatedHCPRecordingRules.bicep Regenerates HCP recording rule groups to include the new KMS recording rules.

Copilot's findings

  • Files reviewed: 17/17 changed files
  • Comments generated: 4

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.

Comment thread hypershiftoperator/deploy/templates/sre-metrics-set.configmap.yaml
Copilot AI review requested due to automatic review settings June 17, 2026 02:28
@wanghaoran1988 wanghaoran1988 force-pushed the worktree-kms-sli-ksm-config branch from c26672b to e272e85 Compare June 17, 2026 02:28
@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wanghaoran1988
Once this PR has been reviewed and has the lgtm label, please assign avollmer-redhat for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Not ready to approve

The generated RP alerting rules Bicep appears to have dropped existing nodepool rule groups and the new dashboard contains PromQL bugs (histogram_quantile without rate and NaN-on-idle error-rate expressions).

Copilot's findings
  • Files reviewed: 15/15 changed files
  • Comments generated: 6

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.

Add observability instrumentation for the customer-managed etcd
encryption (KMS) user journey (ARO-25913).

KSM custom resource state config:
- Expose ValidAzureKMSConfig and EtcdAvailable HostedCluster conditions
  as Prometheus metrics via kube-state-metrics

Recording rules (registered in recording-rules-hcps.yaml):
- KMS availability burn-rate windows (ratio_avg, sum/count_over_time)
- KMS operation error rate and multi-window averages
- KMS encrypt/decrypt p99 latency
- DEK cache size and KMS status check freshness

Alert rules (registered in alerts-rp-services.yaml, RP lane):
- UJKmsAvailability: multi-window burn-rate (14.4x fast, 6x medium)
- UJKmsErrors: multi-window burn-rate (14.4x, 6x, 1x slow)
- UJKmsLatencyHigh: p99 > 500ms threshold
- UJKmsDekCacheEmpty: cache size = 0 threshold
- UJKmsStatusCheckStale: status check age > 5min threshold

Grafana dashboard (SRE User Journey folder):
- 4 gauge panels: availability, error rate, latency p99, DEK cache
- 7 timeseries panels with SLO target lines and per-HCP filtering
- SRE-friendly panel descriptions explaining good/bad values

SLO targets per ADR:
- Availability: 99.95% ValidAzureKMSConfig=True over 30d
- Errors: <0.05% KMS operation error rate over 30d
- Latency: p99 <500ms for encrypt/decrypt
- Saturation: DEK cache >0, status check age <5min
@wanghaoran1988 wanghaoran1988 force-pushed the worktree-kms-sli-ksm-config branch from e272e85 to fb81950 Compare June 17, 2026 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants