Controller to deploy KSM per HCP#5689
Conversation
…egion label - Use Server-Side Apply instead of Get-then-Update to avoid spurious updates from Kubernetes-defaulted fields - Guard KSM controller creation on --ksm-image being set - Use single leader election for both SwiftNIC and KSM controllers - Check KubeAPIServerAvailable condition for readiness instead of reading secrets (no additional RBAC needed) - Use service-network-admin-kubeconfig for in-cluster HCP API access - Filter metrics to kube_node_status_condition and kube_node_info only - Inject azure_region label from HCP spec via ServiceMonitor relabeling - Add TypeMeta to Deployment and Service for SSA compatibility - Add liveness/readiness probes and automountServiceAccountToken: false - Add pod security context (runAsUser/runAsGroup/fsGroup: 65534)
hypershift v0.1.76 (used by mgmt-agent) requires newer golang.org/x packages, which go work sync propagates across all workspace modules.
- propagate error from toUnstructured in buildServiceMonitor - return nil instead of error when KubeAPIServer is unavailable - reduce RBAC verbs to least-privilege (remove update/delete) - move dynamic client creation inside KSMImage guard - fix gci import ordering
- Add unit tests for readiness checks, resource builders, label injection, and deletion timestamp behavior - Add DeletionTimestamp check to prevent recreation loop during HCP deletion (GC and controller racing) - Add README documenting architecture and design decisions - Add HCP worker node metrics section to docs/monitoring.md - Update cmd.go description to document both controllers
…xternal label The azure_region metric relabel on each KSM HCP ServiceMonitor was redundant since Prometheus external labels apply globally to all remote-written series. This moves the region label to the Prometheus externalLabels config for both mgmt and svc clusters, removing it from buildServiceMonitor.
SSA only needs create and patch verbs. Remove unused get, list, watch on deployments, services, and servicemonitors.
- Fix README typo and update to reflect region is a global Prometheus external label, not injected per-ServiceMonitor - Remove TestDeletingHCPStillReportsAvailable which checked field values without calling syncHandler
Adds environmentName alongside cluster and region in externalLabels for mgmt, svc, and opstool Prometheus configs so all remote-written metrics carry the deployment environment.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: inbharajmani The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR updates multiple Go module dependencies and extends observability + management-cluster functionality by adding a new mgmt-agent controller that deploys kube-state-metrics per HostedControlPlane, along with additional Prometheus external labels for better metric attribution.
Changes:
- Bump
golang.org/x/*and several other indirect dependencies across many Go modules. - Add
regionandenvironmentPrometheusexternalLabelsand update Helm snapshots/fixtures accordingly. - Introduce the mgmt-agent “KSM HCP controller” (flag-driven) that reconciles per-HCP kube-state-metrics Deployment/Service/ServiceMonitor and adds required RBAC + Helm wiring.
Reviewed changes
Copilot reviewed 50 out of 76 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tooling/tenant-quota/go.sum | Dependency checksum updates (golang.org/x/*). |
| tooling/tenant-quota/go.mod | Bump indirect golang.org/x/* versions. |
| tooling/templatize/go.mod | Update go-openapi/google/uber/golang.org/x dependency versions. |
| tooling/secret-sync/go.sum | Dependency checksum updates (golang.org/x/*). |
| tooling/secret-sync/go.mod | Bump indirect golang.org/x/* versions. |
| tooling/prometheus-rules/go.sum | Dependency checksum updates (golang.org/x/*). |
| tooling/prometheus-rules/go.mod | Bump indirect golang.org/x/* versions. |
| tooling/olm-bundle-repkg/go.sum | Dependency checksum updates (go-openapi, golang.org/x/*, zap, tools). |
| tooling/olm-bundle-repkg/go.mod | Bump dependencies used by OLM bundle repack tool. |
| tooling/kustoctl/go.sum | Dependency checksum updates (golang.org/x/*). |
| tooling/kustoctl/go.mod | Bump indirect golang.org/x/* versions. |
| tooling/image-updater/go.sum | Dependency checksum updates (golang.org/x/*, x/mod). |
| tooling/image-updater/go.mod | Bump x/mod and indirect golang.org/x/* versions. |
| tooling/helmtest/go.sum | Dependency checksum updates (go-openapi, golang.org/x/*, zap, tools). |
| tooling/helmtest/go.mod | Bump dependencies used for helm templating tests. |
| tooling/hcpctl/go.mod | Bump hypershift API and related indirect deps for hcpctl. |
| tooling/grafanactl/go.sum | Dependency checksum updates (golang.org/x/*). |
| tooling/grafanactl/go.mod | Bump indirect golang.org/x/* versions. |
| tooling/cleanup-sweeper/go.sum | Dependency checksum updates (golang.org/x/*). |
| tooling/cleanup-sweeper/go.mod | Bump indirect golang.org/x/* versions. |
| tooling/aro-hcp-exporter/go.sum | Dependency checksum updates (golang.org/x/*). |
| tooling/aro-hcp-exporter/go.mod | Bump indirect golang.org/x/* versions. |
| test/go.mod | Bump hypershift API + various indirect deps for tests. |
| test-integration/go.mod | Bump hypershift API + various indirect deps for integration tests. |
| sessiongate/go.sum | Dependency checksum updates (go-openapi, hypershift, golang.org/x/*). |
| sessiongate/go.mod | Bump hypershift API + golang.org/x/* and go-openapi deps. |
| observability/prometheus/values-svc.yaml | Add region + environment external labels for svc Prometheus. |
| observability/prometheus/values-opstool.yaml | Add region + environment external labels for opstool Prometheus. |
| observability/prometheus/values-mgmt.yaml | Add region + environment external labels for mgmt Prometheus. |
| observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_svc_resources_unset.yaml | Update snapshot with new Prometheus external labels. |
| observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_svc_resources.yaml | Update snapshot with new Prometheus external labels. |
| observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_mgmt_resources_unset.yaml | Update snapshot with new Prometheus external labels. |
| observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_mgmt_resources.yaml | Update snapshot with new Prometheus external labels. |
| observability/prometheus/deploy/templates/prometheus.yaml | Template now renders region + environment external labels. |
| mgmt-agent/zz_fixture_TestHelmTemplate_dev_westus3_mgmt_1_mgmt_agent.yaml | Update mgmt-agent Helm snapshot (RBAC + new flag). |
| mgmt-agent/values.yaml | Add Helm values for kube-state-metrics image (ksmImage). |
| mgmt-agent/pkg/controller/ksmhcp/testdata/zz_fixture_TestBuildServiceMonitor.yaml | New fixture for ServiceMonitor rendering. |
| mgmt-agent/pkg/controller/ksmhcp/testdata/zz_fixture_TestBuildService.yaml | New fixture for Service rendering. |
| mgmt-agent/pkg/controller/ksmhcp/testdata/zz_fixture_TestBuildDeployment.yaml | New fixture for Deployment rendering. |
| mgmt-agent/pkg/controller/ksmhcp/resources.go | Build KSM Deployment/Service/ServiceMonitor objects. |
| mgmt-agent/pkg/controller/ksmhcp/controller_test.go | Unit tests for condition check + resource builders. |
| mgmt-agent/pkg/controller/ksmhcp/controller.go | New controller reconciling KSM resources per HostedControlPlane. |
| mgmt-agent/pkg/controller/ksmhcp/README.md | Document new controller behavior and why it exists. |
| mgmt-agent/go.mod | Add dependencies for Hypershift + Prometheus Operator and testutil. |
| mgmt-agent/deploy/templates/deployment.yaml | Add --ksm-image arg to controller container. |
| mgmt-agent/deploy/templates/clusterrole.yaml | Add RBAC rules for HostedControlPlanes + KSM resources. |
| mgmt-agent/cmd/options.go | Wire informers/clients and start KSM controller under leader election. |
| mgmt-agent/cmd/cmd.go | Update CLI help text to describe both controllers. |
| kube-applier/go.mod | Bump go-openapi + hypershift API + etcd/zap/golang.org/x deps. |
| internal/go.mod | Bump hypershift API + go-openapi and other indirect deps. |
| frontend/go.sum | Dependency checksum updates (hypershift API, golang.org/x/*). |
| frontend/go.mod | Bump hypershift API + add google/pprof indirect + golang.org/x/* deps. |
| fleet/go.mod | Bump go-openapi + hypershift API + golang.org/x/* deps. |
| docs/monitoring.md | Document HCP worker node metrics via per-HCP kube-state-metrics. |
| dev-infrastructure/zz_fixture_TestHelmTemplate_dev_westus3_svc_1_arohcp_monitor.yaml | Update snapshot with new Prometheus external labels. |
| dev-infrastructure/zz_fixture_TestHelmTemplate_dev_westus3_mgmt_1_arohcp_monitor.yaml | Update snapshot with new Prometheus external labels. |
| dev-infrastructure/scripts/recreate-system-pool/go.sum | Dependency checksum updates (go-openapi, golang.org/x/*). |
| dev-infrastructure/scripts/recreate-system-pool/go.mod | Bump go-openapi + golang.org/x/* deps. |
| dev-infrastructure/scripts/recreate-broken-pools/go.sum | Dependency checksum updates (go-openapi, golang.org/x/*). |
| dev-infrastructure/scripts/recreate-broken-pools/go.mod | Bump go-openapi + golang.org/x/* deps. |
| dev-infrastructure/scripts/cleanup-pko-resources/go.sum | Dependency checksum updates (go-openapi, golang.org/x/*). |
| dev-infrastructure/scripts/cleanup-pko-resources/go.mod | Bump go-openapi + golang.org/x/* deps. |
| backend/go.mod | Bump hypershift API + go-openapi + golang.org/x/* deps. |
| admin/server/go.mod | Bump hypershift API + go-openapi + golang.org/x/* deps. |
| admin/client/go.sum | Dependency checksum updates (golang.org/x/*). |
| admin/client/go.mod | Bump golang.org/x/* deps. |
Comments suppressed due to low confidence (1)
observability/prometheus/deploy/templates/prometheus.yaml:1
- The template now assumes
.Values.prometheusSpec.externalLabels.regionand.Values.prometheusSpec.externalLabels.environmentalways exist. If a consumer renders this chart with older values (or omits these keys), Helm will render<no value>into the manifest, which is hard to detect and can pollute label sets. Consider either (a) usingrequiredto fail fast with a clear error if the keys are missing, or (b) guarding/defaulting these fields so the labels are omitted or empty when not provided.
Use label-filtered informer factories for Deployments, Services, and ServiceMonitors so the controller only caches resources it manages (app.kubernetes.io/name=kube-state-metrics-hcp), reducing memory overhead on busy management clusters.
https://redhat.atlassian.net/browse/AROSLSRE-1014
What
Deploy kube-state-metrics per HCP to monitor worker node health.
Why
HCP worker nodes are invisible to the management cluster's KSM - they register with the HCP's own API server. This controller deploys a KSM instance into each HCP namespace to scrape node metrics and forward them to the HCP Azure Managed Prometheus workspace
How
The mgmt-agent watches HostedControlPlane CRs and creates a KSM Deployment, Service, and ServiceMonitor per HCP. KSM connects to the HCP API server via service-network-admin-kubeconfig and exposes kube_node_status_condition and kube_node_info metrics. The ServiceMonitor injects the namespace label so metrics route to the HCP Azure Monitor Workspace. The region and environment labels are added globally via Prometheus external labels.
Testing
Unit tests for readiness, resource builders, label injection, deletion safety
Verified end-to-end on personal dev: metrics visible in Grafana via HCP Azure Monitor Workspace
https://arohcp-dev-c9g7a4fjanb0c4gc.wus3.grafana.azure.com/goto/2KCMG8xvR?o