Skip to content

Controller to deploy KSM per HCP#5689

Open
inbharajmani wants to merge 14 commits into
Azure:mainfrom
inbharajmani:ksm-controller
Open

Controller to deploy KSM per HCP#5689
inbharajmani wants to merge 14 commits into
Azure:mainfrom
inbharajmani:ksm-controller

Conversation

@inbharajmani

Copy link
Copy Markdown
Collaborator

https://redhat.atlassian.net/browse/AROSLSRE-1014

What
Deploy kube-state-metrics per HCP to monitor worker node health.

Why
HCP worker nodes are invisible to the management cluster's KSM - they register with the HCP's own API server. This controller deploys a KSM instance into each HCP namespace to scrape node metrics and forward them to the HCP Azure Managed Prometheus workspace

How
The mgmt-agent watches HostedControlPlane CRs and creates a KSM Deployment, Service, and ServiceMonitor per HCP. KSM connects to the HCP API server via service-network-admin-kubeconfig and exposes kube_node_status_condition and kube_node_info metrics. The ServiceMonitor injects the namespace label so metrics route to the HCP Azure Monitor Workspace. The region and environment labels are added globally via Prometheus external labels.

Testing
Unit tests for readiness, resource builders, label injection, deletion safety
Verified end-to-end on personal dev: metrics visible in Grafana via HCP Azure Monitor Workspace
https://arohcp-dev-c9g7a4fjanb0c4gc.wus3.grafana.azure.com/goto/2KCMG8xvR?o

venkateshsredhat and others added 13 commits June 16, 2026 17:18
…egion label

- Use Server-Side Apply instead of Get-then-Update to avoid spurious
  updates from Kubernetes-defaulted fields
- Guard KSM controller creation on --ksm-image being set
- Use single leader election for both SwiftNIC and KSM controllers
- Check KubeAPIServerAvailable condition for readiness instead of
  reading secrets (no additional RBAC needed)
- Use service-network-admin-kubeconfig for in-cluster HCP API access
- Filter metrics to kube_node_status_condition and kube_node_info only
- Inject azure_region label from HCP spec via ServiceMonitor relabeling
- Add TypeMeta to Deployment and Service for SSA compatibility
- Add liveness/readiness probes and automountServiceAccountToken: false
- Add pod security context (runAsUser/runAsGroup/fsGroup: 65534)
hypershift v0.1.76 (used by mgmt-agent) requires newer golang.org/x
packages, which go work sync propagates across all workspace modules.
- propagate error from toUnstructured in buildServiceMonitor
- return nil instead of error when KubeAPIServer is unavailable
- reduce RBAC verbs to least-privilege (remove update/delete)
- move dynamic client creation inside KSMImage guard
- fix gci import ordering
- Add unit tests for readiness checks, resource builders, label
  injection, and deletion timestamp behavior
- Add DeletionTimestamp check to prevent recreation loop during
  HCP deletion (GC and controller racing)
- Add README documenting architecture and design decisions
- Add HCP worker node metrics section to docs/monitoring.md
- Update cmd.go description to document both controllers
…xternal label

The azure_region metric relabel on each KSM HCP ServiceMonitor was redundant
since Prometheus external labels apply globally to all remote-written series.
This moves the region label to the Prometheus externalLabels config for both
mgmt and svc clusters, removing it from buildServiceMonitor.
SSA only needs create and patch verbs. Remove unused get, list, watch
on deployments, services, and servicemonitors.
- Fix README typo and update to reflect region is a global Prometheus
  external label, not injected per-ServiceMonitor
- Remove TestDeletingHCPStillReportsAvailable which checked field values
  without calling syncHandler
Adds environmentName alongside cluster and region in externalLabels
for mgmt, svc, and opstool Prometheus configs so all remote-written
metrics carry the deployment environment.
Copilot AI review requested due to automatic review settings June 17, 2026 00:56
@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: inbharajmani
Once this PR has been reviewed and has the lgtm label, please assign roivaz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR updates multiple Go module dependencies and extends observability + management-cluster functionality by adding a new mgmt-agent controller that deploys kube-state-metrics per HostedControlPlane, along with additional Prometheus external labels for better metric attribution.

Changes:

  • Bump golang.org/x/* and several other indirect dependencies across many Go modules.
  • Add region and environment Prometheus externalLabels and update Helm snapshots/fixtures accordingly.
  • Introduce the mgmt-agent “KSM HCP controller” (flag-driven) that reconciles per-HCP kube-state-metrics Deployment/Service/ServiceMonitor and adds required RBAC + Helm wiring.

Reviewed changes

Copilot reviewed 50 out of 76 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tooling/tenant-quota/go.sum Dependency checksum updates (golang.org/x/*).
tooling/tenant-quota/go.mod Bump indirect golang.org/x/* versions.
tooling/templatize/go.mod Update go-openapi/google/uber/golang.org/x dependency versions.
tooling/secret-sync/go.sum Dependency checksum updates (golang.org/x/*).
tooling/secret-sync/go.mod Bump indirect golang.org/x/* versions.
tooling/prometheus-rules/go.sum Dependency checksum updates (golang.org/x/*).
tooling/prometheus-rules/go.mod Bump indirect golang.org/x/* versions.
tooling/olm-bundle-repkg/go.sum Dependency checksum updates (go-openapi, golang.org/x/*, zap, tools).
tooling/olm-bundle-repkg/go.mod Bump dependencies used by OLM bundle repack tool.
tooling/kustoctl/go.sum Dependency checksum updates (golang.org/x/*).
tooling/kustoctl/go.mod Bump indirect golang.org/x/* versions.
tooling/image-updater/go.sum Dependency checksum updates (golang.org/x/*, x/mod).
tooling/image-updater/go.mod Bump x/mod and indirect golang.org/x/* versions.
tooling/helmtest/go.sum Dependency checksum updates (go-openapi, golang.org/x/*, zap, tools).
tooling/helmtest/go.mod Bump dependencies used for helm templating tests.
tooling/hcpctl/go.mod Bump hypershift API and related indirect deps for hcpctl.
tooling/grafanactl/go.sum Dependency checksum updates (golang.org/x/*).
tooling/grafanactl/go.mod Bump indirect golang.org/x/* versions.
tooling/cleanup-sweeper/go.sum Dependency checksum updates (golang.org/x/*).
tooling/cleanup-sweeper/go.mod Bump indirect golang.org/x/* versions.
tooling/aro-hcp-exporter/go.sum Dependency checksum updates (golang.org/x/*).
tooling/aro-hcp-exporter/go.mod Bump indirect golang.org/x/* versions.
test/go.mod Bump hypershift API + various indirect deps for tests.
test-integration/go.mod Bump hypershift API + various indirect deps for integration tests.
sessiongate/go.sum Dependency checksum updates (go-openapi, hypershift, golang.org/x/*).
sessiongate/go.mod Bump hypershift API + golang.org/x/* and go-openapi deps.
observability/prometheus/values-svc.yaml Add region + environment external labels for svc Prometheus.
observability/prometheus/values-opstool.yaml Add region + environment external labels for opstool Prometheus.
observability/prometheus/values-mgmt.yaml Add region + environment external labels for mgmt Prometheus.
observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_svc_resources_unset.yaml Update snapshot with new Prometheus external labels.
observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_svc_resources.yaml Update snapshot with new Prometheus external labels.
observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_mgmt_resources_unset.yaml Update snapshot with new Prometheus external labels.
observability/prometheus/testdata/zz_fixture_TestHelmTemplate_helmtest_mgmt_resources.yaml Update snapshot with new Prometheus external labels.
observability/prometheus/deploy/templates/prometheus.yaml Template now renders region + environment external labels.
mgmt-agent/zz_fixture_TestHelmTemplate_dev_westus3_mgmt_1_mgmt_agent.yaml Update mgmt-agent Helm snapshot (RBAC + new flag).
mgmt-agent/values.yaml Add Helm values for kube-state-metrics image (ksmImage).
mgmt-agent/pkg/controller/ksmhcp/testdata/zz_fixture_TestBuildServiceMonitor.yaml New fixture for ServiceMonitor rendering.
mgmt-agent/pkg/controller/ksmhcp/testdata/zz_fixture_TestBuildService.yaml New fixture for Service rendering.
mgmt-agent/pkg/controller/ksmhcp/testdata/zz_fixture_TestBuildDeployment.yaml New fixture for Deployment rendering.
mgmt-agent/pkg/controller/ksmhcp/resources.go Build KSM Deployment/Service/ServiceMonitor objects.
mgmt-agent/pkg/controller/ksmhcp/controller_test.go Unit tests for condition check + resource builders.
mgmt-agent/pkg/controller/ksmhcp/controller.go New controller reconciling KSM resources per HostedControlPlane.
mgmt-agent/pkg/controller/ksmhcp/README.md Document new controller behavior and why it exists.
mgmt-agent/go.mod Add dependencies for Hypershift + Prometheus Operator and testutil.
mgmt-agent/deploy/templates/deployment.yaml Add --ksm-image arg to controller container.
mgmt-agent/deploy/templates/clusterrole.yaml Add RBAC rules for HostedControlPlanes + KSM resources.
mgmt-agent/cmd/options.go Wire informers/clients and start KSM controller under leader election.
mgmt-agent/cmd/cmd.go Update CLI help text to describe both controllers.
kube-applier/go.mod Bump go-openapi + hypershift API + etcd/zap/golang.org/x deps.
internal/go.mod Bump hypershift API + go-openapi and other indirect deps.
frontend/go.sum Dependency checksum updates (hypershift API, golang.org/x/*).
frontend/go.mod Bump hypershift API + add google/pprof indirect + golang.org/x/* deps.
fleet/go.mod Bump go-openapi + hypershift API + golang.org/x/* deps.
docs/monitoring.md Document HCP worker node metrics via per-HCP kube-state-metrics.
dev-infrastructure/zz_fixture_TestHelmTemplate_dev_westus3_svc_1_arohcp_monitor.yaml Update snapshot with new Prometheus external labels.
dev-infrastructure/zz_fixture_TestHelmTemplate_dev_westus3_mgmt_1_arohcp_monitor.yaml Update snapshot with new Prometheus external labels.
dev-infrastructure/scripts/recreate-system-pool/go.sum Dependency checksum updates (go-openapi, golang.org/x/*).
dev-infrastructure/scripts/recreate-system-pool/go.mod Bump go-openapi + golang.org/x/* deps.
dev-infrastructure/scripts/recreate-broken-pools/go.sum Dependency checksum updates (go-openapi, golang.org/x/*).
dev-infrastructure/scripts/recreate-broken-pools/go.mod Bump go-openapi + golang.org/x/* deps.
dev-infrastructure/scripts/cleanup-pko-resources/go.sum Dependency checksum updates (go-openapi, golang.org/x/*).
dev-infrastructure/scripts/cleanup-pko-resources/go.mod Bump go-openapi + golang.org/x/* deps.
backend/go.mod Bump hypershift API + go-openapi + golang.org/x/* deps.
admin/server/go.mod Bump hypershift API + go-openapi + golang.org/x/* deps.
admin/client/go.sum Dependency checksum updates (golang.org/x/*).
admin/client/go.mod Bump golang.org/x/* deps.
Comments suppressed due to low confidence (1)

observability/prometheus/deploy/templates/prometheus.yaml:1

  • The template now assumes .Values.prometheusSpec.externalLabels.region and .Values.prometheusSpec.externalLabels.environment always exist. If a consumer renders this chart with older values (or omits these keys), Helm will render <no value> into the manifest, which is hard to detect and can pollute label sets. Consider either (a) using required to fail fast with a clear error if the keys are missing, or (b) guarding/defaulting these fields so the labels are omitted or empty when not provided.

Comment thread mgmt-agent/deploy/templates/deployment.yaml
Comment thread mgmt-agent/cmd/options.go
Use label-filtered informer factories for Deployments, Services, and
ServiceMonitors so the controller only caches resources it manages
(app.kubernetes.io/name=kube-state-metrics-hcp), reducing memory
overhead on busy management clusters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants