Skip to content

pkg/operator/encryption/kms/health: add KMS plugin health checker#2298

Open
ibihim wants to merge 1 commit into
openshift:masterfrom
ibihim:CNTRLPLANE-3234-health-reporter-reader
Open

pkg/operator/encryption/kms/health: add KMS plugin health checker#2298
ibihim wants to merge 1 commit into
openshift:masterfrom
ibihim:CNTRLPLANE-3234-health-reporter-reader

Conversation

@ibihim

@ibihim ibihim commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What

Adds a checker to the kms-health-reporter that probes each co-located KMSv2 plugin's Status endpoint and turns the responses into per-plugin health reports (healthy / unhealthy / error, one entry per plugin).

Why

We want to track the state of all plugins. Currently we log it, later we add it to the status of each apiserver operator's CR status.

Note

Compare to #2193

Summary by CodeRabbit

  • New Features

    • Continuous, cancellable KMS plugin health monitoring with periodic, jittered probes.
    • Per-plugin health reports including status, last-checked time, and key identifiers.
  • Bug Fixes

    • Stricter KMS socket ID parsing and clearer initialization error handling.
    • More reliable kubeconfig/in-cluster config resolution.
  • Tests

    • Added tests for concurrent probe execution and socket ID extraction.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR adds a context-aware KMS plugin health prober: it parses socket names for key IDs, builds per-plugin KMSv2 gRPC services, runs concurrent periodic probes via a prober that maps responses to health reports, and wires graceful shutdown via signal-aware contexts.

Changes

KMS Health Prober Feature

Layer / File(s) Summary
Socket key ID extraction and validation
pkg/operator/encryption/kms/health/cmd.go, pkg/operator/encryption/kms/health/prober_test.go
Socket paths are parsed via regex to capture numeric key IDs; keyIDFromSocket extracts the captured ID with error handling; tests verify extraction from valid socket names and error handling for non-matching patterns.
Command context propagation and signal handling
pkg/operator/encryption/kms/health/cmd.go
Cobra command context flows to runner (RunE -> o.run(ctx)), runner accepts context.Context, and setupSignalContext wraps context with SIGTERM/SIGINT cancellation; kubeconfig loading uses clientcmd.BuildConfigFromFlags.
Plugin initialization and gRPC service creation
pkg/operator/encryption/kms/health/cmd.go
buildPlugins accepts ctx, parses each socket's key ID, and creates per-plugin KMSv2 gRPC services with unique service names; parsing or service-creation errors return early with socket-specific context.
Probe loop orchestration and execution
pkg/operator/encryption/kms/health/cmd.go
Integrates plugin setup and prober into a wait.JitterUntilWithContext loop that repeatedly calls prober.probeAll(ctx), logs computed health reports, and respects context cancellation for graceful shutdown.
Health prober implementation
pkg/operator/encryption/kms/health/prober.go
Introduces prober and related internal types; probeAll(ctx) fans out concurrent probes to each plugin, timestamps results, and maps service responses/errors into pluginHealthReport entries.
Test infrastructure and prober tests
pkg/operator/encryption/kms/health/prober_test.go
Adds fakeService and blockingService test doubles; TestProber_ProbeAll verifies mapping of healthy/error/unhealthy responses and timestamps; TestProber_ProbeAllFansOut asserts probes run concurrently.

🎯 3 (Moderate) | ⏱️ ~25 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (2 errors, 1 warning)

Check name Status Explanation Resolution
Stable And Deterministic Test Names ❌ Error FAIL: prober_test.go uses t.Run(tt.socket, ...) so subtest titles include specific socket paths like kms-1.sock/kms-42.sock (identifiers), making titles brittle. Change subtest titles to static descriptions (e.g., t.Run("extracts keyID from kms socket", ...)) and use tt.socket only inside the test body for assertions.
No-Sensitive-Data-In-Logs ❌ Error cmd.go logs klog.InfoS("kms-health-reporter starting","config", o); options o includes NodeName (internal hostname) and socket/kubeconfig strings. Avoid logging full config; omit NodeName/Kubeconfig/KMSSockets or redact them before passing to klog.InfoS.
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding a KMS plugin health checker to the health package.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Test Structure And Quality ✅ Passed PR adds only standard Go unit tests (checker_test.go) using testing—no Ginkgo/BeforeEach/AfterEach, no cluster resources; assertions include have/want details.
Microshift Test Compatibility ✅ Passed PR #2298 only adds/updates Go unit tests in pkg/operator/encryption/kms/health (uses testing.T); no new Ginkgo e2e tests referencing MicroShift-incompatible APIs/resources.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds KMS health checker/prober Go unit tests; repository contains no Ginkgo (no onsi/ginkgo usage), so no new Ginkgo e2e multi-node/SNO assumptions to flag.
Topology-Aware Scheduling Compatibility ✅ Passed PR #2298 changes only pkg/operator/encryption/kms/health Go health-checker code; searches in changed files for node affinity/anti-affinity, topology spread, replicas, and CP nodeSelectors found no...
Ote Binary Stdout Contract ✅ Passed Checked pkg/operator/encryption/kms/health/cmd.go and prober*.go: no main/init/TestMain/RunSpecs, no fmt.Print/Println/os.Stdout, and no klog/log.SetOutput(os.Stdout).
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR #2298 adds only Go unit tests (no Ginkgo It/Describe/Context/When); no IPv4 or external connectivity e2e test assumptions to flag. citeturn2view0turn6view0
No-Weak-Crypto ✅ Passed Scanned pkg/operator/encryption/kms/health/*.go for MD5/SHA1/DES/RC4/3DES/Blowfish/ECB and crypto/md5/sha1/des/rc4/blowfish; none found. No weak non-constant-time secret/token compares detected in...
Container-Privileges ✅ Passed PR #2298 only changes Go files (checker.go, checker_test.go, cmd.go) under pkg/operator/encryption/kms/health; no container/K8s manifests with privileged/host*/SYS_ADMIN/allowPrivilegeEscalation ar...
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from ardaguclu and dgrisonnet June 11, 2026 15:54
@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ibihim
Once this PR has been reviewed and has the lgtm label, please assign dgrisonnet for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ardaguclu ardaguclu left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is nicely written without any AI generated redundant code. Just dropped a comment. Other than that looks good to me. Thank you.


service, err := k8senvelopekmsv2.NewGRPCService(ctx, socket, providerName, timeout)
if err != nil {
return nil, fmt.Errorf("dial KMS plugin at %q: %w", socket, err)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which circumstances NewGPRCService returns errors?. I'm asking to understand we should return error here or, log the error and continue for others?.
It sounds like returning error is the right choice.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to fail if the UDS path is malformed (which is a good reason, but is noop due to our validation):

ibihim@c0f1eb1

grpc.Dial is deprecated, but it is what is used here and the only thing that could fail, besides the util.ParseEndpoint. All other error paths are disabled at its call site, mostly by the hardcoded insecure.NewCredentials(). There is no connection test involved, the socket is first touched at the Status RPC, where a dead plugin surfaces as a per-plugin error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the logging.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjungblu, yes, but did you check my test? Those cases never ever happen.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure?

  grpc.WithDefaultCallOptions(grpc.WaitForReady(true)),

will definitely check whether the other party responds before returning. It may timeout after some time however.

Comment thread pkg/operator/encryption/kms/health/cmd.go
return plugins, nil
}

func buildRESTConfig(kubeconfig string) (*rest.Config, error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need buildRESTConfig ? I think that BuildConfigFromFlags already falls back to in-cluster config when kubeconfig is empty.

return nil, err
}

service, err := k8senvelopekmsv2.NewGRPCService(ctx, socket, providerName, timeout)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make providerName unique per plugin ?

statusError = "error"
)

type PluginHealthReport struct {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it have to be exported ?

type PluginHealthReport struct {
// KeyID is the controller's sequential key id; KEKID is the KMS provider's
// encryption key id. Distinct identifiers, easy to confuse.
KeyID string `json:"keyID"`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need json tags on this struct ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the API change, rather not. Without it, yes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can get Bryce to merge it today before shiftweek, let's see


// checkStatus never returns an error: a failed probe is encoded as a report
// with Status "error" so the caller always gets one entry per plugin.
func (c *checker) checkStatus(ctx context.Context) []PluginHealthReport {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: feels more like probeStatus or probeStatuses or probeAll

service kmsservice.Service
}

type checker struct {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prober

func (c *checker) checkStatus(ctx context.Context) []PluginHealthReport {
reports := make([]PluginHealthReport, 0, len(c.plugins))

// We could fan out if performance is an issue.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might reconsider this decision one the timeout is 30s or so then the lag would be 60s.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/operator/encryption/kms/health/prober.go (1)

61-72: ⚠️ Potential issue | 🟠 Major

Add a defensive resp == nil check in probeAll to prevent a possible panic

In pkg/operator/encryption/kms/health/prober.go (lines 61-72), resp.Healthz is accessed when err == nil without checking resp != nil. The kmsservice.Service interface only defines Status(ctx) (*StatusResponse, error) and doesn’t guarantee a non-nil response on err == nil, so a (nil, nil) contract violation would panic.

🛡️ Proposed defensive fix
 		resp, err := plugin.service.Status(ctx)
 		switch {
 		case err != nil:
 			report.Status = statusError
 			report.Detail = err.Error()
+		case resp == nil:
+			report.Status = statusError
+			report.Detail = "nil response from Status call"
 		case resp.Healthz == healthzOK:
 			report.Status = statusHealthy
 			report.KEKID = resp.KeyID
 		default:
 			report.Status = statusUnhealthy
 			report.Detail = resp.Healthz
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/operator/encryption/kms/health/prober.go` around lines 61 - 72, In
probeAll, defend against a nil StatusResponse by checking resp == nil after
calling plugin.service.Status(ctx); if resp is nil set report.Status =
statusError and set report.Detail to a clear message (e.g. "nil StatusResponse")
instead of accessing resp.Healthz or resp.KeyID; keep the existing branches for
err != nil, resp.Healthz == healthzOK (setting statusHealthy and report.KEKID =
resp.KeyID), and the default unhealthy branch, but ensure the nil check occurs
before any dereference of resp to prevent a panic when plugin.service.Status
returns (nil, nil).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/operator/encryption/kms/health/prober.go`:
- Around line 61-72: In probeAll, defend against a nil StatusResponse by
checking resp == nil after calling plugin.service.Status(ctx); if resp is nil
set report.Status = statusError and set report.Detail to a clear message (e.g.
"nil StatusResponse") instead of accessing resp.Healthz or resp.KeyID; keep the
existing branches for err != nil, resp.Healthz == healthzOK (setting
statusHealthy and report.KEKID = resp.KeyID), and the default unhealthy branch,
but ensure the nil check occurs before any dereference of resp to prevent a
panic when plugin.service.Status returns (nil, nil).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 756fa7e4-1ebf-468f-acf4-87a3f63a3ab1

📥 Commits

Reviewing files that changed from the base of the PR and between 81b6a10 and 8a0ca10.

📒 Files selected for processing (3)
  • pkg/operator/encryption/kms/health/cmd.go
  • pkg/operator/encryption/kms/health/prober.go
  • pkg/operator/encryption/kms/health/prober_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/operator/encryption/kms/health/cmd.go


var wg sync.WaitGroup
for i, plugin := range p.plugins {
wg.Go(func() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is a clean usage of go routines. didn't know you can use wg.Go

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never used it, by I heard already of it, so here I am giving it a try.

@p0lyn0mial

Copy link
Copy Markdown
Contributor

LGTM

let's fix ci/prow/verify-deps before merging.

@p0lyn0mial

Copy link
Copy Markdown
Contributor

oh, and let's squash the commits.

LastChecked: p.now(),
}

resp, err := plugin.service.Status(ctx)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just realized that resp is a pointer and can be nil. should we add a check ?


// Empty kubeconfig falls back to the in-cluster config (service account
// token + KUBERNETES_SERVICE_HOST), which is the deployed path.
cfg, err := clientcmd.BuildConfigFromFlags("", o.Kubeconfig)

@ardaguclu ardaguclu Jun 12, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Ideally this should be in validate complete function not run. But this is definitely not a blocker.

func buildPlugins(ctx context.Context, sockets []string, timeout time.Duration) ([]pluginClient, error) {
plugins := make([]pluginClient, 0, len(sockets))

for _, socket := range sockets {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just realized that we don't check for duplicates, in theory a "user" can pass --kms-sockets kms-1.sock,kms-1.sock - we could add a check in a new pr.

@ibihim ibihim force-pushed the CNTRLPLANE-3234-health-reporter-reader branch 2 times, most recently from 5100a46 to 92a2354 Compare June 12, 2026 19:00
Add Checker, which dials each co-located KMSv2 plugin's UDS Status
endpoint and reports per-plugin health.
@ibihim ibihim force-pushed the CNTRLPLANE-3234-health-reporter-reader branch from 92a2354 to 812dbbf Compare June 12, 2026 19:02
@ardaguclu

Copy link
Copy Markdown
Member

LGTM

@openshift-ci

openshift-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@ibihim: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants