pkg/operator/encryption/kms/health: add KMS plugin health checker#2298
pkg/operator/encryption/kms/health: add KMS plugin health checker#2298ibihim wants to merge 1 commit into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR adds a context-aware KMS plugin health prober: it parses socket names for key IDs, builds per-plugin KMSv2 gRPC services, runs concurrent periodic probes via a prober that maps responses to health reports, and wires graceful shutdown via signal-aware contexts. ChangesKMS Health Prober Feature
🎯 3 (Moderate) | ⏱️ ~25 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (2 errors, 1 warning)
✅ Passed checks (12 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ibihim The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
ardaguclu
left a comment
There was a problem hiding this comment.
This PR is nicely written without any AI generated redundant code. Just dropped a comment. Other than that looks good to me. Thank you.
|
|
||
| service, err := k8senvelopekmsv2.NewGRPCService(ctx, socket, providerName, timeout) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("dial KMS plugin at %q: %w", socket, err) |
There was a problem hiding this comment.
In which circumstances NewGPRCService returns errors?. I'm asking to understand we should return error here or, log the error and continue for others?.
It sounds like returning error is the right choice.
There was a problem hiding this comment.
It seems to fail if the UDS path is malformed (which is a good reason, but is noop due to our validation):
grpc.Dial is deprecated, but it is what is used here and the only thing that could fail, besides the util.ParseEndpoint. All other error paths are disabled at its call site, mostly by the hardcoded insecure.NewCredentials(). There is no connection test involved, the socket is first touched at the Status RPC, where a dead plugin surfaces as a per-plugin error.
There was a problem hiding this comment.
There was a problem hiding this comment.
I will update the logging.
There was a problem hiding this comment.
@tjungblu, yes, but did you check my test? Those cases never ever happen.
There was a problem hiding this comment.
are you sure?
grpc.WithDefaultCallOptions(grpc.WaitForReady(true)),
will definitely check whether the other party responds before returning. It may timeout after some time however.
| return plugins, nil | ||
| } | ||
|
|
||
| func buildRESTConfig(kubeconfig string) (*rest.Config, error) { |
There was a problem hiding this comment.
do we need buildRESTConfig ? I think that BuildConfigFromFlags already falls back to in-cluster config when kubeconfig is empty.
| return nil, err | ||
| } | ||
|
|
||
| service, err := k8senvelopekmsv2.NewGRPCService(ctx, socket, providerName, timeout) |
There was a problem hiding this comment.
should we make providerName unique per plugin ?
| statusError = "error" | ||
| ) | ||
|
|
||
| type PluginHealthReport struct { |
There was a problem hiding this comment.
does it have to be exported ?
| type PluginHealthReport struct { | ||
| // KeyID is the controller's sequential key id; KEKID is the KMS provider's | ||
| // encryption key id. Distinct identifiers, easy to confuse. | ||
| KeyID string `json:"keyID"` |
There was a problem hiding this comment.
do we need json tags on this struct ?
There was a problem hiding this comment.
With the API change, rather not. Without it, yes.
There was a problem hiding this comment.
maybe we can get Bryce to merge it today before shiftweek, let's see
|
|
||
| // checkStatus never returns an error: a failed probe is encoded as a report | ||
| // with Status "error" so the caller always gets one entry per plugin. | ||
| func (c *checker) checkStatus(ctx context.Context) []PluginHealthReport { |
There was a problem hiding this comment.
nit: feels more like probeStatus or probeStatuses or probeAll
| service kmsservice.Service | ||
| } | ||
|
|
||
| type checker struct { |
| func (c *checker) checkStatus(ctx context.Context) []PluginHealthReport { | ||
| reports := make([]PluginHealthReport, 0, len(c.plugins)) | ||
|
|
||
| // We could fan out if performance is an issue. |
There was a problem hiding this comment.
we might reconsider this decision one the timeout is 30s or so then the lag would be 60s.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/operator/encryption/kms/health/prober.go (1)
61-72:⚠️ Potential issue | 🟠 MajorAdd a defensive
resp == nilcheck inprobeAllto prevent a possible panicIn
pkg/operator/encryption/kms/health/prober.go(lines 61-72),resp.Healthzis accessed whenerr == nilwithout checkingresp != nil. Thekmsservice.Serviceinterface only definesStatus(ctx) (*StatusResponse, error)and doesn’t guarantee a non-nil response onerr == nil, so a(nil, nil)contract violation would panic.🛡️ Proposed defensive fix
resp, err := plugin.service.Status(ctx) switch { case err != nil: report.Status = statusError report.Detail = err.Error() + case resp == nil: + report.Status = statusError + report.Detail = "nil response from Status call" case resp.Healthz == healthzOK: report.Status = statusHealthy report.KEKID = resp.KeyID default: report.Status = statusUnhealthy report.Detail = resp.Healthz }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/operator/encryption/kms/health/prober.go` around lines 61 - 72, In probeAll, defend against a nil StatusResponse by checking resp == nil after calling plugin.service.Status(ctx); if resp is nil set report.Status = statusError and set report.Detail to a clear message (e.g. "nil StatusResponse") instead of accessing resp.Healthz or resp.KeyID; keep the existing branches for err != nil, resp.Healthz == healthzOK (setting statusHealthy and report.KEKID = resp.KeyID), and the default unhealthy branch, but ensure the nil check occurs before any dereference of resp to prevent a panic when plugin.service.Status returns (nil, nil).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@pkg/operator/encryption/kms/health/prober.go`:
- Around line 61-72: In probeAll, defend against a nil StatusResponse by
checking resp == nil after calling plugin.service.Status(ctx); if resp is nil
set report.Status = statusError and set report.Detail to a clear message (e.g.
"nil StatusResponse") instead of accessing resp.Healthz or resp.KeyID; keep the
existing branches for err != nil, resp.Healthz == healthzOK (setting
statusHealthy and report.KEKID = resp.KeyID), and the default unhealthy branch,
but ensure the nil check occurs before any dereference of resp to prevent a
panic when plugin.service.Status returns (nil, nil).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 756fa7e4-1ebf-468f-acf4-87a3f63a3ab1
📒 Files selected for processing (3)
pkg/operator/encryption/kms/health/cmd.gopkg/operator/encryption/kms/health/prober.gopkg/operator/encryption/kms/health/prober_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/operator/encryption/kms/health/cmd.go
|
|
||
| var wg sync.WaitGroup | ||
| for i, plugin := range p.plugins { | ||
| wg.Go(func() { |
There was a problem hiding this comment.
that is a clean usage of go routines. didn't know you can use wg.Go
There was a problem hiding this comment.
I never used it, by I heard already of it, so here I am giving it a try.
|
LGTM let's fix |
|
oh, and let's squash the commits. |
| LastChecked: p.now(), | ||
| } | ||
|
|
||
| resp, err := plugin.service.Status(ctx) |
There was a problem hiding this comment.
just realized that resp is a pointer and can be nil. should we add a check ?
|
|
||
| // Empty kubeconfig falls back to the in-cluster config (service account | ||
| // token + KUBERNETES_SERVICE_HOST), which is the deployed path. | ||
| cfg, err := clientcmd.BuildConfigFromFlags("", o.Kubeconfig) |
There was a problem hiding this comment.
nit: Ideally this should be in validatecomplete function not run. But this is definitely not a blocker.
| func buildPlugins(ctx context.Context, sockets []string, timeout time.Duration) ([]pluginClient, error) { | ||
| plugins := make([]pluginClient, 0, len(sockets)) | ||
|
|
||
| for _, socket := range sockets { |
There was a problem hiding this comment.
just realized that we don't check for duplicates, in theory a "user" can pass --kms-sockets kms-1.sock,kms-1.sock - we could add a check in a new pr.
5100a46 to
92a2354
Compare
Add Checker, which dials each co-located KMSv2 plugin's UDS Status endpoint and reports per-plugin health.
92a2354 to
812dbbf
Compare
|
LGTM |
|
@ibihim: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What
Adds a checker to the kms-health-reporter that probes each co-located KMSv2 plugin's Status endpoint and turns the responses into per-plugin health reports (healthy / unhealthy / error, one entry per plugin).
Why
We want to track the state of all plugins. Currently we log it, later we add it to the status of each apiserver operator's CR status.
Note
Compare to #2193
Summary by CodeRabbit
New Features
Bug Fixes
Tests