OCPBUGS-86798: spot MHC not created when autoRepair=true and ignition endpoint not reached#8645
OCPBUGS-86798: spot MHC not created when autoRepair=true and ignition endpoint not reached#8645dpateriya wants to merge 1 commit into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThis PR decouples spot instance health checks from the auto-repair gating logic. The Reconcile method now creates or updates a spot-specific MachineHealthCheck early based solely on whether spot instances are enabled, and deletes it if spot is disabled. The previous conditional block tied to auto-repair and ignition-endpoint status was removed. A new table-driven test verifies spot MHC behavior across combinations of spot enablement, auto-repair setting, and ignition endpoint readiness. Suggested reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dpateriya The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@dpateriya: This pull request references Jira Issue OCPBUGS-86798, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
be91e33 to
a0f1b6d
Compare
|
/jira refresh |
|
@dpateriya: This pull request references Jira Issue OCPBUGS-86798, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@hypershift-operator/controllers/nodepool/capi_test.go`:
- Around line 3301-3334: The test case name strings in the table of cases (the
struct entries that include fields like name, autoRepair, ignitionReached,
spotEnabled, expectSpotMHC, expectRegularMHC, expectAutoRepairStatus) do not
follow the repo policy; update each name to the required format starting with
"When ..." and then "it should ..." (e.g. "When spot enabled and autoRepair true
and ignition NOT reached it should create spot MHC and not create regular MHC"),
preserving the rest of the case fields and semantics; change only the name
values in the test cases inside capi_test.go so they conform to the "When ... it
should ..." pattern.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: f9c051ee-dacb-4f42-a86c-f7ad5906603b
📒 Files selected for processing (2)
hypershift-operator/controllers/nodepool/capi.gohypershift-operator/controllers/nodepool/capi_test.go
|
@dpateriya: This pull request references Jira Issue OCPBUGS-86798, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
a0f1b6d to
1a9eec3
Compare
|
Now I have all the information needed. Let me compile the final report. Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryFive CI jobs failed on PR #8645. Three are caused by the PR's code (Lint, Gitlint, Verify) and two are pre-existing CI infrastructure issues unrelated to the PR (okd-scos-images, images). The PR-caused failures stem from: (1) the new test file Root CausePR-caused failures (3 jobs — Lint, Gitlint, Verify):
Infrastructure failures (2 jobs — okd-scos-images, images):
RecommendationsTo fix the PR-caused failures:
No action needed for the Prow image jobs — the Evidence
|
jparrill
left a comment
There was a problem hiding this comment.
Dropped some comments. Thanks!
Note (not in the diff): While reviewing the spot MHC lifecycle I noticed that delete() in nodepool_controller.go:527 only cleans up the regular MHC (capi.machineHealthCheck()). The spot MHC (capi.spotMachineHealthCheck()) is not deleted — and there are no owner references on it either, so it gets orphaned in the control plane namespace when a spot-enabled NodePool is deleted.
It's harmless in practice (the namespace is GC'd with the HostedCluster), but could you file a follow-up to add the cleanup there?
… endpoint not reached OCPBUGS-86798 The spot-specific MachineHealthCheck was blocked by the autoRepair/ignition gate because it was reconciled after the gate's early return. Move spot MHC reconciliation before the gate so it always runs when spot is enabled, regardless of autoRepair status or ignition endpoint reachability. Co-authored-by: Cursor <[email protected]>
1a9eec3 to
3b7b333
Compare
|
Addressed all feedback:
Follow-up issues filed:
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8645 +/- ##
=======================================
Coverage 41.66% 41.66%
=======================================
Files 758 758
Lines 93929 93928 -1
=======================================
+ Hits 39135 39138 +3
+ Misses 52046 52043 -3
+ Partials 2748 2747 -1
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
@dpateriya: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
autoRepair: trueandReachedIgnitionEndpointis FalseProblem
The spot MHC block was placed after the autoRepair gate in
CAPI.Reconcile(). WhenautoRepair: trueand the ignition endpoint hasn't been reached yet, the function returns early viareturn nilat line 158, preventing the spot MHC from being created.This creates a permanent deadlock when a spot instance fails to provision:
PendingwithInstanceProvisionFailedReachedIgnitionEndpointstaysFalseforeverautoRepair: true+ ignition not reached →return nil→ spot MHC never createdNodeStartupTimeout→ no remediation → Machine stuck foreverFix
Move the spot MHC reconciliation block before the autoRepair gate. The spot MHC is independent of autoRepair — it serves as a safety net for spot instance failures using
maxUnhealthy: 100%and a 20-minuteNodeStartupTimeout.Reproduction
With spot capacity unavailable, the Machine stays
Pendingindefinitely and no<nodepool>-spotMHC is created.Test plan
TestSpotMHCCreatedIndependentlyOfAutoRepairIgnitionGatewith 4 cases:TestCAPIReconciletests pass (no regression)TestReconcileMachineHealthChecktests passnodepoolpackage tests passgo vetcleanSummary by CodeRabbit
Bug Fixes
Tests