Skip to content

OCPBUGS-84368: add safe-to-evict annotation and remove tolerations to fix autoscaler scale-down and drain loop#8338

Open
sdminonne wants to merge 1 commit into
openshift:mainfrom
sdminonne:fix/kas-connection-checker-pdb-safe-to-evict
Open

OCPBUGS-84368: add safe-to-evict annotation and remove tolerations to fix autoscaler scale-down and drain loop#8338
sdminonne wants to merge 1 commit into
openshift:mainfrom
sdminonne:fix/kas-connection-checker-pdb-safe-to-evict

Conversation

@sdminonne

@sdminonne sdminonne commented Apr 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Problem

The kas-connection-checker deployment runs in the kube-system namespace without a PDB or safe-to-evict annotation. The cluster autoscaler treats pods in kube-system without a PDB as system-critical pods that cannot be evicted during scale-down, blocking node draining on underutilized nodes. See also the autoscaler system drainability rule.

Additionally, the deployment used a blanket {Effect: NoSchedule, Operator: Exists} toleration that matched the cordon taint (node.kubernetes.io/unschedulable:NoSchedule), causing replacement pods to be scheduled back onto cordoned nodes during drain — creating a destructive evict-reschedule loop.

Solution

  1. Add cluster-autoscaler.kubernetes.io/safe-to-evict: "true" annotation to the pod template. This bypasses the autoscaler's system drainability rule for kube-system pods, allowing nodes to be scaled down. The annotation is preferred over a PDB because it doesn't block scale-to-zero operations.

  2. Remove all custom tolerations. The previous blanket NoSchedule toleration matched the cordon taint, causing the evict-reschedule loop described above. In a HyperShift hosted cluster there are no master nodes in the guest cluster, and the checker doesn't need to run on infra nodes, so no NoSchedule tolerations are needed. Kubernetes provides default NoExecute tolerations (300s grace period) for unreachable and not-ready via the DefaultTolerationSeconds admission controller.

Changes

  • control-plane-operator/.../resources/resources.go: Add safe-to-evict annotation to pod template; remove all custom tolerations (set to nil).
  • control-plane-operator/.../resources/resources_test.go: Update tests to validate the safe-to-evict annotation and assert no custom tolerations are set.

Test plan

  • go test ./control-plane-operator/hostedclusterconfigoperator/controllers/resources/... -count=1 passes
  • Verify on a live cluster that the annotation is present on kas-connection-checker pods
  • Verify cluster autoscaler can scale down nodes running kas-connection-checker pods

Fixes: https://redhat.atlassian.net/browse/OCPBUGS-84368

🤖 Generated with Claude Code

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai

coderabbitai Bot commented Apr 26, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This change adds PodDisruptionBudget (PDB) support to the KAS connection checker component. A new helper function constructs PDB manifests with appropriate metadata. The reconciliation flow for the KAS connection checker is updated to annotate the Deployment pod template as safe to evict and then create or update a PodDisruptionBudget with MinAvailable=1, a selector matching KAS connection checker pods, and unhealthy pod eviction enabled. A new constant defines the cluster autoscaler safe-to-evict annotation key. Test coverage is expanded to validate PDB reconciliation behavior, including edge cases and error scenarios.

Sequence Diagram

sequenceDiagram
    participant Reconciler as Reconciler<br/>(KAS Connection Checker)
    participant K8sAPI as Kubernetes API
    participant Deployment as Deployment<br/>Resource
    participant PDB as PodDisruptionBudget<br/>Resource
    
    Reconciler->>K8sAPI: Fetch existing Deployment
    K8sAPI-->>Reconciler: Return Deployment
    
    Reconciler->>Deployment: Annotate pod template<br/>(safe-to-evict)
    Reconciler->>K8sAPI: Create/Update Deployment
    K8sAPI-->>Reconciler: Deployment reconciled
    
    Reconciler->>K8sAPI: Fetch existing PDB
    K8sAPI-->>Reconciler: Return PDB (or none)
    
    Reconciler->>PDB: Set minAvailable=1<br/>Set selector (app=KASConnectionChecker)<br/>Set unhealthyPodEvictionPolicy=AlwaysAllow
    Reconciler->>K8sAPI: Create/Update PDB
    K8sAPI-->>Reconciler: PDB reconciled
    
    Reconciler-->>Reconciler: Return reconciliation result
Loading
🚥 Pre-merge checks | ✅ 9 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Topology-Aware Scheduling Compatibility ⚠️ Warning Fixed 3-replica deployment lacks topology awareness for SNO, Two-Node, and HyperShift topologies, causing scheduling failures. Implement topology-aware logic checking infrastructure.Status.ControlPlaneTopology to adjust replica count and PDB settings per cluster topology.
Test Structure And Quality ❓ Inconclusive PR summary indicates resources_test.go uses Go testing package (t.Parallel() removal), not Ginkgo, making the Ginkgo-specific check inapplicable without direct file inspection. Verify the test framework used in resources_test.go to determine if Ginkgo assessment criteria apply or if alternative testing package standards should be used.
✅ Passed checks (9 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in modified files are stable, static string constants with no dynamic elements like timestamps, UUIDs, pod names, node names, or namespace suffixes embedded in test names.
Microshift Test Compatibility ✅ Passed This PR does not add any new Ginkgo e2e tests; it only modifies standard Go unit tests in resources_test.go and includes a minor formatting change in nodepool_test.go.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR does not add new Ginkgo e2e tests. Changes are unit tests using standard Go testing package, not Ginkgo framework.
Ote Binary Stdout Contract ✅ Passed The pull request does not violate the OTE Binary Stdout Contract. All modifications use only Kubernetes API calls and controller-runtime logging.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The PR does not add any new Ginkgo e2e tests using patterns like It(), Describe(), Context(), or When(). New test code is in a unit test file using standard Go testing.T framework.
Title check ✅ Passed The title directly addresses the main objective of the PR: adding a safe-to-evict annotation and addressing autoscaler scale-down and drain loop issues for the KAS connection checker.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels Apr 26, 2026
@sdminonne sdminonne force-pushed the fix/kas-connection-checker-pdb-safe-to-evict branch from 67a200b to 238cc6a Compare April 26, 2026 10:56
@openshift-ci openshift-ci Bot added the area/testing Indicates the PR includes changes for e2e testing label Apr 26, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go (1)

2841-2855: Add an existing-PDB reconciliation case.

These assertions only cover the create path. Since the controller now reconciles the PDB on every run, please add a case that seeds a kas-connection-checker PDB with maxUnavailable and verifies reconcile rewrites it to minAvailable=1/AlwaysAllow.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`
around lines 2841 - 2855, Add a new test case that seeds an existing
PodDisruptionBudget named by manifests.KASConnectionCheckerName in namespace
manifests.KASConnectionCheckerNamespace with Spec.MaxUnavailable set (e.g., 1)
and UnhealthyPodEvictionPolicy set to policyv1.DoNotEvict, then invoke the
controller reconcile path used by other tests (the same reconcile helper used in
resources_test.go) and assert the PDB is mutated: fetch the PDB via c.Get and
verify pdb.Spec.MinAvailable equals intstr.FromInt32(1),
pdb.Spec.UnhealthyPodEvictionPolicy equals policyv1.AlwaysAllow, and
pdb.Spec.Selector still matches app=manifests.KASConnectionCheckerName to
confirm reconciliation rewrote maxUnavailable to minAvailable and enforced
AlwaysAllow.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`:
- Around line 1756-1764: The update closure passed to r.CreateOrUpdate for the
PodDisruptionBudget (pdb) should clear any existing pdb.Spec.MaxUnavailable
before assigning MinAvailable, because PodDisruptionBudgetSpec rejects objects
with both fields set; inside the anonymous func (the CreateOrUpdate callback
that mutates pdb), explicitly set pdb.Spec.MaxUnavailable = nil (or reset it)
prior to setting pdb.Spec.MinAvailable = ptr.To(intstr.FromInt32(1)) and
pdb.Spec.UnhealthyPodEvictionPolicy, referencing the pdb variable and the
CreateOrUpdate call that manages the kas-connection-checker PDB
(manifests.KASConnectionCheckerName).

---

Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`:
- Around line 2841-2855: Add a new test case that seeds an existing
PodDisruptionBudget named by manifests.KASConnectionCheckerName in namespace
manifests.KASConnectionCheckerNamespace with Spec.MaxUnavailable set (e.g., 1)
and UnhealthyPodEvictionPolicy set to policyv1.DoNotEvict, then invoke the
controller reconcile path used by other tests (the same reconcile helper used in
resources_test.go) and assert the PDB is mutated: fetch the PDB via c.Get and
verify pdb.Spec.MinAvailable equals intstr.FromInt32(1),
pdb.Spec.UnhealthyPodEvictionPolicy equals policyv1.AlwaysAllow, and
pdb.Spec.Selector still matches app=manifests.KASConnectionCheckerName to
confirm reconciliation rewrote maxUnavailable to minAvailable and enforced
AlwaysAllow.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: dff10846-33f5-40b0-99c8-a423c96284f5

📥 Commits

Reviewing files that changed from the base of the PR and between 67a200b and 238cc6a.

📒 Files selected for processing (4)
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/manifests/kasconnectionchecker.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go
  • test/e2e/nodepool_test.go
✅ Files skipped from review due to trivial changes (1)
  • test/e2e/nodepool_test.go

@openshift-ci openshift-ci Bot added the area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release label Apr 26, 2026
@sdminonne sdminonne force-pushed the fix/kas-connection-checker-pdb-safe-to-evict branch from 784e934 to cbad894 Compare April 26, 2026 12:02
@codecov

codecov Bot commented Apr 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.69%. Comparing base (1f7deb5) to head (d32c325).
⚠️ Report is 25 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8338      +/-   ##
==========================================
+ Coverage   41.59%   41.69%   +0.10%     
==========================================
  Files         758      758              
  Lines       93925    93943      +18     
==========================================
+ Hits        39066    39173     +107     
+ Misses      52113    52025      -88     
+ Partials     2746     2745       -1     
Files with missing lines Coverage Δ
...rconfigoperator/controllers/resources/resources.go 56.68% <100.00%> (-0.03%) ⬇️

... and 4 files with indirect coverage changes

Flag Coverage Δ
cmd-support 35.02% <ø> (+0.06%) ⬆️
cpo-hostedcontrolplane 44.00% <ø> (+0.40%) ⬆️
cpo-other 43.44% <100.00%> (-0.01%) ⬇️
hypershift-operator 51.70% <ø> (+0.04%) ⬆️
other 31.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sdminonne sdminonne force-pushed the fix/kas-connection-checker-pdb-safe-to-evict branch from cbad894 to 8979ea8 Compare April 26, 2026 15:53

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go (1)

580-582: Clarify the top-level wrapped error message.

This path now reconciles both Deployment and PDB, but the wrapper still says “deployment” only. A broader message will make failure triage clearer.

Proposed tweak
-            errs = append(errs, fmt.Errorf("failed to reconcile KAS connection checker deployment: %w", err))
+            errs = append(errs, fmt.Errorf("failed to reconcile KAS connection checker resources: %w", err))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`
around lines 580 - 582, The current error wrapper in the call that invokes
reconcileKASConnectionChecker uses "failed to reconcile KAS connection checker
deployment" but that function reconciles both the Deployment and the
PodDisruptionBudget; update the wrapped message to reflect both resources (e.g.,
"failed to reconcile KAS connection checker resources (deployment and PDB)" or
similar) so failures from reconcileKASConnectionChecker are not
misleading—change the fmt.Errorf wrapper string in the block that calls
reconcileKASConnectionChecker(ctx, hcp, cliImage) accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`:
- Around line 580-582: The current error wrapper in the call that invokes
reconcileKASConnectionChecker uses "failed to reconcile KAS connection checker
deployment" but that function reconciles both the Deployment and the
PodDisruptionBudget; update the wrapped message to reflect both resources (e.g.,
"failed to reconcile KAS connection checker resources (deployment and PDB)" or
similar) so failures from reconcileKASConnectionChecker are not
misleading—change the fmt.Errorf wrapper string in the block that calls
reconcileKASConnectionChecker(ctx, hcp, cliImage) accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4e5098fd-38b6-4599-929b-7eba3028cb17

📥 Commits

Reviewing files that changed from the base of the PR and between cbad894 and 8979ea8.

📒 Files selected for processing (5)
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/manifests/kasconnectionchecker.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go
  • support/config/types.go
  • test/e2e/nodepool_test.go
✅ Files skipped from review due to trivial changes (1)
  • test/e2e/nodepool_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/manifests/kasconnectionchecker.go

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-conformance

@sdminonne sdminonne changed the title fix(kas-connection-checker): add PDB and safe-to-evict annotation OCPBUGS-84368: add PDB and safe-to-evict annotation to kas-connection-checker Apr 27, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Apr 27, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-84368, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Problem

The kas-connection-checker deployment in kube-system uses system-node-critical priority class, which prevents the cluster autoscaler from evicting its pods during scale-down. This blocks node draining and prevents efficient cluster scale-down when kas-connection-checker pods are running on underutilized nodes.

Solution

  • Add cluster-autoscaler.kubernetes.io/safe-to-evict: "true" annotation to the kas-connection-checker pod template so the cluster autoscaler can evict pods despite system-node-critical priority class.
  • Add a PodDisruptionBudget with minAvailable: 1 and unhealthyPodEvictionPolicy: AlwaysAllow to guarantee at least one replica remains available during voluntary disruptions while still allowing scale-down.
  • Extract the cluster-autoscaler.kubernetes.io/safe-to-evict annotation key to a shared constant (PodSafeToEvictKey) in support/config alongside the existing PodSafeToEvictLocalVolumesKey.
  • Rename reconcileKASConnectionCheckerDeployment to reconcileKASConnectionChecker since it now manages the PDB in addition to the deployment.

Changes

  • support/config/types.go: New PodSafeToEvictKey constant.
  • control-plane-operator/.../manifests/kasconnectionchecker.go: New KASConnectionCheckerPodDisruptionBudget() manifest builder.
  • control-plane-operator/.../resources/resources.go: Add safe-to-evict annotation to pod template; create/reconcile PDB with minAvailable: 1 and AlwaysAllow eviction policy; clear any stale maxUnavailable field.
  • control-plane-operator/.../resources/resources_test.go: New test cases for PDB creation, PDB reconciliation from maxUnavailable to minAvailable, PDB creation failure error propagation, and safe-to-evict annotation assertions on both create and update paths.
  • test/e2e/nodepool_test.go: Minor string concatenation formatting fix.

Test plan

  • go test ./control-plane-operator/hostedclusterconfigoperator/controllers/resources/ -run Test_reconciler_reconcileKASConnectionChecker -v passes
  • go test ./control-plane-operator/hostedclusterconfigoperator/controllers/resources/... passes (including TestReconcileErrorHandling and PDB failure injection)
  • Verify on a live cluster that the PDB is created in kube-system with correct spec
  • Verify cluster autoscaler can scale down nodes running kas-connection-checker pods

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@csrwng

csrwng commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

/approve

@openshift-ci

openshift-ci Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng, sdminonne

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2026
@sdminonne

Copy link
Copy Markdown
Contributor Author

According to claude

  Summary: This is a KNOWN FLAKY upstream Kubernetes e2e test, unrelated to the PR #8338 changes.
           The test creates a CRD with a custom ResourceQuota and waits for the quota status to
           reflect the custom resource lifecycle. It timed out after ~61 seconds with
           "context deadline exceeded" at resource_quota.go:683. PR #8338 only modifies
           kas-connection-checker (PDB + safe-to-evict annotation) and has no interaction
           with ResourceQuota or CRD lifecycle machinery. The test was not eligible for retries
           per the retry strategy. Overall suite: 1929 passed, 1 blocking fail, 1 informing fail,
           1 flaky, 2127 skipped.

  Evidence:
    - JUnit XML failure message:
      <failure>fail [k8s.io/kubernetes/test/e2e/apimachinery/resource_quota.go:683]: context deadline exceeded</failure>
    - Test output from JUnit system-out:
      "Creating a Custom Resource Definition" at 20:29:39
      "Truncated CRD plural name from 'e2e-test-e2e-resourcequota-975-6663-crds' to 'e2e-test-e2e-resourcequota-975'"
      "context deadline exceeded" at 20:30:40 (~61s later)
      "Found 0 events" in namespace e2e-resourcequota-975
    - Build log: "failed: (1m1s) 2026-04-26T20:30:40 '[sig-api-machinery] ResourceQuota should create
      a ResourceQuota and capture the life of a custom resource.'"
    - Retry log: "Test [...] not eligible for retries (strategy returned 0)"

I'm rerunning

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-conformance

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2026
@enxebre

enxebre commented Apr 28, 2026

Copy link
Copy Markdown
Member

Thanks! few questions:

Add cluster-autoscaler.kubernetes.io/safe-to-evict: "true" annotation to the kas-connection-checker pod template so the cluster autoscaler can evict pods despite system-node-critical priority class.

why is this annotation needed? does the autoscaler has any specific behaviour for pods with this priority class during draining?

system-node-critical priority class, which prevents the cluster autoscaler from evicting its pods during scale-down.

where does autoscaler implement this?
wouldn't this mess with scale down draining just because of its hard tolerations?

@sdminonne sdminonne force-pushed the fix/kas-connection-checker-pdb-safe-to-evict branch from 8979ea8 to a1b9129 Compare April 28, 2026 09:27
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 28, 2026
@sdminonne

Copy link
Copy Markdown
Contributor Author

/hold cancel

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 16, 2026
@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aks-4-22

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aws-4-22

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD f76be88 and 2 for PR HEAD 83a07f0 in total

@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 18, 2026
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label May 18, 2026
@openshift-ci

openshift-ci Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

New changes are detected. LGTM label has been removed.

@sdminonne

Copy link
Copy Markdown
Contributor Author

/verified by me in local-dev

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 18, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@sdminonne: This PR has been marked as verified by me in local-dev.

Details

In response to this:

/verified by me in local-dev

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdminonne

Copy link
Copy Markdown
Contributor Author

/pipeline required

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@sdminonne

Copy link
Copy Markdown
Contributor Author

/retest-required

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2056326175654940672 | Cost: $2.7445194999999996 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2056605273405001728 | Cost: $2.801915 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 11, 2026
@sdminonne sdminonne force-pushed the fix/kas-connection-checker-pdb-safe-to-evict branch from 3339fbb to 7ab35c9 Compare June 16, 2026 09:02
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Jun 16, 2026
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2026
…y spread constraint

Add the cluster-autoscaler safe-to-evict annotation to kas-connection-checker
pods so PDBs do not block node scale-down. Add topology spread constraint to
distribute pods across nodes. Fix stale error message and add missing toleration
assertion. Gate the e2e spec test with CPOAtLeast for pre-4.23 HCCO.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@sdminonne sdminonne force-pushed the fix/kas-connection-checker-pdb-safe-to-evict branch from 7ab35c9 to d32c325 Compare June 16, 2026 09:19
@hypershift-jira-solve-ci

Copy link
Copy Markdown

I now have all the evidence needed. The job state is aborted (not failure), and the description is "Aborted by trigger plugin." Let me compile the complete analysis.

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-images
  • Build ID: 2066808632065921024
  • Type: presubmit
  • PR: #8338OCPBUGS-84368: add safe-to-evict annotation and remove tolerations to fix autoscaler scale-down and drain loop
  • Author: @sdminonne
  • Cluster: build01
  • Start Time: 2026-06-16T09:02:31Z
  • State: aborted (reported as "failure" on GitHub, but underlying state is abort)

Test Failure Analysis

Error

Entrypoint received interrupt: terminated
Aborted by trigger plugin.

Summary

This job was not a real failure — it was aborted by the Prow trigger plugin before it could finish. All four image builds (src, hypershift-operator, hypershift, hypershift-tests, hypershift-cli) completed successfully. The job was terminated at 09:19:53Z while creating the final release image (release:latest), just seconds before it would have finished. The abort was triggered externally (likely by a new push to the PR or a re-trigger of CI), which caused Prow to kill the in-progress run and start a new one. This is normal CI behavior and not indicative of any code or infrastructure problem in the PR.

Root Cause

The Prow trigger plugin aborted this job run. This happens when:

  1. A new commit is pushed to the PR — Prow cancels in-flight jobs for the old commit SHA and starts new jobs for the new SHA. The commit tested here was 7ab35c9f; if a subsequent commit was pushed, this run would be aborted in favor of the new one.
  2. A /retest or /test command was issued — A maintainer or the PR author re-triggered the job while this run was still in progress, causing the old run to be aborted.

The prowjob.json confirms:

  • status.state: "aborted" (not "failure")
  • status.description: "Aborted by trigger plugin."
  • status.completionTime: null (job never completed normally)

The build log shows all image builds completed successfully:

  • src-amd64 ✅ (1m58s)
  • hypershift-operator-amd64 ✅ (4m36s)
  • hypershift-amd64 ✅ (4m4s, after 1 transient infra retry for DNS resolution failure)
  • hypershift-tests-amd64 ✅ (11m50s)
  • hypershift-cli-amd64 ✅ (2m58s)

The interrupt signal was received at 09:19:53Z while the final step — creating the release image registry.build01.ci.openshift.org/ci-op-dyxkb5b0/release:latest — was in progress. The job was ~17 minutes in and would have succeeded within seconds had the abort not occurred.

This is not a PR code issue. The PR's code changes compiled and built into images without errors.

Recommendations
  1. No action required on the PR code — All image builds succeeded; the abort was external.
  2. Check for a newer run — This job was likely superseded by a newer run on the same or updated commit. Look at the latest CI status on PR OCPBUGS-84368: add safe-to-evict annotation and remove tolerations to fix autoscaler scale-down and drain loop #8338 for the current ci/prow/images result.
  3. If the job shows "failure" on GitHub — Prow sometimes reports aborted jobs as "failure" on the GitHub status check. If the latest run has since passed, this is a non-issue. If no newer run exists, re-trigger with /test images.
  4. Transient infra issue noted — The hypershift-amd64 build had one transient DNS failure (FetchImageContentFailedno such host for the internal image registry) but ci-operator automatically retried and it succeeded. This is a known transient issue on build clusters and not related to the PR.
Evidence
Evidence Detail
Job state aborted (not failure)
Abort reason "Aborted by trigger plugin." — external cancellation by Prow
Completion time null — job never completed normally
Interrupt signal Entrypoint received interrupt: terminated at 09:19:53Z
Image builds All 5 images built successfully (src, hypershift-operator, hypershift, hypershift-tests, hypershift-cli)
Point of abort During final step: Creating release image registry.build01.ci.openshift.org/ci-op-dyxkb5b0/release:latest
Transient infra error hypershift-amd64 build failed once with DNS error (no such host for internal registry), auto-retried and succeeded
PR commit tested 7ab35c9fa4e30f116f883fe8f2b7c014eccff3e1
Job duration before abort ~17 minutes (09:02:31 → 09:19:53)

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@sdminonne: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/testing Indicates the PR includes changes for e2e testing jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants