OCPBUGS-84368: add safe-to-evict annotation and remove tolerations to fix autoscaler scale-down and drain loop by sdminonne · Pull Request #8338 · openshift/hypershift

sdminonne · 2026-04-26T09:27:49Z

Summary

Problem

The kas-connection-checker deployment runs in the kube-system namespace without a PDB or safe-to-evict annotation. The cluster autoscaler treats pods in kube-system without a PDB as system-critical pods that cannot be evicted during scale-down, blocking node draining on underutilized nodes. See also the autoscaler system drainability rule.

Additionally, the deployment used a blanket {Effect: NoSchedule, Operator: Exists} toleration that matched the cordon taint (node.kubernetes.io/unschedulable:NoSchedule), causing replacement pods to be scheduled back onto cordoned nodes during drain — creating a destructive evict-reschedule loop.

Solution

Add cluster-autoscaler.kubernetes.io/safe-to-evict: "true" annotation to the pod template. This bypasses the autoscaler's system drainability rule for kube-system pods, allowing nodes to be scaled down. The annotation is preferred over a PDB because it doesn't block scale-to-zero operations.
Remove all custom tolerations. The previous blanket NoSchedule toleration matched the cordon taint, causing the evict-reschedule loop described above. In a HyperShift hosted cluster there are no master nodes in the guest cluster, and the checker doesn't need to run on infra nodes, so no NoSchedule tolerations are needed. Kubernetes provides default NoExecute tolerations (300s grace period) for unreachable and not-ready via the DefaultTolerationSeconds admission controller.

Changes

control-plane-operator/.../resources/resources.go: Add safe-to-evict annotation to pod template; remove all custom tolerations (set to nil).
control-plane-operator/.../resources/resources_test.go: Update tests to validate the safe-to-evict annotation and assert no custom tolerations are set.

Test plan

go test ./control-plane-operator/hostedclusterconfigoperator/controllers/resources/... -count=1 passes
Verify on a live cluster that the annotation is present on kas-connection-checker pods
Verify cluster autoscaler can scale down nodes running kas-connection-checker pods

Fixes: https://redhat.atlassian.net/browse/OCPBUGS-84368

🤖 Generated with Claude Code

openshift-merge-bot · 2026-04-26T09:27:52Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

coderabbitai · 2026-04-26T09:28:07Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This change adds PodDisruptionBudget (PDB) support to the KAS connection checker component. A new helper function constructs PDB manifests with appropriate metadata. The reconciliation flow for the KAS connection checker is updated to annotate the Deployment pod template as safe to evict and then create or update a PodDisruptionBudget with MinAvailable=1, a selector matching KAS connection checker pods, and unhealthy pod eviction enabled. A new constant defines the cluster autoscaler safe-to-evict annotation key. Test coverage is expanded to validate PDB reconciliation behavior, including edge cases and error scenarios.

Sequence Diagram

sequenceDiagram
    participant Reconciler as Reconciler<br/>(KAS Connection Checker)
    participant K8sAPI as Kubernetes API
    participant Deployment as Deployment<br/>Resource
    participant PDB as PodDisruptionBudget<br/>Resource
    
    Reconciler->>K8sAPI: Fetch existing Deployment
    K8sAPI-->>Reconciler: Return Deployment
    
    Reconciler->>Deployment: Annotate pod template<br/>(safe-to-evict)
    Reconciler->>K8sAPI: Create/Update Deployment
    K8sAPI-->>Reconciler: Deployment reconciled
    
    Reconciler->>K8sAPI: Fetch existing PDB
    K8sAPI-->>Reconciler: Return PDB (or none)
    
    Reconciler->>PDB: Set minAvailable=1<br/>Set selector (app=KASConnectionChecker)<br/>Set unhealthyPodEvictionPolicy=AlwaysAllow
    Reconciler->>K8sAPI: Create/Update PDB
    K8sAPI-->>Reconciler: PDB reconciled
    
    Reconciler-->>Reconciler: Return reconciliation result

🚥 Pre-merge checks | ✅ 9 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.83% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Topology-Aware Scheduling Compatibility	⚠️ Warning	Fixed 3-replica deployment lacks topology awareness for SNO, Two-Node, and HyperShift topologies, causing scheduling failures.	Implement topology-aware logic checking infrastructure.Status.ControlPlaneTopology to adjust replica count and PDB settings per cluster topology.
Test Structure And Quality	❓ Inconclusive	PR summary indicates resources_test.go uses Go testing package (t.Parallel() removal), not Ginkgo, making the Ginkgo-specific check inapplicable without direct file inspection.	Verify the test framework used in resources_test.go to determine if Ginkgo assessment criteria apply or if alternative testing package standards should be used.

✅ Passed checks (9 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in modified files are stable, static string constants with no dynamic elements like timestamps, UUIDs, pod names, node names, or namespace suffixes embedded in test names.
Microshift Test Compatibility	✅ Passed	This PR does not add any new Ginkgo e2e tests; it only modifies standard Go unit tests in resources_test.go and includes a minor formatting change in nodepool_test.go.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	This PR does not add new Ginkgo e2e tests. Changes are unit tests using standard Go testing package, not Ginkgo framework.
Ote Binary Stdout Contract	✅ Passed	The pull request does not violate the OTE Binary Stdout Contract. All modifications use only Kubernetes API calls and controller-runtime logging.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	The PR does not add any new Ginkgo e2e tests using patterns like It(), Describe(), Context(), or When(). New test code is in a unit test file using standard Go testing.T framework.
Title check	✅ Passed	The title directly addresses the main objective of the PR: adding a safe-to-evict annotation and addressing autoscaler scale-down and drain loop issues for the KAS connection checker.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go (1)
2841-2855: Add an existing-PDB reconciliation case.

These assertions only cover the create path. Since the controller now reconciles the PDB on every run, please add a case that seeds a kas-connection-checker PDB with maxUnavailable and verifies reconcile rewrites it to minAvailable=1/AlwaysAllow.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`
around lines 2841 - 2855, Add a new test case that seeds an existing
PodDisruptionBudget named by manifests.KASConnectionCheckerName in namespace
manifests.KASConnectionCheckerNamespace with Spec.MaxUnavailable set (e.g., 1)
and UnhealthyPodEvictionPolicy set to policyv1.DoNotEvict, then invoke the
controller reconcile path used by other tests (the same reconcile helper used in
resources_test.go) and assert the PDB is mutated: fetch the PDB via c.Get and
verify pdb.Spec.MinAvailable equals intstr.FromInt32(1),
pdb.Spec.UnhealthyPodEvictionPolicy equals policyv1.AlwaysAllow, and
pdb.Spec.Selector still matches app=manifests.KASConnectionCheckerName to
confirm reconciliation rewrote maxUnavailable to minAvailable and enforced
AlwaysAllow.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`:
- Around line 1756-1764: The update closure passed to r.CreateOrUpdate for the
PodDisruptionBudget (pdb) should clear any existing pdb.Spec.MaxUnavailable
before assigning MinAvailable, because PodDisruptionBudgetSpec rejects objects
with both fields set; inside the anonymous func (the CreateOrUpdate callback
that mutates pdb), explicitly set pdb.Spec.MaxUnavailable = nil (or reset it)
prior to setting pdb.Spec.MinAvailable = ptr.To(intstr.FromInt32(1)) and
pdb.Spec.UnhealthyPodEvictionPolicy, referencing the pdb variable and the
CreateOrUpdate call that manages the kas-connection-checker PDB
(manifests.KASConnectionCheckerName).

---

Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`:
- Around line 2841-2855: Add a new test case that seeds an existing
PodDisruptionBudget named by manifests.KASConnectionCheckerName in namespace
manifests.KASConnectionCheckerNamespace with Spec.MaxUnavailable set (e.g., 1)
and UnhealthyPodEvictionPolicy set to policyv1.DoNotEvict, then invoke the
controller reconcile path used by other tests (the same reconcile helper used in
resources_test.go) and assert the PDB is mutated: fetch the PDB via c.Get and
verify pdb.Spec.MinAvailable equals intstr.FromInt32(1),
pdb.Spec.UnhealthyPodEvictionPolicy equals policyv1.AlwaysAllow, and
pdb.Spec.Selector still matches app=manifests.KASConnectionCheckerName to
confirm reconciliation rewrote maxUnavailable to minAvailable and enforced
AlwaysAllow.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: dff10846-33f5-40b0-99c8-a423c96284f5

📥 Commits

Reviewing files that changed from the base of the PR and between 67a200b and 238cc6a.

📒 Files selected for processing (4)

control-plane-operator/hostedclusterconfigoperator/controllers/resources/manifests/kasconnectionchecker.go
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go
test/e2e/nodepool_test.go

✅ Files skipped from review due to trivial changes (1)

test/e2e/nodepool_test.go

codecov · 2026-04-26T12:17:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.69%. Comparing base (1f7deb5) to head (d32c325).
⚠️ Report is 25 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8338      +/-   ##
==========================================
+ Coverage   41.59%   41.69%   +0.10%     
==========================================
  Files         758      758              
  Lines       93925    93943      +18     
==========================================
+ Hits        39066    39173     +107     
+ Misses      52113    52025      -88     
+ Partials     2746     2745       -1

Files with missing lines	Coverage Δ
...rconfigoperator/controllers/resources/resources.go	`56.68% <100.00%> (-0.03%)`	⬇️

... and 4 files with indirect coverage changes

Flag	Coverage Δ
cmd-support	`35.02% <ø> (+0.06%)`	⬆️
cpo-hostedcontrolplane	`44.00% <ø> (+0.40%)`	⬆️
cpo-other	`43.44% <100.00%> (-0.01%)`	⬇️
hypershift-operator	`51.70% <ø> (+0.04%)`	⬆️
other	`31.56% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

🧹 Nitpick comments (1)

control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go (1)

580-582: Clarify the top-level wrapped error message.

This path now reconciles both Deployment and PDB, but the wrapper still says “deployment” only. A broader message will make failure triage clearer.

Proposed tweak

-            errs = append(errs, fmt.Errorf("failed to reconcile KAS connection checker deployment: %w", err))
+            errs = append(errs, fmt.Errorf("failed to reconcile KAS connection checker resources: %w", err))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`
around lines 580 - 582, The current error wrapper in the call that invokes
reconcileKASConnectionChecker uses "failed to reconcile KAS connection checker
deployment" but that function reconciles both the Deployment and the
PodDisruptionBudget; update the wrapped message to reflect both resources (e.g.,
"failed to reconcile KAS connection checker resources (deployment and PDB)" or
similar) so failures from reconcileKASConnectionChecker are not
misleading—change the fmt.Errorf wrapper string in the block that calls
reconcileKASConnectionChecker(ctx, hcp, cliImage) accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`:
- Around line 580-582: The current error wrapper in the call that invokes
reconcileKASConnectionChecker uses "failed to reconcile KAS connection checker
deployment" but that function reconciles both the Deployment and the
PodDisruptionBudget; update the wrapped message to reflect both resources (e.g.,
"failed to reconcile KAS connection checker resources (deployment and PDB)" or
similar) so failures from reconcileKASConnectionChecker are not
misleading—change the fmt.Errorf wrapper string in the block that calls
reconcileKASConnectionChecker(ctx, hcp, cliImage) accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4e5098fd-38b6-4599-929b-7eba3028cb17

📥 Commits

Reviewing files that changed from the base of the PR and between cbad894 and 8979ea8.

📒 Files selected for processing (5)

control-plane-operator/hostedclusterconfigoperator/controllers/resources/manifests/kasconnectionchecker.go
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go
support/config/types.go
test/e2e/nodepool_test.go

✅ Files skipped from review due to trivial changes (1)

test/e2e/nodepool_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

control-plane-operator/hostedclusterconfigoperator/controllers/resources/manifests/kasconnectionchecker.go

sdminonne · 2026-04-26T19:42:23Z

/test e2e-conformance

openshift-ci-robot · 2026-04-27T07:02:15Z

@sdminonne: This pull request references Jira Issue OCPBUGS-84368, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Problem

The kas-connection-checker deployment in kube-system uses system-node-critical priority class, which prevents the cluster autoscaler from evicting its pods during scale-down. This blocks node draining and prevents efficient cluster scale-down when kas-connection-checker pods are running on underutilized nodes.

Solution

Add cluster-autoscaler.kubernetes.io/safe-to-evict: "true" annotation to the kas-connection-checker pod template so the cluster autoscaler can evict pods despite system-node-critical priority class.

Add a PodDisruptionBudget with minAvailable: 1 and unhealthyPodEvictionPolicy: AlwaysAllow to guarantee at least one replica remains available during voluntary disruptions while still allowing scale-down.

Extract the cluster-autoscaler.kubernetes.io/safe-to-evict annotation key to a shared constant (PodSafeToEvictKey) in support/config alongside the existing PodSafeToEvictLocalVolumesKey.

Rename reconcileKASConnectionCheckerDeployment to reconcileKASConnectionChecker since it now manages the PDB in addition to the deployment.

Changes

support/config/types.go: New PodSafeToEvictKey constant.

control-plane-operator/.../manifests/kasconnectionchecker.go: New KASConnectionCheckerPodDisruptionBudget() manifest builder.

control-plane-operator/.../resources/resources.go: Add safe-to-evict annotation to pod template; create/reconcile PDB with minAvailable: 1 and AlwaysAllow eviction policy; clear any stale maxUnavailable field.

control-plane-operator/.../resources/resources_test.go: New test cases for PDB creation, PDB reconciliation from maxUnavailable to minAvailable, PDB creation failure error propagation, and safe-to-evict annotation assertions on both create and update paths.

test/e2e/nodepool_test.go: Minor string concatenation formatting fix.

Test plan

go test ./control-plane-operator/hostedclusterconfigoperator/controllers/resources/ -run Test_reconciler_reconcileKASConnectionChecker -v passes

go test ./control-plane-operator/hostedclusterconfigoperator/controllers/resources/... passes (including TestReconcileErrorHandling and PDB failure injection)

Verify on a live cluster that the PDB is created in kube-system with correct spec

Verify cluster autoscaler can scale down nodes running kas-connection-checker pods

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

csrwng · 2026-04-27T13:28:34Z

/approve

openshift-ci · 2026-04-27T13:30:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng, sdminonne

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [csrwng]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sdminonne · 2026-04-27T16:47:05Z

According to claude

  Summary: This is a KNOWN FLAKY upstream Kubernetes e2e test, unrelated to the PR #8338 changes.
           The test creates a CRD with a custom ResourceQuota and waits for the quota status to
           reflect the custom resource lifecycle. It timed out after ~61 seconds with
           "context deadline exceeded" at resource_quota.go:683. PR #8338 only modifies
           kas-connection-checker (PDB + safe-to-evict annotation) and has no interaction
           with ResourceQuota or CRD lifecycle machinery. The test was not eligible for retries
           per the retry strategy. Overall suite: 1929 passed, 1 blocking fail, 1 informing fail,
           1 flaky, 2127 skipped.

  Evidence:
    - JUnit XML failure message:
      <failure>fail [k8s.io/kubernetes/test/e2e/apimachinery/resource_quota.go:683]: context deadline exceeded</failure>
    - Test output from JUnit system-out:
      "Creating a Custom Resource Definition" at 20:29:39
      "Truncated CRD plural name from 'e2e-test-e2e-resourcequota-975-6663-crds' to 'e2e-test-e2e-resourcequota-975'"
      "context deadline exceeded" at 20:30:40 (~61s later)
      "Found 0 events" in namespace e2e-resourcequota-975
    - Build log: "failed: (1m1s) 2026-04-26T20:30:40 '[sig-api-machinery] ResourceQuota should create
      a ResourceQuota and capture the life of a custom resource.'"
    - Retry log: "Test [...] not eligible for retries (strategy returned 0)"

I'm rerunning

sdminonne · 2026-04-27T16:47:25Z

/test e2e-conformance

enxebre · 2026-04-28T07:52:47Z

Thanks! few questions:

Add cluster-autoscaler.kubernetes.io/safe-to-evict: "true" annotation to the kas-connection-checker pod template so the cluster autoscaler can evict pods despite system-node-critical priority class.

why is this annotation needed? does the autoscaler has any specific behaviour for pods with this priority class during draining?

system-node-critical priority class, which prevents the cluster autoscaler from evicting its pods during scale-down.

where does autoscaler implement this?
wouldn't this mess with scale down draining just because of its hard tolerations?

sdminonne · 2026-05-16T14:11:49Z

/hold cancel

sdminonne · 2026-05-16T14:13:23Z

/test e2e-aks-4-22

sdminonne · 2026-05-16T14:13:43Z

/test e2e-aws-4-22

openshift-merge-bot · 2026-05-16T16:41:18Z

/retest-required

Remaining retests: 0 against base HEAD f76be88 and 2 for PR HEAD 83a07f0 in total

openshift-ci · 2026-05-18T07:00:45Z

New changes are detected. LGTM label has been removed.

sdminonne · 2026-05-18T07:01:25Z

/verified by me in local-dev

openshift-ci-robot · 2026-05-18T07:01:38Z

@sdminonne: This PR has been marked as verified by me in local-dev.

Details

In response to this:

/verified by me in local-dev

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sdminonne · 2026-05-18T07:42:49Z

/pipeline required

openshift-merge-bot · 2026-05-18T07:42:52Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

sdminonne · 2026-05-18T10:48:55Z

/retest-required

hypershift-jira-solve-ci · 2026-05-18T12:51:32Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2056326175654940672 | Cost: $2.7445194999999996 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

sdminonne · 2026-05-18T16:35:08Z

/test e2e-aks

sdminonne · 2026-05-18T17:06:06Z

/test e2e-aws

sdminonne · 2026-05-19T05:10:00Z

/test e2e-aks

hypershift-jira-solve-ci · 2026-05-19T07:07:40Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2056605273405001728 | Cost: $2.801915 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

sdminonne · 2026-05-19T10:31:12Z

/test e2e-aks

sdminonne · 2026-05-20T05:30:46Z

/test e2e-aks

…y spread constraint Add the cluster-autoscaler safe-to-evict annotation to kas-connection-checker pods so PDBs do not block node scale-down. Add topology spread constraint to distribute pods across nodes. Fix stale error message and add missing toleration assertion. Gate the e2e spec test with CPOAtLeast for pre-4.23 HCCO. Co-Authored-By: Claude Opus 4.6 <[email protected]>

hypershift-jira-solve-ci · 2026-06-16T09:22:50Z

I now have all the evidence needed. The job state is aborted (not failure), and the description is "Aborted by trigger plugin." Let me compile the complete analysis.

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-images
Build ID: 2066808632065921024
Type: presubmit
PR: #8338 — OCPBUGS-84368: add safe-to-evict annotation and remove tolerations to fix autoscaler scale-down and drain loop
Author: @sdminonne
Cluster: build01
Start Time: 2026-06-16T09:02:31Z
State: aborted (reported as "failure" on GitHub, but underlying state is abort)

Test Failure Analysis

Error

Entrypoint received interrupt: terminated
Aborted by trigger plugin.

Summary

This job was not a real failure — it was aborted by the Prow trigger plugin before it could finish. All four image builds (src, hypershift-operator, hypershift, hypershift-tests, hypershift-cli) completed successfully. The job was terminated at 09:19:53Z while creating the final release image (release:latest), just seconds before it would have finished. The abort was triggered externally (likely by a new push to the PR or a re-trigger of CI), which caused Prow to kill the in-progress run and start a new one. This is normal CI behavior and not indicative of any code or infrastructure problem in the PR.

Root Cause

The Prow trigger plugin aborted this job run. This happens when:

A new commit is pushed to the PR — Prow cancels in-flight jobs for the old commit SHA and starts new jobs for the new SHA. The commit tested here was 7ab35c9f; if a subsequent commit was pushed, this run would be aborted in favor of the new one.
A /retest or /test command was issued — A maintainer or the PR author re-triggered the job while this run was still in progress, causing the old run to be aborted.

The prowjob.json confirms:

status.state: "aborted" (not "failure")
status.description: "Aborted by trigger plugin."
status.completionTime: null (job never completed normally)

The build log shows all image builds completed successfully:

src-amd64 ✅ (1m58s)
hypershift-operator-amd64 ✅ (4m36s)
hypershift-amd64 ✅ (4m4s, after 1 transient infra retry for DNS resolution failure)
hypershift-tests-amd64 ✅ (11m50s)
hypershift-cli-amd64 ✅ (2m58s)

The interrupt signal was received at 09:19:53Z while the final step — creating the release image registry.build01.ci.openshift.org/ci-op-dyxkb5b0/release:latest — was in progress. The job was ~17 minutes in and would have succeeded within seconds had the abort not occurred.

This is not a PR code issue. The PR's code changes compiled and built into images without errors.

Recommendations

No action required on the PR code — All image builds succeeded; the abort was external.
Check for a newer run — This job was likely superseded by a newer run on the same or updated commit. Look at the latest CI status on PR OCPBUGS-84368: add safe-to-evict annotation and remove tolerations to fix autoscaler scale-down and drain loop #8338 for the current ci/prow/images result.
If the job shows "failure" on GitHub — Prow sometimes reports aborted jobs as "failure" on the GitHub status check. If the latest run has since passed, this is a non-issue. If no newer run exists, re-trigger with /test images.
Transient infra issue noted — The hypershift-amd64 build had one transient DNS failure (FetchImageContentFailed — no such host for the internal image registry) but ci-operator automatically retried and it succeeded. This is a known transient issue on build clusters and not related to the PR.

Evidence

Evidence	Detail
Job state	`aborted` (not `failure`)
Abort reason	`"Aborted by trigger plugin."` — external cancellation by Prow
Completion time	`null` — job never completed normally
Interrupt signal	`Entrypoint received interrupt: terminated` at `09:19:53Z`
Image builds	All 5 images built successfully (`src`, `hypershift-operator`, `hypershift`, `hypershift-tests`, `hypershift-cli`)
Point of abort	During final step: `Creating release image registry.build01.ci.openshift.org/ci-op-dyxkb5b0/release:latest`
Transient infra error	`hypershift-amd64` build failed once with DNS error (`no such host` for internal registry), auto-retried and succeeded
PR commit tested	`7ab35c9fa4e30f116f883fe8f2b7c014eccff3e1`
Job duration before abort	~17 minutes (09:02:31 → 09:19:53)

openshift-ci · 2026-06-16T09:33:16Z

@sdminonne: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot added the do-not-merge/needs-area label Apr 26, 2026

openshift-ci Bot requested review from cblecker and sjenning April 26, 2026 09:28

openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels Apr 26, 2026

sdminonne force-pushed the fix/kas-connection-checker-pdb-safe-to-evict branch from 67a200b to 238cc6a Compare April 26, 2026 10:56

openshift-ci Bot added the area/testing Indicates the PR includes changes for e2e testing label Apr 26, 2026

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

Comment thread control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go Outdated

openshift-ci Bot added the area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release label Apr 26, 2026

sdminonne force-pushed the fix/kas-connection-checker-pdb-safe-to-evict branch from 784e934 to cbad894 Compare April 26, 2026 12:02

sdminonne force-pushed the fix/kas-connection-checker-pdb-safe-to-evict branch from cbad894 to 8979ea8 Compare April 26, 2026 15:53

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

sdminonne changed the title ~~fix(kas-connection-checker): add PDB and safe-to-evict annotation~~ OCPBUGS-84368: add PDB and safe-to-evict annotation to kas-connection-checker Apr 27, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2026

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2026

sdminonne force-pushed the fix/kas-connection-checker-pdb-safe-to-evict branch from 8979ea8 to a1b9129 Compare April 28, 2026 09:27

openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 28, 2026

openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 16, 2026

openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 18, 2026

openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label May 18, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 18, 2026

sdminonne mentioned this pull request May 19, 2026

CNTRLPLANE-3430: Make HA break-glass credentials test infraless #8546

Open

3 tasks

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 11, 2026

sdminonne force-pushed the fix/kas-connection-checker-pdb-safe-to-evict branch from 3339fbb to 7ab35c9 Compare June 16, 2026 09:02

openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Jun 16, 2026

openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2026

sdminonne force-pushed the fix/kas-connection-checker-pdb-safe-to-evict branch from 7ab35c9 to d32c325 Compare June 16, 2026 09:19

Conversation

sdminonne commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Test plan

Uh oh!

openshift-merge-bot Bot commented Apr 26, 2026

Uh oh!

coderabbitai Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Sequence Diagram

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sdminonne commented Apr 26, 2026

Uh oh!

openshift-ci-robot commented Apr 27, 2026

Summary

Problem

Solution

Changes

Test plan

Uh oh!

csrwng commented Apr 27, 2026

Uh oh!

openshift-ci Bot commented Apr 27, 2026

Uh oh!

sdminonne commented Apr 27, 2026

Uh oh!

sdminonne commented Apr 27, 2026

Uh oh!

enxebre commented Apr 28, 2026

Uh oh!

sdminonne commented May 16, 2026

Uh oh!

sdminonne commented May 16, 2026

Uh oh!

sdminonne commented May 16, 2026

Uh oh!

openshift-merge-bot Bot commented May 16, 2026

Uh oh!

openshift-ci Bot commented May 18, 2026

Uh oh!

sdminonne commented May 18, 2026

Uh oh!

openshift-ci-robot commented May 18, 2026

Uh oh!

sdminonne commented May 18, 2026

Uh oh!

openshift-merge-bot Bot commented May 18, 2026

Uh oh!

sdminonne commented May 18, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 18, 2026

AI Test Failure Analysis

Uh oh!

sdminonne commented May 18, 2026

Uh oh!

sdminonne commented May 18, 2026

Uh oh!

sdminonne commented May 19, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 19, 2026

AI Test Failure Analysis

Uh oh!

sdminonne commented Apr 26, 2026 •

edited

Loading

coderabbitai Bot commented Apr 26, 2026 •

edited

Loading

codecov Bot commented Apr 26, 2026 •

edited

Loading