Skip to content

CNTRLPLANE-3023: Add CEL rule to prevent osImageStream removal#8719

Open
sdminonne wants to merge 1 commit into
openshift:mainfrom
sdminonne:CNTRLPLANE-3023
Open

CNTRLPLANE-3023: Add CEL rule to prevent osImageStream removal#8719
sdminonne wants to merge 1 commit into
openshift:mainfrom
sdminonne:CNTRLPLANE-3023

Conversation

@sdminonne

@sdminonne sdminonne commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add a FeatureGateAwareXValidation CEL rule on NodePoolSpec that prevents removing the osImageStream field once set
  • Add OSImageStreamRHEL9 and OSImageStreamRHEL10 constants for consistent stream name usage
  • Add envtest case covering the removal scenario

Description

Optional immutable fields on feature-gated types are subject to a two-step bypass: a user can (1) remove the field, then (2) re-add it with a different value. The existing field-level CEL rule on osImageStream prevents single-step downgrades (rhel-10 → rhel-9), but a field-level transition rule (oldSelf/self) does not fire when the field is removed entirely — the validator requires both oldSelf and self to be present.

This change adds a parent-level CEL rule on NodePoolSpec using +openshift:validation:FeatureGateAwareXValidation (gated on OSStreams) that rejects any update that removes osImageStream once it has been set:

!has(oldSelf.osImageStream) || has(self.osImageStream)

The FeatureGateAwareXValidation marker is used instead of +kubebuilder:validation:XValidation because the osImageStream field is stripped from the CRD schema in the Default variant (where the feature gate is disabled). A regular XValidation rule referencing osImageStream would fail at CRD installation time in that variant.

Test plan

  • Envtest: "When removing osImageStream from an existing NodePool it should fail"
  • CEL rule present in TechPreviewNoUpgrade and CustomNoUpgrade CRD variants
  • CEL rule absent from Default CRD variant
  • make update && make verify passes clean

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added support for RHEL 9 and RHEL 10 operating system image streams.
  • Enhancements

    • Implemented validation to prevent removing an operating system image stream configuration once it has been set.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 11, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 11, 2026

Copy link
Copy Markdown

@sdminonne: This pull request references CNTRLPLANE-3023 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add ReleaseImage.AvailableOSImageStreams() method that returns OS image streams available in a release payload based on version heuristics (pre-5.0: [rhel-9], 5.0+: [rhel-9, rhel-10])
  • Extend validOSImageStreamCondition to validate that the specified spec.osImageStream.name exists in the NodePool's release payload, in addition to the existing transition guards (removal/downgrade)
  • Add NodePoolOSImageStreamNotInPayloadReason constant for the new validation failure reason
  • Add unit tests for both AvailableOSImageStreams() and the new payload validation path in TestValidOSImageStreamCondition

Description

Previously, validOSImageStreamCondition only enforced transition guards (preventing removal once set and preventing downgrades from rhel-10 to rhel-9). It did not verify whether the specified stream actually exists in the release payload.

This change restructures the condition into three phases:

  1. Transition validation (unchanged): removal and downgrade guards that don't need release info
  2. Payload validation (new): looks up the release image and checks that the stream is in AvailableOSImageStreams(); rejects with ValidOSImageStream=False / OSImageStreamNotInPayload if not found
  3. Success: updates the status latch and sets condition True

For example, setting rhel-10 on a 4.18 release will now produce:

ValidOSImageStream=False, Reason=OSImageStreamNotInPayload
Message: osImageStream "rhel-10" is not available in release payload; available streams: [rhel-9]

Test plan

  • TestAvailableOSImageStreams — 4 cases covering 4.18 (rhel-9 only), 5.0/5.1 (both streams), invalid version (conservative default)
  • TestValidOSImageStreamCondition — 10 cases covering transition guards (removal, downgrade) and payload validation (rhel-9 on 4.18, rhel-10 on 4.18 rejected, rhel-9/rhel-10 on 5.0, upgrade on 5.0)
  • make pre-commit passes (update, build, verify, test, gitlint)
  • /test e2e-hypershift

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Jun 11, 2026
@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 76086dc2-767b-4cc5-b5a2-0b2ba5a6b689

📥 Commits

Reviewing files that changed from the base of the PR and between 4496f29 and a59f54d.

⛔ Files ignored due to path filters (5)
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OSStreams.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • cmd/install/assets/crds/hypershift-operator/tests/nodepools.hypershift.openshift.io/featuregated.nodepools.osimagestream.testsuite.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (1)
  • api/hypershift/v1beta1/nodepool_types.go

📝 Walkthrough

Walkthrough

In api/hypershift/v1beta1/nodepool_types.go, a feature-gated CEL XValidation rule is added to NodePoolSpec via a +openshift:validation:FeatureGateAwareXValidation annotation (gated by OSStreams). The rule enforces that the osImageStream field cannot be removed once it has been set. Additionally, a new const block exports two string constants: OSImageStreamRHEL9 ("rhel-9") and OSImageStreamRHEL10 ("rhel-10"), representing supported OS image stream names.

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning The custom check requires reviewing Ginkgo test code for quality requirements. However, the PR description claims tests were added (TestAvailableOSImageStreams, TestValidOSImageStreamCondition) tha... The PR description describes unit tests (TestAvailableOSImageStreams with 4 cases, TestValidOSImageStreamCondition with 10 cases) and methods (AvailableOSImageStreams, NodePoolOSImageStreamNotInPayloadReason) that are not actually implem...
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a CEL rule to prevent osImageStream removal, which is the primary feature introduced in this pull request.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo tests present in PR. Only Go table-driven tests in nodepool_types_test.go exist, with stable deterministic names containing no dynamic values.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only API type definitions and validation annotations (CEL rules) on NodePoolSpec with no deployment manifests, operator code, or scheduling constraints that would affect topology compat...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests added. PR only adds standard Go unit tests in nodepool_types_test.go (TestNodePoolAutoScalingSerializationCompatibility), which uses testing.T, not Ginkgo patterns.
No-Weak-Crypto ✅ Passed The PR adds OSImageStream constants and CEL validation rules to NodePoolSpec. No weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-cons...
Container-Privileges ✅ Passed PR only modifies Go API type definitions (constants and CEL validation rules) in nodepool_types.go; contains no container/K8s manifests or container security configurations to flag.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data logging found. Changes only add OS stream constants ("rhel-9", "rhel-10") and validation rules with generic messages. No passwords, tokens, PII, or internal data exposed in logs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels Jun 11, 2026
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.66%. Comparing base (44f5195) to head (a59f54d).
⚠️ Report is 28 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8719   +/-   ##
=======================================
  Coverage   41.66%   41.66%           
=======================================
  Files         758      758           
  Lines       93929    93929           
=======================================
  Hits        39135    39135           
  Misses      52046    52046           
  Partials     2748     2748           
Flag Coverage Δ
cmd-support 34.96% <ø> (ø)
cpo-hostedcontrolplane 44.00% <ø> (ø)
cpo-other 43.45% <ø> (ø)
hypershift-operator 51.65% <ø> (ø)
other 31.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
api/hypershift/v1beta1/nodepool_types.go (1)

242-261: ⚡ Quick win

Clarify default OS stream behavior in documentation.

Line 252 states "the pool uses the release version's default stream (rhel-9 for OCP < 5.0, rhel-10 for OCP >= 5.0)". This is misleading: for OCP >= 5.0, AvailableOSImageStreams() returns both ["rhel-9", "rhel-10"], not a single default.

When osImageStream is omitted, the actual default selection is implementation-defined by downstream platform code, not by this API field. Consider revising the documentation to:

// When omitted, the pool uses platform-specific default OS images.
// For OCP < 5.0, only rhel-9 is available.
// For OCP >= 5.0, both rhel-9 and rhel-10 are available; set this field
// explicitly to select a non-default stream.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/hypershift/v1beta1/nodepool_types.go` around lines 242 - 261, Doc comment
for the OSImageStream field is misleading about defaults; update the comment
above the OSImageStream (osImageStream) field to state that when omitted the
pool uses platform-specific defaults, that AvailableOSImageStreams() returns
only rhel-9 for OCP < 5.0 but both rhel-9 and rhel-10 for OCP >= 5.0, and
recommend that callers explicitly set OSImageStream to pick a non-default stream
(use the suggested replacement wording from the review to replace the existing
paragraph). Ensure references to OSImageStream and AvailableOSImageStreams()
remain accurate and keep the CEL validation note intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@api/hypershift/v1beta1/nodepool_types.go`:
- Around line 242-261: Doc comment for the OSImageStream field is misleading
about defaults; update the comment above the OSImageStream (osImageStream) field
to state that when omitted the pool uses platform-specific defaults, that
AvailableOSImageStreams() returns only rhel-9 for OCP < 5.0 but both rhel-9 and
rhel-10 for OCP >= 5.0, and recommend that callers explicitly set OSImageStream
to pick a non-default stream (use the suggested replacement wording from the
review to replace the existing paragraph). Ensure references to OSImageStream
and AvailableOSImageStreams() remain accurate and keep the CEL validation note
intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 212bbd47-181c-4e0a-8242-848f008b0507

📥 Commits

Reviewing files that changed from the base of the PR and between 35c0190 and bef198c.

⛔ Files ignored due to path filters (20)
  • api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests.yaml is excluded by !**/zz_generated*
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OSStreams.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • client/applyconfiguration/hypershift/v1beta1/nodepoolspec.go is excluded by !client/**
  • client/applyconfiguration/hypershift/v1beta1/nodepoolstatus.go is excluded by !client/**
  • client/applyconfiguration/hypershift/v1beta1/osimagestreamreference.go is excluded by !client/**
  • client/applyconfiguration/utils.go is excluded by !client/**
  • cmd/install/assets/crds/hypershift-operator/payload-manifests/featuregates/featureGate-Hypershift-Default.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/payload-manifests/featuregates/featureGate-Hypershift-TechPreviewNoUpgrade.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/payload-manifests/featuregates/featureGate-SelfManagedHA-Default.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/payload-manifests/featuregates/featureGate-SelfManagedHA-TechPreviewNoUpgrade.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/tests/nodepools.hypershift.openshift.io/featuregated.nodepools.osimagestream.testsuite.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_conditions.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests.yaml is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*
📒 Files selected for processing (11)
  • api/hypershift/v1beta1/featuregates/featureGate-Hypershift-Default.yaml
  • api/hypershift/v1beta1/featuregates/featureGate-Hypershift-TechPreviewNoUpgrade.yaml
  • api/hypershift/v1beta1/featuregates/featureGate-SelfManagedHA-Default.yaml
  • api/hypershift/v1beta1/featuregates/featureGate-SelfManagedHA-TechPreviewNoUpgrade.yaml
  • api/hypershift/v1beta1/nodepool_conditions.go
  • api/hypershift/v1beta1/nodepool_types.go
  • hypershift-operator/controllers/nodepool/conditions.go
  • hypershift-operator/controllers/nodepool/conditions_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • support/releaseinfo/releaseinfo.go
  • support/releaseinfo/releaseinfo_test.go

@sdminonne sdminonne marked this pull request as ready for review June 12, 2026 09:39
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2026
@openshift-ci openshift-ci Bot requested review from devguyio and jparrill June 12, 2026 09:40
@sdminonne

Copy link
Copy Markdown
Contributor Author

/retest

@jparrill jparrill left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some comments. Thanks!

CIDRConflictReason = "CIDRConflict"
NodePoolKubeVirtLiveMigratableReason = "KubeVirtNodesNotLiveMigratable"
NodePoolUnsupportedSkewReason = "UnsupportedSkew"
NodePoolOSImageStreamRemovalReason = "OSImageStreamRemoved"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The alignment on these three is off compared to the rest of the const block. The existing constants column-align the = sign — these push it further right. make fmt won't catch it (gofmt doesn't enforce alignment across different-length names in a const block), so it's a manual fix.

Also, the constant name NodePoolOSImageStreamRemovalReason says "Removal" (noun) but its value is "OSImageStreamRemoved" (past tense). The other two are consistent with themselves (DowngradeReason/"...Downgrade", NotInPayloadReason/"...NotInPayload"). Minor, but worth picking one convention.


// --- Phase 3: Valid. Update status latch and set condition True ---

if specStream != "" {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first condition function in this file that mutates a non-condition status field — all the others only call SetStatusCondition. The latch makes sense here because it has to be set atomically with the validation passing, but it's a departure from the pattern. A one-line comment explaining why would help future readers, e.g.:

// Latch must be set here, not in the reconcile body, because transition guards
// depend on it being updated only after validation passes.

Comment thread support/releaseinfo/releaseinfo.go Outdated
return i.ImageStream.Name
}

// AvailableOSImageStreams returns the OS image streams available in this release payload.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a version heuristic, not actual payload introspection — it doesn't look at StreamMetadata or image tags. That's fine for TechPreview, but worth a TODO so it's not forgotten:

// TODO(CNTRLPLANE-3023): Replace version heuristic with payload metadata
// introspection when release images carry OS stream manifests.

Also, consider defining "rhel-9" and "rhel-10" as constants in nodepool_types.go (like ArchitectureAMD64 = "amd64" and UpgradeTypeReplace). They're used here and in conditions.go — centralizing them reduces the chance of typos and makes the hardcoded downgrade check more maintainable.


// NodePoolValidOSImageStreamConditionType signals if the osImageStream requested in the
// NodePool spec is valid. This covers two classes of validation:
// 1. Transition guards (controller-level complement to CEL): prevents removing the field

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to enforce this "prevents removing the field", this needs to be implemented via CEL

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// 2. Payload validation: verifies that the specified stream is available in the NodePool's
// release payload (e.g., rhel-10 is only available in 5.0+ payloads).
// A failure here requires the user to change the osImageStream field to a valid value.
NodePoolValidOSImageStreamConditionType = "ValidOSImageStream"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is derailing from the enhancement, if something needs adjustment please create a PR against the enhacement

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I'm removing it.

Add a FeatureGateAwareXValidation CEL rule on NodePoolSpec that prevents
removing the osImageStream field once set, closing the two-step bypass
for optional immutable fields on feature-gated types. Add an envtest
case covering the removal scenario.

Also add OSImageStreamRHEL9 and OSImageStreamRHEL10 constants for
consistent use of stream name strings across the codebase.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@sdminonne sdminonne changed the title CNTRLPLANE-3023: Add osImageStream release payload validation CNTRLPLANE-3023: Add CEL rule to prevent osImageStream removal Jun 16, 2026
@sdminonne

Copy link
Copy Markdown
Contributor Author

@enxebre removed the erroneous condition definitions. Left only the CEL rule to prevent stream removal once set.
PR description updated.

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@sdminonne: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@enxebre

enxebre commented Jun 16, 2026

Copy link
Copy Markdown
Member

/approve

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre, sdminonne

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2026
@sdminonne

Copy link
Copy Markdown
Contributor Author

/retest

@sdminonne

Copy link
Copy Markdown
Contributor Author

/verified by unit-tests

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 17, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@sdminonne: This PR has been marked as verified by unit-tests.

Details

In response to this:

/verified by unit-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jparrill jparrill left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2067150317815861248 | Cost: $3.9142590000000004 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cwbotbot

cwbotbot commented Jun 17, 2026

Copy link
Copy Markdown

Test Results

e2e-aws

e2e-aks

Failed Tests

Total failed tests: 3

  • TestCreateClusterHABreakGlassCredentials
  • TestCreateClusterHABreakGlassCredentials/ValidateHostedCluster
  • TestCreateClusterHABreakGlassCredentials/ValidateHostedCluster/EnsureOAPIMountsTrustBundle

@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown

Now I have a clear picture. Let me assemble the timeline and present the final analysis. The key findings are:

  1. This is an infrastructure/CI failure in the pre (installation) phase — the cluster bootstrap timed out
  2. The PR changes (adding CEL validation rules to NodePool CRDs) are completely unrelated to this failure
  3. The bootstrap waited 45 minutes and timed out because:
    • The management cluster was installed with only 1 worker node (c5n.metal type, needed for KubeVirt)
    • The ingress controller router deployment couldn't reach minimum replicas (0/1 available)
    • Without the ingress controller ready, several dependent operators (authentication, monitoring, openshift-apiserver, console) couldn't become available
    • The bootstrap process timed out waiting for these operators

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Bootstrap failed to complete: timed out waiting for the condition
Failed to wait for bootstrapping to complete. This error usually happens when there
is a problem with control plane hosts that prevents the control plane operators from
creating the control plane.

Cluster operator ingress Available is False: Waiting for router deployment rollout
to finish: 0 of 1 updated replica(s) are available...

Summary

The e2e-kubevirt-aws-ovn-reduced job failed during the pre phase (cluster installation), before any test code from the PR was executed. The openshift-install bootstrap process timed out after 45 minutes because the ingress controller router deployment could not reach minimum replicas (0/1 available). This blocked multiple dependent operators (authentication, monitoring, openshift-apiserver, console) from becoming available, ultimately preventing bootstrap completion. The PR's changes — adding CEL validation rules to NodePool CRDs — are purely API schema changes and are completely unrelated to this infrastructure-level installation failure.

Root Cause

This is an infrastructure/CI environment failure, not a code regression from PR #8719.

The cluster was configured with:

  • 3 master nodes (m6i.2xlarge)
  • 1 worker node (c5n.metal — bare metal instance required for KubeVirt/nested virtualization)

The bootstrap process timed out because the ingress controller's router deployment could not start (0/1 replicas available). The router pod requires a schedulable worker node, and the single c5n.metal worker node did not become ready in time.

Failure chain:

  1. The c5n.metal worker node did not join the cluster within the 45-minute bootstrap window
  2. The ingress controller router deployment stayed at 0/1 replicas (needs a worker node to schedule)
  3. Without ingress, the Route API was unavailable → monitoring operator failed to create Routes
  4. Without ingress, OAuth endpoints were unreachable → authentication operator became degraded
  5. The openshift-apiserver and console operators remained unavailable (dependency on authentication/ingress)
  6. Bootstrap timed out waiting for all operators to become available

Why c5n.metal was slow/failed to join: Bare metal instances (c5n.metal) in AWS have significantly longer provisioning times compared to standard virtualized instances. Combined with the full RHCOS image download, ignition processing, and MachineConfig application on a bare-metal instance, the single worker exceeded the 45-minute bootstrap timeout.

PR correlation: NONE. The PR adds CEL validation rules to NodePool CRD schemas. These changes:

  • Only affect NodePool API validation (preventing osImageStream removal)
  • Are compiled into the hypershift-operator image but never run during IPI cluster installation
  • Do not affect the openshift-install process, bootstrap, or any cluster operator
Recommendations
  1. Retry the job — This is a transient infrastructure failure. The PR changes have no bearing on cluster installation. A /retest should resolve this.

  2. Monitor for recurring pattern — If e2e-kubevirt-aws-ovn-reduced consistently fails with bootstrap timeouts, the c5n.metal provisioning time may be systematically too close to the 45-minute limit, and the bootstrap timeout or instance type should be revisited by the CI infrastructure team.

  3. No code changes needed — The PR (adding CEL rules to prevent osImageStream removal) is unrelated to this failure.

Evidence
Evidence Detail
Failed Step ipi-install-install (pre phase) — exit code 5
Phase Pre (installation) — test code never executed
Bootstrap Wait 08:13:24 → 08:58:24 UTC (45-minute timeout)
Ingress Status Available=False, Degraded=True — router deployment 0/1 replicas
Authentication Status Degraded=True — OAuth endpoints unreachable (connection refused to 172.30.23.120:443)
Monitoring Status Available=False — Route API unavailable (routes.route.openshift.io not found)
OpenShift API Server Available=FalseAPIServices: PreconditionNotReady
Worker Instance Type c5n.metal (bare metal, 1 replica) — slow to provision
API Accessibility API became available at ~08:13 after DNS/network delays from 08:10
PR Changes NodePool CRD CEL rules only — zero overlap with IPI installation
Installer Exit Code 5 (bootstrap timeout)

@jparrill

Copy link
Copy Markdown
Contributor

/retest-required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants