Skip to content

CNTRLPLANE-3552: Multi-stream CoreOS metadata parsing and stream resolution#8669

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
jparrill:CNTRLPLANE-3552
Jun 17, 2026
Merged

CNTRLPLANE-3552: Multi-stream CoreOS metadata parsing and stream resolution#8669
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
jparrill:CNTRLPLANE-3552

Conversation

@jparrill

@jparrill jparrill commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Add multi-stream CoreOS boot image parsing and RHEL stream resolution logic for the dual-stream RHEL 9/10 NodePool feature (Phase 1 of CNTRLPLANE-3018).

  • Update DeserializeImageMetadata to parse both the legacy stream key (OCP < 5.0) and the new streams key (OCP >= 5.0) from the boot image ConfigMap
  • Add OSStreams field to ReleaseImage and StreamForName() method for stream-specific boot image lookup
  • Add GetRHELStream() pure function implementing the stream resolution table from the dual-stream RHEL enhancement
  • Add 5.0 boot image fixture extracted from 5.0.0-ec.2 release payload
  • Fallback: when ConfigMap has only streams without stream, populate StreamMetadata from the first available stream to prevent nil panics in platform consumers

This is preparatory code — not reachable in production until Phase 2 connects GetRHELStream into the NodePool controller (NewToken(), validMachineConfigCondition). The multi-stream parsing only activates when a >= 5.0 payload carries the streams ConfigMap key.

Stream Resolution Table

explicit stream release runc result
unset 4.x - "" (legacy)
unset 5.x false rhel-10
unset 5.x true rhel-9 (fallback)
rhel-9 any - rhel-9
rhel-10 <5.0 - error
rhel-10 >=5.0 false rhel-10
rhel-10 >=5.0 true error

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-3552

Special notes for your reviewer:

  • Fixture data comes from a real 5.0.0-ec.2 payload (not synthetic)
  • GetRHELStream is exported for future Phase 2 consumers
  • The OSStreams field lives on ReleaseImage (not on stream.Stream) to keep the upstream struct clean
  • Uses stream.Stream from coreos/stream-metadata-go (introduced by CNTRLPLANE-3020: Adopt coreos/stream-metadata-go upstream library #8673)
  • Stream constants StreamRHEL9/StreamRHEL10 exported for reuse

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Test plan

Unit tests — parsing and stream resolution:

go test ./support/releaseinfo/... -v -count=1
go test -run TestGetRHELStream ./hypershift-operator/controllers/nodepool/ -v -count=1

Verification:

make lint-fix  # 0 issues
make test      # 0 failures
make verify    # passes

🤖 Generated with Claude Code

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 4, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 4, 2026

Copy link
Copy Markdown

@jparrill: This pull request references CNTRLPLANE-3552 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Fixes

Summary

  • Add multi-stream boot image ConfigMap parsing to DeserializeImageMetadata — supports both the legacy stream key (OCP < 5.0) and the new streams key (OCP >= 5.0)
  • Add OSStreams field to ReleaseImage and StreamForName() method for stream-specific boot image lookup
  • Add GetRHELStream() pure function implementing the stream resolution table from the dual-stream RHEL enhancement
  • Add 5.0 boot image fixture extracted from 5.0.0-ec.2 release payload
  • Integration test validating end-to-end flow: parse payload → resolve stream → look up boot images

Phase 1 of CNTRLPLANE-3018 (epic). No API changes, no platform wiring — parsing and resolution logic only. This code is not reachable in production until Phase 2 connects GetRHELStream into the NodePool controller (NewToken(), validMachineConfigCondition). The multi-stream parsing only activates when a >= 5.0 payload carries the streams ConfigMap key.

Stream Resolution Table

explicit stream release runc result
unset 4.x - "" (legacy)
unset 5.x false rhel-10
unset 5.x true rhel-9 (fallback)
rhel-9 any - rhel-9
rhel-10 <5.0 - error
rhel-10 >=5.0 false rhel-10
rhel-10 >=5.0 true error

Test plan

  • Unit tests — parsing and stream resolution:
go test ./support/releaseinfo/... -v -count=1
go test -run TestGetRHELStream ./hypershift-operator/controllers/nodepool/ -v -count=1
  • Integration tests — end-to-end flow with real 5.0 EC payload:
go test -tags integration ./test/integration/osstreams/... -v -count=1

Note: requires -tags integration build tag. No cluster needed.

  • make lint-fix — 0 issues
  • make verify — passes

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 4, 2026
@openshift-ci

openshift-ci Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces multi-stream RHEL CoreOS image metadata support to the hypershift operator. It adds constants and resolution logic for rhel-9 and rhel-10 streams (GetRHELStream), extends the ReleaseImage data model with an OSStreams field and StreamForName lookup method, rewrites ConfigMap deserialization to parse both single-stream and multi-stream metadata from separate keys, integrates the parsed data into release info providers, adds defensive nil checks in image resolution code, and validates the entire flow with new OCP 5.0 fixture data and end-to-end integration tests covering happy paths, error cases, and backward compatibility with legacy OCP 4.10 payloads.

Sequence Diagram

sequenceDiagram
  participant NodePool as NodePool Controller
  participant GetRHELStream as GetRHELStream
  participant RegistryClient as RegistryClientProvider
  participant Deserialize as DeserializeImageMetadata
  participant ReleaseImage as ReleaseImage
  participant StreamForName as StreamForName
  participant ImageResolver as Image Resolver (kubevirt/openstack/powervs)

  NodePool->>GetRHELStream: explicitStream, releaseVersion, usesRunc
  GetRHELStream-->>NodePool: resolved stream name (or error)
  
  RegistryClient->>Deserialize: ConfigMap data bytes
  Deserialize-->>RegistryClient: (singleStreamMeta, multiStreamMap, error)
  RegistryClient->>ReleaseImage: populate StreamMetadata + OSStreams
  
  NodePool->>StreamForName: streamName
  StreamForName->>ReleaseImage: lookup in StreamMetadata or OSStreams
  StreamForName-->>NodePool: CoreOSStreamMetadata (or error)
  
  NodePool->>ImageResolver: resolved metadata
  ImageResolver->>ImageResolver: validate StreamMetadata not nil
  ImageResolver-->>NodePool: image artifact (AMI, GCP image, etc.)
Loading

Suggested reviewers

  • sdminonne
  • muraee
🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: multi-stream CoreOS metadata parsing and stream resolution logic added across the codebase.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the PR are static string literals with no dynamic content; no fmt.Sprintf, variables, timestamps, UUIDs, or generated identifiers detected in test titles.
Test Structure And Quality ✅ Passed PR contains only standard Go unit tests (Test* functions with *testing.T), not Ginkgo tests. Check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces only utility functions and metadata parsing—no deployment manifests, scheduling constraints, affinity rules, or topology constraints.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR adds standard Go unit/integration tests only, not Ginkgo e2e tests. No It()/Describe()/Context() patterns found. The custom check does not apply.
No-Weak-Crypto ✅ Passed No weak crypto algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons found in any modified files.
Container-Privileges ✅ Passed PR contains only Go source/test files and a ConfigMap fixture with boot image metadata. No container manifests, deployment specs, or privileged security configurations found.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data in logs. Error messages log only public metadata (image URLs, AMI IDs, GCP names, release versions) with no passwords, tokens, keys, PII, or secrets.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Jun 4, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@support/releaseinfo/deserialize.go`:
- Around line 41-48: The code currently initializes a zero-value
CoreOSStreamMetadata and returns its address even when the "stream" key is
absent; update the logic in the json unmarshal block and final return so that
when hasStreamData is false the function returns (nil, osStreams, nil) instead
of &coreOSMeta. Concretely, only allocate or populate coreOSMeta when
hasStreamData and json.Unmarshal succeeds (inside the if hasStreamData { ... }
block using json.Unmarshal into a local variable), and change the final return
to return the pointer to that populated struct or nil when hasStreamData was
false (i.e., return nil, osStreams, nil).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 39f31f55-c3e5-4714-a09f-53a65264d50e

📥 Commits

Reviewing files that changed from the base of the PR and between e25a87a and e3e1dce.

📒 Files selected for processing (11)
  • hypershift-operator/controllers/nodepool/stream.go
  • hypershift-operator/controllers/nodepool/stream_test.go
  • support/releaseinfo/deserialize.go
  • support/releaseinfo/deserialize_test.go
  • support/releaseinfo/fixtures/5.0-installer-coreos-bootimages.yaml
  • support/releaseinfo/fixtures/fixtures.go
  • support/releaseinfo/registry_mirror_provider.go
  • support/releaseinfo/registryclient_provider.go
  • support/releaseinfo/releaseinfo.go
  • support/releaseinfo/releaseinfo_test.go
  • test/integration/osstreams/osstreams_test.go

Comment thread support/releaseinfo/deserialize.go Outdated
@codecov

codecov Bot commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.96970% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.79%. Comparing base (2ce76f6) to head (ba2f870).
⚠️ Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
support/releaseinfo/registryclient_provider.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8669      +/-   ##
==========================================
+ Coverage   41.67%   41.79%   +0.11%     
==========================================
  Files         758      759       +1     
  Lines       93945    94037      +92     
==========================================
+ Hits        39155    39304     +149     
+ Misses      52043    51983      -60     
- Partials     2747     2750       +3     
Files with missing lines Coverage Δ
hypershift-operator/controllers/nodepool/stream.go 100.00% <100.00%> (ø)
support/releaseinfo/deserialize.go 89.65% <100.00%> (+49.65%) ⬆️
support/releaseinfo/registry_mirror_provider.go 45.83% <100.00%> (+2.35%) ⬆️
support/releaseinfo/releaseinfo.go 51.48% <100.00%> (+5.62%) ⬆️
support/releaseinfo/registryclient_provider.go 0.00% <0.00%> (ø)

... and 5 files with indirect coverage changes

Flag Coverage Δ
cmd-support 35.11% <95.23%> (+0.15%) ⬆️
cpo-hostedcontrolplane 44.10% <ø> (+0.09%) ⬆️
cpo-other 43.45% <ø> (ø)
hypershift-operator 51.87% <100.00%> (+0.17%) ⬆️
other 31.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jparrill jparrill force-pushed the CNTRLPLANE-3552 branch 2 times, most recently from 6651f3c to 3121c68 Compare June 4, 2026 15:42

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
support/releaseinfo/deserialize_test.go (2)

20-26: ⚡ Quick win

Mark test helper as a helper for cleaner failure locations.

testConfigMap is a helper but does not mark itself with t.Helper(), which makes failures point to the helper body instead of the calling test.

Suggested change
-func testConfigMap(dataFields map[string]string) []byte {
+func testConfigMap(t *testing.T, dataFields map[string]string) []byte {
+	t.Helper()
 	cm := "apiVersion: v1\nkind: ConfigMap\nmetadata:\n  name: test\ndata:\n"
 	for k, v := range dataFields {
 		cm += fmt.Sprintf("  %s: %s\n", k, v)
 	}
 	return []byte(cm)
}
// Update call sites in this file:
data: testConfigMap(t, map[string]string{ ... })

As per coding guidelines, "Use t.Helper() in Go helper functions to improve error tracebacks".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@support/releaseinfo/deserialize_test.go` around lines 20 - 26, testConfigMap
is a test helper but doesn't call t.Helper() or accept *testing.T, so failures
point into the helper; change the signature of testConfigMap to accept t
*testing.T (e.g., testConfigMap(t *testing.T, dataFields map[string]string)),
call t.Helper() at the top of that function, and update all call sites in this
file (places that call testConfigMap(...)) to pass the test variable (e.g.,
testConfigMap(t, map[string]string{...})).

28-123: ⚡ Quick win

Run these unit tests in parallel.

Both test functions and their subtests can run independently; adding t.Parallel() aligns with repository test guidance and reduces suite runtime.

Suggested change
func TestDeserializeImageMetadata(t *testing.T) {
+	t.Parallel()
 	tests := []struct {
 		...
 	}

 	for _, tt := range tests {
+		tt := tt
 		t.Run(tt.name, func(t *testing.T) {
+			t.Parallel()
 			g := NewWithT(t)
 			...
 		})
 	}
}

func TestDeserializeImageMetadataMultiStreamContent(t *testing.T) {
+	t.Parallel()
 	defaultStream, osStreams, err := DeserializeImageMetadata(fixtures.CoreOSBootImagesYAML_5_0)
 	...

 	for _, tt := range tests {
+		tt := tt
 		t.Run(tt.name, func(t *testing.T) {
+			t.Parallel()
 			tt.assert(NewWithT(t))
 		})
 	}
}

As per coding guidelines, "Unit tests should use race detection and parallel execution".

Also applies to: 125-229

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@support/releaseinfo/deserialize_test.go` around lines 28 - 123, Add parallel
execution to the tests: call t.Parallel() at the start of the top-level
TestDeserializeImageMetadata function and also inside each subtest goroutine
(the func passed to t.Run) so subtests run in parallel; do the same for the
other test function referenced (the one around lines 125-229). Locate the test
functions by name (TestDeserializeImageMetadata and the other test function in
the same file) and add t.Parallel() both at the top of each test and at the
start of each t.Run subtest closure, ensuring no shared mutable state is
assumed.
hypershift-operator/controllers/nodepool/stream_test.go (1)

11-171: ⚡ Quick win

Enable parallel execution for this table-driven test.

This suite is a good candidate for t.Parallel() at both parent and subtest level.

Suggested change
func TestGetRHELStream(t *testing.T) {
+	t.Parallel()
 	tests := []struct {
 		...
 	}

 	for _, tt := range tests {
+		tt := tt
 		t.Run(tt.name, func(t *testing.T) {
+			t.Parallel()
 			g := NewWithT(t)

 			result, err := GetRHELStream(tt.explicitStream, tt.releaseVersion, tt.usesRunc)
 			...
 		})
 	}
}

As per coding guidelines, "Unit tests should use race detection and parallel execution".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/stream_test.go` around lines 11 -
171, The test TestGetRHELStream should enable parallel execution to follow
guidelines; add t.Parallel() as the first statement in TestGetRHELStream and
also call t.Parallel() at the start of each subtest inside the t.Run closure so
each case runs concurrently; locate TestGetRHELStream and the anonymous func
passed to t.Run around the table loop and insert the t.Parallel() calls while
ensuring no shared mutable state is accessed when invoking GetRHELStream.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@hypershift-operator/controllers/nodepool/stream_test.go`:
- Around line 11-171: The test TestGetRHELStream should enable parallel
execution to follow guidelines; add t.Parallel() as the first statement in
TestGetRHELStream and also call t.Parallel() at the start of each subtest inside
the t.Run closure so each case runs concurrently; locate TestGetRHELStream and
the anonymous func passed to t.Run around the table loop and insert the
t.Parallel() calls while ensuring no shared mutable state is accessed when
invoking GetRHELStream.

In `@support/releaseinfo/deserialize_test.go`:
- Around line 20-26: testConfigMap is a test helper but doesn't call t.Helper()
or accept *testing.T, so failures point into the helper; change the signature of
testConfigMap to accept t *testing.T (e.g., testConfigMap(t *testing.T,
dataFields map[string]string)), call t.Helper() at the top of that function, and
update all call sites in this file (places that call testConfigMap(...)) to pass
the test variable (e.g., testConfigMap(t, map[string]string{...})).
- Around line 28-123: Add parallel execution to the tests: call t.Parallel() at
the start of the top-level TestDeserializeImageMetadata function and also inside
each subtest goroutine (the func passed to t.Run) so subtests run in parallel;
do the same for the other test function referenced (the one around lines
125-229). Locate the test functions by name (TestDeserializeImageMetadata and
the other test function in the same file) and add t.Parallel() both at the top
of each test and at the start of each t.Run subtest closure, ensuring no shared
mutable state is assumed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: c73114df-faea-4ddf-bd9d-f0e03b64c080

📥 Commits

Reviewing files that changed from the base of the PR and between e3e1dce and 6651f3c.

📒 Files selected for processing (11)
  • hypershift-operator/controllers/nodepool/stream.go
  • hypershift-operator/controllers/nodepool/stream_test.go
  • support/releaseinfo/deserialize.go
  • support/releaseinfo/deserialize_test.go
  • support/releaseinfo/fixtures/5.0-installer-coreos-bootimages.yaml
  • support/releaseinfo/fixtures/fixtures.go
  • support/releaseinfo/registry_mirror_provider.go
  • support/releaseinfo/registryclient_provider.go
  • support/releaseinfo/releaseinfo.go
  • support/releaseinfo/releaseinfo_test.go
  • test/integration/osstreams/osstreams_test.go
🚧 Files skipped from review as they are similar to previous changes (9)
  • support/releaseinfo/fixtures/fixtures.go
  • support/releaseinfo/releaseinfo_test.go
  • support/releaseinfo/releaseinfo.go
  • support/releaseinfo/registryclient_provider.go
  • support/releaseinfo/registry_mirror_provider.go
  • hypershift-operator/controllers/nodepool/stream.go
  • support/releaseinfo/deserialize.go
  • support/releaseinfo/fixtures/5.0-installer-coreos-bootimages.yaml
  • test/integration/osstreams/osstreams_test.go

@jparrill

jparrill commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

/hold

We need to discuss what's the best approach for the dependent PR #8673

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 8, 2026
@openshift-ci openshift-ci Bot added area/documentation Indicates the PR includes changes for documentation area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/platform/openstack PR/issue for OpenStack (OpenStackPlatform) platform area/platform/powervs PR/issue for PowerVS (PowerVSPlatform) platform labels Jun 8, 2026
@jparrill jparrill force-pushed the CNTRLPLANE-3552 branch 2 times, most recently from ce055e3 to 6eadd8e Compare June 8, 2026 11:04
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2066455597015896064 | Cost: $4.283351750000001 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@jparrill

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2026
@jparrill

Copy link
Copy Markdown
Contributor Author

Rebased

}

if !isOCP5Plus {
return "", nil

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should be explicit here, we can always follow up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The function's job is to resolve the stream, and for <5.0 that's always rhel-9. Returning empty string here is intentional though — it signals "legacy single-stream behavior" to callers. The config hash (Phase 2, CNTRLPLANE-3553) only includes the stream when spec.osImageStream.name is explicitly set by the user, so this return value never enters the hash. Happy to make it explicit if you prefer — it won't cause rollouts either way.

return nil, err
}

if coreOSMeta == nil && len(osStreams) > 0 {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this choice coming from? can we validate with the owner of this api what's the expectation and contract for consumers?
Then at minimum add a //comment with the rationale for that choice

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validated with the MCO team (Slack thread). The legacy stream key in the coreos-bootimages ConfigMap is frozen to rhel-9 and will be present until rhel-9 EoL (OCP 5.3). There's no "default stream" field in the ConfigMap itself — the installer injects a separate OSImageStream CR with spec.defaultStream (installer source), but HyperShift doesn't use installer assets, so we generate our own in Phase 2 (CNTRLPLANE-3553).

The fallback now explicitly looks up rhel-9 (instead of iterating the map) and lives in DeserializeImageMetadata — removed the redundant copy from Lookup() since DeserializeImageMetadata already handles it before returning. Added a comment with the rationale and the Slack reference.

@jparrill jparrill force-pushed the CNTRLPLANE-3552 branch 2 times, most recently from d28408c to ee41ea1 Compare June 16, 2026 12:18
Comment thread support/releaseinfo/releaseinfo.go Outdated
if i.OSStreams == nil {
return nil, fmt.Errorf("stream %q not found: no multi-stream metadata available", name)
}
meta, ok := i.OSStreams[name]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for < 5 with explicit rhel9 will need to fallback to return i.StreamMetadata as i.OSStreams[name] won't exist, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the legacy stream key is guaranteed present until OCP 5.3 (confirmed with MCO team), so the fallback was dead code. And when stream does eventually go away, consumers should explicitly pick their stream via StreamForName() rather than silently getting rhel-9.

Removed the fallback from DeserializeImageMetadata. Updated StreamForName() so that when OSStreams is not nil (>= 5.0 payload), it only looks in OSStreams and errors if the stream isn't found. When OSStreams is nil (< 5.0 payload), it falls back to StreamMetadata as you described.

…m resolution

Add support for parsing the new multi-stream boot image ConfigMap format
introduced in OCP 5.0 payloads. The ConfigMap now carries a "streams" key
alongside the legacy "stream" key, mapping stream names (rhel-9, rhel-10)
to per-architecture boot image metadata.

- Update DeserializeImageMetadata to parse both "streams" and "stream"
  keys, returning the parsed OSStreams map alongside the default metadata
- Add OSStreams field to ReleaseImage for holding per-stream metadata
- Add StreamForName convenience method on ReleaseImage for stream lookup
- Add GetRHELStream pure function implementing the stream resolution
  table from the dual-stream RHEL enhancement
- Add 5.0 boot image fixture extracted from 5.0.0-ec.2 release payload
- Fallback: when ConfigMap has only "streams" without "stream", populate
  StreamMetadata from the first available stream to prevent nil panics
  in platform consumers

Signed-off-by: Juan Manuel Parrilla Madrid <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Juan Manuel Parrilla Madrid <[email protected]>
@enxebre

enxebre commented Jun 16, 2026

Copy link
Copy Markdown
Member

/approve

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre, jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sdminonne

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@jparrill

Copy link
Copy Markdown
Contributor Author

/retest-required

@jparrill

Copy link
Copy Markdown
Contributor Author

/retest-required

@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown

The PR changes CoreOS metadata parsing and stream resolution code — entirely in the HyperShift operator and release info support layer. These changes are not used during the management cluster installation (IPI install), which is what failed. The failure is in the pre phase (cluster installation), long before any HyperShift operator code runs.

I now have enough evidence to produce the final report. This is a management cluster infrastructure installation failure — the bootstrap process timed out because the single worker node (c5n.metal) never became ready, which prevented the ingress router from scheduling, which in turn prevented bootstrap completion. This is unrelated to the PR changes.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Bootstrap failed to complete: timed out waiting for the condition
Failed to wait for bootstrapping to complete. This error usually happens when there is a problem
with control plane hosts that prevents the control plane operators from creating the control plane.

Cluster operator ingress Available is False: Deployment does not have minimum availability.
  → Waiting for router deployment rollout to finish: 0 of 1 updated replica(s) are available...
Cluster operator authentication Available is False: APIServicesAvailable: PreconditionNotReady
Cluster operator monitoring Available is False: creating Route object failed: the server could not
  find the requested resource (post routes.route.openshift.io)
Cluster operator openshift-apiserver Available is False: APIServicesAvailable: PreconditionNotReady

Installer exit with code 5

Summary

The management cluster IPI installation failed during the pre phase with a bootstrap timeout (exit code 5) after 1h1m. The 3 control-plane master nodes provisioned and the Kubernetes API became available at 20:09, but the bootstrap process — which waited until 20:57 (the full 45-minute timeout) — never completed because the ingress controller's router deployment could not achieve minimum availability (0/1 replicas ready). The install-config specified a single c5n.metal bare-metal worker node, and this worker node appears to have never joined the cluster or become schedulable within the timeout window. Without a schedulable worker, the router pod had nowhere to run, which cascaded into failures for authentication, monitoring, openshift-apiserver, and console operators — all of which depend on ingress/routes. This is an infrastructure-level installation failure completely unrelated to PR #8669, which modifies only HyperShift operator CoreOS metadata parsing code that doesn't execute until after the management cluster is fully installed.

Root Cause

The root cause is a management cluster bootstrap timeout due to the worker node not joining the cluster. The failure chain is:

  1. Worker node never became ready: The install-config requested 1 worker node of type c5n.metal (bare-metal instance). While the 3 master nodes provisioned successfully and the API became available at 20:09:32Z, the single worker node did not join the cluster within the 45-minute bootstrap timeout.

  2. Router deployment stuck at 0/1 replicas: The default ingress controller's router deployment requires a worker node to schedule. With no schedulable worker, the deployment stayed at MinimumReplicasUnavailable for the entire bootstrap window.

  3. Cascading operator failures: Multiple operators depend on ingress/routes:

    • ingress: Degraded — router deployment has 0/1 available replicas
    • authentication: Unavailable — oauth-server endpoints not found (depends on ingress)
    • monitoring: Unavailable — cannot create Route objects (routes.route.openshift.io API not available)
    • openshift-apiserver: Unavailable — PreconditionNotReady (depends on authentication)
    • console: Unknown — no data (depends on routes)
  4. Bootstrap timed out at 20:57:04Z: After waiting the full 45 minutes, the installer declared bootstrap failure.

Why this is unrelated to PR #8669: The PR modifies CoreOS metadata parsing in hypershift-operator/controllers/nodepool/stream.go, support/releaseinfo/deserialize.go, and related files. These changes affect the HyperShift operator's nodepool controller, which only runs inside an already-installed management cluster to manage hosted cluster nodepools. The failure occurred during the management cluster's IPI installation — phase pre — before any HyperShift operator code executes. The c5n.metal bare-metal worker provisioning and bootstrap process is entirely within the OpenShift installer domain.

This is likely a CI infrastructure flakec5n.metal instances are bare-metal instances that can have longer provisioning times and occasional availability issues in AWS.

Recommendations
  1. Retry the job: This is an infrastructure-level installation flake unrelated to the PR changes. Re-triggering the e2e-kubevirt-aws-ovn-reduced job should succeed if the c5n.metal instance provisions normally.

  2. No code changes needed: PR CNTRLPLANE-3552: Multi-stream CoreOS metadata parsing and stream resolution #8669 (CoreOS metadata parsing) does not touch any code involved in IPI cluster installation. The HyperShift operator code only runs after the management cluster is fully installed.

  3. Monitor for recurrence: If the e2e-kubevirt-aws-ovn-reduced job fails repeatedly with bootstrap timeouts, the issue may be with c5n.metal instance availability in the us-east-1 region or with the CI infrastructure's AWS quota slice.

  4. Consider checking Sippy: Cross-reference this job name in Sippy to determine if bootstrap timeout failures are a known recurring pattern for this particular job configuration.

Evidence
Evidence Detail
Failed step ipi-install-install (pre phase, exit code 5)
Failure type Bootstrap timeout — management cluster install
Install duration 1h1m12s (19:56:36Z → 20:57:48Z)
Control-plane ready 20:09:32Z (masters provisioned, API up)
Bootstrap timeout 20:57:04Z (45-minute wait expired)
Worker config 1x c5n.metal in us-east-1 (zones b, d, f)
Master config 3x m6i.2xlarge in us-east-1 (zones b, d, f)
Ingress status Degraded — 0/1 router replicas available
Authentication status Unavailable — oauth-server endpoints not found
Monitoring status Unavailable — cannot create Route objects
openshift-apiserver Unavailable — PreconditionNotReady
PR #8669 files changed stream.go, deserialize.go, releaseinfo.go (HyperShift operator / release info)
PR relevance None — HyperShift operator code not executed during IPI install
Build log .work/prow-job-analyze-test-failure/2066966692289843200/logs/build-log.txt
Install log .work/prow-job-analyze-test-failure/2066966692289843200/logs/openshift_install.log

@jparrill

Copy link
Copy Markdown
Contributor Author

/override ci/prow/e2e-kubevirt-aws-ovn-reduced
This test e2e-kubevirt-aws-ovn-reduced is failing systematically in all the PRs, overriding...

@jparrill

Copy link
Copy Markdown
Contributor Author

/verified by unit tests

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@jparrill: Overrode contexts on behalf of jparrill: ci/prow/e2e-kubevirt-aws-ovn-reduced

Details

In response to this:

/override ci/prow/e2e-kubevirt-aws-ovn-reduced
This test e2e-kubevirt-aws-ovn-reduced is failing systematically in all the PRs, overriding...

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 16, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@jparrill: This PR has been marked as verified by unit tests.

Details

In response to this:

/verified by unit tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@jparrill: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 03f89c3 into openshift:main Jun 17, 2026
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/platform/openstack PR/issue for OpenStack (OpenStackPlatform) platform area/platform/powervs PR/issue for PowerVS (PowerVSPlatform) platform area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants