Skip to content

OCPBUGS-86415: Use canonical image for kube-apiserver-proxy static pod#8742

Draft
csrwng wants to merge 1 commit into
openshift:mainfrom
csrwng:OCPBUGS-86415
Draft

OCPBUGS-86415: Use canonical image for kube-apiserver-proxy static pod#8742
csrwng wants to merge 1 commit into
openshift:mainfrom
csrwng:OCPBUGS-86415

Conversation

@csrwng

@csrwng csrwng commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Fix the kube-apiserver-proxy static pod image in the ignition payload being unnecessarily rewritten by --registry-overrides. Data plane nodes use CRI-O which handles mirroring natively via IDMS/ICSP, so the canonical image reference should be used.
  • Gate the fix behind a hypershift.openshift.io/canonical-data-plane-images annotation to avoid triggering rollouts on existing stable NodePools. The annotation is set automatically on new NodePools and during version upgrades.

Test plan

  • Verify existing HAProxy image resolution tests pass (TestResolveHAProxyImage)
  • Verify new NodePools get canonical (non-overridden) HAProxy image in static pod manifest
  • Verify upgrading NodePools switch to canonical image during version upgrade
  • Verify stable NodePools with no annotation preserve the existing (overridden) image
  • Verify annotation-specified HAProxy images are not affected by override reversal
  • Run e2e-aws-upgrade-hypershift-operator to validate rollout safety

Fixes: https://issues.redhat.com/browse/OCPBUGS-86415

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features
    • Introduced new annotation to control whether canonical data-plane images are used for HAProxy.
    • Enhanced HAProxy image resolution with improved registry override handling and automatic detection of node pool changes.

The RegistryMirrorProviderDecorator rewrites all component image
references using --registry-overrides, including the haproxy-router
image used for the kube-apiserver-proxy static pod. This rewritten
image is embedded in the ignition payload sent to data plane nodes.

This is incorrect because data plane nodes run CRI-O on RHCOS, which
handles mirroring natively via IDMS/ICSP configured in the ignition
payload. The rewritten image may also point to a mirror accessible
only from the management cluster, causing pull failures on data plane
nodes.

Fix: reverse registry overrides on the HAProxy image when it comes
from the release payload, so the static pod manifest uses the
canonical image reference. CRI-O mirroring on the data plane handles
resolution from the correct mirror transparently.

To avoid triggering rollouts on existing stable NodePools, the fix is
gated behind a new annotation (canonical-data-plane-images) that is
set automatically on new NodePools and during version upgrades, when
a rollout is already happening.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 16, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 16, 2026
@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 16, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@csrwng: This pull request references Jira Issue OCPBUGS-86415, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Fix the kube-apiserver-proxy static pod image in the ignition payload being unnecessarily rewritten by --registry-overrides. Data plane nodes use CRI-O which handles mirroring natively via IDMS/ICSP, so the canonical image reference should be used.
  • Gate the fix behind a hypershift.openshift.io/canonical-data-plane-images annotation to avoid triggering rollouts on existing stable NodePools. The annotation is set automatically on new NodePools and during version upgrades.

Test plan

  • Verify existing HAProxy image resolution tests pass (TestResolveHAProxyImage)
  • Verify new NodePools get canonical (non-overridden) HAProxy image in static pod manifest
  • Verify upgrading NodePools switch to canonical image during version upgrade
  • Verify stable NodePools with no annotation preserve the existing (overridden) image
  • Verify annotation-specified HAProxy images are not affected by override reversal
  • Run e2e-aws-upgrade-hypershift-operator to validate rollout safety

Fixes: https://issues.redhat.com/browse/OCPBUGS-86415

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added do-not-merge/needs-area area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release labels Jun 16, 2026
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 259da9cf-6572-4df9-b938-dc9cbec788dc

📥 Commits

Reviewing files that changed from the base of the PR and between 392fd5a and e7c2be0.

📒 Files selected for processing (3)
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go
  • support/releaseinfo/fake/fake.go

📝 Walkthrough

Walkthrough

A new NodePool annotation constant (nodePoolAnnotationCanonicalDataPlaneImages) is introduced to track whether canonical (non-registry-overridden) data-plane images should be used for HAProxy. resolveHAProxyImage gains two new parameters (useCanonicalImages and releaseProvider); when useCanonicalImages is true and the provider has registry overrides, it reverses those mappings to rewrite the resolved image back to its canonical reference. generateHAProxyRawConfig reads the annotation and, for new or upgrading NodePools (detected by comparing Status.Version to the release image version), sets the annotation to "true" before invoking the updated resolver. FakeReleaseProvider gains a FakeRegistryOverrides field and its GetRegistryOverrides() method is updated accordingly to support new test cases.

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning The TestResolveHAProxyImage test violates the "single responsibility" principle: it tests 13+ distinct scenarios in one test function, making diagnosis of failures difficult. Many assertions lack m... Split into separate test functions or add descriptive messages to all Expect() calls. For example: g.Expect(err).ToNot(HaveOccurred(), "failed to unmarshal MachineConfig")
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: using canonical images for the kube-apiserver-proxy (HAProxy) static pod instead of rewritten registry-override images.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test case names in TestResolveHAProxyImage are static, descriptive strings with no dynamic information like pod names, timestamps, UUIDs, node names, namespaces, or IP addresses. Test names are...
Topology-Aware Scheduling Compatibility ✅ Passed PR does not introduce scheduling constraints; it only modifies image reference resolution logic for HAProxy in nodepool controller, with no pod specs, affinity rules, or topology-dependent constrai...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests (It/Describe/Context/When) were added. The PR contains only standard Go unit tests using testing.T, not Ginkgo patterns. The custom check applies only to Ginkgo e2e tests.
No-Weak-Crypto ✅ Passed No weak crypto (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons found. PR modifies HAProxy image resolution using string manipulati...
Container-Privileges ✅ Passed PR modifies only HAProxy image resolution logic and test coverage; no privileged containers, host access, or security escalation settings are introduced.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data exposed in logs. The new resolveHAProxyImage and modified generateHAProxyRawConfig functions contain no logging. All existing logs use generic messages. Test assertions with image...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2026
@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.76%. Comparing base (392fd5a) to head (e7c2be0).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8742   +/-   ##
=======================================
  Coverage   41.75%   41.76%           
=======================================
  Files         758      758           
  Lines       93981    93995   +14     
=======================================
+ Hits        39240    39254   +14     
  Misses      51988    51988           
  Partials     2753     2753           
Files with missing lines Coverage Δ
...erator/controllers/nodepool/nodepool_controller.go 44.15% <100.00%> (+0.89%) ⬆️
Flag Coverage Δ
cmd-support 35.02% <ø> (ø)
cpo-hostedcontrolplane 44.10% <ø> (ø)
cpo-other 43.45% <ø> (ø)
hypershift-operator 51.85% <100.00%> (+0.02%) ⬆️
other 31.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hypershift-jira-solve-ci

Copy link
Copy Markdown

Now I have the complete picture. Let me verify one more detail about the exact files flagged:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

hypershift-operator/controllers/nodepool/nodepool_controller.go: needs update
support/releaseinfo/fake/fake.go: needs update
##[error]Process completed with exit code 1.

Summary

The verify job runs make generate update, make staticcheck, make fmt, and make vet, then checks that no files were modified by these steps (i.e., the committed code must already be properly formatted and generated). Two files in the PR have incorrect gofmt alignment: the PR manually adjusted whitespace around new identifiers but did not run gofmt afterward, leaving the alignment inconsistent with what gofmt produces. The fix is to run make fmt (or gofmt -w) on both files and commit the result.

Root Cause

The root cause is inconsistent Go source formatting (gofmt alignment) in two files modified by the PR. When Go has a group of declarations (constants in a const block or fields in a struct), gofmt aligns them to a consistent column based on the longest identifier. The PR introduced new identifiers but did not let gofmt re-normalize the alignment:

  1. hypershift-operator/controllers/nodepool/nodepool_controller.go — The PR added a new constant nodePoolAnnotationCanonicalDataPlaneImages (the longest name in the group) and manually re-aligned two existing constants (nodePoolAnnotationPlatformMachineTemplate and nodePoolAnnotationTaints) to the wider column. However, it did not re-align the adjacent constant nodePoolCoreIgnitionConfigLabel, which still uses the old, narrower alignment. gofmt treats the entire block as one alignment group and normalizes all of them, changing nodePoolCoreIgnitionConfigLabel's spacing.

  2. support/releaseinfo/fake/fake.go — The PR added a new struct field FakeRegistryOverrides (the longest field name) and manually widened the padding on ImageVersion and Components to 8+ spaces. But gofmt aligns struct fields differently: it sets the type column to one tab-stop after the longest field name, producing different padding than what was manually typed. Running gofmt changes the whitespace on all three fields.

In both cases, the committed code is not gofmt-canonical. The verify workflow detects this because make fmt (which invokes gofmt) modifies the files, and the subsequent git update-index --refresh / git diff check fails.

Recommendations
  1. Run make fmt and commit the result. This is the only fix needed — it will normalize the alignment in both files to match gofmt output:

    make fmt
    git add hypershift-operator/controllers/nodepool/nodepool_controller.go support/releaseinfo/fake/fake.go
    git commit --amend  # or a new commit
  2. Pre-push workflow: Before pushing, always run make generate update && make fmt && make vet and verify no uncommitted changes remain (git diff --exit-code). This is exactly what the verify CI checks.

  3. Editor configuration: Ensure your editor runs gofmt (or goimports) on save for .go files. This prevents manual alignment from diverging from gofmt output.

Evidence
Evidence Detail
Failed step git update-index --refresh — detects files modified by prior make generate update / make fmt steps
Dirty file 1 hypershift-operator/controllers/nodepool/nodepool_controller.go — constant nodePoolCoreIgnitionConfigLabel not re-aligned to match new wider group containing nodePoolAnnotationCanonicalDataPlaneImages
Dirty file 2 support/releaseinfo/fake/fake.go — struct fields ImageVersion, Components, FakeRegistryOverrides have manually-typed padding inconsistent with gofmt output
CI log evidence hypershift-operator/controllers/nodepool/nodepool_controller.go: needs update and support/releaseinfo/fake/fake.go: needs update logged at 18:18:01 UTC
All prior steps passed make generate update ✅, make staticcheck ✅, make fmt ✅, make vet ✅ — the fmt step itself succeeded (it reformatted the files silently), and the dirty-tree check then caught the diff
Root cause type Code formatting — not a product bug, test flake, or infrastructure issue

nodePoolAnnotationTaints = "hypershift.openshift.io/nodePoolTaints"
nodePoolAnnotationPlatformMachineTemplate = "hypershift.openshift.io/nodePoolPlatformMachineTemplate"
nodePoolAnnotationTaints = "hypershift.openshift.io/nodePoolTaints"
nodePoolAnnotationCanonicalDataPlaneImages = "hypershift.openshift.io/canonical-data-plane-images"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we // doc this annotation? You pr desc seems pretty explanatory "Gate the fix behind a hypershift.openshift.io/canonical-data-plane-images annotation to avoid triggering rollouts on existing stable NodePools. The annotation is set automatically on new NodePools and during version upgrades."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants