Skip to content

[release-4.22] OCPBUGS-88356: fix(cpo): deduplicate VPC endpoint subnets by AZ#8724

Open
reedcort wants to merge 1 commit into
openshift:release-4.22from
reedcort:backport-OCPBUGS-82443-to-release-4.22
Open

[release-4.22] OCPBUGS-88356: fix(cpo): deduplicate VPC endpoint subnets by AZ#8724
reedcort wants to merge 1 commit into
openshift:release-4.22from
reedcort:backport-OCPBUGS-82443-to-release-4.22

Conversation

@reedcort

@reedcort reedcort commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Backport of #8651 to release-4.22.

When a HCP cluster has multiple NodePools with subnets in the same AWS availability zone, the CPO's VPC endpoint
reconciliation fails indefinitely with DuplicateSubnetsInSameZone. This PR adds AZ-aware subnet deduplication
in the CPO using an in-memory cache on the reconciler.

Manually cherry-picked due to code structure differences between main and release-4.22 (endpoint logic is
inline in reconcileAWSEndpointService on 4.22 vs extracted into helper functions on main).

Which issue(s) this PR fixes:

Fixes OCPBUGS-88356

Special notes for your reviewer:

  • Same logic as the main PR, adapted to the 4.22 code structure
  • No API/CRD changes — uses in-memory cache for easy backporting
  • ROSA managed policy update tracked in ROSAENG-57993

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

…SubnetsInSameZone

When a HCP cluster has multiple NodePools with subnets in the same AWS
availability zone, the CPO's VPC endpoint reconciliation fails with
DuplicateSubnetsInSameZone because AWS allows at most one subnet per AZ
per endpoint.

Add a deduplicateSubnetsByAZ method on the reconciler that calls
DescribeSubnets to resolve AZ membership, groups subnets by AZ, and
picks one per AZ (lexicographically first for determinism). The
subnet-to-AZ mapping is cached in an in-memory map on the reconciler
to avoid redundant AWS API calls across reconcile loops.

On DescribeSubnets failure the controller gracefully degrades by
proceeding with the original subnet list, preserving existing behavior.

Also adds ec2:DescribeSubnets to the three CPO IAM policies that lacked
it. The ROSA-managed ROSAControlPlaneOperatorPolicy requires a separate
update with AWS (tracked in ROSAENG-57993).

Signed-off-by: Cortney Reed <[email protected]>
Commit-Message-Assisted-by: Claude (via Claude Code)
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2026
@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Jun 11, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@reedcort: This pull request references Jira Issue OCPBUGS-82443, which is invalid:

  • expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "5.0.0" instead
  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is MODIFIED instead
  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected Jira Issue OCPBUGS-82443 to depend on a bug targeting a version in 5.0.0 and in one of the following states: MODIFIED, ON_QA, VERIFIED, but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

Backport of #8651 to release-4.22.

When a HCP cluster has multiple NodePools with subnets in the same AWS availability zone, the CPO's VPC endpoint
reconciliation fails indefinitely with DuplicateSubnetsInSameZone. This PR adds AZ-aware subnet deduplication
in the CPO using an in-memory cache on the reconciler.

Manually cherry-picked due to code structure differences between main and release-4.22 (endpoint logic is
inline in reconcileAWSEndpointService on 4.22 vs extracted into helper functions on main).

Which issue(s) this PR fixes:

Fixes OCPBUGS-82443

Special notes for your reviewer:

  • Same logic as the main PR, adapted to the 4.22 code structure
  • No API/CRD changes — uses in-memory cache for easy backporting
  • ROSA managed policy update tracked in ROSAENG-57993

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 11, 2026
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 572a1a03-93fc-4f34-8dc1-96302b223a20

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added do-not-merge/needs-area area/cli Indicates the PR includes changes for CLI labels Jun 11, 2026
@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: reedcort
Once this PR has been reviewed and has the lgtm label, please assign muraee for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform and removed do-not-merge/needs-area labels Jun 11, 2026
@reedcort reedcort marked this pull request as ready for review June 11, 2026 19:05
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2026
@openshift-ci openshift-ci Bot requested review from enxebre and sjenning June 11, 2026 19:06
@reedcort

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot

Copy link
Copy Markdown

@reedcort: This pull request references Jira Issue OCPBUGS-82443, which is invalid:

  • expected the bug to target only the "4.22.0" version, but multiple target versions were set
  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is MODIFIED instead
  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected Jira Issue OCPBUGS-82443 to depend on a bug targeting a version in 5.0.0 and in one of the following states: MODIFIED, ON_QA, VERIFIED, but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@reedcort

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot

Copy link
Copy Markdown

@reedcort: This pull request references Jira Issue OCPBUGS-82443, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected Jira Issue OCPBUGS-82443 to depend on a bug targeting a version in 5.0.0 and in one of the following states: MODIFIED, ON_QA, VERIFIED, but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 84.41558% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 35.49%. Comparing base (5cb8735) to head (f1455fa).

Files with missing lines Patch % Lines
cmd/infra/aws/iam.go 0.00% 6 Missing ⚠️
...ollers/awsprivatelink/awsprivatelink_controller.go 91.54% 6 Missing ⚠️
Additional details and impacted files
@@               Coverage Diff                @@
##           release-4.22    #8724      +/-   ##
================================================
+ Coverage         35.45%   35.49%   +0.04%     
================================================
  Files               767      767              
  Lines             93724    93798      +74     
================================================
+ Hits              33226    33291      +65     
- Misses            57785    57794       +9     
  Partials           2713     2713              
Files with missing lines Coverage Δ
cmd/infra/aws/iam.go 28.91% <0.00%> (-0.12%) ⬇️
...ollers/awsprivatelink/awsprivatelink_controller.go 23.80% <91.54%> (+5.50%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@reedcort reedcort changed the title [release-4.22] OCPBUGS-82443: fix(cpo): deduplicate VPC endpoint subnets by AZ [release-4.22] OCPBUGS-88356: fix(cpo): deduplicate VPC endpoint subnets by AZ Jun 11, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@reedcort: This pull request references Jira Issue OCPBUGS-88356, which is invalid:

  • expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "5.0.0" instead
  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected Jira Issue OCPBUGS-88356 to depend on a bug targeting a version in 5.0.0 and in one of the following states: MODIFIED, ON_QA, VERIFIED, but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

Backport of #8651 to release-4.22.

When a HCP cluster has multiple NodePools with subnets in the same AWS availability zone, the CPO's VPC endpoint
reconciliation fails indefinitely with DuplicateSubnetsInSameZone. This PR adds AZ-aware subnet deduplication
in the CPO using an in-memory cache on the reconciler.

Manually cherry-picked due to code structure differences between main and release-4.22 (endpoint logic is
inline in reconcileAWSEndpointService on 4.22 vs extracted into helper functions on main).

Which issue(s) this PR fixes:

Fixes OCPBUGS-82443

Special notes for your reviewer:

  • Same logic as the main PR, adapted to the 4.22 code structure
  • No API/CRD changes — uses in-memory cache for easy backporting
  • ROSA managed policy update tracked in ROSAENG-57993

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@reedcort

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot

Copy link
Copy Markdown

@reedcort: This pull request references Jira Issue OCPBUGS-88356, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@reedcort

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 11, 2026
@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 11, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@reedcort: This pull request references Jira Issue OCPBUGS-88356, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note text is set and does not match the template
  • dependent bug Jira Issue OCPBUGS-82443 is in the state MODIFIED, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
  • dependent Jira Issue OCPBUGS-82443 targets the "5.0.0" version, which is one of the valid target versions: 5.0.0
  • bug has dependents
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot

Copy link
Copy Markdown

@reedcort: This pull request references Jira Issue OCPBUGS-88356, which is valid.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note text is set and does not match the template
  • dependent bug Jira Issue OCPBUGS-82443 is in the state MODIFIED, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
  • dependent Jira Issue OCPBUGS-82443 targets the "5.0.0" version, which is one of the valid target versions: 5.0.0
  • bug has dependents
Details

In response to this:

What this PR does / why we need it:

Backport of #8651 to release-4.22.

When a HCP cluster has multiple NodePools with subnets in the same AWS availability zone, the CPO's VPC endpoint
reconciliation fails indefinitely with DuplicateSubnetsInSameZone. This PR adds AZ-aware subnet deduplication
in the CPO using an in-memory cache on the reconciler.

Manually cherry-picked due to code structure differences between main and release-4.22 (endpoint logic is
inline in reconcileAWSEndpointService on 4.22 vs extracted into helper functions on main).

Which issue(s) this PR fixes:

Fixes OCPBUGS-88356

Special notes for your reviewer:

  • Same logic as the main PR, adapted to the 4.22 code structure
  • No API/CRD changes — uses in-memory cache for easy backporting
  • ROSA managed policy update tracked in ROSAENG-57993

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot

Copy link
Copy Markdown

@reedcort: This pull request references Jira Issue OCPBUGS-88356, which is valid.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note text is set and does not match the template
  • dependent bug Jira Issue OCPBUGS-82443 is in the state MODIFIED, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
  • dependent Jira Issue OCPBUGS-82443 targets the "5.0.0" version, which is one of the valid target versions: 5.0.0
  • bug has dependents

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Nirshal

Nirshal commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 12, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@reedcort

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-merge-robot

Copy link
Copy Markdown
Contributor

Fix included in release 5.0.0-0.nightly-2026-06-12-141614

@reedcort

Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@bryan-cox

Copy link
Copy Markdown
Member

/retest

@reedcort

Copy link
Copy Markdown
Contributor Author

/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator

@reedcort

Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@reedcort

Copy link
Copy Markdown
Contributor Author

/retest

@reedcort

Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@reedcort

Copy link
Copy Markdown
Contributor Author

/retest

@reedcort

Copy link
Copy Markdown
Contributor Author

/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-aks

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@reedcort: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws f1455fa link true /test e2e-aws
ci/prow/e2e-aws-upgrade-hypershift-operator f1455fa link true /test e2e-aws-upgrade-hypershift-operator
ci/prow/e2e-aks f1455fa link true /test e2e-aks

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants