ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions by Ajpantuso · Pull Request #8689 · openshift/hypershift

Ajpantuso · 2026-06-05T19:40:15Z

What this PR does / why we need it:

HCP namespace egress NetworkPolicies (private-router, management-kas) use /32 CIDRs
derived from the management cluster's KAS endpoint IPs to block HCP pods from reaching
the hosting KAS. These IPs rotate during rolling restarts, causing a burst of NetworkPolicy
UPDATEs that can trigger OVN port-group reconciliation races and drop traffic to HCP router
pods for 20-30+ minutes.

Adds a --hcp-egress-block-cidrs repeatable flag to the HyperShift Operator. When
supplied, the egress exception list in both policies is built from the provided stable CIDR
blocks (e.g. the MC machine network CIDR) instead of the live default/kubernetes Endpoints
object. This eliminates NetworkPolicy churn during KAS rollouts and avoids the OVN race.

The flag is wired end-to-end: hypershift install --hcp-egress-block-cidrs=<cidr> sets it
on the operator Deployment. Values are validated as valid CIDRs at startup. May be specified
multiple times.

Default behaviour (flag absent) is unchanged: KAS endpoint IPs continue to be discovered at
reconcile time from the default/kubernetes Endpoints object.

This PR will remain in draft until live testing is completed to ensure no regressions.

Which issue(s) this PR fixes:

Fixes ROSAENG-8224

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

Summary by CodeRabbit

New Features
- Added a repeatable CLI flag --hcp-egress-block-cidrs to supply one or more CIDRs (validated at startup). Provided CIDRs are used for HCP namespace egress policies instead of dynamically discovered endpoints—reducing policy churn during control-plane restarts—and are propagated into the operator deployment and install manifests.

openshift-merge-bot · 2026-06-05T19:40:20Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-06-05T19:40:22Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-06-05T19:40:24Z

@Ajpantuso: This pull request references ROSAENG-8224 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Adds a --use-mc-machine-network-for-kas-block flag to the HyperShift Operator.

When enabled, the operator reads the management cluster's machine network
CIDR(s) from kube-system/cluster-config-v1 (the install-config key) and
uses them as the Except entries in the egress IPBlock rules of the
private-router and management-kas NetworkPolicies, instead of the
per-pod /32 CIDRs derived from the default/kubernetes Endpoints object.

If the ConfigMap is absent or unparseable, the operator falls back to the
existing kasEndpointsToCIDRs() behavior and logs a warning.

The flag is wired end-to-end: hypershift install --use-mc-machine-network-for-kas-block
sets it on the operator Deployment.

This PR will remain in draft until live testing is completed to ensure no regressions.

Which issue(s) this PR fixes:

Fixes ROSAENG-8224

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-06-05T19:40:31Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds support for statically configuring HCP egress block CIDRs to replace dynamic KAS endpoint derivation in network policies. Configuration flows through the install pipeline and operator startup with CLI flag validation, then into the HostedClusterReconciler. Network policy reconciliation is refactored to compute the kasBlock CIDR list once per reconcile and pass it through the policy chain, eliminating per-endpoint /32 derivation and reducing policy churn during KAS rolling restarts.

Sequence Diagram(s)

sequenceDiagram
  participant UserCLI as User CLI
  participant OperatorMain as Operator main
  participant Reconciler as HostedClusterReconciler
  participant K8sAPI as Kubernetes API
  UserCLI->>OperatorMain: provide --hcp-egress-block-cidrs values
  OperatorMain->>OperatorMain: validate CIDRs
  OperatorMain->>Reconciler: set HCPEgressBlockCIDRs
  Reconciler->>Reconciler: kasBlockCIDRs(kubernetesEndpoint) -> kasBlock
  Reconciler->>K8sAPI: apply NetworkPolicies using kasBlock in IPBlock.Except

Suggested reviewers

sjenning
sdminonne

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Ipv6 And Disconnected Network Test Compatibility	⚠️ Warning	Two new unit tests contain hardcoded IPv4 addresses (10.0.0.1, 10.100.0.0/16, 10.128.0.0/14, 172.30.0.0/16) and CIDRs without IPv6 equivalents, incompatible with IPv6-only disconnected environments.	Update tests to use dynamic IP family detection or add IPv6 equivalents; use correctCIDRFamily() helper to select appropriate CIDRs for the cluster IP family.

✅ Passed checks (10 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the primary change: adding a --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names are stable and deterministic. New tests use static string literals with no dynamic values like pod names, IPs, or timestamps.
Test Structure And Quality	✅ Passed	New tests follow all quality requirements: single responsibility per test, proper setup/cleanup with fake clients, meaningful assertion messages, and consistent with existing patterns.
Topology-Aware Scheduling Compatibility	✅ Passed	PR adds only networking configuration (HCPEgressBlockCIDRs field and CLI flag). No scheduling constraints, affinity rules, replicas, node selectors, tolerations, or topology settings modified.
No-Weak-Crypto	✅ Passed	PR contains no weak crypto usage: no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, no custom crypto, no insecure secret comparisons. Changes add CIDR field strings and net.ParseCIDR() validation only.
Container-Privileges	✅ Passed	PR adds network policy CIDR filtering flag with no container privilege escalation or host access settings.
No-Sensitive-Data-In-Logs	✅ Passed	No sensitive data (passwords, tokens, API keys, PII, session IDs) are logged. CIDRs logged in line 228 of main.go for validation errors are public network ranges, not sensitive.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

hypershift-operator/controllers/hostedcluster/network_policies_test.go (1)
1697-1707: ⚡ Quick win

Assert management pod CIDR is still included in Except.

Line 1701 validates the feature-toggle CIDRs, but it doesn’t confirm the existing management cluster pod CIDR remains present. Add one assertion so this test also catches regressions where new KAS block CIDRs replace (instead of extend) existing exceptions.
Suggested test hardening
 			exceptCIDRs := collectExceptCIDRs(privateRouterPolicy)
+			g.Expect(exceptCIDRs).To(ContainElement("10.128.0.0/14"), "private-router Except list should retain management cluster pod CIDR")
 			for _, cidr := range tc.expectedExceptCIDRs {
 				g.Expect(exceptCIDRs).To(ContainElement(cidr), "private-router Except list should contain %s", cidr)
 			}
Based on learnings: "Unit test any code changes and additions and include e2e tests when changes impact consumer behaviour."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go`
around lines 1697 - 1707, The test currently checks feature-toggle CIDRs but
doesn't assert the management cluster pod CIDR is still present; after
collecting exceptCIDRs (from collectExceptCIDRs(privateRouterPolicy)) add an
assertion that the known management pod CIDR (e.g., managementPodCIDR or the
variable used elsewhere in this test file) is contained in exceptCIDRs using
g.Expect(exceptCIDRs).To(ContainElement(managementPodCIDR)) to catch regressions
where exceptions get replaced instead of extended; place this assertion near the
existing checks for tc.expectedExceptCIDRs and tc.expectCIDRNotPresent so it
runs for each test case referencing createdNetworkPolicies["private-router"] and
privateRouterPolicy.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go`:
- Around line 1697-1707: The test currently checks feature-toggle CIDRs but
doesn't assert the management cluster pod CIDR is still present; after
collecting exceptCIDRs (from collectExceptCIDRs(privateRouterPolicy)) add an
assertion that the known management pod CIDR (e.g., managementPodCIDR or the
variable used elsewhere in this test file) is contained in exceptCIDRs using
g.Expect(exceptCIDRs).To(ContainElement(managementPodCIDR)) to catch regressions
where exceptions get replaced instead of extended; place this assertion near the
existing checks for tc.expectedExceptCIDRs and tc.expectCIDRNotPresent so it
runs for each test case referencing createdNetworkPolicies["private-router"] and
privateRouterPolicy.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: eadd8666-ca8d-475c-8303-c5739a6cf3c8

📥 Commits

Reviewing files that changed from the base of the PR and between f13c62d and 4e64951.

📒 Files selected for processing (6)

cmd/install/assets/hypershift_operator.go
cmd/install/install.go
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
hypershift-operator/controllers/hostedcluster/network_policies.go
hypershift-operator/controllers/hostedcluster/network_policies_test.go
hypershift-operator/main.go

codecov · 2026-06-05T19:47:53Z

Codecov Report

❌ Patch coverage is 63.63636% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.67%. Comparing base (4755e9c) to head (f5af523).
⚠️ Report is 55 commits behind head on main.

Files with missing lines	Patch %	Lines
hypershift-operator/main.go	0.00%	7 Missing ⚠️
...ator/controllers/hostedcluster/network_policies.go	79.16%	2 Missing and 3 partials ⚠️
cmd/install/install.go	60.00%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8689      +/-   ##
==========================================
+ Coverage   41.54%   41.67%   +0.12%     
==========================================
  Files         758      758              
  Lines       93838    93959     +121     
==========================================
+ Hits        38986    39154     +168     
+ Misses      52107    52056      -51     
- Partials     2745     2749       +4

Files with missing lines	Coverage Δ
cmd/install/assets/hypershift_operator.go	`48.19% <100.00%> (+0.08%)`	⬆️
...trollers/hostedcluster/hostedcluster_controller.go	`45.89% <ø> (ø)`
cmd/install/install.go	`63.04% <60.00%> (-0.03%)`	⬇️
...ator/controllers/hostedcluster/network_policies.go	`77.42% <79.16%> (+0.23%)`	⬆️
hypershift-operator/main.go	`0.00% <0.00%> (ø)`

... and 12 files with indirect coverage changes

Flag	Coverage Δ
cmd-support	`34.97% <69.23%> (+0.01%)`	⬆️
cpo-hostedcontrolplane	`44.00% <ø> (+0.40%)`	⬆️
cpo-other	`43.45% <ø> (+0.27%)`	⬆️
hypershift-operator	`51.66% <61.29%> (+0.03%)`	⬆️
other	`31.56% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

hypershift-operator/controllers/hostedcluster/network_policies.go (1)
345-346: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle IPv4/IPv6 exceptions with separate IPBlock CIDRs.

In both reconcilePrivateRouterNetworkPolicy and reconcileManagementKASNetworkPolicy, the egress rule uses IPBlock{CIDR:"0.0.0.0/0", Except: exceptions} where exceptions := append(kasBlockExceptions, clusterNetworks...). kasBlockExceptions comes from HCPEgressBlockCIDRs directly when set (no IPv4/IPv6 filtering) or from kasEndpointsToCIDRs (hardcoded to /32), and clusterNetworks comes from managementClusterNetwork.Spec.ClusterNetwork (also unfiltered). If any IPv6 CIDRs are present in either list, the resulting NetworkPolicy will be rejected because ipBlock.except must be a strict subset of—and in the same IP family as—ipBlock.cidr.

Also applies to: 851-852, 943-955
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/hostedcluster/network_policies.go` around
lines 345 - 346, The egress IPBlock uses a single CIDR ("0.0.0.0/0") while
combining exceptions that may include IPv6 addresses, which will be rejected; in
reconcilePrivateRouterNetworkPolicy and reconcileManagementKASNetworkPolicy (and
the similar blocks around the other occurrences) split egress rules by IP
family: build two IPBlock rules—one with CIDR "0.0.0.0/0" and Except filtered to
only IPv4 CIDRs, and one with CIDR "::/0" and Except filtered to only IPv6
CIDRs; when assembling exceptions (from kasBlockExceptions, kasEndpointsToCIDRs
results, and clusterNetworks) filter each list by IP family before appending to
the matching IPBlock so ipBlock.except is a strict subset and same family as
ipBlock.cidr.

🧹 Nitpick comments (1)

hypershift-operator/controllers/hostedcluster/network_policies.go (1)

73-76: ⚡ Quick win

Skip default/kubernetes Endpoints lookup when static CIDRs are configured.

Static mode still hard-fails on Endpoints Get. This keeps an unnecessary dependency/path even when HCPEgressBlockCIDRs is already set.

Suggested refactor

- kubernetesEndpoint := &corev1.Endpoints{ObjectMeta: metav1.ObjectMeta{Name: "kubernetes", Namespace: "default"}}
- if err := r.Get(ctx, client.ObjectKeyFromObject(kubernetesEndpoint), kubernetesEndpoint); err != nil {
- 	return fmt.Errorf("failed to get management cluster network config: %w", err)
- }
- kasBlock := r.kasBlockCIDRs(kubernetesEndpoint)
+ var kasBlock []string
+ if len(r.HCPEgressBlockCIDRs) > 0 {
+ 	kasBlock = append([]string(nil), r.HCPEgressBlockCIDRs...)
+ } else {
+ 	kubernetesEndpoint := &corev1.Endpoints{ObjectMeta: metav1.ObjectMeta{Name: "kubernetes", Namespace: "default"}}
+ 	if err := r.Get(ctx, client.ObjectKeyFromObject(kubernetesEndpoint), kubernetesEndpoint); err != nil {
+ 		return fmt.Errorf("failed to get management cluster kubernetes endpoint: %w", err)
+ 	}
+ 	kasBlock = kasEndpointsToCIDRs(kubernetesEndpoint)
+ }

Also applies to: 78-83, 950-955

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/hostedcluster/network_policies.go` around
lines 73 - 76, The code currently always does a Get on the Endpoints object
kubernetesEndpoint (via r.Get(..., kubernetesEndpoint)) which causes a hard
failure even when static CIDRs are configured; update the logic in
network_policies.go to first check the static CIDR configuration
(HCPEgressBlockCIDRs) and skip the endpoints lookup entirely when static CIDRs
are present — i.e., guard the r.Get call with an if that returns or continues
when HCPEgressBlockCIDRs is non-empty so kubernetesEndpoint and the r.Get call
are only executed when no static CIDRs are configured (apply the same guard to
the other occurrences around lines indicated: the block at 78-83 and 950-955).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/install/install.go`:
- Line 443: Add CIDR validation for the --hcp-egress-block-cidrs option by
implementing a helper (e.g.,
validateHCPEgressBlockCIDRs(opts.HCPEgressBlockCIDRs) bool or error) that
iterates over Options.HCPEgressBlockCIDRs and uses net.ParseCIDR to ensure each
entry is a valid CIDR, returning a descriptive error for the first invalid
value; call this helper from Options.Validate() so installation fails fast on
invalid input, and add import "net" at the top of the file; ensure the
Validate() error message references the invalid CIDR and the flag name for
clarity.

In `@hypershift-operator/controllers/hostedcluster/network_policies.go`:
- Line 338: The code appends clusterNetworks onto kasBlockExceptions which may
share a backing array returned directly by kasBlockCIDRs
(r.HCPEgressBlockCIDRs), causing race/mutation issues; to fix, make a defensive
copy of kasBlockExceptions before appending (e.g. start with a new slice created
from kasBlockExceptions or append onto a nil copy) when building exceptions in
the code path that uses the kasBlockExceptions variable (the occurrence with:
exceptions := append(kasBlockExceptions, clusterNetworks...) and the similar
sites around the kasBlockCIDRs usage), and apply the same defensive-copy pattern
at the other occurrences mentioned (around lines 844 and 950-953) so we never
append into the original shared slice.

---

Outside diff comments:
In `@hypershift-operator/controllers/hostedcluster/network_policies.go`:
- Around line 345-346: The egress IPBlock uses a single CIDR ("0.0.0.0/0") while
combining exceptions that may include IPv6 addresses, which will be rejected; in
reconcilePrivateRouterNetworkPolicy and reconcileManagementKASNetworkPolicy (and
the similar blocks around the other occurrences) split egress rules by IP
family: build two IPBlock rules—one with CIDR "0.0.0.0/0" and Except filtered to
only IPv4 CIDRs, and one with CIDR "::/0" and Except filtered to only IPv6
CIDRs; when assembling exceptions (from kasBlockExceptions, kasEndpointsToCIDRs
results, and clusterNetworks) filter each list by IP family before appending to
the matching IPBlock so ipBlock.except is a strict subset and same family as
ipBlock.cidr.

---

Nitpick comments:
In `@hypershift-operator/controllers/hostedcluster/network_policies.go`:
- Around line 73-76: The code currently always does a Get on the Endpoints
object kubernetesEndpoint (via r.Get(..., kubernetesEndpoint)) which causes a
hard failure even when static CIDRs are configured; update the logic in
network_policies.go to first check the static CIDR configuration
(HCPEgressBlockCIDRs) and skip the endpoints lookup entirely when static CIDRs
are present — i.e., guard the r.Get call with an if that returns or continues
when HCPEgressBlockCIDRs is non-empty so kubernetesEndpoint and the r.Get call
are only executed when no static CIDRs are configured (apply the same guard to
the other occurrences around lines indicated: the block at 78-83 and 950-955).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: f093ff4c-fa3a-486c-9e3f-711a006a912e

📥 Commits

Reviewing files that changed from the base of the PR and between 2827d9d and c1597c3.

📒 Files selected for processing (6)

cmd/install/assets/hypershift_operator.go
cmd/install/install.go
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
hypershift-operator/controllers/hostedcluster/network_policies.go
hypershift-operator/controllers/hostedcluster/network_policies_test.go
hypershift-operator/main.go

🚧 Files skipped from review as they are similar to previous changes (1)

cmd/install/assets/hypershift_operator.go

csrwng · 2026-06-16T20:43:59Z

/verified later @Ajpantuso

openshift-ci-robot · 2026-06-16T20:44:12Z

@csrwng: This PR has been marked to be verified later by @Ajpantuso.

Details

In response to this:

/verified later @Ajpantuso

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

rafael-azevedo · 2026-06-16T20:52:49Z

/lgtm One minor ask is a log line at operator startup indicating if dynamic Endpoints mode is active would help on-call debugging.

rafael-azevedo · 2026-06-16T21:00:03Z

/lgtm

openshift-merge-bot · 2026-06-16T21:00:34Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

bryan-cox

/lgtm

openshift-ci · 2026-06-16T21:09:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ajpantuso, bryan-cox, csrwng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bryan-cox,csrwng]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bryan-cox · 2026-06-16T21:10:55Z

@Ajpantuso @rafael-azevedo would you only ever use IPv4?

[suggestion] IPv6 dual-stack gap — The new flag accepts any CIDR, but the policies only use 0.0.0.0/0 as the base IPBlock. An IPv6 CIDR in the Except list of an IPv4 block is semantically wrong. Recommend adding validation in main.go/install.go to reject IPv6 CIDRs since this PR's scope is IPv4-only.

Ajpantuso · 2026-06-16T21:16:33Z

@Ajpantuso @rafael-azevedo would you only ever use IPv4?

[suggestion] IPv6 dual-stack gap — The new flag accepts any CIDR, but the policies only use 0.0.0.0/0 as the base IPBlock. An IPv6 CIDR in the Except list of an IPv4 block is semantically wrong. Recommend adding validation in main.go/install.go to reject IPv6 CIDRs since this PR's scope is IPv4-only.

Yes, we would only use IPv4 as we don't run dual stack. I will submit a follow up with additional validation.

Ajpantuso · 2026-06-17T00:11:06Z

/retest-required

openshift-merge-bot · 2026-06-17T03:26:55Z

/retest-required

Remaining retests: 0 against base HEAD 7507291 and 2 for PR HEAD f5af523 in total

enxebre · 2026-06-17T08:21:38Z

/test e2e-kubevirt-aws-ovn-reduced

openshift-merge-bot · 2026-06-17T11:29:08Z

/retest-required

Remaining retests: 0 against base HEAD 30a12e6 and 1 for PR HEAD f5af523 in total

celebdor · 2026-06-17T13:06:09Z

/override "ci/prow/e2e-kubevirt-aws-ovn-reduced"

openshift-ci · 2026-06-17T13:06:24Z

@celebdor: Overrode contexts on behalf of celebdor: ci/prow/e2e-kubevirt-aws-ovn-reduced

Details

In response to this:

/override "ci/prow/e2e-kubevirt-aws-ovn-reduced"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ironcladlou · 2026-06-17T13:07:13Z

+				os.Exit(1)
+			}
+		}
+


Validation should be encapsulated in validateStartOptions()

We might need to follow up on this later. There are incidents waiting for this PR to merge.

I'll have a follow up after this merges with the proposed modifications.

Follow up #8763

hypershift-jira-solve-ci · 2026-06-17T13:40:48Z

Now I have a comprehensive picture. Let me compile the final report. The root cause is an infrastructure-level management cluster installation failure — the bootstrap process timed out because the ingress controller (router) could not schedule its pod. This is a standard bootstrap deadlock pattern where the ingress operator's router deployment requires a worker node, but no workers are provisioned until after bootstrap completes. This is a well-known CI infrastructure flaky failure unrelated to PR #8689's code changes.

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-kubevirt-aws-ovn-reduced
Build ID: 2067208365750095872
Target: e2e-kubevirt-aws-ovn-reduced
Failed Step: ipi-install-install (pre phase)
PR: ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions #8689

Test Failure Analysis

Error

Bootstrap failed to complete: timed out waiting for the condition
Failed to wait for bootstrapping to complete. This error usually happens when there is a problem 
with control plane hosts that prevents the control plane operators from creating the control plane.

Cluster operator ingress Available is False with IngressUnavailable: The "default" ingress controller 
reports Available=False: DeploymentAvailable=False (DeploymentUnavailable: Deployment does not have 
minimum availability.) — 0/1 of replicas are available.

Cluster operator monitoring Available is False with PlatformTasksFailed: creating Route object failed: 
the server could not find the requested resource (post routes.route.openshift.io)

Cluster operator openshift-apiserver Available is False with APIServices_PreconditionNotReady
Cluster operator authentication Available is False — oauth service endpoints are not ready

Summary

The job failed during the pre phase (management cluster IPI installation), not during any test execution. The openshift-install process provisioned 3 control-plane nodes (m6i.2xlarge) and a bootstrap node successfully, the Kubernetes API came up at 12:03:51Z, but the 45-minute bootstrap completion wait timed out at 12:48:51Z. The root failure is that the ingress controller's router deployment could not achieve minimum availability (0/1 replicas ready), which cascaded into failures of the monitoring operator (Route CRD not registered), openshift-apiserver (APIServices PreconditionNotReady), authentication operator (OAuth endpoints not ready), and console operator (NoData). This is a CI infrastructure flake — a bootstrap timeout due to the router pod being unable to schedule — and is unrelated to PR #8689's code changes (which add an --hcp-egress-block-cidrs flag to the HyperShift operator). No test code from this PR was ever executed.

Root Cause

The failure is a management cluster IPI installation timeout, not a test failure. The root cause chain is:

Router pod could not schedule (0/1 replicas available): The ingress controller's router Deployment never achieved MinimumReplicasAvailable. The cluster was configured with 3 master nodes (m6i.2xlarge) and 1 worker node (c5n.metal — a bare metal instance type). During IPI installation via Cluster API (CAPI), only bootstrap + master machines are provisioned. Worker nodes are created by machine-api after bootstrap completes. However, the default ingress controller's router pod requires a worker node to schedule (masters have the node-role.kubernetes.io/master taint). This creates a bootstrap deadlock: bootstrap needs ingress → ingress needs a worker → workers need bootstrap to complete.
Cascading operator failures: Because ingress was unavailable:
- The Route CRD (routes.route.openshift.io) was never registered with the API server
- The monitoring operator failed trying to create Route objects for Alertmanager, Thanos Querier, and Prometheus
- The openshift-apiserver stayed in PreconditionNotReady state
- The authentication operator could not create OAuth server endpoints
- The console operator reported NoData
Bootstrap timeout: After waiting the full 45 minutes (12:03:51Z to 12:48:51Z), the installer declared bootstrap failure with exit code 5.

This is a known CI infrastructure flaky pattern for jobs using c5n.metal worker instance types on AWS. Bare metal instances take significantly longer to provision, and the bootstrap process timing can be fragile. The PR's code changes (adding --hcp-egress-block-cidrs flag) were never reached — the management cluster never finished installing.

Recommendations

Retry the job: This is an infrastructure flake unrelated to the PR code changes. A simple re-trigger of the CI job should resolve it.
No code changes needed in PR ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions #8689: The --hcp-egress-block-cidrs feature code was never exercised. The failure occurred in the management cluster IPI installation phase, before any HyperShift-specific test step ran.
If failure recurs consistently: Investigate whether the c5n.metal worker instance type is experiencing AWS availability issues in us-east-2, or consider whether the job's bootstrap timeout (45m) is sufficient for bare-metal instance provisioning.

Evidence

Evidence	Detail
Failed step	`ipi-install-install` (pre phase), exit code 5, ran for 55m23s
Bootstrap wait	Started 12:03:51Z, timed out at 12:48:51Z (45m window exhausted)
Ingress operator	`Available=False`, router deployment 0/1 replicas, `MinimumReplicasUnavailable`
Monitoring operator	`Available=False`, `the server could not find the requested resource (post routes.route.openshift.io)`
openshift-apiserver	`Available=False`, `APIServices_PreconditionNotReady`
Authentication operator	`Available=False`, `oauth service endpoints are not ready`, connection refused to OAuth healthz
Worker node config	1x `c5n.metal` (bare metal) — not provisioned during bootstrap phase
Master nodes	3x `m6i.2xlarge` — all provisioned and running successfully
API availability	Kubernetes API came up normally at 12:03:51Z (`v1.35.5-dirty`)
Post phase failure	`hypershift-k8sgpt` — secondary failure, `the server doesn't have a resource type "hostedcluster"` (expected since install never completed)
PR code relevance	None — PR adds `--hcp-egress-block-cidrs` flag to HyperShift operator; management cluster install failed before any HyperShift code ran

…gress NetworkPolicy Reverts the revert PR openshift#8662 to re-apply the original PR openshift#7872 behavior: conditionally create or delete the openshift-ingress NetworkPolicy based on whether routes are labeled for the HCP router (LabelHCPRoutes). The OVN-Kubernetes port group race that prompted the revert is being addressed via PR openshift#8689 (--hcp-egress-block-cidrs flag), making the blanket always-create workaround unnecessary. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Ajpantuso · 2026-06-17T15:44:40Z

/test e2e-kubevirt-aws-ovn-reduced

bryan-cox · 2026-06-17T15:50:58Z

/override "ci/prow/e2e-kubevirt-aws-ovn-reduced"

@Ajpantuso that test is broken so let's not try running it again. Toni put in an override for it earlier - #8689 (comment)

openshift-ci · 2026-06-17T15:51:22Z

@bryan-cox: Overrode contexts on behalf of bryan-cox: ci/prow/e2e-kubevirt-aws-ovn-reduced

Details

In response to this:

/override "ci/prow/e2e-kubevirt-aws-ovn-reduced"

@Ajpantuso that test is broken so let's not try running it again. Toni put in an override for it earlier.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2026-06-17T16:35:18Z

@Ajpantuso: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 5, 2026

openshift-ci Bot added the do-not-merge/needs-area label Jun 5, 2026

openshift-ci Bot added area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels Jun 5, 2026

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Ajpantuso force-pushed the apantuso/rosaeng-8224 branch 2 times, most recently from 3b45a90 to 8df0806 Compare June 5, 2026 21:10

openshift-ci Bot added area/platform/azure PR/issue for Azure (AzurePlatform) platform area/testing Indicates the PR includes changes for e2e testing labels Jun 5, 2026

Ajpantuso force-pushed the apantuso/rosaeng-8224 branch 4 times, most recently from 65a6dad to 0f14878 Compare June 7, 2026 17:24

Ajpantuso changed the title ~~ROSAENG-8224: feat(operator): use MC machine network CIDR to block HCP egress to management KAS~~ ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions Jun 7, 2026

coderabbitai Bot reviewed Jun 7, 2026

View reviewed changes

Comment thread cmd/install/install.go

Comment thread hypershift-operator/controllers/hostedcluster/network_policies.go Outdated

Ajpantuso force-pushed the apantuso/rosaeng-8224 branch 4 times, most recently from 3e92625 to 6352ee3 Compare June 7, 2026 19:32

This was referenced Jun 8, 2026

CNTRLPLANE-3515: test(e2e): add day-2 label/taint no-rollout verification #8595

Merged

CNTRLPLANE-3526: Add spec.monitoring API for metrics forwarding #8626

Open

Ajpantuso marked this pull request as ready for review June 8, 2026 13:15

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 8, 2026

openshift-ci-robot added verified-later verified Signifies that the PR passed pre-merge verification criteria labels Jun 16, 2026

openshift-ci Bot assigned rafael-azevedo Jun 16, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2026

bryan-cox approved these changes Jun 16, 2026

View reviewed changes

openshift-ci Bot assigned bryan-cox Jun 16, 2026

ironcladlou reviewed Jun 17, 2026

View reviewed changes

csrwng mentioned this pull request Jun 17, 2026

CNTRLPLANE-2916: restore conditional deletion of openshift-ingress NetworkPolicy #8754

Open

2 tasks

openshift-merge-bot Bot merged commit fabde37 into openshift:main Jun 17, 2026
41 checks passed

Ajpantuso mentioned this pull request Jun 17, 2026

ROSAENG-8224: refactor(ho): improve --hcp-egress-block-cidrs validation #8763

Open

4 tasks

Conversation

Ajpantuso commented Jun 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented Jun 5, 2026

Uh oh!

openshift-ci Bot commented Jun 5, 2026

Uh oh!

openshift-ci-robot commented Jun 5, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Sequence Diagram(s)

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

csrwng commented Jun 16, 2026

Uh oh!

openshift-ci-robot commented Jun 16, 2026

Uh oh!

rafael-azevedo commented Jun 16, 2026

Uh oh!

rafael-azevedo commented Jun 16, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 16, 2026

Uh oh!

bryan-cox left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jun 16, 2026

Uh oh!

bryan-cox commented Jun 16, 2026

Uh oh!

Ajpantuso commented Jun 16, 2026

Uh oh!

Ajpantuso commented Jun 17, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 17, 2026

Uh oh!

enxebre commented Jun 17, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 17, 2026

Uh oh!

celebdor commented Jun 17, 2026

Uh oh!

openshift-ci Bot commented Jun 17, 2026

Uh oh!

ironcladlou Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

bryan-cox Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ajpantuso Jun 17, 2026

Choose a reason for hiding this comment

Ajpantuso commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Jun 5, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

codecov Bot commented Jun 5, 2026 •

edited

Loading

bryan-cox Jun 17, 2026 •

edited

Loading

bryan-cox commented Jun 17, 2026 •

edited

Loading