Skip to content

ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions#8689

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
Ajpantuso:apantuso/rosaeng-8224
Jun 17, 2026
Merged

ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions#8689
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
Ajpantuso:apantuso/rosaeng-8224

Conversation

@Ajpantuso

@Ajpantuso Ajpantuso commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

HCP namespace egress NetworkPolicies (private-router, management-kas) use /32 CIDRs
derived from the management cluster's KAS endpoint IPs to block HCP pods from reaching
the hosting KAS. These IPs rotate during rolling restarts, causing a burst of NetworkPolicy
UPDATEs that can trigger OVN port-group reconciliation races and drop traffic to HCP router
pods for 20-30+ minutes.

Adds a --hcp-egress-block-cidrs repeatable flag to the HyperShift Operator. When
supplied, the egress exception list in both policies is built from the provided stable CIDR
blocks (e.g. the MC machine network CIDR) instead of the live default/kubernetes Endpoints
object. This eliminates NetworkPolicy churn during KAS rollouts and avoids the OVN race.

The flag is wired end-to-end: hypershift install --hcp-egress-block-cidrs=<cidr> sets it
on the operator Deployment. Values are validated as valid CIDRs at startup. May be specified
multiple times.

Default behaviour (flag absent) is unchanged: KAS endpoint IPs continue to be discovered at
reconcile time from the default/kubernetes Endpoints object.

This PR will remain in draft until live testing is completed to ensure no regressions.

Which issue(s) this PR fixes:

Fixes ROSAENG-8224

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features
    • Added a repeatable CLI flag --hcp-egress-block-cidrs to supply one or more CIDRs (validated at startup). Provided CIDRs are used for HCP namespace egress policies instead of dynamically discovered endpoints—reducing policy churn during control-plane restarts—and are propagated into the operator deployment and install manifests.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2026
@openshift-ci

openshift-ci Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 5, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 5, 2026

Copy link
Copy Markdown

@Ajpantuso: This pull request references ROSAENG-8224 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Adds a --use-mc-machine-network-for-kas-block flag to the HyperShift Operator.

When enabled, the operator reads the management cluster's machine network
CIDR(s) from kube-system/cluster-config-v1 (the install-config key) and
uses them as the Except entries in the egress IPBlock rules of the
private-router and management-kas NetworkPolicies, instead of the
per-pod /32 CIDRs derived from the default/kubernetes Endpoints object.

If the ConfigMap is absent or unparseable, the operator falls back to the
existing kasEndpointsToCIDRs() behavior and logs a warning.

The flag is wired end-to-end: hypershift install --use-mc-machine-network-for-kas-block
sets it on the operator Deployment.

This PR will remain in draft until live testing is completed to ensure no regressions.

Which issue(s) this PR fixes:

Fixes ROSAENG-8224

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds support for statically configuring HCP egress block CIDRs to replace dynamic KAS endpoint derivation in network policies. Configuration flows through the install pipeline and operator startup with CLI flag validation, then into the HostedClusterReconciler. Network policy reconciliation is refactored to compute the kasBlock CIDR list once per reconcile and pass it through the policy chain, eliminating per-endpoint /32 derivation and reducing policy churn during KAS rolling restarts.

Sequence Diagram(s)

sequenceDiagram
  participant UserCLI as User CLI
  participant OperatorMain as Operator main
  participant Reconciler as HostedClusterReconciler
  participant K8sAPI as Kubernetes API
  UserCLI->>OperatorMain: provide --hcp-egress-block-cidrs values
  OperatorMain->>OperatorMain: validate CIDRs
  OperatorMain->>Reconciler: set HCPEgressBlockCIDRs
  Reconciler->>Reconciler: kasBlockCIDRs(kubernetesEndpoint) -> kasBlock
  Reconciler->>K8sAPI: apply NetworkPolicies using kasBlock in IPBlock.Except
Loading

Suggested reviewers

  • sjenning
  • sdminonne
🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning Two new unit tests contain hardcoded IPv4 addresses (10.0.0.1, 10.100.0.0/16, 10.128.0.0/14, 172.30.0.0/16) and CIDRs without IPv6 equivalents, incompatible with IPv6-only disconnected environments. Update tests to use dynamic IP family detection or add IPv6 equivalents; use correctCIDRFamily() helper to select appropriate CIDRs for the cluster IP family.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change: adding a --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are stable and deterministic. New tests use static string literals with no dynamic values like pod names, IPs, or timestamps.
Test Structure And Quality ✅ Passed New tests follow all quality requirements: single responsibility per test, proper setup/cleanup with fake clients, meaningful assertion messages, and consistent with existing patterns.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds only networking configuration (HCPEgressBlockCIDRs field and CLI flag). No scheduling constraints, affinity rules, replicas, node selectors, tolerations, or topology settings modified.
No-Weak-Crypto ✅ Passed PR contains no weak crypto usage: no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, no custom crypto, no insecure secret comparisons. Changes add CIDR field strings and net.ParseCIDR() validation only.
Container-Privileges ✅ Passed PR adds network policy CIDR filtering flag with no container privilege escalation or host access settings.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data (passwords, tokens, API keys, PII, session IDs) are logged. CIDRs logged in line 228 of main.go for validation errors are public network ranges, not sensitive.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels Jun 5, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/network_policies_test.go (1)

1697-1707: ⚡ Quick win

Assert management pod CIDR is still included in Except.

Line 1701 validates the feature-toggle CIDRs, but it doesn’t confirm the existing management cluster pod CIDR remains present. Add one assertion so this test also catches regressions where new KAS block CIDRs replace (instead of extend) existing exceptions.

Suggested test hardening
 			exceptCIDRs := collectExceptCIDRs(privateRouterPolicy)
+			g.Expect(exceptCIDRs).To(ContainElement("10.128.0.0/14"), "private-router Except list should retain management cluster pod CIDR")
 			for _, cidr := range tc.expectedExceptCIDRs {
 				g.Expect(exceptCIDRs).To(ContainElement(cidr), "private-router Except list should contain %s", cidr)
 			}

Based on learnings: "Unit test any code changes and additions and include e2e tests when changes impact consumer behaviour."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go`
around lines 1697 - 1707, The test currently checks feature-toggle CIDRs but
doesn't assert the management cluster pod CIDR is still present; after
collecting exceptCIDRs (from collectExceptCIDRs(privateRouterPolicy)) add an
assertion that the known management pod CIDR (e.g., managementPodCIDR or the
variable used elsewhere in this test file) is contained in exceptCIDRs using
g.Expect(exceptCIDRs).To(ContainElement(managementPodCIDR)) to catch regressions
where exceptions get replaced instead of extended; place this assertion near the
existing checks for tc.expectedExceptCIDRs and tc.expectCIDRNotPresent so it
runs for each test case referencing createdNetworkPolicies["private-router"] and
privateRouterPolicy.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go`:
- Around line 1697-1707: The test currently checks feature-toggle CIDRs but
doesn't assert the management cluster pod CIDR is still present; after
collecting exceptCIDRs (from collectExceptCIDRs(privateRouterPolicy)) add an
assertion that the known management pod CIDR (e.g., managementPodCIDR or the
variable used elsewhere in this test file) is contained in exceptCIDRs using
g.Expect(exceptCIDRs).To(ContainElement(managementPodCIDR)) to catch regressions
where exceptions get replaced instead of extended; place this assertion near the
existing checks for tc.expectedExceptCIDRs and tc.expectCIDRNotPresent so it
runs for each test case referencing createdNetworkPolicies["private-router"] and
privateRouterPolicy.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: eadd8666-ca8d-475c-8303-c5739a6cf3c8

📥 Commits

Reviewing files that changed from the base of the PR and between f13c62d and 4e64951.

📒 Files selected for processing (6)
  • cmd/install/assets/hypershift_operator.go
  • cmd/install/install.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • hypershift-operator/controllers/hostedcluster/network_policies.go
  • hypershift-operator/controllers/hostedcluster/network_policies_test.go
  • hypershift-operator/main.go

@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 63.63636% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.67%. Comparing base (4755e9c) to head (f5af523).
⚠️ Report is 55 commits behind head on main.

Files with missing lines Patch % Lines
hypershift-operator/main.go 0.00% 7 Missing ⚠️
...ator/controllers/hostedcluster/network_policies.go 79.16% 2 Missing and 3 partials ⚠️
cmd/install/install.go 60.00% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8689      +/-   ##
==========================================
+ Coverage   41.54%   41.67%   +0.12%     
==========================================
  Files         758      758              
  Lines       93838    93959     +121     
==========================================
+ Hits        38986    39154     +168     
+ Misses      52107    52056      -51     
- Partials     2745     2749       +4     
Files with missing lines Coverage Δ
cmd/install/assets/hypershift_operator.go 48.19% <100.00%> (+0.08%) ⬆️
...trollers/hostedcluster/hostedcluster_controller.go 45.89% <ø> (ø)
cmd/install/install.go 63.04% <60.00%> (-0.03%) ⬇️
...ator/controllers/hostedcluster/network_policies.go 77.42% <79.16%> (+0.23%) ⬆️
hypershift-operator/main.go 0.00% <0.00%> (ø)

... and 12 files with indirect coverage changes

Flag Coverage Δ
cmd-support 34.97% <69.23%> (+0.01%) ⬆️
cpo-hostedcontrolplane 44.00% <ø> (+0.40%) ⬆️
cpo-other 43.45% <ø> (+0.27%) ⬆️
hypershift-operator 51.66% <61.29%> (+0.03%) ⬆️
other 31.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Ajpantuso Ajpantuso force-pushed the apantuso/rosaeng-8224 branch 2 times, most recently from 3b45a90 to 8df0806 Compare June 5, 2026 21:10
@openshift-ci openshift-ci Bot added area/platform/azure PR/issue for Azure (AzurePlatform) platform area/testing Indicates the PR includes changes for e2e testing labels Jun 5, 2026
@Ajpantuso Ajpantuso force-pushed the apantuso/rosaeng-8224 branch 4 times, most recently from 65a6dad to 0f14878 Compare June 7, 2026 17:24
@Ajpantuso Ajpantuso changed the title ROSAENG-8224: feat(operator): use MC machine network CIDR to block HCP egress to management KAS ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions Jun 7, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
hypershift-operator/controllers/hostedcluster/network_policies.go (1)

345-346: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle IPv4/IPv6 exceptions with separate IPBlock CIDRs.

In both reconcilePrivateRouterNetworkPolicy and reconcileManagementKASNetworkPolicy, the egress rule uses IPBlock{CIDR:"0.0.0.0/0", Except: exceptions} where exceptions := append(kasBlockExceptions, clusterNetworks...). kasBlockExceptions comes from HCPEgressBlockCIDRs directly when set (no IPv4/IPv6 filtering) or from kasEndpointsToCIDRs (hardcoded to /32), and clusterNetworks comes from managementClusterNetwork.Spec.ClusterNetwork (also unfiltered). If any IPv6 CIDRs are present in either list, the resulting NetworkPolicy will be rejected because ipBlock.except must be a strict subset of—and in the same IP family as—ipBlock.cidr.

Also applies to: 851-852, 943-955

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/hostedcluster/network_policies.go` around
lines 345 - 346, The egress IPBlock uses a single CIDR ("0.0.0.0/0") while
combining exceptions that may include IPv6 addresses, which will be rejected; in
reconcilePrivateRouterNetworkPolicy and reconcileManagementKASNetworkPolicy (and
the similar blocks around the other occurrences) split egress rules by IP
family: build two IPBlock rules—one with CIDR "0.0.0.0/0" and Except filtered to
only IPv4 CIDRs, and one with CIDR "::/0" and Except filtered to only IPv6
CIDRs; when assembling exceptions (from kasBlockExceptions, kasEndpointsToCIDRs
results, and clusterNetworks) filter each list by IP family before appending to
the matching IPBlock so ipBlock.except is a strict subset and same family as
ipBlock.cidr.
🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/network_policies.go (1)

73-76: ⚡ Quick win

Skip default/kubernetes Endpoints lookup when static CIDRs are configured.

Static mode still hard-fails on Endpoints Get. This keeps an unnecessary dependency/path even when HCPEgressBlockCIDRs is already set.

Suggested refactor
- kubernetesEndpoint := &corev1.Endpoints{ObjectMeta: metav1.ObjectMeta{Name: "kubernetes", Namespace: "default"}}
- if err := r.Get(ctx, client.ObjectKeyFromObject(kubernetesEndpoint), kubernetesEndpoint); err != nil {
- 	return fmt.Errorf("failed to get management cluster network config: %w", err)
- }
- kasBlock := r.kasBlockCIDRs(kubernetesEndpoint)
+ var kasBlock []string
+ if len(r.HCPEgressBlockCIDRs) > 0 {
+ 	kasBlock = append([]string(nil), r.HCPEgressBlockCIDRs...)
+ } else {
+ 	kubernetesEndpoint := &corev1.Endpoints{ObjectMeta: metav1.ObjectMeta{Name: "kubernetes", Namespace: "default"}}
+ 	if err := r.Get(ctx, client.ObjectKeyFromObject(kubernetesEndpoint), kubernetesEndpoint); err != nil {
+ 		return fmt.Errorf("failed to get management cluster kubernetes endpoint: %w", err)
+ 	}
+ 	kasBlock = kasEndpointsToCIDRs(kubernetesEndpoint)
+ }

Also applies to: 78-83, 950-955

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/hostedcluster/network_policies.go` around
lines 73 - 76, The code currently always does a Get on the Endpoints object
kubernetesEndpoint (via r.Get(..., kubernetesEndpoint)) which causes a hard
failure even when static CIDRs are configured; update the logic in
network_policies.go to first check the static CIDR configuration
(HCPEgressBlockCIDRs) and skip the endpoints lookup entirely when static CIDRs
are present — i.e., guard the r.Get call with an if that returns or continues
when HCPEgressBlockCIDRs is non-empty so kubernetesEndpoint and the r.Get call
are only executed when no static CIDRs are configured (apply the same guard to
the other occurrences around lines indicated: the block at 78-83 and 950-955).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/install/install.go`:
- Line 443: Add CIDR validation for the --hcp-egress-block-cidrs option by
implementing a helper (e.g.,
validateHCPEgressBlockCIDRs(opts.HCPEgressBlockCIDRs) bool or error) that
iterates over Options.HCPEgressBlockCIDRs and uses net.ParseCIDR to ensure each
entry is a valid CIDR, returning a descriptive error for the first invalid
value; call this helper from Options.Validate() so installation fails fast on
invalid input, and add import "net" at the top of the file; ensure the
Validate() error message references the invalid CIDR and the flag name for
clarity.

In `@hypershift-operator/controllers/hostedcluster/network_policies.go`:
- Line 338: The code appends clusterNetworks onto kasBlockExceptions which may
share a backing array returned directly by kasBlockCIDRs
(r.HCPEgressBlockCIDRs), causing race/mutation issues; to fix, make a defensive
copy of kasBlockExceptions before appending (e.g. start with a new slice created
from kasBlockExceptions or append onto a nil copy) when building exceptions in
the code path that uses the kasBlockExceptions variable (the occurrence with:
exceptions := append(kasBlockExceptions, clusterNetworks...) and the similar
sites around the kasBlockCIDRs usage), and apply the same defensive-copy pattern
at the other occurrences mentioned (around lines 844 and 950-953) so we never
append into the original shared slice.

---

Outside diff comments:
In `@hypershift-operator/controllers/hostedcluster/network_policies.go`:
- Around line 345-346: The egress IPBlock uses a single CIDR ("0.0.0.0/0") while
combining exceptions that may include IPv6 addresses, which will be rejected; in
reconcilePrivateRouterNetworkPolicy and reconcileManagementKASNetworkPolicy (and
the similar blocks around the other occurrences) split egress rules by IP
family: build two IPBlock rules—one with CIDR "0.0.0.0/0" and Except filtered to
only IPv4 CIDRs, and one with CIDR "::/0" and Except filtered to only IPv6
CIDRs; when assembling exceptions (from kasBlockExceptions, kasEndpointsToCIDRs
results, and clusterNetworks) filter each list by IP family before appending to
the matching IPBlock so ipBlock.except is a strict subset and same family as
ipBlock.cidr.

---

Nitpick comments:
In `@hypershift-operator/controllers/hostedcluster/network_policies.go`:
- Around line 73-76: The code currently always does a Get on the Endpoints
object kubernetesEndpoint (via r.Get(..., kubernetesEndpoint)) which causes a
hard failure even when static CIDRs are configured; update the logic in
network_policies.go to first check the static CIDR configuration
(HCPEgressBlockCIDRs) and skip the endpoints lookup entirely when static CIDRs
are present — i.e., guard the r.Get call with an if that returns or continues
when HCPEgressBlockCIDRs is non-empty so kubernetesEndpoint and the r.Get call
are only executed when no static CIDRs are configured (apply the same guard to
the other occurrences around lines indicated: the block at 78-83 and 950-955).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: f093ff4c-fa3a-486c-9e3f-711a006a912e

📥 Commits

Reviewing files that changed from the base of the PR and between 2827d9d and c1597c3.

📒 Files selected for processing (6)
  • cmd/install/assets/hypershift_operator.go
  • cmd/install/install.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • hypershift-operator/controllers/hostedcluster/network_policies.go
  • hypershift-operator/controllers/hostedcluster/network_policies_test.go
  • hypershift-operator/main.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/install/assets/hypershift_operator.go

Comment thread cmd/install/install.go
Comment thread hypershift-operator/controllers/hostedcluster/network_policies.go Outdated
@Ajpantuso Ajpantuso force-pushed the apantuso/rosaeng-8224 branch 4 times, most recently from 3e92625 to 6352ee3 Compare June 7, 2026 19:32
@Ajpantuso Ajpantuso marked this pull request as ready for review June 8, 2026 13:15
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 8, 2026
@csrwng

csrwng commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

/verified later @Ajpantuso

@openshift-ci-robot openshift-ci-robot added verified-later verified Signifies that the PR passed pre-merge verification criteria labels Jun 16, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@csrwng: This PR has been marked to be verified later by @Ajpantuso.

Details

In response to this:

/verified later @Ajpantuso

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rafael-azevedo

Copy link
Copy Markdown
Contributor

/lgtm One minor ask is a log line at operator startup indicating if dynamic Endpoints mode is active would help on-call debugging.

@rafael-azevedo

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@bryan-cox bryan-cox left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ajpantuso, bryan-cox, csrwng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bryan-cox

Copy link
Copy Markdown
Member

@Ajpantuso @rafael-azevedo would you only ever use IPv4?

[suggestion] IPv6 dual-stack gap — The new flag accepts any CIDR, but the policies only use 0.0.0.0/0 as the base IPBlock. An IPv6 CIDR in the Except list of an IPv4 block is semantically wrong. Recommend adding validation in main.go/install.go to reject IPv6 CIDRs since this PR's scope is IPv4-only.

@Ajpantuso

Copy link
Copy Markdown
Contributor Author

@Ajpantuso @rafael-azevedo would you only ever use IPv4?

[suggestion] IPv6 dual-stack gap — The new flag accepts any CIDR, but the policies only use 0.0.0.0/0 as the base IPBlock. An IPv6 CIDR in the Except list of an IPv4 block is semantically wrong. Recommend adding validation in main.go/install.go to reject IPv6 CIDRs since this PR's scope is IPv4-only.

Yes, we would only use IPv4 as we don't run dual stack. I will submit a follow up with additional validation.

@Ajpantuso

Copy link
Copy Markdown
Contributor Author

/retest-required

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 7507291 and 2 for PR HEAD f5af523 in total

@enxebre

enxebre commented Jun 17, 2026

Copy link
Copy Markdown
Member

/test e2e-kubevirt-aws-ovn-reduced

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 30a12e6 and 1 for PR HEAD f5af523 in total

@celebdor

Copy link
Copy Markdown
Collaborator

/override "ci/prow/e2e-kubevirt-aws-ovn-reduced"

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@celebdor: Overrode contexts on behalf of celebdor: ci/prow/e2e-kubevirt-aws-ovn-reduced

Details

In response to this:

/override "ci/prow/e2e-kubevirt-aws-ovn-reduced"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

os.Exit(1)
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation should be encapsulated in validateStartOptions()

@bryan-cox bryan-cox Jun 17, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to follow up on this later. There are incidents waiting for this PR to merge.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a follow up after this merges with the proposed modifications.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up #8763

@hypershift-jira-solve-ci

Copy link
Copy Markdown

Now I have a comprehensive picture. Let me compile the final report. The root cause is an infrastructure-level management cluster installation failure — the bootstrap process timed out because the ingress controller (router) could not schedule its pod. This is a standard bootstrap deadlock pattern where the ingress operator's router deployment requires a worker node, but no workers are provisioned until after bootstrap completes. This is a well-known CI infrastructure flaky failure unrelated to PR #8689's code changes.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Bootstrap failed to complete: timed out waiting for the condition
Failed to wait for bootstrapping to complete. This error usually happens when there is a problem 
with control plane hosts that prevents the control plane operators from creating the control plane.

Cluster operator ingress Available is False with IngressUnavailable: The "default" ingress controller 
reports Available=False: DeploymentAvailable=False (DeploymentUnavailable: Deployment does not have 
minimum availability.) — 0/1 of replicas are available.

Cluster operator monitoring Available is False with PlatformTasksFailed: creating Route object failed: 
the server could not find the requested resource (post routes.route.openshift.io)

Cluster operator openshift-apiserver Available is False with APIServices_PreconditionNotReady
Cluster operator authentication Available is False — oauth service endpoints are not ready

Summary

The job failed during the pre phase (management cluster IPI installation), not during any test execution. The openshift-install process provisioned 3 control-plane nodes (m6i.2xlarge) and a bootstrap node successfully, the Kubernetes API came up at 12:03:51Z, but the 45-minute bootstrap completion wait timed out at 12:48:51Z. The root failure is that the ingress controller's router deployment could not achieve minimum availability (0/1 replicas ready), which cascaded into failures of the monitoring operator (Route CRD not registered), openshift-apiserver (APIServices PreconditionNotReady), authentication operator (OAuth endpoints not ready), and console operator (NoData). This is a CI infrastructure flake — a bootstrap timeout due to the router pod being unable to schedule — and is unrelated to PR #8689's code changes (which add an --hcp-egress-block-cidrs flag to the HyperShift operator). No test code from this PR was ever executed.

Root Cause

The failure is a management cluster IPI installation timeout, not a test failure. The root cause chain is:

  1. Router pod could not schedule (0/1 replicas available): The ingress controller's router Deployment never achieved MinimumReplicasAvailable. The cluster was configured with 3 master nodes (m6i.2xlarge) and 1 worker node (c5n.metal — a bare metal instance type). During IPI installation via Cluster API (CAPI), only bootstrap + master machines are provisioned. Worker nodes are created by machine-api after bootstrap completes. However, the default ingress controller's router pod requires a worker node to schedule (masters have the node-role.kubernetes.io/master taint). This creates a bootstrap deadlock: bootstrap needs ingress → ingress needs a worker → workers need bootstrap to complete.

  2. Cascading operator failures: Because ingress was unavailable:

    • The Route CRD (routes.route.openshift.io) was never registered with the API server
    • The monitoring operator failed trying to create Route objects for Alertmanager, Thanos Querier, and Prometheus
    • The openshift-apiserver stayed in PreconditionNotReady state
    • The authentication operator could not create OAuth server endpoints
    • The console operator reported NoData
  3. Bootstrap timeout: After waiting the full 45 minutes (12:03:51Z to 12:48:51Z), the installer declared bootstrap failure with exit code 5.

This is a known CI infrastructure flaky pattern for jobs using c5n.metal worker instance types on AWS. Bare metal instances take significantly longer to provision, and the bootstrap process timing can be fragile. The PR's code changes (adding --hcp-egress-block-cidrs flag) were never reached — the management cluster never finished installing.

Recommendations
  1. Retry the job: This is an infrastructure flake unrelated to the PR code changes. A simple re-trigger of the CI job should resolve it.

  2. No code changes needed in PR ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions #8689: The --hcp-egress-block-cidrs feature code was never exercised. The failure occurred in the management cluster IPI installation phase, before any HyperShift-specific test step ran.

  3. If failure recurs consistently: Investigate whether the c5n.metal worker instance type is experiencing AWS availability issues in us-east-2, or consider whether the job's bootstrap timeout (45m) is sufficient for bare-metal instance provisioning.

Evidence
Evidence Detail
Failed step ipi-install-install (pre phase), exit code 5, ran for 55m23s
Bootstrap wait Started 12:03:51Z, timed out at 12:48:51Z (45m window exhausted)
Ingress operator Available=False, router deployment 0/1 replicas, MinimumReplicasUnavailable
Monitoring operator Available=False, the server could not find the requested resource (post routes.route.openshift.io)
openshift-apiserver Available=False, APIServices_PreconditionNotReady
Authentication operator Available=False, oauth service endpoints are not ready, connection refused to OAuth healthz
Worker node config 1x c5n.metal (bare metal) — not provisioned during bootstrap phase
Master nodes 3x m6i.2xlarge — all provisioned and running successfully
API availability Kubernetes API came up normally at 12:03:51Z (v1.35.5-dirty)
Post phase failure hypershift-k8sgpt — secondary failure, the server doesn't have a resource type "hostedcluster" (expected since install never completed)
PR code relevance None — PR adds --hcp-egress-block-cidrs flag to HyperShift operator; management cluster install failed before any HyperShift code ran

csrwng added a commit to csrwng/hypershift-1 that referenced this pull request Jun 17, 2026
…gress NetworkPolicy

Reverts the revert PR openshift#8662 to re-apply the original PR openshift#7872 behavior:
conditionally create or delete the openshift-ingress NetworkPolicy based
on whether routes are labeled for the HCP router (LabelHCPRoutes).

The OVN-Kubernetes port group race that prompted the revert is being
addressed via PR openshift#8689 (--hcp-egress-block-cidrs flag), making the
blanket always-create workaround unnecessary.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@Ajpantuso

Copy link
Copy Markdown
Contributor Author

/test e2e-kubevirt-aws-ovn-reduced

@bryan-cox

bryan-cox commented Jun 17, 2026

Copy link
Copy Markdown
Member

/override "ci/prow/e2e-kubevirt-aws-ovn-reduced"

@Ajpantuso that test is broken so let's not try running it again. Toni put in an override for it earlier - #8689 (comment)

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@bryan-cox: Overrode contexts on behalf of bryan-cox: ci/prow/e2e-kubevirt-aws-ovn-reduced

Details

In response to this:

/override "ci/prow/e2e-kubevirt-aws-ovn-reduced"

@Ajpantuso that test is broken so let's not try running it again. Toni put in an override for it earlier.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot Bot merged commit fabde37 into openshift:main Jun 17, 2026
41 checks passed
@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@Ajpantuso: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/azure PR/issue for Azure (AzurePlatform) platform area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria verified-later

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants