CNTRLPLANE-3276: Add Azure endpoint access transition e2e test by bryan-cox · Pull Request #8718 · openshift/hypershift

bryan-cox · 2026-06-11T15:22:30Z

What this PR does / why we need it:

Adds a v2 e2e lifecycle test that validates transitioning a HostedCluster between Private and PublicAndPrivate topology on Azure self-managed clusters. The test verifies:

Private → PublicAndPrivate: Updates topology, waits for PublicEndpointExposed condition to become True with reason SharedIngressConfigured, confirms KAS service becomes LoadBalancer, validates PLS CRs persist, and checks API reachability.
PublicAndPrivate → Private: Restores topology, waits for PublicEndpointExposed condition to become False with reason TopologyPrivate, confirms KAS service returns to ClusterIP, validates PLS CRs persist.
Private connectivity verification: Confirms API server is reachable via the private path after restore.

The test runs on the existing private cluster variant (labeled self-managed-azure-private) after AzurePrivateTopologyTest completes. DeferCleanup restores Private topology if the test fails mid-transition.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-3276

Special notes for your reviewer:

No changes to lifecycle/azure.go — the private variant's LabelFilter already includes self-managed-azure-private
Uses e2eutil.ConditionPredicate[*hyperv1.HostedCluster] rather than a custom condition helper
DeferCleanup uses context.Background() intentionally — cleanup may run after testCtx.Context is cancelled on timeout
The sync.Once-cached guest client in GetHostedClusterClient() is safe across topology transitions because the private endpoint stays active in PublicAndPrivate mode

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

Summary by CodeRabbit

Tests
- Added a new Azure-only end-to-end HostedCluster test that runs in order for clusters starting with Azure Private topology, validating a Private → PublicAndPrivate → Private transition.
- Verifies KAS external route behavior at each step (public route host populated; private route deleted or restored as expected).
- Confirms Azure PrivateLink Service resources remain present with a non-empty alias throughout transitions.
- Checks API reachability via namespace listings during the workflow and after restoration, and automatically restores the original Private topology on completion.

openshift-merge-bot · 2026-06-11T15:22:33Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci-robot · 2026-06-11T15:22:34Z

@bryan-cox: This pull request references CNTRLPLANE-3276 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Adds a v2 e2e lifecycle test that validates transitioning a HostedCluster between Private and PublicAndPrivate topology on Azure self-managed clusters. The test verifies:

Private → PublicAndPrivate: Updates topology, waits for PublicEndpointExposed condition to become True with reason SharedIngressConfigured, confirms KAS service becomes LoadBalancer, validates PLS CRs persist, and checks API reachability.

PublicAndPrivate → Private: Restores topology, waits for PublicEndpointExposed condition to become False with reason TopologyPrivate, confirms KAS service returns to ClusterIP, validates PLS CRs persist.

Private connectivity verification: Confirms API server is reachable via the private path after restore.

The test runs on the existing private cluster variant (labeled self-managed-azure-private) after AzurePrivateTopologyTest completes. DeferCleanup restores Private topology if the test fails mid-transition.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-3276

Special notes for your reviewer:

No changes to lifecycle/azure.go — the private variant's LabelFilter already includes self-managed-azure-private

Uses e2eutil.ConditionPredicate[*hyperv1.HostedCluster] rather than a custom condition helper

DeferCleanup uses context.Background() intentionally — cleanup may run after testCtx.Context is cancelled on timeout

The sync.Once-cached guest client in GetHostedClusterClient() is safe across topology transitions because the private endpoint stays active in PublicAndPrivate mode

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-06-11T15:22:36Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-06-11T15:22:45Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds a new e2e test, AzureEndpointAccessTransitionTest, that validates HostedCluster topology transitions on Azure. The test transitions a cluster from Private to PublicAndPrivate and back to Private, verifying that KAS external routes appear and are deleted appropriately for each topology state, AzurePrivateLinkService CRs persist with non-empty Status.PrivateLinkServiceAlias, and the hosted cluster API remains reachable throughout. Supporting helper predicates match service types, verify PLS alias presence, and poll API reachability via namespace listing. The test is registered immediately after AzurePrivateTopologyTest.

Suggested reviewers

cblecker
Nirshal
csrwng

🚥 Pre-merge checks | ✅ 11

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding an Azure endpoint access transition e2e test, which matches the core functionality added in the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in AzureEndpointAccessTransitionTest are static descriptive strings without dynamic values, pod names, timestamps, UUIDs, node names, or IP addresses.
Test Structure And Quality	✅ Passed	Test code follows all quality requirements: single responsibility per test, proper setup/cleanup via BeforeAll/DeferCleanup, timeouts on all cluster operations, meaningful assertion messages with f...
Topology-Aware Scheduling Compatibility	✅ Passed	This PR adds only e2e test code (test/e2e/v2/tests/hosted_cluster_azure_test.go) with no deployment manifests, operator code, or controllers. The custom check's scope explicitly applies to these, m...
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	Test contains no hardcoded IPv4 addresses, IPv4-specific logic, or external connectivity requirements. All operations are cluster-internal Kubernetes API calls.
No-Weak-Crypto	✅ Passed	The PR adds an e2e test for Azure topology transitions with no weak cryptography usage. No MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB algorithms, custom crypto implementations, or non-constant-time s...
Container-Privileges	✅ Passed	PR adds a Go e2e test file with no container/K8s manifests or privilege configurations. Check is not applicable to test code.
No-Sensitive-Data-In-Logs	✅ Passed	Logging statements in AzureEndpointAccessTransitionTest log only non-sensitive data: Route hostnames (DNS names) and Azure PLS aliases (service metadata). No passwords, tokens, API keys, or PII are...

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-11T15:23:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bryan-cox]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bryan-cox · 2026-06-11T15:24:38Z

/test e2e-azure-v2-self-managed

coderabbitai

🧹 Nitpick comments (1)

test/e2e/v2/tests/hosted_cluster_azure_test.go (1)

270-280: 💤 Low value

Add WithInterval for polling consistency.

The condition check at lines 253-268 specifies WithInterval(15*time.Second), but this EventuallyObject call omits it. For consistent polling behavior across similar checks in this test, consider adding the interval.

Suggested fix

 			e2eutil.EventuallyObject(GinkgoTB(), ctx, "KAS service is LoadBalancer",
 				func(ctx context.Context) (*corev1.Service, error) {
 					svc := hcpmanifests.KubeAPIServerService(controlPlaneNamespace)
 					err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(svc), svc)
 					return svc, err
 				},
 				[]e2eutil.Predicate[*corev1.Service]{
 					serviceTypePredicate(corev1.ServiceTypeLoadBalancer),
 				},
 				e2eutil.WithTimeout(10*time.Minute),
+				e2eutil.WithInterval(15*time.Second),
 			)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 270 - 280, Add a
consistent polling interval to the EventuallyObject call by including
e2eutil.WithInterval(15*time.Second) alongside the existing e2eutil.WithTimeout
option; locate the call to e2eutil.EventuallyObject (the block returning
*corev1.Service and using serviceTypePredicate) and append the WithInterval
option to its variadic options so it uses the same 15s polling cadence as the
earlier check.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 270-280: Add a consistent polling interval to the EventuallyObject
call by including e2eutil.WithInterval(15*time.Second) alongside the existing
e2eutil.WithTimeout option; locate the call to e2eutil.EventuallyObject (the
block returning *corev1.Service and using serviceTypePredicate) and append the
WithInterval option to its variadic options so it uses the same 15s polling
cadence as the earlier check.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3b2222e8-a730-4864-a308-b08be14365df

📥 Commits

Reviewing files that changed from the base of the PR and between 35c0190 and 8a9c92f.

📒 Files selected for processing (1)

test/e2e/v2/tests/hosted_cluster_azure_test.go

codecov · 2026-06-11T15:30:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.75%. Comparing base (392fd5a) to head (5ba7144).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #8718   +/-   ##
=======================================
  Coverage   41.75%   41.75%           
=======================================
  Files         758      758           
  Lines       93981    93981           
=======================================
  Hits        39240    39240           
  Misses      51988    51988           
  Partials     2753     2753

Flag	Coverage Δ
cmd-support	`35.02% <ø> (ø)`
cpo-hostedcontrolplane	`44.10% <ø> (ø)`
cpo-other	`43.45% <ø> (ø)`
hypershift-operator	`51.82% <ø> (ø)`
other	`31.56% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bryan-cox · 2026-06-11T23:56:10Z

/test e2e-azure-v2-self-managed

coderabbitai

🧹 Nitpick comments (2)

test/e2e/v2/tests/hosted_cluster_azure_test.go (2)

279-309: ⚡ Quick win

Consider verifying HostedCluster condition PublicEndpointExposed returns to False.

The PR description states the test should "wait for PublicEndpointExposed to become False with reason TopologyPrivate" when transitioning back to Private. Currently, the test only verifies the KAS service type returns to ClusterIP. Adding a condition check would ensure the full reconciliation completed.

📋 Suggested addition to verify condition

After line 297, add:

 		e2eutil.EventuallyObject(GinkgoTB(), ctx, "KAS service is ClusterIP",
 			func(ctx context.Context) (*corev1.Service, error) {
 				svc := hcpmanifests.KubeAPIServerService(controlPlaneNamespace)
 				err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(svc), svc)
 				return svc, err
 			},
 			[]e2eutil.Predicate[*corev1.Service]{
 				serviceTypePredicate(corev1.ServiceTypeClusterIP),
 			},
 			e2eutil.WithTimeout(10*time.Minute),
 		)
+
+		e2eutil.EventuallyObject(GinkgoTB(), ctx, "HostedCluster PublicEndpointExposed condition is False",
+			func(ctx context.Context) (*hyperv1.HostedCluster, error) {
+				latest := &hyperv1.HostedCluster{}
+				err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(hc), latest)
+				return latest, err
+			},
+			[]e2eutil.Predicate[*hyperv1.HostedCluster]{
+				e2eutil.ConditionPredicate[*hyperv1.HostedCluster](
+					hyperv1.PublicEndpointExposed,
+					metav1.ConditionFalse,
+					hyperv1.TopologyPrivateReason,
+				),
+			},
+			e2eutil.WithTimeout(10*time.Minute),
+		)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 279 - 309, Add
an assertion that the HostedCluster's PublicEndpointExposed condition flips to
False with reason TopologyPrivate after topology revert: use
e2eutil.EventuallyObject (like the existing KAS service check) to fetch the
HostedCluster (variable hc via testCtx.MgmtClient) and poll until
conditions.Get(hc.Status.Conditions, hyperv1.PublicEndpointExposed).Status ==
corev1.ConditionFalse and .Reason == "TopologyPrivate" (or use the helper that
checks condition values), with an appropriate timeout placed after the KAS
service ClusterIP check to ensure full reconciliation completed.

245-277: ⚡ Quick win

Consider verifying HostedCluster condition PublicEndpointExposed.

The PR description states the test should "wait for HostedCluster condition PublicEndpointExposed to become True with reason SharedIngressConfigured." Currently, the test only verifies the KAS service type changes to LoadBalancer. Adding a condition check would provide stronger validation that the control plane reconciliation loop completed successfully.

📋 Suggested addition to verify condition

After line 263, add a condition check using e2eutil.ConditionPredicate:

 		e2eutil.EventuallyObject(GinkgoTB(), ctx, "KAS service is LoadBalancer",
 			func(ctx context.Context) (*corev1.Service, error) {
 				svc := hcpmanifests.KubeAPIServerService(controlPlaneNamespace)
 				err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(svc), svc)
 				return svc, err
 			},
 			[]e2eutil.Predicate[*corev1.Service]{
 				serviceTypePredicate(corev1.ServiceTypeLoadBalancer),
 			},
 			e2eutil.WithTimeout(10*time.Minute),
 		)
+
+		e2eutil.EventuallyObject(GinkgoTB(), ctx, "HostedCluster PublicEndpointExposed condition is True",
+			func(ctx context.Context) (*hyperv1.HostedCluster, error) {
+				latest := &hyperv1.HostedCluster{}
+				err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(hc), latest)
+				return latest, err
+			},
+			[]e2eutil.Predicate[*hyperv1.HostedCluster]{
+				e2eutil.ConditionPredicate[*hyperv1.HostedCluster](
+					hyperv1.PublicEndpointExposed,
+					metav1.ConditionTrue,
+					hyperv1.SharedIngressConfiguredReason,
+				),
+			},
+			e2eutil.WithTimeout(10*time.Minute),
+		)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 245 - 277, Add a
check that waits for the HostedCluster condition "PublicEndpointExposed" to be
True with reason "SharedIngressConfigured" (using the existing HostedCluster
object hc and testCtx.MgmtClient) to ensure control-plane reconciliation
completed; implement this by calling e2eutil.EventuallyObject (or the existing
helper that waits on conditions) with e2eutil.ConditionPredicate for condition
"PublicEndpointExposed" and reason "SharedIngressConfigured" (use
e2eutil.ConditionPredicate(hc, "PublicEndpointExposed", corev1.ConditionTrue,
"SharedIngressConfigured") or the appropriate signature) placed after the KAS
service LoadBalancer check/PLS verification and before verifyAPIReachable.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 279-309: Add an assertion that the HostedCluster's
PublicEndpointExposed condition flips to False with reason TopologyPrivate after
topology revert: use e2eutil.EventuallyObject (like the existing KAS service
check) to fetch the HostedCluster (variable hc via testCtx.MgmtClient) and poll
until conditions.Get(hc.Status.Conditions, hyperv1.PublicEndpointExposed).Status
== corev1.ConditionFalse and .Reason == "TopologyPrivate" (or use the helper
that checks condition values), with an appropriate timeout placed after the KAS
service ClusterIP check to ensure full reconciliation completed.
- Around line 245-277: Add a check that waits for the HostedCluster condition
"PublicEndpointExposed" to be True with reason "SharedIngressConfigured" (using
the existing HostedCluster object hc and testCtx.MgmtClient) to ensure
control-plane reconciliation completed; implement this by calling
e2eutil.EventuallyObject (or the existing helper that waits on conditions) with
e2eutil.ConditionPredicate for condition "PublicEndpointExposed" and reason
"SharedIngressConfigured" (use e2eutil.ConditionPredicate(hc,
"PublicEndpointExposed", corev1.ConditionTrue, "SharedIngressConfigured") or the
appropriate signature) placed after the KAS service LoadBalancer check/PLS
verification and before verifyAPIReachable.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9e7d7d98-d5d6-4677-8778-67d0d9a0b9a5

📥 Commits

Reviewing files that changed from the base of the PR and between 8a9c92f and 13e0579.

📒 Files selected for processing (1)

test/e2e/v2/tests/hosted_cluster_azure_test.go

bryan-cox · 2026-06-15T12:26:48Z

/test e2e-azure-v2-self-managed

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

test/e2e/v2/tests/hosted_cluster_azure_test.go (1)

246-314: ⚡ Quick win

Add labels to each new It for filterability consistency.

The new specs are under a labeled Context, but the It blocks themselves are unlabeled.

Proposed fix

-		It("should transition from Private to PublicAndPrivate", func() {
+		It("should transition from Private to PublicAndPrivate", Label("azure-endpoint-transition"), func() {
...
-		It("should transition from PublicAndPrivate back to Private", func() {
+		It("should transition from PublicAndPrivate back to Private", Label("azure-endpoint-transition"), func() {
...
-		It("should reach the API server after restoring Private topology", func() {
+		It("should reach the API server after restoring Private topology", Label("azure-endpoint-transition"), func() {

As per coding guidelines, apply labels to both Describe and It blocks for test filtering.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 246 - 314, Add
labels to the three It blocks in the test to maintain filtering consistency with
the parent Context block. The It blocks "should transition from Private to
PublicAndPrivate", "should transition from PublicAndPrivate back to Private",
and "should reach the API server after restoring Private topology" need to be
labeled using the Label() function according to coding guidelines for test
filtering. Apply appropriate labels that categorize these topology transition
tests consistently.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 295-297: The assertion for the service type check is too loose and
only verifies that the type is not LoadBalancer, allowing false positives from
other unintended types like NodePort or ExternalName. Replace the
NotTo(Equal(corev1.ServiceTypeLoadBalancer)) check with explicit assertions that
verify the service is in one of the expected states after restore: either the
service should not be found (NotFound error) or it should have type ClusterIP.
This ensures the assertion catches actual regressions instead of passing for any
non-LoadBalancer type.
- Around line 236-243: The DeferCleanup function currently logs a warning when
the topology restore fails (when restoreErr is not nil) but continues execution,
which can leave the cluster state mutated and cause cascading failures in
subsequent tests. Modify the error handling block where restoreErr is checked to
fail the test using an appropriate Ginkgo assertion or failure method (such as
GinkgoTB().Fail) instead of just logging a warning, ensuring that any failure to
restore the Azure topology causes the test cleanup to fail and prevent state
corruption from affecting other specs.

---

Nitpick comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 246-314: Add labels to the three It blocks in the test to maintain
filtering consistency with the parent Context block. The It blocks "should
transition from Private to PublicAndPrivate", "should transition from
PublicAndPrivate back to Private", and "should reach the API server after
restoring Private topology" need to be labeled using the Label() function
according to coding guidelines for test filtering. Apply appropriate labels that
categorize these topology transition tests consistently.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 82fff169-72ca-4e38-910e-d973910bb5f5

📥 Commits

Reviewing files that changed from the base of the PR and between 13e0579 and 57c8e8e.

📒 Files selected for processing (1)

test/e2e/v2/tests/hosted_cluster_azure_test.go

coderabbitai · 2026-06-15T12:33:23Z

+			DeferCleanup(func() {
+				restoreErr := e2eutil.UpdateObject(GinkgoTB(), context.Background(), testCtx.MgmtClient, hc, func(obj *hyperv1.HostedCluster) {
+					obj.Spec.Platform.Azure.Topology = hyperv1.AzureTopologyPrivate
+				})
+				if restoreErr != nil {
+					GinkgoTB().Logf("WARNING: failed to restore Private topology: %v", restoreErr)
+				}
+			})


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail cleanup if topology restore does not succeed.

The cleanup currently logs and continues on restore failure, which can leave shared cluster state mutated and cause cascading failures in later specs.

Proposed fix

DeferCleanup(func() { - restoreErr := e2eutil.UpdateObject(GinkgoTB(), context.Background(), testCtx.MgmtClient, hc, func(obj *hyperv1.HostedCluster) { + cleanupCtx, cancel := context.WithTimeout(context.Background(), 2*time.Minute) + defer cancel() + restoreErr := e2eutil.UpdateObject(GinkgoTB(), cleanupCtx, testCtx.MgmtClient, hc, func(obj *hyperv1.HostedCluster) { obj.Spec.Platform.Azure.Topology = hyperv1.AzureTopologyPrivate }) - if restoreErr != nil { - GinkgoTB().Logf("WARNING: failed to restore Private topology: %v", restoreErr) - } + Expect(restoreErr).NotTo(HaveOccurred(), "cleanup: failed to restore Private topology") })

As per coding guidelines, when mutating cluster state, restore it via DeferCleanup in a fail-safe way on all exit paths.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 236 - 243, The DeferCleanup function currently logs a warning when the topology restore fails (when restoreErr is not nil) but continues execution, which can leave the cluster state mutated and cause cascading failures in subsequent tests. Modify the error handling block where restoreErr is checked to fail the test using an appropriate Ginkgo assertion or failure method (such as GinkgoTB().Fail) instead of just logging a warning, ensuring that any failure to restore the Azure topology causes the test cleanup to fail and prevent state corruption from affecting other specs.

Source: Coding guidelines

coderabbitai · 2026-06-15T12:33:23Z

+				g.Expect(svc.Spec.Type).NotTo(Equal(corev1.ServiceTypeLoadBalancer),
+					"KAS Azure LB service should no longer be LoadBalancer after restoring Private topology")
+			}).WithTimeout(10 * time.Minute).WithPolling(10 * time.Second).Should(Succeed(),


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tighten the restored-service assertion to expected states only.

The current check passes for any non-LoadBalancer type (e.g., NodePort/ExternalName), which can hide regressions. After restore, success should be NotFound or ClusterIP.

Proposed fix

g.Expect(err).NotTo(HaveOccurred(), "failed to get KAS Azure LB service") - g.Expect(svc.Spec.Type).NotTo(Equal(corev1.ServiceTypeLoadBalancer), - "KAS Azure LB service should no longer be LoadBalancer after restoring Private topology") + g.Expect(svc.Spec.Type).To(Equal(corev1.ServiceTypeClusterIP), + "KAS Azure LB service should be ClusterIP when restoring Private topology") }).WithTimeout(10 * time.Minute).WithPolling(10 * time.Second).Should(Succeed(), - "KAS Azure LB service should be deleted or no longer LoadBalancer") + "KAS Azure LB service should be deleted or ClusterIP")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 295 - 297, The assertion for the service type check is too loose and only verifies that the type is not LoadBalancer, allowing false positives from other unintended types like NodePort or ExternalName. Replace the NotTo(Equal(corev1.ServiceTypeLoadBalancer)) check with explicit assertions that verify the service is in one of the expected states after restore: either the service should not be found (NotFound error) or it should have type ClusterIP. This ensures the assertion catches actual regressions instead of passing for any non-LoadBalancer type.

bryan-cox · 2026-06-15T15:28:19Z

/test e2e-azure-v2-self-managed

bryan-cox · 2026-06-15T18:33:48Z

/test e2e-azure-v2-self-managed

bryan-cox · 2026-06-16T11:57:10Z

/test pull-ci-openshift-hypershift-main-e2e-azure-v2-self-managed

bryan-cox · 2026-06-16T12:11:06Z

/test e2e-azure-v2-self-managed

bryan-cox · 2026-06-16T15:41:56Z

/test e2e-azure-v2-self-managed

bryan-cox · 2026-06-16T15:45:08Z

/test e2e-azure-v2-self-managed

Add a v2 lifecycle test that validates transitioning a HostedCluster from Private to PublicAndPrivate topology and back on Azure self-managed clusters. The test runs on the existing private cluster variant after AzurePrivateTopologyTest completes. Co-Authored-By: Claude Opus 4.6 <[email protected]>

bryan-cox · 2026-06-16T19:50:21Z

/test e2e-azure-v2-self-managed

openshift-ci · 2026-06-16T21:35:53Z

@bryan-cox: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-azure-v2-self-managed	`5ba7144`	link	true	`/test e2e-azure-v2-self-managed`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hypershift-jira-solve-ci · 2026-06-16T22:07:31Z

The dump ran ~40 minutes later successfully, confirming the API server outage was transient. Now I have the full picture. Let me verify exactly what happened by looking at the exact error timing:

The error logs show at 20:39:11, the management cluster's kube-apiserver at 20.221.20.125:7443 became unreachable. The script was trying to start new watches for the remaining non-Available clusters, but got "connection refused". This is a transient management cluster kube-apiserver outage that happened right at the moment the create-guests script was checking cluster availability.

The fact that the management cluster was already stressed (monitoring operator unavailable for 14+ minutes during install) and then had to handle 6 hosted clusters being created simultaneously likely contributed to API server instability.

Now I have all the evidence I need. Here's the complete analysis:

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-azure-v2-self-managed
Build ID: 2066971702260142080
Target: e2e-azure-v2-self-managed
PR: CNTRLPLANE-3276: Add Azure endpoint access transition e2e test #8718 (CNTRLPLANE-3276: Add Azure endpoint access transition e2e test)

Test Failure Analysis

Error

ERROR: cluster private-a62f06854a (private) did not become Available: starting watch for clusters/private-a62f06854a: Get "https://20.221.20.125:7443/apis/hypershift.openshift.io/v1beta1/namespaces/clusters/hostedclusters?fieldSelector=metadata.name%3Dprivate-a62f06854a&watch=true": dial tcp 20.221.20.125:7443: connect: connection refused

ERROR: cluster oauth-lb-a62f06854a (oauth-lb) did not become Available: [same connection refused]
ERROR: cluster upgrade-a62f06854a (upgrade) did not become Available: [same connection refused]
ERROR: cluster autoscaling-a62f06854a (autoscaling) did not become Available: [same connection refused]

Error: one or more clusters did not become Available

Summary

The hypershift-azure-create-selfmanaged-guests pre-phase step failed because 4 of 6 hosted clusters did not reach Available status. The script created 6 guest clusters in parallel on the self-managed management cluster and then watched for them to become Available. During this watch period, the management cluster's kube-apiserver at 20.221.20.125:7443 became transiently unreachable (connection refused). The 2 clusters that had already become Available before the outage (public and external-oidc) succeeded, while the remaining 4 clusters' watchers failed fatally when attempting to reconnect to the API server. This is an infrastructure-level transient failure unrelated to the PR's code changes.

Root Cause

The root cause is a transient management cluster kube-apiserver outage during the guest cluster creation phase, combined with the create-guests script's lack of resilience to API server interruptions.

Detailed chain of events:

Management cluster instability from install (19:50–20:25): The self-managed management cluster itself took abnormally long to install. The monitoring cluster operator was unavailable for over 14 minutes during ClusterVersion installation (from 20:11 through 20:25), indicating the management cluster was under significant resource pressure from the start.
Guest cluster creation overload (20:32–20:34): At 20:32:07, the script launched creation of 6 hosted clusters simultaneously (public, private, oauth-lb, upgrade, autoscaling, external-oidc). Each cluster creation required: Azure resource groups, NSGs, VNets, DNS zones, role assignments, DNS zone links, public IPs, and Kubernetes resources — all hitting the management cluster's kube-apiserver concurrently.
Prolonged wait for cluster availability (20:34–20:39): All 6 clusters remained in Available=False, VersionState=Partial for ~5 minutes. The management cluster kube-apiserver was serving watch requests for all 6 clusters simultaneously while also processing the hosted control plane reconciliation workload.
API server outage (~20:38–20:39): The management cluster's kube-apiserver at 20.221.20.125:7443 became transiently unavailable. This likely resulted from resource pressure (CPU/memory exhaustion) on the management cluster nodes due to running 6 hosted control planes simultaneously on a freshly-installed and already-stressed cluster.
Fatal watcher failure (20:39:11): The create-guests script's watch connections to the API server were severed. When the watchers attempted to re-establish their connections, they received connection refused errors. The script treated this as a terminal failure rather than retrying, resulting in 4 of 6 clusters being reported as failed even though public (became Available at 20:38:12) and external-oidc (became Available at 20:37:58) had already succeeded.
Post-failure API server recovery: The dump-management-cluster step ran successfully ~40 minutes later (21:18–21:22), confirming the API server recovered and the outage was transient.

This failure is NOT caused by PR #8718's changes. It is a transient infrastructure failure on the self-managed management cluster.

Recommendations

Retry the job — This is a transient infrastructure failure unrelated to the PR's code changes. The management cluster's kube-apiserver became temporarily unreachable under load, which is a known risk in self-managed Azure environments running 6 hosted clusters simultaneously.
Consider filing a tracking issue for create-guests script resilience — The create-guests script's watcher does not retry on connection refused errors. It would be more robust to implement exponential backoff retry logic when the management cluster API server is temporarily unavailable, rather than treating a transient connection failure as terminal.
Monitor management cluster sizing — The management cluster took 14+ minutes with the monitoring operator unavailable during install, indicating it was resource-constrained. If this pattern recurs, the management cluster node sizing or resource quotas may need adjustment for the self-managed Azure test profile.

Evidence

Evidence	Detail
Failed Step	`hypershift-azure-create-selfmanaged-guests` (pre phase)
Step Duration	489 seconds (~8 min 9s)
Clusters Created	6 (public, private, oauth-lb, upgrade, autoscaling, external-oidc)
Clusters Available	2 of 6 (public at 20:38:12, external-oidc at 20:37:58)
Clusters Failed	4 (private, oauth-lb, upgrade, autoscaling)
Failure Time	2026-06-16T20:39:11Z
Error Type	`dial tcp 20.221.20.125:7443: connect: connection refused`
API Server IP	20.221.20.125:7443 (management cluster kube-apiserver)
Mgmt Install Issue	Monitoring operator unavailable for 14+ minutes during ClusterVersion install
Post-failure Recovery	dump-management-cluster ran successfully at 21:18–21:22 (API server was back)
k8sgpt Analysis	No problems detected (ran after recovery)
PR Relation	None — transient infrastructure failure, not caused by PR #8718 code changes

bryan-cox · 2026-06-17T09:41:23Z

/test e2e-azure-v2-self-managed

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 11, 2026

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2026

openshift-ci Bot added the do-not-merge/needs-area label Jun 11, 2026

openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/platform/azure PR/issue for Azure (AzurePlatform) platform area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Jun 11, 2026

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

hypershift-jira-solve-ci Bot mentioned this pull request Jun 11, 2026

fix: OCPBUGS-81675: hcco: merge additionalTrustBundle into proxy trusted CA bundle #8520

Open

bryan-cox force-pushed the CNTRLPLANE-3276 branch from 8a9c92f to 13e0579 Compare June 11, 2026 23:56

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

bryan-cox force-pushed the CNTRLPLANE-3276 branch from 13e0579 to 57c8e8e Compare June 15, 2026 12:26

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

bryan-cox force-pushed the CNTRLPLANE-3276 branch from 57c8e8e to 4f67e09 Compare June 16, 2026 11:51

bryan-cox force-pushed the CNTRLPLANE-3276 branch from 4f67e09 to c562f67 Compare June 16, 2026 12:02

bryan-cox force-pushed the CNTRLPLANE-3276 branch from c562f67 to 3bf01a1 Compare June 16, 2026 15:41

bryan-cox force-pushed the CNTRLPLANE-3276 branch from 3bf01a1 to 0593853 Compare June 16, 2026 15:44

bryan-cox force-pushed the CNTRLPLANE-3276 branch from 0593853 to 5ba7144 Compare June 16, 2026 19:50

Conversation

bryan-cox commented Jun 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented Jun 11, 2026

Uh oh!

openshift-ci-robot commented Jun 11, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

openshift-ci Bot commented Jun 11, 2026

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Suggested reviewers

Uh oh!

openshift-ci Bot commented Jun 11, 2026

Uh oh!

bryan-cox commented Jun 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bryan-cox commented Jun 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

bryan-cox commented Jun 15, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

bryan-cox commented Jun 15, 2026

Uh oh!

bryan-cox commented Jun 15, 2026

Uh oh!

bryan-cox commented Jun 16, 2026

Uh oh!

bryan-cox commented Jun 16, 2026

Uh oh!

bryan-cox commented Jun 16, 2026

Uh oh!

bryan-cox commented Jun 16, 2026

Uh oh!

bryan-cox commented Jun 16, 2026

Uh oh!

openshift-ci Bot commented Jun 16, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented Jun 16, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Summary

Uh oh!

bryan-cox commented Jun 17, 2026

Uh oh!

Reviewers

bryan-cox commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Jun 11, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

codecov Bot commented Jun 11, 2026 •

edited

Loading

hypershift-jira-solve-ci Bot commented Jun 16, 2026 •

edited by openshift-ci Bot

Loading