Skip to content

CNTRLPLANE-3276: Add Azure endpoint access transition e2e test#8718

Draft
bryan-cox wants to merge 1 commit into
openshift:mainfrom
bryan-cox:CNTRLPLANE-3276
Draft

CNTRLPLANE-3276: Add Azure endpoint access transition e2e test#8718
bryan-cox wants to merge 1 commit into
openshift:mainfrom
bryan-cox:CNTRLPLANE-3276

Conversation

@bryan-cox

@bryan-cox bryan-cox commented Jun 11, 2026

Copy link
Copy Markdown
Member

What this PR does / why we need it:

Adds a v2 e2e lifecycle test that validates transitioning a HostedCluster between Private and PublicAndPrivate topology on Azure self-managed clusters. The test verifies:

  1. Private → PublicAndPrivate: Updates topology, waits for PublicEndpointExposed condition to become True with reason SharedIngressConfigured, confirms KAS service becomes LoadBalancer, validates PLS CRs persist, and checks API reachability.
  2. PublicAndPrivate → Private: Restores topology, waits for PublicEndpointExposed condition to become False with reason TopologyPrivate, confirms KAS service returns to ClusterIP, validates PLS CRs persist.
  3. Private connectivity verification: Confirms API server is reachable via the private path after restore.

The test runs on the existing private cluster variant (labeled self-managed-azure-private) after AzurePrivateTopologyTest completes. DeferCleanup restores Private topology if the test fails mid-transition.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-3276

Special notes for your reviewer:

  • No changes to lifecycle/azure.go — the private variant's LabelFilter already includes self-managed-azure-private
  • Uses e2eutil.ConditionPredicate[*hyperv1.HostedCluster] rather than a custom condition helper
  • DeferCleanup uses context.Background() intentionally — cleanup may run after testCtx.Context is cancelled on timeout
  • The sync.Once-cached guest client in GetHostedClusterClient() is safe across topology transitions because the private endpoint stays active in PublicAndPrivate mode

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Tests
    • Added a new Azure-only end-to-end HostedCluster test that runs in order for clusters starting with Azure Private topology, validating a Private → PublicAndPrivate → Private transition.
    • Verifies KAS external route behavior at each step (public route host populated; private route deleted or restored as expected).
    • Confirms Azure PrivateLink Service resources remain present with a non-empty alias throughout transitions.
    • Checks API reachability via namespace listings during the workflow and after restoration, and automatically restores the original Private topology on completion.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 11, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 11, 2026

Copy link
Copy Markdown

@bryan-cox: This pull request references CNTRLPLANE-3276 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Adds a v2 e2e lifecycle test that validates transitioning a HostedCluster between Private and PublicAndPrivate topology on Azure self-managed clusters. The test verifies:

  1. Private → PublicAndPrivate: Updates topology, waits for PublicEndpointExposed condition to become True with reason SharedIngressConfigured, confirms KAS service becomes LoadBalancer, validates PLS CRs persist, and checks API reachability.
  2. PublicAndPrivate → Private: Restores topology, waits for PublicEndpointExposed condition to become False with reason TopologyPrivate, confirms KAS service returns to ClusterIP, validates PLS CRs persist.
  3. Private connectivity verification: Confirms API server is reachable via the private path after restore.

The test runs on the existing private cluster variant (labeled self-managed-azure-private) after AzurePrivateTopologyTest completes. DeferCleanup restores Private topology if the test fails mid-transition.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/CNTRLPLANE-3276

Special notes for your reviewer:

  • No changes to lifecycle/azure.go — the private variant's LabelFilter already includes self-managed-azure-private
  • Uses e2eutil.ConditionPredicate[*hyperv1.HostedCluster] rather than a custom condition helper
  • DeferCleanup uses context.Background() intentionally — cleanup may run after testCtx.Context is cancelled on timeout
  • The sync.Once-cached guest client in GetHostedClusterClient() is safe across topology transitions because the private endpoint stays active in PublicAndPrivate mode

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2026
@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a new e2e test, AzureEndpointAccessTransitionTest, that validates HostedCluster topology transitions on Azure. The test transitions a cluster from Private to PublicAndPrivate and back to Private, verifying that KAS external routes appear and are deleted appropriately for each topology state, AzurePrivateLinkService CRs persist with non-empty Status.PrivateLinkServiceAlias, and the hosted cluster API remains reachable throughout. Supporting helper predicates match service types, verify PLS alias presence, and poll API reachability via namespace listing. The test is registered immediately after AzurePrivateTopologyTest.

Suggested reviewers

  • cblecker
  • Nirshal
  • csrwng
🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding an Azure endpoint access transition e2e test, which matches the core functionality added in the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in AzureEndpointAccessTransitionTest are static descriptive strings without dynamic values, pod names, timestamps, UUIDs, node names, or IP addresses.
Test Structure And Quality ✅ Passed Test code follows all quality requirements: single responsibility per test, proper setup/cleanup via BeforeAll/DeferCleanup, timeouts on all cluster operations, meaningful assertion messages with f...
Topology-Aware Scheduling Compatibility ✅ Passed This PR adds only e2e test code (test/e2e/v2/tests/hosted_cluster_azure_test.go) with no deployment manifests, operator code, or controllers. The custom check's scope explicitly applies to these, m...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Test contains no hardcoded IPv4 addresses, IPv4-specific logic, or external connectivity requirements. All operations are cluster-internal Kubernetes API calls.
No-Weak-Crypto ✅ Passed The PR adds an e2e test for Azure topology transitions with no weak cryptography usage. No MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB algorithms, custom crypto implementations, or non-constant-time s...
Container-Privileges ✅ Passed PR adds a Go e2e test file with no container/K8s manifests or privilege configurations. Check is not applicable to test code.
No-Sensitive-Data-In-Logs ✅ Passed Logging statements in AzureEndpointAccessTransitionTest log only non-sensitive data: Route hostnames (DNS names) and Azure PLS aliases (service metadata). No passwords, tokens, API keys, or PII are...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/platform/azure PR/issue for Azure (AzurePlatform) platform area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Jun 11, 2026
@bryan-cox

Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/e2e/v2/tests/hosted_cluster_azure_test.go (1)

270-280: 💤 Low value

Add WithInterval for polling consistency.

The condition check at lines 253-268 specifies WithInterval(15*time.Second), but this EventuallyObject call omits it. For consistent polling behavior across similar checks in this test, consider adding the interval.

Suggested fix
 			e2eutil.EventuallyObject(GinkgoTB(), ctx, "KAS service is LoadBalancer",
 				func(ctx context.Context) (*corev1.Service, error) {
 					svc := hcpmanifests.KubeAPIServerService(controlPlaneNamespace)
 					err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(svc), svc)
 					return svc, err
 				},
 				[]e2eutil.Predicate[*corev1.Service]{
 					serviceTypePredicate(corev1.ServiceTypeLoadBalancer),
 				},
 				e2eutil.WithTimeout(10*time.Minute),
+				e2eutil.WithInterval(15*time.Second),
 			)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 270 - 280, Add a
consistent polling interval to the EventuallyObject call by including
e2eutil.WithInterval(15*time.Second) alongside the existing e2eutil.WithTimeout
option; locate the call to e2eutil.EventuallyObject (the block returning
*corev1.Service and using serviceTypePredicate) and append the WithInterval
option to its variadic options so it uses the same 15s polling cadence as the
earlier check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 270-280: Add a consistent polling interval to the EventuallyObject
call by including e2eutil.WithInterval(15*time.Second) alongside the existing
e2eutil.WithTimeout option; locate the call to e2eutil.EventuallyObject (the
block returning *corev1.Service and using serviceTypePredicate) and append the
WithInterval option to its variadic options so it uses the same 15s polling
cadence as the earlier check.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3b2222e8-a730-4864-a308-b08be14365df

📥 Commits

Reviewing files that changed from the base of the PR and between 35c0190 and 8a9c92f.

📒 Files selected for processing (1)
  • test/e2e/v2/tests/hosted_cluster_azure_test.go

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.75%. Comparing base (392fd5a) to head (5ba7144).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8718   +/-   ##
=======================================
  Coverage   41.75%   41.75%           
=======================================
  Files         758      758           
  Lines       93981    93981           
=======================================
  Hits        39240    39240           
  Misses      51988    51988           
  Partials     2753     2753           
Flag Coverage Δ
cmd-support 35.02% <ø> (ø)
cpo-hostedcontrolplane 44.10% <ø> (ø)
cpo-other 43.45% <ø> (ø)
hypershift-operator 51.82% <ø> (ø)
other 31.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bryan-cox

Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
test/e2e/v2/tests/hosted_cluster_azure_test.go (2)

279-309: ⚡ Quick win

Consider verifying HostedCluster condition PublicEndpointExposed returns to False.

The PR description states the test should "wait for PublicEndpointExposed to become False with reason TopologyPrivate" when transitioning back to Private. Currently, the test only verifies the KAS service type returns to ClusterIP. Adding a condition check would ensure the full reconciliation completed.

📋 Suggested addition to verify condition

After line 297, add:

 		e2eutil.EventuallyObject(GinkgoTB(), ctx, "KAS service is ClusterIP",
 			func(ctx context.Context) (*corev1.Service, error) {
 				svc := hcpmanifests.KubeAPIServerService(controlPlaneNamespace)
 				err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(svc), svc)
 				return svc, err
 			},
 			[]e2eutil.Predicate[*corev1.Service]{
 				serviceTypePredicate(corev1.ServiceTypeClusterIP),
 			},
 			e2eutil.WithTimeout(10*time.Minute),
 		)
+
+		e2eutil.EventuallyObject(GinkgoTB(), ctx, "HostedCluster PublicEndpointExposed condition is False",
+			func(ctx context.Context) (*hyperv1.HostedCluster, error) {
+				latest := &hyperv1.HostedCluster{}
+				err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(hc), latest)
+				return latest, err
+			},
+			[]e2eutil.Predicate[*hyperv1.HostedCluster]{
+				e2eutil.ConditionPredicate[*hyperv1.HostedCluster](
+					hyperv1.PublicEndpointExposed,
+					metav1.ConditionFalse,
+					hyperv1.TopologyPrivateReason,
+				),
+			},
+			e2eutil.WithTimeout(10*time.Minute),
+		)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 279 - 309, Add
an assertion that the HostedCluster's PublicEndpointExposed condition flips to
False with reason TopologyPrivate after topology revert: use
e2eutil.EventuallyObject (like the existing KAS service check) to fetch the
HostedCluster (variable hc via testCtx.MgmtClient) and poll until
conditions.Get(hc.Status.Conditions, hyperv1.PublicEndpointExposed).Status ==
corev1.ConditionFalse and .Reason == "TopologyPrivate" (or use the helper that
checks condition values), with an appropriate timeout placed after the KAS
service ClusterIP check to ensure full reconciliation completed.

245-277: ⚡ Quick win

Consider verifying HostedCluster condition PublicEndpointExposed.

The PR description states the test should "wait for HostedCluster condition PublicEndpointExposed to become True with reason SharedIngressConfigured." Currently, the test only verifies the KAS service type changes to LoadBalancer. Adding a condition check would provide stronger validation that the control plane reconciliation loop completed successfully.

📋 Suggested addition to verify condition

After line 263, add a condition check using e2eutil.ConditionPredicate:

 		e2eutil.EventuallyObject(GinkgoTB(), ctx, "KAS service is LoadBalancer",
 			func(ctx context.Context) (*corev1.Service, error) {
 				svc := hcpmanifests.KubeAPIServerService(controlPlaneNamespace)
 				err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(svc), svc)
 				return svc, err
 			},
 			[]e2eutil.Predicate[*corev1.Service]{
 				serviceTypePredicate(corev1.ServiceTypeLoadBalancer),
 			},
 			e2eutil.WithTimeout(10*time.Minute),
 		)
+
+		e2eutil.EventuallyObject(GinkgoTB(), ctx, "HostedCluster PublicEndpointExposed condition is True",
+			func(ctx context.Context) (*hyperv1.HostedCluster, error) {
+				latest := &hyperv1.HostedCluster{}
+				err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(hc), latest)
+				return latest, err
+			},
+			[]e2eutil.Predicate[*hyperv1.HostedCluster]{
+				e2eutil.ConditionPredicate[*hyperv1.HostedCluster](
+					hyperv1.PublicEndpointExposed,
+					metav1.ConditionTrue,
+					hyperv1.SharedIngressConfiguredReason,
+				),
+			},
+			e2eutil.WithTimeout(10*time.Minute),
+		)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 245 - 277, Add a
check that waits for the HostedCluster condition "PublicEndpointExposed" to be
True with reason "SharedIngressConfigured" (using the existing HostedCluster
object hc and testCtx.MgmtClient) to ensure control-plane reconciliation
completed; implement this by calling e2eutil.EventuallyObject (or the existing
helper that waits on conditions) with e2eutil.ConditionPredicate for condition
"PublicEndpointExposed" and reason "SharedIngressConfigured" (use
e2eutil.ConditionPredicate(hc, "PublicEndpointExposed", corev1.ConditionTrue,
"SharedIngressConfigured") or the appropriate signature) placed after the KAS
service LoadBalancer check/PLS verification and before verifyAPIReachable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 279-309: Add an assertion that the HostedCluster's
PublicEndpointExposed condition flips to False with reason TopologyPrivate after
topology revert: use e2eutil.EventuallyObject (like the existing KAS service
check) to fetch the HostedCluster (variable hc via testCtx.MgmtClient) and poll
until conditions.Get(hc.Status.Conditions, hyperv1.PublicEndpointExposed).Status
== corev1.ConditionFalse and .Reason == "TopologyPrivate" (or use the helper
that checks condition values), with an appropriate timeout placed after the KAS
service ClusterIP check to ensure full reconciliation completed.
- Around line 245-277: Add a check that waits for the HostedCluster condition
"PublicEndpointExposed" to be True with reason "SharedIngressConfigured" (using
the existing HostedCluster object hc and testCtx.MgmtClient) to ensure
control-plane reconciliation completed; implement this by calling
e2eutil.EventuallyObject (or the existing helper that waits on conditions) with
e2eutil.ConditionPredicate for condition "PublicEndpointExposed" and reason
"SharedIngressConfigured" (use e2eutil.ConditionPredicate(hc,
"PublicEndpointExposed", corev1.ConditionTrue, "SharedIngressConfigured") or the
appropriate signature) placed after the KAS service LoadBalancer check/PLS
verification and before verifyAPIReachable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9e7d7d98-d5d6-4677-8778-67d0d9a0b9a5

📥 Commits

Reviewing files that changed from the base of the PR and between 8a9c92f and 13e0579.

📒 Files selected for processing (1)
  • test/e2e/v2/tests/hosted_cluster_azure_test.go

@bryan-cox

Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/e2e/v2/tests/hosted_cluster_azure_test.go (1)

246-314: ⚡ Quick win

Add labels to each new It for filterability consistency.

The new specs are under a labeled Context, but the It blocks themselves are unlabeled.

Proposed fix
-		It("should transition from Private to PublicAndPrivate", func() {
+		It("should transition from Private to PublicAndPrivate", Label("azure-endpoint-transition"), func() {
...
-		It("should transition from PublicAndPrivate back to Private", func() {
+		It("should transition from PublicAndPrivate back to Private", Label("azure-endpoint-transition"), func() {
...
-		It("should reach the API server after restoring Private topology", func() {
+		It("should reach the API server after restoring Private topology", Label("azure-endpoint-transition"), func() {

As per coding guidelines, apply labels to both Describe and It blocks for test filtering.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 246 - 314, Add
labels to the three It blocks in the test to maintain filtering consistency with
the parent Context block. The It blocks "should transition from Private to
PublicAndPrivate", "should transition from PublicAndPrivate back to Private",
and "should reach the API server after restoring Private topology" need to be
labeled using the Label() function according to coding guidelines for test
filtering. Apply appropriate labels that categorize these topology transition
tests consistently.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 295-297: The assertion for the service type check is too loose and
only verifies that the type is not LoadBalancer, allowing false positives from
other unintended types like NodePort or ExternalName. Replace the
NotTo(Equal(corev1.ServiceTypeLoadBalancer)) check with explicit assertions that
verify the service is in one of the expected states after restore: either the
service should not be found (NotFound error) or it should have type ClusterIP.
This ensures the assertion catches actual regressions instead of passing for any
non-LoadBalancer type.
- Around line 236-243: The DeferCleanup function currently logs a warning when
the topology restore fails (when restoreErr is not nil) but continues execution,
which can leave the cluster state mutated and cause cascading failures in
subsequent tests. Modify the error handling block where restoreErr is checked to
fail the test using an appropriate Ginkgo assertion or failure method (such as
GinkgoTB().Fail) instead of just logging a warning, ensuring that any failure to
restore the Azure topology causes the test cleanup to fail and prevent state
corruption from affecting other specs.

---

Nitpick comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 246-314: Add labels to the three It blocks in the test to maintain
filtering consistency with the parent Context block. The It blocks "should
transition from Private to PublicAndPrivate", "should transition from
PublicAndPrivate back to Private", and "should reach the API server after
restoring Private topology" need to be labeled using the Label() function
according to coding guidelines for test filtering. Apply appropriate labels that
categorize these topology transition tests consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 82fff169-72ca-4e38-910e-d973910bb5f5

📥 Commits

Reviewing files that changed from the base of the PR and between 13e0579 and 57c8e8e.

📒 Files selected for processing (1)
  • test/e2e/v2/tests/hosted_cluster_azure_test.go

Comment on lines +236 to +243
DeferCleanup(func() {
restoreErr := e2eutil.UpdateObject(GinkgoTB(), context.Background(), testCtx.MgmtClient, hc, func(obj *hyperv1.HostedCluster) {
obj.Spec.Platform.Azure.Topology = hyperv1.AzureTopologyPrivate
})
if restoreErr != nil {
GinkgoTB().Logf("WARNING: failed to restore Private topology: %v", restoreErr)
}
})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail cleanup if topology restore does not succeed.

The cleanup currently logs and continues on restore failure, which can leave shared cluster state mutated and cause cascading failures in later specs.

Proposed fix
 			DeferCleanup(func() {
-				restoreErr := e2eutil.UpdateObject(GinkgoTB(), context.Background(), testCtx.MgmtClient, hc, func(obj *hyperv1.HostedCluster) {
+				cleanupCtx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
+				defer cancel()
+				restoreErr := e2eutil.UpdateObject(GinkgoTB(), cleanupCtx, testCtx.MgmtClient, hc, func(obj *hyperv1.HostedCluster) {
 					obj.Spec.Platform.Azure.Topology = hyperv1.AzureTopologyPrivate
 				})
-				if restoreErr != nil {
-					GinkgoTB().Logf("WARNING: failed to restore Private topology: %v", restoreErr)
-				}
+				Expect(restoreErr).NotTo(HaveOccurred(), "cleanup: failed to restore Private topology")
 			})

As per coding guidelines, when mutating cluster state, restore it via DeferCleanup in a fail-safe way on all exit paths.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 236 - 243, The
DeferCleanup function currently logs a warning when the topology restore fails
(when restoreErr is not nil) but continues execution, which can leave the
cluster state mutated and cause cascading failures in subsequent tests. Modify
the error handling block where restoreErr is checked to fail the test using an
appropriate Ginkgo assertion or failure method (such as GinkgoTB().Fail) instead
of just logging a warning, ensuring that any failure to restore the Azure
topology causes the test cleanup to fail and prevent state corruption from
affecting other specs.

Source: Coding guidelines

Comment on lines +295 to +297
g.Expect(svc.Spec.Type).NotTo(Equal(corev1.ServiceTypeLoadBalancer),
"KAS Azure LB service should no longer be LoadBalancer after restoring Private topology")
}).WithTimeout(10 * time.Minute).WithPolling(10 * time.Second).Should(Succeed(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tighten the restored-service assertion to expected states only.

The current check passes for any non-LoadBalancer type (e.g., NodePort/ExternalName), which can hide regressions. After restore, success should be NotFound or ClusterIP.

Proposed fix
 				g.Expect(err).NotTo(HaveOccurred(), "failed to get KAS Azure LB service")
-				g.Expect(svc.Spec.Type).NotTo(Equal(corev1.ServiceTypeLoadBalancer),
-					"KAS Azure LB service should no longer be LoadBalancer after restoring Private topology")
+				g.Expect(svc.Spec.Type).To(Equal(corev1.ServiceTypeClusterIP),
+					"KAS Azure LB service should be ClusterIP when restoring Private topology")
 			}).WithTimeout(10 * time.Minute).WithPolling(10 * time.Second).Should(Succeed(),
-				"KAS Azure LB service should be deleted or no longer LoadBalancer")
+				"KAS Azure LB service should be deleted or ClusterIP")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 295 - 297, The
assertion for the service type check is too loose and only verifies that the
type is not LoadBalancer, allowing false positives from other unintended types
like NodePort or ExternalName. Replace the
NotTo(Equal(corev1.ServiceTypeLoadBalancer)) check with explicit assertions that
verify the service is in one of the expected states after restore: either the
service should not be found (NotFound error) or it should have type ClusterIP.
This ensures the assertion catches actual regressions instead of passing for any
non-LoadBalancer type.

@bryan-cox

Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

1 similar comment
@bryan-cox

Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

@bryan-cox

Copy link
Copy Markdown
Member Author

/test pull-ci-openshift-hypershift-main-e2e-azure-v2-self-managed

@bryan-cox

Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

@bryan-cox

Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

@bryan-cox

Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

Add a v2 lifecycle test that validates transitioning a HostedCluster
from Private to PublicAndPrivate topology and back on Azure
self-managed clusters. The test runs on the existing private cluster
variant after AzurePrivateTopologyTest completes.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@bryan-cox

Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@bryan-cox: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-v2-self-managed 5ba7144 link true /test e2e-azure-v2-self-managed

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown

The dump ran ~40 minutes later successfully, confirming the API server outage was transient. Now I have the full picture. Let me verify exactly what happened by looking at the exact error timing:

The error logs show at 20:39:11, the management cluster's kube-apiserver at 20.221.20.125:7443 became unreachable. The script was trying to start new watches for the remaining non-Available clusters, but got "connection refused". This is a transient management cluster kube-apiserver outage that happened right at the moment the create-guests script was checking cluster availability.

The fact that the management cluster was already stressed (monitoring operator unavailable for 14+ minutes during install) and then had to handle 6 hosted clusters being created simultaneously likely contributed to API server instability.

Now I have all the evidence I need. Here's the complete analysis:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

ERROR: cluster private-a62f06854a (private) did not become Available: starting watch for clusters/private-a62f06854a: Get "https://20.221.20.125:7443/apis/hypershift.openshift.io/v1beta1/namespaces/clusters/hostedclusters?fieldSelector=metadata.name%3Dprivate-a62f06854a&watch=true": dial tcp 20.221.20.125:7443: connect: connection refused

ERROR: cluster oauth-lb-a62f06854a (oauth-lb) did not become Available: [same connection refused]
ERROR: cluster upgrade-a62f06854a (upgrade) did not become Available: [same connection refused]
ERROR: cluster autoscaling-a62f06854a (autoscaling) did not become Available: [same connection refused]

Error: one or more clusters did not become Available

Summary

The hypershift-azure-create-selfmanaged-guests pre-phase step failed because 4 of 6 hosted clusters did not reach Available status. The script created 6 guest clusters in parallel on the self-managed management cluster and then watched for them to become Available. During this watch period, the management cluster's kube-apiserver at 20.221.20.125:7443 became transiently unreachable (connection refused). The 2 clusters that had already become Available before the outage (public and external-oidc) succeeded, while the remaining 4 clusters' watchers failed fatally when attempting to reconnect to the API server. This is an infrastructure-level transient failure unrelated to the PR's code changes.

Root Cause

The root cause is a transient management cluster kube-apiserver outage during the guest cluster creation phase, combined with the create-guests script's lack of resilience to API server interruptions.

Detailed chain of events:

  1. Management cluster instability from install (19:50–20:25): The self-managed management cluster itself took abnormally long to install. The monitoring cluster operator was unavailable for over 14 minutes during ClusterVersion installation (from 20:11 through 20:25), indicating the management cluster was under significant resource pressure from the start.

  2. Guest cluster creation overload (20:32–20:34): At 20:32:07, the script launched creation of 6 hosted clusters simultaneously (public, private, oauth-lb, upgrade, autoscaling, external-oidc). Each cluster creation required: Azure resource groups, NSGs, VNets, DNS zones, role assignments, DNS zone links, public IPs, and Kubernetes resources — all hitting the management cluster's kube-apiserver concurrently.

  3. Prolonged wait for cluster availability (20:34–20:39): All 6 clusters remained in Available=False, VersionState=Partial for ~5 minutes. The management cluster kube-apiserver was serving watch requests for all 6 clusters simultaneously while also processing the hosted control plane reconciliation workload.

  4. API server outage (~20:38–20:39): The management cluster's kube-apiserver at 20.221.20.125:7443 became transiently unavailable. This likely resulted from resource pressure (CPU/memory exhaustion) on the management cluster nodes due to running 6 hosted control planes simultaneously on a freshly-installed and already-stressed cluster.

  5. Fatal watcher failure (20:39:11): The create-guests script's watch connections to the API server were severed. When the watchers attempted to re-establish their connections, they received connection refused errors. The script treated this as a terminal failure rather than retrying, resulting in 4 of 6 clusters being reported as failed even though public (became Available at 20:38:12) and external-oidc (became Available at 20:37:58) had already succeeded.

  6. Post-failure API server recovery: The dump-management-cluster step ran successfully ~40 minutes later (21:18–21:22), confirming the API server recovered and the outage was transient.

This failure is NOT caused by PR #8718's changes. It is a transient infrastructure failure on the self-managed management cluster.

Recommendations
  1. Retry the job — This is a transient infrastructure failure unrelated to the PR's code changes. The management cluster's kube-apiserver became temporarily unreachable under load, which is a known risk in self-managed Azure environments running 6 hosted clusters simultaneously.

  2. Consider filing a tracking issue for create-guests script resilience — The create-guests script's watcher does not retry on connection refused errors. It would be more robust to implement exponential backoff retry logic when the management cluster API server is temporarily unavailable, rather than treating a transient connection failure as terminal.

  3. Monitor management cluster sizing — The management cluster took 14+ minutes with the monitoring operator unavailable during install, indicating it was resource-constrained. If this pattern recurs, the management cluster node sizing or resource quotas may need adjustment for the self-managed Azure test profile.

Evidence
Evidence Detail
Failed Step hypershift-azure-create-selfmanaged-guests (pre phase)
Step Duration 489 seconds (~8 min 9s)
Clusters Created 6 (public, private, oauth-lb, upgrade, autoscaling, external-oidc)
Clusters Available 2 of 6 (public at 20:38:12, external-oidc at 20:37:58)
Clusters Failed 4 (private, oauth-lb, upgrade, autoscaling)
Failure Time 2026-06-16T20:39:11Z
Error Type dial tcp 20.221.20.125:7443: connect: connection refused
API Server IP 20.221.20.125:7443 (management cluster kube-apiserver)
Mgmt Install Issue Monitoring operator unavailable for 14+ minutes during ClusterVersion install
Post-failure Recovery dump-management-cluster ran successfully at 21:18–21:22 (API server was back)
k8sgpt Analysis No problems detected (ran after recovery)
PR Relation None — transient infrastructure failure, not caused by PR #8718 code changes

@bryan-cox

Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/platform/azure PR/issue for Azure (AzurePlatform) platform area/testing Indicates the PR includes changes for e2e testing do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants