CNTRLPLANE-3276: Add Azure endpoint access transition e2e test#8718
CNTRLPLANE-3276: Add Azure endpoint access transition e2e test#8718bryan-cox wants to merge 1 commit into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@bryan-cox: This pull request references CNTRLPLANE-3276 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds a new e2e test, Suggested reviewers
🚥 Pre-merge checks | ✅ 11✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bryan-cox The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test e2e-azure-v2-self-managed |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/e2e/v2/tests/hosted_cluster_azure_test.go (1)
270-280: 💤 Low valueAdd
WithIntervalfor polling consistency.The condition check at lines 253-268 specifies
WithInterval(15*time.Second), but this EventuallyObject call omits it. For consistent polling behavior across similar checks in this test, consider adding the interval.Suggested fix
e2eutil.EventuallyObject(GinkgoTB(), ctx, "KAS service is LoadBalancer", func(ctx context.Context) (*corev1.Service, error) { svc := hcpmanifests.KubeAPIServerService(controlPlaneNamespace) err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(svc), svc) return svc, err }, []e2eutil.Predicate[*corev1.Service]{ serviceTypePredicate(corev1.ServiceTypeLoadBalancer), }, e2eutil.WithTimeout(10*time.Minute), + e2eutil.WithInterval(15*time.Second), )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 270 - 280, Add a consistent polling interval to the EventuallyObject call by including e2eutil.WithInterval(15*time.Second) alongside the existing e2eutil.WithTimeout option; locate the call to e2eutil.EventuallyObject (the block returning *corev1.Service and using serviceTypePredicate) and append the WithInterval option to its variadic options so it uses the same 15s polling cadence as the earlier check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 270-280: Add a consistent polling interval to the EventuallyObject
call by including e2eutil.WithInterval(15*time.Second) alongside the existing
e2eutil.WithTimeout option; locate the call to e2eutil.EventuallyObject (the
block returning *corev1.Service and using serviceTypePredicate) and append the
WithInterval option to its variadic options so it uses the same 15s polling
cadence as the earlier check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 3b2222e8-a730-4864-a308-b08be14365df
📒 Files selected for processing (1)
test/e2e/v2/tests/hosted_cluster_azure_test.go
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8718 +/- ##
=======================================
Coverage 41.75% 41.75%
=======================================
Files 758 758
Lines 93981 93981
=======================================
Hits 39240 39240
Misses 51988 51988
Partials 2753 2753
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
8a9c92f to
13e0579
Compare
|
/test e2e-azure-v2-self-managed |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
test/e2e/v2/tests/hosted_cluster_azure_test.go (2)
279-309: ⚡ Quick winConsider verifying HostedCluster condition PublicEndpointExposed returns to False.
The PR description states the test should "wait for PublicEndpointExposed to become False with reason TopologyPrivate" when transitioning back to Private. Currently, the test only verifies the KAS service type returns to ClusterIP. Adding a condition check would ensure the full reconciliation completed.
📋 Suggested addition to verify condition
After line 297, add:
e2eutil.EventuallyObject(GinkgoTB(), ctx, "KAS service is ClusterIP", func(ctx context.Context) (*corev1.Service, error) { svc := hcpmanifests.KubeAPIServerService(controlPlaneNamespace) err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(svc), svc) return svc, err }, []e2eutil.Predicate[*corev1.Service]{ serviceTypePredicate(corev1.ServiceTypeClusterIP), }, e2eutil.WithTimeout(10*time.Minute), ) + + e2eutil.EventuallyObject(GinkgoTB(), ctx, "HostedCluster PublicEndpointExposed condition is False", + func(ctx context.Context) (*hyperv1.HostedCluster, error) { + latest := &hyperv1.HostedCluster{} + err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(hc), latest) + return latest, err + }, + []e2eutil.Predicate[*hyperv1.HostedCluster]{ + e2eutil.ConditionPredicate[*hyperv1.HostedCluster]( + hyperv1.PublicEndpointExposed, + metav1.ConditionFalse, + hyperv1.TopologyPrivateReason, + ), + }, + e2eutil.WithTimeout(10*time.Minute), + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 279 - 309, Add an assertion that the HostedCluster's PublicEndpointExposed condition flips to False with reason TopologyPrivate after topology revert: use e2eutil.EventuallyObject (like the existing KAS service check) to fetch the HostedCluster (variable hc via testCtx.MgmtClient) and poll until conditions.Get(hc.Status.Conditions, hyperv1.PublicEndpointExposed).Status == corev1.ConditionFalse and .Reason == "TopologyPrivate" (or use the helper that checks condition values), with an appropriate timeout placed after the KAS service ClusterIP check to ensure full reconciliation completed.
245-277: ⚡ Quick winConsider verifying HostedCluster condition PublicEndpointExposed.
The PR description states the test should "wait for HostedCluster condition PublicEndpointExposed to become True with reason SharedIngressConfigured." Currently, the test only verifies the KAS service type changes to LoadBalancer. Adding a condition check would provide stronger validation that the control plane reconciliation loop completed successfully.
📋 Suggested addition to verify condition
After line 263, add a condition check using
e2eutil.ConditionPredicate:e2eutil.EventuallyObject(GinkgoTB(), ctx, "KAS service is LoadBalancer", func(ctx context.Context) (*corev1.Service, error) { svc := hcpmanifests.KubeAPIServerService(controlPlaneNamespace) err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(svc), svc) return svc, err }, []e2eutil.Predicate[*corev1.Service]{ serviceTypePredicate(corev1.ServiceTypeLoadBalancer), }, e2eutil.WithTimeout(10*time.Minute), ) + + e2eutil.EventuallyObject(GinkgoTB(), ctx, "HostedCluster PublicEndpointExposed condition is True", + func(ctx context.Context) (*hyperv1.HostedCluster, error) { + latest := &hyperv1.HostedCluster{} + err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(hc), latest) + return latest, err + }, + []e2eutil.Predicate[*hyperv1.HostedCluster]{ + e2eutil.ConditionPredicate[*hyperv1.HostedCluster]( + hyperv1.PublicEndpointExposed, + metav1.ConditionTrue, + hyperv1.SharedIngressConfiguredReason, + ), + }, + e2eutil.WithTimeout(10*time.Minute), + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 245 - 277, Add a check that waits for the HostedCluster condition "PublicEndpointExposed" to be True with reason "SharedIngressConfigured" (using the existing HostedCluster object hc and testCtx.MgmtClient) to ensure control-plane reconciliation completed; implement this by calling e2eutil.EventuallyObject (or the existing helper that waits on conditions) with e2eutil.ConditionPredicate for condition "PublicEndpointExposed" and reason "SharedIngressConfigured" (use e2eutil.ConditionPredicate(hc, "PublicEndpointExposed", corev1.ConditionTrue, "SharedIngressConfigured") or the appropriate signature) placed after the KAS service LoadBalancer check/PLS verification and before verifyAPIReachable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 279-309: Add an assertion that the HostedCluster's
PublicEndpointExposed condition flips to False with reason TopologyPrivate after
topology revert: use e2eutil.EventuallyObject (like the existing KAS service
check) to fetch the HostedCluster (variable hc via testCtx.MgmtClient) and poll
until conditions.Get(hc.Status.Conditions, hyperv1.PublicEndpointExposed).Status
== corev1.ConditionFalse and .Reason == "TopologyPrivate" (or use the helper
that checks condition values), with an appropriate timeout placed after the KAS
service ClusterIP check to ensure full reconciliation completed.
- Around line 245-277: Add a check that waits for the HostedCluster condition
"PublicEndpointExposed" to be True with reason "SharedIngressConfigured" (using
the existing HostedCluster object hc and testCtx.MgmtClient) to ensure
control-plane reconciliation completed; implement this by calling
e2eutil.EventuallyObject (or the existing helper that waits on conditions) with
e2eutil.ConditionPredicate for condition "PublicEndpointExposed" and reason
"SharedIngressConfigured" (use e2eutil.ConditionPredicate(hc,
"PublicEndpointExposed", corev1.ConditionTrue, "SharedIngressConfigured") or the
appropriate signature) placed after the KAS service LoadBalancer check/PLS
verification and before verifyAPIReachable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 9e7d7d98-d5d6-4677-8778-67d0d9a0b9a5
📒 Files selected for processing (1)
test/e2e/v2/tests/hosted_cluster_azure_test.go
13e0579 to
57c8e8e
Compare
|
/test e2e-azure-v2-self-managed |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
test/e2e/v2/tests/hosted_cluster_azure_test.go (1)
246-314: ⚡ Quick winAdd labels to each new
Itfor filterability consistency.The new specs are under a labeled
Context, but theItblocks themselves are unlabeled.Proposed fix
- It("should transition from Private to PublicAndPrivate", func() { + It("should transition from Private to PublicAndPrivate", Label("azure-endpoint-transition"), func() { ... - It("should transition from PublicAndPrivate back to Private", func() { + It("should transition from PublicAndPrivate back to Private", Label("azure-endpoint-transition"), func() { ... - It("should reach the API server after restoring Private topology", func() { + It("should reach the API server after restoring Private topology", Label("azure-endpoint-transition"), func() {As per coding guidelines, apply labels to both
DescribeandItblocks for test filtering.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 246 - 314, Add labels to the three It blocks in the test to maintain filtering consistency with the parent Context block. The It blocks "should transition from Private to PublicAndPrivate", "should transition from PublicAndPrivate back to Private", and "should reach the API server after restoring Private topology" need to be labeled using the Label() function according to coding guidelines for test filtering. Apply appropriate labels that categorize these topology transition tests consistently.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 295-297: The assertion for the service type check is too loose and
only verifies that the type is not LoadBalancer, allowing false positives from
other unintended types like NodePort or ExternalName. Replace the
NotTo(Equal(corev1.ServiceTypeLoadBalancer)) check with explicit assertions that
verify the service is in one of the expected states after restore: either the
service should not be found (NotFound error) or it should have type ClusterIP.
This ensures the assertion catches actual regressions instead of passing for any
non-LoadBalancer type.
- Around line 236-243: The DeferCleanup function currently logs a warning when
the topology restore fails (when restoreErr is not nil) but continues execution,
which can leave the cluster state mutated and cause cascading failures in
subsequent tests. Modify the error handling block where restoreErr is checked to
fail the test using an appropriate Ginkgo assertion or failure method (such as
GinkgoTB().Fail) instead of just logging a warning, ensuring that any failure to
restore the Azure topology causes the test cleanup to fail and prevent state
corruption from affecting other specs.
---
Nitpick comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 246-314: Add labels to the three It blocks in the test to maintain
filtering consistency with the parent Context block. The It blocks "should
transition from Private to PublicAndPrivate", "should transition from
PublicAndPrivate back to Private", and "should reach the API server after
restoring Private topology" need to be labeled using the Label() function
according to coding guidelines for test filtering. Apply appropriate labels that
categorize these topology transition tests consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 82fff169-72ca-4e38-910e-d973910bb5f5
📒 Files selected for processing (1)
test/e2e/v2/tests/hosted_cluster_azure_test.go
| DeferCleanup(func() { | ||
| restoreErr := e2eutil.UpdateObject(GinkgoTB(), context.Background(), testCtx.MgmtClient, hc, func(obj *hyperv1.HostedCluster) { | ||
| obj.Spec.Platform.Azure.Topology = hyperv1.AzureTopologyPrivate | ||
| }) | ||
| if restoreErr != nil { | ||
| GinkgoTB().Logf("WARNING: failed to restore Private topology: %v", restoreErr) | ||
| } | ||
| }) |
There was a problem hiding this comment.
Fail cleanup if topology restore does not succeed.
The cleanup currently logs and continues on restore failure, which can leave shared cluster state mutated and cause cascading failures in later specs.
Proposed fix
DeferCleanup(func() {
- restoreErr := e2eutil.UpdateObject(GinkgoTB(), context.Background(), testCtx.MgmtClient, hc, func(obj *hyperv1.HostedCluster) {
+ cleanupCtx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
+ defer cancel()
+ restoreErr := e2eutil.UpdateObject(GinkgoTB(), cleanupCtx, testCtx.MgmtClient, hc, func(obj *hyperv1.HostedCluster) {
obj.Spec.Platform.Azure.Topology = hyperv1.AzureTopologyPrivate
})
- if restoreErr != nil {
- GinkgoTB().Logf("WARNING: failed to restore Private topology: %v", restoreErr)
- }
+ Expect(restoreErr).NotTo(HaveOccurred(), "cleanup: failed to restore Private topology")
})As per coding guidelines, when mutating cluster state, restore it via DeferCleanup in a fail-safe way on all exit paths.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 236 - 243, The
DeferCleanup function currently logs a warning when the topology restore fails
(when restoreErr is not nil) but continues execution, which can leave the
cluster state mutated and cause cascading failures in subsequent tests. Modify
the error handling block where restoreErr is checked to fail the test using an
appropriate Ginkgo assertion or failure method (such as GinkgoTB().Fail) instead
of just logging a warning, ensuring that any failure to restore the Azure
topology causes the test cleanup to fail and prevent state corruption from
affecting other specs.
Source: Coding guidelines
| g.Expect(svc.Spec.Type).NotTo(Equal(corev1.ServiceTypeLoadBalancer), | ||
| "KAS Azure LB service should no longer be LoadBalancer after restoring Private topology") | ||
| }).WithTimeout(10 * time.Minute).WithPolling(10 * time.Second).Should(Succeed(), |
There was a problem hiding this comment.
Tighten the restored-service assertion to expected states only.
The current check passes for any non-LoadBalancer type (e.g., NodePort/ExternalName), which can hide regressions. After restore, success should be NotFound or ClusterIP.
Proposed fix
g.Expect(err).NotTo(HaveOccurred(), "failed to get KAS Azure LB service")
- g.Expect(svc.Spec.Type).NotTo(Equal(corev1.ServiceTypeLoadBalancer),
- "KAS Azure LB service should no longer be LoadBalancer after restoring Private topology")
+ g.Expect(svc.Spec.Type).To(Equal(corev1.ServiceTypeClusterIP),
+ "KAS Azure LB service should be ClusterIP when restoring Private topology")
}).WithTimeout(10 * time.Minute).WithPolling(10 * time.Second).Should(Succeed(),
- "KAS Azure LB service should be deleted or no longer LoadBalancer")
+ "KAS Azure LB service should be deleted or ClusterIP")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 295 - 297, The
assertion for the service type check is too loose and only verifies that the
type is not LoadBalancer, allowing false positives from other unintended types
like NodePort or ExternalName. Replace the
NotTo(Equal(corev1.ServiceTypeLoadBalancer)) check with explicit assertions that
verify the service is in one of the expected states after restore: either the
service should not be found (NotFound error) or it should have type ClusterIP.
This ensures the assertion catches actual regressions instead of passing for any
non-LoadBalancer type.
|
/test e2e-azure-v2-self-managed |
1 similar comment
|
/test e2e-azure-v2-self-managed |
57c8e8e to
4f67e09
Compare
|
/test pull-ci-openshift-hypershift-main-e2e-azure-v2-self-managed |
4f67e09 to
c562f67
Compare
|
/test e2e-azure-v2-self-managed |
c562f67 to
3bf01a1
Compare
|
/test e2e-azure-v2-self-managed |
3bf01a1 to
0593853
Compare
|
/test e2e-azure-v2-self-managed |
Add a v2 lifecycle test that validates transitioning a HostedCluster from Private to PublicAndPrivate topology and back on Azure self-managed clusters. The test runs on the existing private cluster variant after AzurePrivateTopologyTest completes. Co-Authored-By: Claude Opus 4.6 <[email protected]>
0593853 to
5ba7144
Compare
|
/test e2e-azure-v2-self-managed |
|
@bryan-cox: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
The dump ran ~40 minutes later successfully, confirming the API server outage was transient. Now I have the full picture. Let me verify exactly what happened by looking at the exact error timing: The error logs show at The fact that the management cluster was already stressed (monitoring operator unavailable for 14+ minutes during install) and then had to handle 6 hosted clusters being created simultaneously likely contributed to API server instability. Now I have all the evidence I need. Here's the complete analysis: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe Root CauseThe root cause is a transient management cluster kube-apiserver outage during the guest cluster creation phase, combined with the create-guests script's lack of resilience to API server interruptions. Detailed chain of events:
This failure is NOT caused by PR #8718's changes. It is a transient infrastructure failure on the self-managed management cluster. Recommendations
Evidence
|
|
/test e2e-azure-v2-self-managed |
What this PR does / why we need it:
Adds a v2 e2e lifecycle test that validates transitioning a HostedCluster between Private and PublicAndPrivate topology on Azure self-managed clusters. The test verifies:
PublicEndpointExposedcondition to become True with reasonSharedIngressConfigured, confirms KAS service becomes LoadBalancer, validates PLS CRs persist, and checks API reachability.PublicEndpointExposedcondition to become False with reasonTopologyPrivate, confirms KAS service returns to ClusterIP, validates PLS CRs persist.The test runs on the existing private cluster variant (labeled
self-managed-azure-private) afterAzurePrivateTopologyTestcompletes.DeferCleanuprestores Private topology if the test fails mid-transition.Which issue(s) this PR fixes:
Fixes https://issues.redhat.com/browse/CNTRLPLANE-3276
Special notes for your reviewer:
lifecycle/azure.go— the private variant'sLabelFilteralready includesself-managed-azure-privatee2eutil.ConditionPredicate[*hyperv1.HostedCluster]rather than a custom condition helperDeferCleanupusescontext.Background()intentionally — cleanup may run after testCtx.Context is cancelled on timeoutsync.Once-cached guest client inGetHostedClusterClient()is safe across topology transitions because the private endpoint stays active in PublicAndPrivate modeChecklist:
Summary by CodeRabbit