Skip to content

CNTRLPLANE-3277: Add Azure OAuth LoadBalancer private topology e2e test#8584

Open
Nirshal wants to merge 3 commits into
openshift:mainfrom
Nirshal:CNTRLPLANE-3277
Open

CNTRLPLANE-3277: Add Azure OAuth LoadBalancer private topology e2e test#8584
Nirshal wants to merge 3 commits into
openshift:mainfrom
Nirshal:CNTRLPLANE-3277

Conversation

@Nirshal

@Nirshal Nirshal commented May 25, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Adds the AzureOAuthLoadBalancerPrivateTest to the v2 Ginkgo e2e suite, validating
that the OAuth server works correctly when published via LoadBalancer on an Azure
Private topology cluster.

The test verifies:

  • The oauth-openshift Service is created as type LoadBalancer with an allocated endpoint
  • The Service carries the Azure internal LoadBalancer annotation (confirming it's an ILB, not public)
  • The OAuth token flow (kubeadmin login + htpasswd IDP setup + token request) works through the private endpoint

Changes to lifecycle/azure.go:

  • Adds --oauth-publishing-strategy=LoadBalancer to the existing private cluster variant, reusing it instead of spinning up a separate cluster (per reviewer feedback to reduce Azure CI cluster count)
  • Adds self-managed-azure-oauth-lb-private to the private TestGroup's LabelFilter so the new test runs on the same cluster

Note: The OAuth token flow test requires the test runner to have network connectivity
to the Azure VNet (e.g., running in the management cluster or via VPN/peering).

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-3277

Special notes for your reviewer:

  • This builds on top of CNTRLPLANE-3222: Port v1 lifecycle tests to v2 Ginkgo framework #8527 (CNTRLPLANE-3222) which ported the v1 lifecycle tests to v2 and established the Azure self-managed test infrastructure.
  • The private cluster variant now always creates with OAuth via LoadBalancer. This means private topology with default OAuth (Route) is no longer tested in CI. This tradeoff was agreed upon to avoid an additional Azure cluster.
  • AzureOAuthLoadBalancerPrivateTest mutates the HostedCluster (creates htpasswd IDP, triggers kubeadmin secret deletion). It is registered last for the private cluster in RegisterHostedClusterAzureTests to avoid affecting other private topology tests.
  • No companion PR in openshift/release is needed since the private cluster already exists in CI.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests. No unit tests needed — changes are e2e test code and CI cluster configuration only, with no new business logic.

Summary by CodeRabbit

  • Tests
    • Added end-to-end test coverage for Azure private-topology clusters running OAuth behind a LoadBalancer, verifying service type, internal LB annotation, and token flow through the LoadBalancer endpoint.
    • Introduced an ordered validation suite for private Azure deployments to ensure OAuth LoadBalancer behavior.
    • Broadened the Azure test selection to include an additional private OAuth LoadBalancer variant.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 25, 2026
@openshift-ci-robot

openshift-ci-robot commented May 25, 2026

Copy link
Copy Markdown

@Nirshal: This pull request references CNTRLPLANE-3277 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Adds the AzureOAuthLoadBalancerPrivateTest to the v2 Ginkgo e2e suite, validating
that the OAuth server works correctly when published via LoadBalancer on an Azure
Private topology cluster.

The test verifies:

  • The oauth-openshift Service is created as type LoadBalancer with an allocated endpoint
  • The Service carries the Azure internal LoadBalancer annotation (confirming it's an ILB, not public)
  • The OAuth token flow (kubeadmin login + htpasswd IDP setup + token request) works through the private endpoint

Also adds:

  • oauth-lb-private cluster spec in lifecycle/azure.go combining --endpoint-access=Private with --oauth-publishing-strategy=LoadBalancer
  • Corresponding test matrix entry with label self-managed-azure-oauth-lb-private

Note: The OAuth token flow test requires the test runner to have network connectivity
to the Azure VNet (e.g., running in the management cluster or via VPN/peering).

Companion PR needed: openshift/release changes to create/destroy/dump the
oauth-lb-private cluster and wire the label filter in the CI job.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-3277

Special notes for your reviewer:

  • This builds on top of CNTRLPLANE-3222: Port v1 lifecycle tests to v2 Ginkgo framework #8527 (CNTRLPLANE-3222) which ported the v1 lifecycle tests to v2 and established the Azure self-managed test infrastructure.
  • The naming oauth-lb-private follows the existing oauth-lb convention rather than private-oauth-lb to stay consistent with Bryan's naming. Open to renaming both to public-oauth-lb / private-oauth-lb for clarity if preferred.
  • The ValidateOAuthWithIdentityProviderViaLoadBalancer helper in test/e2e/util/oauth.go does not clean up the htpasswd Secret it creates — this is pre-existing and safe since each test variant uses a separate cluster, but worth a follow-up.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests. No unit tests needed — changes are e2e test code and CI cluster configuration only, with no new business logic.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 25, 2026
@openshift-ci

openshift-ci Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Adds --oauth-publishing-strategy=LoadBalancer and the private NAT subnet arg to the Azure private cluster ExtraArgs, expands the Azure private test label selector to include self-managed-azure-oauth-lb-private, and adds an ordered E2E test (AzureOAuthLoadBalancerPrivateTest) that waits for oauth-openshift to be a LoadBalancer, asserts the Azure internal LoadBalancer annotation, and validates the OAuth token flow via the LoadBalancer endpoint.

Sequence Diagram(s)

sequenceDiagram
  participant TestRunner
  participant HostedCluster
  participant AzureLoadBalancer
  participant ManagementClient
  TestRunner->>HostedCluster: wait for oauth-openshift Service type=LoadBalancer
  HostedCluster->>AzureLoadBalancer: allocate LB and ingress endpoint
  AzureLoadBalancer->>HostedCluster: annotate Service with InternalLoadBalancerAnnotation
  TestRunner->>ManagementClient: perform OAuth token flow via LB endpoint
  ManagementClient->>HostedCluster: validate token exchange through LB ingress
Loading

Suggested reviewers

  • enxebre
  • cblecker
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning ValidateOAuthWithIdentityProviderViaLoadBalancer creates htpasswd Secret and mutates HostedCluster OAuth config without cleanup, causing test order-dependency on shared clusters. Add AfterAll block to restore original OAuth config and delete htpasswd Secret with apierrors.IsNotFound checks during cleanup.
Microshift Test Compatibility ⚠️ Warning Test uses config.openshift.io/v1, user.openshift.io/v1, and osin.openshift.io/v1 APIs unavailable on MicroShift without protection mechanisms like [apigroup:config.openshift.io] label. Add [apigroup:config.openshift.io] tag to test name, or add [Skipped:MicroShift] label, or guard with exutil.IsMicroShiftCluster() check.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: adding an Azure OAuth LoadBalancer private topology e2e test. It is concise, specific, and directly matches the pull request's primary objective.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are stable and deterministic with no dynamic content; dynamic values correctly appear only in test bodies, not titles.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The new AzureOAuthLoadBalancerPrivateTest does not make multi-node assumptions; it tests Service creation, annotations, and OAuth token flow which all work on single-node clusters.
Topology-Aware Scheduling Compatibility ✅ Passed This PR only modifies e2e test files. It does not introduce deployment manifests, operator code, or controllers with scheduling constraints.
Ote Binary Stdout Contract ✅ Passed PR introduces no OTE stdout contract violations: adds only configuration strings and Ginkgo test functions (inside Context/It blocks), with no process-level stdout writes.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Test uses DNS hostnames validated by regex for URL construction, not raw IP addresses. No hardcoded IPv4/IPv6 addresses or public internet connectivity detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/platform/azure PR/issue for Azure (AzurePlatform) platform area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels May 25, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/e2e/v2/tests/hosted_cluster_azure_test.go (1)

304-320: ⚡ Quick win

Move platform/topology guard skips to BeforeEach with standardized skip text.

In Line 304-320, the new suite uses BeforeAll and custom skip messages for platform checks. Please switch to BeforeEach and the repository’s required message format for platform-specific skips.

♻️ Suggested adjustment
-		BeforeAll(func() {
+		BeforeEach(func() {
 			testCtx = getTestCtx()
 			hc = testCtx.GetHostedCluster()
 			if hc == nil || hc.Spec.Platform.Type != hyperv1.AzurePlatform {
-				Skip("Azure OAuth LB Private tests are only for Azure platform")
+				Skip("[Azure OAuth LB Private] test is only for Azure platform")
 			}
 			if hc.Spec.Platform.Azure == nil || hc.Spec.Platform.Azure.Topology != hyperv1.AzureTopologyPrivate {
 				Skip("Azure OAuth LB Private tests require Private topology")
 			}

As per coding guidelines, "test/e2e/v2/tests/**/*_test.go: Use BeforeEach with Skip() for platform-specific tests with message format: [Feature] test is only for [Platform] platform".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 304 - 320,
Replace the platform/topology guard block currently inside BeforeAll with a
BeforeEach that performs the same checks (use
getTestCtx()/testCtx.GetHostedCluster(), validate hc.Spec.Platform.Type,
hc.Spec.Platform.Azure.Topology, and
netutil.ServicePublishingStrategyByTypeByHC(hc, hyperv1.OAuthServer)), move
controlPlaneNamespace assignment into this BeforeEach, and call Skip() with the
repository-standard messages: for platform use "Azure OAuth LB Private tests is
only for Azure platform" and for topology use "Azure OAuth LB Private tests
requires Private topology" (also keep the existing Skip for publishing strategy
if present); update references to BeforeAll -> BeforeEach and keep the same
symbol names (getTestCtx, testCtx.GetHostedCluster, controlPlaneNamespace,
netutil.ServicePublishingStrategyByTypeByHC, hyperv1.OAuthServer).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 304-320: Replace the platform/topology guard block currently
inside BeforeAll with a BeforeEach that performs the same checks (use
getTestCtx()/testCtx.GetHostedCluster(), validate hc.Spec.Platform.Type,
hc.Spec.Platform.Azure.Topology, and
netutil.ServicePublishingStrategyByTypeByHC(hc, hyperv1.OAuthServer)), move
controlPlaneNamespace assignment into this BeforeEach, and call Skip() with the
repository-standard messages: for platform use "Azure OAuth LB Private tests is
only for Azure platform" and for topology use "Azure OAuth LB Private tests
requires Private topology" (also keep the existing Skip for publishing strategy
if present); update references to BeforeAll -> BeforeEach and keep the same
symbol names (getTestCtx, testCtx.GetHostedCluster, controlPlaneNamespace,
netutil.ServicePublishingStrategyByTypeByHC, hyperv1.OAuthServer).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4942b228-db2d-42ed-83bf-2dd3738e5e94

📥 Commits

Reviewing files that changed from the base of the PR and between 7625684 and b8a5e30.

📒 Files selected for processing (2)
  • test/e2e/v2/lifecycle/azure.go
  • test/e2e/v2/tests/hosted_cluster_azure_test.go

@codecov

codecov Bot commented May 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.79%. Comparing base (7507291) to head (c274044).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8584   +/-   ##
=======================================
  Coverage   41.79%   41.79%           
=======================================
  Files         759      759           
  Lines       94037    94037           
=======================================
  Hits        39304    39304           
  Misses      51983    51983           
  Partials     2750     2750           
Flag Coverage Δ
cmd-support 35.11% <ø> (ø)
cpo-hostedcontrolplane 44.10% <ø> (ø)
cpo-other 43.45% <ø> (ø)
hypershift-operator 51.87% <ø> (ø)
other 31.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Nirshal

Nirshal commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

Re: CodeRabbit suggestion to change BeforeAll to BeforeEach — intentionally keeping BeforeAll here.

This Context uses the Ordered decorator with shared state (testCtx, hc, controlPlaneNamespace) across three sequential It blocks. BeforeAll runs once and initializes the shared variables; BeforeEach would redundantly re-fetch the test context and re-run the guards before every It.

This matches the existing pattern in AzurePrivateTopologyTest (same file, line 91) which also uses Ordered + BeforeAll for the same reason.

@Nirshal Nirshal marked this pull request as ready for review May 26, 2026 13:56
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 26, 2026
@openshift-ci openshift-ci Bot requested review from cblecker and sjenning May 26, 2026 13:58

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/v2/tests/hosted_cluster_azure_test.go`:
- Around line 379-382: The test calls
ValidateOAuthWithIdentityProviderViaLoadBalancer which mutates the HostedCluster
OAuth config and creates an htpasswd Secret without restoring them; capture the
original HostedCluster.Spec.OAuth (and any other mutated fields) and the
existence/data of the htpasswd Secret before calling
ValidateOAuthWithIdentityProviderViaLoadBalancer, then defer a restoration
function that re-applies the original OAuth config and deletes or restores the
htpasswd Secret, and in that deferred cleanup handle apierrors.IsNotFound(error)
when attempting to restore/delete resources so cleanup tolerates resources
already removed; use the HostedCluster object and the Secret name returned/used
by ValidateOAuthWithIdentityProviderViaLoadBalancer to identify what to restore.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: bcc13421-364c-476a-94f1-73cf7c97a172

📥 Commits

Reviewing files that changed from the base of the PR and between b8a5e30 and 5d52eb0.

📒 Files selected for processing (2)
  • test/e2e/v2/lifecycle/azure.go
  • test/e2e/v2/tests/hosted_cluster_azure_test.go

Comment on lines +379 to +382
It("should complete OAuth token flow through LoadBalancer endpoint", func() {
ctx := testCtx.Context
e2eutil.ValidateOAuthWithIdentityProviderViaLoadBalancer(GinkgoTB(), ctx, testCtx.MgmtClient, hc)
})

@coderabbitai coderabbitai Bot May 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restore cluster auth state after OAuth flow mutation.

Line 381 calls ValidateOAuthWithIdentityProviderViaLoadBalancer, which mutates HostedCluster OAuth config and creates an htpasswd Secret without restoring/deleting them. This makes outcomes order-dependent on shared clusters. Please capture original state and defer restoration, with apierrors.IsNotFound() handling during cleanup.

As per coding guidelines: "When a test mutates cluster state, capture original state before mutation and defer restoration; check apierrors.IsNotFound() in cleanup to handle already-deleted resources".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/hosted_cluster_azure_test.go` around lines 379 - 382, The
test calls ValidateOAuthWithIdentityProviderViaLoadBalancer which mutates the
HostedCluster OAuth config and creates an htpasswd Secret without restoring
them; capture the original HostedCluster.Spec.OAuth (and any other mutated
fields) and the existence/data of the htpasswd Secret before calling
ValidateOAuthWithIdentityProviderViaLoadBalancer, then defer a restoration
function that re-applies the original OAuth config and deletes or restores the
htpasswd Secret, and in that deferred cleanup handle apierrors.IsNotFound(error)
when attempting to restore/delete resources so cleanup tolerates resources
already removed; use the HostedCluster object and the Secret name returned/used
by ValidateOAuthWithIdentityProviderViaLoadBalancer to identify what to restore.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mutation happens inside the shared helper ValidateOAuthWithIdentityProviderViaLoadBalancer (in test/e2e/util/oauth.go), which is also used by the existing AzureOAuthLoadBalancerTest (line 284, same file) without cleanup. Adding cleanup in the caller would mean duplicating restore logic that belongs in the helper itself.

This is a pre-existing concern, not introduced by this PR. Noted in the PR description as a known tradeoff. If we address it, it should be done in the shared helper as a follow-up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nirshal, that's a fair point — the concern is pre-existing and your reasoning is correct. Duplicating restore logic in each caller would be the wrong approach; the cleanup belongs inside ValidateOAuthWithIdentityProviderViaLoadBalancer itself so all callers benefit automatically.

Would you like me to open a follow-up GitHub issue to track adding htpasswd Secret cleanup (and OAuth config restoration) to the shared helper?


🧠 Learnings used
Learnt from: cblecker
Repo: openshift/hypershift PR: 8511
File: test/e2e/v2/tests/hosted_cluster_metrics_test.go:203-245
Timestamp: 2026-05-15T03:34:00.446Z
Learning: When reviewing HyperShift v2 e2e tests under test/e2e/v2/, do not require MicroShift “guard” logic or markers. HyperShift v2 e2e tests in this directory run only against hosted clusters (no MicroShift targets), so missing exutil.IsMicroShiftCluster() checks and/or [Skipped:MicroShift] labels should not be flagged.

Learnt from: bryan-cox
Repo: openshift/hypershift PR: 8527
File: test/e2e/v2/tests/nodepool_autoscaling_test.go:236-236
Timestamp: 2026-05-15T14:34:25.062Z
Learning: In openshift/hypershift v2 e2e lifecycle CI tests under test/e2e/v2/tests/, it is OK to reference external image registries because the CI environment is not air-gapped. Image references like registry.access.redhat.com/ubi9/ubi-minimal:latest in lifecycle DaemonSet definitions should not be flagged for requiring internal registries. For the autoscaling workload created by newAutoscalingWorkload (nodepool_autoscaling_test.go), ensure it uses registry.k8s.io/pause:3.9 (not ubi-minimal).

Learnt from: bryan-cox
Repo: openshift/hypershift PR: 8469
File: test/e2e/util/util_test.go:17-131
Timestamp: 2026-05-17T22:37:02.057Z
Learning: In Go e2e tests under test/e2e, do not call t.Parallel() (neither at the top-level test nor in any subtest) when the test (or any helper it calls) uses t.Setenv—for example, when calling azureutil.SetAsAroHCPTest(t). The Go test runner panics with "testing: test using t.Setenv can not use t.Parallel" because environment variables are process-global. Keep such tests serial to ensure the environment setup is safe and deterministic.

@Nirshal

Nirshal commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

/test ?

1 similar comment
@Nirshal

Nirshal commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

/test ?

@Nirshal

Nirshal commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-azure-v2-self-managed

@cblecker cblecker left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test structure itself is solid — good use of Ordered + BeforeAll, thorough skip guards, and the internal LB annotation check is a meaningful addition for private topology coverage.

Two things to address before this can merge:

1. Healthcheck unreachable from CI (blocking): WaitForOAuthLoadBalancerReady dials the OAuth hostname directly via TLS, which resolves to the internal LB IP for private topology — unreachable from the build cluster. The CI run confirmed this (job 2059278516129632256, oauth.go:375, 300s timeout). The token flow step worked; it's only the raw healthcheck that fails. Fix needs to happen in the shared helper.

2. Test isolation on the shared cluster: The public AzureOAuthLoadBalancerTest works because it has a dedicated cluster. This private variant shares a cluster with read-only tests and relies on declaration order for isolation. Either a dedicated cluster, or a sequential matrix step, would give the same guarantee the public variant has.

Minor nit: the Service type + endpoint predicates in the first It block are nearly identical to the public variant (~30 lines). Not blocking, but worth a shared helper if more LB variants get added.

It("should complete OAuth token flow through LoadBalancer endpoint", func() {
ctx := testCtx.Context
e2eutil.ValidateOAuthWithIdentityProviderViaLoadBalancer(GinkgoTB(), ctx, testCtx.MgmtClient, hc)
})

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI run showed this test can't pass as-is — WaitForOAuthLoadBalancerReady (in test/e2e/util/oauth.go:350-375) does a direct TLS dial to https://<oauth-hostname>/healthz. For private topology that hostname resolves to the Azure internal LB IP (e.g., 10.0.0.22), which is unreachable from the CI build cluster.

Interestingly the OAuth token flow itself succeeds because it routes through the guest kubeconfig transport — it's specifically the raw healthcheck that fails. So the underlying validation logic works; it's just the connectivity check that's incompatible with private topology.

The fix needs to happen in the shared helper. A few approaches:

  1. Skip WaitForOAuthLoadBalancerReady entirely for private topology and rely on the token flow validation alone — it already exercises the full OAuth stack end-to-end.
  2. Use the guest kubeconfig transport for the healthcheck too, the same way the token flow step does.
  3. Route it through a pod on the management cluster that has VNet connectivity.

This needs to be addressed in the helper before this can land.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c274044. Extracted WaitForOAuthLoadBalancerEndpoint from WaitForOAuthLoadBalancerReady — it waits for the LB endpoint without the direct /healthz TLS dial. The private test now calls WaitForOAuthLoadBalancerEndpoint + ValidateOAuthIdentityProviderFlow (also extracted), skipping the health check entirely while still validating the full OAuth IDP token flow end-to-end.

WaitForOAuthLoadBalancerReady is unchanged in behavior — it now composes WaitForOAuthLoadBalancerEndpoint + the health check, so the public test path is unaffected.

AzureOAuthLoadBalancerPrivateTest(getTestCtx)
}

var _ = Describe("Hosted Cluster Azure", Label("hosted-cluster-azure"), func() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The public variant (AzureOAuthLoadBalancerTest) works cleanly because it runs on a dedicated cluster (cluster-name-oauth-lb) created specifically for it — mutation is safe there since nothing else runs on that cluster. This private variant is trying to replicate that pattern but without the dedicated cluster, sharing cluster-name-private with the read-only AzurePrivateTopologyTest instead.

Registering this last is a reasonable mitigation, but declaration order between Context blocks isn't enforced by Ginkgo's Ordered decorator — that only applies within a single Context. If the registration order changes, or a new test gets inserted between the two, the read-only private topology tests could run after the HC has been mutated (htpasswd IDP created, kubeadmin secret deleted).

The cleanest fix that matches how the public variant is structured would be a dedicated cluster-name-oauth-lb-private cluster. I understand the reviewer feedback to reduce cluster count, so if a dedicated cluster is off the table, the other option is to make this a sequential step in the test matrix (after the read-only private tests), the way etcd-chaos runs sequentially after upgrade on the upgrade cluster. That would give the same isolation guarantee without an extra cluster.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c274044. Moved the private test group from Parallel to Sequential in the test matrix, following the same pattern as upgrade-and-chaos. The new private-and-oauth sequential group runs read-only private topology tests first, then the mutating OAuth LB test second — guaranteeing isolation regardless of registration order.

Nirshal added 2 commits June 17, 2026 14:14
Add e2e validation for OAuth LoadBalancer publishing on Azure Private
topology. Reuse the existing private cluster variant by adding
--oauth-publishing-strategy=LoadBalancer to its creation args, avoiding
an additional Azure cluster in CI.

Signed-off-by: Alessandro Rossi <[email protected]>
Commit-Message-Assisted-by: Claude (via Claude Code)
Extract WaitForOAuthLoadBalancerEndpoint and ValidateOAuthIdentityProviderFlow
helpers from the monolithic WaitForOAuthLoadBalancerReady and
ValidateOAuthWithIdentityProviderViaLoadBalancer functions. The private
topology test now skips the direct /healthz TLS dial (unreachable on
internal LB) while still validating the full OAuth IDP token flow.

Move the private test group from Parallel to Sequential in the test
matrix, running read-only tests before the mutating OAuth flow to
guarantee isolation.

Signed-off-by: Alessandro Rossi <[email protected]>
Commit-Message-Assisted-by: Claude (via Claude Code)
@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Nirshal
Once this PR has been reviewed and has the lgtm label, please ask for approval from cblecker. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Nirshal

Nirshal commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-azure-v2-self-managed

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@Nirshal: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-v2-self-managed c274044 link true /test e2e-azure-v2-self-managed

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci

Copy link
Copy Markdown

Now I have the full root cause. Here's the analysis:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

codecov/project: "CI failed." (state: error, 2026-06-17T13:52:43Z)
codecov/patch:   "CI failed." (state: error, 2026-06-17T13:52:44Z)

Summary

The Codecov checks initially passed successfully at 13:05 UTC (CheckRun conclusion: SUCCESS, coverage: "41.79% (+0.00%) compared to 7507291", patch: "Coverage not affected"). Approximately 47 minutes later, at 13:52 UTC, Codecov posted a second round of commit status updates with state error and description "CI failed." — this was triggered exactly ~84 seconds after the ci/prow/e2e-azure-v2-self-managed Prow job posted its failure status at 13:51:19 UTC. This is a Codecov reactive behavior, not a coverage failure. The coverage itself passed; Codecov simply mirrors CI failure state back as an error on its own status checks.

Root Cause

The root cause is not a Codecov or coverage issue — it is a cascading status effect from the ci/prow/e2e-azure-v2-self-managed Prow job failure.

Here is the exact sequence of events:

  1. 13:05:07–13:05:08 UTC — Codecov posted CheckRun results: codecov/project = SUCCESS (41.79% coverage, +0.00%), codecov/patch = SUCCESS ("Coverage not affected"). These are the real coverage results.

  2. 13:07:52–13:07:53 UTC — Codecov also posted commit StatusContext results confirming success.

  3. 13:51:19 UTC — The ci/prow/e2e-azure-v2-self-managed Prow job reported FAILURE.

  4. 13:52:43–13:52:44 UTC — Codecov reacted by posting new commit StatusContext entries with state: error and description: "CI failed.". This overwrote the earlier successful StatusContext entries.

The Codecov configuration in .codecov.yml has wait_for_ci: false, which means Codecov reports coverage results before CI completes. However, Codecov still monitors CI status afterward and updates its own commit status to reflect CI failures — this is standard Codecov behavior where it signals "CI failed" as an error on its status checks even though coverage analysis itself succeeded.

GitHub's status API uses a "last status wins" model per context — the later error status at 13:52 overwrites the earlier success status at 13:07 for the same codecov/project and codecov/patch context names. The CheckRun API entries (which show SUCCESS) remain unaffected, creating the apparent contradiction where CheckRuns show success but StatusContexts show error.

These Codecov errors are purely informational and do not indicate any problem with the PR's code or test coverage. The actual failure to investigate is ci/prow/e2e-azure-v2-self-managed.

Recommendations
  1. Ignore the Codecov errors — they are a mirror of the ci/prow/e2e-azure-v2-self-managed failure and carry no independent signal. Coverage is fine at 41.79% with no regression.

  2. Investigate the real failure — the ci/prow/e2e-azure-v2-self-managed job that failed at 13:51:19 UTC. That job's artifacts are at: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/8584/pull-ci-openshift-hypershift-main-e2e-azure-v2-self-managed/2067230554738135040

  3. No action needed on Codecov configuration — the wait_for_ci: false setting is correct for fast feedback. The "CI failed" error status is expected Codecov behavior and does not block merge (Prow/tide does not gate on codecov status checks).

Evidence
Evidence Detail
codecov/project CheckRun SUCCESS at 13:05:07 UTC, conclusion: success
codecov/patch CheckRun SUCCESS at 13:05:08 UTC, conclusion: success
codecov/project StatusContext (initial) success at 13:07:52 UTC — "41.79% (+0.00%) compared to 7507291"
codecov/patch StatusContext (initial) success at 13:07:53 UTC — "Coverage not affected when comparing 7507291...c274044"
ci/prow/e2e-azure-v2-self-managed failure at 13:51:19 UTC — "Job failed."
codecov/project StatusContext (overwrite) error at 13:52:43 UTC — "CI failed." (84s after Prow failure)
codecov/patch StatusContext (overwrite) error at 13:52:44 UTC — "CI failed." (85s after Prow failure)
.codecov.yml config wait_for_ci: false — reports coverage before CI finishes, but still reacts to CI failures
Coverage impact Zero — 41.79% project coverage, no change on patch

@Nirshal

Nirshal commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

CI failure on e2e-azure-v2-self-managed was an infrastructure issue — the hypershift-install step timed out after ~5m due to API server rate limiting (client rate limiter Wait returned an error: context deadline exceeded). The HyperShift operator never deployed, so no test clusters were created and none of our test code ran. Retriggering.

/test e2e-azure-v2-self-managed

@cblecker cblecker left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both previous blocking items are addressed — thank you for the updates! One minor annotation fix needed for CI tooling.

@cblecker cblecker left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing [Feature:AzureOAuth] annotation for Sippy mapping

// The OAuth token flow test requires the test runner to have network connectivity
// to the Azure VNet (e.g., running in the management cluster or via VPN/peering).
func AzureOAuthLoadBalancerPrivateTest(getTestCtx internal.TestContextGetter) {
Context("Azure OAuth LoadBalancer in Private Topology", Label("Azure", "self-managed-azure-oauth-lb-private"), Ordered, func() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Context block is missing the [Feature:AzureOAuth] annotation that the other tests in this file have (e.g., line 230). Per the v2 AGENTS.md section 19, this is needed for Sippy Component Readiness mapping. Should be:

Context("[Feature:AzureOAuth] Azure OAuth LoadBalancer in Private Topology", Label("Azure", "self-managed-azure-oauth-lb-private"), Ordered, func() {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed that, thanks a lot for finding it!

…uth test

The Context block for AzureOAuthLoadBalancerPrivateTest was missing the
[Feature:AzureOAuth] Sippy/CR annotation required by v2 AGENTS.md
section 19 for test-to-Jira component mapping.

Signed-off-by: Alessandro Rossi <[email protected]>
Commit-Message-Assisted-by: Claude (via Claude Code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/platform/azure PR/issue for Azure (AzurePlatform) platform area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants