Skip to content

Fix flaky Test_Gateway* probe timing (#12298)#12299

Open
brooke-hamilton wants to merge 2 commits into
mainfrom
brooke-hamilton-fix-flaky-gateway-test-12298
Open

Fix flaky Test_Gateway* probe timing (#12298)#12299
brooke-hamilton wants to merge 2 commits into
mainfrom
brooke-hamilton-fix-flaky-gateway-test-12298

Conversation

@brooke-hamilton

Copy link
Copy Markdown
Member

Description

Fixes the intermittent Test_Gateway* failures in the corerp-noncloud functional job, where the gateway probe fails with a transient HTTP 503 Service Unavailable from Envoy on /healthz via port-forward.

Root cause: a freshly-deployed gateway briefly returns 503 while Contour programs the Envoy xDS route/cluster for the new route. The probe testGatewayAvailability only retried 2× with a 5s backoff (~10s total), which is too short to cover that programming window.

This is a test-only change — no product code is modified.

Changes

  • test/functional-portable/corerp/noncloud/resources/gateway_test.go — widen the retry/poll budget in testGatewayAvailability from 2×5s to poll the expected status over ~90s with a 5s backoff. The full request/response dump is now emitted only once on final failure (with a concise one-line message per attempt) to reduce log noise.
  • test/validation/k8s.go — remove the misleading per-non-match "Resource: … Expected labels … got …" log line in matchesActualLabels that fired for every scanned pod during its linear label match. This was diagnostic noise that misled triage (label validation actually passes). Genuine match failures are still reported by the existing remaining-resources loop.

Gating the probe on the HTTPProxy/route reaching Valid/programmed status (the issue's "optional" item) was intentionally skipped — the widened poll budget directly addresses the root cause and keeps the change minimal and self-contained.

Type of change

  • This pull request fixes a bug in Radius and has an approved issue (issue link required).

Fixes: #12298

Contributor checklist

  • Tests are changed as part of this PR (test-only change)
  • No product code is modified

Widen the retry/poll budget in testGatewayAvailability from 2x5s (~10s) to poll over ~90s with a 5s backoff, so the probe tolerates the transient window while Contour programs the Envoy xDS route/cluster for a freshly-deployed gateway (which surfaces as a transient HTTP 503 on /healthz). The full request/response dump is now emitted only once on final failure to reduce log noise, with a concise per-attempt line.

Also remove the misleading per-non-match log line in matchesActualLabels (test/validation/k8s.go) that fired for every scanned pod during its linear label match, which added diagnostic noise that misled triage. Genuine match failures are still reported by the existing remaining-resources loop.

Test-only change; no product code modified.

Co-authored-by: Copilot App <[email protected]>
Signed-off-by: Brooke Hamilton <[email protected]>
@brooke-hamilton brooke-hamilton requested a review from a team as a code owner July 1, 2026 19:09
Copilot AI review requested due to automatic review settings July 1, 2026 19:09
@brooke-hamilton brooke-hamilton requested a review from a team as a code owner July 1, 2026 19:09
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses intermittent Test_Gateway* failures in the corerp-noncloud functional suite by increasing the probe retry budget for the gateway /healthz check and reducing misleading log noise during Kubernetes label validation. The changes are confined to test code and aim to make CI runs more reliable under transient Envoy/Contour programming windows.

Changes:

  • Expand testGatewayAvailability from a short fixed retry loop to a ~90s polling window with per-attempt concise logging and full dumps only on final failure.
  • Remove per-non-match label log spam from matchesActualLabels, keeping only genuine validation failure reporting.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
test/functional-portable/corerp/noncloud/resources/gateway_test.go Increase gateway probe polling budget and reduce log noise during retries.
test/validation/k8s.go Remove misleading log output emitted for every non-matching resource during label scans.

Comment thread test/functional-portable/corerp/noncloud/resources/gateway_test.go
Close the HTTP response body on the success path, close the previously retained response before overwriting it on retry, and close the final retained response after dumping it, so the widened poll budget does not leak connections/bodies.

Co-authored-by: Copilot App <[email protected]>
Signed-off-by: Brooke Hamilton <[email protected]>
@radius-functional-tests

radius-functional-tests Bot commented Jul 1, 2026

Copy link
Copy Markdown

Radius functional test overview

🔍 Go to test action run

Click here to see the test run details
Name Value
Repository radius-project/radius
Commit ref 5c407d8
Unique ID funcdd4d66aee3
Image tag pr-funcdd4d66aee3
  • Dapr: 1.14.4
  • Azure KeyVault CSI driver: 1.4.2
  • Azure Workload identity webhook: 1.3.0
  • Bicep recipe location ghcr.io/radius-project/dev/test/testrecipes/test-bicep-recipes/<name>:pr-funcdd4d66aee3
  • Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
  • applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-funcdd4d66aee3
  • dynamic-rp test image location: ghcr.io/radius-project/dev/dynamic-rp:pr-funcdd4d66aee3
  • controller test image location: ghcr.io/radius-project/dev/controller:pr-funcdd4d66aee3
  • ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-funcdd4d66aee3
  • deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting corerp-cloud functional tests...
⌛ Starting ucp-cloud functional tests...
✅ ucp-cloud functional tests succeeded
✅ corerp-cloud functional tests succeeded

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Unit Tests

    2 files  ±0    452 suites  ±0   7m 32s ⏱️ +3s
5 656 tests ±0  5 654 ✅ ±0  2 💤 ±0  0 ❌ ±0 
6 853 runs  ±0  6 851 ✅ ±0  2 💤 ±0  0 ❌ ±0 

Results for commit 5c407d8. ± Comparison against base commit bf1015c.

@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.97%. Comparing base (bf1015c) to head (5c407d8).

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #12299   +/-   ##
=======================================
  Coverage   52.97%   52.97%           
=======================================
  Files         754      754           
  Lines       48686    48686           
=======================================
+ Hits        25791    25793    +2     
+ Misses      20469    20468    -1     
+ Partials     2426     2425    -1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: Test_Gateway* (503 Service Unavailable from Envoy on /healthz via port-forward)

2 participants