Fix flaky Test_Gateway* probe timing (#12298)#12299
Conversation
Widen the retry/poll budget in testGatewayAvailability from 2x5s (~10s) to poll over ~90s with a 5s backoff, so the probe tolerates the transient window while Contour programs the Envoy xDS route/cluster for a freshly-deployed gateway (which surfaces as a transient HTTP 503 on /healthz). The full request/response dump is now emitted only once on final failure to reduce log noise, with a concise per-attempt line. Also remove the misleading per-non-match log line in matchesActualLabels (test/validation/k8s.go) that fired for every scanned pod during its linear label match, which added diagnostic noise that misled triage. Genuine match failures are still reported by the existing remaining-resources loop. Test-only change; no product code modified. Co-authored-by: Copilot App <[email protected]> Signed-off-by: Brooke Hamilton <[email protected]>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
There was a problem hiding this comment.
Pull request overview
This PR addresses intermittent Test_Gateway* failures in the corerp-noncloud functional suite by increasing the probe retry budget for the gateway /healthz check and reducing misleading log noise during Kubernetes label validation. The changes are confined to test code and aim to make CI runs more reliable under transient Envoy/Contour programming windows.
Changes:
- Expand
testGatewayAvailabilityfrom a short fixed retry loop to a ~90s polling window with per-attempt concise logging and full dumps only on final failure. - Remove per-non-match label log spam from
matchesActualLabels, keeping only genuine validation failure reporting.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| test/functional-portable/corerp/noncloud/resources/gateway_test.go | Increase gateway probe polling budget and reduce log noise during retries. |
| test/validation/k8s.go | Remove misleading log output emitted for every non-matching resource during label scans. |
Close the HTTP response body on the success path, close the previously retained response before overwriting it on retry, and close the final retained response after dumping it, so the widened poll budget does not leak connections/bodies. Co-authored-by: Copilot App <[email protected]> Signed-off-by: Brooke Hamilton <[email protected]>
Radius functional test overviewClick here to see the test run details
Test Status⌛ Building Radius and pushing container images for functional tests... |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #12299 +/- ##
=======================================
Coverage 52.97% 52.97%
=======================================
Files 754 754
Lines 48686 48686
=======================================
+ Hits 25791 25793 +2
+ Misses 20469 20468 -1
+ Partials 2426 2425 -1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Description
Fixes the intermittent
Test_Gateway*failures in thecorerp-noncloudfunctional job, where the gateway probe fails with a transient HTTP503 Service Unavailablefrom Envoy on/healthzvia port-forward.Root cause: a freshly-deployed gateway briefly returns
503while Contour programs the Envoy xDS route/cluster for the new route. The probetestGatewayAvailabilityonly retried 2× with a 5s backoff (~10s total), which is too short to cover that programming window.This is a test-only change — no product code is modified.
Changes
test/functional-portable/corerp/noncloud/resources/gateway_test.go— widen the retry/poll budget intestGatewayAvailabilityfrom 2×5s to poll the expected status over ~90s with a 5s backoff. The full request/response dump is now emitted only once on final failure (with a concise one-line message per attempt) to reduce log noise.test/validation/k8s.go— remove the misleading per-non-match"Resource: … Expected labels … got …"log line inmatchesActualLabelsthat fired for every scanned pod during its linear label match. This was diagnostic noise that misled triage (label validation actually passes). Genuine match failures are still reported by the existingremaining-resources loop.Gating the probe on the
HTTPProxy/route reachingValid/programmed status (the issue's "optional" item) was intentionally skipped — the widened poll budget directly addresses the root cause and keeps the change minimal and self-contained.Type of change
Fixes: #12298
Contributor checklist