test: retry transient UCP connection reset/EOF in functional deploys#12300
test: retry transient UCP connection reset/EOF in functional deploys#12300brooke-hamilton wants to merge 2 commits into
Conversation
corerp-cloud functional deploys intermittently fail with 'connection reset by peer' / EOF when the kind control-plane restarts mid-run and drops the kubectl port-forward tunnel that rad uses to reach the UCP API server. UCP and the Radius pods do not crash; this is environmental runner pressure, not a product bug. Extend the existing deploy-retry harness to also retry these transient transport failures: - Add IsTransientConnectionError and combined IsTransientDeployError in test/step/deployexecutor.go; make the combined predicate the default ShouldRetry in NewDeployExecutor. - Fold the rad CLI output into the error returned by deployInternal for non-structured failures so the retry predicate can inspect the transport-level cause (previously only 'exit status 1' surfaced). - Add unit tests for the new predicates and default retry behavior. Fixes #12297 Co-authored-by: Copilot App <[email protected]> Signed-off-by: Brooke Hamilton <[email protected]>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
There was a problem hiding this comment.
Pull request overview
This PR hardens the functional test deploy harness against an intermittent CI failure mode where the kind control-plane restarts mid-run and drops the kubectl port-forward tunnel used by rad, causing parallel deploys to fail with transport errors (connection reset by peer / EOF). It expands the existing retry mechanism (previously focused on transient image pull failures) to also retry these transient UCP transport disruptions, and ensures the retry predicate can “see” the underlying transport cause by folding CLI output into the returned error.
Changes:
- Add transient-connection error detection (
connection reset by peer,: EOF,unexpected EOF,broken pipe) and make it part of the default deploy retry predicate. - Update the
radCLI test wrapper to include CLI output in non-structured deploy errors so retry logic can match transport markers. - Add unit tests covering the new retry predicate and default retry-on-connection-reset behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| test/step/deployexecutor.go | Introduces transient connection error markers and makes the default retry predicate cover both image-pull and UCP transport failures. |
| test/step/deployexecutor_test.go | Adds unit tests for transient connection/deploy error detection and default retry behavior. |
| test/radcli/cli.go | Includes raw CLI output in non-structured deploy errors so transport failures are visible to retry predicates. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #12300 +/- ##
=======================================
Coverage 52.97% 52.97%
=======================================
Files 754 754
Lines 48686 48686
=======================================
Hits 25791 25791
Misses 20469 20469
Partials 2426 2426 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…olded output Address PR review feedback: - IsTransientConnectionError now returns false for *radcli.CLIError so a genuine structured deployment failure whose flattened ARM message happens to contain a transport marker (e.g. 'unexpected EOF') is never misclassified as a retryable connection reset. - deployInternal folds only the tail (last 20 lines) of the rad CLI output into non-structured errors instead of the full output, avoiding duplicate large logs across retries while preserving the trailing transport markers used for retry classification. - Add unit tests for the CLIError guard and the tailLines helper. Co-authored-by: Copilot App <[email protected]> Signed-off-by: Brooke Hamilton <[email protected]>
Radius functional test overviewClick here to see the test run details
Test Status⌛ Building Radius and pushing container images for functional tests... |
Functional Tests - corerp-cloud26 tests ±0 17 ✅ - 8 35m 31s ⏱️ + 24m 25s For more details on these failures, see this check. Results for commit 6d8114a. ± Comparison against base commit bf1015c. |
Description
corerp-cloudfunctional deploys intermittently fail withconnection reset by peer/EOFwhen the kind control-plane (kube-apiserver) restarts mid-run and drops thekubectl port-forwardtunnel thatraduses to reach the UCP API server. Every in-flight parallelrad deployconnection is reset at the same instant. UCP and the Radius pods do not crash (RestartCount=0) — this is environmental GitHub-hosted-runner resource pressure bouncing the kind static control-plane pods, not a Radius product bug.This PR extends the existing deploy-retry harness (which already retries transient container image-pull failures) to also retry these transient transport failures. Retries re-run
rad deployfrom scratch — a fresh port-forward each attempt — so once the control-plane recovers the retry succeeds. Retries do not mask real failures: a persistent deployment error still fails after retries are exhausted, and matching is limited to specific connection-reset/EOF network markers.Changes
test/step/deployexecutor.goIsTransientConnectionError(markers:connection reset by peer,: EOF,unexpected EOF,broken pipe).IsTransientDeployError(image-pull or connection error) and make it the defaultShouldRetryinNewDeployExecutor.test/radcli/cli.goradCLI output into the error returned bydeployInternalfor non-structured failures. Transport-level causes are printed byradto its output while the wrapped exec error only carriedexit status 1, so the retry predicate previously could not see them. The structuredError: {ARM-error path and thenilpath are unchanged.test/step/deployexecutor_test.goIsTransientConnectionError,IsTransientDeployError, and default-executor retry-on-connection-reset behavior.Retry defaults are unchanged (
MaxRetries=2,RetryDelay=30s); each retry also re-runs the full deploy, so total elapsed time across attempts spans several minutes.Out of scope
Reducing the kind control-plane restarts themselves (runner memory/CPU pressure and the kind cluster resource footprint) is the environmental root cause and a separate follow-up; issue #12297 remains the tracking item for the "UCP connection reset / EOF" class.
Type of change
Fixes: #12297
Contributor checklist
eng/design-notes/in this repository, if new APIs are being introduced.