Skip to content

test: retry transient UCP connection reset/EOF in functional deploys#12300

Open
brooke-hamilton wants to merge 2 commits into
mainfrom
brooke-hamilton-fix-12297-corerp-cloud-connection-reset
Open

test: retry transient UCP connection reset/EOF in functional deploys#12300
brooke-hamilton wants to merge 2 commits into
mainfrom
brooke-hamilton-fix-12297-corerp-cloud-connection-reset

Conversation

@brooke-hamilton

Copy link
Copy Markdown
Member

Description

corerp-cloud functional deploys intermittently fail with connection reset by peer / EOF when the kind control-plane (kube-apiserver) restarts mid-run and drops the kubectl port-forward tunnel that rad uses to reach the UCP API server. Every in-flight parallel rad deploy connection is reset at the same instant. UCP and the Radius pods do not crash (RestartCount=0) — this is environmental GitHub-hosted-runner resource pressure bouncing the kind static control-plane pods, not a Radius product bug.

This PR extends the existing deploy-retry harness (which already retries transient container image-pull failures) to also retry these transient transport failures. Retries re-run rad deploy from scratch — a fresh port-forward each attempt — so once the control-plane recovers the retry succeeds. Retries do not mask real failures: a persistent deployment error still fails after retries are exhausted, and matching is limited to specific connection-reset/EOF network markers.

Changes

  • test/step/deployexecutor.go
    • Add IsTransientConnectionError (markers: connection reset by peer, : EOF, unexpected EOF, broken pipe).
    • Add combined IsTransientDeployError (image-pull or connection error) and make it the default ShouldRetry in NewDeployExecutor.
  • test/radcli/cli.go
    • Fold the rad CLI output into the error returned by deployInternal for non-structured failures. Transport-level causes are printed by rad to its output while the wrapped exec error only carried exit status 1, so the retry predicate previously could not see them. The structured Error: { ARM-error path and the nil path are unchanged.
  • test/step/deployexecutor_test.go
    • Unit tests for IsTransientConnectionError, IsTransientDeployError, and default-executor retry-on-connection-reset behavior.

Retry defaults are unchanged (MaxRetries=2, RetryDelay=30s); each retry also re-runs the full deploy, so total elapsed time across attempts spans several minutes.

Out of scope

Reducing the kind control-plane restarts themselves (runner memory/CPU pressure and the kind cluster resource footprint) is the environmental root cause and a separate follow-up; issue #12297 remains the tracking item for the "UCP connection reset / EOF" class.

Type of change

  • This pull request is a minor refactor, code cleanup, test improvement, or other maintenance task and doesn't change the functionality of Radius (issue link optional).

Fixes: #12297

Contributor checklist

  • An overview of proposed schema changes is included in a linked GitHub issue.
    • Not applicable
  • A design document is added or updated under eng/design-notes/ in this repository, if new APIs are being introduced.
    • Not applicable
  • The design document has been reviewed and approved by Radius maintainers/approvers.
    • Not applicable
  • A PR for resource-types-contrib is created, if resource types or recipes are affected by the changes in this PR.
    • Not applicable
  • A PR for dashboard is created, if the Radius Dashboard is affected by the changes in this PR.
    • Not applicable
  • A PR for the documentation repository is created, if the changes in this PR affect the documentation or any user facing updates are made.
    • Not applicable

corerp-cloud functional deploys intermittently fail with 'connection reset by peer' / EOF when the kind control-plane restarts mid-run and drops the kubectl port-forward tunnel that rad uses to reach the UCP API server. UCP and the Radius pods do not crash; this is environmental runner pressure, not a product bug.

Extend the existing deploy-retry harness to also retry these transient transport failures:

- Add IsTransientConnectionError and combined IsTransientDeployError in test/step/deployexecutor.go; make the combined predicate the default ShouldRetry in NewDeployExecutor.

- Fold the rad CLI output into the error returned by deployInternal for non-structured failures so the retry predicate can inspect the transport-level cause (previously only 'exit status 1' surfaced).

- Add unit tests for the new predicates and default retry behavior.

Fixes #12297

Co-authored-by: Copilot App <[email protected]>
Signed-off-by: Brooke Hamilton <[email protected]>
Copilot AI review requested due to automatic review settings July 1, 2026 19:41
@brooke-hamilton brooke-hamilton requested review from a team as code owners July 1, 2026 19:41
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the functional test deploy harness against an intermittent CI failure mode where the kind control-plane restarts mid-run and drops the kubectl port-forward tunnel used by rad, causing parallel deploys to fail with transport errors (connection reset by peer / EOF). It expands the existing retry mechanism (previously focused on transient image pull failures) to also retry these transient UCP transport disruptions, and ensures the retry predicate can “see” the underlying transport cause by folding CLI output into the returned error.

Changes:

  • Add transient-connection error detection (connection reset by peer, : EOF, unexpected EOF, broken pipe) and make it part of the default deploy retry predicate.
  • Update the rad CLI test wrapper to include CLI output in non-structured deploy errors so retry logic can match transport markers.
  • Add unit tests covering the new retry predicate and default retry-on-connection-reset behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
test/step/deployexecutor.go Introduces transient connection error markers and makes the default retry predicate cover both image-pull and UCP transport failures.
test/step/deployexecutor_test.go Adds unit tests for transient connection/deploy error detection and default retry behavior.
test/radcli/cli.go Includes raw CLI output in non-structured deploy errors so transport failures are visible to retry predicates.

Comment thread test/step/deployexecutor.go
Comment thread test/radcli/cli.go
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.97%. Comparing base (bf1015c) to head (6d8114a).

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #12300   +/-   ##
=======================================
  Coverage   52.97%   52.97%           
=======================================
  Files         754      754           
  Lines       48686    48686           
=======================================
  Hits        25791    25791           
  Misses      20469    20469           
  Partials     2426     2426           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Unit Tests

    2 files  ±0    452 suites  ±0   7m 27s ⏱️ -2s
5 656 tests ±0  5 654 ✅ ±0  2 💤 ±0  0 ❌ ±0 
6 853 runs  ±0  6 851 ✅ ±0  2 💤 ±0  0 ❌ ±0 

Results for commit 6d8114a. ± Comparison against base commit bf1015c.

♻️ This comment has been updated with latest results.

…olded output

Address PR review feedback:

- IsTransientConnectionError now returns false for *radcli.CLIError so a genuine structured deployment failure whose flattened ARM message happens to contain a transport marker (e.g. 'unexpected EOF') is never misclassified as a retryable connection reset.

- deployInternal folds only the tail (last 20 lines) of the rad CLI output into non-structured errors instead of the full output, avoiding duplicate large logs across retries while preserving the trailing transport markers used for retry classification.

- Add unit tests for the CLIError guard and the tailLines helper.

Co-authored-by: Copilot App <[email protected]>
Signed-off-by: Brooke Hamilton <[email protected]>
@radius-functional-tests

radius-functional-tests Bot commented Jul 1, 2026

Copy link
Copy Markdown

Radius functional test overview

🔍 Go to test action run

Click here to see the test run details
Name Value
Repository radius-project/radius
Commit ref 6d8114a
Unique ID funcc0bc668696
Image tag pr-funcc0bc668696
  • Dapr: 1.14.4
  • Azure KeyVault CSI driver: 1.4.2
  • Azure Workload identity webhook: 1.3.0
  • Bicep recipe location ghcr.io/radius-project/dev/test/testrecipes/test-bicep-recipes/<name>:pr-funcc0bc668696
  • Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
  • applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-funcc0bc668696
  • dynamic-rp test image location: ghcr.io/radius-project/dev/dynamic-rp:pr-funcc0bc668696
  • controller test image location: ghcr.io/radius-project/dev/controller:pr-funcc0bc668696
  • ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-funcc0bc668696
  • deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting ucp-cloud functional tests...
⌛ Starting corerp-cloud functional tests...
✅ ucp-cloud functional tests succeeded
❌ corerp-cloud functional test failed. Please check the logs for more details

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Functional Tests - corerp-cloud

26 tests  ±0   17 ✅  - 8   35m 31s ⏱️ + 24m 25s
 2 suites ±0    1 💤 ±0 
 1 files   ±0    8 ❌ +8 

For more details on these failures, see this check.

Results for commit 6d8114a. ± Comparison against base commit bf1015c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: corerp-cloud deploys fail with "connection reset by peer" / EOF when kind control-plane restarts mid-run

2 participants