Skip to content

ci(fleet): retry heavy lane on suite failure#407

Merged
joshua-temple merged 1 commit into
mainfrom
fix/fleet-heavy-lane-retry
Jun 28, 2026
Merged

ci(fleet): retry heavy lane on suite failure#407
joshua-temple merged 1 commit into
mainfrom
fix/fleet-heavy-lane-retry

Conversation

@joshua-temple

Copy link
Copy Markdown
Collaborator

What

Adds an opt-in retry to the fleet's dispatch-suite action and enables it (2 attempts) on the 4env heavy lane only.

Why

4env is the heaviest fleet lane (4 environments + hotfix + rollback + merge_queue). Its scenarios occasionally race shared state under load (a separate root-cause fix in the 4env suite normalizes env branches before hotfix dispatch). A bounded retry on this one lane absorbs residual intermittent flakes so a single transient does not red the whole fleet gate, mirroring the e2e matrix's per-leg retry.

Changes

  • .github/actions/dispatch-suite/action.yaml: new retry-attempts input (default 1 = unchanged behavior) and retry-backoff. The dispatch+recover+watch logic is refactored into an attempt_suite() function wrapped in a bounded outer loop that re-dispatches a fresh run on non-success/recovery-miss/timeout, failing only after the final attempt. Watch-loop exit-code capture hardened.
  • .github/workflows/fleet-e2e.yaml: retry-attempts: '2' on the 4env lane only; all other lanes untouched.

Safety

Retrying a suite is safe: every suite run begins with an idempotent reset+seed (Step 1). Default 1 preserves current behavior for every other lane.

Verification

go build, golangci-lint (0 issues), actionlint, shellcheck -S warning on the embedded script all clean; a unit harness confirmed the loop semantics (default-1 no-retry, fail-then-pass, all-fail, garbage-input). Hardens the heavy lane; complements the 4env suite env-branch fix.

Wrap the dispatch-suite dispatch+recover+watch cycle in a bounded outer
retry loop gated by a new retry-attempts input (default 1, no retry). On a
non-success suite run, a recovery miss, or a watch timeout, it re-dispatches
a fresh run up to the configured total, logging each attempt; only the final
attempt's failure reds the step. This is safe because every scenario-suite
run begins by resetting and seeding dev, so a re-dispatch starts clean.

Set retry-attempts: 2 on the 4env heavy lane only, the most race-prone
surface, leaving every other lane at the historical single-dispatch default.

Signed-off-by: Joshua Temple <[email protected]>
@joshua-temple joshua-temple merged commit 7db22e9 into main Jun 28, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant