ci(fleet): retry heavy lane on suite failure#407
Merged
Conversation
Wrap the dispatch-suite dispatch+recover+watch cycle in a bounded outer retry loop gated by a new retry-attempts input (default 1, no retry). On a non-success suite run, a recovery miss, or a watch timeout, it re-dispatches a fresh run up to the configured total, logging each attempt; only the final attempt's failure reds the step. This is safe because every scenario-suite run begins by resetting and seeding dev, so a re-dispatch starts clean. Set retry-attempts: 2 on the 4env heavy lane only, the most race-prone surface, leaving every other lane at the historical single-dispatch default. Signed-off-by: Joshua Temple <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds an opt-in retry to the fleet's dispatch-suite action and enables it (2 attempts) on the 4env heavy lane only.
Why
4env is the heaviest fleet lane (4 environments + hotfix + rollback + merge_queue). Its scenarios occasionally race shared state under load (a separate root-cause fix in the 4env suite normalizes env branches before hotfix dispatch). A bounded retry on this one lane absorbs residual intermittent flakes so a single transient does not red the whole fleet gate, mirroring the e2e matrix's per-leg retry.
Changes
.github/actions/dispatch-suite/action.yaml: newretry-attemptsinput (default 1 = unchanged behavior) andretry-backoff. The dispatch+recover+watch logic is refactored into anattempt_suite()function wrapped in a bounded outer loop that re-dispatches a fresh run on non-success/recovery-miss/timeout, failing only after the final attempt. Watch-loop exit-code capture hardened..github/workflows/fleet-e2e.yaml:retry-attempts: '2'on the 4env lane only; all other lanes untouched.Safety
Retrying a suite is safe: every suite run begins with an idempotent reset+seed (Step 1). Default 1 preserves current behavior for every other lane.
Verification
go build, golangci-lint (0 issues), actionlint, shellcheck -S warning on the embedded script all clean; a unit harness confirmed the loop semantics (default-1 no-retry, fail-then-pass, all-fail, garbage-input). Hardens the heavy lane; complements the 4env suite env-branch fix.