fix(prow-job-executor): retry SubmitJob on transient 429/5xx#251
fix(prow-job-executor): retry SubmitJob on transient 429/5xx#251raelga wants to merge 2 commits into
Conversation
The prowjobexecutor execute command used in EV2 gating had no retry logic on the SubmitJob() call to the Prow Gangway API. Gangway enforces a ~9 req/min rate limit and rejects excess requests immediately with HTTP 429, which failed the entire EV2 gating step and forced the full deployment job to restart. Refactor SubmitJob into a single-shot submitJobOnce() plus a retry wrapper that reuses the existing httpStatusError / isRetryableStatusCode plumbing already used by GetJobStatus(). Transient failures (429, 5xx) are retried with exponential backoff (1, 2, 4, 8, 16, 30m capped, ~61m worst case); 401/403 remain non-retryable. Retries are logged for visibility in EV2 job output. Add unit tests covering 429/5xx retry, 401/403 no-retry, and success paths. Jira: AROSLSRE-1168
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds retry logic with exponential backoff to Prow Gangway job submission, and introduces tests to validate retry vs non-retry behavior.
Changes:
- Add configurable exponential backoff to
Client.SubmitJoband retry on transient failures. - Introduce
submitJobOnceand wrap non-200 responses in a typed status error for retry classification. - Add unit tests covering retries, non-retryable statuses, and retryable status-code classification.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| tools/prow-job-executor/prowjob/client.go | Implements exponential-backoff retries for job submission and adds backoff configuration on the client. |
| tools/prow-job-executor/prowjob/client_test.go | Adds tests for retry behavior and retryable status-code logic. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| condition := func(ctx context.Context) (bool, error) { | ||
| id, err := c.submitJobOnce(ctx, request) | ||
| if err != nil { | ||
| lastErr = err | ||
| // Check if this is a non-retryable error (e.g., 401/403). | ||
| if isNonRetryableHTTPError(err) { | ||
| return false, err // Stop retrying and propagate the error | ||
| } | ||
|
|
||
| // For retryable errors, log and continue. | ||
| logger.Info("Job submission failed with a transient error, will retry", "error", err.Error()) | ||
| return false, nil | ||
| } |
There was a problem hiding this comment.
Fixed in 7dc3f03. Inverted the logic: SubmitJob now retries only known-transient errors via a new `isRetryableError` (HTTP 429, 5xx, and `net.Error`). Deterministic failures (marshal, request build, decode, other 4xx) now fail fast instead of being retried.
| // Exponential backoff with jitter for transient job-submission failures. | ||
| // Delays grow 1, 2, 4, 8, 16, 30 minutes (capped), for a worst-case | ||
| // cumulative wait of ~61 minutes before giving up. The parent context | ||
| // still bounds the total runtime. | ||
| submitBackoff: wait.Backoff{ | ||
| Duration: time.Minute, // Initial delay | ||
| Factor: 2.0, // Exponential factor | ||
| Jitter: 0.1, // 10% jitter to de-sync concurrent submitters | ||
| Steps: 6, // Maximum retries | ||
| Cap: 30 * time.Minute, // Maximum delay cap | ||
| }, |
There was a problem hiding this comment.
Fixed in 7dc3f03. The comment was wrong about the schedule. apimachinery Cap not only clamps a delay but also stops retries once reached, so I dropped Cap and bound by Steps: 7 → delays 1/2/4/8/16/32m, ~63m worst case. Comment updated to document this.
| wantAttempts: -1, // retried multiple times; exact count is backoff/k8s-version dependent | ||
| }, |
There was a problem hiding this comment.
Fixed in 7dc3f03. fastBackoff now uses Jitter: 0 and no Cap, so persistent-failure cases assert an exact attempt count (== fastBackoffSteps) instead of > 1.
| if got := atomic.LoadInt32(&attempts); tc.wantAttempts < 0 { | ||
| if got <= 1 { | ||
| t.Fatalf("expected multiple retries, got %d attempts", got) | ||
| } | ||
| } else if got != tc.wantAttempts { | ||
| t.Fatalf("got %d attempts, want %d", got, tc.wantAttempts) | ||
| } |
| func TestIsRetryableStatusCode(t *testing.T) { | ||
| tests := []struct { | ||
| status int | ||
| retryable bool | ||
| }{ | ||
| {http.StatusTooManyRequests, true}, | ||
| {http.StatusServiceUnavailable, true}, | ||
| {http.StatusInternalServerError, true}, | ||
| {http.StatusBadGateway, true}, | ||
| {http.StatusUnauthorized, false}, | ||
| {http.StatusForbidden, false}, | ||
| } |
There was a problem hiding this comment.
Fixed in 7dc3f03. Added explicit non-retryable cases (400/404/409/422 → 1 attempt) and retryable 5xx cases (500/502/503), plus a network-error retry test and a dedicated TestIsRetryableError unit test.
- Retry only known-transient errors (HTTP 429, 5xx, network errors); fail fast on deterministic errors (marshaling, request build, decode, other 4xx) instead of retrying them. - Correct backoff schedule: bound by Steps (7 attempts, 1/2/4/8/16/32m, ~63m worst case) rather than Cap, since apimachinery's Cap also halts retries early once reached. - Make retry tests deterministic (exact attempt counts) and add explicit non-retryable (400/404/409/422) and retryable 5xx cases, plus a network-error and isRetryableError unit test.
Summary
prowjobexecutor execute(used in EV2 gating) had no retry logic on theSubmitJob()call to the Prow Gangway API. Gangway enforces a ~9 requests/minute rate limit and rejects excess requests immediately with HTTP 429 (nodelay, no queueing). A single 429 failed the entire EV2 gating step, forcing the full deployment job to restart from scratch.Ironically, the sibling
GetJobStatus()call in the same file already had exponential-backoff retry that treats 429 as retryable —SubmitJob()was just never given the same treatment.Changes
SubmitJob()into a single-shotsubmitJobOnce()plus a retry wrapper, reusing the existinghttpStatusError/isRetryableStatusCode()plumbing already used byGetJobStatus().Duration 1m, Factor 2, Jitter 0.1, Cap 30m, Steps 6→ delays1, 2, 4, 8, 16, 30min, ~61 min worst-case cumulative wait. The parentcontextstill bounds total runtime (ExponentialBackoffWithContextaborts mid-sleep on cancellation).Clientfield (defaulted inNewClient) so tests can inject a fast backoff.isRetryableStatusCodecoverage.Testing
go test -race ./prowjob/— passgo tool golangci-lint run ./prowjob/— 0 issuesNotes for reviewers
isRetryableStatusCode()(everything except 401/403), keeping submit and status behavior identical. Happy to tighten to an explicit{429, 500, 502, 503, 504}allowlist if preferred.Jitter: 0.1is positive-only skew (~+10%), so the true ceiling is ~67 min; it helps de-sync concurrent EV2 submitters against the shared 9 req/min limit.Observed failure
branch-ci-Azure-ARO-HCP-main-e2e-prod-e2e-parallel(PROD eastus2euap, EV2 gating)job submission failed with status 429: ... 429 Too Many Requests ... nginx/1.20.1Jira: AROSLSRE-1168