Skip to content

fix(prow-job-executor): retry SubmitJob on transient 429/5xx#251

Open
raelga wants to merge 2 commits into
Azure:mainfrom
raelga:raelga/aroslsre-1168-submitjob-retry
Open

fix(prow-job-executor): retry SubmitJob on transient 429/5xx#251
raelga wants to merge 2 commits into
Azure:mainfrom
raelga:raelga/aroslsre-1168-submitjob-retry

Conversation

@raelga

@raelga raelga commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

prowjobexecutor execute (used in EV2 gating) had no retry logic on the SubmitJob() call to the Prow Gangway API. Gangway enforces a ~9 requests/minute rate limit and rejects excess requests immediately with HTTP 429 (nodelay, no queueing). A single 429 failed the entire EV2 gating step, forcing the full deployment job to restart from scratch.

Ironically, the sibling GetJobStatus() call in the same file already had exponential-backoff retry that treats 429 as retryable — SubmitJob() was just never given the same treatment.

Changes

  • Refactor SubmitJob() into a single-shot submitJobOnce() plus a retry wrapper, reusing the existing httpStatusError / isRetryableStatusCode() plumbing already used by GetJobStatus().
  • Transient failures (429, 5xx) are retried with exponential backoff; 401/403 remain non-retryable.
  • Backoff: Duration 1m, Factor 2, Jitter 0.1, Cap 30m, Steps 6 → delays 1, 2, 4, 8, 16, 30 min, ~61 min worst-case cumulative wait. The parent context still bounds total runtime (ExponentialBackoffWithContext aborts mid-sleep on cancellation).
  • The backoff is a Client field (defaulted in NewClient) so tests can inject a fast backoff.
  • Retries are logged so they are visible in EV2 job output.
  • Add unit tests: 429→200, 503→200, persistent-429 exhausts retries, 401/403 no-retry, first-try success, plus isRetryableStatusCode coverage.

Testing

  • go test -race ./prowjob/ — pass
  • go tool golangci-lint run ./prowjob/ — 0 issues

Notes for reviewers

  • Retryability intentionally reuses the existing isRetryableStatusCode() (everything except 401/403), keeping submit and status behavior identical. Happy to tighten to an explicit {429, 500, 502, 503, 504} allowlist if preferred.
  • Jitter: 0.1 is positive-only skew (~+10%), so the true ceiling is ~67 min; it helps de-sync concurrent EV2 submitters against the shared 9 req/min limit.

Observed failure

  • Job: branch-ci-Azure-ARO-HCP-main-e2e-prod-e2e-parallel (PROD eastus2euap, EV2 gating)
  • Error: job submission failed with status 429: ... 429 Too Many Requests ... nginx/1.20.1

Jira: AROSLSRE-1168

The prowjobexecutor execute command used in EV2 gating had no retry logic on
the SubmitJob() call to the Prow Gangway API. Gangway enforces a ~9 req/min
rate limit and rejects excess requests immediately with HTTP 429, which failed
the entire EV2 gating step and forced the full deployment job to restart.

Refactor SubmitJob into a single-shot submitJobOnce() plus a retry wrapper that
reuses the existing httpStatusError / isRetryableStatusCode plumbing already
used by GetJobStatus(). Transient failures (429, 5xx) are retried with
exponential backoff (1, 2, 4, 8, 16, 30m capped, ~61m worst case); 401/403
remain non-retryable. Retries are logged for visibility in EV2 job output.

Add unit tests covering 429/5xx retry, 401/403 no-retry, and success paths.

Jira: AROSLSRE-1168
Copilot AI review requested due to automatic review settings June 17, 2026 12:11

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds retry logic with exponential backoff to Prow Gangway job submission, and introduces tests to validate retry vs non-retry behavior.

Changes:

  • Add configurable exponential backoff to Client.SubmitJob and retry on transient failures.
  • Introduce submitJobOnce and wrap non-200 responses in a typed status error for retry classification.
  • Add unit tests covering retries, non-retryable statuses, and retryable status-code classification.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
tools/prow-job-executor/prowjob/client.go Implements exponential-backoff retries for job submission and adds backoff configuration on the client.
tools/prow-job-executor/prowjob/client_test.go Adds tests for retry behavior and retryable status-code logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +90 to +102
condition := func(ctx context.Context) (bool, error) {
id, err := c.submitJobOnce(ctx, request)
if err != nil {
lastErr = err
// Check if this is a non-retryable error (e.g., 401/403).
if isNonRetryableHTTPError(err) {
return false, err // Stop retrying and propagate the error
}

// For retryable errors, log and continue.
logger.Info("Job submission failed with a transient error, will retry", "error", err.Error())
return false, nil
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7dc3f03. Inverted the logic: SubmitJob now retries only known-transient errors via a new `isRetryableError` (HTTP 429, 5xx, and `net.Error`). Deterministic failures (marshal, request build, decode, other 4xx) now fail fast instead of being retried.

Comment on lines +61 to +71
// Exponential backoff with jitter for transient job-submission failures.
// Delays grow 1, 2, 4, 8, 16, 30 minutes (capped), for a worst-case
// cumulative wait of ~61 minutes before giving up. The parent context
// still bounds the total runtime.
submitBackoff: wait.Backoff{
Duration: time.Minute, // Initial delay
Factor: 2.0, // Exponential factor
Jitter: 0.1, // 10% jitter to de-sync concurrent submitters
Steps: 6, // Maximum retries
Cap: 30 * time.Minute, // Maximum delay cap
},

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7dc3f03. The comment was wrong about the schedule. apimachinery Cap not only clamps a delay but also stops retries once reached, so I dropped Cap and bound by Steps: 7 → delays 1/2/4/8/16/32m, ~63m worst case. Comment updated to document this.

Comment on lines +85 to +86
wantAttempts: -1, // retried multiple times; exact count is backoff/k8s-version dependent
},

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7dc3f03. fastBackoff now uses Jitter: 0 and no Cap, so persistent-failure cases assert an exact attempt count (== fastBackoffSteps) instead of > 1.

Comment on lines +143 to +149
if got := atomic.LoadInt32(&attempts); tc.wantAttempts < 0 {
if got <= 1 {
t.Fatalf("expected multiple retries, got %d attempts", got)
}
} else if got != tc.wantAttempts {
t.Fatalf("got %d attempts, want %d", got, tc.wantAttempts)
}
Comment on lines +154 to +165
func TestIsRetryableStatusCode(t *testing.T) {
tests := []struct {
status int
retryable bool
}{
{http.StatusTooManyRequests, true},
{http.StatusServiceUnavailable, true},
{http.StatusInternalServerError, true},
{http.StatusBadGateway, true},
{http.StatusUnauthorized, false},
{http.StatusForbidden, false},
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7dc3f03. Added explicit non-retryable cases (400/404/409/422 → 1 attempt) and retryable 5xx cases (500/502/503), plus a network-error retry test and a dedicated TestIsRetryableError unit test.

- Retry only known-transient errors (HTTP 429, 5xx, network errors);
  fail fast on deterministic errors (marshaling, request build, decode,
  other 4xx) instead of retrying them.
- Correct backoff schedule: bound by Steps (7 attempts, 1/2/4/8/16/32m,
  ~63m worst case) rather than Cap, since apimachinery's Cap also halts
  retries early once reached.
- Make retry tests deterministic (exact attempt counts) and add explicit
  non-retryable (400/404/409/422) and retryable 5xx cases, plus a
  network-error and isRetryableError unit test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants