fix(prow-job-executor): retry SubmitJob on transient 429/5xx by raelga · Pull Request #251 · Azure/ARO-Tools

raelga · 2026-06-17T12:11:39Z

Summary

prowjobexecutor execute (used in EV2 gating) had no retry logic on the SubmitJob() call to the Prow Gangway API. Gangway enforces a ~9 requests/minute rate limit and rejects excess requests immediately with HTTP 429 (nodelay, no queueing). A single 429 failed the entire EV2 gating step, forcing the full deployment job to restart from scratch.

Ironically, the sibling GetJobStatus() call in the same file already had exponential-backoff retry that treats 429 as retryable — SubmitJob() was just never given the same treatment.

Changes

Refactor SubmitJob() into a single-shot submitJobOnce() plus a retry wrapper, reusing the existing httpStatusError / isRetryableStatusCode() plumbing already used by GetJobStatus().
Transient failures (429, 5xx) are retried with exponential backoff; 401/403 remain non-retryable.
Backoff: Duration 1m, Factor 2, Jitter 0.1, Cap 30m, Steps 6 → delays 1, 2, 4, 8, 16, 30 min, ~61 min worst-case cumulative wait. The parent context still bounds total runtime (ExponentialBackoffWithContext aborts mid-sleep on cancellation).
The backoff is a Client field (defaulted in NewClient) so tests can inject a fast backoff.
Retries are logged so they are visible in EV2 job output.
Add unit tests: 429→200, 503→200, persistent-429 exhausts retries, 401/403 no-retry, first-try success, plus isRetryableStatusCode coverage.

Testing

go test -race ./prowjob/ — pass
go tool golangci-lint run ./prowjob/ — 0 issues

Notes for reviewers

Retryability intentionally reuses the existing isRetryableStatusCode() (everything except 401/403), keeping submit and status behavior identical. Happy to tighten to an explicit {429, 500, 502, 503, 504} allowlist if preferred.
Jitter: 0.1 is positive-only skew (~+10%), so the true ceiling is ~67 min; it helps de-sync concurrent EV2 submitters against the shared 9 req/min limit.

Observed failure

Job: branch-ci-Azure-ARO-HCP-main-e2e-prod-e2e-parallel (PROD eastus2euap, EV2 gating)
Error: job submission failed with status 429: ... 429 Too Many Requests ... nginx/1.20.1

Jira: AROSLSRE-1168

The prowjobexecutor execute command used in EV2 gating had no retry logic on the SubmitJob() call to the Prow Gangway API. Gangway enforces a ~9 req/min rate limit and rejects excess requests immediately with HTTP 429, which failed the entire EV2 gating step and forced the full deployment job to restart. Refactor SubmitJob into a single-shot submitJobOnce() plus a retry wrapper that reuses the existing httpStatusError / isRetryableStatusCode plumbing already used by GetJobStatus(). Transient failures (429, 5xx) are retried with exponential backoff (1, 2, 4, 8, 16, 30m capped, ~61m worst case); 401/403 remain non-retryable. Retries are logged for visibility in EV2 job output. Add unit tests covering 429/5xx retry, 401/403 no-retry, and success paths. Jira: AROSLSRE-1168

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds retry logic with exponential backoff to Prow Gangway job submission, and introduces tests to validate retry vs non-retry behavior.

Changes:

Add configurable exponential backoff to Client.SubmitJob and retry on transient failures.
Introduce submitJobOnce and wrap non-200 responses in a typed status error for retry classification.
Add unit tests covering retries, non-retryable statuses, and retryable status-code classification.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
tools/prow-job-executor/prowjob/client.go	Implements exponential-backoff retries for job submission and adds backoff configuration on the client.
tools/prow-job-executor/prowjob/client_test.go	Adds tests for retry behavior and retryable status-code logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

raelga · 2026-06-17T14:47:00Z

+	condition := func(ctx context.Context) (bool, error) {
+		id, err := c.submitJobOnce(ctx, request)
+		if err != nil {
+			lastErr = err
+			// Check if this is a non-retryable error (e.g., 401/403).
+			if isNonRetryableHTTPError(err) {
+				return false, err // Stop retrying and propagate the error
+			}
+
+			// For retryable errors, log and continue.
+			logger.Info("Job submission failed with a transient error, will retry", "error", err.Error())
+			return false, nil
+		}


Fixed in 7dc3f03. Inverted the logic: SubmitJob now retries only known-transient errors via a new `isRetryableError` (HTTP 429, 5xx, and `net.Error`). Deterministic failures (marshal, request build, decode, other 4xx) now fail fast instead of being retried.

raelga · 2026-06-17T14:47:02Z

+		// Exponential backoff with jitter for transient job-submission failures.
+		// Delays grow 1, 2, 4, 8, 16, 30 minutes (capped), for a worst-case
+		// cumulative wait of ~61 minutes before giving up. The parent context
+		// still bounds the total runtime.
+		submitBackoff: wait.Backoff{
+			Duration: time.Minute,      // Initial delay
+			Factor:   2.0,              // Exponential factor
+			Jitter:   0.1,              // 10% jitter to de-sync concurrent submitters
+			Steps:    6,                // Maximum retries
+			Cap:      30 * time.Minute, // Maximum delay cap
+		},


Fixed in 7dc3f03. The comment was wrong about the schedule. apimachinery Cap not only clamps a delay but also stops retries once reached, so I dropped Cap and bound by Steps: 7 → delays 1/2/4/8/16/32m, ~63m worst case. Comment updated to document this.

raelga · 2026-06-17T14:47:05Z

+			wantAttempts:    -1, // retried multiple times; exact count is backoff/k8s-version dependent
+		},


Fixed in 7dc3f03. fastBackoff now uses Jitter: 0 and no Cap, so persistent-failure cases assert an exact attempt count (== fastBackoffSteps) instead of > 1.

+			if got := atomic.LoadInt32(&attempts); tc.wantAttempts < 0 {
+				if got <= 1 {
+					t.Fatalf("expected multiple retries, got %d attempts", got)
+				}
+			} else if got != tc.wantAttempts {
+				t.Fatalf("got %d attempts, want %d", got, tc.wantAttempts)
+			}


raelga · 2026-06-17T14:47:07Z

+func TestIsRetryableStatusCode(t *testing.T) {
+	tests := []struct {
+		status    int
+		retryable bool
+	}{
+		{http.StatusTooManyRequests, true},
+		{http.StatusServiceUnavailable, true},
+		{http.StatusInternalServerError, true},
+		{http.StatusBadGateway, true},
+		{http.StatusUnauthorized, false},
+		{http.StatusForbidden, false},
+	}


Fixed in 7dc3f03. Added explicit non-retryable cases (400/404/409/422 → 1 attempt) and retryable 5xx cases (500/502/503), plus a network-error retry test and a dedicated TestIsRetryableError unit test.

- Retry only known-transient errors (HTTP 429, 5xx, network errors); fail fast on deterministic errors (marshaling, request build, decode, other 4xx) instead of retrying them. - Correct backoff schedule: bound by Steps (7 attempts, 1/2/4/8/16/32m, ~63m worst case) rather than Cap, since apimachinery's Cap also halts retries early once reached. - Make retry tests deterministic (exact attempt counts) and add explicit non-retryable (400/404/409/422) and retryable 5xx cases, plus a network-error and isRetryableError unit test.

Copilot AI review requested due to automatic review settings June 17, 2026 12:11

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(prow-job-executor): retry SubmitJob on transient 429/5xx#251

fix(prow-job-executor): retry SubmitJob on transient 429/5xx#251
raelga wants to merge 2 commits into
Azure:mainfrom
raelga:raelga/aroslsre-1168-submitjob-retry

raelga commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

raelga Jun 17, 2026

Uh oh!

raelga Jun 17, 2026

Uh oh!

raelga Jun 17, 2026

Uh oh!

raelga Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		wantAttempts: -1, // retried multiple times; exact count is backoff/k8s-version dependent
		},

Conversation

raelga commented Jun 17, 2026

Summary

Changes

Testing

Notes for reviewers

Observed failure

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

raelga Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

raelga Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

raelga Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

raelga Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants