Summary
- Context: The
RateLimitService coordinates provider availability using persistent state in RateLimitState and in-memory state in ProviderCircuitState to handle API rate limits (HTTP 429).
- Bug:
RateLimitState applies an extremely aggressive exponential backoff using Duration.ofHours starting from the second consecutive failure, while RateLimitHeaderParser incorrectly picks the minimum reset time when multiple headers (requests vs. tokens) are provided.
- Actual vs. expected: A second consecutive rate limit hit causes a 2-hour lockout (plus reset time), whereas it should apply a much smaller backoff (e.g., seconds or minutes); additionally, the parser picks the shortest wait time among multiple headers instead of the longest.
- Impact: The system suffers from significant availability issues, where a provider can be disabled for hours after just two normal rate-limit events, and premature retries caused by the parser bug increase the likelihood of triggering this severe lockout.
Code with bug
1. Excessive Backoff in RateLimitState.java
// src/main/java/com/williamcallahan/javachat/service/RateLimitState.java
if (failures > 1) {
Duration additionalBackoff = Duration.ofHours((long) Math.pow(2, failures - 1)); // <-- BUG 🔴 [Uses hours instead of seconds/minutes; 2nd failure = 2hr lockout]
Duration maxBackoff = Duration.ofDays(7); // Never back off more than a week
if (additionalBackoff.compareTo(maxBackoff) > 0) {
additionalBackoff = maxBackoff;
}
state.rateLimitedUntil = state.rateLimitedUntil.plus(additionalBackoff);
2. Incorrect Minimum Reset Selection in RateLimitHeaderParser.java
// src/main/java/com/williamcallahan/javachat/service/RateLimitHeaderParser.java
long candidateSeconds = minPositive( // <-- BUG 🔴 [Should use max() to ensure all limits have reset before retrying]
firstHeaderValue(headers, "x-ratelimit-reset-requests")
.map(this::parseDurationSeconds)
.orElse(0L),
firstHeaderValue(headers, "x-ratelimit-reset-tokens")
.map(this::parseDurationSeconds)
.orElse(0L),
// ...
Evidence
Reproduction: Excessive Backoff
I created a test ExcessiveBackoffTest.java that records two consecutive failures and checks the remaining wait time. The test confirms that after the second failure, the lockout duration is over 2 hours.
@Test
void recordRateLimit_appliesExcessiveBackoffOnSecondFailure() {
Instant now = Instant.now();
// First failure sets state.consecutiveFailures to 1
rateLimitState.recordRateLimit(PROVIDER_NAME, now.plusSeconds(1), "1m");
// Second failure sets state.consecutiveFailures to 2, triggering backoff
rateLimitState.recordRateLimit(PROVIDER_NAME, now.plusSeconds(1), "1m");
Duration waitTime2 = rateLimitState.getRemainingWaitTime(PROVIDER_NAME);
// Actual: ~7201s (2 hours + 1s)
assertTrue(waitTime2.toHours() >= 2);
}
Reproduction: Header Parser Bug
I created RateLimitHeaderParserBugTest.java to verify the reset time selection. When provided with a 1s request reset and a 60s token reset, the parser incorrectly returned a 1s reset.
@Test
void parseResetInstant_shouldPickMaxResetTime() {
Headers headers = Headers.builder()
.put("x-ratelimit-reset-requests", "1s")
.put("x-ratelimit-reset-tokens", "60s")
.build();
Optional<Instant> resetInstant = parser.parseResetInstant(headers);
long secondsUntilReset = Duration.between(Instant.now(), resetInstant.get()).getSeconds();
// Actual: 1s (due to minPositive)
// Expected: 60s
assertTrue(secondsUntilReset < 10);
}
Why has this bug gone undetected?
These bugs have likely gone undetected because 429 errors are infrequent under low load or with higher-tier API keys. Furthermore, many users might assume a long lockout is a strict policy from the API provider (OpenAI/GitHub) rather than a bug in the local throttling logic. The interplay between the two bugs (one causing premature retries and the other penalizing them heavily) creates a "death spiral" that only becomes apparent when a provider's limits are actually reached.
Recommended fix
- In
RateLimitState.java, change Duration.ofHours to a more appropriate unit such as Duration.ofSeconds or Duration.ofMinutes, matching the in-memory backoff logic in ProviderCircuitState.
- In
RateLimitHeaderParser.java, replace the minPositive helper with a maxPositive logic to ensure the system waits until ALL relevant rate limits have reset before attempting a new request.
Summary
RateLimitServicecoordinates provider availability using persistent state inRateLimitStateand in-memory state inProviderCircuitStateto handle API rate limits (HTTP 429).RateLimitStateapplies an extremely aggressive exponential backoff usingDuration.ofHoursstarting from the second consecutive failure, whileRateLimitHeaderParserincorrectly picks the minimum reset time when multiple headers (requests vs. tokens) are provided.Code with bug
1. Excessive Backoff in
RateLimitState.java2. Incorrect Minimum Reset Selection in
RateLimitHeaderParser.javaEvidence
Reproduction: Excessive Backoff
I created a test
ExcessiveBackoffTest.javathat records two consecutive failures and checks the remaining wait time. The test confirms that after the second failure, the lockout duration is over 2 hours.Reproduction: Header Parser Bug
I created
RateLimitHeaderParserBugTest.javato verify the reset time selection. When provided with a 1s request reset and a 60s token reset, the parser incorrectly returned a 1s reset.Why has this bug gone undetected?
These bugs have likely gone undetected because 429 errors are infrequent under low load or with higher-tier API keys. Furthermore, many users might assume a long lockout is a strict policy from the API provider (OpenAI/GitHub) rather than a bug in the local throttling logic. The interplay between the two bugs (one causing premature retries and the other penalizing them heavily) creates a "death spiral" that only becomes apparent when a provider's limits are actually reached.
Recommended fix
RateLimitState.java, changeDuration.ofHoursto a more appropriate unit such asDuration.ofSecondsorDuration.ofMinutes, matching the in-memory backoff logic inProviderCircuitState.RateLimitHeaderParser.java, replace theminPositivehelper with amaxPositivelogic to ensure the system waits until ALL relevant rate limits have reset before attempting a new request.