-
Notifications
You must be signed in to change notification settings - Fork 1.8k
WIP feat(DRIVERS-3239): add exponential backoff in operation retry loop #4806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
8e1da84
add token bucket
baileympearson d960f8a
token bucket comments
baileympearson 5ac5dbd
hack
baileympearson 4327b3d
sync yml tests
baileympearson 410fb45
squash
baileympearson 0363694
WIP
baileympearson 7ba7426
refactor + retry writes error logic
baileympearson File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,5 @@ | ||
| import { setTimeout } from 'node:timers/promises'; | ||
|
|
||
| import { MIN_SUPPORTED_SNAPSHOT_READS_WIRE_VERSION } from '../cmap/wire_protocol/constants'; | ||
| import { | ||
| isRetryableReadError, | ||
|
|
@@ -10,6 +12,7 @@ import { | |
| MongoInvalidArgumentError, | ||
| MongoNetworkError, | ||
| MongoNotConnectedError, | ||
| MongoOperationTimeoutError, | ||
| MongoRuntimeError, | ||
| MongoServerError, | ||
| MongoTransactionError, | ||
|
|
@@ -26,9 +29,16 @@ import { | |
| import type { Topology } from '../sdam/topology'; | ||
| import type { ClientSession } from '../sessions'; | ||
| import { TimeoutContext } from '../timeout'; | ||
| import { abortable, maxWireVersion, supportsRetryableWrites } from '../utils'; | ||
| import { RETRY_COST, TOKEN_REFRESH_RATE } from '../token_bucket'; | ||
| import { | ||
| abortable, | ||
| ExponentialBackoffProvider, | ||
| maxWireVersion, | ||
| supportsRetryableWrites | ||
| } from '../utils'; | ||
| import { AggregateOperation } from './aggregate'; | ||
| import { AbstractOperation, Aspect } from './operation'; | ||
| import { RunCommandOperation } from './run_command'; | ||
|
|
||
| const MMAPv1_RETRY_WRITES_ERROR_CODE = MONGODB_ERROR_CODES.IllegalOperation; | ||
| const MMAPv1_RETRY_WRITES_ERROR_MESSAGE = | ||
|
|
@@ -50,7 +60,7 @@ type ResultTypeFromOperation<TOperation extends AbstractOperation> = ReturnType< | |
| * The expectation is that this function: | ||
| * - Connects the MongoClient if it has not already been connected, see {@link autoConnect} | ||
| * - Creates a session if none is provided and cleans up the session it creates | ||
| * - Tries an operation and retries under certain conditions, see {@link tryOperation} | ||
| * - Tries an operation and retries under certain conditions, see {@link executeOperationWithRetries} | ||
| * | ||
| * @typeParam T - The operation's type | ||
| * @typeParam TResult - The type of the operation's result, calculated from T | ||
|
|
@@ -120,7 +130,7 @@ export async function executeOperation< | |
| }); | ||
|
|
||
| try { | ||
| return await tryOperation(operation, { | ||
| return await executeOperationWithRetries(operation, { | ||
| topology, | ||
| timeoutContext, | ||
| session, | ||
|
|
@@ -184,7 +194,10 @@ type RetryOptions = { | |
| * | ||
| * @param operation - The operation to execute | ||
| * */ | ||
| async function tryOperation<T extends AbstractOperation, TResult = ResultTypeFromOperation<T>>( | ||
| async function executeOperationWithRetries< | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. better name imo. |
||
| T extends AbstractOperation, | ||
| TResult = ResultTypeFromOperation<T> | ||
| >( | ||
| operation: T, | ||
| { topology, timeoutContext, session, readPreference }: RetryOptions | ||
| ): Promise<TResult> { | ||
|
|
@@ -233,33 +246,117 @@ async function tryOperation<T extends AbstractOperation, TResult = ResultTypeFro | |
| session.incrementTransactionNumber(); | ||
| } | ||
|
|
||
| const maxTries = willRetry ? (timeoutContext.csotEnabled() ? Infinity : 2) : 1; | ||
| let previousOperationError: MongoError | undefined; | ||
| const deprioritizedServers = new DeprioritizedServers(); | ||
| const backoffDelayProvider = new ExponentialBackoffProvider( | ||
| 10_000, // MAX_BACKOFF | ||
| 100, // base backoff | ||
| 2 // backoff rate | ||
| ); | ||
|
|
||
| let maxAttempts = | ||
| (operation.maxAttempts ?? willRetry) ? (timeoutContext.csotEnabled() ? Infinity : 2) : 1; | ||
|
|
||
| const shouldRetry = operation.hasAspect(Aspect.READ_OPERATION) && topology.s.options.retryReads || (operation.hasAspect(Aspect.WRITE_OPERATION) || operation instanceof RunCommandOperation) && topology.s.options.retryWrites; | ||
|
|
||
| let error: MongoError | null = null; | ||
|
|
||
| for ( | ||
| let attempt = 0; | ||
| attempt < maxAttempts; | ||
| attempt++ | ||
| ) { | ||
|
|
||
| operation.server = server; | ||
|
|
||
| try { | ||
| const isRetry = attempt > 0; | ||
|
|
||
| try { | ||
| const result = await server.command(operation, timeoutContext); | ||
| topology.tokenBucket.deposit( | ||
| isRetry | ||
| ? // on successful retry, deposit the retry cost + the refresh rate. | ||
| TOKEN_REFRESH_RATE + RETRY_COST | ||
| : // otherwise, just deposit the refresh rate. | ||
| TOKEN_REFRESH_RATE | ||
| ); | ||
| return operation.handleOk(result); | ||
| } catch (error) { | ||
| return operation.handleError(error); | ||
| } | ||
| } catch (operationError) { | ||
| // Should never happen but if it does - propragate the error. | ||
| if (!(operationError instanceof MongoError)) throw operationError; | ||
|
|
||
| if (!operationError.hasErrorLabel(MongoErrorLabel.SystemOverloadedError)) { | ||
| // if an operation fails with an error that does not contain the SystemOverloadError, deposit 1 token. | ||
| topology.tokenBucket.deposit(RETRY_COST); | ||
| } | ||
|
|
||
| if (error == null) { | ||
| error = operationError; | ||
| } else { | ||
| if (!operationError.hasErrorLabel(MongoErrorLabel.NoWritesPerformed)) { | ||
| error = operationError; | ||
| } | ||
| } | ||
|
|
||
| for (let tries = 0; tries < maxTries; tries++) { | ||
| if (previousOperationError) { | ||
| if (hasWriteAspect && previousOperationError.code === MMAPv1_RETRY_WRITES_ERROR_CODE) { | ||
| if (hasWriteAspect && operationError.code === MMAPv1_RETRY_WRITES_ERROR_CODE) { | ||
| throw new MongoServerError({ | ||
| message: MMAPv1_RETRY_WRITES_ERROR_MESSAGE, | ||
| errmsg: MMAPv1_RETRY_WRITES_ERROR_MESSAGE, | ||
| originalError: previousOperationError | ||
| originalError: operationError | ||
| }); | ||
| } | ||
|
|
||
| if (operation.hasAspect(Aspect.COMMAND_BATCHING) && !operation.canRetryWrite) { | ||
| throw previousOperationError; | ||
| // prepare for retry | ||
| const isRetryable = | ||
| // bulk write commands are retryable if all operations in the batch are retryable | ||
| (operation.hasAspect(Aspect.COMMAND_BATCHING) && operation.canRetryWrite) || | ||
| // if we have a retryable read or write operation, we can retry | ||
| (!operation.hasAspect(Aspect.COMMAND_BATCHING) && hasWriteAspect && willRetryWrite && isRetryableWriteError(operationError)) || | ||
| (hasReadAspect && willRetryRead && isRetryableReadError(operationError)) || | ||
| // if we have a retryable, system overloaded error, we can retry | ||
| (operationError.hasErrorLabel(MongoErrorLabel.SystemOverloadedError) && | ||
| operationError.hasErrorLabel(MongoErrorLabel.RetryableError)); | ||
|
|
||
| if (!isRetryable) throw error; | ||
|
|
||
| maxAttempts = shouldRetry && operationError.hasErrorLabel(MongoErrorLabel.SystemOverloadedError) | ||
| ? 6 | ||
| : maxAttempts | ||
| if (attempt >= maxAttempts) { | ||
| throw error; | ||
| } | ||
|
|
||
| if (hasWriteAspect && !isRetryableWriteError(previousOperationError)) | ||
| throw previousOperationError; | ||
| // safe to retry - reset timeout context, apply backoff if necessary and re-run server selection | ||
|
|
||
| // Reset timeouts | ||
| timeoutContext.clear(); | ||
|
|
||
| if (hasReadAspect && !isRetryableReadError(previousOperationError)) { | ||
| throw previousOperationError; | ||
| if (operationError.hasErrorLabel(MongoErrorLabel.SystemOverloadedError)) { | ||
| const delayMS = backoffDelayProvider.getNextBackoffDuration(); | ||
|
|
||
| // if the delay would exhaust the CSOT timeout, short-circuit. | ||
| if (timeoutContext.csotEnabled() && delayMS > timeoutContext.remainingTimeMS) { | ||
| // TODO: is this the right error to throw? | ||
| throw new MongoOperationTimeoutError( | ||
| `MongoDB SystemOverload exponential backoff would exceed timeoutMS deadline: remaining CSOT deadline=${timeoutContext.remainingTimeMS}, backoff delayMS=${delayMS}`, | ||
| { | ||
| cause: error | ||
| } | ||
| ); | ||
| } | ||
|
|
||
| if (!topology.tokenBucket.consume(RETRY_COST)) { | ||
| throw error; | ||
| } | ||
|
|
||
| await setTimeout(delayMS); | ||
| } | ||
|
|
||
| if ( | ||
| previousOperationError instanceof MongoNetworkError && | ||
| operationError instanceof MongoNetworkError && | ||
| operation.hasAspect(Aspect.CURSOR_CREATING) && | ||
| session != null && | ||
| session.isPinned && | ||
|
|
@@ -268,6 +365,8 @@ async function tryOperation<T extends AbstractOperation, TResult = ResultTypeFro | |
| session.unpin({ force: true, forceClear: true }); | ||
| } | ||
|
|
||
| deprioritizedServers.add(server.description); | ||
|
|
||
| server = await topology.selectServer(selector, { | ||
| session, | ||
| operationName: operation.commandName, | ||
|
|
@@ -280,40 +379,13 @@ async function tryOperation<T extends AbstractOperation, TResult = ResultTypeFro | |
| 'Selected server does not support retryable writes' | ||
| ); | ||
| } | ||
| } | ||
|
|
||
| operation.server = server; | ||
|
|
||
| try { | ||
| // If tries > 0 and we are command batching we need to reset the batch. | ||
| if (tries > 0 && operation.hasAspect(Aspect.COMMAND_BATCHING)) { | ||
| // If attempt > 0 and we are command batching we need to reset the batch. | ||
| if (operation.hasAspect(Aspect.COMMAND_BATCHING)) { | ||
| operation.resetBatch(); | ||
| } | ||
|
|
||
| try { | ||
| const result = await server.command(operation, timeoutContext); | ||
| return operation.handleOk(result); | ||
| } catch (error) { | ||
| return operation.handleError(error); | ||
| } | ||
| } catch (operationError) { | ||
| if (!(operationError instanceof MongoError)) throw operationError; | ||
| if ( | ||
| previousOperationError != null && | ||
| operationError.hasErrorLabel(MongoErrorLabel.NoWritesPerformed) | ||
| ) { | ||
| throw previousOperationError; | ||
| } | ||
| deprioritizedServers.add(server.description); | ||
| previousOperationError = operationError; | ||
|
|
||
| // Reset timeouts | ||
| timeoutContext.clear(); | ||
| } | ||
| } | ||
|
|
||
| throw ( | ||
| previousOperationError ?? | ||
| new MongoRuntimeError('Tried to propagate retryability error, but no error was found.') | ||
| ); | ||
| throw error ?? new MongoRuntimeError('ahh'); | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edge case: if we encounter a network error (such as a failCommand with closeConnection=true) we never get a server response to update a session with, but still need to update the session's transaction, if the session is in a transaction.