fix: suppress APM error events for receive-loop cancellations during shutdown (5.7.5)#116
Open
ecofrankie wants to merge 4 commits into
Open
fix: suppress APM error events for receive-loop cancellations during shutdown (5.7.5)#116ecofrankie wants to merge 4 commits into
ecofrankie wants to merge 4 commits into
Conversation
8f47415 to
8ffdb1a
Compare
…shutdown (5.7.5) Setting Outcome=Success (5.7.4) was insufficient: Elastic APM captures error events at the DiagnosticSource level before ReceiverWrapper runs, so the error document was already queued regardless of the outcome override. Registers a one-time Agent.AddFilter(IError) that drops error events whose TransactionId matches a cancelled-receive transaction, preventing them from reaching the APM server. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
8cc09ec to
49ad9bc
Compare
…M errors during shutdown (5.7.5) Root cause: Elastic APM creates "AzureServiceBus RECEIVE" transactions via DiagnosticSource auto-instrumentation. The Azure SDK ends its Activity before firing ProcessErrorAsync, so Agent.Tracer.CurrentTransaction is null in OnReceiveCancelled() — the TX ID is never tracked and the filter passes the error through. Fix: add culprit-based suppression for TaskCanceledException on the AmqpReceiver path (post-WebSockets fix, only fires on shutdown). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
49ad9bc to
fba02a4
Compare
…cap comment, filter tests
- Replace ContainsKey with TryRemove in the filter so matched transaction IDs are consumed
on first use, keeping the dictionary lean and avoiding unnecessary cap pressure
- Extract 'AmqpReceiver' magic string to private const AmqpReceiverCulprit with comment
documenting the Azure SDK source and why the culprit path is safe post-WebSockets fix
- Extract ShouldSuppressError as an internal static method for unit testability
- Add inline comment at the CancelledTransactionIdCap guard explaining the culprit fallback
- Add InternalsVisibleTo("Ev.ServiceBus.UnitTests") to Ev.ServiceBus.Apm
- Add ApmTransactionManagerTests: 8 tests covering both filter paths, the TryRemove
consume-once behaviour, and non-suppressed exception types
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…eeded test - Add explicit parentheses around the is-pattern in ShouldSuppressError to document operator precedence intent: (exceptionType is "TCE" or "OCE") && culprit.Contains(...) - Document the non-atomic Count+TryAdd in OnReceiveCancelled as an intentional soft cap - Add ShouldSuppressError_WhenCapExceeded_CulpritPathStillSuppresses test: seeds 1000 entries to simulate cap exhaustion, then verifies the culprit path still suppresses the error for an untracked transaction ID Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Eliminates
TaskCanceledException/OperationCanceledExceptionAPM error documents generated during pod graceful shutdown.Root cause chain
5.7.3 fixed the
LogErrorsuppression inReceiverWrapper(the guard checkedIsCancellationRequestedwhich is alwaysfalsewhen Azure SDK firesProcessErrorAsyncwithCancellationToken.None).5.7.4 added
ICancellationAwareTransactionManager—ApmTransactionManagersetsOutcome = Success. This fixed the transaction error-rate metric but not error event documents.5.7.5-preview1 added
Agent.AddFilter(IError)registered lazily inOnReceiveCancelled(). Still lost the race during pod shutdown. Confirmed in production: 34 errors at 10:19 UTC on a pod already running preview1.5.7.5-preview2 moved filter registration to the
ApmTransactionManagerconstructor (eager, at app startup). Still failed in production at 11:56 UTC.Root cause identified: Elastic APM auto-instruments Azure SDK activities as
"AzureServiceBus RECEIVE"transactions via DiagnosticSource. The Azure SDK ends itsActivitybefore callingProcessErrorAsync, soAgent.Tracer.CurrentTransactionis alreadynullwhenOnReceiveCancelled()runs — the transaction ID is never added to_cancelledTransactionIdsand the filter passes the error through.5.7.5 (this PR) adds a second filter condition: suppress
TaskCanceledException/OperationCanceledExceptionerrors whoseCulpritoriginates inAmqpReceiver. After switching to WebSockets transport, this code path only fires during pod graceful shutdown.Changes
Commit 1 — constructor eager registration
Ev.ServiceBus.Apm/ApmTransactionManager.cs— constructor callsRegisterShutdownErrorFilter()eagerly;OnReceiveCancelled()keeps a fallback call;_cancelledTransactionIdssoft-capped at 1000 entries; null guard onCurrentTransactionCommit 2 — culprit-based suppression for auto-instrumented transactions
Ev.ServiceBus.Apm/ApmTransactionManager.cs— second filter condition: suppressTaskCanceledException/OperationCanceledExceptionwhereCulpritcontains"AmqpReceiver", targeting the auto-instrumented"AzureServiceBus RECEIVE"transactions whose Activity ends beforeProcessErrorAsyncfiresdocs/CHANGELOG.md— 5.7.5 entry updated with both fixesNote:
ReceiverWrapperTests.cstests were added in PR #115 and are already on master.