Skip to content

fix: suppress APM error events for receive-loop cancellations during shutdown (5.7.5)#116

Open
ecofrankie wants to merge 4 commits into
masterfrom
teams/core/209/task/446605-apm-filter
Open

fix: suppress APM error events for receive-loop cancellations during shutdown (5.7.5)#116
ecofrankie wants to merge 4 commits into
masterfrom
teams/core/209/task/446605-apm-filter

Conversation

@ecofrankie

@ecofrankie ecofrankie commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Eliminates TaskCanceledException / OperationCanceledException APM error documents generated during pod graceful shutdown.

Root cause chain

5.7.3 fixed the LogError suppression in ReceiverWrapper (the guard checked IsCancellationRequested which is always false when Azure SDK fires ProcessErrorAsync with CancellationToken.None).

5.7.4 added ICancellationAwareTransactionManagerApmTransactionManager sets Outcome = Success. This fixed the transaction error-rate metric but not error event documents.

5.7.5-preview1 added Agent.AddFilter(IError) registered lazily in OnReceiveCancelled(). Still lost the race during pod shutdown. Confirmed in production: 34 errors at 10:19 UTC on a pod already running preview1.

5.7.5-preview2 moved filter registration to the ApmTransactionManager constructor (eager, at app startup). Still failed in production at 11:56 UTC.

Root cause identified: Elastic APM auto-instruments Azure SDK activities as "AzureServiceBus RECEIVE" transactions via DiagnosticSource. The Azure SDK ends its Activity before calling ProcessErrorAsync, so Agent.Tracer.CurrentTransaction is already null when OnReceiveCancelled() runs — the transaction ID is never added to _cancelledTransactionIds and the filter passes the error through.

5.7.5 (this PR) adds a second filter condition: suppress TaskCanceledException/OperationCanceledException errors whose Culprit originates in AmqpReceiver. After switching to WebSockets transport, this code path only fires during pod graceful shutdown.

Changes

Commit 1 — constructor eager registration

  • Ev.ServiceBus.Apm/ApmTransactionManager.cs — constructor calls RegisterShutdownErrorFilter() eagerly; OnReceiveCancelled() keeps a fallback call; _cancelledTransactionIds soft-capped at 1000 entries; null guard on CurrentTransaction

Commit 2 — culprit-based suppression for auto-instrumented transactions

  • Ev.ServiceBus.Apm/ApmTransactionManager.cs — second filter condition: suppress TaskCanceledException/OperationCanceledException where Culprit contains "AmqpReceiver", targeting the auto-instrumented "AzureServiceBus RECEIVE" transactions whose Activity ends before ProcessErrorAsync fires
  • docs/CHANGELOG.md — 5.7.5 entry updated with both fixes

Note: ReceiverWrapperTests.cs tests were added in PR #115 and are already on master.

@ecofrankie ecofrankie requested a review from benspeth June 15, 2026 08:22
@ecofrankie ecofrankie force-pushed the teams/core/209/task/446605-apm-filter branch 3 times, most recently from 8f47415 to 8ffdb1a Compare June 16, 2026 13:27
…shutdown (5.7.5)

Setting Outcome=Success (5.7.4) was insufficient: Elastic APM captures error
events at the DiagnosticSource level before ReceiverWrapper runs, so the error
document was already queued regardless of the outcome override.

Registers a one-time Agent.AddFilter(IError) that drops error events whose
TransactionId matches a cancelled-receive transaction, preventing them from
reaching the APM server.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@ecofrankie ecofrankie force-pushed the teams/core/209/task/446605-apm-filter branch 2 times, most recently from 8cc09ec to 49ad9bc Compare June 17, 2026 13:26
…M errors during shutdown (5.7.5)

Root cause: Elastic APM creates "AzureServiceBus RECEIVE" transactions via DiagnosticSource
auto-instrumentation. The Azure SDK ends its Activity before firing ProcessErrorAsync, so
Agent.Tracer.CurrentTransaction is null in OnReceiveCancelled() — the TX ID is never tracked
and the filter passes the error through. Fix: add culprit-based suppression for
TaskCanceledException on the AmqpReceiver path (post-WebSockets fix, only fires on shutdown).

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@ecofrankie ecofrankie force-pushed the teams/core/209/task/446605-apm-filter branch from 49ad9bc to fba02a4 Compare June 17, 2026 13:39
Robert Karp and others added 2 commits June 17, 2026 16:18
…cap comment, filter tests

- Replace ContainsKey with TryRemove in the filter so matched transaction IDs are consumed
  on first use, keeping the dictionary lean and avoiding unnecessary cap pressure
- Extract 'AmqpReceiver' magic string to private const AmqpReceiverCulprit with comment
  documenting the Azure SDK source and why the culprit path is safe post-WebSockets fix
- Extract ShouldSuppressError as an internal static method for unit testability
- Add inline comment at the CancelledTransactionIdCap guard explaining the culprit fallback
- Add InternalsVisibleTo("Ev.ServiceBus.UnitTests") to Ev.ServiceBus.Apm
- Add ApmTransactionManagerTests: 8 tests covering both filter paths, the TryRemove
  consume-once behaviour, and non-suppressed exception types

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…eeded test

- Add explicit parentheses around the is-pattern in ShouldSuppressError to document
  operator precedence intent: (exceptionType is "TCE" or "OCE") && culprit.Contains(...)
- Document the non-atomic Count+TryAdd in OnReceiveCancelled as an intentional soft cap
- Add ShouldSuppressError_WhenCapExceeded_CulpritPathStillSuppresses test: seeds 1000
  entries to simulate cap exhaustion, then verifies the culprit path still suppresses
  the error for an untracked transaction ID

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant