Skip to content

Prevent message orphaning with checkpoint retry and recovery#7521

Draft
damonbarry wants to merge 1 commit into
Azure:mainfrom
damonbarry:checkpoint-retry-recovery
Draft

Prevent message orphaning with checkpoint retry and recovery#7521
damonbarry wants to merge 1 commit into
Azure:mainfrom
damonbarry:checkpoint-retry-recovery

Conversation

@damonbarry

Copy link
Copy Markdown
Member

Problem

Messages can become orphaned in the store-and-forward queue when network failures occur during checkpoint commit. The message is successfully sent upstream, but the checkpoint offset fails to update in persistent storage due to a timeout. The cleanup processor cannot remove the message because it only deletes messages where offset > checkpointData.Offset. This results in the message staying in the queue indefinitely (or until TTL expiration), causing message loss in nested IoT Edge deployments.

Evidence: Edge Hub logs show messages stuck for 4+ minutes with repeated "Getting next batch...batch size 0" entries after a network failure during checkpoint.

Solution

Implemented a three-part architectural fix to prevent orphaning, enable recovery, and provide visibility:

1. Checkpoint Commit Retry (Prevention)

  • Added exponential backoff retry logic (3 attempts: 100ms, 200ms, 400ms delays)
  • Transient failures are retried automatically during checkpoint commit
  • If network recovers on retry attempt, checkpoint succeeds and cleanup proceeds normally

2. Connection Recovery Handler (Recovery)

  • Subscribed to CloudConnectionEstablished event in ConnectionManager
  • When connection is restored, immediately triggers message store cleanup instead of waiting for 30-minute timer
  • Allows retry of checkpoint commits when network recovers

3. Orphan Detection & Logging (Observability)

  • During each cleanup pass, identifies expired messages beyond the checkpoint offset
  • Logs diagnostic information: orphan count, checkpoint offset, message age
  • Enables troubleshooting and monitoring of stuck messages

Impact

  • ✅ Prevents message orphaning during transient network failures
  • ✅ Enables automatic recovery when connection is restored
  • ✅ Provides diagnostic visibility for troubleshooting
  • ✅ Comprehensive unit test coverage for all three components

Azure IoT Edge PR checklist:

This checklist is used to make sure that common guidelines for a pull request are followed.

General Guidelines and Best Practices

  • I have read the contribution guidelines.
  • Title of the pull request is clear and informative.
  • Description of the pull request includes a concise summary of the enhancement or bug fix.

Testing Guidelines

  • Pull request includes test coverage for the included changes.
  • Description of the pull request includes
    • concise summary of tests added/modified
    • local testing done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant