Prevent message orphaning with checkpoint retry and recovery by damonbarry · Pull Request #7521 · Azure/iotedge

damonbarry · 2026-07-01T00:31:35Z

Problem

Messages can become orphaned in the store-and-forward queue when network failures occur during checkpoint commit. The message is successfully sent upstream, but the checkpoint offset fails to update in persistent storage due to a timeout. The cleanup processor cannot remove the message because it only deletes messages where offset > checkpointData.Offset. This results in the message staying in the queue indefinitely (or until TTL expiration), causing message loss in nested IoT Edge deployments.

Evidence: Edge Hub logs show messages stuck for 4+ minutes with repeated "Getting next batch...batch size 0" entries after a network failure during checkpoint.

Solution

Implemented a three-part architectural fix to prevent orphaning, enable recovery, and provide visibility:

1. Checkpoint Commit Retry (Prevention)

Added exponential backoff retry logic (3 attempts: 100ms, 200ms, 400ms delays)
Transient failures are retried automatically during checkpoint commit
If network recovers on retry attempt, checkpoint succeeds and cleanup proceeds normally

2. Connection Recovery Handler (Recovery)

Subscribed to CloudConnectionEstablished event in ConnectionManager
When connection is restored, immediately triggers message store cleanup instead of waiting for 30-minute timer
Allows retry of checkpoint commits when network recovers

3. Orphan Detection & Logging (Observability)

During each cleanup pass, identifies expired messages beyond the checkpoint offset
Logs diagnostic information: orphan count, checkpoint offset, message age
Enables troubleshooting and monitoring of stuck messages

Impact

✅ Prevents message orphaning during transient network failures
✅ Enables automatic recovery when connection is restored
✅ Provides diagnostic visibility for troubleshooting
✅ Comprehensive unit test coverage for all three components

Azure IoT Edge PR checklist:

This checklist is used to make sure that common guidelines for a pull request are followed.

General Guidelines and Best Practices

I have read the contribution guidelines.
Title of the pull request is clear and informative.
Description of the pull request includes a concise summary of the enhancement or bug fix.

Testing Guidelines

Pull request includes test coverage for the included changes.
Description of the pull request includes
- concise summary of tests added/modified
- local testing done.

Prevent message orphaning with checkpoint retry and recovery

fa33fbe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent message orphaning with checkpoint retry and recovery#7521

Prevent message orphaning with checkpoint retry and recovery#7521
damonbarry wants to merge 1 commit into
Azure:mainfrom
damonbarry:checkpoint-retry-recovery

damonbarry commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

damonbarry commented Jul 1, 2026

Problem

Solution

Impact

Azure IoT Edge PR checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant