fix(cursor): judge cursor staleness by write-time, not message-time#96
Open
brandwe wants to merge 1 commit into
Open
fix(cursor): judge cursor staleness by write-time, not message-time#96brandwe wants to merge 1 commit into
brandwe wants to merge 1 commit into
Conversation
The background Teams poll re-delivered weeks-old messages on every MCP
restart. Root cause: chat_cursors.is_stale() measured age from last_ts
(the newest-MESSAGE watermark) instead of last_written_at (when the
cursor was persisted).
Any chat idle longer than the 24h cap therefore had a "stale" cursor
even when it had just been written, so _register_watched_chat discarded
it and re-ran _bootstrap_chat on every restart. _bootstrap_chat
deliberately leaves the newest message unseen so it fires once -- so each
re-bootstrap re-pushed that chat's weeks-old newest message as if it were
live. With ~50 idle chats and frequent restarts (amplified by the open
MCP-disconnect issue) this produced a flood of stale replays.
Fix: is_stale() now takes last_written_at. Both call sites pass the write
timestamp:
- mcp_server._register_watched_chat (rehydrate-vs-bootstrap decision)
- body_bootstrap._cursor_freshness (cursors_stale telemetry, which was
itself miscounting for the same reason)
The 24h cap still does its real job: if the server was actually down
longer than the cap, messages may have been missed and the seen-set is
untrustworthy, so we re-baseline via a fresh bootstrap.
TDD: added test_idle_chat_recent_write_rehydrates_despite_old_last_ts
(red before the fix, green after). Corrected two existing tests that
encoded the buggy behavior by crafting cursors with an old
last_written_at directly. Full suite: 1527 passed, ruff clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The background Teams poll re-delivered weeks-old messages as if they were new — a flood on every MCP restart. Diagnosis showed ~50 watched-chat cursors "frozen" at late-May timestamps while a freshly-sent message still arrived correctly.
Root cause
chat_cursors.is_stale()measured cursor age fromlast_ts(the newest-message watermark) instead oflast_written_at(when the cursor was actually persisted).Any chat idle longer than the 24h cap therefore had a "stale" cursor even though it had just been written.
_register_watched_chatdiscarded it and re-ran_bootstrap_chaton every restart — and_bootstrap_chatdeliberately leaves the newest message unseen so it fires once. So each re-bootstrap re-pushed that chat's weeks-old newest message as if it were live. With ~50 idle chats and frequent restarts (amplified by the open MCP-disconnect issue) this produced a continuous flood of stale replays.This is a follow-up to the original issue #17 cursor-persistence work — persistence was working fine; the staleness check keyed off the wrong field.
Fix
is_stale()now takeslast_written_at. Both call sites pass the write timestamp:mcp_server._register_watched_chat— the rehydrate-vs-bootstrap decision.body_bootstrap._cursor_freshness— thecursors_staletelemetry, which was miscounting for the same reason (this caller was not in the original investigation; found by grep).The 24h cap still does its real job: if the server was down longer than the cap, messages may have been missed and the seen-set is untrustworthy, so we re-baseline via a fresh bootstrap.
Tests (TDD)
test_idle_chat_recent_write_rehydrates_despite_old_last_ts— red before the fix, green after.last_tsbutsave_cursorstampslast_written_at=now, so under the correct logic they were actually fresh. Rewritten to craft genuinely-stale cursors by writing an oldlast_written_atdirectly.ruff check .clean.Deployment note
Takes effect on the next entrabot MCP restart. Idle chats will then rehydrate (judged by write-time) instead of re-bootstrapping, so the perpetual every-restart flood stops. Worst case, a chat whose cursor was saved mid-bootstrap fires its single newest message one last time, then goes quiet — correct-by-design, not the bug.
🤖 Generated with Claude Code