Skip to content

fix(cursor): judge cursor staleness by write-time, not message-time#96

Open
brandwe wants to merge 1 commit into
mainfrom
fix/cursor-staleness-uses-write-time
Open

fix(cursor): judge cursor staleness by write-time, not message-time#96
brandwe wants to merge 1 commit into
mainfrom
fix/cursor-staleness-uses-write-time

Conversation

@brandwe

@brandwe brandwe commented Jun 27, 2026

Copy link
Copy Markdown
Member

Problem

The background Teams poll re-delivered weeks-old messages as if they were new — a flood on every MCP restart. Diagnosis showed ~50 watched-chat cursors "frozen" at late-May timestamps while a freshly-sent message still arrived correctly.

Root cause

chat_cursors.is_stale() measured cursor age from last_ts (the newest-message watermark) instead of last_written_at (when the cursor was actually persisted).

Any chat idle longer than the 24h cap therefore had a "stale" cursor even though it had just been written. _register_watched_chat discarded it and re-ran _bootstrap_chat on every restart — and _bootstrap_chat deliberately leaves the newest message unseen so it fires once. So each re-bootstrap re-pushed that chat's weeks-old newest message as if it were live. With ~50 idle chats and frequent restarts (amplified by the open MCP-disconnect issue) this produced a continuous flood of stale replays.

This is a follow-up to the original issue #17 cursor-persistence work — persistence was working fine; the staleness check keyed off the wrong field.

Fix

is_stale() now takes last_written_at. Both call sites pass the write timestamp:

  • mcp_server._register_watched_chat — the rehydrate-vs-bootstrap decision.
  • body_bootstrap._cursor_freshness — the cursors_stale telemetry, which was miscounting for the same reason (this caller was not in the original investigation; found by grep).

The 24h cap still does its real job: if the server was down longer than the cap, messages may have been missed and the seen-set is untrustworthy, so we re-baseline via a fresh bootstrap.

Tests (TDD)

  • New regression test test_idle_chat_recent_write_rehydrates_despite_old_last_tsred before the fix, green after.
  • Two existing tests that encoded the buggy behavior corrected: they saved cursors with an old last_ts but save_cursor stamps last_written_at=now, so under the correct logic they were actually fresh. Rewritten to craft genuinely-stale cursors by writing an old last_written_at directly.
  • Full suite: 1527 passed, 8 skipped. ruff check . clean.

Deployment note

Takes effect on the next entrabot MCP restart. Idle chats will then rehydrate (judged by write-time) instead of re-bootstrapping, so the perpetual every-restart flood stops. Worst case, a chat whose cursor was saved mid-bootstrap fires its single newest message one last time, then goes quiet — correct-by-design, not the bug.

🤖 Generated with Claude Code

The background Teams poll re-delivered weeks-old messages on every MCP
restart. Root cause: chat_cursors.is_stale() measured age from last_ts
(the newest-MESSAGE watermark) instead of last_written_at (when the
cursor was persisted).

Any chat idle longer than the 24h cap therefore had a "stale" cursor
even when it had just been written, so _register_watched_chat discarded
it and re-ran _bootstrap_chat on every restart. _bootstrap_chat
deliberately leaves the newest message unseen so it fires once -- so each
re-bootstrap re-pushed that chat's weeks-old newest message as if it were
live. With ~50 idle chats and frequent restarts (amplified by the open
MCP-disconnect issue) this produced a flood of stale replays.

Fix: is_stale() now takes last_written_at. Both call sites pass the write
timestamp:
  - mcp_server._register_watched_chat (rehydrate-vs-bootstrap decision)
  - body_bootstrap._cursor_freshness (cursors_stale telemetry, which was
    itself miscounting for the same reason)

The 24h cap still does its real job: if the server was actually down
longer than the cap, messages may have been missed and the seen-set is
untrustworthy, so we re-baseline via a fresh bootstrap.

TDD: added test_idle_chat_recent_write_rehydrates_despite_old_last_ts
(red before the fix, green after). Corrected two existing tests that
encoded the buggy behavior by crafting cursors with an old
last_written_at directly. Full suite: 1527 passed, ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant