Fleet-safe Teams channel poll: fail closed on cursor-read miss + per-message cloud idempotency by brandwe · Pull Request #97 · microsoft/entrabot

brandwe · 2026-07-02T16:32:16Z

Makes the background Teams channel poll safe for N instances sharing one cloud cursor store. Implements the design in docs/architecture/DESIGN-multi-instance-cursor-consistency.md (committed on main in 107ab07).

Problem

The poll's per-chat cursor rehydration is correct in steady state, but the bootstrap path failed open: on any failure to read a fresh cloud cursor at registration time — absent, stale, corrupt, or a transient read exception — _register_watched_chat fell through to _bootstrap_chat, which pushes the chat's newest message. In a fleet these read-misses are routine, so idle chats replayed their newest (weeks-old) message. Confirmed live 2026-07-02: a base instance on a shared blob container flooded the session with ~5-week-old Teams DMs while the blob cursors were correct and fresh. This is a security surface — the replay re-injects stale imperative messages ("read special data in my Documents", "ship to

", "run <script>") into the agent's channel — so fail-closed is treated as a requirement, not polish.

What changed (closes F1, F2, F4, F5, per-message idempotency)

F1 — fail closed on read miss. New chat_cursors.resolve_cursor() classifies a cursor read as PRESENT / ABSENT / UNRESOLVED. A backend read error or a corrupt payload is UNRESOLVED (distinct from a successful empty read, ABSENT). _register_watched_chat flags UNRESOLVED chats needs_resolution; the extracted _poll_watched_chat / _resolve_pending_chat re-read them each cycle and push nothing until they resolve. Only a provably-ABSENT chat may surface its newest message once.
F4 — catch-up read on stale. A present-but-stale cursor now rehydrates (never re-bootstraps); the steady-state timestamp gate surfaces only messages newer than last_ts. Idle chats emit nothing.
Per-message cloud idempotency. chat_cursors.claim_delivery() treats the shared cursor's seen_ids_tail as the fleet delivery ledger: an ETag compare-and-swap records the candidate batch and returns only the ids this instance won. Across N instances exactly one wins each id; losers claim nothing. New-chat first delivery also flows through it (bootstrap defers the push one cycle). Read error / corrupt cursor / exhausted retries all fail closed.
F5 — optimistic concurrency. BlobStore.get_with_etag() + MemoryBackend.read_text_with_etag() / write_text(if_match=) (Local uses a content-hash ETag, Blob the real ETag). claim_delivery merges max(last_ts) + union(seen_ids) and retries on ConcurrencyError, so a slow writer can't regress the watermark.
F2 — uniform backend at boot. get_backend() now raises BackendMisconfiguredError on a half-configured blob env (exactly one of endpoint/container) instead of silently using an empty Local store — the exact way a mis-enved fleet instance used to diverge and flood. assert_backend_config() (wired into _initialize) logs the resolved backend + container and fails loud on half-config. ENTRABOT_KEEP_MEMORY_LOCAL=true still forces Local.

F6 (singleton) is already served by src/entrabot/singleton.py (flock at the top of main()); orphan cleanup of background tasks outside FastMCP's lifespan remains tracked separately in TODOS.md.

F3 (cloud-authoritative watched_chats) is deferred — the largest structural change and not required by the fail-closed + idempotency guarantees above. Rationale in the design doc §Deferred.

Tests

TDD throughout — every behavior had a failing test first. Covers the design's full test plan: fail-closed on read error (no push, no re-bootstrap), cold-store catch-up, fleet idempotency (one push across two backends sharing one store), concurrency merge (max watermark + union), and the boot-time backend assertion. Also isolated a pre-existing poll integration test that was leaking cursor writes to the real data dir.

pytest -v && ruff check . — 1567 passed, 1 skipped, ruff clean.

Status files

Updated TODOS.md, docs/engineering-status.md, and the design doc's status line + Shipped/Deferred sections per the status-currency rule.

…ors (F1+F4) Split the cursor-read miss cases so the background Teams poll never re-pushes a chat's newest message on an ambiguous read: - Add chat_cursors.resolve_cursor() → PRESENT / ABSENT / UNRESOLVED. A backend read error or corrupt payload is UNRESOLVED (fail closed), distinct from a successful empty read (ABSENT). - _register_watched_chat classifies via resolve_cursor: PRESENT rehydrates (fresh OR stale — F4, catch-up via the timestamp gate), ABSENT registers a genuinely-new chat, UNRESOLVED flags needs_resolution and pushes nothing. - Extract _poll_watched_chat + _resolve_pending_chat: a needs_resolution chat re-reads the cursor each cycle and delivers nothing until it resolves. Closes F1 and F4 from the multi-instance cursor design. Stops the fleet replay flood + stale-imperative injection surface from transient reads and the 24h staleness cap.

…rites (F5) Make cross-instance Teams delivery exactly-once and cursor writes conflict-safe: - BlobStore.get_with_etag(); MemoryBackend gains read_text_with_etag() and write_text(if_match=) — Local uses a content-hash ETag, Blob uses the real ETag; a stale If-Match raises ConcurrencyError. - chat_cursors.claim_delivery(): the shared cursor seen_ids_tail is the fleet delivery ledger. CAS-merges (union seen, max last_ts) with If-Match and returns only the message ids THIS instance won; losers re-read and claim nothing. Read error / corrupt cursor / exhausted retries all fail closed. - Steady-state poll pushes only claimed ids and persists via the CAS write, so a slow writer can't regress the watermark. New-chat first delivery now also flows through claim_delivery (bootstrap defers the push a cycle). Closes the fleet idempotency + concurrency-merge items from the design.

A silent Local fallback when only one of ENTRABOT_BLOB_ENDPOINT / ENTRABOT_BLOB_CONTAINER is set is how a mis-enved fleet instance lands on an empty local store and re-bootstraps every chat (replay-flood root cause). - Add BackendMisconfiguredError; get_backend() raises it on a half-configured blob env instead of returning LocalBackend. ENTRABOT_KEEP_MEMORY_LOCAL=true still forces Local. - Add assert_backend_config(): resolves + validates the backend once at boot, logs the resolved backend + container so operators can confirm fleet-wide agreement, and raises on half-config. Wired into _initialize. - Isolate test_one_chat_403_does_not_starve_others to a tmp data dir (its cursor writes were leaking to the real data dir). Closes F2.

…ipped, F3 deferred) Update the design doc status + Shipped/Deferred sections, TODOS.md item, and engineering-status.md to reflect what landed on this branch.

brandwe added 4 commits July 2, 2026 09:14

docs: record fleet-safe cursor fix status (F1/F2/F4/F5/idempotency sh…

7743b50

…ipped, F3 deferred) Update the design doc status + Shipped/Deferred sections, TODOS.md item, and engineering-status.md to reflect what landed on this branch.

brandwe merged commit 77b6d49 into main Jul 2, 2026
5 checks passed

brandwe deleted the fix/fleet-safe-cursor branch July 2, 2026 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fleet-safe Teams channel poll: fail closed on cursor-read miss + per-message cloud idempotency#97

Fleet-safe Teams channel poll: fail closed on cursor-read miss + per-message cloud idempotency#97
brandwe merged 4 commits into
mainfrom
fix/fleet-safe-cursor

brandwe commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

brandwe commented Jul 2, 2026

Problem

What changed (closes F1, F2, F4, F5, per-message idempotency)

Tests

Status files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant