Fleet-safe Teams channel poll: fail closed on cursor-read miss + per-message cloud idempotency#97
Merged
Merged
Conversation
…ors (F1+F4) Split the cursor-read miss cases so the background Teams poll never re-pushes a chat's newest message on an ambiguous read: - Add chat_cursors.resolve_cursor() → PRESENT / ABSENT / UNRESOLVED. A backend read error or corrupt payload is UNRESOLVED (fail closed), distinct from a successful empty read (ABSENT). - _register_watched_chat classifies via resolve_cursor: PRESENT rehydrates (fresh OR stale — F4, catch-up via the timestamp gate), ABSENT registers a genuinely-new chat, UNRESOLVED flags needs_resolution and pushes nothing. - Extract _poll_watched_chat + _resolve_pending_chat: a needs_resolution chat re-reads the cursor each cycle and delivers nothing until it resolves. Closes F1 and F4 from the multi-instance cursor design. Stops the fleet replay flood + stale-imperative injection surface from transient reads and the 24h staleness cap.
…rites (F5) Make cross-instance Teams delivery exactly-once and cursor writes conflict-safe: - BlobStore.get_with_etag(); MemoryBackend gains read_text_with_etag() and write_text(if_match=) — Local uses a content-hash ETag, Blob uses the real ETag; a stale If-Match raises ConcurrencyError. - chat_cursors.claim_delivery(): the shared cursor seen_ids_tail is the fleet delivery ledger. CAS-merges (union seen, max last_ts) with If-Match and returns only the message ids THIS instance won; losers re-read and claim nothing. Read error / corrupt cursor / exhausted retries all fail closed. - Steady-state poll pushes only claimed ids and persists via the CAS write, so a slow writer can't regress the watermark. New-chat first delivery now also flows through claim_delivery (bootstrap defers the push a cycle). Closes the fleet idempotency + concurrency-merge items from the design.
A silent Local fallback when only one of ENTRABOT_BLOB_ENDPOINT / ENTRABOT_BLOB_CONTAINER is set is how a mis-enved fleet instance lands on an empty local store and re-bootstraps every chat (replay-flood root cause). - Add BackendMisconfiguredError; get_backend() raises it on a half-configured blob env instead of returning LocalBackend. ENTRABOT_KEEP_MEMORY_LOCAL=true still forces Local. - Add assert_backend_config(): resolves + validates the backend once at boot, logs the resolved backend + container so operators can confirm fleet-wide agreement, and raises on half-config. Wired into _initialize. - Isolate test_one_chat_403_does_not_starve_others to a tmp data dir (its cursor writes were leaking to the real data dir). Closes F2.
…ipped, F3 deferred) Update the design doc status + Shipped/Deferred sections, TODOS.md item, and engineering-status.md to reflect what landed on this branch.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Makes the background Teams channel poll safe for N instances sharing one cloud cursor store. Implements the design in
docs/architecture/DESIGN-multi-instance-cursor-consistency.md(committed onmainin107ab07).Problem
The poll's per-chat cursor rehydration is correct in steady state, but the bootstrap path failed open: on any failure to read a fresh cloud cursor at registration time — absent, stale, corrupt, or a transient read exception —
", "run <script>") into the agent's channel — so fail-closed is treated as a requirement, not polish._register_watched_chatfell through to_bootstrap_chat, which pushes the chat's newest message. In a fleet these read-misses are routine, so idle chats replayed their newest (weeks-old) message. Confirmed live 2026-07-02: a base instance on a shared blob container flooded the session with ~5-week-old Teams DMs while the blob cursors were correct and fresh. This is a security surface — the replay re-injects stale imperative messages ("read special data in my Documents", "ship toWhat changed (closes F1, F2, F4, F5, per-message idempotency)
chat_cursors.resolve_cursor()classifies a cursor read asPRESENT/ABSENT/UNRESOLVED. A backend read error or a corrupt payload isUNRESOLVED(distinct from a successful empty read,ABSENT)._register_watched_chatflagsUNRESOLVEDchatsneeds_resolution; the extracted_poll_watched_chat/_resolve_pending_chatre-read them each cycle and push nothing until they resolve. Only a provably-ABSENTchat may surface its newest message once.last_ts. Idle chats emit nothing.chat_cursors.claim_delivery()treats the shared cursor'sseen_ids_tailas the fleet delivery ledger: an ETag compare-and-swap records the candidate batch and returns only the ids this instance won. Across N instances exactly one wins each id; losers claim nothing. New-chat first delivery also flows through it (bootstrap defers the push one cycle). Read error / corrupt cursor / exhausted retries all fail closed.BlobStore.get_with_etag()+MemoryBackend.read_text_with_etag()/write_text(if_match=)(Local uses a content-hash ETag, Blob the real ETag).claim_deliverymergesmax(last_ts)+union(seen_ids)and retries onConcurrencyError, so a slow writer can't regress the watermark.get_backend()now raisesBackendMisconfiguredErroron a half-configured blob env (exactly one of endpoint/container) instead of silently using an empty Local store — the exact way a mis-enved fleet instance used to diverge and flood.assert_backend_config()(wired into_initialize) logs the resolved backend + container and fails loud on half-config.ENTRABOT_KEEP_MEMORY_LOCAL=truestill forces Local.F6 (singleton) is already served by
src/entrabot/singleton.py(flock at the top ofmain()); orphan cleanup of background tasks outside FastMCP's lifespan remains tracked separately inTODOS.md.F3 (cloud-authoritative
watched_chats) is deferred — the largest structural change and not required by the fail-closed + idempotency guarantees above. Rationale in the design doc §Deferred.Tests
TDD throughout — every behavior had a failing test first. Covers the design's full test plan: fail-closed on read error (no push, no re-bootstrap), cold-store catch-up, fleet idempotency (one push across two backends sharing one store), concurrency merge (max watermark + union), and the boot-time backend assertion. Also isolated a pre-existing poll integration test that was leaking cursor writes to the real data dir.
pytest -v && ruff check .— 1567 passed, 1 skipped, ruff clean.Status files
Updated
TODOS.md,docs/engineering-status.md, and the design doc's status line + Shipped/Deferred sections per the status-currency rule.