Skip to content

Fleet-safe Teams channel poll: fail closed on cursor-read miss + per-message cloud idempotency#97

Merged
brandwe merged 4 commits into
mainfrom
fix/fleet-safe-cursor
Jul 2, 2026
Merged

Fleet-safe Teams channel poll: fail closed on cursor-read miss + per-message cloud idempotency#97
brandwe merged 4 commits into
mainfrom
fix/fleet-safe-cursor

Conversation

@brandwe

@brandwe brandwe commented Jul 2, 2026

Copy link
Copy Markdown
Member

Makes the background Teams channel poll safe for N instances sharing one cloud cursor store. Implements the design in docs/architecture/DESIGN-multi-instance-cursor-consistency.md (committed on main in 107ab07).

Problem

The poll's per-chat cursor rehydration is correct in steady state, but the bootstrap path failed open: on any failure to read a fresh cloud cursor at registration time — absent, stale, corrupt, or a transient read exception_register_watched_chat fell through to _bootstrap_chat, which pushes the chat's newest message. In a fleet these read-misses are routine, so idle chats replayed their newest (weeks-old) message. Confirmed live 2026-07-02: a base instance on a shared blob container flooded the session with ~5-week-old Teams DMs while the blob cursors were correct and fresh. This is a security surface — the replay re-injects stale imperative messages ("read special data in my Documents", "ship to

", "run <script>") into the agent's channel — so fail-closed is treated as a requirement, not polish.

What changed (closes F1, F2, F4, F5, per-message idempotency)

  • F1 — fail closed on read miss. New chat_cursors.resolve_cursor() classifies a cursor read as PRESENT / ABSENT / UNRESOLVED. A backend read error or a corrupt payload is UNRESOLVED (distinct from a successful empty read, ABSENT). _register_watched_chat flags UNRESOLVED chats needs_resolution; the extracted _poll_watched_chat / _resolve_pending_chat re-read them each cycle and push nothing until they resolve. Only a provably-ABSENT chat may surface its newest message once.
  • F4 — catch-up read on stale. A present-but-stale cursor now rehydrates (never re-bootstraps); the steady-state timestamp gate surfaces only messages newer than last_ts. Idle chats emit nothing.
  • Per-message cloud idempotency. chat_cursors.claim_delivery() treats the shared cursor's seen_ids_tail as the fleet delivery ledger: an ETag compare-and-swap records the candidate batch and returns only the ids this instance won. Across N instances exactly one wins each id; losers claim nothing. New-chat first delivery also flows through it (bootstrap defers the push one cycle). Read error / corrupt cursor / exhausted retries all fail closed.
  • F5 — optimistic concurrency. BlobStore.get_with_etag() + MemoryBackend.read_text_with_etag() / write_text(if_match=) (Local uses a content-hash ETag, Blob the real ETag). claim_delivery merges max(last_ts) + union(seen_ids) and retries on ConcurrencyError, so a slow writer can't regress the watermark.
  • F2 — uniform backend at boot. get_backend() now raises BackendMisconfiguredError on a half-configured blob env (exactly one of endpoint/container) instead of silently using an empty Local store — the exact way a mis-enved fleet instance used to diverge and flood. assert_backend_config() (wired into _initialize) logs the resolved backend + container and fails loud on half-config. ENTRABOT_KEEP_MEMORY_LOCAL=true still forces Local.

F6 (singleton) is already served by src/entrabot/singleton.py (flock at the top of main()); orphan cleanup of background tasks outside FastMCP's lifespan remains tracked separately in TODOS.md.

F3 (cloud-authoritative watched_chats) is deferred — the largest structural change and not required by the fail-closed + idempotency guarantees above. Rationale in the design doc §Deferred.

Tests

TDD throughout — every behavior had a failing test first. Covers the design's full test plan: fail-closed on read error (no push, no re-bootstrap), cold-store catch-up, fleet idempotency (one push across two backends sharing one store), concurrency merge (max watermark + union), and the boot-time backend assertion. Also isolated a pre-existing poll integration test that was leaking cursor writes to the real data dir.

pytest -v && ruff check .1567 passed, 1 skipped, ruff clean.

Status files

Updated TODOS.md, docs/engineering-status.md, and the design doc's status line + Shipped/Deferred sections per the status-currency rule.

brandwe added 4 commits July 2, 2026 09:14
…ors (F1+F4)

Split the cursor-read miss cases so the background Teams poll never
re-pushes a chat's newest message on an ambiguous read:

- Add chat_cursors.resolve_cursor() → PRESENT / ABSENT / UNRESOLVED.
  A backend read error or corrupt payload is UNRESOLVED (fail closed),
  distinct from a successful empty read (ABSENT).
- _register_watched_chat classifies via resolve_cursor: PRESENT rehydrates
  (fresh OR stale — F4, catch-up via the timestamp gate), ABSENT registers a
  genuinely-new chat, UNRESOLVED flags needs_resolution and pushes nothing.
- Extract _poll_watched_chat + _resolve_pending_chat: a needs_resolution chat
  re-reads the cursor each cycle and delivers nothing until it resolves.

Closes F1 and F4 from the multi-instance cursor design. Stops the fleet
replay flood + stale-imperative injection surface from transient reads and
the 24h staleness cap.
…rites (F5)

Make cross-instance Teams delivery exactly-once and cursor writes
conflict-safe:

- BlobStore.get_with_etag(); MemoryBackend gains read_text_with_etag() and
  write_text(if_match=) — Local uses a content-hash ETag, Blob uses the real
  ETag; a stale If-Match raises ConcurrencyError.
- chat_cursors.claim_delivery(): the shared cursor seen_ids_tail is the fleet
  delivery ledger. CAS-merges (union seen, max last_ts) with If-Match and
  returns only the message ids THIS instance won; losers re-read and claim
  nothing. Read error / corrupt cursor / exhausted retries all fail closed.
- Steady-state poll pushes only claimed ids and persists via the CAS write,
  so a slow writer can't regress the watermark. New-chat first delivery now
  also flows through claim_delivery (bootstrap defers the push a cycle).

Closes the fleet idempotency + concurrency-merge items from the design.
A silent Local fallback when only one of ENTRABOT_BLOB_ENDPOINT /
ENTRABOT_BLOB_CONTAINER is set is how a mis-enved fleet instance lands on
an empty local store and re-bootstraps every chat (replay-flood root cause).

- Add BackendMisconfiguredError; get_backend() raises it on a half-configured
  blob env instead of returning LocalBackend. ENTRABOT_KEEP_MEMORY_LOCAL=true
  still forces Local.
- Add assert_backend_config(): resolves + validates the backend once at boot,
  logs the resolved backend + container so operators can confirm fleet-wide
  agreement, and raises on half-config. Wired into _initialize.
- Isolate test_one_chat_403_does_not_starve_others to a tmp data dir (its
  cursor writes were leaking to the real data dir).

Closes F2.
…ipped, F3 deferred)

Update the design doc status + Shipped/Deferred sections, TODOS.md item, and
engineering-status.md to reflect what landed on this branch.
@brandwe brandwe merged commit 77b6d49 into main Jul 2, 2026
5 checks passed
@brandwe brandwe deleted the fix/fleet-safe-cursor branch July 2, 2026 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant