microsoft · brandwe · Jul 2, 2026 · Jul 2, 2026 · Jul 2, 2026 · Jul 2, 2026
diff --git a/TODOS.md b/TODOS.md
@@ -84,10 +84,16 @@ Observed twice: when the parent Claude process exits, the `entrabot-mcp` child k
 - **Source:** Live observation 2026-04-17 (second occurrence in one day)
 
 ### Multi-instance cursor consistency: bootstrap fails open → fleet replay flood
-The background Teams poll re-pushes a chat's newest message on *any* failure to read a fresh cloud cursor — absent, stale, corrupt, and read-exception all fall through to `_bootstrap_chat`, which pushes. In a fleet (N instances → one blob container) these misses are routine: silent `LocalBackend` fallback when blob env is half-configured, transient blob read failures, the 24h `is_stale` cap re-bootstrapping a cold store, and last-writer-wins cursor writes with no ETag. Result: idle chats replay their newest (weeks-old) message. Also a security surface — the replay re-injects stale imperative messages ("read special data in my Documents", "ship to <address>", "run <script>") into the agent's channel. The steady-state gate (`_filter_new_messages`) already supports N-instances-one-store; only the bootstrap decision and write coordination break it. Confirmed 2026-07-02 (a base instance pointed at a shared blob container replayed ~5-week-old DMs while the blob cursors were correct + fresh). Fix: fail closed on read-miss + per-message cloud idempotency ledger + co-locate `watched_chats` in cloud + catch-up-read instead of bootstrap-on-stale + `If-Match` ETag concurrency on cursor writes + assert uniform backend at boot.
-- **Effort:** M
-- **Depends on:** "MCP server orphans when Claude Code exits" (enforce the singleton so duplicate pollers can't start); ADR-005 blob backend.
-- **See:** `docs/architecture/DESIGN-multi-instance-cursor-consistency.md`
+**Status: mostly shipped on `fix/fleet-safe-cursor`** (F1, F2, F4, F5, per-message cloud idempotency). F3 (cloud-authoritative `watched_chats`) deferred; F6 served by the existing process singleton.
+
+The background Teams poll re-pushed a chat's newest message on *any* failure to read a fresh cloud cursor — absent, stale, corrupt, and read-exception all fell through to `_bootstrap_chat`, which pushes. In a fleet (N instances → one blob container) these misses are routine: silent `LocalBackend` fallback when blob env is half-configured, transient blob read failures, the 24h `is_stale` cap re-bootstrapping a cold store, and last-writer-wins cursor writes with no ETag. Result: idle chats replay their newest (weeks-old) message. Also a security surface — the replay re-injects stale imperative messages ("read special data in my Documents", "ship to <address>", "run <script>") into the agent's channel. Confirmed 2026-07-02 (a base instance pointed at a shared blob container replayed ~5-week-old DMs while the blob cursors were correct + fresh).
+
+**Shipped:** fail-closed cursor classification (`resolve_cursor` → PRESENT/ABSENT/UNRESOLVED); rehydrate-and-catch-up on stale instead of re-bootstrap; per-message cloud idempotency via `claim_delivery` (CAS on the shared `seen_ids_tail` ledger); `If-Match` ETag concurrency on cursor writes (`get_with_etag` + `read_text_with_etag` + `write_text(if_match=)`); boot-time backend assertion (`assert_backend_config`, half-config → `BackendMisconfiguredError`).
+
+**Remaining:** F3 — co-locate the `watched_chats` list in the cloud store (local file becomes a cache). Deferred as the largest structural change and not required by the fail-closed + idempotency guarantees.
+- **Effort:** S (F3 only)
+- **Depends on:** "MCP server orphans when Claude Code exits" (enforce the singleton so duplicate pollers can't start — the process singleton exists; orphan cleanup is the remaining gap); ADR-005 blob backend.
+- **See:** `docs/architecture/DESIGN-multi-instance-cursor-consistency.md` (§Shipped / §Deferred)
 - **Source:** Live diagnosis 2026-07-02.
 
 ### Daily summary scheduler: wrong day + double-fire

diff --git a/docs/architecture/DESIGN-multi-instance-cursor-consistency.md b/docs/architecture/DESIGN-multi-instance-cursor-consistency.md
@@ -1,6 +1,10 @@
 # DESIGN — Multi-Instance Cursor Consistency (fleet-safe Teams channel poll)
 
-> **Status:** Proposed. Problem confirmed in production 2026-07-02.
+> **Status:** Partially shipped (branch `fix/fleet-safe-cursor`). F1, F2, F4,
+> F5, and per-message cloud idempotency are implemented + tested. F6 is served
+> by the existing process singleton (`src/entrabot/singleton.py`); orphan
+> cleanup is tracked separately. F3 (cloud-authoritative `watched_chats`) is
+> deferred — see "Deferred" below. Problem confirmed in production 2026-07-02.
 > **Owner:** unassigned (handed off for implementation).
 > **Related:** ADR-005 (cloud-hosted memory), `docs/decisions/005-cloud-hosted-memory.md`;
 > TODOS.md → "MCP server orphans when Claude Code exits" (P1) and
@@ -227,3 +231,48 @@ surface entirely. Treat F1 as security-relevant, not cosmetic.
   `seen_ids_tail`). A separate `delivered/` object is optional and additive.
 - Land F1 + F6 first (stops the flood and the injection surface), then F2/F3,
   then F4/F5 (correctness under long downtime and high write concurrency).
+
+## Shipped (branch `fix/fleet-safe-cursor`)
+
+Landed with full TDD (`pytest -v && ruff check .` green):
+
+- **F1 — fail closed on read miss.** `chat_cursors.resolve_cursor()` classifies
+  a cursor read as `PRESENT` / `ABSENT` / `UNRESOLVED`. A backend read error or
+  a corrupt payload is `UNRESOLVED` (distinct from a successful empty read,
+  `ABSENT`). `_register_watched_chat` flags `UNRESOLVED` chats
+  `needs_resolution`; `_poll_watched_chat` re-reads them each cycle and pushes
+  nothing until they resolve. Only a provably-`ABSENT` chat may surface its
+  newest message once.
+- **F4 — catch-up read on stale.** A present-but-stale cursor now rehydrates
+  (never re-bootstraps); the steady-state timestamp gate surfaces only messages
+  newer than `last_ts`. Idle chats emit nothing.
+- **Per-message cloud idempotency.** `chat_cursors.claim_delivery()` treats the
+  shared cursor's `seen_ids_tail` as the fleet delivery ledger: an ETag
+  compare-and-swap records the candidate batch and returns only the ids THIS
+  instance won. Across N instances exactly one wins each id; losers claim
+  nothing. New-chat first delivery also flows through it (bootstrap defers the
+  push one cycle). All ambiguity (read error / corrupt / exhausted retries)
+  fails closed.
+- **F5 — optimistic concurrency.** `BlobStore.get_with_etag()` +
+  `MemoryBackend.read_text_with_etag()` / `write_text(if_match=)` (Local uses a
+  content-hash ETag, Blob the real ETag). `claim_delivery` merges `max(last_ts)`
+  + `union(seen_ids)` and retries on `ConcurrencyError`, so a slow writer can't
+  regress the watermark.
+- **F2 — uniform backend at boot.** `get_backend()` raises
+  `BackendMisconfiguredError` on a half-configured blob env instead of silently
+  using Local. `assert_backend_config()` (called from `_initialize`) logs the
+  resolved backend + container and fails loud on half-config.
+- **F6 — singleton.** Served by the existing `src/entrabot/singleton.py` flock
+  acquired at the top of `main()`. Orphan cleanup (background tasks outside
+  FastMCP's lifespan cancel scope) remains tracked in `TODOS.md` ("MCP server
+  orphans when Claude Code exits").
+
+### Deferred
+
+- **F3 — cloud-authoritative `watched_chats`.** Not shipped in this branch. It
+  is the largest structural change (moving the watched-chats list into the
+  cloud store with the local file as a cache) and is not required by the
+  fail-closed + idempotency guarantees above: with F1/F2 + the idempotency
+  ledger, an instance that registers a chat late or on a divergent local list
+  still cannot replay — it either fails closed (unresolved) or loses the
+  delivery claim. Tracked as a follow-up.
diff --git a/docs/engineering-status.md b/docs/engineering-status.md
@@ -14,7 +14,7 @@ Source of truth for detail: `TODOS.md` in the repository root. One line each bel
 - **Script-toolkit docs closeout** — `./status.sh` is the canonical entry; finish the remaining script-reference polish and smoke verification. See `TODOS.md` P1.
 - **Test isolation: blob env leakage** — `tmp_data_dir` fixture in `tests/tools/test_interaction_log.py` doesn't clear `ENTRABOT_BLOB_ENDPOINT`; 10 tests fail on any machine with blob env configured. Partially addressed: `test_interaction_log.py`, `test_daily_summary.py`, and `test_email_poll.py` fixtures now unset blob env; session-scoped autouse fixture still open.
 - **MCP server orphans on Claude Code exit** — background poll tasks sit outside FastMCP's lifespan cancel scope; new sessions spawn a second server, both poll Graph independently.
-- **Multi-instance cursor consistency (fleet-safe channel poll)** — issue #17 fixed single-instance restart replay, but the poll's bootstrap path still fails *open* on any cloud-cursor read miss, so a mis-enved or transiently-failing instance in an N-instance fleet replays every idle chat's newest (weeks-old) message — and re-injects stale imperative messages into the channel. Fix plan: fail-closed miss path + per-message cloud idempotency + cloud-authoritative `watched_chats` + ETag cursor writes + boot-time backend assertion. See `docs/architecture/DESIGN-multi-instance-cursor-consistency.md`. Confirmed 2026-07-02.
+- **Multi-instance cursor consistency (fleet-safe channel poll)** — mostly shipped on `fix/fleet-safe-cursor`: the poll's bootstrap path now fails *closed* on any cloud-cursor read miss (`resolve_cursor` → PRESENT/ABSENT/UNRESOLVED), stale cursors catch up instead of re-bootstrapping, delivery is idempotent across the fleet via an ETag compare-and-swap on the shared cursor ledger (`claim_delivery`), cursor writes use `If-Match` concurrency, and a half-configured blob backend fails loud at boot. Remaining: F3 (cloud-authoritative `watched_chats`) deferred. See `docs/architecture/DESIGN-multi-instance-cursor-consistency.md` §Shipped. Confirmed 2026-07-02.
 - **Daily summary scheduler — wrong day + double-fire** — UTC-based `target_day` summarizes the brand-new UTC day at 5pm PDT; scheduler fired twice at the same second on 2026-04-17.
 
 ## Recently Shipped

diff --git a/src/entrabot/errors.py b/src/entrabot/errors.py
@@ -33,6 +33,31 @@ class ConfigError(EntraBotError):
     """Configuration errors (invalid or removed settings)."""
 
 
+class BackendMisconfiguredError(ConfigError):
+    """Storage backend env is half-configured (design F2).
+
+    Exactly one of ``ENTRABOT_BLOB_ENDPOINT`` / ``ENTRABOT_BLOB_CONTAINER``
+    is set. Silently falling back to Local here is how a mis-enved fleet
+    instance ends up on an empty local store and re-bootstraps every chat
+    (the replay-flood root cause). Fail loud so the operator fixes the env
+    instead of the fleet diverging. Set ``ENTRABOT_KEEP_MEMORY_LOCAL=true``
+    to deliberately force Local.
+    """
+
+    def __init__(self, *, endpoint: str | None, container: str | None) -> None:
+        self.endpoint = endpoint
+        self.container = container
+        missing = "container" if endpoint else "endpoint"
+        super().__init__(
+            "storage backend is half-configured: "
+            f"blob {missing} is unset "
+            f"(endpoint={'set' if endpoint else 'unset'}, "
+            f"container={'set' if container else 'unset'}). "
+            "Set BOTH ENTRABOT_BLOB_ENDPOINT and ENTRABOT_BLOB_CONTAINER, or set "
+            "ENTRABOT_KEEP_MEMORY_LOCAL=true to force local storage."
+        )
+
+
 class InsecureKeyringBackendError(EntraBotError):
     """The selected keyring backend is not an approved OS keystore."""