Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 10 additions & 4 deletions TODOS.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,10 +84,16 @@ Observed twice: when the parent Claude process exits, the `entrabot-mcp` child k
- **Source:** Live observation 2026-04-17 (second occurrence in one day)

### Multi-instance cursor consistency: bootstrap fails open → fleet replay flood
The background Teams poll re-pushes a chat's newest message on *any* failure to read a fresh cloud cursor — absent, stale, corrupt, and read-exception all fall through to `_bootstrap_chat`, which pushes. In a fleet (N instances → one blob container) these misses are routine: silent `LocalBackend` fallback when blob env is half-configured, transient blob read failures, the 24h `is_stale` cap re-bootstrapping a cold store, and last-writer-wins cursor writes with no ETag. Result: idle chats replay their newest (weeks-old) message. Also a security surface — the replay re-injects stale imperative messages ("read special data in my Documents", "ship to <address>", "run <script>") into the agent's channel. The steady-state gate (`_filter_new_messages`) already supports N-instances-one-store; only the bootstrap decision and write coordination break it. Confirmed 2026-07-02 (a base instance pointed at a shared blob container replayed ~5-week-old DMs while the blob cursors were correct + fresh). Fix: fail closed on read-miss + per-message cloud idempotency ledger + co-locate `watched_chats` in cloud + catch-up-read instead of bootstrap-on-stale + `If-Match` ETag concurrency on cursor writes + assert uniform backend at boot.
- **Effort:** M
- **Depends on:** "MCP server orphans when Claude Code exits" (enforce the singleton so duplicate pollers can't start); ADR-005 blob backend.
- **See:** `docs/architecture/DESIGN-multi-instance-cursor-consistency.md`
**Status: mostly shipped on `fix/fleet-safe-cursor`** (F1, F2, F4, F5, per-message cloud idempotency). F3 (cloud-authoritative `watched_chats`) deferred; F6 served by the existing process singleton.

The background Teams poll re-pushed a chat's newest message on *any* failure to read a fresh cloud cursor — absent, stale, corrupt, and read-exception all fell through to `_bootstrap_chat`, which pushes. In a fleet (N instances → one blob container) these misses are routine: silent `LocalBackend` fallback when blob env is half-configured, transient blob read failures, the 24h `is_stale` cap re-bootstrapping a cold store, and last-writer-wins cursor writes with no ETag. Result: idle chats replay their newest (weeks-old) message. Also a security surface — the replay re-injects stale imperative messages ("read special data in my Documents", "ship to <address>", "run <script>") into the agent's channel. Confirmed 2026-07-02 (a base instance pointed at a shared blob container replayed ~5-week-old DMs while the blob cursors were correct + fresh).

**Shipped:** fail-closed cursor classification (`resolve_cursor` → PRESENT/ABSENT/UNRESOLVED); rehydrate-and-catch-up on stale instead of re-bootstrap; per-message cloud idempotency via `claim_delivery` (CAS on the shared `seen_ids_tail` ledger); `If-Match` ETag concurrency on cursor writes (`get_with_etag` + `read_text_with_etag` + `write_text(if_match=)`); boot-time backend assertion (`assert_backend_config`, half-config → `BackendMisconfiguredError`).

**Remaining:** F3 — co-locate the `watched_chats` list in the cloud store (local file becomes a cache). Deferred as the largest structural change and not required by the fail-closed + idempotency guarantees.
- **Effort:** S (F3 only)
- **Depends on:** "MCP server orphans when Claude Code exits" (enforce the singleton so duplicate pollers can't start — the process singleton exists; orphan cleanup is the remaining gap); ADR-005 blob backend.
- **See:** `docs/architecture/DESIGN-multi-instance-cursor-consistency.md` (§Shipped / §Deferred)
- **Source:** Live diagnosis 2026-07-02.

### Daily summary scheduler: wrong day + double-fire
Expand Down
51 changes: 50 additions & 1 deletion docs/architecture/DESIGN-multi-instance-cursor-consistency.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# DESIGN — Multi-Instance Cursor Consistency (fleet-safe Teams channel poll)

> **Status:** Proposed. Problem confirmed in production 2026-07-02.
> **Status:** Partially shipped (branch `fix/fleet-safe-cursor`). F1, F2, F4,
> F5, and per-message cloud idempotency are implemented + tested. F6 is served
> by the existing process singleton (`src/entrabot/singleton.py`); orphan
> cleanup is tracked separately. F3 (cloud-authoritative `watched_chats`) is
> deferred — see "Deferred" below. Problem confirmed in production 2026-07-02.
> **Owner:** unassigned (handed off for implementation).
> **Related:** ADR-005 (cloud-hosted memory), `docs/decisions/005-cloud-hosted-memory.md`;
> TODOS.md → "MCP server orphans when Claude Code exits" (P1) and
Expand Down Expand Up @@ -227,3 +231,48 @@ surface entirely. Treat F1 as security-relevant, not cosmetic.
`seen_ids_tail`). A separate `delivered/` object is optional and additive.
- Land F1 + F6 first (stops the flood and the injection surface), then F2/F3,
then F4/F5 (correctness under long downtime and high write concurrency).

## Shipped (branch `fix/fleet-safe-cursor`)

Landed with full TDD (`pytest -v && ruff check .` green):

- **F1 — fail closed on read miss.** `chat_cursors.resolve_cursor()` classifies
a cursor read as `PRESENT` / `ABSENT` / `UNRESOLVED`. A backend read error or
a corrupt payload is `UNRESOLVED` (distinct from a successful empty read,
`ABSENT`). `_register_watched_chat` flags `UNRESOLVED` chats
`needs_resolution`; `_poll_watched_chat` re-reads them each cycle and pushes
nothing until they resolve. Only a provably-`ABSENT` chat may surface its
newest message once.
- **F4 — catch-up read on stale.** A present-but-stale cursor now rehydrates
(never re-bootstraps); the steady-state timestamp gate surfaces only messages
newer than `last_ts`. Idle chats emit nothing.
- **Per-message cloud idempotency.** `chat_cursors.claim_delivery()` treats the
shared cursor's `seen_ids_tail` as the fleet delivery ledger: an ETag
compare-and-swap records the candidate batch and returns only the ids THIS
instance won. Across N instances exactly one wins each id; losers claim
nothing. New-chat first delivery also flows through it (bootstrap defers the
push one cycle). All ambiguity (read error / corrupt / exhausted retries)
fails closed.
- **F5 — optimistic concurrency.** `BlobStore.get_with_etag()` +
`MemoryBackend.read_text_with_etag()` / `write_text(if_match=)` (Local uses a
content-hash ETag, Blob the real ETag). `claim_delivery` merges `max(last_ts)`
+ `union(seen_ids)` and retries on `ConcurrencyError`, so a slow writer can't
regress the watermark.
- **F2 — uniform backend at boot.** `get_backend()` raises
`BackendMisconfiguredError` on a half-configured blob env instead of silently
using Local. `assert_backend_config()` (called from `_initialize`) logs the
resolved backend + container and fails loud on half-config.
- **F6 — singleton.** Served by the existing `src/entrabot/singleton.py` flock
acquired at the top of `main()`. Orphan cleanup (background tasks outside
FastMCP's lifespan cancel scope) remains tracked in `TODOS.md` ("MCP server
orphans when Claude Code exits").

### Deferred

- **F3 — cloud-authoritative `watched_chats`.** Not shipped in this branch. It
is the largest structural change (moving the watched-chats list into the
cloud store with the local file as a cache) and is not required by the
fail-closed + idempotency guarantees above: with F1/F2 + the idempotency
ledger, an instance that registers a chat late or on a divergent local list
still cannot replay — it either fails closed (unresolved) or loses the
delivery claim. Tracked as a follow-up.
2 changes: 1 addition & 1 deletion docs/engineering-status.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Source of truth for detail: `TODOS.md` in the repository root. One line each bel
- **Script-toolkit docs closeout** — `./status.sh` is the canonical entry; finish the remaining script-reference polish and smoke verification. See `TODOS.md` P1.
- **Test isolation: blob env leakage** — `tmp_data_dir` fixture in `tests/tools/test_interaction_log.py` doesn't clear `ENTRABOT_BLOB_ENDPOINT`; 10 tests fail on any machine with blob env configured. Partially addressed: `test_interaction_log.py`, `test_daily_summary.py`, and `test_email_poll.py` fixtures now unset blob env; session-scoped autouse fixture still open.
- **MCP server orphans on Claude Code exit** — background poll tasks sit outside FastMCP's lifespan cancel scope; new sessions spawn a second server, both poll Graph independently.
- **Multi-instance cursor consistency (fleet-safe channel poll)** — issue #17 fixed single-instance restart replay, but the poll's bootstrap path still fails *open* on any cloud-cursor read miss, so a mis-enved or transiently-failing instance in an N-instance fleet replays every idle chat's newest (weeks-old) message — and re-injects stale imperative messages into the channel. Fix plan: fail-closed miss path + per-message cloud idempotency + cloud-authoritative `watched_chats` + ETag cursor writes + boot-time backend assertion. See `docs/architecture/DESIGN-multi-instance-cursor-consistency.md`. Confirmed 2026-07-02.
- **Multi-instance cursor consistency (fleet-safe channel poll)** — mostly shipped on `fix/fleet-safe-cursor`: the poll's bootstrap path now fails *closed* on any cloud-cursor read miss (`resolve_cursor` → PRESENT/ABSENT/UNRESOLVED), stale cursors catch up instead of re-bootstrapping, delivery is idempotent across the fleet via an ETag compare-and-swap on the shared cursor ledger (`claim_delivery`), cursor writes use `If-Match` concurrency, and a half-configured blob backend fails loud at boot. Remaining: F3 (cloud-authoritative `watched_chats`) deferred. See `docs/architecture/DESIGN-multi-instance-cursor-consistency.md` §Shipped. Confirmed 2026-07-02.
- **Daily summary scheduler — wrong day + double-fire** — UTC-based `target_day` summarizes the brand-new UTC day at 5pm PDT; scheduler fired twice at the same second on 2026-04-17.

## Recently Shipped
Expand Down
25 changes: 25 additions & 0 deletions src/entrabot/errors.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,31 @@ class ConfigError(EntraBotError):
"""Configuration errors (invalid or removed settings)."""


class BackendMisconfiguredError(ConfigError):
"""Storage backend env is half-configured (design F2).

Exactly one of ``ENTRABOT_BLOB_ENDPOINT`` / ``ENTRABOT_BLOB_CONTAINER``
is set. Silently falling back to Local here is how a mis-enved fleet
instance ends up on an empty local store and re-bootstraps every chat
(the replay-flood root cause). Fail loud so the operator fixes the env
instead of the fleet diverging. Set ``ENTRABOT_KEEP_MEMORY_LOCAL=true``
to deliberately force Local.
"""

def __init__(self, *, endpoint: str | None, container: str | None) -> None:
self.endpoint = endpoint
self.container = container
missing = "container" if endpoint else "endpoint"
super().__init__(
"storage backend is half-configured: "
f"blob {missing} is unset "
f"(endpoint={'set' if endpoint else 'unset'}, "
f"container={'set' if container else 'unset'}). "
"Set BOTH ENTRABOT_BLOB_ENDPOINT and ENTRABOT_BLOB_CONTAINER, or set "
"ENTRABOT_KEEP_MEMORY_LOCAL=true to force local storage."
)


class InsecureKeyringBackendError(EntraBotError):
"""The selected keyring backend is not an approved OS keystore."""

Expand Down
Loading
Loading