fix(watch): exit on originating-session death so a quiet watcher can't hang (#67, #388) by fujibee · Pull Request #215 · fujibee/agmsg

fujibee · 2026-06-24T05:29:42Z

Problem

The watch: SIGPIPE tests flake on Linux and fail deterministically on macOS, and — more importantly — a real watcher whose owning session has died can spin forever when its message stream is quiet. That silent spin is the source of the ~33-minute windows/macos runner stall behind #388 (the job timeout only caps the symptom).

A watcher only notices a gone consumer when its next write fails. But a plain pipe offers no portable way to detect a downstream reader that closed silently: printf '' raises no EPIPE, and macOS buffers a final write into an already-dead reader. So if no further message arrives, the watcher never writes again, never sees the broken pipe, and loops indefinitely.

Fix

watch.sh — add a liveness guard at the top of the poll loop. When the agent process embedded in the composite instance id is gone (portable kill -0 via _agmsg_pid_alive, which falls back to tasklist under Git Bash), the watcher exits within ~1 poll interval, even with a silent stream. Gated on a composite instance id only. Closes watch.sh: self-exit when owning CC session is no longer alive #67.
Tests — the old "closed consumer does not advance watermark" assertion is an unachievable contract on a plain pipe (see above), so it is rewritten to the observable one: the watcher exits on session death without advancing the watermark past an unconsumed row, driven by a controllable stand-in session pid. A watermark-vs-pidfile setup race in it is also fixed. The #93/#66 relaunch/independence tests are moved off fabricated dead pids (which the new guard now exits) onto live stand-in processes, and the relaunch's kill-then-check is made deterministic.

Scope / limitation

Delivery and watermark semantics are unchanged — that is the storage delivery-cursor work, not this PR. A bare instance id (the degraded fallback, used when the agent pid can't be resolved) is not liveness-gated; for that case the CI/job timeout-minutes remains the backstop.

Verification

tests/test_watch.bats: 13/13 across repeated runs.
The previously 100%-failing case (macOS): 20/20 fail → 20/20 pass.
Full bats suite: 387/0.

Closes #67. Fixes #388.

…t hang (#67, #388) A watcher only notices a gone consumer when its next write fails, but a plain pipe gives no portable way to detect a downstream reader that closed silently: printf '' raises no EPIPE, and macOS buffers a final write into an already-dead reader. So a watcher whose owning session died with a quiet message stream would spin forever — the source of the ~33-minute macOS-runner stall behind #388; the job timeout only caps the symptom. Add a liveness guard at the top of the poll loop: when the agent process embedded in the composite instance id is gone (portable kill -0 via _agmsg_pid_alive, which falls back to tasklist under Git Bash), the watcher exits within ~1 poll interval, even with a silent stream. Gated on a composite instance id only; a bare id (degraded, no resolved agent pid) keeps the prior behavior and relies on the job timeout as a backstop. Rewrite the flaky 'closed consumer does not advance watermark' test: a closed downstream consumer is an unachievable contract on a plain pipe, so assert the observable one instead — the watcher exits on session death without advancing the watermark past an unconsumed row — driven by a controllable stand-in session pid. Also fix a watermark-vs-pidfile setup race in it, move the #93/#66 tests off fabricated dead pids (which the guard now exits) onto live stand-in processes, and make the relaunch kill-then-check deterministic. Closes #67. Fixes #388.

…de (#206 phase 1) #206 migrates the message call-sites off direct SQL onto the storage facade. This first phase moves only the READ/display paths; it is purely additive and keeps the suite green, because the facade's storage_list_unread / storage_history read the event log UNION the legacy messages table, so messages still written to the legacy table by send.sh stay visible. Deliberately deferred to the phase-3 atomic flip: mark-read and send. mark-read stays a legacy read_at UPDATE here — the read-state writers (inbox/check-inbox) and the legacy-read_at readers that have not migrated yet (watch-once, whose codex-bridge consumes an integer max_id cursor for stale-wake detection) are all coupled to read_at and must flip together with send and watch, or read-state splits (a half-migrated mark would be invisible to watch-once). Drawing the boundary at read-only keeps every commit consistent. - inbox.sh / check-inbox.sh: unread now comes from storage_list_unread (JSONL), parsed in one pass with sqlite's JSON funcs (no jq; cf. lib/hooks-json.sh). - history.sh: storage_history (events ∪ legacy). The ●/○ marker is derived by cross-referencing storage_list_unread per recipient (read-state is recipient-scoped, §2.3, so it is not carried on a history record). Team-wide history (no agent) now works via the widened contract below. - driver storage_history: <agent> is optional (omitted = whole team), and --limit returns the most RECENT N in chronological order (was: oldest N). spec §2.1 updated — both are additive (existing agent-scoped calls unchanged). - contract tests +2: team-wide history, and --limit recency/ordering. Contract 24/24, full suite 410 ok / 1 (pre-existing watch flake #124, fixed by PR #215). Refs #206, ADR 0003.

… truly optional (co1 #206 review) co1 request-changes on #206 phase 1 — two blocking gaps in the storage_history contract changes, both addressed: - [blocking 1] The newest-N --limit semantics lived only in the sqlite impl and a contract test, not the spec — a place a future jsonl/redis driver could diverge. spec §2.1 now states it: '--limit N selects the most recent N of the matching set; output is always time order, oldest→newest (the tail of the history, still chronological — never newest-first)', and a driver must honour both selection and ordering. - [blocking 2] The optional <agent> wasn't actually optional at the driver level: the old 'agent=$2; shift 2' misread 'storage_history <team> --limit N' (--limit taken as the agent) and was unsafe for 'storage_history <team>' alone. history.sh hid this by always passing an empty $2, but the contract was wrong. The parser now consumes a leading NON-flag arg as the agent (empty allowed = team-wide) and leaves --flags to the flag loop. Added a contract test: 'storage_history <team> --limit 1' returns exactly one team-wide record as pure JSONL (the flag is not swallowed as the agent). Non-blocking, tracked for the phase-3 / driver-switch checklist (co1): inbox/ check-inbox/history still gate on the agmsg_db_path file-existence check; a jsonl/redis active driver would early-exit when messages.db is absent, so that guard should move to driver-level empty-store semantics before a non-sqlite driver is enabled. Contract 25/25, full suite 411 ok / 1 (pre-existing watch flake #124, resolved by the upcoming rebase onto main / #215). Refs #206, ADR 0003.

…de (#206 phase 1) #206 migrates the message call-sites off direct SQL onto the storage facade. This first phase moves only the READ/display paths; it is purely additive and keeps the suite green, because the facade's storage_list_unread / storage_history read the event log UNION the legacy messages table, so messages still written to the legacy table by send.sh stay visible. Deliberately deferred to the phase-3 atomic flip: mark-read and send. mark-read stays a legacy read_at UPDATE here — the read-state writers (inbox/check-inbox) and the legacy-read_at readers that have not migrated yet (watch-once, whose codex-bridge consumes an integer max_id cursor for stale-wake detection) are all coupled to read_at and must flip together with send and watch, or read-state splits (a half-migrated mark would be invisible to watch-once). Drawing the boundary at read-only keeps every commit consistent. - inbox.sh / check-inbox.sh: unread now comes from storage_list_unread (JSONL), parsed in one pass with sqlite's JSON funcs (no jq; cf. lib/hooks-json.sh). - history.sh: storage_history (events ∪ legacy). The ●/○ marker is derived by cross-referencing storage_list_unread per recipient (read-state is recipient-scoped, §2.3, so it is not carried on a history record). Team-wide history (no agent) now works via the widened contract below. - driver storage_history: <agent> is optional (omitted = whole team), and --limit returns the most RECENT N in chronological order (was: oldest N). spec §2.1 updated — both are additive (existing agent-scoped calls unchanged). - contract tests +2: team-wide history, and --limit recency/ordering. Contract 24/24, full suite 410 ok / 1 (pre-existing watch flake #124, fixed by PR #215). Refs #206, ADR 0003.

… truly optional (co1 #206 review) co1 request-changes on #206 phase 1 — two blocking gaps in the storage_history contract changes, both addressed: - [blocking 1] The newest-N --limit semantics lived only in the sqlite impl and a contract test, not the spec — a place a future jsonl/redis driver could diverge. spec §2.1 now states it: '--limit N selects the most recent N of the matching set; output is always time order, oldest→newest (the tail of the history, still chronological — never newest-first)', and a driver must honour both selection and ordering. - [blocking 2] The optional <agent> wasn't actually optional at the driver level: the old 'agent=$2; shift 2' misread 'storage_history <team> --limit N' (--limit taken as the agent) and was unsafe for 'storage_history <team>' alone. history.sh hid this by always passing an empty $2, but the contract was wrong. The parser now consumes a leading NON-flag arg as the agent (empty allowed = team-wide) and leaves --flags to the flag loop. Added a contract test: 'storage_history <team> --limit 1' returns exactly one team-wide record as pure JSONL (the flag is not swallowed as the agent). Non-blocking, tracked for the phase-3 / driver-switch checklist (co1): inbox/ check-inbox/history still gate on the agmsg_db_path file-existence check; a jsonl/redis active driver would early-exit when messages.db is absent, so that guard should move to driver-level empty-store semantics before a non-sqlite driver is enabled. Contract 25/25, full suite 411 ok / 1 (pre-existing watch flake #124, resolved by the upcoming rebase onto main / #215). Refs #206, ADR 0003.

The atomic cutover that makes the storage facade authoritative. After phase 1 (reads through the facade) this flips the writers and the live-delivery cursor so ALL message I/O is the event log; the legacy messages table is now read-only back-compat (§2.4). Done as one change because send (writer), mark-read (read-state writer), watch / watch-once (readers), and the codex bridge are all coupled through what was a single read_at + integer-id contract — splitting them would leave read-state or the delivery cursor half-migrated. Writers: - send.sh -> storage_send (a message_sent event). - inbox.sh / check-inbox.sh mark-read -> storage_mark_read_batch (a recipient-scoped message_read event; for a legacy id it records the read without mutating the row). Live delivery (opaque cursor, §2.2): - watch.sh watermark is now the driver's opaque cursor, not an integer id: storage_watch_tip seeds a fresh watcher; storage_watch_after streams new messages + a trailing cursor each poll. The #215 liveness guard is preserved untouched (session-liveness and delivery-position are independent). WATERMARK_FILE holds the token; the dead SQL WHERE_PAIRS builder is gone. - watch-once.sh counts unread via storage_list_unread; its max_id is now an OPAQUE token (greatest unread id), and codex-bridge.js compares it for equality only (stale-wake), never magnitude — parseMaxId/lastWakeMaxId/isStaleWake destringed. Correctness fix (driver): - storage_watch_after wraps its message scan + high-water read in one deferred WAL read transaction. Non-atomic, a row inserted between the two statements could advance the cursor past a message the scan never returned — a silent skip. The snapshot makes the emitted cursor never run ahead of the scan (§2.2 never-skip). - storage_send tries the INSERT first and only inits on failure (#114 pattern), instead of running PRAGMA WAL + CREATEs on every send — fixes lost rows under a concurrent fan-out. - rename-team.sh now rewrites the team in the events table too (best-effort), not just legacy messages, or a rename would orphan post-flip messages. Tests updated for the flip (behaviour is correct; the old assertions encoded the legacy path): delivery 120/124 send via send.sh (watch streams the event log, not direct messages inserts); watch #4 asserts the watermark against the storage tip (_storage_tip, the event high-water) not a messages id; storage fan-out counts events. Full suite 412 ok / 0. Tracked for the driver-switch checklist (co1, non-blocking): inbox/check-inbox/ history still gate on the messages.db file-existence check; move to driver-level empty-store semantics before a non-sqlite driver is active. Refs #206, ADR 0003.

fujibee merged commit 0653e5d into main Jun 24, 2026
5 of 6 checks passed

fujibee deleted the fix/test-watch-sigpipe branch June 24, 2026 05:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(watch): exit on originating-session death so a quiet watcher can't hang (#67, #388)#215

fix(watch): exit on originating-session death so a quiet watcher can't hang (#67, #388)#215
fujibee merged 1 commit into
mainfrom
fix/test-watch-sigpipe

fujibee commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fujibee commented Jun 24, 2026

Problem

Fix

Scope / limitation

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant