fix(watch): exit on originating-session death so a quiet watcher can't hang (#67, #388)#215
Merged
Conversation
…t hang (#67, #388) A watcher only notices a gone consumer when its next write fails, but a plain pipe gives no portable way to detect a downstream reader that closed silently: printf '' raises no EPIPE, and macOS buffers a final write into an already-dead reader. So a watcher whose owning session died with a quiet message stream would spin forever — the source of the ~33-minute macOS-runner stall behind #388; the job timeout only caps the symptom. Add a liveness guard at the top of the poll loop: when the agent process embedded in the composite instance id is gone (portable kill -0 via _agmsg_pid_alive, which falls back to tasklist under Git Bash), the watcher exits within ~1 poll interval, even with a silent stream. Gated on a composite instance id only; a bare id (degraded, no resolved agent pid) keeps the prior behavior and relies on the job timeout as a backstop. Rewrite the flaky 'closed consumer does not advance watermark' test: a closed downstream consumer is an unachievable contract on a plain pipe, so assert the observable one instead — the watcher exits on session death without advancing the watermark past an unconsumed row — driven by a controllable stand-in session pid. Also fix a watermark-vs-pidfile setup race in it, move the #93/#66 tests off fabricated dead pids (which the guard now exits) onto live stand-in processes, and make the relaunch kill-then-check deterministic. Closes #67. Fixes #388.
fujibee
added a commit
that referenced
this pull request
Jun 24, 2026
…de (#206 phase 1) #206 migrates the message call-sites off direct SQL onto the storage facade. This first phase moves only the READ/display paths; it is purely additive and keeps the suite green, because the facade's storage_list_unread / storage_history read the event log UNION the legacy messages table, so messages still written to the legacy table by send.sh stay visible. Deliberately deferred to the phase-3 atomic flip: mark-read and send. mark-read stays a legacy read_at UPDATE here — the read-state writers (inbox/check-inbox) and the legacy-read_at readers that have not migrated yet (watch-once, whose codex-bridge consumes an integer max_id cursor for stale-wake detection) are all coupled to read_at and must flip together with send and watch, or read-state splits (a half-migrated mark would be invisible to watch-once). Drawing the boundary at read-only keeps every commit consistent. - inbox.sh / check-inbox.sh: unread now comes from storage_list_unread (JSONL), parsed in one pass with sqlite's JSON funcs (no jq; cf. lib/hooks-json.sh). - history.sh: storage_history (events ∪ legacy). The ●/○ marker is derived by cross-referencing storage_list_unread per recipient (read-state is recipient-scoped, §2.3, so it is not carried on a history record). Team-wide history (no agent) now works via the widened contract below. - driver storage_history: <agent> is optional (omitted = whole team), and --limit returns the most RECENT N in chronological order (was: oldest N). spec §2.1 updated — both are additive (existing agent-scoped calls unchanged). - contract tests +2: team-wide history, and --limit recency/ordering. Contract 24/24, full suite 410 ok / 1 (pre-existing watch flake #124, fixed by PR #215). Refs #206, ADR 0003.
fujibee
added a commit
that referenced
this pull request
Jun 24, 2026
… truly optional (co1 #206 review) co1 request-changes on #206 phase 1 — two blocking gaps in the storage_history contract changes, both addressed: - [blocking 1] The newest-N --limit semantics lived only in the sqlite impl and a contract test, not the spec — a place a future jsonl/redis driver could diverge. spec §2.1 now states it: '--limit N selects the most recent N of the matching set; output is always time order, oldest→newest (the tail of the history, still chronological — never newest-first)', and a driver must honour both selection and ordering. - [blocking 2] The optional <agent> wasn't actually optional at the driver level: the old 'agent=$2; shift 2' misread 'storage_history <team> --limit N' (--limit taken as the agent) and was unsafe for 'storage_history <team>' alone. history.sh hid this by always passing an empty $2, but the contract was wrong. The parser now consumes a leading NON-flag arg as the agent (empty allowed = team-wide) and leaves --flags to the flag loop. Added a contract test: 'storage_history <team> --limit 1' returns exactly one team-wide record as pure JSONL (the flag is not swallowed as the agent). Non-blocking, tracked for the phase-3 / driver-switch checklist (co1): inbox/ check-inbox/history still gate on the agmsg_db_path file-existence check; a jsonl/redis active driver would early-exit when messages.db is absent, so that guard should move to driver-level empty-store semantics before a non-sqlite driver is enabled. Contract 25/25, full suite 411 ok / 1 (pre-existing watch flake #124, resolved by the upcoming rebase onto main / #215). Refs #206, ADR 0003.
fujibee
added a commit
that referenced
this pull request
Jun 24, 2026
…de (#206 phase 1) #206 migrates the message call-sites off direct SQL onto the storage facade. This first phase moves only the READ/display paths; it is purely additive and keeps the suite green, because the facade's storage_list_unread / storage_history read the event log UNION the legacy messages table, so messages still written to the legacy table by send.sh stay visible. Deliberately deferred to the phase-3 atomic flip: mark-read and send. mark-read stays a legacy read_at UPDATE here — the read-state writers (inbox/check-inbox) and the legacy-read_at readers that have not migrated yet (watch-once, whose codex-bridge consumes an integer max_id cursor for stale-wake detection) are all coupled to read_at and must flip together with send and watch, or read-state splits (a half-migrated mark would be invisible to watch-once). Drawing the boundary at read-only keeps every commit consistent. - inbox.sh / check-inbox.sh: unread now comes from storage_list_unread (JSONL), parsed in one pass with sqlite's JSON funcs (no jq; cf. lib/hooks-json.sh). - history.sh: storage_history (events ∪ legacy). The ●/○ marker is derived by cross-referencing storage_list_unread per recipient (read-state is recipient-scoped, §2.3, so it is not carried on a history record). Team-wide history (no agent) now works via the widened contract below. - driver storage_history: <agent> is optional (omitted = whole team), and --limit returns the most RECENT N in chronological order (was: oldest N). spec §2.1 updated — both are additive (existing agent-scoped calls unchanged). - contract tests +2: team-wide history, and --limit recency/ordering. Contract 24/24, full suite 410 ok / 1 (pre-existing watch flake #124, fixed by PR #215). Refs #206, ADR 0003.
fujibee
added a commit
that referenced
this pull request
Jun 24, 2026
… truly optional (co1 #206 review) co1 request-changes on #206 phase 1 — two blocking gaps in the storage_history contract changes, both addressed: - [blocking 1] The newest-N --limit semantics lived only in the sqlite impl and a contract test, not the spec — a place a future jsonl/redis driver could diverge. spec §2.1 now states it: '--limit N selects the most recent N of the matching set; output is always time order, oldest→newest (the tail of the history, still chronological — never newest-first)', and a driver must honour both selection and ordering. - [blocking 2] The optional <agent> wasn't actually optional at the driver level: the old 'agent=$2; shift 2' misread 'storage_history <team> --limit N' (--limit taken as the agent) and was unsafe for 'storage_history <team>' alone. history.sh hid this by always passing an empty $2, but the contract was wrong. The parser now consumes a leading NON-flag arg as the agent (empty allowed = team-wide) and leaves --flags to the flag loop. Added a contract test: 'storage_history <team> --limit 1' returns exactly one team-wide record as pure JSONL (the flag is not swallowed as the agent). Non-blocking, tracked for the phase-3 / driver-switch checklist (co1): inbox/ check-inbox/history still gate on the agmsg_db_path file-existence check; a jsonl/redis active driver would early-exit when messages.db is absent, so that guard should move to driver-level empty-store semantics before a non-sqlite driver is enabled. Contract 25/25, full suite 411 ok / 1 (pre-existing watch flake #124, resolved by the upcoming rebase onto main / #215). Refs #206, ADR 0003.
fujibee
added a commit
that referenced
this pull request
Jun 24, 2026
The atomic cutover that makes the storage facade authoritative. After phase 1 (reads through the facade) this flips the writers and the live-delivery cursor so ALL message I/O is the event log; the legacy messages table is now read-only back-compat (§2.4). Done as one change because send (writer), mark-read (read-state writer), watch / watch-once (readers), and the codex bridge are all coupled through what was a single read_at + integer-id contract — splitting them would leave read-state or the delivery cursor half-migrated. Writers: - send.sh -> storage_send (a message_sent event). - inbox.sh / check-inbox.sh mark-read -> storage_mark_read_batch (a recipient-scoped message_read event; for a legacy id it records the read without mutating the row). Live delivery (opaque cursor, §2.2): - watch.sh watermark is now the driver's opaque cursor, not an integer id: storage_watch_tip seeds a fresh watcher; storage_watch_after streams new messages + a trailing cursor each poll. The #215 liveness guard is preserved untouched (session-liveness and delivery-position are independent). WATERMARK_FILE holds the token; the dead SQL WHERE_PAIRS builder is gone. - watch-once.sh counts unread via storage_list_unread; its max_id is now an OPAQUE token (greatest unread id), and codex-bridge.js compares it for equality only (stale-wake), never magnitude — parseMaxId/lastWakeMaxId/isStaleWake destringed. Correctness fix (driver): - storage_watch_after wraps its message scan + high-water read in one deferred WAL read transaction. Non-atomic, a row inserted between the two statements could advance the cursor past a message the scan never returned — a silent skip. The snapshot makes the emitted cursor never run ahead of the scan (§2.2 never-skip). - storage_send tries the INSERT first and only inits on failure (#114 pattern), instead of running PRAGMA WAL + CREATEs on every send — fixes lost rows under a concurrent fan-out. - rename-team.sh now rewrites the team in the events table too (best-effort), not just legacy messages, or a rename would orphan post-flip messages. Tests updated for the flip (behaviour is correct; the old assertions encoded the legacy path): delivery 120/124 send via send.sh (watch streams the event log, not direct messages inserts); watch #4 asserts the watermark against the storage tip (_storage_tip, the event high-water) not a messages id; storage fan-out counts events. Full suite 412 ok / 0. Tracked for the driver-switch checklist (co1, non-blocking): inbox/check-inbox/ history still gate on the messages.db file-existence check; move to driver-level empty-store semantics before a non-sqlite driver is active. Refs #206, ADR 0003.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
watch:SIGPIPE tests flake on Linux and fail deterministically on macOS, and — more importantly — a real watcher whose owning session has died can spin forever when its message stream is quiet. That silent spin is the source of the ~33-minutewindows/macosrunner stall behind #388 (the job timeout only caps the symptom).A watcher only notices a gone consumer when its next write fails. But a plain pipe offers no portable way to detect a downstream reader that closed silently:
printf ''raises noEPIPE, and macOS buffers a final write into an already-dead reader. So if no further message arrives, the watcher never writes again, never sees the broken pipe, and loops indefinitely.Fix
watch.sh— add a liveness guard at the top of the poll loop. When the agent process embedded in the composite instance id is gone (portablekill -0via_agmsg_pid_alive, which falls back totasklistunder Git Bash), the watcher exits within ~1 poll interval, even with a silent stream. Gated on a composite instance id only. Closes watch.sh: self-exit when owning CC session is no longer alive #67.#93/#66relaunch/independence tests are moved off fabricated dead pids (which the new guard now exits) onto live stand-in processes, and the relaunch's kill-then-check is made deterministic.Scope / limitation
Delivery and watermark semantics are unchanged — that is the storage delivery-cursor work, not this PR. A bare instance id (the degraded fallback, used when the agent pid can't be resolved) is not liveness-gated; for that case the CI/job
timeout-minutesremains the backstop.Verification
tests/test_watch.bats: 13/13 across repeated runs.Closes #67. Fixes #388.