Skip to content

fix(watch): exit on originating-session death so a quiet watcher can't hang (#67, #388)#215

Merged
fujibee merged 1 commit into
mainfrom
fix/test-watch-sigpipe
Jun 24, 2026
Merged

fix(watch): exit on originating-session death so a quiet watcher can't hang (#67, #388)#215
fujibee merged 1 commit into
mainfrom
fix/test-watch-sigpipe

Conversation

@fujibee

@fujibee fujibee commented Jun 24, 2026

Copy link
Copy Markdown
Owner

Problem

The watch: SIGPIPE tests flake on Linux and fail deterministically on macOS, and — more importantly — a real watcher whose owning session has died can spin forever when its message stream is quiet. That silent spin is the source of the ~33-minute windows/macos runner stall behind #388 (the job timeout only caps the symptom).

A watcher only notices a gone consumer when its next write fails. But a plain pipe offers no portable way to detect a downstream reader that closed silently: printf '' raises no EPIPE, and macOS buffers a final write into an already-dead reader. So if no further message arrives, the watcher never writes again, never sees the broken pipe, and loops indefinitely.

Fix

  • watch.sh — add a liveness guard at the top of the poll loop. When the agent process embedded in the composite instance id is gone (portable kill -0 via _agmsg_pid_alive, which falls back to tasklist under Git Bash), the watcher exits within ~1 poll interval, even with a silent stream. Gated on a composite instance id only. Closes watch.sh: self-exit when owning CC session is no longer alive #67.
  • Tests — the old "closed consumer does not advance watermark" assertion is an unachievable contract on a plain pipe (see above), so it is rewritten to the observable one: the watcher exits on session death without advancing the watermark past an unconsumed row, driven by a controllable stand-in session pid. A watermark-vs-pidfile setup race in it is also fixed. The #93/#66 relaunch/independence tests are moved off fabricated dead pids (which the new guard now exits) onto live stand-in processes, and the relaunch's kill-then-check is made deterministic.

Scope / limitation

Delivery and watermark semantics are unchanged — that is the storage delivery-cursor work, not this PR. A bare instance id (the degraded fallback, used when the agent pid can't be resolved) is not liveness-gated; for that case the CI/job timeout-minutes remains the backstop.

Verification

  • tests/test_watch.bats: 13/13 across repeated runs.
  • The previously 100%-failing case (macOS): 20/20 fail → 20/20 pass.
  • Full bats suite: 387/0.

Closes #67. Fixes #388.

…t hang (#67, #388)

A watcher only notices a gone consumer when its next write fails, but a plain
pipe gives no portable way to detect a downstream reader that closed silently:
printf '' raises no EPIPE, and macOS buffers a final write into an already-dead
reader. So a watcher whose owning session died with a quiet message stream would
spin forever — the source of the ~33-minute macOS-runner stall behind #388; the
job timeout only caps the symptom.

Add a liveness guard at the top of the poll loop: when the agent process embedded
in the composite instance id is gone (portable kill -0 via _agmsg_pid_alive,
which falls back to tasklist under Git Bash), the watcher exits within ~1 poll
interval, even with a silent stream. Gated on a composite instance id only; a
bare id (degraded, no resolved agent pid) keeps the prior behavior and relies on
the job timeout as a backstop.

Rewrite the flaky 'closed consumer does not advance watermark' test: a closed
downstream consumer is an unachievable contract on a plain pipe, so assert the
observable one instead — the watcher exits on session death without advancing the
watermark past an unconsumed row — driven by a controllable stand-in session pid.
Also fix a watermark-vs-pidfile setup race in it, move the #93/#66 tests off
fabricated dead pids (which the guard now exits) onto live stand-in processes,
and make the relaunch kill-then-check deterministic.

Closes #67. Fixes #388.
@fujibee fujibee merged commit 0653e5d into main Jun 24, 2026
5 of 6 checks passed
@fujibee fujibee deleted the fix/test-watch-sigpipe branch June 24, 2026 05:42
fujibee added a commit that referenced this pull request Jun 24, 2026
…de (#206 phase 1)

#206 migrates the message call-sites off direct SQL onto the storage facade. This
first phase moves only the READ/display paths; it is purely additive and keeps the
suite green, because the facade's storage_list_unread / storage_history read the
event log UNION the legacy messages table, so messages still written to the legacy
table by send.sh stay visible.

Deliberately deferred to the phase-3 atomic flip: mark-read and send. mark-read
stays a legacy read_at UPDATE here — the read-state writers (inbox/check-inbox) and
the legacy-read_at readers that have not migrated yet (watch-once, whose
codex-bridge consumes an integer max_id cursor for stale-wake detection) are all
coupled to read_at and must flip together with send and watch, or read-state
splits (a half-migrated mark would be invisible to watch-once). Drawing the
boundary at read-only keeps every commit consistent.

- inbox.sh / check-inbox.sh: unread now comes from storage_list_unread (JSONL),
  parsed in one pass with sqlite's JSON funcs (no jq; cf. lib/hooks-json.sh).
- history.sh: storage_history (events ∪ legacy). The ●/○ marker is derived by
  cross-referencing storage_list_unread per recipient (read-state is
  recipient-scoped, §2.3, so it is not carried on a history record). Team-wide
  history (no agent) now works via the widened contract below.
- driver storage_history: <agent> is optional (omitted = whole team), and --limit
  returns the most RECENT N in chronological order (was: oldest N). spec §2.1
  updated — both are additive (existing agent-scoped calls unchanged).
- contract tests +2: team-wide history, and --limit recency/ordering.

Contract 24/24, full suite 410 ok / 1 (pre-existing watch flake #124, fixed by
PR #215). Refs #206, ADR 0003.
fujibee added a commit that referenced this pull request Jun 24, 2026
… truly optional (co1 #206 review)

co1 request-changes on #206 phase 1 — two blocking gaps in the storage_history
contract changes, both addressed:

- [blocking 1] The newest-N --limit semantics lived only in the sqlite impl and a
  contract test, not the spec — a place a future jsonl/redis driver could diverge.
  spec §2.1 now states it: '--limit N selects the most recent N of the matching
  set; output is always time order, oldest→newest (the tail of the history, still
  chronological — never newest-first)', and a driver must honour both selection
  and ordering.

- [blocking 2] The optional <agent> wasn't actually optional at the driver level:
  the old 'agent=$2; shift 2' misread 'storage_history <team> --limit N' (--limit
  taken as the agent) and was unsafe for 'storage_history <team>' alone. history.sh
  hid this by always passing an empty $2, but the contract was wrong. The parser
  now consumes a leading NON-flag arg as the agent (empty allowed = team-wide) and
  leaves --flags to the flag loop. Added a contract test: 'storage_history <team>
  --limit 1' returns exactly one team-wide record as pure JSONL (the flag is not
  swallowed as the agent).

Non-blocking, tracked for the phase-3 / driver-switch checklist (co1): inbox/
check-inbox/history still gate on the agmsg_db_path file-existence check; a
jsonl/redis active driver would early-exit when messages.db is absent, so that
guard should move to driver-level empty-store semantics before a non-sqlite driver
is enabled.

Contract 25/25, full suite 411 ok / 1 (pre-existing watch flake #124, resolved by
the upcoming rebase onto main / #215). Refs #206, ADR 0003.
fujibee added a commit that referenced this pull request Jun 24, 2026
…de (#206 phase 1)

#206 migrates the message call-sites off direct SQL onto the storage facade. This
first phase moves only the READ/display paths; it is purely additive and keeps the
suite green, because the facade's storage_list_unread / storage_history read the
event log UNION the legacy messages table, so messages still written to the legacy
table by send.sh stay visible.

Deliberately deferred to the phase-3 atomic flip: mark-read and send. mark-read
stays a legacy read_at UPDATE here — the read-state writers (inbox/check-inbox) and
the legacy-read_at readers that have not migrated yet (watch-once, whose
codex-bridge consumes an integer max_id cursor for stale-wake detection) are all
coupled to read_at and must flip together with send and watch, or read-state
splits (a half-migrated mark would be invisible to watch-once). Drawing the
boundary at read-only keeps every commit consistent.

- inbox.sh / check-inbox.sh: unread now comes from storage_list_unread (JSONL),
  parsed in one pass with sqlite's JSON funcs (no jq; cf. lib/hooks-json.sh).
- history.sh: storage_history (events ∪ legacy). The ●/○ marker is derived by
  cross-referencing storage_list_unread per recipient (read-state is
  recipient-scoped, §2.3, so it is not carried on a history record). Team-wide
  history (no agent) now works via the widened contract below.
- driver storage_history: <agent> is optional (omitted = whole team), and --limit
  returns the most RECENT N in chronological order (was: oldest N). spec §2.1
  updated — both are additive (existing agent-scoped calls unchanged).
- contract tests +2: team-wide history, and --limit recency/ordering.

Contract 24/24, full suite 410 ok / 1 (pre-existing watch flake #124, fixed by
PR #215). Refs #206, ADR 0003.
fujibee added a commit that referenced this pull request Jun 24, 2026
… truly optional (co1 #206 review)

co1 request-changes on #206 phase 1 — two blocking gaps in the storage_history
contract changes, both addressed:

- [blocking 1] The newest-N --limit semantics lived only in the sqlite impl and a
  contract test, not the spec — a place a future jsonl/redis driver could diverge.
  spec §2.1 now states it: '--limit N selects the most recent N of the matching
  set; output is always time order, oldest→newest (the tail of the history, still
  chronological — never newest-first)', and a driver must honour both selection
  and ordering.

- [blocking 2] The optional <agent> wasn't actually optional at the driver level:
  the old 'agent=$2; shift 2' misread 'storage_history <team> --limit N' (--limit
  taken as the agent) and was unsafe for 'storage_history <team>' alone. history.sh
  hid this by always passing an empty $2, but the contract was wrong. The parser
  now consumes a leading NON-flag arg as the agent (empty allowed = team-wide) and
  leaves --flags to the flag loop. Added a contract test: 'storage_history <team>
  --limit 1' returns exactly one team-wide record as pure JSONL (the flag is not
  swallowed as the agent).

Non-blocking, tracked for the phase-3 / driver-switch checklist (co1): inbox/
check-inbox/history still gate on the agmsg_db_path file-existence check; a
jsonl/redis active driver would early-exit when messages.db is absent, so that
guard should move to driver-level empty-store semantics before a non-sqlite driver
is enabled.

Contract 25/25, full suite 411 ok / 1 (pre-existing watch flake #124, resolved by
the upcoming rebase onto main / #215). Refs #206, ADR 0003.
fujibee added a commit that referenced this pull request Jun 24, 2026
The atomic cutover that makes the storage facade authoritative. After phase 1
(reads through the facade) this flips the writers and the live-delivery cursor so
ALL message I/O is the event log; the legacy messages table is now read-only
back-compat (§2.4). Done as one change because send (writer), mark-read (read-state
writer), watch / watch-once (readers), and the codex bridge are all coupled
through what was a single read_at + integer-id contract — splitting them would
leave read-state or the delivery cursor half-migrated.

Writers:
- send.sh -> storage_send (a message_sent event).
- inbox.sh / check-inbox.sh mark-read -> storage_mark_read_batch (a recipient-scoped
  message_read event; for a legacy id it records the read without mutating the row).

Live delivery (opaque cursor, §2.2):
- watch.sh watermark is now the driver's opaque cursor, not an integer id:
  storage_watch_tip seeds a fresh watcher; storage_watch_after streams new messages
  + a trailing cursor each poll. The #215 liveness guard is preserved untouched
  (session-liveness and delivery-position are independent). WATERMARK_FILE holds the
  token; the dead SQL WHERE_PAIRS builder is gone.
- watch-once.sh counts unread via storage_list_unread; its max_id is now an OPAQUE
  token (greatest unread id), and codex-bridge.js compares it for equality only
  (stale-wake), never magnitude — parseMaxId/lastWakeMaxId/isStaleWake destringed.

Correctness fix (driver):
- storage_watch_after wraps its message scan + high-water read in one deferred WAL
  read transaction. Non-atomic, a row inserted between the two statements could
  advance the cursor past a message the scan never returned — a silent skip. The
  snapshot makes the emitted cursor never run ahead of the scan (§2.2 never-skip).
- storage_send tries the INSERT first and only inits on failure (#114 pattern),
  instead of running PRAGMA WAL + CREATEs on every send — fixes lost rows under a
  concurrent fan-out.
- rename-team.sh now rewrites the team in the events table too (best-effort), not
  just legacy messages, or a rename would orphan post-flip messages.

Tests updated for the flip (behaviour is correct; the old assertions encoded the
legacy path): delivery 120/124 send via send.sh (watch streams the event log, not
direct messages inserts); watch #4 asserts the watermark against the storage tip
(_storage_tip, the event high-water) not a messages id; storage fan-out counts
events. Full suite 412 ok / 0.

Tracked for the driver-switch checklist (co1, non-blocking): inbox/check-inbox/
history still gate on the messages.db file-existence check; move to driver-level
empty-store semantics before a non-sqlite driver is active. Refs #206, ADR 0003.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

watch.sh: self-exit when owning CC session is no longer alive

1 participant