Skip to content

feat: re-queue remaining chain steps on global-deadline expiry#70

Open
JeanBaptisteRenard wants to merge 89 commits into
doctly:mainfrom
JeanBaptisteRenard:feat/chain-requeue-on-timeout
Open

feat: re-queue remaining chain steps on global-deadline expiry#70
JeanBaptisteRenard wants to merge 89 commits into
doctly:mainfrom
JeanBaptisteRenard:feat/chain-requeue-on-timeout

Conversation

@JeanBaptisteRenard

Copy link
Copy Markdown

Summary

  • When a chain trigger's global deadline fires with unsent steps remaining, write a new trigger file for the remaining steps with wait: idle so they are processed when the session next becomes free
  • MAX_REQUEUE = 2 cap prevents unbounded retry loops; requeue_count field tracks depth and is validated (non-negative integer, absent → 0)
  • All four chain-timeout sites participate: initial idle-wait timeout, before-step deadline, step-verify timedOut, step busy-fall timedOut
  • Timed-out result shape preserved (ok: false, error: 'chain timeout', partial: true) with requeued: true + requeue_trigger: <filename> extras, or requeue_exhausted: true at cap
  • Steps already sent are never re-queued (slice starts at first unsent step)

Test plan

  • W9-1: chain timeout with unsent steps → requeued:true + new trigger file with correct shape
  • W9-2: initial idle-wait timeout (all steps unsent) → entire chain re-queued
  • W9-3: requeue_count >= MAX_REQUEUE → requeue_exhausted:true, no new file
  • W9-4: negative requeue_count → validation error
  • W9-5: float requeue_count → validation error
  • W9-6: string requeue_count → validation error
  • W9-7: absent requeue_count → treated as 0, no error
  • W9-8: completed chain → no re-queue file written
  • Full suite: 49/49 pass (0 failures)

Fixes witnessed starvation: trigger e252d69c — step 0 (/compact) sent, step 1 never sent due to parallel */4 cron re-busying the session, result {ok:false, error:"chain timeout", partial:true, steps_completed:0}.

JeanBaptisteRenard and others added 30 commits May 21, 2026 23:02
Subagent transcripts written by Claude CLI live at
<folder>/<parentSessionId>/subagents/agent-<agentId>.jsonl alongside a
.meta.json sidecar holding { agentType, description }. Surface them as
first-class rows in session_cache, keyed by sessionId 'sub:<parent>:<agentId>'.

- adds parentSessionId, agentId, subagentType, description columns
- migration v4 clears the cache so a re-index repopulates everything
- adds idx_session_cache_parent for hierarchy lookups
- new query getCachedByParent + widened cacheGetByFolder
Adds helpers for the on-disk subagent layout:
- enumerateSessionFiles(folderPath) walks top-level + <uuid>/subagents/*.jsonl
  + legacy <uuid>/*.jsonl fallback
- subagentSessionId(parent, agentId) gives the synthetic 'sub:<p>:<a>' id
- resolveJsonlPath(projectsDir, row) reconstructs the absolute path
- readSubagentMeta reads the .meta.json sidecar

readSessionFile now accepts opts.parentSessionId. When set, it requires
isSidechain on at least one entry (defensive guard), reads the sidecar for
agentType/description, falls back to filename for agentId if absent in entries.

Adds 10 unit tests covering both branches, the sidecar guard, the legacy
fallback path, and the synthetic-id format.
Both scanners (synchronous refreshFolder + worker scan-projects) now use
enumerateSessionFiles instead of a flat top-level readdir, so subagent
transcripts get cached as first-class rows with parentSessionId set.

FTS indexing inherits this for free: subagent summaries land in search_fts
the same way top-level sessions do, with type='session' and folder=<parent>.
Full-text search now finds anything an Agent call discussed.
Clicking an Agent tool_use block in the JSONL viewer now loads the matching
subagent transcript and renders it nested below the block. Re-click collapses
without refetching.

- main.js: new IPC read-subagent-jsonl(parentSessionId, agentId) and
  list-subagents(parentSessionId)
- preload.js: window.api.readSubagentJsonl / listSubagents
- jsonl-viewer.js: clickable Agent block, caret indicator, recursive nested
  render. Identical (description, subagentType) pairs are disambiguated by
  occurrence ordinal so parallel fanout calls each open their own transcript.
- style.css: subtle hover + left-border indent for nested transcripts
Runs `npm ci && npm test` on ubuntu-latest against Node 20 and 22.
Triggers on pull requests targeting main and on direct pushes to main.
Migrations that clear session_cache (v2, v3, v4) leave the cache empty on
first launch. The previous get-projects handler returned [] immediately and
fire-and-forgot the worker scan, relying on a later 'projects-changed' IPC
to trigger a reload — but app.js only reloads on that event when the user
happens to be on the Sessions tab. If they were on another tab (or
hadn't navigated to Sessions yet), the list stayed empty until they
manually switched tabs.

Make populateCacheViaWorker return a Promise that resolves when the scan
finishes. Concurrent callers share the same Promise. get-projects awaits
that Promise when the cache is empty, so the response carries the freshly
populated cache and the renderer paints immediately.
buildProjectsFromCache was projecting only the pre-PR#1 columns onto the
session object sent to the renderer. The renderer's hierarchical sidebar
(introduced in PR#2) reads parentSessionId/agentId/subagentType/description
from each session and silently flattens to top-level when they're missing —
which is what happened: all subagents appeared as siblings of their parent
instead of nested under it.

Add the four subagent columns to the projection. No DB or wire change.
Subagent support: index, search, and inline-expand transcripts (fork CI)
readSessionFile JSON.parse'd every line of a session file in a single
outer try/catch. If a Claude CLI session was actively writing the file
while the worker scan read it, one mid-write line could throw and
invalidate the ENTIRE file's session row. With many parallel live sessions
this manifested as 'most projects show no sessions after a fresh index'.

- Per-line try/catch inside readSessionFile: skip malformed lines, keep
  parsing the rest.
- Per-file try/catch in workers/scan-projects.js: defensive belt and
  braces so one unparseable file can't abort an entire folder scan.
resolveWorktreePath was defined and exported in the file but never
actually called from deriveProjectPath. As a result every worktree under
<project>/.claude/worktrees/<name>/ appeared as a separate project group
in the sidebar instead of being grouped under its parent project.

Wire the call in both branches of deriveProjectPath (direct .jsonl path
and subdirectory path) and export resolveWorktreePath for other callers.
session-transitions.js now tracks per-active-session knownSubagents and emits
two new IPC events:
- subagent-spawned (parentSessionId, agentId, subagentType, description) when
  a new agent-*.jsonl file appears under <sid>/subagents/
- subagent-completed (parentSessionId, agentId) when an existing file's mtime
  has been stable for >30s

Adds two IPCs for read-only live tailing:
- start-subagent-watch (parent, agentId) -> watchId
- stop-subagent-watch (watchId)

The watcher streams new JSONL entries via subagent-watch-event so the renderer
can append them to an open inline-expanded subagent transcript without polling.
- sidebar.js: subagents are no longer flat siblings of their parent. Each
  top-level row now shows a 'N subagents' affordance with a disclosure caret;
  expanded children render indented with a subagentType pill, description as
  label, and persist expansion state per parent in localStorage. Orphan
  subagents (parent absent from cache) hoist to a 'Orphan subagents' group.
- grid-view.js: each active session card now shows a stack of small pills
  for currently-running subagents, color-coded by subagentType, capped at
  5 with a '+N more' overflow. Listens on onSubagentSpawned/onSubagentCompleted.
- jsonl-viewer.js: when an inline-expanded subagent block represents a still-
  running agent, start an fs.watch via the new IPC and append streamed
  entries; show a small '● live' indicator until completion. Stops the
  watch on collapse or subagent-completed.
- style.css: minimal styles for new sidebar/grid/live elements, matching
  existing palette and font weights.
When Switchboard attaches to a session that already has many subagent
files on disk (e.g. a long-running session with 100+ Agent calls), the
first walk of detectSubagentTransitions treated every existing file as a
'first sighting' and emitted subagent-spawned for each. 30s later it
emitted subagent-completed for each. Hundreds of IPC events back to back
froze the renderer UI.

Distinguish bootstrap (first walk for this session) from steady-state.
On bootstrap, record every existing file in knownSubagents silently:
- files modified in the last 60s stay in the active lifecycle (could be
  mid-run, will eventually fire subagent-completed)
- older files are marked completed immediately with no IPC

Only files that appear AFTER the bootstrap walk fire subagent-spawned.
…lity

Subagent observability: hierarchy, transitions, badges, live tail (fork CI)
…om disk

The existing 'Hide worktree' affordance only added the path to hiddenProjects;
on-disk worktrees accumulated. This adds a real delete action.

- main.js: new IPC delete-worktree(path). Validates the path matches the
  worktree-layout regex (.claude/worktrees, .claude-worktrees, .worktrees),
  runs 'git worktree remove -f' via execFile (no shell), retries with '-f -f'
  on locked worktrees, then purges any matching session_cache rows and FTS
  entries. Returns { ok, removed } / { ok: false, error }.
- preload.js: window.api.deleteWorktree binding.
- sidebar.js: 'Delete worktree' button on worktree sub-groups, with confirm
  dialog. Keeps 'Hide worktree' for the cosmetic case.

Hide is unchanged; Delete is opt-in and confirmed before running.
Makes it possible to tell the dev build apart from a released installed
binary in the OS task switcher, dock, About menu, and window title.
No-op in packaged builds.
Adds 14 unit tests for two load-bearing behaviours shipped via PR#1/#2
that previously had no coverage:

- derive-project-path.js — resolveWorktreePath collapses worktree paths
  (.worktrees, .claude-worktrees, .claude/worktrees) onto the parent
  project; deriveProjectPath end-to-end via stubbed jsonl
- session-transitions.js — bootstrap silent-init (no spawn/complete
  events on first walk), post-bootstrap spawn detection, completion
  timing
…ons-worktree

test: coverage for resolveWorktreePath + subagent cold-start
refreshFolder used a linear scan of cachedMap for every on-disk file
(O(N²) on a folder with N cached sessions). For projects with thousands
of subagent transcripts this froze the main process on every fs.watch
flush — fs.watch fires often while live Claude sessions append JSONL,
so the freeze recurred every 500ms (the debounce interval).

Build an inverted filePath → dbId index once and look up O(1) per file.

Confirmed: 24/24 tests still pass.
Even with the O(1) lookup, refreshFolder still ran enumerateSessionFiles
(many readdirSyncs) and fs.statSync on every file in the folder on
every flush. For projects with thousands of cached subagents and many
concurrent active agents writing JSONL, the main process kept blocking
on syscalls every 500ms.

The watcher already knows which file changed. Plumb that information
through to refreshFolder via opts.files; in targeted mode, skip
enumerateSessionFiles entirely and only stat the dirty paths.

- main.js: pendingFolders (Set) → pendingChanges (Map<folder, Set<relPath>|true>)
- session-cache.js: refreshFolder(folder, { files }) — when files is a
  non-empty Set, scan only those entries; otherwise full walk
- Targeted-mode deletion: rely on per-file ENOENT in statSync, skip the
  whole-folder GC sweep (no longer accurate when we only saw a subset)
- Falls back to full walk for folder-level events / bootstrap

24/24 tests still pass.
Live Claude session JSONLs grow without bound — observed a 218 MB
host-session file in a real workload. refreshFolder ran
fs.readFileSync + JSON.parse on the full file every time the
watcher fired (every ~500 ms while the session was being written),
making the main process freeze for 1-2 seconds at a time.

When a file is already cached and now exceeds HUGE_FILE_BYTES
(5 MB), skip the re-read entirely: just bump the modified timestamp
in the DB so the sidebar shows activity. Summary/slug/title were
captured when the file was smaller and rarely change after the first
turn; the next cold start or a shrink below threshold refreshes them.

Adds db.touchCachedModified(sessionId, modified) — a one-row UPDATE
that avoids the 14-column upsert when only mtime matters.

24/24 tests still pass.
…w ones

User feedback: "for display we shouldn't need to go that deep into the
file — full read should only happen for search". Implements that split.

readSessionDisplayHeader: stream-reads the first ~256 KB / 500 lines and
extracts only what the sidebar needs (summary, slug, customTitle,
aiTitle, agentId, isSidechain marker, subagent sidecar). No textContent,
no messageCount. ~1 ms even for a 200 MB host-session JSONL.

refreshFolder flow:
- NEW file (no cache row) → full readSessionFile, seeds FTS body
- EXISTING file → header-only refresh, merges fresh display fields with
  cached body. NO FTS write — search index for live sessions lags until
  cold-start
- Header fails → mtime-only touch as last resort

cacheGetByFolder widened to SELECT * so refresh can merge unchanged
fields (created, messageCount, textContent) without re-reading.

Drops the HUGE_FILE_BYTES hack from the previous commit — the header
approach handles size uniformly so no special-casing.

24/24 tests still pass.
The header-only refresh leaves search_fts stale for active sessions —
content typed after the last cold-start isn't indexed. Adds a user
trigger that runs the full worker re-scan (which rewrites FTS from
the live JSONL tails) and then re-fires the current query.

UI
  - New refresh button inside the search bar (circular-arrow icon)
  - Enter in the search input triggers the same path
  - Spinner on the button while reindexing
  - Search debounce bumped from 200ms to 350ms (gentler under heavy
    concurrent workloads)

Wiring
  - main.js: ipcMain.handle('rebuild-cache') → populateCacheViaWorker
  - preload.js: window.api.rebuildCache()
  - app.js: runSearchQuery() extracted; triggerRebuildAndSearch()
    serialises rebuild + refire and guards against double-clicks

24/24 tests still pass.
Renderer threw ReferenceError on every project iteration because the
result destructure at sidebar.js:438 was missing 'subagentIndex' even
though processProjectSessions returns it and buildSessionsList expects
it at line 477. The error aborted the project loop, leaving the
sidebar blank while the backend correctly returned 13 projects / 1500+
sessions. Pre-existing bug from PR#2's hierarchical sidebar — became
visible now because every project in this workload has subagents.

Also nudges #search-refresh-btn from right:42px to right:60px so it no
longer overlaps with #search-clear (the × button at right:40px).
…efault

On long-lived projects the orphan-subagent list can grow huge (>1000 in
this session's host project) and pushes the rest of the project out of
view. Default the section to collapsed and let the user toggle it with
a click on the label. Adds a right-pointing caret that rotates 90°
when expanded and a per-project state in localStorage so the choice
sticks across reloads.

- localStorage key: 'orphanExpanded:<projectPath>' = '0' | '1'
- Default: collapsed (no key set)
- Label format: '▸ Orphan subagents <count>'
The previous commit referenced 'project.projectPath' inside
buildSessionsList, which never had 'project' in scope — runtime
ReferenceError on every render, sidebar blank again. Pass projectPath
as an argument from both call sites (regular projects and worktrees).

Also leaves the renderer console→main bridge in place under
mainWindow.webContents 'console-message' — paid for itself twice now.
Live JSONL writes fire the watcher every ~500ms, triggering
notifyRendererProjectsChanged on each flush. Even with the header-only
refresh, the renderer was re-fetching projects and running morphdom
diff over 100+ session items at that cadence, producing visible
sidebar flicker. User flagged it as 'UI glitch on left side during
refresh'.

- session-cache.js: leading-edge throttle on notifyRendererProjectsChanged,
  1.5s cooldown with trailing flush so the first change is instant but
  bursts coalesce.
- public/app.js: bump renderer debounce 300ms → 900ms. Combined with the
  main-side throttle the sidebar redraws at most ~1×/sec under heavy load.
…ookup

perf(refresh): O(1) cached lookup in refreshFolder
Wraps npm scripts under task(1) with named tasks (install, dev, build,
test, lint, check, ci, clean, db:reset, install:lint). Updates README
with a Tooling section documenting the preferred workflow.
JeanBaptisteRenard and others added 30 commits May 30, 2026 03:47
* fix(app): showSession() on openTerminal error path

* feat(sidebar): detect missing project paths

* feat(main): remap-project IPC with atomic JSONL rewrite

* fix(remap-project): address CRITICAL review findings

- CRITICAL-1: replace flat readdirSync with enumerateSessionFiles so
  subagent transcripts under <uuid>/subagents/*.jsonl and the legacy
  <uuid>/*.jsonl layout are also rewritten
- CRITICAL-2: add active-sessions guard — refuse remap if any non-exited
  PTY session has projectPath matching the folder (avoids concurrent-writer
  data loss); also re-check fs.existsSync(oldPath) at handler entry so a
  path that came back does not get clobbered
- Extract rewriteJsonlAtomic helper that cleans up orphan .tmp files on
  error (try/catch around writeFileSync + renameSync)
- Switch statSync → lstatSync on newPath to make symlink intent explicit
- Tests: add subagent layout (preferred + legacy), orphan .tmp cleanup,
  and active-sessions guard tests (4 new tests, total 51 passing / 64)
* docs: AI agent guidelines + AppImage replacement safety

CLAUDE.md follows the workspace convention (single-line @ pointer) with
content in .ai/shared-guidelines.md. Covers:
- Invariants for AI agents working in this fork (no double Electron,
  SWITCHBOARD_DATA_DIR isolation, worktree isolation for parallel
  agents, no Co-Authored-By).
- Fork-specific features not in upstream (subagent support, work-files
  tab, heatmap from DB, single-instance-lock, etc.).
- Helpers worth reusing (enumerateSessionFiles, encodeProjectPath,
  ViewerPanel, escapeHtml).
- Upstreaming workflow.

README.md gets a "Working alongside a running AppImage" subsection that
explains why rebuilds + cp don't kill the running instance (the live
process extracted /tmp/.mount_* at launch and doesn't need the on-disk
file anymore).

* chore(ai): complete init-project scaffold

- .ai/shared-guidelines.md: import @~/.skaleet-ai/conventions/rules.md
  for universal Skaleet AI rules (HANDOFF, model gate, worktree isolation,
  no Co-Authored-By, shell-command pitfalls, memory hygiene). Caveat block
  notes which rules don't apply here (DDD/CQRS, docker compose exec,
  monitor-ci, glab, /pre-commit — Switchboard is Electron, not a Skaleet
  backend service).
- .ai/project.json: scopes (main / renderer / tests / build) with verify
  commands so /dispatch knows what to run. Encodes the no-double-electron
  and worktree-required constraints.
- Memory dir initialised at
  ~/.claude/projects/-home-jean-baptiste-workspace-switchboard/memory/.
Add 5 focused context docs at .ai/contexts/*.md covering the natural
boundaries inside Switchboard:

- session-cache: SQLite + FTS5 + watcher + targeted refresh
- schedule-runner: in-process cron + .md storage + claude --resume -p
- subagent-observability: parent->child grouping + transcript view
- viewer-panel: reusable CodeMirror panel for Plans / Memory / .work-files
- ipc-bridge: trust boundary inventory (every IPC, every event)

Each doc: ~70-130 lines, structured around purpose, key files, public
surface, invariants, non-obvious behaviors, and "if you change this,
also check..." cross-references. Audience is AI agents who need to make
a focused change without re-reading 1800 LOC of main.js.

Also adds:
- .ai/contexts/README.md   index + recommended reading order
- .ai/contexts/_issues.md  observations captured while writing (not
                           blockers, opportunistic follow-ups)
- .ai/project.json contexts map
- shared-guidelines.md quick-orientation table now points at the
  context docs instead of raw files
Follow-up to the context-engineering skill refactor:
- .ai/project.json gains "context_engineering_profile": "electron-flat"
  so the new router skill detects the profile without falling back to
  heuristics.
- AGENTS.md is a thin pointer at the repo root referencing the universal
  rules and .ai/shared-guidelines.md. Conventional location agents look
  at first; mirrors the CLAUDE.md pointer pattern.
…scripts (#24)

* feat(main): trigger-watcher — file-based input injection for harness scripts

Drop a JSON trigger file into ~/.switchboard/triggers/<uuid>.json to send
a command (e.g. /compact) into any open PTY session.  Optional idle-wait
polls session._cliBusy (the OSC-title spinner flag) before writing, so
the harness can safely call /compact without racing mid-response output.

- trigger-watcher.js: watches triggers dir, processes JSON files, writes
  result to processed/<uuid>.result.json, deletes the trigger on done.
- main.js: wires start() after startScheduler with getPtyForSession and
  isSessionBusy closures over activeSessions.
- test/trigger-watcher.test.js: 6 node:test cases (happy path, unknown
  session, malformed JSON, missing field, idle-wait flip, idle timeout).
- .ai/contexts/trigger-watcher.md: context doc.
- .ai/contexts/README.md: table entry.

* fix(trigger-watcher): address code review findings

- C1: lstatSync size guard — reject trigger files > 64 KB before readFileSync
- C2: lstatSync isFile() — reject symlinks and non-regular files
- W1: SyntaxError retry — wait 50 ms and re-read once before failing on partial writes
- W2: command length cap — reject commands > 4 KB
- W3: control char sanitization — reject \r \n \0 \x1b in command
- W4: concurrency cap — queue triggers when > 8 in-flight; simple semaphore via waitQueue
- W5: session-exit during wait:idle — check getPtyForSession on each poll tick
- W6: tests for C1, C2, W1 retry-then-fail, W2, W3, W4 (12 concurrent), W5, PTY throw, dedup, I4
- I1: atomic result write — write to .tmp then renameSync to final path
- I3: wrap trigger-watcher start() in try/catch in main.js boot path
- I4: NaN guard on parseInt(SWITCHBOARD_TRIGGER_IDLE_TIMEOUT_MS) — fall back to default
…#25)

The "Building does NOT kill the running instance" rule (added in the
docs/ai-guidelines PR) was wrong. Witnessed 2026-05-31 ~13:49 — running
AppImage went silent in main.log during a background `npm run build:linux`,
no SIGTERM logged, 9-minute gap before user relaunched. Suspected cause:
electron-builder's @electron/rebuild replaces native `.node` files
(better-sqlite3, node-pty) via non-atomic truncate+write, breaking the
dlopen handle in the running process → SIGKILL.

The `cp dist/*.AppImage ~/Applications/...` step itself is safe — verified
at 17:36 (process kept running through the swap).

Updates:
- .ai/shared-guidelines.md §2 — rewritten as "build is risky, cp is safe"
- README.md "Working alongside a running AppImage" — same correction
- .ai/contexts/_issues.md — full incident writeup + mitigation candidates
…rigger timeout_ms

Raises DEFAULT_IDLE_TIMEOUT 30s→5min, adds MAX_TRIGGER_TIMEOUT=600_000ms cap, adds optional timeout_ms per trigger field. Precedence: per-trigger > env var > default. Validated end-to-end by two production self-compacts (2026-05-31). 86 tests pass. Code review APPROVE (3 non-blocking NITs deferred).
node-pty's ptyProcess.write() is silent on a dead child. Adds defaultIsPtyAlive() using process.kill(pid, 0) signal-0 probe, with checks at both lookup time and right before write. 3 new W7 tests (25 total pass). Critical for safe deployment of auto-compact harness in routine.
Adds a chain field to the trigger JSON for sequential multi-step PTY injection. Mutually exclusive with command. Each step waits for the previous turn to complete (busy-rise + 2s TOCTOU window, then busy-fall) before injecting the next. Global and per-step timeout_ms supported, per-step capped at remaining global. W7 liveness checks applied at chain entry AND before each step's write (defense-in-depth). 37 tests pass (22 existing + 11 chain CHAIN-1..11 + 3 W7 + 1 CHAIN-12 covering the instant-reply path). Final review APPROVE.
The stats screen renderer was already wired to display tokens, tool calls,
and per-model usage, but the backend only fed messageCount — so those cards
showed zeros and Total Sessions was inflated by subagents.

- read-session-file.js: extractDailyMetrics() accumulates per-(date,model)
  tokens (input/output/cache), tool_use counts, and message counts bucketed
  by each line's timestamp. Synthetic/model-less assistant turns bucket under
  model '' with no tokens; tool_result-only user turns are not counted as
  messages. dailyMetrics attached to parent + subagent return objects.
- db.js: migration v5 adds session_metrics(sessionId,date,model,...) +
  date index; replaceSessionMetrics() (delete-by-session then insert, in a
  txn); deletion wired into deleteCachedSession/deleteCachedFolder (folder
  delete sub-selects on session_cache, so it runs first); aggregates
  getDailyMetrics/getDailyModelTokens/getModelUsage/getTotalCounts
  (totalSessions = parents only).
- session-cache.js: writes metrics on the full-read paths only (NEW-file
  branch of refreshFolder + the worker handler), not the header-only refresh.
- main.js: get-stats-from-db / refresh-stats now build the full stats object
  (dailyActivity, dailyModelTokens, modelUsage, totals) from session_metrics.
- public/stats-view.js: two new summary cards (Total Tokens, Tool Calls).

Tests: test/read-session-file-metrics.test.js (extractDailyMetrics unit +
readSessionFile integration), test/db-session-metrics.test.js (pure-JS mirror
of the SQL aggregation). Metrics populate automatically on next cold-start
rebuild — no separate backfill.
Two coupled changes:

1. chain field — a trigger may carry `chain: [{command, timeout_ms?}, ...]`
   instead of a single `command`. Steps are injected sequentially, each
   waiting for the prior turn to complete (busy then idle) before the next,
   under a shared global deadline. Lets one trigger drive a /compact then a
   resume prompt without racing on a shared idle tick.

2. discrete-Enter submit — write the command text and the Enter keypress as
   SEPARATE PTY writes (SUBMIT_ENTER_DELAY_MS apart). A CR concatenated onto
   the text in a single write is absorbed by Claude Code (kitty keyboard
   protocol) as a literal newline and the command never submits; only a
   discrete CR submits, mirroring how xterm sends each keypress. Fixes
   free-text trigger commands landing in the composer unsubmitted while the
   short menu-driven /compact path appeared to work.

Tests: 101 passing (node:test). Write-shape assertions updated for the
two-write submit; makeChainCtx starts a turn only on the discrete Enter.
Call populateCacheViaWorker() unconditionally in app.whenReady so
deleted transcripts (sub-agent / workflow runs cleaned up between
sessions) are evicted from session_cache on the next launch. The
worker does a deleteCachedFolder + upsert per folder — a full prune —
and runs in a Worker thread so startup is non-blocking. Concurrent
callers (FTS-recreated path, first get-projects call) share the
in-flight Promise via the existing guard, so no double scan occurs.

Previously the prune only ran when isCachePopulated() was false (cold
DB after migration) or when searchFtsRecreated was set, leaving ~457
stale rows visible in the sidebar across normal restarts.
…screte-submit

fix(trigger-watcher): discrete-Enter submit + chain triggers; prune stale sidebar cache
…ection

feat(stats): collect token/tool/message metrics for the stats screen
…sionCache ctx.db

main.js hand-builds the db: {} allow-list passed to sessionCache.init(). The
stats work added replaceSessionMetrics to db.js + session-cache.js but never
added it to that literal, so ctx.db.replaceSessionMetrics was undefined and the
call inside the cold-start worker handler threw on stderr (not electron-log),
silently aborting the indexing write loop → session_metrics stayed empty → the
stats screen still showed 0 tokens / 0 tool calls after the build.

touchCachedModified was missing from the same literal (pre-existing latent bug,
header-refresh fallback path) — added too.

Adds test/main-ctx-db-wiring.test.js: static guard asserting main.js's ctx.db
allow-list covers every ctx.db.* session-cache.js dereferences. Verified it
flags both names against the pre-fix main.js. No unit test booted this wiring
before, which is why task check stayed green through the bug.
fix(stats): wire replaceSessionMetrics + touchCachedModified into sessionCache ctx.db
The main process pegged ~53% CPU continuously while idle. Three causes:

- main.js: each subagent live-tail watcher used fs.watchFile, which
  stat-polls the file once per second per watcher, forever. Teardown
  via fs.unwatchFile(path) was fragile and watchers could accumulate
  across a long app session. Switch to event-driven fs.watch + 300ms
  debounce, and store a per-watcher teardown() closure called from both
  stop-subagent-watch and the window-closed handler so nothing leaks.
  Falls back to a 10s poll only if fs.watch fails to attach.

- public/app.js: pollActiveSessions ran every 3s unconditionally (IPC +
  full-sidebar querySelectorAll) even with zero running sessions. Poll
  adaptively: 3s while sessions run, 30s when idle. In-renderer session
  starts re-arm the fast cadence immediately; the 30s floor still catches
  externally-started sessions. The 30s timeago interval now no-ops when
  nothing is active.

- schedule-runner.js: scanSchedules re-read 4KB of every project JSONL
  every 60s just to extract projectPath. Resolve it from the cache_meta
  SQLite cache (getAllFolderMeta) instead, falling back to the JSONL read
  only when a folder is genuinely uncached.

Lint clean; existing 119-test suite passes unchanged.
…-polling

perf: cut idle CPU from leaked watchers and unconditional polling
--worktree creates a fresh isolated git worktree, which only makes sense when
STARTING a session. On resume (isNew === false) it makes claude try to spin up
a new worktree and fail to attach — so a plain sidebar click, and equally the
"Create scheduled task" flow (launchScheduleCreator also resumes via
openTerminal with isNew=false), silently broke.

Rather than stripping the option at each renderer call site, gate it in the
open-terminal handler: only append --worktree when isNew. That closes every
current and future resume path in one place, and avoids mutating the caller's
options object.
…g missing

The session cache is only rebuilt from the filesystem on a cold start
(populateCacheViaWorker runs only when the cache is completely empty). Once it
has any rows, a project folder that changed while the app was closed — or that
predates the build which first indexed it — is never re-scanned, so its
sessions (and whole worktrees) silently disappear from the sidebar. The live
fs.watch only covers changes while the app is running.

When the cache is already populated, get-projects now reconciles first: the
dead populateCacheFromFilesystem() (defined + exported, never called) becomes a
stat-gated reconcileCacheFromFilesystem() that re-indexes only folders that are
new or whose newest transcript is newer than cache_meta.indexMtimeMs — cheap
when nothing changed.

Also extend getFolderIndexMtimeMs to stat the full session-file set
(enumerateSessionFiles), not just top-level *.jsonl, so a folder whose only
offline change was a subagent transcript under <id>/subagents/ is still flagged
for reconcile. Adds regression tests for the gate and the subagent-only case.

Rebased onto current main, which wires touchCachedModified/replaceSessionMetrics
into the ctx.db literal.
The 2026-06-04 incident: a chain's final step wrote its text + discrete
Enter into a session's PTY but the Enter never submitted — the text sat in
the composer. The watcher had no post-write verification: the instant-reply
heuristic masked the failure, and the final chain step plus the
single-command path had no observation at all.

Add submitWithVerify(): submit, poll for the busy rising edge within
SWITCHBOARD_SUBMIT_VERIFY_MS (default BUSY_RISE_TIMEOUT_MS), and if no rise
arrives write a single bare '\r' retry (a no-op on an empty composer, so
harmless if the first submit worked) before polling again. The combined
waitForTurnComplete is split into submitWithVerify (Phase 1, busy-rise +
retry) and waitForBusyFall (Phase 2). Wired into every chain step (the
observed rise replaces the old Phase-1 wait — no double wait), the final
chain step (verify only, no busy-fall), and the single-command path. The
per-step timeout_ms still bounds the whole step, preserving the existing
mid-chain timeout semantics. Instant-reply behaviour is preserved: when no
rise is confirmed the watcher still proceeds; submit_retries traces it.

Result JSON gains additive submit_retries (top-level for single, per-step
for chain). Existing assertions updated for the retry '\r' (silent-ctx
tests count submissions by command text, CHAIN-12 timing) and 4 new tests
cover the verify/retry/no-retry paths.
…bmit-verify

fix(trigger-watcher): verify submission and retry the Enter once
…erd)

Corrects the 2026-05-31 claim. Witnessed 2026-06-04: cp at 17:17 triggered
appimagelauncherd desktop re-integration at 17:17:33; the running instance was
cleanly terminated at 17:17:48 (systemd scope end, no segfault, no kernel
trace; user confirmed they did not quit). The same swap had survived twice on
2026-06-02 — non-deterministic. Also promotes --config.npmRebuild=false from
investigation candidate to field-proven safe-build path (3 successful builds
with the instance live).
…er-not-safe

docs: cp over a running AppImage is NOT reliably safe (appimagelauncherd)
Don't apply the worktree default when resuming a session
Reconcile cache with filesystem on get-projects so sessions stop going missing
…ll site

Review follow-ups: loadProjects() fires get-projects twice per sidebar
paint (showArchived false/true), running the reconcile sweep back-to-back;
a 1s throttle skips the redundant second pass while the live watcher
covers anything landing inside the window. Also note at the call site
that the reconcile is synchronous, so the missing await next to the
cold-start branch's 'await populateCacheViaWorker()' is intentional.
polish(reconcile): throttle back-to-back sweeps, document the sync call site
…ine expiry

When a chain trigger's global deadline fires with unsent steps remaining
(starvation via parallel cron re-busying the session), write a new
trigger file for the remaining steps with wait:idle so they execute when
the session next becomes available.

- MAX_REQUEUE = 2 cap prevents infinite re-queue loops
- requeue_count field on triggers (validated: non-negative integer)
- All four chain-timeout sites re-queue: initial-idle-wait, before-step
  deadline, step-verify timedOut, step busy-fall timedOut
- Timed-out result gains requeued:true + requeue_trigger filename (or
  requeue_exhausted:true at cap); ok:false/partial:true shape preserved
- 8 new tests covering: timeout-with-remaining, initial-wait requeue,
  cap exhaustion, validation (negative/float/string/absent), success
  no-requeue

Fixes witnessed starvation: trigger e252d69c step 1 never sent,
result chain timeout with steps_completed:0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants