coston · coston · Jun 9, 2026 · Jun 8, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/.gitignore b/.gitignore
@@ -3,6 +3,9 @@
 *.local.*
 .claude/projects/
 
+# Local environment
+.env
+
 # Node.js
 node_modules/
 
@@ -17,6 +20,12 @@ mock_calls.log
 claude_call_count
 claude_prompts.log
 
+# Workflow eval intermediate files (context handoff between steps)
+.eval/
+
+# Eval run results (generated by npm run eval:compare — not committed)
+results/
+
 # OS files
 .DS_Store
 Thumbs.db

diff --git a/AGENTS.md b/AGENTS.md
@@ -1,4 +1,4 @@
-# CLAUDE.md
+# Development Guide
 
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
 
@@ -18,6 +18,7 @@ Executant is a TypeScript CLI tool (`src/`) that executes YAML-defined workflows
 8. Keep Readme.md, ARCHITECTURE.md, and BACKLOG.md, PRODUCT-SPEC.md up-to-date as things evolve.
 9. Always strive for extensive test coverage.
 10. Always consider how changes will affect the goals and data integrity of the application. Defend the users.
+11. This cli must work on MacOS and Linux
 
 ## Core Architecture
 
@@ -33,7 +34,7 @@ Executant is a TypeScript CLI tool (`src/`) that executes YAML-defined workflows
    - `continue_on_error: true` - Optional, allows script steps to fail without stopping
    - `self_healing: true` - Optional (defaults to `false`; opt-in per step), automatically passes script failures to Claude for fixing
    - `llm_as_judge: true` - Optional, evaluates step quality and retries up to 5 times if needed
-   - `allowed_tools` - Optional list restricting which Claude tools are available for a prompt step
+   - `allowed_tools` - Optional list restricting which tools are available for a prompt step. Applies to both Claude and OpenCode providers. Omit entirely for no restrictions (default — all tools available). `[]` = text-only mode (no tools). `[bash, read]` = only those tools. Tool names are case-insensitive (`Bash` and `bash` both work).
    - `context` - Optional list of var names whose values are file paths; file contents are prepended to the prompt at runtime
    - `forEach` - Optional inline array or shell command (newline-split stdout); runs the inner step once per item with `{{item}}` substituted
    - `repeat: N` - Runs the step N times sequentially (compiles to a ForEachTask at load time); mutually exclusive with `forEach`; `{{item}}` is the 1-based iteration number

diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
@@ -35,7 +35,11 @@ In CI mode (`--ci`), the event stream is serialized as NDJSON to stdout instead
 
 **`src/load-workflow.ts`** — Parses YAML into a typed `Workflow`. Validates the schema, resolves `vars`, infers step types, and wires up `context:`, `output:`, and `timeout_seconds:` fields. Accepts an optional `cliVars` parameter that is merged over YAML vars (CLI overrides YAML) before placeholder substitution.
 
-**`src/tasks/claude.ts`** — Spawns the Claude CLI as a child process and streams its NDJSON output as `Event`s. Handles tool call parsing, cost events, and structured output (`output:structured`). `runClaude(task: ClaudeTask, _channel?: InterjectChannel)` is the low-level generator; the `channel` parameter is accepted for API compatibility but is not used for stdin injection — the Claude CLI requires stdin EOF before processing a piped prompt, making mid-execution injection impossible. Interjections are instead queued by `InterjectChannel` and prepended to the next Claude step's prompt in `runner.ts`. `runClaudeStructured<T>(task, schema)` is a typed wrapper that passes a Zod schema as `--json-schema` and validates the result. Exports `METHODOLOGY` (the development loop loaded from `src/prompts/development-methodology.txt`) and `buildClaudeArgs(task, interactive?)` (pure function constructing the CLI args array, exported for testing; `interactive=true` omits `--print` from the returned args but is not used by the production path). `ClaudeTask` carries four internal runtime fields not present in YAML: `permissionMode` (defaults to `'bypassPermissions'`), `jsonSchema` (JSON Schema object for `--json-schema`), `appendSystemPrompt` (text appended via `--append-system-prompt`), and `model` (model override via `--model`).
+**`src/tasks/agent.ts`** — Provider dispatch layer. `resolveAgentProvider(task)` resolves the provider in this order: (1) `task.provider` field, (2) `EXECUTANT_PROVIDER` env var, (3) `"claude"` default. `runAgent(task)` and `runAgentStructured(task, schema)` route to the appropriate backend and are the only entry points used by `runner.ts`, `plan.ts`, and `refine.ts`. Adding a new provider requires only a new case in each switch and a new `src/tasks/<provider>.ts` file.
+
+**`src/tasks/claude.ts`** — Spawns the Claude CLI as a child process and streams its NDJSON output as `Event`s. Handles tool call parsing, cost events, and structured output (`output:structured`). `runClaude(task: ClaudeTask)` is the low-level generator. `runClaudeStructured<T>(task, schema)` is a typed wrapper that passes a Zod schema as `--json-schema` and validates the result. Exports `METHODOLOGY` (the development loop loaded from `src/prompts/development-methodology.txt`) and `buildClaudeArgs(task, interactive?)` (pure function constructing the CLI args array, exported for testing). `ClaudeTask` carries runtime fields not present in YAML: `provider` (optional — routes through `agent.ts` dispatch), `permissionMode`, `jsonSchema`, `appendSystemPrompt`, `model`, and `agent` (OpenCode `--agent` flag).
+
+**`src/tasks/opencode.ts`** — Spawns the OpenCode CLI (`opencode run --format json`) and streams its JSON events as `Event`s. `buildOpenCodeArgs(task)` constructs the args array (model from `task.model` then `EXECUTANT_MODEL` env; agent from `task.agent` then `EXECUTANT_AGENT` env; `--dangerously-skip-permissions` for `bypassPermissions` mode). `buildOpenCodePermissionEnv(allowedTools)` translates the `allowed_tools` step field into the `OPENCODE_PERMISSION` env var: `undefined` → no env set (all tools allowed); `[]` → deny all tools (text-only mode); `["bash","read"]` → deny every tool not in the list. Tool names are matched case-insensitively so Claude-style names (`Bash`, `Read`) and opencode-style names (`bash`, `read`) both work. `parseOpenCodeMessage(msg)` normalises OpenCode's event types (`text`, `tool_use`, `error`) to Executant's `output:text` and `output:tool` events. `runOpenCodeStructured` appends a JSON-only instruction to the prompt and parses the response via `extractJsonObject`.
 
 **`src/tasks/command.ts`** — Spawns a bash subprocess and streams stdout/stderr as `output:text` events. Exports `CommandError`, a typed error class that carries `exitCode` and `command` fields. Supports per-step `timeoutSeconds` via the shared `startTimeout` helper from `stream.ts`.
 
@@ -117,21 +121,49 @@ Large text passed to Claude lives in `src/prompts/*.txt`. They use `{{PLACEHOLDE
 
 The eval system tests and iteratively refines the prompt templates in `src/prompts/`. It is not user-facing — run via `npm run eval` during development.
 
-**`src/eval/index.ts`** — CLI entry point. Parses `--refine` and `--max-iter` flags, orchestrates the score → collect-failures → refine → re-score loop, and delegates rendering to `report.ts`.
+**`src/eval/index.ts`** — CLI entry point. Parses `--refine`, `--max-iter`, `--models`, `--cases`, `--output-json`, and `--output-csv` flags. Accepts one or more eval file paths as positional arguments. `--cases` accepts comma-separated case IDs or 1-based index ranges (e.g. `simple,1-3`) to run a subset without editing YAML. Single-model mode: loads existing CSV results for resume (skips already-scored cases), runs remaining cases, optional refine loop. Multi-model mode (2+ models via `--models`): runs each model independently, builds an `EvalComparison`, prints a side-by-side table. When multiple files are passed, output paths are auto-suffixed per eval name.
 
 **`src/eval/load.ts`** — Parses `evals/*.eval.yaml` via Zod. Resolves fixture paths (values in `vars` that end in `.md` / `.txt` are read and substituted with file contents).
 
-**`src/eval/runner.ts`** — `runPrompt()`: substitutes `{{PLACEHOLDER}}` vars into a prompt template, calls Claude with no tools, and returns the raw text output.
+**`src/eval/runner.ts`** — `runPrompt(templatePath, vars, model?)`: substitutes `{{PLACEHOLDER}}` vars, runs the prompt through the specified model via `runAgent`, and returns the raw text output. Claude receives `METHODOLOGY` as `appendSystemPrompt`; OpenCode does not (flag not supported).
+
+**`src/eval/judge.ts`** — `judgeOutput()`: takes a single output string and a criterion string, always uses Claude for judgment (the authoritative judge), and returns `{ pass: boolean, reason: string }`.
 
-**`src/eval/judge.ts`** — `judgeOutput()`: takes a single output string and a criterion string, calls Claude with the criterion-judge prompt, and returns `{ pass: boolean, reason: string }`.
+**`src/eval/refine.ts`** — `refinePrompt()`: given the current template and a list of failures, calls Claude with the prompt-refiner prompt and returns a rewritten template.
 
-**`src/eval/refine.ts`** — `refinePrompt()`: given the current template and a list of failures (case id + criterion + reason), calls Claude with the prompt-refiner prompt and returns a rewritten template.
+**`src/eval/report.ts`** — Terminal output: `printRun()` for single-model pass/fail table; `printComparison()` for multi-model side-by-side comparison table.
 
-**`src/eval/report.ts`** — Terminal output: renders a per-case pass/fail table with criterion reasons.
+**`src/eval/export.ts`** — `toJson(comparison)` and `toCsv(comparison)`: serialize `EvalComparison` for benchmark analysis. CSV is denormalized (one row per criterion judgment per model) with columns `eval_name, template_path, case_id, criterion, model_label, provider, model, pass, reason, duration_ms`.
 
 **`src/eval/prompts/`** — Eval-specific prompts (`criterion-judge.txt`, `prompt-refiner.txt`). Same `{{PLACEHOLDER}}` convention as `src/prompts/`.
 
-**`evals/`** — Eval YAML definitions and `fixtures/` subdirectory with reusable input documents. Covers `plan-decompose.txt`, `judge-evaluation.txt`, `self-healing-fix.txt`, and `plan-judge.txt`.
+**`evals/`** — Eval YAML definitions and `fixtures/` subdirectory with reusable input documents. Covers prompt-quality evals (`plan-decompose`, `judge-evaluation`, `self-healing-fix`, `plan-judge`, `development-methodology`) and benchmark evals (`code-generation-quality`, `code-review-depth`, `instruction-following-precision`, `structured-output-reliability`, `methodology-context-sensitivity`).
+
+## Workflow Eval System
+
+Tests end-to-end model capability on real coding tasks, not just prompt quality. Each task runs the full development lifecycle in an isolated git worktree.
+
+**Two-phase design:**
+
+```
+Phase 1 — Model execution (in git worktree):
+  explore → writes research.md to .eval/
+  plan    → reads research.md via context:, writes plan.md
+  implement → reads both via context:, edits src/
+  test    → npm test (self_healing: true)
+  commit  → git commit
+
+Phase 2 — Eval harness (always Claude as judge, never the model):
+  git diff HEAD -- src/ tests/
+  judgeAllCriteria(diff, eval_criteria)
+  → WorkflowComparison table
+```
+
+**`src/eval/workflow.ts`** — `runWorkflowEval(taskPath, models)`: creates an isolated git worktree per model (with a `node_modules` symlink), spawns executant `--ci` in the worktree with the model's env vars, then uses Claude to judge the resulting diff against `eval_criteria`.
+
+**`src/eval/workflow-report.ts`** — `printWorkflowComparison()`: per-model table showing tests pass/fail, judge score, diff stats, and duration. `toWorkflowCsv()` for export.
+
+**`src/eval/workflow-index.ts`** — CLI: `npm run eval:workflow -- --models claude/sonnet evals/workflow/add-workflow-description.yaml`
 
 ### Refinement loop
 
@@ -162,3 +194,13 @@ The interjection feature lets users send a correction to a running workflow by p
 
 - **LLM-as-judge** (`llm_as_judge: true`) — after a step completes, a separate Claude call evaluates output quality. On `FAIL`, the step retries with feedback appended, up to 5 times.
 - **Self-healing** (`self_healing: true`) — on script failure, error output is passed to Claude for diagnosis. Claude applies a fix and the command re-runs, up to 5 times.
+
+## Local Model Inference (Dev Tooling)
+
+These scripts are internal dev tooling for running multi-model eval comparisons. They are not part of the published package.
+
+**`src/lib/model-config.ts`** — Shared model registry: `MODELS_DIR` (`~/.executant/models/`), `PIDS_DIR` (`~/.executant/pids/`), and the `MODELS` array defining each model's name, key, file, port, download URL, and size. Imported by `native-models.ts`, `model-server.ts`, `setup.ts`, and the dependency tests.
+
+**`src/native-models.ts`** — Downloads GGUF model files to `~/.executant/models/` using native `curl`. Idempotent: present files are skipped. Run via `npm run models:download`.
+
+**`src/model-server.ts`** — Manages native `llama-server` processes (Apple Silicon Metal GPU). `start` spawns detached processes with `-ngl 999`, writes PIDs to `~/.executant/pids/`. `stop` kills by PID. `status` cross-references live PID with HTTP health check. Exports `isServerHealthy(port)`. The CLI entry point is guarded by an `isMain` check so the file is safe to import. Run via `npm run models:start|stop|status`.
diff --git a/BACKLOG.md b/BACKLOG.md
@@ -14,6 +14,8 @@ Known improvements deferred from code reviews and audits.
 
 - **True mid-step interjection (kill + resume)** — The current `i` key queues a correction for the *next* Claude step. To truly stop a running Claude step and redirect it mid-execution, the approach is: kill the subprocess, then re-invoke with `--resume <session_id>` (captured from the result event) and the user's correction prepended. This preserves conversation context while immediately stopping the bad action. The `session_id` is available in Claude CLI's `result` event. The TUI would show a "restarting with correction…" log line. Blocked on: deciding UX (separate keybinding like `I` vs. a mode toggle), and verifying `--resume` behavior with `--output-format stream-json`.
 
+- **OpenCode server-mode integration** — The current OpenCode runner uses `opencode run --format json` (CLI subprocess). A more robust integration would use OpenCode's HTTP server API (sessions, SSE event stream, messages endpoint). This enables better session management, lower startup overhead, and potentially mid-session context carry-over. Blocked on: OpenCode server API stabilizing.
+
 ## Implemented (code review fixes, 2026-06)
 
 - ✅ **`workDir` in `RunOptions`** — `.executant-cancel` is now checked next to the workflow YAML (`dirname(resolve(filePath))`) rather than fixed to `process.cwd()` at module load time; predictable regardless of invocation directory.