perf(experiments): bench harness + P1 correctness/fidelity judge + edit flow by ivanmkc · Pull Request #168 · ivanmkc/termchart

ivanmkc · 2026-06-10T04:30:43Z

Summary

Standalone (off master) bench harness for the latency/token/fidelity climb, including the P0 foundation from #143 plus the P1 work that #143 deferred. (This branch carries #143's commits rebased onto current master, so it can land independently of that PR.)

Measures the end-to-end cost of an agent generating a termchart diagram across runners (Claude Code / OpenCode; AGY deferred), all driven headless through one LiteLLM proxy → Vertex Gemini, with tokens/latency from the proxy spend log.

What's new vs #143 (P1)

Real correctness/fidelity gate (judge.py) — replaces the coarse "exit 0 + ≥1 call" signal. Deterministic rules gate (artifact produced? right type? within MAX_*?) then a rubric LLM-judge graded by a separate, stronger judge-model alias so the model under test never grades itself. Judge tokens are sliced out of diagram cost. Cost still compared over successful diagrams only.
Edit (two-turn) flow — Task.edit follow-ups; the cell runs generate→edit in the same board, slicing the spend log into gen/edit/judge windows. A termchart PATH shim (tc-shim.sh) logs every CLI call → termchart_calls + the patch-vs-re-push signal (T4 lever).
Plumbing: capture.py, proxy_client.py, pull_board.py, loopback viewer + shim in entrypoint_cell.sh, judge-model alias in the proxy config, report.md fidelity column + Edits section, dry-run synthesizes both.

Tests

70 pytest cases green (was 20) — all offline, no network: rules gate on fixtures, judge parsing via injected caller, capture/shim parser, bounded spend slices, edit-cell record shape, dry-run pipeline.

Not yet

The live podman+Vertex path (viewer/pull/two-turn/judge call) is unit-green but validated-in-pilot — smoke run in progress.
c1 carries T1 (minify) + T5 (exit-code); further T*/L* fixes land next.
AGY headless runner still deferred.

Spec: docs/superpowers/specs/2026-06-09-bench-judge-edit-flow-design.md
Plan: docs/plans/2026-06-07-latency-token-experimentation-plan.md

Experiment to reduce diagram-generation latency and tokens across three runners (Claude Code, AGY, OpenCode), measured on VMs against baseline vs a combined-fixes build. Decisions baked in: one shared model across runners and a pilot-first first pass.

- corpus_run.py: --metrics-out emits per-render JSONL (Tier A spine) + test - scripts/experiments: agent_run (Tier B orchestrator, --dry-run safe), metrics (RunRecord, median/p95/bootstrap CI, proxy-log parse), config (runners/conditions/tasks), aggregate (pool per-VM streams) - pilot task suite, LiteLLM proxy config + spend logger callback - GCE vm/startup.sh + vm/provision.sh (env-parameterized) - pytest suite (20 tests, dry-run, no network)

…al slice) - podman/: Containerfile (node+python+claude-code+opencode+litellm), run_local.sh (proxy container + per-cell containers + aggregate), entrypoint_cell.sh (build termchart per condition, run runner headless as non-root node, emit RunRecord) - proxy: route shared-model -> Vertex Gemini 2.5 Flash via ADC; clean spend-log usage; Claude is a one-line EXPERIMENT_MODEL flip once Model Garden is enabled - metrics: spend-log slice correlation (parse_proxy_log_slice, count_log_lines) - cell_record.py: per-cell RunRecord from runner output + proxy spend slice - proven: Claude Code in-container draws an ER diagram via termchart end-to-end on Vertex Gemini (57.7k in / 3.0k out tokens captured) Refs #142

When TERMCHART_VIEWER_URL/TOKEN are unset, push and status return EXIT_NO_VIEWER=4 (packages/cli/src/viewer-detect.ts:15), and the message is '…are not set: no termchart viewer configured.' AGENTS.md claimed exit 3 with a non-matching hint, which can mislead an agent into a wrong retry path.

The diagram-recipes examples are loaded verbatim into agent context when an example is adapted. Pretty-printed, they were ~298 KB (the two *-matrix trees alone ~89 KB across ~2,800 lines). Minifying to compact JSON is byte-for-byte the same data but ~45% fewer bytes (305,190 -> 167,886), cutting tokens an agent spends to load an example. Still valid JSON; flow-geometry.test.ts JSON.parses them so it is unaffected. Fix T1 from the latency/token experiment plan. Refs #142.

… gate - entrypoint_cell.sh: OpenCode provider config (openai baseURL->proxy, --model openai/shared-model); capture runner exit code - cell_record.py: success = clean exit + >=1 model call (runner-agnostic; OpenCode emits text not Claude JSON) - README: runner status table (Claude Code + OpenCode working; AGY deferred - no custom base-URL to share the proxy/model) + matrix command Refs #142

Replaces the coarse 'exit 0 + >=1 call' gate with a real correctness gate and adds the edit workload, the two axes P1 deferred. - judge.py: rules gate (offline, per-type + MAX_* limits) + rubric LLM-judge via an injected call_fn; evaluate() => success = rules_pass and rubric_pass. - capture.py + tc-shim.sh: artifact capture (terminal files / pulled viewer spec / status from the call log) + a termchart PATH shim logging every CLI call (termchart_calls, edit_via_patch). - proxy_client.py / pull_board.py: stdlib judge client (judge-model alias) and a tolerant viewer board pull. - Edit flow: Task.edit; two-turn cell in entrypoint_cell.sh; spend log sliced into gen/edit/judge windows so judge tokens never count toward diagram cost. - RunRecord + report.md gain fidelity + an Edits section; dry-run synthesizes both. - litellm judge-model alias (EXPERIMENT_JUDGE_MODEL, stronger model => no self-grading). - 3 pilot tasks gain edit follow-ups. 70 pytest cases green (was 20); dry-run report renders fidelity + edits end-to-end.

…); record pilot findings - proxy_client.chat_usage/parse_usage: judge cost from the HTTP response, fixing judge_tokens=0 (the spend log races the proxy's async flush). - cell_record: accumulate judge cost via the call_fn meter. - README: first live-smoke findings — harness validated end-to-end; Gemini Flash hand-draws instead of invoking termchart (blocks measurement until fixed). 72 pytests green.

…ills) Real users install the plugin; the staged repo only ships AGENTS.md + plugin/skills (not a Claude Code-scanned location), so the runner never discovered the workflow. Copy diagram-recipes/termchart/inbox-watch into ~/.claude/skills so it's loadable. Pilot result: skill loads (input tokens 38k->60k) but Gemini Flash still hand-draws instead of invoking termchart -> discoverability isn't the blocker; testing a stronger runner model next. judge_tokens now captured (1608) via the response-usage fix.

…o Gemini 3.x Pilot finding: capable models hand-draw instead of invoking termchart. Gemini 2.5 Flash/Pro and even gemini-3-flash-preview produced hand-drawn ASCII ER diagrams (fid 30/70/95) with termchart_calls=0 — the 3.x one PASSED the rubric. Installing the skills into ~/.claude/skills made the workflow discoverable (tokens 38k->60k) but did NOT change the behavior, so discoverability isn't the blocker. - Gate now requires capture.termchart_used (terminal->render, viewer->push, status->status via the shim log); a hand-drawn win is scored 'no-termchart' so cost is attributed only to real termchart runs. - Models default to Gemini 3.x (gemini-3-flash-preview runner / gemini-3.1-pro-preview judge) served from the 'global' location; Claude flip -> Sonnet/Opus. - README: per-model findings table + the 'when is termchart worth it?' open question. 76 pytests green.

ivanmkc-google added 11 commits June 11, 2026 12:47

docs(bench): spec for P1 correctness/fidelity judge + edit flow

5a5fda2

ivanmkc force-pushed the perf/bench-p1 branch from 6a3ed5c to eb19457 Compare June 11, 2026 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(experiments): bench harness + P1 correctness/fidelity judge + edit flow#168

perf(experiments): bench harness + P1 correctness/fidelity judge + edit flow#168
ivanmkc wants to merge 11 commits into
masterfrom
perf/bench-p1

ivanmkc commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ivanmkc commented Jun 10, 2026

Summary

What's new vs #143 (P1)

Tests

Not yet

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants