Skip to content

perf(experiments): bench harness + P1 correctness/fidelity judge + edit flow#168

Open
ivanmkc wants to merge 11 commits into
masterfrom
perf/bench-p1
Open

perf(experiments): bench harness + P1 correctness/fidelity judge + edit flow#168
ivanmkc wants to merge 11 commits into
masterfrom
perf/bench-p1

Conversation

@ivanmkc

@ivanmkc ivanmkc commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Summary

Standalone (off master) bench harness for the latency/token/fidelity climb, including the P0 foundation from #143 plus the P1 work that #143 deferred. (This branch carries #143's commits rebased onto current master, so it can land independently of that PR.)

Measures the end-to-end cost of an agent generating a termchart diagram across runners (Claude Code / OpenCode; AGY deferred), all driven headless through one LiteLLM proxy → Vertex Gemini, with tokens/latency from the proxy spend log.

What's new vs #143 (P1)

  • Real correctness/fidelity gate (judge.py) — replaces the coarse "exit 0 + ≥1 call" signal. Deterministic rules gate (artifact produced? right type? within MAX_*?) then a rubric LLM-judge graded by a separate, stronger judge-model alias so the model under test never grades itself. Judge tokens are sliced out of diagram cost. Cost still compared over successful diagrams only.
  • Edit (two-turn) flowTask.edit follow-ups; the cell runs generate→edit in the same board, slicing the spend log into gen/edit/judge windows. A termchart PATH shim (tc-shim.sh) logs every CLI call → termchart_calls + the patch-vs-re-push signal (T4 lever).
  • Plumbing: capture.py, proxy_client.py, pull_board.py, loopback viewer + shim in entrypoint_cell.sh, judge-model alias in the proxy config, report.md fidelity column + Edits section, dry-run synthesizes both.

Tests

  • 70 pytest cases green (was 20) — all offline, no network: rules gate on fixtures, judge parsing via injected caller, capture/shim parser, bounded spend slices, edit-cell record shape, dry-run pipeline.

Not yet

  • The live podman+Vertex path (viewer/pull/two-turn/judge call) is unit-green but validated-in-pilot — smoke run in progress.
  • c1 carries T1 (minify) + T5 (exit-code); further T*/L* fixes land next.
  • AGY headless runner still deferred.

Spec: docs/superpowers/specs/2026-06-09-bench-judge-edit-flow-design.md
Plan: docs/plans/2026-06-07-latency-token-experimentation-plan.md

Experiment to reduce diagram-generation latency and tokens across three
runners (Claude Code, AGY, OpenCode), measured on VMs against baseline vs a
combined-fixes build. Decisions baked in: one shared model across runners and
a pilot-first first pass.
- corpus_run.py: --metrics-out emits per-render JSONL (Tier A spine) + test
- scripts/experiments: agent_run (Tier B orchestrator, --dry-run safe),
  metrics (RunRecord, median/p95/bootstrap CI, proxy-log parse), config
  (runners/conditions/tasks), aggregate (pool per-VM streams)
- pilot task suite, LiteLLM proxy config + spend logger callback
- GCE vm/startup.sh + vm/provision.sh (env-parameterized)
- pytest suite (20 tests, dry-run, no network)
…al slice)

- podman/: Containerfile (node+python+claude-code+opencode+litellm),
  run_local.sh (proxy container + per-cell containers + aggregate),
  entrypoint_cell.sh (build termchart per condition, run runner headless as
  non-root node, emit RunRecord)
- proxy: route shared-model -> Vertex Gemini 2.5 Flash via ADC; clean spend-log
  usage; Claude is a one-line EXPERIMENT_MODEL flip once Model Garden is enabled
- metrics: spend-log slice correlation (parse_proxy_log_slice, count_log_lines)
- cell_record.py: per-cell RunRecord from runner output + proxy spend slice
- proven: Claude Code in-container draws an ER diagram via termchart end-to-end
  on Vertex Gemini (57.7k in / 3.0k out tokens captured)

Refs #142
When TERMCHART_VIEWER_URL/TOKEN are unset, push and status return
EXIT_NO_VIEWER=4 (packages/cli/src/viewer-detect.ts:15), and the message is
'…are not set: no termchart viewer configured.' AGENTS.md claimed exit 3 with a
non-matching hint, which can mislead an agent into a wrong retry path.
The diagram-recipes examples are loaded verbatim into agent context when an
example is adapted. Pretty-printed, they were ~298 KB (the two *-matrix trees
alone ~89 KB across ~2,800 lines). Minifying to compact JSON is byte-for-byte
the same data but ~45% fewer bytes (305,190 -> 167,886), cutting tokens an agent
spends to load an example. Still valid JSON; flow-geometry.test.ts JSON.parses
them so it is unaffected.

Fix T1 from the latency/token experiment plan. Refs #142.
… gate

- entrypoint_cell.sh: OpenCode provider config (openai baseURL->proxy,
  --model openai/shared-model); capture runner exit code
- cell_record.py: success = clean exit + >=1 model call (runner-agnostic; OpenCode
  emits text not Claude JSON)
- README: runner status table (Claude Code + OpenCode working; AGY deferred -
  no custom base-URL to share the proxy/model) + matrix command

Refs #142
Replaces the coarse 'exit 0 + >=1 call' gate with a real correctness gate and
adds the edit workload, the two axes P1 deferred.

- judge.py: rules gate (offline, per-type + MAX_* limits) + rubric LLM-judge via
  an injected call_fn; evaluate() => success = rules_pass and rubric_pass.
- capture.py + tc-shim.sh: artifact capture (terminal files / pulled viewer spec /
  status from the call log) + a termchart PATH shim logging every CLI call
  (termchart_calls, edit_via_patch).
- proxy_client.py / pull_board.py: stdlib judge client (judge-model alias) and a
  tolerant viewer board pull.
- Edit flow: Task.edit; two-turn cell in entrypoint_cell.sh; spend log sliced into
  gen/edit/judge windows so judge tokens never count toward diagram cost.
- RunRecord + report.md gain fidelity + an Edits section; dry-run synthesizes both.
- litellm judge-model alias (EXPERIMENT_JUDGE_MODEL, stronger model => no self-grading).
- 3 pilot tasks gain edit follow-ups.

70 pytest cases green (was 20); dry-run report renders fidelity + edits end-to-end.
…); record pilot findings

- proxy_client.chat_usage/parse_usage: judge cost from the HTTP response, fixing
  judge_tokens=0 (the spend log races the proxy's async flush).
- cell_record: accumulate judge cost via the call_fn meter.
- README: first live-smoke findings — harness validated end-to-end; Gemini Flash
  hand-draws instead of invoking termchart (blocks measurement until fixed).

72 pytests green.
…ills)

Real users install the plugin; the staged repo only ships AGENTS.md + plugin/skills
(not a Claude Code-scanned location), so the runner never discovered the workflow.
Copy diagram-recipes/termchart/inbox-watch into ~/.claude/skills so it's loadable.

Pilot result: skill loads (input tokens 38k->60k) but Gemini Flash still hand-draws
instead of invoking termchart -> discoverability isn't the blocker; testing a
stronger runner model next. judge_tokens now captured (1608) via the response-usage fix.
…o Gemini 3.x

Pilot finding: capable models hand-draw instead of invoking termchart. Gemini
2.5 Flash/Pro and even gemini-3-flash-preview produced hand-drawn ASCII ER diagrams
(fid 30/70/95) with termchart_calls=0 — the 3.x one PASSED the rubric. Installing the
skills into ~/.claude/skills made the workflow discoverable (tokens 38k->60k) but did
NOT change the behavior, so discoverability isn't the blocker.

- Gate now requires capture.termchart_used (terminal->render, viewer->push,
  status->status via the shim log); a hand-drawn win is scored 'no-termchart' so cost
  is attributed only to real termchart runs.
- Models default to Gemini 3.x (gemini-3-flash-preview runner / gemini-3.1-pro-preview
  judge) served from the 'global' location; Claude flip -> Sonnet/Opus.
- README: per-model findings table + the 'when is termchart worth it?' open question.

76 pytests green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants