perf(experiments): bench harness + P1 correctness/fidelity judge + edit flow#168
Open
ivanmkc wants to merge 11 commits into
Open
perf(experiments): bench harness + P1 correctness/fidelity judge + edit flow#168ivanmkc wants to merge 11 commits into
ivanmkc wants to merge 11 commits into
Conversation
Experiment to reduce diagram-generation latency and tokens across three runners (Claude Code, AGY, OpenCode), measured on VMs against baseline vs a combined-fixes build. Decisions baked in: one shared model across runners and a pilot-first first pass.
- corpus_run.py: --metrics-out emits per-render JSONL (Tier A spine) + test - scripts/experiments: agent_run (Tier B orchestrator, --dry-run safe), metrics (RunRecord, median/p95/bootstrap CI, proxy-log parse), config (runners/conditions/tasks), aggregate (pool per-VM streams) - pilot task suite, LiteLLM proxy config + spend logger callback - GCE vm/startup.sh + vm/provision.sh (env-parameterized) - pytest suite (20 tests, dry-run, no network)
…al slice) - podman/: Containerfile (node+python+claude-code+opencode+litellm), run_local.sh (proxy container + per-cell containers + aggregate), entrypoint_cell.sh (build termchart per condition, run runner headless as non-root node, emit RunRecord) - proxy: route shared-model -> Vertex Gemini 2.5 Flash via ADC; clean spend-log usage; Claude is a one-line EXPERIMENT_MODEL flip once Model Garden is enabled - metrics: spend-log slice correlation (parse_proxy_log_slice, count_log_lines) - cell_record.py: per-cell RunRecord from runner output + proxy spend slice - proven: Claude Code in-container draws an ER diagram via termchart end-to-end on Vertex Gemini (57.7k in / 3.0k out tokens captured) Refs #142
When TERMCHART_VIEWER_URL/TOKEN are unset, push and status return EXIT_NO_VIEWER=4 (packages/cli/src/viewer-detect.ts:15), and the message is '…are not set: no termchart viewer configured.' AGENTS.md claimed exit 3 with a non-matching hint, which can mislead an agent into a wrong retry path.
The diagram-recipes examples are loaded verbatim into agent context when an example is adapted. Pretty-printed, they were ~298 KB (the two *-matrix trees alone ~89 KB across ~2,800 lines). Minifying to compact JSON is byte-for-byte the same data but ~45% fewer bytes (305,190 -> 167,886), cutting tokens an agent spends to load an example. Still valid JSON; flow-geometry.test.ts JSON.parses them so it is unaffected. Fix T1 from the latency/token experiment plan. Refs #142.
… gate - entrypoint_cell.sh: OpenCode provider config (openai baseURL->proxy, --model openai/shared-model); capture runner exit code - cell_record.py: success = clean exit + >=1 model call (runner-agnostic; OpenCode emits text not Claude JSON) - README: runner status table (Claude Code + OpenCode working; AGY deferred - no custom base-URL to share the proxy/model) + matrix command Refs #142
Replaces the coarse 'exit 0 + >=1 call' gate with a real correctness gate and adds the edit workload, the two axes P1 deferred. - judge.py: rules gate (offline, per-type + MAX_* limits) + rubric LLM-judge via an injected call_fn; evaluate() => success = rules_pass and rubric_pass. - capture.py + tc-shim.sh: artifact capture (terminal files / pulled viewer spec / status from the call log) + a termchart PATH shim logging every CLI call (termchart_calls, edit_via_patch). - proxy_client.py / pull_board.py: stdlib judge client (judge-model alias) and a tolerant viewer board pull. - Edit flow: Task.edit; two-turn cell in entrypoint_cell.sh; spend log sliced into gen/edit/judge windows so judge tokens never count toward diagram cost. - RunRecord + report.md gain fidelity + an Edits section; dry-run synthesizes both. - litellm judge-model alias (EXPERIMENT_JUDGE_MODEL, stronger model => no self-grading). - 3 pilot tasks gain edit follow-ups. 70 pytest cases green (was 20); dry-run report renders fidelity + edits end-to-end.
…); record pilot findings - proxy_client.chat_usage/parse_usage: judge cost from the HTTP response, fixing judge_tokens=0 (the spend log races the proxy's async flush). - cell_record: accumulate judge cost via the call_fn meter. - README: first live-smoke findings — harness validated end-to-end; Gemini Flash hand-draws instead of invoking termchart (blocks measurement until fixed). 72 pytests green.
…ills) Real users install the plugin; the staged repo only ships AGENTS.md + plugin/skills (not a Claude Code-scanned location), so the runner never discovered the workflow. Copy diagram-recipes/termchart/inbox-watch into ~/.claude/skills so it's loadable. Pilot result: skill loads (input tokens 38k->60k) but Gemini Flash still hand-draws instead of invoking termchart -> discoverability isn't the blocker; testing a stronger runner model next. judge_tokens now captured (1608) via the response-usage fix.
…o Gemini 3.x Pilot finding: capable models hand-draw instead of invoking termchart. Gemini 2.5 Flash/Pro and even gemini-3-flash-preview produced hand-drawn ASCII ER diagrams (fid 30/70/95) with termchart_calls=0 — the 3.x one PASSED the rubric. Installing the skills into ~/.claude/skills made the workflow discoverable (tokens 38k->60k) but did NOT change the behavior, so discoverability isn't the blocker. - Gate now requires capture.termchart_used (terminal->render, viewer->push, status->status via the shim log); a hand-drawn win is scored 'no-termchart' so cost is attributed only to real termchart runs. - Models default to Gemini 3.x (gemini-3-flash-preview runner / gemini-3.1-pro-preview judge) served from the 'global' location; Claude flip -> Sonnet/Opus. - README: per-model findings table + the 'when is termchart worth it?' open question. 76 pytests green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Standalone (off
master) bench harness for the latency/token/fidelity climb, including the P0 foundation from #143 plus the P1 work that #143 deferred. (This branch carries #143's commits rebased onto current master, so it can land independently of that PR.)Measures the end-to-end cost of an agent generating a termchart diagram across runners (Claude Code / OpenCode; AGY deferred), all driven headless through one LiteLLM proxy → Vertex Gemini, with tokens/latency from the proxy spend log.
What's new vs #143 (P1)
judge.py) — replaces the coarse "exit 0 + ≥1 call" signal. Deterministic rules gate (artifact produced? right type? withinMAX_*?) then a rubric LLM-judge graded by a separate, strongerjudge-modelalias so the model under test never grades itself. Judge tokens are sliced out of diagram cost. Cost still compared over successful diagrams only.Task.editfollow-ups; the cell runs generate→edit in the same board, slicing the spend log into gen/edit/judge windows. AtermchartPATH shim (tc-shim.sh) logs every CLI call →termchart_calls+ thepatch-vs-re-pushsignal (T4 lever).capture.py,proxy_client.py,pull_board.py, loopback viewer + shim inentrypoint_cell.sh,judge-modelalias in the proxy config,report.mdfidelity column + Edits section, dry-run synthesizes both.Tests
Not yet
c1carries T1 (minify) + T5 (exit-code); further T*/L* fixes land next.Spec:
docs/superpowers/specs/2026-06-09-bench-judge-edit-flow-design.mdPlan:
docs/plans/2026-06-07-latency-token-experimentation-plan.md