Spike — measure-first foundation. Build a commit-replay retrieval eval harness. rag-rat has NO task-shaped retrieval eval today — only the SCIP oracle (edge-resolution precision) and a hand-authored static fixture (eval.rs). Replay gives free, auto-growing ground truth: the diff IS the gold.
Mechanism: for a past PR/commit, check out the parent, index the pre-state, use the issue/commit text as the query, and score whether retrieval / edit_brief surfaces the files/symbols/tests the merge actually touched.
Caveats to design in from the start:
- Label noise — a PR diff bundles refactors/formatting; weight by changed-symbol, exclude bulk/mechanical commits, treat file-recall as a loose upper bound.
- Leakage / time-split — PR/issue bodies are often written post-fix and name the solution → prefer the issue body at open time as the query; if you ever TRAIN on replay, split by time (old→train, recent→eval) or you measure memorization.
- Volume — this repo is ~222 commits: enough for an eval harness + few-shot calibration + hard-negative mining, NOT to fine-tune a retriever from scratch.
Why first: you can't improve (or safely turn on fact generation / dream mode) what you can't measure. Everything downstream rides on this.
Ref: docs/plans/2026-06-14-agent-value-strategy.md §4d.
Spike — measure-first foundation. Build a commit-replay retrieval eval harness. rag-rat has NO task-shaped retrieval eval today — only the SCIP oracle (edge-resolution precision) and a hand-authored static fixture (
eval.rs). Replay gives free, auto-growing ground truth: the diff IS the gold.Mechanism: for a past PR/commit, check out the parent, index the pre-state, use the issue/commit text as the query, and score whether retrieval /
edit_briefsurfaces the files/symbols/tests the merge actually touched.Caveats to design in from the start:
Why first: you can't improve (or safely turn on fact generation / dream mode) what you can't measure. Everything downstream rides on this.
Ref:
docs/plans/2026-06-14-agent-value-strategy.md§4d.