Spike: commit-replay retrieval eval harness (measure-first foundation)

**Spike — measure-first foundation.** Build a commit-replay retrieval eval harness. rag-rat has NO task-shaped retrieval eval today — only the SCIP oracle (edge-resolution precision) and a hand-authored static fixture (`eval.rs`). Replay gives free, auto-growing ground truth: the diff IS the gold.

**Mechanism:** for a past PR/commit, check out the parent, index the pre-state, use the issue/commit text as the query, and score whether retrieval / `edit_brief` surfaces the files/symbols/tests the merge actually touched.

**Caveats to design in from the start:**
- **Label noise** — a PR diff bundles refactors/formatting; weight by changed-*symbol*, exclude bulk/mechanical commits, treat file-recall as a loose upper bound.
- **Leakage / time-split** — PR/issue bodies are often written post-fix and name the solution → prefer the issue body *at open time* as the query; if you ever TRAIN on replay, split by time (old→train, recent→eval) or you measure memorization.
- **Volume** — this repo is ~222 commits: enough for an eval harness + few-shot calibration + hard-negative mining, NOT to fine-tune a retriever from scratch.

**Why first:** you can't improve (or safely turn on fact generation / dream mode) what you can't measure. Everything downstream rides on this.

Ref: `docs/plans/2026-06-14-agent-value-strategy.md` §4d.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: commit-replay retrieval eval harness (measure-first foundation) #120

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Spike: commit-replay retrieval eval harness (measure-first foundation) #120

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions