You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add proof instrumentation so the durable Auto Review concept can be evaluated with data, not vibes.
Scope
Emit structured counters/events for run lifecycle, duplicate reuse/skips, supersede/cancel reasons, findings surfaced/inspected/applied/dismissed, ledger tokens, detail tokens, and token-spend estimates.
Add a diagnostic surface such as /review-stats if useful.
Build deterministic scanners or fixtures for stale/superseded/duplicate/fix-train signatures from logs/rollouts where appropriate.
Acceptance Criteria
Metrics include duplicate review rate, skipped/adopted/superseded/cancelled counts, unsurfaced terminal findings, ledger overhead, avoided token estimate, time to surface findings, and finding usefulness/disposition.
Each Auto Review run records enough proof data to explain latency: model, reasoning effort, resolve model/effort, phase timing, follow-up count, token count when available, prompt token estimate, and terminal reason.
Restart recovery and duplicate avoidance are testable without a live TUI where possible.
Dogfood diagnostics can compare before/after behavior across real sessions and identify whether slowness came from first review pass, follow-up loops, worktree/lock contention, retries, or prompt bloat.
Metrics do not inject bulky telemetry into normal assistant context; ordinary turns receive only bounded actionable review state.
Every Code emits enough Auto Review metrics and diagnostics to prove duplicate review reduction, avoided token spend, surfaced findings, ledger overhead, restart recovery, and finding usefulness during dogfooding.
Current Status
State: Scoped proof-metrics implementation merged.
Merged PR: #381 feat(auto-review): add proof metrics to compact ledger
Merge commit: 5bb9fbf704aa2968bac1e44f328e8f4b3d0c458c
Branch: fix/auto-review-proof-metrics (remote branch deleted after merge).
Next action: #331 Auto Review lifecycle docs can proceed against the merged durable Auto Review behavior. Broader prompt/context/token-budget/request-shape accounting remains gated by #92 and should not be started while the token-count refactor is active.
Blocked by: None for #331 docs. Broader prompt/context/token metrics remain blocked by #92.
Last verified: 2026-06-05 after focused tests, required build, PR CI, Claude review, and merge.
Future polish noted by review: add symmetric tests for clean Failed/Lost omission and Cancelled/Lost with error detail if those paths are touched again.
Summary
Add proof instrumentation so the durable Auto Review concept can be evaluated with data, not vibes.
Scope
/review-statsif useful.Acceptance Criteria
Relationships
Parent: #324
Depends on: #325, #327, #329
Related: #43, #50
Finish Line
Every Code emits enough Auto Review metrics and diagnostics to prove duplicate review reduction, avoided token spend, surfaced findings, ledger overhead, restart recovery, and finding usefulness during dogfooding.
Current Status
State: Scoped proof-metrics implementation merged.
Merged PR: #381 feat(auto-review): add proof metrics to compact ledger
Merge commit:
5bb9fbf704aa2968bac1e44f328e8f4b3d0c458cBranch:
fix/auto-review-proof-metrics(remote branch deleted after merge).Next action: #331 Auto Review lifecycle docs can proceed against the merged durable Auto Review behavior. Broader prompt/context/token-budget/request-shape accounting remains gated by #92 and should not be started while the token-count refactor is active.
Blocked by: None for #331 docs. Broader prompt/context/token metrics remain blocked by #92.
Last verified: 2026-06-05 after focused tests, required build, PR CI, Claude review, and merge.
Completed in #381:
Review and validation:
cargo test -p code-core compact_ledger --lib./build-fast.shResidual scope: