Skip to content

Add Auto Review proof metrics and dogfood diagnostics #330

@shiny-code-bot

Description

@shiny-code-bot

Summary

Add proof instrumentation so the durable Auto Review concept can be evaluated with data, not vibes.

Scope

  • Emit structured counters/events for run lifecycle, duplicate reuse/skips, supersede/cancel reasons, findings surfaced/inspected/applied/dismissed, ledger tokens, detail tokens, and token-spend estimates.
  • Add a diagnostic surface such as /review-stats if useful.
  • Build deterministic scanners or fixtures for stale/superseded/duplicate/fix-train signatures from logs/rollouts where appropriate.

Acceptance Criteria

  • Metrics include duplicate review rate, skipped/adopted/superseded/cancelled counts, unsurfaced terminal findings, ledger overhead, avoided token estimate, time to surface findings, and finding usefulness/disposition.
  • Each Auto Review run records enough proof data to explain latency: model, reasoning effort, resolve model/effort, phase timing, follow-up count, token count when available, prompt token estimate, and terminal reason.
  • Restart recovery and duplicate avoidance are testable without a live TUI where possible.
  • Dogfood diagnostics can compare before/after behavior across real sessions and identify whether slowness came from first review pass, follow-up loops, worktree/lock contention, retries, or prompt bloat.
  • Metrics do not inject bulky telemetry into normal assistant context; ordinary turns receive only bounded actionable review state.

Relationships

Parent: #324
Depends on: #325, #327, #329
Related: #43, #50

Finish Line

Every Code emits enough Auto Review metrics and diagnostics to prove duplicate review reduction, avoided token spend, surfaced findings, ledger overhead, restart recovery, and finding usefulness during dogfooding.

Current Status

State: Scoped proof-metrics implementation merged.
Merged PR: #381 feat(auto-review): add proof metrics to compact ledger
Merge commit: 5bb9fbf704aa2968bac1e44f328e8f4b3d0c458c
Branch: fix/auto-review-proof-metrics (remote branch deleted after merge).
Next action: #331 Auto Review lifecycle docs can proceed against the merged durable Auto Review behavior. Broader prompt/context/token-budget/request-shape accounting remains gated by #92 and should not be started while the token-count refactor is active.
Blocked by: None for #331 docs. Broader prompt/context/token metrics remain blocked by #92.
Last verified: 2026-06-05 after focused tests, required build, PR CI, Claude review, and merge.

Completed in #381:

  • Compact Auto Review diagnostics now count duplicate-skipped runs, skipped runs without saved tokens, superseded clean duplicates, failed/cancelled/lost terminal proof outcomes, saved token estimates, and existing prompt/token/timing signals.
  • Duplicate/superseded/cancelled/lost proof stays in compact diagnostics without surfacing bulky run details.
  • Focused review-store tests cover duplicate proof, superseded proof, terminal outcome counts, old proof-run omission, and active-run plus dedupe proof combinations.

Review and validation:

Residual scope:

  • This PR does not implement broader prompt/context budget enforcement or request-shape accounting; those remain behind Add context source ledger and prompt observability #92.
  • Future polish noted by review: add symmetric tests for clean Failed/Lost omission and Cancelled/Lost with error detail if those paths are touched again.

Metadata

Metadata

Assignees

No one assigned

    Labels

    planDurable planning issueplan:donePlan completed or superseded

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions