demo: enforce top-slot contract; clarify rho and int4 scope by lukefwalton · Pull Request #17 · lukefwalton/answer-engine

lukefwalton · 2026-06-16T06:37:58Z

What

Gate-hardening and wording fixes for the int8 scaling demo, from a deep review of quantize.ts and the gold-suite top-slot logic. No change to the quantizer math, the committed/certified vectors, or the verdicts — int8 still certifies 7/7, int4+spire still rejected 7/9.

Why

Two reviewers converged on three small action items (and two explicit "do nots"). The one that matters is a latent correctness risk that would otherwise be frozen into the artifact's permanent DOI.

Changes

Enforce the top-slot contract (the real fix). evaluateQuery (demo/harness.ts) now throws if a non-refusal gold case does not name exactly one expectSources entry. The gate guards expectSources[0] as the required top-slot winner, so two entries (which must rank Set up Surmado Code Review #1?) or none would let a route/disambiguation flip past silently. Harmless today (every case lists one), but it depended on that by luck. Added a unit test for the throw and the refusal exemption.
Soften rho wording — no floor added, by design. Rank correlation is a reported diagnostic; the gold suite is the adjudicator. A rho floor could reject an encoding whose authored verdicts all hold — exactly the benchmark-style failure mode the demo avoids. Prose now matches the code (demo/harness.ts, demo/README.md).
Clarify int4 scope. int4 is modeled as precision loss (codes stay in an Int8Array, not nibble-packed); the byte-size win is a production property, not what this gate measures. Noted in demo/quantize.ts and demo/README.md.

Deliberately not changed: Math.round (half-up asymmetry is negligible at 3072 dims and changing it would invalidate calibrated vectors for no verdict change), and no rho gate.

Verification

npm test → 43/43 pass (the +1 is the new contract test)
npm run typecheck → clean
npm run demo:run → int8 certified 7/7, rho mean/min 1.0000
npm run demo:run -- --natural+synthetic --bits 4 → int4 rejected 7/9, both route flips caught

Since nothing touched the committed vectors, no re-certification is required.

https://claude.ai/code/session_0164SP7NFZeHcmH1VNPem7S2

Generated by Claude Code

Gate-hardening and wording, from review of the int8 demo. No change to the quantizer math, the committed vectors, or the certified verdicts (int8 still 7/7, int4+spire still rejected 7/9). - harness.ts: make the top-slot contract loud. evaluateQuery now throws if a non-refusal gold case does not name exactly one expectSources entry — the gate guards expectSources[0] as the required top-slot winner, so two entries (ambiguous winner) or none would let a route/disambiguation flip past silently. Add a unit test covering the throw and the refusal exemption. - Soften rho language to match the code: rank correlation is a reported diagnostic, not a gate (the gold suite is the adjudicator). No rho floor added, by design — a floor could reject an encoding whose authored verdicts all hold. - Note that int4 is modeled as precision loss only (codes stay in an Int8Array, not nibble-packed); the byte-size win is a production property, not what this gate measures. Added in quantize.ts and demo/README.md. Math.round left as-is intentionally: its half-up asymmetry is negligible at 3072 dims and changing it would invalidate calibrated vectors for no verdict change.

surmado-code-review · 2026-06-16T06:38:39Z

Automated Checks (advisory, non-blocking)

✅ All checks passed.

Standards Compliance

demo/harness.ts now turns an implicit gold-data assumption into an explicit invariant:
```
if (gold.expectSources?.length !== 1) {
  throw new Error(
    `demo gold '${gold.id}': a non-refusal case must list exactly one expectSources ` +
      `entry (the required top-slot winner); got ${gold.expectSources?.length ?? 0}.`,
  );
}
```
This is aligned with the repo standards on Eval and Tests and Loud Failures: malformed evaluation metadata now fails clearly instead of silently relying on expectSources[0]. That makes the demo gate stricter in the right direction.
I don’t see any visible changes here that weaken the repo’s non-negotiables around the no-leak boundary, citation grounding, or not-found behavior. The not-found exemption in demo/harness.ts also stays consistent with the standard that refusal-style cases should not be forced into partial grounding semantics.

Summary

This PR hardens the demo/eval harness by enforcing that non-refusal gold cases specify exactly one expected top source, and it updates comments/docs to clarify that rho is diagnostic-only and int4 here models precision loss rather than storage savings. The only meaningful runtime change in the visible diff is the new throw in demo/harness.ts; the rest looks like scope/wording clarification.

Reviewer: most of the risk is in demo/harness.ts contract enforcement and its interaction with existing gold fixtures — the quantization/README wording changes are low risk.

What to pay attention to

demo/harness.ts top-slot guard: this is the only behavior change. It’s worth verifying that every existing non-refusal demo gold case really does satisfy the “exactly one expectSources” contract, and that failing fast here is the intended authoring-time signal rather than something downstream callers need to recover from.
Refusal carve-out correctness: the guard exempts expectAnswerMode === 'not-found'. Reviewer should sanity-check that this is the only mode that should bypass top-slot enforcement, so the demo’s route/refusal semantics remain consistent.
Docs/code alignment: the updated comments claim rho “gates nothing” and int4 is precision-only in this demo. Worth confirming the CLI/reporting behavior matches that wording so the README doesn’t overstate or understate what is actually certified.

Things I noticed

No obvious red or yellow flags in the visible diff. The new throw is a deliberate fail-fast on malformed gold cases and fits the repo’s “clear throw, not silent fallback” standard.

Good patterns

Good hardening move: replacing a silent dependency on expectSources[0] with an explicit invariant and a targeted error message makes the demo artifact safer to trust.
The refusal exemption is thoughtfully scoped; it avoids forcing dummy source expectations into not-found gold cases just to satisfy the harness.

Suggested improvements

If there is already a central validator/schema for demo gold entries elsewhere in the path, consider encoding this “exactly one expectSources for non-refusal cases” rule there too as a follow-up. Catching it at load/validation time would fail even earlier than evaluateQuery.

Surmado Code Review (v1.2-mt) is an automated review, designed to work alongside human judgment.

Want to change your STANDARDS.md or YML? Edit it directly, or tune it with our AI agent Scout.

Comment /rerun-review on this PR to refresh the review — costs 1 additional PR credit.

lukefwalton merged commit 06de726 into main Jun 16, 2026
3 checks passed

lukefwalton deleted the claude/session-zen4qi branch June 16, 2026 06:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

demo: enforce top-slot contract; clarify rho and int4 scope#17

demo: enforce top-slot contract; clarify rho and int4 scope#17
lukefwalton merged 1 commit into
mainfrom
claude/session-zen4qi

lukefwalton commented Jun 16, 2026

Uh oh!

surmado-code-review Bot commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lukefwalton commented Jun 16, 2026

What

Why

Changes

Verification

Uh oh!

surmado-code-review Bot commented Jun 16, 2026

Automated Checks (advisory, non-blocking)

Standards Compliance

Summary

What to pay attention to

Things I noticed

Good patterns

Suggested improvements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants