demo: enforce top-slot contract; clarify rho and int4 scope#17
Conversation
Gate-hardening and wording, from review of the int8 demo. No change to the quantizer math, the committed vectors, or the certified verdicts (int8 still 7/7, int4+spire still rejected 7/9). - harness.ts: make the top-slot contract loud. evaluateQuery now throws if a non-refusal gold case does not name exactly one expectSources entry — the gate guards expectSources[0] as the required top-slot winner, so two entries (ambiguous winner) or none would let a route/disambiguation flip past silently. Add a unit test covering the throw and the refusal exemption. - Soften rho language to match the code: rank correlation is a reported diagnostic, not a gate (the gold suite is the adjudicator). No rho floor added, by design — a floor could reject an encoding whose authored verdicts all hold. - Note that int4 is modeled as precision loss only (codes stay in an Int8Array, not nibble-packed); the byte-size win is a production property, not what this gate measures. Added in quantize.ts and demo/README.md. Math.round left as-is intentionally: its half-up asymmetry is negligible at 3072 dims and changing it would invalidate calibrated vectors for no verdict change.
Automated Checks (advisory, non-blocking)✅ All checks passed. Standards Compliance
SummaryThis PR hardens the demo/eval harness by enforcing that non-refusal gold cases specify exactly one expected top source, and it updates comments/docs to clarify that rho is diagnostic-only and int4 here models precision loss rather than storage savings. The only meaningful runtime change in the visible diff is the new throw in Reviewer: most of the risk is in What to pay attention to
Things I noticedNo obvious red or yellow flags in the visible diff. The new throw is a deliberate fail-fast on malformed gold cases and fits the repo’s “clear throw, not silent fallback” standard. Good patterns
Suggested improvements
Surmado Code Review (v1.2-mt) is an automated review, designed to work alongside human judgment. Want to change your STANDARDS.md or YML? Edit it directly, or tune it with our AI agent Scout. Comment |
What
Gate-hardening and wording fixes for the int8 scaling demo, from a deep review of
quantize.tsand the gold-suite top-slot logic. No change to the quantizer math, the committed/certified vectors, or the verdicts — int8 still certifies 7/7, int4+spire still rejected 7/9.Why
Two reviewers converged on three small action items (and two explicit "do nots"). The one that matters is a latent correctness risk that would otherwise be frozen into the artifact's permanent DOI.
Changes
evaluateQuery(demo/harness.ts) now throws if a non-refusal gold case does not name exactly oneexpectSourcesentry. The gate guardsexpectSources[0]as the required top-slot winner, so two entries (which must rank Set up Surmado Code Review #1?) or none would let a route/disambiguation flip past silently. Harmless today (every case lists one), but it depended on that by luck. Added a unit test for the throw and the refusal exemption.demo/harness.ts,demo/README.md).Int8Array, not nibble-packed); the byte-size win is a production property, not what this gate measures. Noted indemo/quantize.tsanddemo/README.md.Deliberately not changed:
Math.round(half-up asymmetry is negligible at 3072 dims and changing it would invalidate calibrated vectors for no verdict change), and no rho gate.Verification
npm test→ 43/43 pass (the +1 is the new contract test)npm run typecheck→ cleannpm run demo:run→ int8 certified 7/7, rho mean/min 1.0000npm run demo:run -- --natural+synthetic --bits 4→ int4 rejected 7/9, both route flips caughtSince nothing touched the committed vectors, no re-certification is required.
https://claude.ai/code/session_0164SP7NFZeHcmH1VNPem7S2
Generated by Claude Code