Skip to content

demo: enforce top-slot contract; clarify rho and int4 scope#17

Merged
lukefwalton merged 1 commit into
mainfrom
claude/session-zen4qi
Jun 16, 2026
Merged

demo: enforce top-slot contract; clarify rho and int4 scope#17
lukefwalton merged 1 commit into
mainfrom
claude/session-zen4qi

Conversation

@lukefwalton

Copy link
Copy Markdown
Owner

What

Gate-hardening and wording fixes for the int8 scaling demo, from a deep review of quantize.ts and the gold-suite top-slot logic. No change to the quantizer math, the committed/certified vectors, or the verdicts — int8 still certifies 7/7, int4+spire still rejected 7/9.

Why

Two reviewers converged on three small action items (and two explicit "do nots"). The one that matters is a latent correctness risk that would otherwise be frozen into the artifact's permanent DOI.

Changes

  1. Enforce the top-slot contract (the real fix). evaluateQuery (demo/harness.ts) now throws if a non-refusal gold case does not name exactly one expectSources entry. The gate guards expectSources[0] as the required top-slot winner, so two entries (which must rank Set up Surmado Code Review #1?) or none would let a route/disambiguation flip past silently. Harmless today (every case lists one), but it depended on that by luck. Added a unit test for the throw and the refusal exemption.
  2. Soften rho wording — no floor added, by design. Rank correlation is a reported diagnostic; the gold suite is the adjudicator. A rho floor could reject an encoding whose authored verdicts all hold — exactly the benchmark-style failure mode the demo avoids. Prose now matches the code (demo/harness.ts, demo/README.md).
  3. Clarify int4 scope. int4 is modeled as precision loss (codes stay in an Int8Array, not nibble-packed); the byte-size win is a production property, not what this gate measures. Noted in demo/quantize.ts and demo/README.md.

Deliberately not changed: Math.round (half-up asymmetry is negligible at 3072 dims and changing it would invalidate calibrated vectors for no verdict change), and no rho gate.

Verification

  • npm test → 43/43 pass (the +1 is the new contract test)
  • npm run typecheck → clean
  • npm run demo:run → int8 certified 7/7, rho mean/min 1.0000
  • npm run demo:run -- --natural+synthetic --bits 4 → int4 rejected 7/9, both route flips caught

Since nothing touched the committed vectors, no re-certification is required.

https://claude.ai/code/session_0164SP7NFZeHcmH1VNPem7S2


Generated by Claude Code

Gate-hardening and wording, from review of the int8 demo. No change to the
quantizer math, the committed vectors, or the certified verdicts (int8 still
7/7, int4+spire still rejected 7/9).

- harness.ts: make the top-slot contract loud. evaluateQuery now throws if a
  non-refusal gold case does not name exactly one expectSources entry — the
  gate guards expectSources[0] as the required top-slot winner, so two entries
  (ambiguous winner) or none would let a route/disambiguation flip past
  silently. Add a unit test covering the throw and the refusal exemption.
- Soften rho language to match the code: rank correlation is a reported
  diagnostic, not a gate (the gold suite is the adjudicator). No rho floor
  added, by design — a floor could reject an encoding whose authored verdicts
  all hold.
- Note that int4 is modeled as precision loss only (codes stay in an Int8Array,
  not nibble-packed); the byte-size win is a production property, not what this
  gate measures. Added in quantize.ts and demo/README.md.

Math.round left as-is intentionally: its half-up asymmetry is negligible at
3072 dims and changing it would invalidate calibrated vectors for no verdict change.
@surmado-code-review

Copy link
Copy Markdown
Contributor

Automated Checks (advisory, non-blocking)

✅ All checks passed.


Standards Compliance

  • demo/harness.ts now turns an implicit gold-data assumption into an explicit invariant:

    if (gold.expectSources?.length !== 1) {
      throw new Error(
        `demo gold '${gold.id}': a non-refusal case must list exactly one expectSources ` +
          `entry (the required top-slot winner); got ${gold.expectSources?.length ?? 0}.`,
      );
    }

    This is aligned with the repo standards on Eval and Tests and Loud Failures: malformed evaluation metadata now fails clearly instead of silently relying on expectSources[0]. That makes the demo gate stricter in the right direction.

  • I don’t see any visible changes here that weaken the repo’s non-negotiables around the no-leak boundary, citation grounding, or not-found behavior. The not-found exemption in demo/harness.ts also stays consistent with the standard that refusal-style cases should not be forced into partial grounding semantics.

Summary

This PR hardens the demo/eval harness by enforcing that non-refusal gold cases specify exactly one expected top source, and it updates comments/docs to clarify that rho is diagnostic-only and int4 here models precision loss rather than storage savings. The only meaningful runtime change in the visible diff is the new throw in demo/harness.ts; the rest looks like scope/wording clarification.

Reviewer: most of the risk is in demo/harness.ts contract enforcement and its interaction with existing gold fixtures — the quantization/README wording changes are low risk.

What to pay attention to

  • demo/harness.ts top-slot guard: this is the only behavior change. It’s worth verifying that every existing non-refusal demo gold case really does satisfy the “exactly one expectSources” contract, and that failing fast here is the intended authoring-time signal rather than something downstream callers need to recover from.
  • Refusal carve-out correctness: the guard exempts expectAnswerMode === 'not-found'. Reviewer should sanity-check that this is the only mode that should bypass top-slot enforcement, so the demo’s route/refusal semantics remain consistent.
  • Docs/code alignment: the updated comments claim rho “gates nothing” and int4 is precision-only in this demo. Worth confirming the CLI/reporting behavior matches that wording so the README doesn’t overstate or understate what is actually certified.

Things I noticed

No obvious red or yellow flags in the visible diff. The new throw is a deliberate fail-fast on malformed gold cases and fits the repo’s “clear throw, not silent fallback” standard.

Good patterns

  • Good hardening move: replacing a silent dependency on expectSources[0] with an explicit invariant and a targeted error message makes the demo artifact safer to trust.
  • The refusal exemption is thoughtfully scoped; it avoids forcing dummy source expectations into not-found gold cases just to satisfy the harness.

Suggested improvements

  • If there is already a central validator/schema for demo gold entries elsewhere in the path, consider encoding this “exactly one expectSources for non-refusal cases” rule there too as a follow-up. Catching it at load/validation time would fail even earlier than evaluateQuery.

Surmado Code Review (v1.2-mt) is an automated review, designed to work alongside human judgment.

Want to change your STANDARDS.md or YML? Edit it directly, or tune it with our AI agent Scout.

Comment /rerun-review on this PR to refresh the review — costs 1 additional PR credit.

@lukefwalton lukefwalton merged commit 06de726 into main Jun 16, 2026
3 checks passed
@lukefwalton lukefwalton deleted the claude/session-zen4qi branch June 16, 2026 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants