Skip to content

Runnable int8 demo (demo/): mechanism + corpus build handoff#15

Merged
lukefwalton merged 10 commits into
mainfrom
claude/lucid-wozniak-xeztl6
Jun 16, 2026
Merged

Runnable int8 demo (demo/): mechanism + corpus build handoff#15
lukefwalton merged 10 commits into
mainfrom
claude/lucid-wozniak-xeztl6

Conversation

@lukefwalton

Copy link
Copy Markdown
Owner

What this is

A thin scaling/ module that makes the paper's §6 claim runnable: the same gold suite that owns grounding and refusal also adjudicates the int8 cost lever. It quantizes the embedding index to int8, re-ranks, and runs the full gold suite (including must-refuse and must-route) against the quantized index.

The result it is built to produce is a caught failure: at --bits 4 a route case flips (the private note loses the top slot to a public record) and the gold suite catches it. "int8 held" alone proves little; the gate saying no when pushed is the point.

What is committed vs pending a build run

This session's egress allowed only GitHub (api.openai.com, Gutenberg, archive.org all returned host_not_allowed) and no OPENAI_API_KEY was set. So:

  • Committed and green now: the quantizer, the harness, the keyless runner, the keyed build script, the gold set, the corpus provenance manifest, and deterministic fixture tests — including the int4 route-flip payload proven on synthetic geometry.
  • Pending a build run (local agent with network + key): the real text bodies and the committed vectors (index.json, query-vectors.json). Exact steps in docs/scaling-demo/build-handoff.md. Nothing is faked; the demo halts honestly at the network boundary.

Budget rule: held, no halt

The int8 path is a quantize wrapper plus a re-rank. It reuses src/retrieve.ts (retrieve/cosine), the no-leak boundary, the gold judges, src/store.ts, and the corpus loaders untouched — no second pipeline, no core type changes. quantize.ts is the public twin of the production vector-quant.ts.

Two design points worth a look (both in the delta log):

  • The keyless headline needs committed gold-query vectors, not just FP vectors, because the core eval CLI embeds queries at run time (so it always needs a key). query-vectors.json + a thin runner solve this; the eval CLI is not reused verbatim.
  • judgeRetrieval checks presence only, so a top-slot route flip where the note stays retrieved would slip through; the harness adds a keyless route-selection check to catch it.

What to review

  • scaling/ — the module (README, config, quantize.ts, harness.ts, run.ts, build.ts, gold, corpus manifest, the quarantined spire).
  • scaling/quantize.test.ts — the mechanism + the deliberate failure, offline.
  • docs/scaling-demo/scaling-demo-delta-log.md — what's settled vs pending, the divergences, and the prepared NEXT-STEPS C-intro/C1 reconciliation (held until scaling:run confirms the headline, so "runnable" is verified not asserted).
  • docs/scaling-demo/build-handoff.md — the brief for the local agent.

Test plan

  • npm test → 36 pass (25 core + 11 scaling); npm run typecheck clean.
  • npm run scaling:run degrades cleanly with a pointer to the handoff when vectors aren't built.
  • Local agent runs build-handoff.md to generate vectors, confirm the int8 headline, fire the int4 break, then apply the NEXT-STEPS reconciliation.

Base / scope note

main is at v1.2.0, so this branch is 58 commits ahead (the prior v1.3–v1.5 work plus these 5 scaling commits). The scaling contribution is the scaling/ directory and the two docs/scaling-demo/ files above (plus tsconfig.json/package.json wiring); the rest of the diff is the accumulated work main has not yet received. Retarget the base if you want a scaling-only diff.

https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1


Generated by Claude Code

claude added 5 commits June 16, 2026 04:21
…ndoff

Thin module beside the core (the budget rule): scaling.config.ts points the
reused engine at scaling/corpus/ with two name-colliding public-domain authors
(Adam Smith the economist, George Adam Smith the theologian). Adds the corpus
provenance + authored-choices manifest (scaling/corpus/README.md) and a build
handoff for an environment with network + an OpenAI key, since this session's
egress allowed only GitHub and api.openai.com was blocked. Brings scaling/ under
typecheck. No core behavior changed.

https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
scaling/gold.yaml, real-only (--natural). Same three-mode shape as the core
set: disambiguation both ways (economist vs theologian over shared "justice"
themes and the name-boost mis-fire), the partial-name boost edge ("Adam Smith"
phrase-matching a "George Adam Smith" title), a route case the private sermon
must win without restating, and a refuse case. Cases sit near the floor and
near each other, where int8 rounding can reorder them. Source ids match the
corpus slugs the build handoff defines.

https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
The int8 path as a thin wrapper plus a re-rank, reusing src/retrieve.ts cosine,
the gold judge, and the no-leak boundary untouched. quantize.ts is the public
twin of the production vector-quant.ts (per-vector symmetric, int8 and int4 from
one path). harness.ts re-ranks the quantized index, reports rank correlation
(necessary) and the gold verdicts including the route top-slot check
(sufficient). run.ts is keyless: it reads committed FP + gold-query vectors and
quantizes in process; --full adds the keyed answer pass. build.ts (keyed, for
the local agent) embeds the corpus and gold queries.

quantize.test.ts proves the mechanism offline on fixture geometry, including the
payload: the gate certifies int8 and rejects an int4 route flip the note stays
retrieved through but loses the top slot to. npm test now covers scaling/ too:
36 pass (25 core + 11 scaling), typecheck clean.

https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
…p gold

The payload, the part the demo rests on. A fabricated George-private note,
quarantined in scaling/corpus/synthetic/ (the location is the flag) and marked
synthetic:true with the gold case, margin, and mode it targets, plus the
expanded gold (gold.synthetic.yaml) loaded only under --natural+synthetic. The
note is built to sit at the floor just above the public Amos exposition, so int8
holds the route while int4 flips the top slot to the public record and the gold
suite catches it. The mechanism is already proven offline in quantize.test.ts;
this is its real-corpus instance, calibrated against real vectors by the build
(handoff §4). Headline numbers stay on --natural; the spire is broken out.

https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
scaling/README.md in the papers' sparse register (no em-dashes), leading with
the deliberate failure: the same gold suite that owns grounding and refusal
rejecting a cheaper encoding. States the three non-negotiable disclosures, the
exact-vs-measured admissibility split, rank correlation as necessary-not-
sufficient, the commit-vectors-only-because-public-domain caveat with the
inversion warning, and the reuse boundary (retrieval and no-leak untouched).
Cross-links production-scaling.md §2 as the prose companion.

Fills the delta log with what the build settled vs what is pending the keyed
build run, the divergences found (keyless headline needs committed gold-query
vectors; demo-canonical record URLs; the added route-selection gate; the
GitHub-only egress that deferred the corpus + vectors), and the prepared
NEXT-STEPS C-intro/C1 reconciliation to apply once scaling:run confirms the
headline.

https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
@surmado-code-review

surmado-code-review Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Automated Checks (advisory, non-blocking)

  • Hallucinated packagenode:test used in demo/quantize.test.ts does not exist on npm. This package appears to be invented by AI.
    ✅ No other issues detected.

Standards Compliance

One repo-standards issue still looks unresolved after the scaling/demo/ rename, and it’s the main thing I’d want the reviewer to settle before merge:

  • demo/config.ts + demo/build.ts still produce commit-intended artifacts from the private layer, which conflicts with the repo’s artifact boundary rule.
    Relevant code:

    privateNotesDir: './demo/corpus/private',
    const naturalNotes = buildPrivateNotes(config);
    const naturalEntries = [...recordEntries(config, vectors), ...noteEntries(naturalNotes, vectors)]
    writeIndexFile(naturalEntries, NATURAL_INDEX);
    console.log('Done. Commit the *.json artifacts, then `npm run demo:run`.');

    and the note entries themselves are serialized as:

    sourceType: 'note',
    note,

    The loaded standards are explicit in §5: “Don't leak private embeddings/text into committed artifacts. (The index is gitignored for a reason.)”
    This PR still builds demo/corpus/index.json / index.synthetic.json from privateNotesDir and explicitly instructs committing those JSON files. Even if runtime answering still respects assembleEvidence and never sends note prose to the model, this is a standards conflict at artifact creation time.

    I don’t see a standards change in this diff that authorizes that exception, so as written this is still a repo-policy violation, not just a docs choice.

Summary

This PR adds a standalone demo/ flow for quantizing the corpus index, re-ranking against the quantized vectors, and adjudicating a gold suite keylessly via committed query vectors, with an optional keyed --full answer-mode pass. The implementation mostly stays thin and reuses core retrieval/eval/store logic; the main review decision is whether demo/build.ts is allowed to create committed note-derived artifacts from demo/corpus/private.

Reviewer: most of the risk is in demo/build.ts / demo/config.ts and the standards exception they imply; the quantization harness itself looks comparatively low-risk.

What to pay attention to

  • demo/build.ts + demo/config.ts: this is where the standards boundary is actually enforced or broken. The concern is not “does runtime leak private prose?”; it’s “are we committing note-derived artifacts at all?”
  • demo/run.ts --full: worth a quick sanity check in a good way — it now passes requantizeIndex(index, args.bits) into the answer pass, which avoids masking a lossy-index flip with full-precision retrieval.
  • demo/harness.ts top-slot logic: this is the substantive new gate behavior. judgeRetrieval only checks presence, so this extra top-slot check is what makes route/disambiguation flips visible.

Things I noticed

🔴 Red flags — fix before merge:

  • Committed private-layer artifacts still violate the repo standards.
    In demo/build.ts, the natural index is built from both public records and private notes:
    const naturalEntries = [...recordEntries(config, vectors), ...noteEntries(naturalNotes, vectors)]
    writeIndexFile(naturalEntries, NATURAL_INDEX);
    combined with:
    privateNotesDir: './demo/corpus/private',
    and:
    console.log('Done. Commit the *.json artifacts, then `npm run demo:run`.');
    That will produce commit-intended JSON derived from the repo’s private layer, which conflicts directly with §5.

🟡 Yellow flags — consider for this PR or a follow-up:

  • There’s still no hard guard around that exception path. Even if the team decides this demo deserves a narrow carve-out, demo:build writes note-derived commit artifacts as the default path. An explicit opt-in or hard fail would make accidental boundary drift much less likely.

Good patterns

  • demo/run.ts keeps the keyed answer-mode pass aligned with the same quantized retrieval surface by calling:
    await runAnswerPass(gold, requantizeIndex(index, args.bits), qv.byId, config);
    That avoids a subtle false-pass where retrieval is judged on lossy vectors but answers are generated from full-precision evidence.
  • demo/query-vectors.ts handles malformed or mismatched artifacts as loud failures with rebuild guidance, which matches the repo’s error-handling standard well.

Suggested improvements

  1. Make demo:build refuse to write commit-intended artifacts containing note-derived entries unless the repo standards are explicitly updated to allow that case.
  2. If no standards exception is intended, keep committed demo artifacts public-only and make note-derived vectors local/uncommitted.
  3. Once policy is settled, add a regression guard around the artifact boundary so future demo changes can’t silently reintroduce committed private-layer outputs.

Questions for the author

  • Is the intent here to establish a repo-level exception for demo/corpus/*.json containing private-layer embeddings, or should the demo be reshaped so committed artifacts stay public-only?

Surmado Code Review (v1.2-mt) is an automated review, designed to work alongside human judgment.

Want to change your STANDARDS.md or YML? Edit it directly, or tune it with our AI agent Scout.

Comment /rerun-review on this PR to refresh the review — costs 1 additional PR credit.

claude added 5 commits June 16, 2026 04:45
…ures

--full now retrieves the answer-mode evidence from the SAME quantized index the
retrieval gate judged, so a route flip on the lossy index can no longer be
masked by full-precision retrieval (the keyed pass exercises the surface it
claims to). build.ts now throws with the id when any source or gold-query
embedding is missing, instead of silently writing a partial artifact, matching
the repo's loud-failure standard. Aligns the test import to the core's named
node:test form.

Not changed: the committed scaling/corpus/index.json. Committing those vectors
is the spec's intentional divergence (§5) — the "private" layer is public-domain
George text, the index is deliberately not gitignored, and README §2 + the
corpus manifest §2 explain it with the embedding-inversion warning. The
STANDARDS.md reconciliation for that is a separate call, raised with the author.

https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
…RDS carve-out

readQueryVectors now checks each committed vector is numeric and matches the
file's dimensions, so a corrupt query-vectors.json fails loudly at read with the
rebuild hint instead of surfacing later as bad cosine (the repo's loud-failure
standard). Adds query-vectors.test.ts: round-trip, missing-file-is-null, and the
malformed cases (wrong dims, non-numeric, bad version). 39 tests pass.

Records in the delta log the one standing standards point the review keeps
raising: the demo commits scaling/corpus/index.json with the public-domain
George "private"-layer note objects on purpose (spec §5, manifest §2), so a
STANDARDS.md line-51 carve-out is prepared as a deferred reconciliation rather
than redesigning the spec-mandated keyless artifact story.

https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
…n output

Generalizes the top-slot check from route cases to every non-refusal case with
an expected source: judgeRetrieval only checks top-K membership, so a
quantization flip that keeps both Smiths retrieved but swaps which ranks first
would pass keyless and leave the corpus's marquee verdict (disambiguation)
protected only by the keyed --full pass. Now the expected source must WIN the
top slot, not merely appear — the right Smith must outrank the wrong one, the
private note must outrank the records. New test proves a partial-mode Smith-vs-
Smith int4 flip is caught keyless (40 tests).

scaling:run now states plainly what it is running (encoding: int8 shipped vs
int4 tightened; corpus: natural vs +spire; keyless) and prints a verdict line
(CERTIFIED / REJECTED) so a reader knows what the result means.

https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
Stops the README opening from asserting the real-corpus caught failure as
established fact. It now leads with the mechanism proven offline in
quantize.test.ts (fixture vectors searched to exhibit the near-tie) and marks
the real-Smith-corpus demonstration as pending the build run — the same
"don't claim runnable before it runs" rule applied to NEXT-STEPS, now applied to
the README itself. Fixes the int4 command to --natural+synthetic --bits 4 (the
spire only loads under +synthetic) and notes scaling:run errors until built.
Notes the keyless top-slot check now protects disambiguation too.

Review nits: synthetic note's traveling title is now unambiguously synthetic
(the label is the A1 leak surface); "no synthetic type field" reworded to match
the synthetic:true marker the file carries; "the real Adam Smith" -> "either
real Smith"; the spec and delta log are noted in the README as kept-in-the-open
on purpose; delta row 12 updated for the disambiguation extension.

https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
The module read like a package/subsystem; it is a demo script. Renames the
top-level dir scaling/ -> demo/, the npm scripts to demo:build / demo:run /
demo:test, scaling.config.ts -> demo/config.ts, and every path/script reference
in the module and the operational docs (build-handoff, delta log). Prose like
"the int8 scaling demo" is left as description. The historical spec and corpus
draft keep the original scaling/ name as the proposal; delta-log row 16 bridges
it. tsconfig + the npm test glob updated; 40 tests pass, typecheck clean,
demo:run degrades cleanly with the new paths.

https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
@lukefwalton lukefwalton changed the title Runnable int8 scaling demo (scaling/): mechanism + corpus build handoff Runnable int8 demo (demo/): mechanism + corpus build handoff Jun 16, 2026
@lukefwalton lukefwalton merged commit 58f82c6 into main Jun 16, 2026
2 of 3 checks passed
@lukefwalton lukefwalton deleted the claude/lucid-wozniak-xeztl6 branch June 16, 2026 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants