Runnable int8 demo (demo/): mechanism + corpus build handoff by lukefwalton · Pull Request #15 · lukefwalton/answer-engine

lukefwalton · 2026-06-16T04:40:54Z

What this is

A thin scaling/ module that makes the paper's §6 claim runnable: the same gold suite that owns grounding and refusal also adjudicates the int8 cost lever. It quantizes the embedding index to int8, re-ranks, and runs the full gold suite (including must-refuse and must-route) against the quantized index.

The result it is built to produce is a caught failure: at --bits 4 a route case flips (the private note loses the top slot to a public record) and the gold suite catches it. "int8 held" alone proves little; the gate saying no when pushed is the point.

What is committed vs pending a build run

This session's egress allowed only GitHub (api.openai.com, Gutenberg, archive.org all returned host_not_allowed) and no OPENAI_API_KEY was set. So:

Committed and green now: the quantizer, the harness, the keyless runner, the keyed build script, the gold set, the corpus provenance manifest, and deterministic fixture tests — including the int4 route-flip payload proven on synthetic geometry.
Pending a build run (local agent with network + key): the real text bodies and the committed vectors (index.json, query-vectors.json). Exact steps in docs/scaling-demo/build-handoff.md. Nothing is faked; the demo halts honestly at the network boundary.

Budget rule: held, no halt

The int8 path is a quantize wrapper plus a re-rank. It reuses src/retrieve.ts (retrieve/cosine), the no-leak boundary, the gold judges, src/store.ts, and the corpus loaders untouched — no second pipeline, no core type changes. quantize.ts is the public twin of the production vector-quant.ts.

Two design points worth a look (both in the delta log):

The keyless headline needs committed gold-query vectors, not just FP vectors, because the core eval CLI embeds queries at run time (so it always needs a key). query-vectors.json + a thin runner solve this; the eval CLI is not reused verbatim.
judgeRetrieval checks presence only, so a top-slot route flip where the note stays retrieved would slip through; the harness adds a keyless route-selection check to catch it.

What to review

scaling/ — the module (README, config, quantize.ts, harness.ts, run.ts, build.ts, gold, corpus manifest, the quarantined spire).
scaling/quantize.test.ts — the mechanism + the deliberate failure, offline.
docs/scaling-demo/scaling-demo-delta-log.md — what's settled vs pending, the divergences, and the prepared NEXT-STEPS C-intro/C1 reconciliation (held until scaling:run confirms the headline, so "runnable" is verified not asserted).
docs/scaling-demo/build-handoff.md — the brief for the local agent.

Test plan

npm test → 36 pass (25 core + 11 scaling); npm run typecheck clean.
npm run scaling:run degrades cleanly with a pointer to the handoff when vectors aren't built.
Local agent runs build-handoff.md to generate vectors, confirm the int8 headline, fire the int4 break, then apply the NEXT-STEPS reconciliation.

Base / scope note

main is at v1.2.0, so this branch is 58 commits ahead (the prior v1.3–v1.5 work plus these 5 scaling commits). The scaling contribution is the scaling/ directory and the two docs/scaling-demo/ files above (plus tsconfig.json/package.json wiring); the rest of the diff is the accumulated work main has not yet received. Retarget the base if you want a scaling-only diff.

https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1

Generated by Claude Code

…ndoff Thin module beside the core (the budget rule): scaling.config.ts points the reused engine at scaling/corpus/ with two name-colliding public-domain authors (Adam Smith the economist, George Adam Smith the theologian). Adds the corpus provenance + authored-choices manifest (scaling/corpus/README.md) and a build handoff for an environment with network + an OpenAI key, since this session's egress allowed only GitHub and api.openai.com was blocked. Brings scaling/ under typecheck. No core behavior changed. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1

scaling/gold.yaml, real-only (--natural). Same three-mode shape as the core set: disambiguation both ways (economist vs theologian over shared "justice" themes and the name-boost mis-fire), the partial-name boost edge ("Adam Smith" phrase-matching a "George Adam Smith" title), a route case the private sermon must win without restating, and a refuse case. Cases sit near the floor and near each other, where int8 rounding can reorder them. Source ids match the corpus slugs the build handoff defines. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1

The int8 path as a thin wrapper plus a re-rank, reusing src/retrieve.ts cosine, the gold judge, and the no-leak boundary untouched. quantize.ts is the public twin of the production vector-quant.ts (per-vector symmetric, int8 and int4 from one path). harness.ts re-ranks the quantized index, reports rank correlation (necessary) and the gold verdicts including the route top-slot check (sufficient). run.ts is keyless: it reads committed FP + gold-query vectors and quantizes in process; --full adds the keyed answer pass. build.ts (keyed, for the local agent) embeds the corpus and gold queries. quantize.test.ts proves the mechanism offline on fixture geometry, including the payload: the gate certifies int8 and rejects an int4 route flip the note stays retrieved through but loses the top slot to. npm test now covers scaling/ too: 36 pass (25 core + 11 scaling), typecheck clean. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1

…p gold The payload, the part the demo rests on. A fabricated George-private note, quarantined in scaling/corpus/synthetic/ (the location is the flag) and marked synthetic:true with the gold case, margin, and mode it targets, plus the expanded gold (gold.synthetic.yaml) loaded only under --natural+synthetic. The note is built to sit at the floor just above the public Amos exposition, so int8 holds the route while int4 flips the top slot to the public record and the gold suite catches it. The mechanism is already proven offline in quantize.test.ts; this is its real-corpus instance, calibrated against real vectors by the build (handoff §4). Headline numbers stay on --natural; the spire is broken out. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1

scaling/README.md in the papers' sparse register (no em-dashes), leading with the deliberate failure: the same gold suite that owns grounding and refusal rejecting a cheaper encoding. States the three non-negotiable disclosures, the exact-vs-measured admissibility split, rank correlation as necessary-not- sufficient, the commit-vectors-only-because-public-domain caveat with the inversion warning, and the reuse boundary (retrieval and no-leak untouched). Cross-links production-scaling.md §2 as the prose companion. Fills the delta log with what the build settled vs what is pending the keyed build run, the divergences found (keyless headline needs committed gold-query vectors; demo-canonical record URLs; the added route-selection gate; the GitHub-only egress that deferred the corpus + vectors), and the prepared NEXT-STEPS C-intro/C1 reconciliation to apply once scaling:run confirms the headline. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1

surmado-code-review · 2026-06-16T04:42:09Z

Automated Checks (advisory, non-blocking)

❌ Hallucinated package — node:test used in demo/quantize.test.ts does not exist on npm. This package appears to be invented by AI.
✅ No other issues detected.

Standards Compliance

One repo-standards issue still looks unresolved after the scaling/ → demo/ rename, and it’s the main thing I’d want the reviewer to settle before merge:

demo/config.ts + demo/build.ts still produce commit-intended artifacts from the private layer, which conflicts with the repo’s artifact boundary rule.
Relevant code:
```
privateNotesDir: './demo/corpus/private',
```
```
const naturalNotes = buildPrivateNotes(config);
```
```
const naturalEntries = [...recordEntries(config, vectors), ...noteEntries(naturalNotes, vectors)]
writeIndexFile(naturalEntries, NATURAL_INDEX);
```
```
console.log('Done. Commit the *.json artifacts, then `npm run demo:run`.');
```
and the note entries themselves are serialized as:
```
sourceType: 'note',
note,
```
The loaded standards are explicit in §5: “Don't leak private embeddings/text into committed artifacts. (The index is gitignored for a reason.)”
This PR still builds demo/corpus/index.json / index.synthetic.json from privateNotesDir and explicitly instructs committing those JSON files. Even if runtime answering still respects assembleEvidence and never sends note prose to the model, this is a standards conflict at artifact creation time.

I don’t see a standards change in this diff that authorizes that exception, so as written this is still a repo-policy violation, not just a docs choice.

Summary

This PR adds a standalone demo/ flow for quantizing the corpus index, re-ranking against the quantized vectors, and adjudicating a gold suite keylessly via committed query vectors, with an optional keyed --full answer-mode pass. The implementation mostly stays thin and reuses core retrieval/eval/store logic; the main review decision is whether demo/build.ts is allowed to create committed note-derived artifacts from demo/corpus/private.

Reviewer: most of the risk is in demo/build.ts / demo/config.ts and the standards exception they imply; the quantization harness itself looks comparatively low-risk.

What to pay attention to

demo/build.ts + demo/config.ts: this is where the standards boundary is actually enforced or broken. The concern is not “does runtime leak private prose?”; it’s “are we committing note-derived artifacts at all?”
demo/run.ts --full: worth a quick sanity check in a good way — it now passes requantizeIndex(index, args.bits) into the answer pass, which avoids masking a lossy-index flip with full-precision retrieval.
demo/harness.ts top-slot logic: this is the substantive new gate behavior. judgeRetrieval only checks presence, so this extra top-slot check is what makes route/disambiguation flips visible.

Things I noticed

🔴 Red flags — fix before merge:

Committed private-layer artifacts still violate the repo standards.
In demo/build.ts, the natural index is built from both public records and private notes:
```
const naturalEntries = [...recordEntries(config, vectors), ...noteEntries(naturalNotes, vectors)]
writeIndexFile(naturalEntries, NATURAL_INDEX);
```
combined with:
```
privateNotesDir: './demo/corpus/private',
```
and:
```
console.log('Done. Commit the *.json artifacts, then `npm run demo:run`.');
```
That will produce commit-intended JSON derived from the repo’s private layer, which conflicts directly with §5.

🟡 Yellow flags — consider for this PR or a follow-up:

There’s still no hard guard around that exception path. Even if the team decides this demo deserves a narrow carve-out, demo:build writes note-derived commit artifacts as the default path. An explicit opt-in or hard fail would make accidental boundary drift much less likely.

Good patterns

demo/run.ts keeps the keyed answer-mode pass aligned with the same quantized retrieval surface by calling:
```
await runAnswerPass(gold, requantizeIndex(index, args.bits), qv.byId, config);
```
That avoids a subtle false-pass where retrieval is judged on lossy vectors but answers are generated from full-precision evidence.
demo/query-vectors.ts handles malformed or mismatched artifacts as loud failures with rebuild guidance, which matches the repo’s error-handling standard well.

Suggested improvements

Make demo:build refuse to write commit-intended artifacts containing note-derived entries unless the repo standards are explicitly updated to allow that case.
If no standards exception is intended, keep committed demo artifacts public-only and make note-derived vectors local/uncommitted.
Once policy is settled, add a regression guard around the artifact boundary so future demo changes can’t silently reintroduce committed private-layer outputs.

Questions for the author

Is the intent here to establish a repo-level exception for demo/corpus/*.json containing private-layer embeddings, or should the demo be reshaped so committed artifacts stay public-only?

Surmado Code Review (v1.2-mt) is an automated review, designed to work alongside human judgment.

Want to change your STANDARDS.md or YML? Edit it directly, or tune it with our AI agent Scout.

Comment /rerun-review on this PR to refresh the review — costs 1 additional PR credit.

…ures --full now retrieves the answer-mode evidence from the SAME quantized index the retrieval gate judged, so a route flip on the lossy index can no longer be masked by full-precision retrieval (the keyed pass exercises the surface it claims to). build.ts now throws with the id when any source or gold-query embedding is missing, instead of silently writing a partial artifact, matching the repo's loud-failure standard. Aligns the test import to the core's named node:test form. Not changed: the committed scaling/corpus/index.json. Committing those vectors is the spec's intentional divergence (§5) — the "private" layer is public-domain George text, the index is deliberately not gitignored, and README §2 + the corpus manifest §2 explain it with the embedding-inversion warning. The STANDARDS.md reconciliation for that is a separate call, raised with the author. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1

…RDS carve-out readQueryVectors now checks each committed vector is numeric and matches the file's dimensions, so a corrupt query-vectors.json fails loudly at read with the rebuild hint instead of surfacing later as bad cosine (the repo's loud-failure standard). Adds query-vectors.test.ts: round-trip, missing-file-is-null, and the malformed cases (wrong dims, non-numeric, bad version). 39 tests pass. Records in the delta log the one standing standards point the review keeps raising: the demo commits scaling/corpus/index.json with the public-domain George "private"-layer note objects on purpose (spec §5, manifest §2), so a STANDARDS.md line-51 carve-out is prepared as a deferred reconciliation rather than redesigning the spec-mandated keyless artifact story. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1

…n output Generalizes the top-slot check from route cases to every non-refusal case with an expected source: judgeRetrieval only checks top-K membership, so a quantization flip that keeps both Smiths retrieved but swaps which ranks first would pass keyless and leave the corpus's marquee verdict (disambiguation) protected only by the keyed --full pass. Now the expected source must WIN the top slot, not merely appear — the right Smith must outrank the wrong one, the private note must outrank the records. New test proves a partial-mode Smith-vs- Smith int4 flip is caught keyless (40 tests). scaling:run now states plainly what it is running (encoding: int8 shipped vs int4 tightened; corpus: natural vs +spire; keyless) and prints a verdict line (CERTIFIED / REJECTED) so a reader knows what the result means. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1

Stops the README opening from asserting the real-corpus caught failure as established fact. It now leads with the mechanism proven offline in quantize.test.ts (fixture vectors searched to exhibit the near-tie) and marks the real-Smith-corpus demonstration as pending the build run — the same "don't claim runnable before it runs" rule applied to NEXT-STEPS, now applied to the README itself. Fixes the int4 command to --natural+synthetic --bits 4 (the spire only loads under +synthetic) and notes scaling:run errors until built. Notes the keyless top-slot check now protects disambiguation too. Review nits: synthetic note's traveling title is now unambiguously synthetic (the label is the A1 leak surface); "no synthetic type field" reworded to match the synthetic:true marker the file carries; "the real Adam Smith" -> "either real Smith"; the spec and delta log are noted in the README as kept-in-the-open on purpose; delta row 12 updated for the disambiguation extension. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1

The module read like a package/subsystem; it is a demo script. Renames the top-level dir scaling/ -> demo/, the npm scripts to demo:build / demo:run / demo:test, scaling.config.ts -> demo/config.ts, and every path/script reference in the module and the operational docs (build-handoff, delta log). Prose like "the int8 scaling demo" is left as description. The historical spec and corpus draft keep the original scaling/ name as the proposal; delta-log row 16 bridges it. tsconfig + the npm test glob updated; 40 tests pass, typecheck clean, demo:run degrades cleanly with the new paths. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1

claude added 5 commits June 16, 2026 04:21

claude added 5 commits June 16, 2026 04:45

lukefwalton changed the title ~~Runnable int8 scaling demo (scaling/): mechanism + corpus build handoff~~ Runnable int8 demo (demo/): mechanism + corpus build handoff Jun 16, 2026

lukefwalton merged commit 58f82c6 into main Jun 16, 2026
2 of 3 checks passed

lukefwalton deleted the claude/lucid-wozniak-xeztl6 branch June 16, 2026 05:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runnable int8 demo (demo/): mechanism + corpus build handoff#15

Runnable int8 demo (demo/): mechanism + corpus build handoff#15
lukefwalton merged 10 commits into
mainfrom
claude/lucid-wozniak-xeztl6

lukefwalton commented Jun 16, 2026

Uh oh!

surmado-code-review Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lukefwalton commented Jun 16, 2026

What this is

What is committed vs pending a build run

Budget rule: held, no halt

What to review

Test plan

Base / scope note

Uh oh!

surmado-code-review Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Checks (advisory, non-blocking)

Standards Compliance

Summary

What to pay attention to

Things I noticed

Good patterns

Suggested improvements

Questions for the author

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

surmado-code-review Bot commented Jun 16, 2026 •

edited

Loading