Runnable int8 demo (demo/): mechanism + corpus build handoff#15
Conversation
…ndoff Thin module beside the core (the budget rule): scaling.config.ts points the reused engine at scaling/corpus/ with two name-colliding public-domain authors (Adam Smith the economist, George Adam Smith the theologian). Adds the corpus provenance + authored-choices manifest (scaling/corpus/README.md) and a build handoff for an environment with network + an OpenAI key, since this session's egress allowed only GitHub and api.openai.com was blocked. Brings scaling/ under typecheck. No core behavior changed. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
scaling/gold.yaml, real-only (--natural). Same three-mode shape as the core
set: disambiguation both ways (economist vs theologian over shared "justice"
themes and the name-boost mis-fire), the partial-name boost edge ("Adam Smith"
phrase-matching a "George Adam Smith" title), a route case the private sermon
must win without restating, and a refuse case. Cases sit near the floor and
near each other, where int8 rounding can reorder them. Source ids match the
corpus slugs the build handoff defines.
https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
The int8 path as a thin wrapper plus a re-rank, reusing src/retrieve.ts cosine, the gold judge, and the no-leak boundary untouched. quantize.ts is the public twin of the production vector-quant.ts (per-vector symmetric, int8 and int4 from one path). harness.ts re-ranks the quantized index, reports rank correlation (necessary) and the gold verdicts including the route top-slot check (sufficient). run.ts is keyless: it reads committed FP + gold-query vectors and quantizes in process; --full adds the keyed answer pass. build.ts (keyed, for the local agent) embeds the corpus and gold queries. quantize.test.ts proves the mechanism offline on fixture geometry, including the payload: the gate certifies int8 and rejects an int4 route flip the note stays retrieved through but loses the top slot to. npm test now covers scaling/ too: 36 pass (25 core + 11 scaling), typecheck clean. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
…p gold The payload, the part the demo rests on. A fabricated George-private note, quarantined in scaling/corpus/synthetic/ (the location is the flag) and marked synthetic:true with the gold case, margin, and mode it targets, plus the expanded gold (gold.synthetic.yaml) loaded only under --natural+synthetic. The note is built to sit at the floor just above the public Amos exposition, so int8 holds the route while int4 flips the top slot to the public record and the gold suite catches it. The mechanism is already proven offline in quantize.test.ts; this is its real-corpus instance, calibrated against real vectors by the build (handoff §4). Headline numbers stay on --natural; the spire is broken out. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
scaling/README.md in the papers' sparse register (no em-dashes), leading with the deliberate failure: the same gold suite that owns grounding and refusal rejecting a cheaper encoding. States the three non-negotiable disclosures, the exact-vs-measured admissibility split, rank correlation as necessary-not- sufficient, the commit-vectors-only-because-public-domain caveat with the inversion warning, and the reuse boundary (retrieval and no-leak untouched). Cross-links production-scaling.md §2 as the prose companion. Fills the delta log with what the build settled vs what is pending the keyed build run, the divergences found (keyless headline needs committed gold-query vectors; demo-canonical record URLs; the added route-selection gate; the GitHub-only egress that deferred the corpus + vectors), and the prepared NEXT-STEPS C-intro/C1 reconciliation to apply once scaling:run confirms the headline. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
Automated Checks (advisory, non-blocking)
Standards ComplianceOne repo-standards issue still looks unresolved after the
SummaryThis PR adds a standalone Reviewer: most of the risk is in What to pay attention to
Things I noticed🔴 Red flags — fix before merge:
🟡 Yellow flags — consider for this PR or a follow-up:
Good patterns
Suggested improvements
Questions for the author
Surmado Code Review (v1.2-mt) is an automated review, designed to work alongside human judgment. Want to change your STANDARDS.md or YML? Edit it directly, or tune it with our AI agent Scout. Comment |
…ures --full now retrieves the answer-mode evidence from the SAME quantized index the retrieval gate judged, so a route flip on the lossy index can no longer be masked by full-precision retrieval (the keyed pass exercises the surface it claims to). build.ts now throws with the id when any source or gold-query embedding is missing, instead of silently writing a partial artifact, matching the repo's loud-failure standard. Aligns the test import to the core's named node:test form. Not changed: the committed scaling/corpus/index.json. Committing those vectors is the spec's intentional divergence (§5) — the "private" layer is public-domain George text, the index is deliberately not gitignored, and README §2 + the corpus manifest §2 explain it with the embedding-inversion warning. The STANDARDS.md reconciliation for that is a separate call, raised with the author. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
…RDS carve-out readQueryVectors now checks each committed vector is numeric and matches the file's dimensions, so a corrupt query-vectors.json fails loudly at read with the rebuild hint instead of surfacing later as bad cosine (the repo's loud-failure standard). Adds query-vectors.test.ts: round-trip, missing-file-is-null, and the malformed cases (wrong dims, non-numeric, bad version). 39 tests pass. Records in the delta log the one standing standards point the review keeps raising: the demo commits scaling/corpus/index.json with the public-domain George "private"-layer note objects on purpose (spec §5, manifest §2), so a STANDARDS.md line-51 carve-out is prepared as a deferred reconciliation rather than redesigning the spec-mandated keyless artifact story. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
…n output Generalizes the top-slot check from route cases to every non-refusal case with an expected source: judgeRetrieval only checks top-K membership, so a quantization flip that keeps both Smiths retrieved but swaps which ranks first would pass keyless and leave the corpus's marquee verdict (disambiguation) protected only by the keyed --full pass. Now the expected source must WIN the top slot, not merely appear — the right Smith must outrank the wrong one, the private note must outrank the records. New test proves a partial-mode Smith-vs- Smith int4 flip is caught keyless (40 tests). scaling:run now states plainly what it is running (encoding: int8 shipped vs int4 tightened; corpus: natural vs +spire; keyless) and prints a verdict line (CERTIFIED / REJECTED) so a reader knows what the result means. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
Stops the README opening from asserting the real-corpus caught failure as established fact. It now leads with the mechanism proven offline in quantize.test.ts (fixture vectors searched to exhibit the near-tie) and marks the real-Smith-corpus demonstration as pending the build run — the same "don't claim runnable before it runs" rule applied to NEXT-STEPS, now applied to the README itself. Fixes the int4 command to --natural+synthetic --bits 4 (the spire only loads under +synthetic) and notes scaling:run errors until built. Notes the keyless top-slot check now protects disambiguation too. Review nits: synthetic note's traveling title is now unambiguously synthetic (the label is the A1 leak surface); "no synthetic type field" reworded to match the synthetic:true marker the file carries; "the real Adam Smith" -> "either real Smith"; the spec and delta log are noted in the README as kept-in-the-open on purpose; delta row 12 updated for the disambiguation extension. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
The module read like a package/subsystem; it is a demo script. Renames the top-level dir scaling/ -> demo/, the npm scripts to demo:build / demo:run / demo:test, scaling.config.ts -> demo/config.ts, and every path/script reference in the module and the operational docs (build-handoff, delta log). Prose like "the int8 scaling demo" is left as description. The historical spec and corpus draft keep the original scaling/ name as the proposal; delta-log row 16 bridges it. tsconfig + the npm test glob updated; 40 tests pass, typecheck clean, demo:run degrades cleanly with the new paths. https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
What this is
A thin
scaling/module that makes the paper's §6 claim runnable: the same gold suite that owns grounding and refusal also adjudicates the int8 cost lever. It quantizes the embedding index to int8, re-ranks, and runs the full gold suite (including must-refuse and must-route) against the quantized index.The result it is built to produce is a caught failure: at
--bits 4a route case flips (the private note loses the top slot to a public record) and the gold suite catches it. "int8 held" alone proves little; the gate saying no when pushed is the point.What is committed vs pending a build run
This session's egress allowed only GitHub (
api.openai.com, Gutenberg, archive.org all returnedhost_not_allowed) and noOPENAI_API_KEYwas set. So:index.json,query-vectors.json). Exact steps indocs/scaling-demo/build-handoff.md. Nothing is faked; the demo halts honestly at the network boundary.Budget rule: held, no halt
The int8 path is a quantize wrapper plus a re-rank. It reuses
src/retrieve.ts(retrieve/cosine), theno-leakboundary, the gold judges,src/store.ts, and the corpus loaders untouched — no second pipeline, no core type changes.quantize.tsis the public twin of the productionvector-quant.ts.Two design points worth a look (both in the delta log):
query-vectors.json+ a thin runner solve this; the eval CLI is not reused verbatim.judgeRetrievalchecks presence only, so a top-slot route flip where the note stays retrieved would slip through; the harness adds a keyless route-selection check to catch it.What to review
scaling/— the module (README, config,quantize.ts,harness.ts,run.ts,build.ts, gold, corpus manifest, the quarantined spire).scaling/quantize.test.ts— the mechanism + the deliberate failure, offline.docs/scaling-demo/scaling-demo-delta-log.md— what's settled vs pending, the divergences, and the preparedNEXT-STEPSC-intro/C1 reconciliation (held untilscaling:runconfirms the headline, so "runnable" is verified not asserted).docs/scaling-demo/build-handoff.md— the brief for the local agent.Test plan
npm test→ 36 pass (25 core + 11 scaling);npm run typecheckclean.npm run scaling:rundegrades cleanly with a pointer to the handoff when vectors aren't built.build-handoff.mdto generate vectors, confirm the int8 headline, fire the int4 break, then apply theNEXT-STEPSreconciliation.Base / scope note
mainis at v1.2.0, so this branch is 58 commits ahead (the prior v1.3–v1.5 work plus these 5 scaling commits). The scaling contribution is thescaling/directory and the twodocs/scaling-demo/files above (plustsconfig.json/package.jsonwiring); the rest of the diff is the accumulated workmainhas not yet received. Retarget the base if you want a scaling-only diff.https://claude.ai/code/session_01EhtDe3ZQnv6vx2qSLy8qQ1
Generated by Claude Code