textbook grounding for course generation#13
Conversation
toc.py + ingest_md.py + 38 tests + labeled fixture; markdown-it-py dep.
PyMuPDF-backed ingester that parses textbook PDFs into the shared textbook IR (Chapter / Section / Paragraph). Handles both whole-book PDFs and one-chapter-per-file directories. - font-size-gated heading detection with pattern + size tiers - split-number/title and wrapped-title heading merge passes - back-matter suppression and size-aware header/footer filtering - parser_quality scoring - shared chapter-builder reused from the markdown path (ingest_md now propagates real page numbers instead of hardcoding 0) Tests: labeled mini-PDF fixture plus skip-if-absent smoke tests against real eval PDFs.
When the new --use-textbook PATH flag is set, the system loads a PDF or markdown textbook, retrieves relevant passages per chapter, injects them into the writing prompts, and emits inline citation tokens like [han_data_mining_3e:ch6.s3:p15] in slides, scripts, and assessments. A new verifier inside evaluate.py scores each citation against the source on a 1-5 faithfulness scale. Three interfaces, all opt-in: CLI flag, API field on POST /api/course/generate, and a file-picker on the Web UI. When the flag is absent every code path falls back to the existing vanilla behavior -- byte-comparable output, zero citation tokens. Measured on two real textbooks (Han Data Mining 3rd ed., Agentic Design Patterns): - Agentic: faithfulness 4.33/5, citation precision 86.7 %, attribution rubric +65 % vs vanilla on slide content - Han: faithfulness 3.87/5, precision 71 %, attribution +111 % - Overall content quality unchanged within LLM-judge noise on both - Vanilla preservation invariant holds end-to-end New code: src/grounding/ (knowledge_base, retriever, contract, reranker), src/textbook/ (schema, ingest_pdf, ingest_md), GroundingAgent in evaluate.py, --use-textbook plumbed through run.py + ADDIE + SlidesDeliberation, textbook_path field + /api/textbooks/upload + /api/textbooks/list in api_server.py, file picker in frontend. 280 tests passing.
05a6deb to
de531da
Compare
Before this change, the syllabus stage of the course generator was
textbook-blind: it guessed at course structure from the topic title
alone, and the per-chapter retrieval downstream had to paper over the
mismatch. This shows up dramatically on narrow textbooks where the
topic-title prior doesn't match the source -- the generated syllabus
references invented "Course Reader" articles instead of actual textbook
chapters.
The fix is small and orthogonal. When --use-textbook is set, the
textbook's table of contents (chapter + section list, ~400 word budget
with graceful degradation) is rendered once and prepended to every
foundation deliberation prompt (instructional_goals, learner_analysis,
prereq_analysis, content_sequencing, assessment_design, syllabus). The
agents now see the actual source before deciding what the course is
about. The syllabus's "Required Readings" block lists real textbook
chapter numbers instead of inventing article titles.
Vanilla path is byte-identical -- the TOC injection is gated on
self.knowledge_base is not None, and Deliberation.run() defaults
textbook_context=None. When no textbook is in play the prompt assembly
is unchanged from before. The copilot retry path threads the same TOC
through _check_for_retry so first-call and retry don't drift.
Measured impact (sequential matrix, seed=42, gpt-4o-mini judge):
Agentic AI
faithfulness 4.33 -> 4.41 (+0.08)
citation precision 86.7% -> 91.13% (+4.4 pp)
retrieval_bad 4.9% -> 2.0% (cut 60%)
hallucination 4.5% -> 2.0% (cut 55%)
attribution lift on slide_content vs vanilla: +125% (1.33 -> 3.00)
Data Mining (Han)
faithfulness 3.87 -> 3.86 (flat)
citation precision 71.0% -> 70.26% (flat)
attribution lift on slide_content vs vanilla: +75% (1.33 -> 2.33)
Han is essentially flat because Han's bottleneck isn't syllabus
alignment -- it's PDF extraction quality on math-heavy chapters. The
residual 18.5% retrieval_bad slice is the cross-encoder reranker's
target in the follow-up.
Both grounded cells stay within +/-0.3 LLM-judge noise on overall_score
vs their vanilla baselines, confirming the architectural fix does not
measurably degrade overall content quality.
Files:
src/textbook/schema.py - Textbook.toc(word_budget=400) method
src/grounding/knowledge_base.py - thin TextbookKnowledgeBase.toc() pass-through
src/agents.py - Deliberation.run() accepts textbook_context kwarg
src/ADDIE.py - ADDIERunner._textbook_toc_context() helper,
threaded through run_foundation_deliberations()
and _check_for_retry() retry path
Tests (16 new, 296 total passing):
tests/test_textbook_toc.py
tests/test_foundation_deliberation_toc_injection.py
The opt-in invariant is locked by TestDeliberationOptInInvariant
(passing textbook_context=None must produce a byte-identical prompt
to the pre-change behavior) and TestAddieRunnerTocHelper (the helper
returns None when no knowledge base is attached).
|
The results convincingly demonstrate increased attribution coverage and textbook grounding. However, the evaluation does not yet demonstrate strong citation reliability: citation precision remains around 72–80% on larger benchmarks, suggesting that a substantial fraction of generated citations are still unsupported or incorrectly retrieved. The PR appears to solve the coverage problem more than the accuracy problem. |
Bumps EMBED_MODEL from text-embedding-3-small (1536-dim) to text-embedding-3-large (3072-dim) for stronger disambiguation between semantically similar chunks. Disk cache is keyed on the model name so existing small caches don't collide with large re-embeds. Vanilla path untouched.
Adds a generic pollution filter to Textbook.format_toc_outline that drops front-matter (preface, foreword, acknowledgments), back-matter (appendices, glossary, bibliography, index), and short boilerplate chapters (< 5 paragraphs across all sections) before the TOC is injected into foundation deliberations. Title-pattern matching is case-insensitive and anchored at start so "Preface" matches but "Chapter 1: Introduction to Preprocessing" does not. No per-textbook rules. Falls back to the unfiltered chapter list if filtering would leave the TOC empty.
Two grounded-path additions to ADDIERunner. Both no-op when textbook grounding is off. 1. Cross-encoder reranker: when knowledge_base is set, construct a CrossEncoderReranker on the default ms-marco-MiniLM-L-6-v2 model and attach it to the HybridRetriever for second-stage scoring on top-K RRF candidates. Falls back to a non-reranking retriever if the optional sentence-transformers dep or model load fails. 2. Admin scaffolding: after foundation deliberations finish but before chapter extraction, append a "Course Policies" section (instructor contact, grading, attendance, accessibility, academic integrity) to the syllabus output file via a separate LLM call that reads the produced syllabus and appends. Idempotent across --resume via a .pre_admin_scaffolding.bak sentinel sibling file. Recovers rubric metrics (transparency_of_policies, accessibility) that competed for prompt budget with the grounding directive in the syllabus deliberation. New test file tests/test_addie_grounding_runtime.py covers both paths and the vanilla no-op guarantee.
GroundingAgent gains an n_samples parameter (default 1, preserves existing behavior). When n_samples > 1, each (claim, chunk) pair is scored multiple times and the aggregate is taken: median of the numeric scores, majority-vote of the failure_mode bucket, and the rationale from the sample whose score is closest to the median. New --verifier-samples N flag on evaluate.py threads the value through. Trades N× verifier API spend for a tighter judge-noise floor (the ±0.16 per-call variance is the dominant source of small per-cell metric drift across re-evals). 8 new tests under TestSelfConsistencyVoting cover the n_samples == 1 passthrough invariant, majority-vote tie-breaking, median-score selection, and fallback-sample exclusion from the vote.
Two changes to make the reranker robust against environment drift: 1. Defensive construction in ADDIERunner: wrap the CrossEncoderReranker() call in try/except. On any failure (missing optional dep, model-load error, ABI mismatch), log a warning and fall back to first-stage retrieval (BM25 + dense + RRF) with reranker=None. The rest of the grounding pipeline is unaffected. 2. Floor + major-version upper-bound pins in requirements.txt for sentence-transformers, torch, and transformers. Locks out the next major (e.g. torch 3.x) which may ship breaking ABI changes while keeping the floor at versions verified to work with Python 3.13 and the default cross-encoder model. Together: even if a future pip install pulls in an incompatible combination, the reranker degrades quietly and grounded generation keeps running.
The chapter-extraction LLM was using textbook chapter numbers from "Readings: Chapter X.Y" references as the COURSE chapter numbers, producing duplicate labels like: Chapter 1: Introduction to Data Mining (was week 1) Chapter 1: Understanding Your Data (was week 2) Chapter 2: Data Preprocessing Overview (was week 3) Chapter 2: Data Cleaning Techniques (was week 4) ... when the syllabus's actual schedule used "Week 1: ...", "Week 2: ..." with textbook readings as a separate field. Latent for a while — the prompt's example showed "title": "Chapter 1: Introduction to Machine Learning" with no instruction to preserve the syllabus's own numbering. Surfaced on grounded runs where the syllabus contains many textbook chapter references; the LLM mimics the example and inherits the chapter numbers. Fix: rewrite the prompt to (1) use a "Week 1: ..." example, (2) explicitly require the LLM to preserve the syllabus's own week/chapter numbering, (3) explicitly warn against using textbook chapter numbers from readings as course chapter numbers. 3 new regression tests in tests/test_syllabus_processor_prompt.py assert each property of the updated prompt.
I do agree this PR solves coverage more than accuracy problem, as a next step I am working on that ! |
Pure-Python module that classifies each page of a PDF as 'prose' or 'complex' based on PyMuPDF object metadata (image count + vector drawing count). Pages flagged 'complex' are candidates for VLM-based extraction; pages flagged 'prose' use the standard text-extraction path. Routing is cheap — inspects PDF object metadata, not text content — so it can be applied to every page before any expensive extraction runs. Default thresholds (any image OR drawings > 40) are empirically generic across textbooks: produce 21.4 % complex on Han Data Mining, 13.3 % on Agentic Design Patterns. No per-source tuning. 10 unit tests cover threshold boundaries, image-triggered complex classification, custom thresholds, and aggregation helpers. First module of the hybrid extraction pipeline. Subsequent modules (VLM adapter, hybrid ingester, cross-page stitching) build on this routing layer.
New module ingest_pdf_paged.py with ingest_pdf_file_paged() and ingest_pdf_directory_paged() that use pymupdf4llm.to_markdown(..., page_chunks=True) to get one markdown chunk per source page, then build the Textbook IR with REAL per-paragraph page numbers (the synthetic word-count-based pagination used elsewhere is bypassed). PyMuPDF4LLM as the markdown workhorse cleanly handles headings, tables, and code blocks that plain-text PyMuPDF flattens. Per-page extraction prevents the chunk-coarseness regression that the prior page_chunks=False attempt produced. Falls back to the plain-text ingester when pymupdf4llm is absent or produces an unparseable result. 9 unit tests cover per-page page- number tagging, the seen_chapter cross-page state, page-span aggregation, and the fallback path. First half of the hybrid extraction stack. Subsequent module (vlm_adapter) will augment complex pages with structured figure / equation descriptions before the chapter builder runs.
Renders a PDF page to PNG, sends to GPT-4o-mini vision with OpenAI Structured Outputs schema, returns parsed components (figures with descriptions, equations as LaTeX, tables with headers/rows, algorithms with steps). Pydantic models with discriminated unions (FigureComponent, EquationComponent, TableComponent, AlgorithmComponent) map cleanly to the API's response_format requirement and validate at the API boundary. Cropped page PNGs are saved to disk when figures_dir is configured — the downstream slide generator can reference them via includegraphics so the actual figures from the source PDF appear in the final materials, not just textual descriptions. Lazy OpenAI client construction so the extractor can be built without an API key (useful for tests). Every failure mode is defensive: render errors, network errors, schema rejection — all return an empty ExtractedPage with a logged warning rather than raising. The hybrid ingester (next phase) treats empty extractions as "use PyMuPDF4LLM output only" so a VLM outage cannot break a run. Smoke-tested on Han page 476 (OPTICS terminology): VLM correctly identified Figure 10.16 with caption, generated a concrete visual description, and extracted the Euclidean distance equation as LaTeX. 14 unit tests cover schema validation, lazy client construction, PNG-save behavior, and the defensive error paths.
New module ingest_pdf_hybrid.py combines three pieces: - spatial_router classifies each page as prose or complex - ingest_pdf_paged extracts clean markdown for every page (workhorse) - vlm_adapter augments complex pages with structured figure/equation/ table/algorithm content from GPT-4o-mini vision VLM-derived components flow into the Textbook IR as additional Paragraphs (kind=figure_cap, equation, or example) with the structured information encoded inline via parseable markers ([IMAGE_PATH: ...], [LATEX: ...], [TABLE: ...], [ALGORITHM_STEPS: ...]) for the downstream slide generator to consume in the next phase. Wired into knowledge_base.from_path via an opt-in vlm_extractor kwarg, into ADDIE.__init__ via a vlm_extraction bool kwarg, and exposed on the CLI as --vlm-extraction. All paths gated on vlm_extraction=True AND textbook_path being set — vanilla behavior is byte-identical without the flag. ADDIE is defensive: if VLM construction fails the run falls back to text-only extraction with a logged warning rather than crashing. Cropped page PNGs saved to .grounding_cache/figures/<textbook_id>/ so the slide generator can include them via \includegraphics in final materials. 13 unit tests cover the block-formatting helpers for each component type, the inline-marker output format, the vanilla preservation invariant (no extractor → delegates to paged ingester), and end-to-end IR construction with a mocked VLM. Full suite: 393 passing.
When the evidence excerpts contain inline markers from the hybrid
ingester's VLM augmentation ([IMAGE_PATH:], [LATEX:], [TABLE:],
[ALGORITHM_STEPS:], [DESCRIPTION:], [INSIGHT:]), _build_evidence_block
now appends an extra VISUAL CONTENT RULES block instructing the LLM
how to consume each marker for the artifact at hand.
Artifact-conditioned rules:
- slide / assessment: include figures via includegraphics, render
equations via display math, render tables as LaTeX tabular, render
algorithms as enumerated lists.
- script: narrate the figure / table verbally using the adjacent
description and insight markers; do NOT read raw LaTeX aloud.
Critically, when no markers are present (vanilla and v2 evidence
text), _build_visual_content_rules returns an empty string and the
evidence block is byte-identical to the prior behavior. Vanilla
preservation invariant holds.
This closes the OTHER half of the v3 contribution: without these
prompt-side rules, the LLM would happily ignore the VLM-extracted
figures and equations and write text-only slides as before. With
them, the LLM is steered to reproduce the source figures
(\includegraphics), equations (\\[ ... \\]), tables (\\begin{tabular}),
and algorithms (\\begin{enumerate}) in the final materials.
12 unit tests cover: empty rule block on plain evidence, each marker
triggering its corresponding rule line for slides, script artifacts
getting narration-flavored rules instead of LaTeX-emission ones,
multi-marker evidence surfacing all relevant rules, and end-to-end
integration via _build_evidence_block with a mocked retriever. Full
suite: 405 passing.
When the hybrid extraction path routes some pages through a VLM, the parsed Textbook IR depends on what the VLM returns. The VLM is not strictly deterministic across runs (OpenAI seed is best-effort, and even at temperature=0 small variations occur). Without caching, the chunks built at generation time would NOT match the chunks built at verification time — citation tokens emitted during generation would fail to resolve during eval. Two changes that together produce deterministic ingestion: 1. New src/grounding/ir_cache.py — saves the parsed Textbook IR to disk as JSON after first ingestion, loads on subsequent calls. Cache lives under .grounding_cache/ir/<textbook_id>.json. Cache invalidation is manual (delete the file to force re-ingestion). Graceful fall-through on missing / corrupt / schema-invalid cache. 2. VlmExtractor pins temperature=0 and seed=42 on every API call so the VLM output itself is as deterministic as the API allows. TextbookKnowledgeBase.from_path gains two kwargs (ir_cache_dir, use_ir_cache=True) that default to cache-enabled at <project>/.grounding_cache/. The verifier and the generator share the same cache by default → tokens emitted at gen time resolve cleanly at eval time. 10 new unit tests cover save/load round-trip, parent-dir creation, cache miss returning None, corrupt-JSON returning None, schema- invalid-JSON returning None, end-to-end from_path cache hit short-circuiting the ingester, and use_ir_cache=False bypassing the cache entirely. Full suite: 415 passing.
The hybrid ingester puts each VLM-extracted figure / equation / table / algorithm into its own Paragraph carrying inline markers ([IMAGE_PATH:, [LATEX:, [TABLE:, [ALGORITHM_STEPS:). Until now the chunker bundled those paragraphs with up to ~500 tokens of surrounding prose into a single chunk — queries about a figure or equation would rank the chunk based mostly on the prose content, leaving the visual element effectively invisible to retrieval. This commit makes the chunker recognize paragraphs carrying any of the four visual markers and emit each as its own standalone chunk (typically 30-150 tokens). Prose paragraphs continue to be packed greedily up to TARGET_TOKENS as before; the only structural change is that visual paragraphs interrupt the prose stream — a prose chunk ends at the boundary, the visual chunk fires, then a new prose chunk starts after the visual paragraph. Overlap between adjacent prose chunks no longer crosses a visual paragraph, preventing figure / equation / table content from bleeding into the next prose chunk's overlap window. For visual queries this should rank the right chunk directly instead of burying it in a 500-token prose chunk — the chief lever for closing the silent-skip gap on figure-heavy pages. 6 new unit tests cover: figure paragraph becoming its own chunk, equation paragraph becoming its own chunk, table paragraph becoming its own chunk, three consecutive visual paragraphs each getting their own chunk, pure-prose sections behaving exactly as before, and prose-chunk overlap correctly stopping at visual paragraph boundaries. Full suite: 421 passing.
The chunker emits OVERLAP_TOKENS of overlap between adjacent prose chunks so adjacent chunks share context. As a side effect, the retriever can rank two neighboring overlapping chunks both in the top-K — the LLM then sees the SAME content twice in the evidence block and occasionally cites the wrong instance, which the verifier flags as wrong_chunk_cited / loose_paraphrase. New _dedupe_results helper in src/slides.py drops later occurrences whose chunk is byte-identical to a kept earlier chunk OR whose first 40 words match a kept chunk's first 40 words (catches the common overlap case where chunk N+1 starts with the last ~64 tokens of chunk N). Applied before evidence-block formatting; preserves the retriever's rank order — first occurrence of each cluster wins. 8 new unit tests in tests/test_evidence_dedupe.py cover empty input, unique chunks all kept, byte-identical dedup, overlap-prefix dedup, mid-content overlap NOT triggering dedup (different starts must be kept), rank-order preservation, empty-text chunks handled, and the short-chunk fall-through to full-text equality. Full suite: 429 passing.
…nces, empty bullets, image sizing
Visual review of the generated PPTX surfaced six writer-side or
converter-side rendering bugs that look ugly on the rendered slide
but were invisible to the LaTeX-only path. Each is small but
visible to the reader; together they were ~3-4 cosmetic defects on
most content slides.
Converter (src/latex_to_pptx.py + src/build_pptx.js):
* Backtick / apostrophe LaTeX quotes ``…'' and `…' now convert to
straight ASCII quotes "…" and '…' in unescape_latex(). Beamer
writers emit these as a paired quote convention; the PPTX path
was displaying the raw delimiters.
* Markdown bold / italic that the writer sometimes emits even
inside .tex output is now stripped in strip_latex_formatting():
**bold** / __bold__ / *italic* → plain text. LaTeX itself would
ignore the asterisks, so they leaked unchanged.
* Bare $…$ math fences without a real math environment now collapse
to just the inner content. The writer used $\geq 30$ as shorthand
for "≥ 30"; without a math renderer the PPTX path showed
"$\geq 30$" verbatim.
* Empty \item entries that produce a lone bullet on the slide are
now dropped. Filter checks for whitespace-only and
punctuation-only items and skips them so the slide doesn't render
a hanging dot.
* The catch-all unknown-command stripper now also consumes optional
[opt] arguments before any {arg} group. Previously
\includegraphics[width=...]{path} that fell through to the
catch-all left "[width=...]{path}" as visible text on the slide.
* \includegraphics image rendering box now respects the slide's
available height. Caps at 3.2" or the remaining vertical space
minus a small buffer, whichever is smaller, so figures never
bleed past the slide edge.
Generation-side cleanup (src/slides.py):
* Markdown **bold** that survives into the writer's final .tex
output now becomes \textbf{...} during the cleanup pass. Prevents
the same bug from reappearing on future runs even if downstream
consumers aren't running the latest converter.
* A defensive fallback regex catches UNCLOSED [DESCRIPTION:/
[INSIGHT:/[IMAGE_PATH:/etc. markers. The strict regex requires
the closing ']', but writers occasionally drop it, leaving the
marker (and its prose) visible. Fallback strips from the opening
marker up to the next "\ or newline.
Test coverage: 8 new tests for backtick quote conversion / markdown
stripping / bare math fences / empty item filtering, 3 new tests
for the upstream markdown-bold-in-tex fix. 65 polish-related tests
total, full suite stays green. Vanilla preservation invariant is
unaffected — none of these touch the writer's prompt or the
grounded retrieval path.
- parser: replace non-greedy itemize/enumerate match with a depth-aware
helper so the outer environment doesn't get cut at the inner
\end{itemize}. Previously nested bullets leaked into the parent
item as raw text and rendered as phantom blank bullets.
- pptx renderer: pre-fit \includegraphics by reading the PNG header
and computing aspect-preserved box dimensions, instead of relying
on pptxgenjs sizing:"contain" (which LibreOffice's renderer does
not always honour). Wide figures now centre cleanly inside the
content column without bleeding off the slide.
- tests: three new regression cases covering nested itemize,
enumerate-inside-itemize, and sibling itemize blocks.
Stacking text and bullets above a figure was squeezing every \includegraphics into <3" of vertical space on a 13" slide. The cropped figures from the PyMuPDF crop pass deserved more room. - new _stackElements helper centralises image-first ordering and a trailingH estimate for each child. Every standard layout (renderStandard, renderListOnly, renderBlocks, renderCodeSlide) now goes through it. - addPicture takes a trailingH so it can reserve vertical space for bullets/captions rendered after the image; the figure's box height is L.maxY - y - buffer - trailingH, capped at 4.5".
|
Thanks for adding me as a reviewer — I pulled the branch down, read through the grounding/ingestion code, and ran the suite locally (267 passing on my machine; the rest skip without the real textbooks). Nice work overall — the ingestion/retrieval split is clean and the opt-in gating is solid. A few substantive notes: 1. PDF extraction is the main quality limiter — worth supporting EPUBThe writeup already traces Han's elevated The groundwork is mostly there: 2. Citation tokens can collide in the verifier index (may skew the headline faithfulness numbers)In self._chunk_by_token = {c.citation_token(): c for c in knowledge_base.chunks}But Since the faithfulness/precision figures are the main result, it'd be worth confirming this isn't shifting them. Disambiguating the token (include the chunk index, or map token → list of chunks and pick the best match) would close the gap. 3.
|
Thanks for the careful read. Addressing your points :
|
The grounded path (textbook ingestion, retrieval, semantic gates, optional reranker) only fires when --use-textbook is passed to run.py or evaluate.py. Its dependencies were previously declared as base deps, so every install pulled in pymupdf, sentence-transformers, torch, transformers, rank-bm25, and markdown-it-py (~400 MB of torch + transitives) even when the user never planned to use them. - pyproject.toml: new [grounding] extras group covering all six deps. Install with: pip install -e ".[grounding]" - requirements.txt: same deps grouped under a clearly marked "Optional: grounding" section, preserved at the bottom so `pip install -r requirements.txt` keeps every supported feature working (backward compatibility). - README: install section documents the three paths (vanilla / grounding extras / kitchen-sink requirements.txt). Verified: - A simulated vanilla install (blocking all six grounding imports at module load) successfully imports run, api_server, evaluate, and src.ADDIE.ADDIE — vanilla code path doesn't touch the grounding deps. - pip installable: importlib.metadata exposes both 'vector-db' and 'grounding' as extras. - Full pytest suite still 693 passed.
The semantic gates (bi-encoder cosine filter) and the cross-encoder
reranker were the only consumers of sentence-transformers + torch in
the runtime path. Both now load their respective MiniLM models through
fastembed, which runs the same ONNX-exported weights via onnxruntime
— no torch dependency.
Numerical scores are unchanged (verified on a fixture):
- Semantic gate cosine on "K-means clustering partitions data" vs
"K-means is a centroid-based clustering algorithm":
sentence-transformers: 0.692239
fastembed: 0.692239
- Cross-encoder reranker scores on the K-means probe set:
sentence-transformers: +8.0968, -11.3451, -1.3409
fastembed: +8.0968, -11.3451, -1.3409 (delta ≤ 1e-6)
Why both backends agree to 6 decimals: fastembed loads from the same
HuggingFace repo (Xenova hosts the ONNX export of the original
cross-encoder repo), so the weights are bit-identical. The two
runtimes round their fused-op outputs slightly differently but the
delta stays well below the gate thresholds (0.32 / 0.30) and the
ranking-only contract of the reranker.
Install footprint dropped:
- vanilla: unchanged at ~50 MB (grounding deps are gated behind
the [grounding] extras group; no run-time imports)
- grounded: ~100 MB (was ~550 MB with torch + sentence-transformers
+ transformers)
Edits:
- src/grounding/semantic_gate.py: _ensure_encoder() now imports
fastembed.TextEmbedding; _embed() materialises the iterator and
L2-normalises (fastembed returns raw vectors, unlike
sentence-transformers' normalize_embeddings=True).
- src/grounding/reranker.py: CrossEncoderReranker uses
fastembed.rerank.cross_encoder.TextCrossEncoder. The model name
switched from cross-encoder/ms-marco-MiniLM-L-6-v2 to
Xenova/ms-marco-MiniLM-L-6-v2 (same weights, ONNX-exported).
Score path went from .predict([(q,p),…]) to .rerank(q, [p,…]).
- pyproject.toml + requirements.txt: [grounding] extras now lists
fastembed in place of sentence-transformers + torch + transformers.
- README install section updated to reflect the lighter footprint.
- tests/test_semantic_gate.py: stub encoder now mirrors fastembed's
.embed(iterable) generator-of-arrays contract instead of the old
sentence-transformers .encode(text, normalize_embeddings=True).
- tests/test_grounding_reranker.py: lazy-load assertion now checks
that fastembed/onnxruntime stay out of sys.modules until .score()
is called (was: torch / sentence_transformers).
Verified: 693 passed, same count as before.
Code shipped upstream should not reference internal iteration labels
("v6 Lever A", "v7 Step 9", "v3 hybrid extraction") that aren't
defined anywhere in the codebase or its docs. Each affected comment
has been rewritten to describe what the code does without the
versioning prefix; nothing about runtime behaviour changes.
Affected docstrings and comments:
- src/grounding/semantic_gate.py: module docstring + Gate A / Gate B
inline labels.
- src/grounding/reranker.py: header retained (it was already cleaned in
the prior commit).
- src/grounding/write_time_verifier.py: module docstring.
- src/grounding/usage_tracker.py: module docstring.
- src/grounding/contract.py: constant comments (SECTIONS_PER_TOPIC,
SUBTOPICS_PER_CHAPTER, smart-intro detection, meta-chapter abstain),
helper docstrings, build-loop comments.
- src/slides.py: constructor diversity-cap / gate / verifier comments,
evidence-block coverage diversification + Gate A + diversity cap +
visual-chunk-inclusion comments, visual-content rules block,
evidence-strip block (cleanup + Gate B + write-time verifier),
cross-chapter assessment context, per-slide grounding hooks, draft
picker docstring.
- src/ADDIE.py: lazy-construction comments for tracker, gates, verifier.
- src/textbook/ingest_pdf_paged.py: module docstring + page-flag
comment.
- src/textbook/vlm_adapter.py: VLM retry docstring.
- evaluate.py: ambiguous-token-rescue comments + docstring.
Verified: 693 passed (same count as before), no behaviour change.
Three small, backwards-compatible changes targeting the residual
failure modes after the v7 baseline (retrieval_bad 3.9-6.6%,
loose_paraphrase 2.8-6.4%, ambiguous-chunk picks). None add a single
LLM API call to the runtime path.
1. Sentence-bounded claim window (semantic gate + write-time verifier).
Both consumers previously walked backward with ``rfind()`` for
``". "`` / ``"! "`` / ``"? "`` / ``"\n"`` and took the trailing
25-30 words. That heuristic split on common abbreviations
(``e.g.``, ``i.e.``, ``Fig.``, ``Eq.``, ``Mr.``, etc.) and
produced truncated or mid-sentence claim windows. Both call sites
then scored / verified the wrong text.
New ``src/grounding/claim_window.py`` factors out a regex sentence
tokeniser with an abbreviation-suppression list. Both
``SemanticGate._extract_claim_window`` and
``WriteTimeVerifier._extract_claim_window`` now delegate to it.
16 new tests pin down the new behaviour.
Expected impact: tighter Gate B strip decisions and verifier
prompts, which should mostly reduce ``loose_paraphrase`` (the
second-largest residual failure mode on Agentic at 6.4%).
2. Cross-encoder reranker warmup at ADDIE init (src/ADDIE.py).
``CrossEncoderReranker()``'s constructor is intentionally lazy —
the ONNX model is downloaded and loaded on first ``.score()``
call, not at __init__. Before this commit, that meant init-time
try/except caught import errors only; an actual model-load
failure would silently fall back per-query, printing the same
warning hundreds of times in one run.
The init now does a tiny ``score("warmup", ["warmup"])`` call
inside the try block so the ONNX model is materialised once,
upfront. Failures get a single clear message and the reranker is
set to None for the rest of the run.
3. Stopword-filtered Jaccard in ambiguous-token rescue (evaluate.py).
``_resolve_best_chunk`` picks the best of look-alike chunks by
word-overlap with the claim. The earlier filter just dropped
tokens shorter than 3 characters. Common filler ("the", "of",
"in") still dominated the overlap, making the score brittle for
topically similar siblings.
We now reuse the lightweight stopword set the retriever already
maintains at ``src.grounding.retriever._STOP`` (lazy import,
defensive fallback to an empty frozenset if the import path
shifts). The dominant content words now drive the rescue
decision.
This change is eval-side only — it can't move which citations
the writer produces, only which sibling chunk the judge scores
them against. The largest expected effect is a small drop in
``wrong_chunk_cited`` reports on multi-chunk-spanning tokens
(the prior baseline showed ~75% of tokens were ambiguous).
Tests: 708 passed (was 693; 15 new tests across claim_window).
The full Han 3rd-edition PDF (740 pages) exposed a latent bug: a small
number of chunks emitted by the IR build exceeded OpenAI's 8192-token
per-input limit on the embedding-3 models, almost certainly long visual
captions or bibliography-style runs that the paragraph-aware chunker
left whole because they were already "complete paragraphs." Every
retrieval call on those chunks failed with a 400, the whole batch was
rejected, and the writer silently fell back to vanilla prompts. The
run then kept retrying for hours (~$8-10 of failed embedding API
spend) while emitting zero citations.
Three defensive layers, none of which trade off precision, recall, or
information completeness:
* Layer 1 — Sentence-aware chunk splitting at ingest. New
`MAX_CHUNK_CHARS = 24000` (≈6000 tokens; ~25% headroom under the
8192 ceiling). `_split_chunk_if_oversized` in knowledge_base.py
walks each emitted chunk; if the text exceeds the ceiling, it
splits on real sentence boundaries (reusing the regex +
abbreviation suppression list from `claim_window`) into sub-chunks
that inherit the parent's section, chapter, page span, and `kinds`.
Sub-chunks therefore EMIT THE SAME CITATION TOKEN as their siblings,
so the existing ambiguous-token rescue in evaluate.py picks the
best at score time — no downstream changes needed.
* Layer 2 — Embedder-level batch guard. `OpenAIEmbedder.embed()` now
checks each input's length before calling the API; oversized
inputs are sentence-split, embedded as multiple pieces, then
mean-pooled into one vector with L2 re-normalisation. The output
shape (one vector per input) is preserved so the retriever's
contract is unchanged. This catches anything Layer 1 missed
(manual chunk injection in tests, future ingester paths, etc.).
* Layer 3 — Fail-fast on consecutive retrieval errors. The
silent-fallback path in `_build_evidence_block` now tracks
consecutive failures of the same error class on the class object;
after 10 same-class failures in a row, raise a RuntimeError with
the last error attached. The counter resets on any successful
retrieval so transient blips (brief rate limits, network hiccups)
don't accumulate spuriously. Cost protection: writer + verifier
calls otherwise keep running for the whole course generation even
though no grounded evidence reaches the prompts.
Plus a one-line operator-facing diagnostic at IR-build time:
``[grounding] N sub-chunks emitted from oversized parent chunks
(max chunk size after split: M chars, ceiling: K).`` Surfaces the
edge case in the run log without forcing future operators to dig.
Public helper `split_into_sentences()` factored out of
`claim_window.py` so both the chunker and the embedder can reuse the
same boundary detector (regex + ~30-entry abbreviation suppression
list).
Tests: 14 new in tests/test_chunk_size_cap.py covering (a) Layer 1's
metadata inheritance, citation token stability, sub-chunk id
uniqueness, information preservation across split, and last-resort
hard-slice fallback when a single sentence exceeds the ceiling; (b)
Layer 2's output-shape invariance under mixed-size batches and
oversized-input mean-pooling via a mocked OpenAI client; (c) Layer 3's
threshold behaviour, error-class-aware counter, and counter reset on
a successful retrieval. Full suite: 722 passed (was 708; +14 new).
The broken IR cache for the full Han PDF at
.grounding_cache/ir/the_morgan_kaufmann_...2011.json was deleted so
the next ingest rebuilds with the corrected chunker.
The sentence-bounded claim window helped Agentic (+2.92 pp precision) but regressed the Han 6-chapter subset (-3.84 pp), with only ~7 % citation overlap between runs suggesting the divergence reaches far upstream. Keep the rfind heuristic in semantic_gate + write_time_verifier until the cross-textbook effect is isolated. The sentence-end regex stays in claim_window.py — the chunker and embedder size guard still use it.
Two independent bugs were blocking VLM-extracted figures from reaching the final slides.tex. After both fixes, a single-chapter clustering run on Han 3e (page 481-534) produced 2 real \includegraphics tokens pointing to verified PNGs (Figure 10.1, DBSCAN density-reachability) — the first iteration where any figure landed in the deck despite 129 PNGs sitting on disk. (1) Visual chunk hoisted to FRONT of injected results. The block-builder loop walks results in rank order with a fixed word budget; large prose chunks (~450 words each in cluster chapters) exhausted the 1800-word budget in four iterations, so the appended-at-tail visual chunk never reached evidence_text. Hoisting it to position 0 guarantees its IMAGE_PATH / LATEX / TABLE / ALGORITHM_STEPS marker survives into the prompt, which is what triggers the visual-content rule block. (2) Teaching Assistant prompt now lists \includegraphics and tabular as available LaTeX features alongside itemize / enumerate / block / lstlisting / equation. Adds an explicit PRESERVE-FIGURES instruction so \includegraphics commands from the Faculty's slide draft don't get silently dropped when the TA reformats. Without this, the TA had no permission to emit figures even when the Faculty drafted them. test_force_visual_chunk assertions updated from out[-1] to out[0] to match the new front-insert position.
The default PDF ingestion path used to call ingest_pdf_file (plain
PyMuPDF text extraction with no image handling) and image cropping was
only available via the opt-in --vlm-extraction flag, which rendered
whole pages as PNGs through gpt-4o vision.
Move the default path through ingest_pdf_file_paged with
pymupdf4llm.to_markdown(write_images=True). pymupdf4llm writes embedded
image XObjects from the PDF as tight cropped PNGs — the actual figure
region, not a full-page screenshot. The ingester walks the figures
directory after extraction, renames each saved file from pymupdf4llm's
{stem}.pdf-{page:04d}-{idx:02d}.png convention to
{textbook_id}_p{page:04d}_{idx:02d}.png, and emits one figure_cap
paragraph with an [IMAGE_PATH: ...] marker per image so the downstream
visual-content rules path still triggers.
Measured on Han 3rd ed (740 pages): 464 cropped figures landed in
80 s with no API spend. Figure 10.3 (K-Means iteration) is now a 32 KB
tight crop showing just the 3-panel diagram — the prior VLM pass
produced a 230 KB full-page render with running headers and body text
bleed.
The hybrid VLM path stays available via --vlm-extraction for runs that
want the gpt-4o DESCRIPTION / INSIGHT markers alongside images. The
test_ir_cache plumbing patch swap targets the new dispatch hook
(_ingest in knowledge_base) so it stays robust to future routing
changes.
Both code paths were tried, measured, and abandoned but their source plus tests stayed in the tree as documentation. The CLAUDE.md note covers the same ground without leaving the modules in place, so drop: * src/grounding/reranker.py — LLMReranker (137-line class plus the json / os imports that only it used). Production retrieval uses CrossEncoderReranker; the LLM variant measured no improvement (89.3 % vs 90.2 % precision) on its trial run. * src/slides.py — _generate_best_of_n_draft and the _decrement_tracker_for_text helper that rolls back loser citations. The multi-draft slide-best-pick path scored drafts by citation count which rewarded volume over quality and was disabled in favour of the semantic-gate stack. The DISABLED comment in _generate_slide_draft referenced a --enable-best-of-n flag that did not exist. * src/grounding/__init__.py — drop the LLMReranker re-export. * src/ADDIE.py — drop the comment paragraph about the removed reranker; CrossEncoderReranker construction is unchanged. * tests/test_multi_draft_best_pick.py — whole file (~146 lines). * tests/test_grounding_reranker.py — TestLLMReranker class plus the _mock_openai_client helper used only by it (~118 lines).
…topic-coherence guard
Bug: Incorrect
|
Good catch : I fixed it in 1304e80. _maybe_augment_syllabus_with_admin now passes [{"role": "user", "content": prompt}] instead of the bare string. You were right about the test, too: the existing grounded test asserted "…" in call_prompt against the string, so it silently encoded the bug. I've updated it to check the message list and added a dedicated assertion on the arg format. Thanks! |
Adds an optional
--use-textbook <pdf>mode. Instead of writing slides from the model's memory, it pulls relevant passages from a real textbook and writes each slide from that evidence. Without the flag nothing changes — output is identical to the original pipeline.What it does
What you get
evaluate.py --rigorous(supported claims ÷ total) that's easy to compare from run to run.Notes