Skip to content

textbook grounding for course generation#13

Draft
shrey-Bish wants to merge 57 commits into
DaRL-GenAI:mainfrom
shrey-Bish:feature/textbook-grounding
Draft

textbook grounding for course generation#13
shrey-Bish wants to merge 57 commits into
DaRL-GenAI:mainfrom
shrey-Bish:feature/textbook-grounding

Conversation

@shrey-Bish

@shrey-Bish shrey-Bish commented May 19, 2026

Copy link
Copy Markdown

Adds an optional --use-textbook <pdf> mode. Instead of writing slides from the model's memory, it pulls relevant passages from a real textbook and writes each slide from that evidence. Without the flag nothing changes — output is identical to the original pipeline.

What it does

  • Reads the PDF into chapters and sections, turns equation images into real LaTeX, and indexes everything for keyword + embedding search.
  • For each course chapter, works out which textbook sections it should draw from — and if nothing fits well, skips grounding for that chapter instead of making things up.
  • Gives the slide writer the matched passages and tells it to teach in its own words from them, keep worked examples and math, and not invent claims.
  • After each chapter, a checker reads the slides back against the evidence and logs how many claims were actually supported.

What you get

  • Decks that trace to the book, using the textbook's own figures and captions.
  • A simple Grounding Fidelity % in evaluate.py --rigorous (supported claims ÷ total) that's easy to compare from run to run.

Notes

  • Early drafts inserted inline citation tokens in the slides; we dropped those — we don't grade attribution, so the deck stays clean and grounding is checked by the fidelity score plus a page-by-page render review instead.
  • All grounding code is gated behind the flag, so the default (no-textbook) pipeline is untouched. ~750 tests cover it.

@shrey-Bish shrey-Bish marked this pull request as ready for review May 19, 2026 23:27
@shrey-Bish shrey-Bish marked this pull request as draft May 22, 2026 21:53
PyMuPDF-backed ingester that parses textbook PDFs into the shared
textbook IR (Chapter / Section / Paragraph). Handles both whole-book
PDFs and one-chapter-per-file directories.

- font-size-gated heading detection with pattern + size tiers
- split-number/title and wrapped-title heading merge passes
- back-matter suppression and size-aware header/footer filtering
- parser_quality scoring
- shared chapter-builder reused from the markdown path (ingest_md
  now propagates real page numbers instead of hardcoding 0)

Tests: labeled mini-PDF fixture plus skip-if-absent smoke tests
against real eval PDFs.
@shrey-Bish shrey-Bish changed the title Markdown textbook ingester textbook grounding for course generation May 28, 2026
When the new --use-textbook PATH flag is set, the system loads a PDF or
markdown textbook, retrieves relevant passages per chapter, injects them
into the writing prompts, and emits inline citation tokens like
[han_data_mining_3e:ch6.s3:p15] in slides, scripts, and assessments.
A new verifier inside evaluate.py scores each citation against the
source on a 1-5 faithfulness scale.

Three interfaces, all opt-in: CLI flag, API field on
POST /api/course/generate, and a file-picker on the Web UI. When the
flag is absent every code path falls back to the existing vanilla
behavior -- byte-comparable output, zero citation tokens.

Measured on two real textbooks (Han Data Mining 3rd ed., Agentic Design
Patterns):
- Agentic: faithfulness 4.33/5, citation precision 86.7 %,
  attribution rubric +65 % vs vanilla on slide content
- Han: faithfulness 3.87/5, precision 71 %, attribution +111 %
- Overall content quality unchanged within LLM-judge noise on both
- Vanilla preservation invariant holds end-to-end

New code: src/grounding/ (knowledge_base, retriever, contract, reranker),
src/textbook/ (schema, ingest_pdf, ingest_md), GroundingAgent in
evaluate.py, --use-textbook plumbed through run.py + ADDIE +
SlidesDeliberation, textbook_path field + /api/textbooks/upload +
/api/textbooks/list in api_server.py, file picker in frontend.

280 tests passing.
@shrey-Bish shrey-Bish force-pushed the feature/textbook-grounding branch from 05a6deb to de531da Compare May 28, 2026 22:12
Before this change, the syllabus stage of the course generator was
textbook-blind: it guessed at course structure from the topic title
alone, and the per-chapter retrieval downstream had to paper over the
mismatch. This shows up dramatically on narrow textbooks where the
topic-title prior doesn't match the source -- the generated syllabus
references invented "Course Reader" articles instead of actual textbook
chapters.

The fix is small and orthogonal. When --use-textbook is set, the
textbook's table of contents (chapter + section list, ~400 word budget
with graceful degradation) is rendered once and prepended to every
foundation deliberation prompt (instructional_goals, learner_analysis,
prereq_analysis, content_sequencing, assessment_design, syllabus). The
agents now see the actual source before deciding what the course is
about. The syllabus's "Required Readings" block lists real textbook
chapter numbers instead of inventing article titles.

Vanilla path is byte-identical -- the TOC injection is gated on
self.knowledge_base is not None, and Deliberation.run() defaults
textbook_context=None. When no textbook is in play the prompt assembly
is unchanged from before. The copilot retry path threads the same TOC
through _check_for_retry so first-call and retry don't drift.

Measured impact (sequential matrix, seed=42, gpt-4o-mini judge):

  Agentic AI
    faithfulness        4.33 -> 4.41  (+0.08)
    citation precision  86.7%  -> 91.13%  (+4.4 pp)
    retrieval_bad       4.9%   -> 2.0%    (cut 60%)
    hallucination       4.5%   -> 2.0%    (cut 55%)
    attribution lift on slide_content vs vanilla: +125% (1.33 -> 3.00)

  Data Mining (Han)
    faithfulness        3.87 -> 3.86  (flat)
    citation precision  71.0%  -> 70.26%  (flat)
    attribution lift on slide_content vs vanilla: +75% (1.33 -> 2.33)

Han is essentially flat because Han's bottleneck isn't syllabus
alignment -- it's PDF extraction quality on math-heavy chapters. The
residual 18.5% retrieval_bad slice is the cross-encoder reranker's
target in the follow-up.

Both grounded cells stay within +/-0.3 LLM-judge noise on overall_score
vs their vanilla baselines, confirming the architectural fix does not
measurably degrade overall content quality.

Files:
  src/textbook/schema.py        - Textbook.toc(word_budget=400) method
  src/grounding/knowledge_base.py - thin TextbookKnowledgeBase.toc() pass-through
  src/agents.py                 - Deliberation.run() accepts textbook_context kwarg
  src/ADDIE.py                  - ADDIERunner._textbook_toc_context() helper,
                                  threaded through run_foundation_deliberations()
                                  and _check_for_retry() retry path

Tests (16 new, 296 total passing):
  tests/test_textbook_toc.py
  tests/test_foundation_deliberation_toc_injection.py

The opt-in invariant is locked by TestDeliberationOptInInvariant
(passing textbook_context=None must produce a byte-identical prompt
to the pre-change behavior) and TestAddieRunnerTocHelper (the helper
returns None when no knowledge base is attached).
@shrey-Bish shrey-Bish marked this pull request as ready for review May 29, 2026 23:44
@wingsweihua wingsweihua requested a review from EroNinja June 3, 2026 17:58
@xander-Huang-ASU

Copy link
Copy Markdown
Collaborator

The results convincingly demonstrate increased attribution coverage and textbook grounding. However, the evaluation does not yet demonstrate strong citation reliability: citation precision remains around 72–80% on larger benchmarks, suggesting that a substantial fraction of generated citations are still unsupported or incorrectly retrieved. The PR appears to solve the coverage problem more than the accuracy problem.

Bumps EMBED_MODEL from text-embedding-3-small (1536-dim) to
text-embedding-3-large (3072-dim) for stronger disambiguation
between semantically similar chunks. Disk cache is keyed on the
model name so existing small caches don't collide with large
re-embeds. Vanilla path untouched.
Adds a generic pollution filter to Textbook.format_toc_outline that
drops front-matter (preface, foreword, acknowledgments), back-matter
(appendices, glossary, bibliography, index), and short boilerplate
chapters (< 5 paragraphs across all sections) before the TOC is
injected into foundation deliberations. Title-pattern matching is
case-insensitive and anchored at start so "Preface" matches but
"Chapter 1: Introduction to Preprocessing" does not. No per-textbook
rules. Falls back to the unfiltered chapter list if filtering would
leave the TOC empty.
Two grounded-path additions to ADDIERunner. Both no-op when textbook
grounding is off.

1. Cross-encoder reranker: when knowledge_base is set, construct a
   CrossEncoderReranker on the default ms-marco-MiniLM-L-6-v2 model
   and attach it to the HybridRetriever for second-stage scoring on
   top-K RRF candidates. Falls back to a non-reranking retriever if
   the optional sentence-transformers dep or model load fails.

2. Admin scaffolding: after foundation deliberations finish but
   before chapter extraction, append a "Course Policies" section
   (instructor contact, grading, attendance, accessibility, academic
   integrity) to the syllabus output file via a separate LLM call
   that reads the produced syllabus and appends. Idempotent across
   --resume via a .pre_admin_scaffolding.bak sentinel sibling file.
   Recovers rubric metrics (transparency_of_policies, accessibility)
   that competed for prompt budget with the grounding directive in
   the syllabus deliberation.

New test file tests/test_addie_grounding_runtime.py covers both
paths and the vanilla no-op guarantee.
GroundingAgent gains an n_samples parameter (default 1, preserves
existing behavior). When n_samples > 1, each (claim, chunk) pair is
scored multiple times and the aggregate is taken: median of the
numeric scores, majority-vote of the failure_mode bucket, and the
rationale from the sample whose score is closest to the median. New
--verifier-samples N flag on evaluate.py threads the value through.

Trades N× verifier API spend for a tighter judge-noise floor (the
±0.16 per-call variance is the dominant source of small per-cell
metric drift across re-evals).

8 new tests under TestSelfConsistencyVoting cover the n_samples == 1
passthrough invariant, majority-vote tie-breaking, median-score
selection, and fallback-sample exclusion from the vote.
Two changes to make the reranker robust against environment drift:

1. Defensive construction in ADDIERunner: wrap the
   CrossEncoderReranker() call in try/except. On any failure
   (missing optional dep, model-load error, ABI mismatch), log a
   warning and fall back to first-stage retrieval (BM25 + dense +
   RRF) with reranker=None. The rest of the grounding pipeline is
   unaffected.

2. Floor + major-version upper-bound pins in requirements.txt for
   sentence-transformers, torch, and transformers. Locks out the
   next major (e.g. torch 3.x) which may ship breaking ABI changes
   while keeping the floor at versions verified to work with
   Python 3.13 and the default cross-encoder model.

Together: even if a future pip install pulls in an incompatible
combination, the reranker degrades quietly and grounded generation
keeps running.
The chapter-extraction LLM was using textbook chapter numbers from
"Readings: Chapter X.Y" references as the COURSE chapter numbers,
producing duplicate labels like:
  Chapter 1: Introduction to Data Mining     (was week 1)
  Chapter 1: Understanding Your Data         (was week 2)
  Chapter 2: Data Preprocessing Overview     (was week 3)
  Chapter 2: Data Cleaning Techniques        (was week 4)
  ...
when the syllabus's actual schedule used "Week 1: ...", "Week 2: ..."
with textbook readings as a separate field.

Latent for a while — the prompt's example showed
  "title": "Chapter 1: Introduction to Machine Learning"
with no instruction to preserve the syllabus's own numbering. Surfaced
on grounded runs where the syllabus contains many textbook chapter
references; the LLM mimics the example and inherits the chapter numbers.

Fix: rewrite the prompt to (1) use a "Week 1: ..." example, (2)
explicitly require the LLM to preserve the syllabus's own week/chapter
numbering, (3) explicitly warn against using textbook chapter numbers
from readings as course chapter numbers.

3 new regression tests in tests/test_syllabus_processor_prompt.py
assert each property of the updated prompt.
@shrey-Bish

Copy link
Copy Markdown
Author

The results convincingly demonstrate increased attribution coverage and textbook grounding. However, the evaluation does not yet demonstrate strong citation reliability: citation precision remains around 72–80% on larger benchmarks, suggesting that a substantial fraction of generated citations are still unsupported or incorrectly retrieved. The PR appears to solve the coverage problem more than the accuracy problem.

I do agree this PR solves coverage more than accuracy problem, as a next step I am working on that !

Pure-Python module that classifies each page of a PDF as 'prose' or
'complex' based on PyMuPDF object metadata (image count + vector
drawing count). Pages flagged 'complex' are candidates for VLM-based
extraction; pages flagged 'prose' use the standard text-extraction
path.

Routing is cheap — inspects PDF object metadata, not text content —
so it can be applied to every page before any expensive extraction
runs. Default thresholds (any image OR drawings > 40) are empirically
generic across textbooks: produce 21.4 % complex on Han Data Mining,
13.3 % on Agentic Design Patterns. No per-source tuning.

10 unit tests cover threshold boundaries, image-triggered complex
classification, custom thresholds, and aggregation helpers.

First module of the hybrid extraction pipeline. Subsequent modules
(VLM adapter, hybrid ingester, cross-page stitching) build on this
routing layer.
New module ingest_pdf_paged.py with ingest_pdf_file_paged() and
ingest_pdf_directory_paged() that use pymupdf4llm.to_markdown(...,
page_chunks=True) to get one markdown chunk per source page, then
build the Textbook IR with REAL per-paragraph page numbers (the
synthetic word-count-based pagination used elsewhere is bypassed).

PyMuPDF4LLM as the markdown workhorse cleanly handles headings,
tables, and code blocks that plain-text PyMuPDF flattens. Per-page
extraction prevents the chunk-coarseness regression that the prior
page_chunks=False attempt produced.

Falls back to the plain-text ingester when pymupdf4llm is absent or
produces an unparseable result. 9 unit tests cover per-page page-
number tagging, the seen_chapter cross-page state, page-span
aggregation, and the fallback path.

First half of the hybrid extraction stack. Subsequent module
(vlm_adapter) will augment complex pages with structured figure /
equation descriptions before the chapter builder runs.
Renders a PDF page to PNG, sends to GPT-4o-mini vision with OpenAI
Structured Outputs schema, returns parsed components (figures with
descriptions, equations as LaTeX, tables with headers/rows,
algorithms with steps).

Pydantic models with discriminated unions (FigureComponent,
EquationComponent, TableComponent, AlgorithmComponent) map cleanly to
the API's response_format requirement and validate at the API
boundary.

Cropped page PNGs are saved to disk when figures_dir is configured —
the downstream slide generator can reference them via includegraphics
so the actual figures from the source PDF appear in the final
materials, not just textual descriptions.

Lazy OpenAI client construction so the extractor can be built
without an API key (useful for tests). Every failure mode is
defensive: render errors, network errors, schema rejection — all
return an empty ExtractedPage with a logged warning rather than
raising. The hybrid ingester (next phase) treats empty extractions
as "use PyMuPDF4LLM output only" so a VLM outage cannot break a run.

Smoke-tested on Han page 476 (OPTICS terminology): VLM correctly
identified Figure 10.16 with caption, generated a concrete visual
description, and extracted the Euclidean distance equation as LaTeX.

14 unit tests cover schema validation, lazy client construction,
PNG-save behavior, and the defensive error paths.
New module ingest_pdf_hybrid.py combines three pieces:
- spatial_router classifies each page as prose or complex
- ingest_pdf_paged extracts clean markdown for every page (workhorse)
- vlm_adapter augments complex pages with structured figure/equation/
  table/algorithm content from GPT-4o-mini vision

VLM-derived components flow into the Textbook IR as additional
Paragraphs (kind=figure_cap, equation, or example) with the structured
information encoded inline via parseable markers ([IMAGE_PATH: ...],
[LATEX: ...], [TABLE: ...], [ALGORITHM_STEPS: ...]) for the downstream
slide generator to consume in the next phase.

Wired into knowledge_base.from_path via an opt-in vlm_extractor kwarg,
into ADDIE.__init__ via a vlm_extraction bool kwarg, and exposed on
the CLI as --vlm-extraction. All paths gated on vlm_extraction=True
AND textbook_path being set — vanilla behavior is byte-identical
without the flag. ADDIE is defensive: if VLM construction fails the
run falls back to text-only extraction with a logged warning rather
than crashing.

Cropped page PNGs saved to .grounding_cache/figures/<textbook_id>/
so the slide generator can include them via \includegraphics in
final materials.

13 unit tests cover the block-formatting helpers for each component
type, the inline-marker output format, the vanilla preservation
invariant (no extractor → delegates to paged ingester), and
end-to-end IR construction with a mocked VLM. Full suite: 393 passing.
When the evidence excerpts contain inline markers from the hybrid
ingester's VLM augmentation ([IMAGE_PATH:], [LATEX:], [TABLE:],
[ALGORITHM_STEPS:], [DESCRIPTION:], [INSIGHT:]), _build_evidence_block
now appends an extra VISUAL CONTENT RULES block instructing the LLM
how to consume each marker for the artifact at hand.

Artifact-conditioned rules:
- slide / assessment: include figures via includegraphics, render
  equations via display math, render tables as LaTeX tabular, render
  algorithms as enumerated lists.
- script: narrate the figure / table verbally using the adjacent
  description and insight markers; do NOT read raw LaTeX aloud.

Critically, when no markers are present (vanilla and v2 evidence
text), _build_visual_content_rules returns an empty string and the
evidence block is byte-identical to the prior behavior. Vanilla
preservation invariant holds.

This closes the OTHER half of the v3 contribution: without these
prompt-side rules, the LLM would happily ignore the VLM-extracted
figures and equations and write text-only slides as before. With
them, the LLM is steered to reproduce the source figures
(\includegraphics), equations (\\[ ... \\]), tables (\\begin{tabular}),
and algorithms (\\begin{enumerate}) in the final materials.

12 unit tests cover: empty rule block on plain evidence, each marker
triggering its corresponding rule line for slides, script artifacts
getting narration-flavored rules instead of LaTeX-emission ones,
multi-marker evidence surfacing all relevant rules, and end-to-end
integration via _build_evidence_block with a mocked retriever. Full
suite: 405 passing.
@shrey-Bish shrey-Bish marked this pull request as draft June 5, 2026 20:34
When the hybrid extraction path routes some pages through a VLM, the
parsed Textbook IR depends on what the VLM returns. The VLM is not
strictly deterministic across runs (OpenAI seed is best-effort, and
even at temperature=0 small variations occur). Without caching, the
chunks built at generation time would NOT match the chunks built at
verification time — citation tokens emitted during generation would
fail to resolve during eval.

Two changes that together produce deterministic ingestion:

1. New src/grounding/ir_cache.py — saves the parsed Textbook IR to
   disk as JSON after first ingestion, loads on subsequent calls.
   Cache lives under .grounding_cache/ir/<textbook_id>.json. Cache
   invalidation is manual (delete the file to force re-ingestion).
   Graceful fall-through on missing / corrupt / schema-invalid cache.

2. VlmExtractor pins temperature=0 and seed=42 on every API call so
   the VLM output itself is as deterministic as the API allows.

TextbookKnowledgeBase.from_path gains two kwargs (ir_cache_dir,
use_ir_cache=True) that default to cache-enabled at
<project>/.grounding_cache/. The verifier and the generator share
the same cache by default → tokens emitted at gen time resolve
cleanly at eval time.

10 new unit tests cover save/load round-trip, parent-dir creation,
cache miss returning None, corrupt-JSON returning None, schema-
invalid-JSON returning None, end-to-end from_path cache hit
short-circuiting the ingester, and use_ir_cache=False bypassing the
cache entirely. Full suite: 415 passing.
The hybrid ingester puts each VLM-extracted figure / equation / table /
algorithm into its own Paragraph carrying inline markers
([IMAGE_PATH:, [LATEX:, [TABLE:, [ALGORITHM_STEPS:). Until now the
chunker bundled those paragraphs with up to ~500 tokens of
surrounding prose into a single chunk — queries about a figure or
equation would rank the chunk based mostly on the prose content,
leaving the visual element effectively invisible to retrieval.

This commit makes the chunker recognize paragraphs carrying any of
the four visual markers and emit each as its own standalone chunk
(typically 30-150 tokens). Prose paragraphs continue to be packed
greedily up to TARGET_TOKENS as before; the only structural change
is that visual paragraphs interrupt the prose stream — a prose chunk
ends at the boundary, the visual chunk fires, then a new prose chunk
starts after the visual paragraph.

Overlap between adjacent prose chunks no longer crosses a visual
paragraph, preventing figure / equation / table content from bleeding
into the next prose chunk's overlap window.

For visual queries this should rank the right chunk directly instead
of burying it in a 500-token prose chunk — the chief lever for
closing the silent-skip gap on figure-heavy pages.

6 new unit tests cover: figure paragraph becoming its own chunk,
equation paragraph becoming its own chunk, table paragraph becoming
its own chunk, three consecutive visual paragraphs each getting
their own chunk, pure-prose sections behaving exactly as before, and
prose-chunk overlap correctly stopping at visual paragraph
boundaries. Full suite: 421 passing.
The chunker emits OVERLAP_TOKENS of overlap between adjacent prose
chunks so adjacent chunks share context. As a side effect, the
retriever can rank two neighboring overlapping chunks both in the
top-K — the LLM then sees the SAME content twice in the evidence
block and occasionally cites the wrong instance, which the verifier
flags as wrong_chunk_cited / loose_paraphrase.

New _dedupe_results helper in src/slides.py drops later occurrences
whose chunk is byte-identical to a kept earlier chunk OR whose first
40 words match a kept chunk's first 40 words (catches the common
overlap case where chunk N+1 starts with the last ~64 tokens of
chunk N). Applied before evidence-block formatting; preserves the
retriever's rank order — first occurrence of each cluster wins.

8 new unit tests in tests/test_evidence_dedupe.py cover empty input,
unique chunks all kept, byte-identical dedup, overlap-prefix dedup,
mid-content overlap NOT triggering dedup (different starts must be
kept), rank-order preservation, empty-text chunks handled, and the
short-chunk fall-through to full-text equality. Full suite: 429
passing.
…nces, empty bullets, image sizing

Visual review of the generated PPTX surfaced six writer-side or
converter-side rendering bugs that look ugly on the rendered slide
but were invisible to the LaTeX-only path. Each is small but
visible to the reader; together they were ~3-4 cosmetic defects on
most content slides.

Converter (src/latex_to_pptx.py + src/build_pptx.js):

* Backtick / apostrophe LaTeX quotes ``…'' and `…' now convert to
  straight ASCII quotes "…" and '…' in unescape_latex(). Beamer
  writers emit these as a paired quote convention; the PPTX path
  was displaying the raw delimiters.

* Markdown bold / italic that the writer sometimes emits even
  inside .tex output is now stripped in strip_latex_formatting():
  **bold** / __bold__ / *italic* → plain text. LaTeX itself would
  ignore the asterisks, so they leaked unchanged.

* Bare $…$ math fences without a real math environment now collapse
  to just the inner content. The writer used $\geq 30$ as shorthand
  for "≥ 30"; without a math renderer the PPTX path showed
  "$\geq 30$" verbatim.

* Empty \item entries that produce a lone bullet on the slide are
  now dropped. Filter checks for whitespace-only and
  punctuation-only items and skips them so the slide doesn't render
  a hanging dot.

* The catch-all unknown-command stripper now also consumes optional
  [opt] arguments before any {arg} group. Previously
  \includegraphics[width=...]{path} that fell through to the
  catch-all left "[width=...]{path}" as visible text on the slide.

* \includegraphics image rendering box now respects the slide's
  available height. Caps at 3.2" or the remaining vertical space
  minus a small buffer, whichever is smaller, so figures never
  bleed past the slide edge.

Generation-side cleanup (src/slides.py):

* Markdown **bold** that survives into the writer's final .tex
  output now becomes \textbf{...} during the cleanup pass. Prevents
  the same bug from reappearing on future runs even if downstream
  consumers aren't running the latest converter.

* A defensive fallback regex catches UNCLOSED [DESCRIPTION:/
  [INSIGHT:/[IMAGE_PATH:/etc. markers. The strict regex requires
  the closing ']', but writers occasionally drop it, leaving the
  marker (and its prose) visible. Fallback strips from the opening
  marker up to the next "\ or newline.

Test coverage: 8 new tests for backtick quote conversion / markdown
stripping / bare math fences / empty item filtering, 3 new tests
for the upstream markdown-bold-in-tex fix. 65 polish-related tests
total, full suite stays green. Vanilla preservation invariant is
unaffected — none of these touch the writer's prompt or the
grounded retrieval path.
- parser: replace non-greedy itemize/enumerate match with a depth-aware
  helper so the outer environment doesn't get cut at the inner
  \end{itemize}. Previously nested bullets leaked into the parent
  item as raw text and rendered as phantom blank bullets.
- pptx renderer: pre-fit \includegraphics by reading the PNG header
  and computing aspect-preserved box dimensions, instead of relying
  on pptxgenjs sizing:"contain" (which LibreOffice's renderer does
  not always honour). Wide figures now centre cleanly inside the
  content column without bleeding off the slide.
- tests: three new regression cases covering nested itemize,
  enumerate-inside-itemize, and sibling itemize blocks.
Stacking text and bullets above a figure was squeezing every
\includegraphics into <3" of vertical space on a 13" slide. The
cropped figures from the PyMuPDF crop pass deserved more room.

- new _stackElements helper centralises image-first ordering and a
  trailingH estimate for each child. Every standard layout
  (renderStandard, renderListOnly, renderBlocks, renderCodeSlide)
  now goes through it.
- addPicture takes a trailingH so it can reserve vertical space for
  bullets/captions rendered after the image; the figure's box height
  is L.maxY - y - buffer - trailingH, capped at 4.5".
@EroNinja

EroNinja commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Thanks for adding me as a reviewer — I pulled the branch down, read through the grounding/ingestion code, and ran the suite locally (267 passing on my machine; the rest skip without the real textbooks). Nice work overall — the ingestion/retrieval split is clean and the opt-in gating is solid. A few substantive notes:

1. PDF extraction is the main quality limiter — worth supporting EPUB

The writeup already traces Han's elevated retrieval_bad to PDF extraction on equations/tables, and the plain-text path in src/textbook/ingest_pdf.py does a lot of heuristic work (font-size tiers, margin-band header/footer stripping, split/wrapped-heading merges) precisely because PDF carries no structure. EPUB sidesteps this — it's structured XHTML with real headings and a native nav/TOC, which is exactly what the new TOC-injection step depends on.

The groundwork is mostly there: schema.py already lists epub in source_format, and an ingest_epub.py could flatten the spine XHTML into the block format and reuse ingest_md._extract_blocks + _blocks_to_chapters — the same pattern ingest_pdf_file_via_markdown already follows. Looks like a modest lift for a real quality gain on any book available as EPUB. Happy to take this on if useful.

2. Citation tokens can collide in the verifier index (may skew the headline faithfulness numbers)

In evaluate.py, GroundingAgent builds:

self._chunk_by_token = {c.citation_token(): c for c in knowledge_base.chunks}

But Chunk.citation_token() is [textbook_id:section_id:p{page_start:02d}] — section + page only, no chunk index. Two chunks in the same section that share a page_start (easy to hit with markdown's ~250-word synthetic pages, and with short sections) produce the same token, so the dict silently keeps only the last one. A citation can then be scored against a different chunk than the writer actually drew from.

Since the faithfulness/precision figures are the main result, it'd be worth confirming this isn't shifting them. Disambiguating the token (include the chunk index, or map token → list of chunks and pick the best match) would close the gap.

3. sentence-transformers (→ torch) is a hard dependency for a path that's off by default

requirements.txt requires sentence-transformers>=2.7.0, which pulls in torch on every install — but per the comment in ADDIE.py the reranker isn't wired into the default runtime path. Moving it (and arguably pymupdf / rank-bm25 / markdown-it-py, only needed when grounding is on) into an optional extra would keep vanilla installs light and match the opt-in framing.

None of these are blockers — solid first version. Thanks!

@shrey-Bish

Copy link
Copy Markdown
Author

Thanks for adding me as a reviewer — I pulled the branch down, read through the grounding/ingestion code, and ran the suite locally (267 passing on my machine; the rest skip without the real textbooks). Nice work overall — the ingestion/retrieval split is clean and the opt-in gating is solid. A few substantive notes:

1. PDF extraction is the main quality limiter — worth supporting EPUB

The writeup already traces Han's elevated retrieval_bad to PDF extraction on equations/tables, and the plain-text path in src/textbook/ingest_pdf.py does a lot of heuristic work (font-size tiers, margin-band header/footer stripping, split/wrapped-heading merges) precisely because PDF carries no structure. EPUB sidesteps this — it's structured XHTML with real headings and a native nav/TOC, which is exactly what the new TOC-injection step depends on.

The groundwork is mostly there: schema.py already lists epub in source_format, and an ingest_epub.py could flatten the spine XHTML into the block format and reuse ingest_md._extract_blocks + _blocks_to_chapters — the same pattern ingest_pdf_file_via_markdown already follows. Looks like a modest lift for a real quality gain on any book available as EPUB. Happy to take this on if useful.

2. Citation tokens can collide in the verifier index (may skew the headline faithfulness numbers)

In evaluate.py, GroundingAgent builds:

self._chunk_by_token = {c.citation_token(): c for c in knowledge_base.chunks}

But Chunk.citation_token() is [textbook_id:section_id:p{page_start:02d}] — section + page only, no chunk index. Two chunks in the same section that share a page_start (easy to hit with markdown's ~250-word synthetic pages, and with short sections) produce the same token, so the dict silently keeps only the last one. A citation can then be scored against a different chunk than the writer actually drew from.

Since the faithfulness/precision figures are the main result, it'd be worth confirming this isn't shifting them. Disambiguating the token (include the chunk index, or map token → list of chunks and pick the best match) would close the gap.

3. sentence-transformers (→ torch) is a hard dependency for a path that's off by default

requirements.txt requires sentence-transformers>=2.7.0, which pulls in torch on every install — but per the comment in ADDIE.py the reranker isn't wired into the default runtime path. Moving it (and arguably pymupdf / rank-bm25 / markdown-it-py, only needed when grounding is on) into an optional extra would keep vanilla installs light and match the opt-in framing.

None of these are blockers — solid first version. Thanks!

Thanks for the careful read. Addressing your points :

  1. While I agree that it is a clean lift , but we want to keep PDF ingestion efficiency as the priority.
  2. This has been addressed in the followup work.
  3. The dep is no longer purely off-by-default in the followup work.

The grounded path (textbook ingestion, retrieval, semantic gates,
optional reranker) only fires when --use-textbook is passed to run.py
or evaluate.py. Its dependencies were previously declared as base
deps, so every install pulled in pymupdf, sentence-transformers,
torch, transformers, rank-bm25, and markdown-it-py (~400 MB of
torch + transitives) even when the user never planned to use them.

- pyproject.toml: new [grounding] extras group covering all six deps.
  Install with: pip install -e ".[grounding]"
- requirements.txt: same deps grouped under a clearly marked
  "Optional: grounding" section, preserved at the bottom so
  `pip install -r requirements.txt` keeps every supported feature
  working (backward compatibility).
- README: install section documents the three paths
  (vanilla / grounding extras / kitchen-sink requirements.txt).

Verified:
- A simulated vanilla install (blocking all six grounding imports at
  module load) successfully imports run, api_server, evaluate, and
  src.ADDIE.ADDIE — vanilla code path doesn't touch the grounding deps.
- pip installable: importlib.metadata exposes both 'vector-db' and
  'grounding' as extras.
- Full pytest suite still 693 passed.
The semantic gates (bi-encoder cosine filter) and the cross-encoder
reranker were the only consumers of sentence-transformers + torch in
the runtime path. Both now load their respective MiniLM models through
fastembed, which runs the same ONNX-exported weights via onnxruntime
— no torch dependency.

Numerical scores are unchanged (verified on a fixture):
- Semantic gate cosine on "K-means clustering partitions data" vs
  "K-means is a centroid-based clustering algorithm":
    sentence-transformers: 0.692239
    fastembed:             0.692239
- Cross-encoder reranker scores on the K-means probe set:
    sentence-transformers: +8.0968, -11.3451, -1.3409
    fastembed:             +8.0968, -11.3451, -1.3409 (delta ≤ 1e-6)

Why both backends agree to 6 decimals: fastembed loads from the same
HuggingFace repo (Xenova hosts the ONNX export of the original
cross-encoder repo), so the weights are bit-identical. The two
runtimes round their fused-op outputs slightly differently but the
delta stays well below the gate thresholds (0.32 / 0.30) and the
ranking-only contract of the reranker.

Install footprint dropped:
- vanilla:    unchanged at ~50 MB (grounding deps are gated behind
              the [grounding] extras group; no run-time imports)
- grounded:   ~100 MB (was ~550 MB with torch + sentence-transformers
              + transformers)

Edits:
- src/grounding/semantic_gate.py: _ensure_encoder() now imports
  fastembed.TextEmbedding; _embed() materialises the iterator and
  L2-normalises (fastembed returns raw vectors, unlike
  sentence-transformers' normalize_embeddings=True).
- src/grounding/reranker.py: CrossEncoderReranker uses
  fastembed.rerank.cross_encoder.TextCrossEncoder. The model name
  switched from cross-encoder/ms-marco-MiniLM-L-6-v2 to
  Xenova/ms-marco-MiniLM-L-6-v2 (same weights, ONNX-exported).
  Score path went from .predict([(q,p),…]) to .rerank(q, [p,…]).
- pyproject.toml + requirements.txt: [grounding] extras now lists
  fastembed in place of sentence-transformers + torch + transformers.
- README install section updated to reflect the lighter footprint.
- tests/test_semantic_gate.py: stub encoder now mirrors fastembed's
  .embed(iterable) generator-of-arrays contract instead of the old
  sentence-transformers .encode(text, normalize_embeddings=True).
- tests/test_grounding_reranker.py: lazy-load assertion now checks
  that fastembed/onnxruntime stay out of sys.modules until .score()
  is called (was: torch / sentence_transformers).

Verified: 693 passed, same count as before.
Code shipped upstream should not reference internal iteration labels
("v6 Lever A", "v7 Step 9", "v3 hybrid extraction") that aren't
defined anywhere in the codebase or its docs. Each affected comment
has been rewritten to describe what the code does without the
versioning prefix; nothing about runtime behaviour changes.

Affected docstrings and comments:
- src/grounding/semantic_gate.py: module docstring + Gate A / Gate B
  inline labels.
- src/grounding/reranker.py: header retained (it was already cleaned in
  the prior commit).
- src/grounding/write_time_verifier.py: module docstring.
- src/grounding/usage_tracker.py: module docstring.
- src/grounding/contract.py: constant comments (SECTIONS_PER_TOPIC,
  SUBTOPICS_PER_CHAPTER, smart-intro detection, meta-chapter abstain),
  helper docstrings, build-loop comments.
- src/slides.py: constructor diversity-cap / gate / verifier comments,
  evidence-block coverage diversification + Gate A + diversity cap +
  visual-chunk-inclusion comments, visual-content rules block,
  evidence-strip block (cleanup + Gate B + write-time verifier),
  cross-chapter assessment context, per-slide grounding hooks, draft
  picker docstring.
- src/ADDIE.py: lazy-construction comments for tracker, gates, verifier.
- src/textbook/ingest_pdf_paged.py: module docstring + page-flag
  comment.
- src/textbook/vlm_adapter.py: VLM retry docstring.
- evaluate.py: ambiguous-token-rescue comments + docstring.

Verified: 693 passed (same count as before), no behaviour change.
@shrey-Bish shrey-Bish marked this pull request as ready for review June 8, 2026 22:47
Three small, backwards-compatible changes targeting the residual
failure modes after the v7 baseline (retrieval_bad 3.9-6.6%,
loose_paraphrase 2.8-6.4%, ambiguous-chunk picks). None add a single
LLM API call to the runtime path.

1. Sentence-bounded claim window (semantic gate + write-time verifier).

   Both consumers previously walked backward with ``rfind()`` for
   ``". "`` / ``"! "`` / ``"? "`` / ``"\n"`` and took the trailing
   25-30 words. That heuristic split on common abbreviations
   (``e.g.``, ``i.e.``, ``Fig.``, ``Eq.``, ``Mr.``, etc.) and
   produced truncated or mid-sentence claim windows. Both call sites
   then scored / verified the wrong text.

   New ``src/grounding/claim_window.py`` factors out a regex sentence
   tokeniser with an abbreviation-suppression list. Both
   ``SemanticGate._extract_claim_window`` and
   ``WriteTimeVerifier._extract_claim_window`` now delegate to it.
   16 new tests pin down the new behaviour.

   Expected impact: tighter Gate B strip decisions and verifier
   prompts, which should mostly reduce ``loose_paraphrase`` (the
   second-largest residual failure mode on Agentic at 6.4%).

2. Cross-encoder reranker warmup at ADDIE init (src/ADDIE.py).

   ``CrossEncoderReranker()``'s constructor is intentionally lazy —
   the ONNX model is downloaded and loaded on first ``.score()``
   call, not at __init__. Before this commit, that meant init-time
   try/except caught import errors only; an actual model-load
   failure would silently fall back per-query, printing the same
   warning hundreds of times in one run.

   The init now does a tiny ``score("warmup", ["warmup"])`` call
   inside the try block so the ONNX model is materialised once,
   upfront. Failures get a single clear message and the reranker is
   set to None for the rest of the run.

3. Stopword-filtered Jaccard in ambiguous-token rescue (evaluate.py).

   ``_resolve_best_chunk`` picks the best of look-alike chunks by
   word-overlap with the claim. The earlier filter just dropped
   tokens shorter than 3 characters. Common filler ("the", "of",
   "in") still dominated the overlap, making the score brittle for
   topically similar siblings.

   We now reuse the lightweight stopword set the retriever already
   maintains at ``src.grounding.retriever._STOP`` (lazy import,
   defensive fallback to an empty frozenset if the import path
   shifts). The dominant content words now drive the rescue
   decision.

   This change is eval-side only — it can't move which citations
   the writer produces, only which sibling chunk the judge scores
   them against. The largest expected effect is a small drop in
   ``wrong_chunk_cited`` reports on multi-chunk-spanning tokens
   (the prior baseline showed ~75% of tokens were ambiguous).

Tests: 708 passed (was 693; 15 new tests across claim_window).
The full Han 3rd-edition PDF (740 pages) exposed a latent bug: a small
number of chunks emitted by the IR build exceeded OpenAI's 8192-token
per-input limit on the embedding-3 models, almost certainly long visual
captions or bibliography-style runs that the paragraph-aware chunker
left whole because they were already "complete paragraphs." Every
retrieval call on those chunks failed with a 400, the whole batch was
rejected, and the writer silently fell back to vanilla prompts. The
run then kept retrying for hours (~$8-10 of failed embedding API
spend) while emitting zero citations.

Three defensive layers, none of which trade off precision, recall, or
information completeness:

  * Layer 1 — Sentence-aware chunk splitting at ingest. New
    `MAX_CHUNK_CHARS = 24000` (≈6000 tokens; ~25% headroom under the
    8192 ceiling). `_split_chunk_if_oversized` in knowledge_base.py
    walks each emitted chunk; if the text exceeds the ceiling, it
    splits on real sentence boundaries (reusing the regex +
    abbreviation suppression list from `claim_window`) into sub-chunks
    that inherit the parent's section, chapter, page span, and `kinds`.
    Sub-chunks therefore EMIT THE SAME CITATION TOKEN as their siblings,
    so the existing ambiguous-token rescue in evaluate.py picks the
    best at score time — no downstream changes needed.

  * Layer 2 — Embedder-level batch guard. `OpenAIEmbedder.embed()` now
    checks each input's length before calling the API; oversized
    inputs are sentence-split, embedded as multiple pieces, then
    mean-pooled into one vector with L2 re-normalisation. The output
    shape (one vector per input) is preserved so the retriever's
    contract is unchanged. This catches anything Layer 1 missed
    (manual chunk injection in tests, future ingester paths, etc.).

  * Layer 3 — Fail-fast on consecutive retrieval errors. The
    silent-fallback path in `_build_evidence_block` now tracks
    consecutive failures of the same error class on the class object;
    after 10 same-class failures in a row, raise a RuntimeError with
    the last error attached. The counter resets on any successful
    retrieval so transient blips (brief rate limits, network hiccups)
    don't accumulate spuriously. Cost protection: writer + verifier
    calls otherwise keep running for the whole course generation even
    though no grounded evidence reaches the prompts.

Plus a one-line operator-facing diagnostic at IR-build time:
``[grounding] N sub-chunks emitted from oversized parent chunks
(max chunk size after split: M chars, ceiling: K).`` Surfaces the
edge case in the run log without forcing future operators to dig.

Public helper `split_into_sentences()` factored out of
`claim_window.py` so both the chunker and the embedder can reuse the
same boundary detector (regex + ~30-entry abbreviation suppression
list).

Tests: 14 new in tests/test_chunk_size_cap.py covering (a) Layer 1's
metadata inheritance, citation token stability, sub-chunk id
uniqueness, information preservation across split, and last-resort
hard-slice fallback when a single sentence exceeds the ceiling; (b)
Layer 2's output-shape invariance under mixed-size batches and
oversized-input mean-pooling via a mocked OpenAI client; (c) Layer 3's
threshold behaviour, error-class-aware counter, and counter reset on
a successful retrieval. Full suite: 722 passed (was 708; +14 new).

The broken IR cache for the full Han PDF at
.grounding_cache/ir/the_morgan_kaufmann_...2011.json was deleted so
the next ingest rebuilds with the corrected chunker.
The sentence-bounded claim window helped Agentic (+2.92 pp precision)
but regressed the Han 6-chapter subset (-3.84 pp), with only ~7 %
citation overlap between runs suggesting the divergence reaches far
upstream. Keep the rfind heuristic in semantic_gate +
write_time_verifier until the cross-textbook effect is isolated. The
sentence-end regex stays in claim_window.py — the chunker and embedder
size guard still use it.
Two independent bugs were blocking VLM-extracted figures from reaching
the final slides.tex. After both fixes, a single-chapter clustering run
on Han 3e (page 481-534) produced 2 real \includegraphics tokens
pointing to verified PNGs (Figure 10.1, DBSCAN density-reachability) —
the first iteration where any figure landed in the deck despite
129 PNGs sitting on disk.

(1) Visual chunk hoisted to FRONT of injected results. The block-builder
loop walks results in rank order with a fixed word budget; large prose
chunks (~450 words each in cluster chapters) exhausted the 1800-word
budget in four iterations, so the appended-at-tail visual chunk never
reached evidence_text. Hoisting it to position 0 guarantees its
IMAGE_PATH / LATEX / TABLE / ALGORITHM_STEPS marker survives into the
prompt, which is what triggers the visual-content rule block.

(2) Teaching Assistant prompt now lists \includegraphics and tabular
as available LaTeX features alongside itemize / enumerate / block /
lstlisting / equation. Adds an explicit PRESERVE-FIGURES instruction so
\includegraphics commands from the Faculty's slide draft don't get
silently dropped when the TA reformats. Without this, the TA had no
permission to emit figures even when the Faculty drafted them.

test_force_visual_chunk assertions updated from out[-1] to out[0] to
match the new front-insert position.
@shrey-Bish shrey-Bish marked this pull request as draft June 14, 2026 01:40
The default PDF ingestion path used to call ingest_pdf_file (plain
PyMuPDF text extraction with no image handling) and image cropping was
only available via the opt-in --vlm-extraction flag, which rendered
whole pages as PNGs through gpt-4o vision.

Move the default path through ingest_pdf_file_paged with
pymupdf4llm.to_markdown(write_images=True). pymupdf4llm writes embedded
image XObjects from the PDF as tight cropped PNGs — the actual figure
region, not a full-page screenshot. The ingester walks the figures
directory after extraction, renames each saved file from pymupdf4llm's
{stem}.pdf-{page:04d}-{idx:02d}.png convention to
{textbook_id}_p{page:04d}_{idx:02d}.png, and emits one figure_cap
paragraph with an [IMAGE_PATH: ...] marker per image so the downstream
visual-content rules path still triggers.

Measured on Han 3rd ed (740 pages): 464 cropped figures landed in
80 s with no API spend. Figure 10.3 (K-Means iteration) is now a 32 KB
tight crop showing just the 3-panel diagram — the prior VLM pass
produced a 230 KB full-page render with running headers and body text
bleed.

The hybrid VLM path stays available via --vlm-extraction for runs that
want the gpt-4o DESCRIPTION / INSIGHT markers alongside images. The
test_ir_cache plumbing patch swap targets the new dispatch hook
(_ingest in knowledge_base) so it stays robust to future routing
changes.
Both code paths were tried, measured, and abandoned but their source
plus tests stayed in the tree as documentation. The CLAUDE.md note
covers the same ground without leaving the modules in place, so drop:

* src/grounding/reranker.py — LLMReranker (137-line class plus the
  json / os imports that only it used). Production retrieval uses
  CrossEncoderReranker; the LLM variant measured no improvement
  (89.3 % vs 90.2 % precision) on its trial run.
* src/slides.py — _generate_best_of_n_draft and the
  _decrement_tracker_for_text helper that rolls back loser citations.
  The multi-draft slide-best-pick path scored drafts by citation count
  which rewarded volume over quality and was disabled in favour of the
  semantic-gate stack. The DISABLED comment in _generate_slide_draft
  referenced a --enable-best-of-n flag that did not exist.
* src/grounding/__init__.py — drop the LLMReranker re-export.
* src/ADDIE.py — drop the comment paragraph about the removed reranker;
  CrossEncoderReranker construction is unchanged.
* tests/test_multi_draft_best_pick.py — whole file (~146 lines).
* tests/test_grounding_reranker.py — TestLLMReranker class plus the
  _mock_openai_client helper used only by it (~118 lines).
@xander-Huang-ASU

Copy link
Copy Markdown
Collaborator

Bug: Incorrect generate_response() input type

_maybe_augment_syllabus_with_admin (ADDIE.py:349) passes a plain string to LLM.generate_response(), which expects a List[Dict[str, str]].

Current:

response = self.addie.llm.generate_response(prompt)

Suggested fix:

response = self.addie.llm.generate_response(
    [{"role": "user", "content": prompt}]
)

Impact

The SDK rejects the call, the exception is silently caught, and the function returns early. Consequently:

  • Admin scaffolding (e.g., Course Policies) is never appended to the syllabus during grounded runs.
  • result_syllabus_design.md.pre_admin_scaffolding.bak is never created, so --resume retries the same failing call every time.

Reproduced: Ran with --use-textbook and confirmed the .bak file is never created.

Why tests missed it

test_addie_grounding_runtime.py mocks addie.llm with MagicMock, which accepts any argument type. Consider asserting the message format:

call_args = runner.addie.llm.generate_response.call_args[0][0]
assert isinstance(call_args, list)
assert call_args[0]["role"] == "user"

@shrey-Bish

Copy link
Copy Markdown
Author

Bug: Incorrect generate_response() input type

_maybe_augment_syllabus_with_admin (ADDIE.py:349) passes a plain string to LLM.generate_response(), which expects a List[Dict[str, str]].

Current:

response = self.addie.llm.generate_response(prompt)

Suggested fix:

response = self.addie.llm.generate_response(
    [{"role": "user", "content": prompt}]
)

Impact

The SDK rejects the call, the exception is silently caught, and the function returns early. Consequently:

  • Admin scaffolding (e.g., Course Policies) is never appended to the syllabus during grounded runs.
  • result_syllabus_design.md.pre_admin_scaffolding.bak is never created, so --resume retries the same failing call every time.

Reproduced: Ran with --use-textbook and confirmed the .bak file is never created.

Why tests missed it

test_addie_grounding_runtime.py mocks addie.llm with MagicMock, which accepts any argument type. Consider asserting the message format:

call_args = runner.addie.llm.generate_response.call_args[0][0]
assert isinstance(call_args, list)
assert call_args[0]["role"] == "user"

Good catch : I fixed it in 1304e80. _maybe_augment_syllabus_with_admin now passes [{"role": "user", "content": prompt}] instead of the bare string. You were right about the test, too: the existing grounded test asserted "…" in call_prompt against the string, so it silently encoded the bug. I've updated it to check the message list and added a dedicated assertion on the arg format. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants