Refiner pipeline#12
Conversation
- add standalone refinement runner with route filtering and report output - implement assessment refinement using evaluator feedback and validation retries - add slide refinement that targets specific Beamer frames instead of rewriting decks - add deterministic slide checks for frame structure, LaTeX balance, citations, and malformed body output - compile refined slide decks with existing LaTeX compiler and report compile results - add tests for slide parsing, patching, and validation behavior
- add ScriptRefiner for section-level script repair - map script sections to related slide frames for grounding - locate target script sections from evaluator feedback - refine only selected script section bodies instead of rewriting full scripts - load matching slides for script refinement and prefer refined slides when available - exclude attribution feedback to avoid fake citation generation
|
Bottom line up front: the core idea (localized rewrite) is great, but the "evaluation" that drives it needs to be replaced. Below are the suggested changes. One-line summarySplit the pipeline into two independent concerns: "decide what to fix" and "how to fix it."
What's done wellThe localized-rewrite design in
This "trust the LLM narrowly + validate deterministically" approach is the right instinct — far more controllable than the original optimize path, which regenerated whole chapters and trusted the output as-is. Main concern: don't use LLM-as-judge as the basis for refinementThe pipeline's refinement entry point is currently evaluation-score-driven:
The problem: an LLM scoring its own generated content is inherently unreliable — scores drift, are uncalibrated, vary across runs on identical content, and frequently produce plausible-but-wrong rationales. Using those scores to decide "what to fix and how" stakes the entire refinement quality on one unstable signal. We should instead make human feedback the primary evaluation signal. The good news: refine_slides(content, feedback_text, max_retries)
# ↑ just a string
Suggested changes1. Decouple the rewrite engine from evaluationExtract 2. Make human feedback a first-class inputThe source of 3. Make "localized rewrite" a selectable strategy, not something hard-wired into the evaluation pipelineThe rewrite should be a reusable capability (manual trigger, human-feedback trigger, optional evaluation trigger) rather than something that can only run on the "evaluate → queue → refine" chain. 4. Minor issues
|
Refinement Pipeline
Adds the first operational refinement pipeline, including:
Currently supports assessment refinement. Slides/scripts are future work.
Running The Pipeline
1. Run Evaluation First
This generates:
under:
2. Preview The Refinement Queue
Before running refinement, you can preview which files will be refined:
This shows:
without modifying any files.
3. Run Refinement
uv run python -m src.refinement_runner \ --exp default \ --refine \ --retries 1This:
4. Main Output Report
The report includes: