Skip to content

Refiner pipeline#12

Open
Aarsh-collab wants to merge 9 commits into
DaRL-GenAI:mainfrom
Aarsh-collab:refiner-pipeline
Open

Refiner pipeline#12
Aarsh-collab wants to merge 9 commits into
DaRL-GenAI:mainfrom
Aarsh-collab:refiner-pipeline

Conversation

@Aarsh-collab

@Aarsh-collab Aarsh-collab commented May 12, 2026

Copy link
Copy Markdown

Refinement Pipeline

Adds the first operational refinement pipeline, including:

  • repair queue generation
  • refinement runner
  • packetization
  • iterative refinement loop
  • validation/retry flow
  • refinement reporting

Currently supports assessment refinement. Slides/scripts are future work.


Running The Pipeline

1. Run Evaluation First

uv run python src/evaluate.py --exp default

This generates:

  • evaluation scores
  • validation reports

under:

eval/{model}-Evaluation_{exp}/

2. Preview The Refinement Queue

Before running refinement, you can preview which files will be refined:

uv run python -m src.refinement_runner --exp default

This shows:

  • queued assessment files
  • scores/metrics
  • validation report usage
  • output paths

without modifying any files.


3. Run Refinement

uv run python -m src.refinement_runner \
    --exp default \
    --refine \
    --retries 1

This:

  • builds a refinement queue from low-scoring assessment files
  • loads evaluation metrics + validation reports
  • generates repair constraints
  • runs iterative refinement + validation
  • saves refined outputs under:
exp/{exp}/refined/

4. Main Output Report

exp/{exp}/refined/refinement_report.json

The report includes:

  • constraints
  • validation status
  • retry metadata
  • validation history
  • validation report usage
  • refined file paths

- add standalone refinement runner with route filtering and report output
- implement assessment refinement using evaluator feedback and validation retries
- add slide refinement that targets specific Beamer frames instead of rewriting decks
- add deterministic slide checks for frame structure, LaTeX balance, citations, and malformed body output
- compile refined slide decks with existing LaTeX compiler and report compile results
- add tests for slide parsing, patching, and validation behavior
- add ScriptRefiner for section-level script repair
- map script sections to related slide frames for grounding
- locate target script sections from evaluator feedback
- refine only selected script section bodies instead of rewriting full scripts
- load matching slides for script refinement and prefer refined slides when available
- exclude attribution feedback to avoid fake citation generation
@wingsweihua wingsweihua requested a review from Hyan-Yao May 28, 2026 04:05
@Hyan-Yao

Copy link
Copy Markdown
Collaborator

Bottom line up front: the core idea (localized rewrite) is great, but the "evaluation" that drives it needs to be replaced. Below are the suggested changes.

One-line summary

Split the pipeline into two independent concerns: "decide what to fix" and "how to fix it."

  • "How to fix it" = SlideRefiner's localized rewrite — this part is excellent; keep and reuse it.
  • "Decide what to fix" = the current LLM-as-judge scoring + threshold selection — this part is unreliable and should move to human-feedback-driven.

What's done well

The localized-rewrite design in SlideRefiner (src/refinement.py, lines 7–649) is the most valuable part of this branch:

  • The LLM is only trusted with two narrow jobs: locate the frames to edit (locate_frames) and rewrite a single frame body (refine_frame_body).
  • Everything else is backstopped deterministically: regex frame parsing, original_content-anchored replacement (untouched frames stay byte-identical), two-level validation (body-level + document-level: frame count unchanged, unedited frame titles still present, environment/brace/ampersand balance, parseable by LaTeXParser), and two-tier retries.

This "trust the LLM narrowly + validate deterministically" approach is the right instinct — far more controllable than the original optimize path, which regenerated whole chapters and trusted the output as-is.

Main concern: don't use LLM-as-judge as the basis for refinement

The pipeline's refinement entry point is currently evaluation-score-driven:

  • src/evaluate.py uses an LLM to score each file/metric 1–5 (with a thought);
  • RefinementRunner.build_repair_queue selects files whose average score < threshold (default 3.0);
  • RefinementEngine.format_metrics_feedback assembles those scores/rationales into the feedback_text fed to the refiner.

The problem: an LLM scoring its own generated content is inherently unreliable — scores drift, are uncalibrated, vary across runs on identical content, and frequently produce plausible-but-wrong rationales. Using those scores to decide "what to fix and how" stakes the entire refinement quality on one unstable signal.

We should instead make human feedback the primary evaluation signal. The good news: SlideRefiner is already fully decoupled from evaluation — its entry point is simply

refine_slides(content, feedback_text, max_retries)
#                       ↑ just a string

SlideRefiner does not care whether feedback_text came from LLM scoring or not. The LLM-as-judge coupling lives entirely upstream (evaluate.py + build_repair_queue + format_metrics_feedback); swap that layer for human feedback and the rewrite core needs zero changes.

Suggested changes

1. Decouple the rewrite engine from evaluation

Extract SlideRefiner (and ScriptRefiner if needed) out of refinement.py into its own module so it depends only on (latex, feedback_text) and imports nothing evaluation-related. Right now refinement.py mixes Refiner / SlideRefiner / ScriptRefiner / RefinementEngine into one 2000+ line file — too tightly coupled.

2. Make human feedback a first-class input

The source of feedback_text should support manual input (a plain free-text string) as the priority path, not only generation by format_metrics_feedback from evaluation_scores.json. LLM evaluation can remain an optional assist, but should not be the default/only driver.

3. Make "localized rewrite" a selectable strategy, not something hard-wired into the evaluation pipeline

The rewrite should be a reusable capability (manual trigger, human-feedback trigger, optional evaluation trigger) rather than something that can only run on the "evaluate → queue → refine" chain.

4. Minor issues

  • RefinementEngine.get_content_type's chain of if route == x: return x can be simplified directly.
  • The dicts returned per route in refine_packet are inconsistent (slides/script/general each differ) — consider normalizing so downstream is easier.
  • remove_external_resource_feedback filters citation-like feedback sentence-by-sentence with a long regex list; it's brittle — add a comment explaining the intent and boundaries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants