Refiner pipeline by Aarsh-collab · Pull Request #12 · DaRL-GenAI/instructional_agents

Aarsh-collab · 2026-05-12T00:34:07Z

Refinement Pipeline

Adds the first operational refinement pipeline, including:

repair queue generation
refinement runner
packetization
iterative refinement loop
validation/retry flow
refinement reporting

Currently supports assessment refinement. Slides/scripts are future work.

Running The Pipeline

1. Run Evaluation First

uv run python src/evaluate.py --exp default

This generates:

evaluation scores
validation reports

under:

eval/{model}-Evaluation_{exp}/

2. Preview The Refinement Queue

Before running refinement, you can preview which files will be refined:

uv run python -m src.refinement_runner --exp default

This shows:

queued assessment files
scores/metrics
validation report usage
output paths

without modifying any files.

3. Run Refinement

uv run python -m src.refinement_runner \
    --exp default \
    --refine \
    --retries 1

This:

builds a refinement queue from low-scoring assessment files
loads evaluation metrics + validation reports
generates repair constraints
runs iterative refinement + validation
saves refined outputs under:

exp/{exp}/refined/

4. Main Output Report

exp/{exp}/refined/refinement_report.json

The report includes:

constraints
validation status
retry metadata
validation history
validation report usage
refined file paths

- add standalone refinement runner with route filtering and report output - implement assessment refinement using evaluator feedback and validation retries - add slide refinement that targets specific Beamer frames instead of rewriting decks - add deterministic slide checks for frame structure, LaTeX balance, citations, and malformed body output - compile refined slide decks with existing LaTeX compiler and report compile results - add tests for slide parsing, patching, and validation behavior

- add ScriptRefiner for section-level script repair - map script sections to related slide frames for grounding - locate target script sections from evaluator feedback - refine only selected script section bodies instead of rewriting full scripts - load matching slides for script refinement and prefer refined slides when available - exclude attribution feedback to avoid fake citation generation

Hyan-Yao · 2026-06-18T01:23:40Z

Bottom line up front: the core idea (localized rewrite) is great, but the "evaluation" that drives it needs to be replaced. Below are the suggested changes.

One-line summary

Split the pipeline into two independent concerns: "decide what to fix" and "how to fix it."

"How to fix it" = SlideRefiner's localized rewrite — this part is excellent; keep and reuse it.
"Decide what to fix" = the current LLM-as-judge scoring + threshold selection — this part is unreliable and should move to human-feedback-driven.

What's done well

The localized-rewrite design in SlideRefiner (src/refinement.py, lines 7–649) is the most valuable part of this branch:

The LLM is only trusted with two narrow jobs: locate the frames to edit (locate_frames) and rewrite a single frame body (refine_frame_body).
Everything else is backstopped deterministically: regex frame parsing, original_content-anchored replacement (untouched frames stay byte-identical), two-level validation (body-level + document-level: frame count unchanged, unedited frame titles still present, environment/brace/ampersand balance, parseable by LaTeXParser), and two-tier retries.

This "trust the LLM narrowly + validate deterministically" approach is the right instinct — far more controllable than the original optimize path, which regenerated whole chapters and trusted the output as-is.

Main concern: don't use LLM-as-judge as the basis for refinement

The pipeline's refinement entry point is currently evaluation-score-driven:

src/evaluate.py uses an LLM to score each file/metric 1–5 (with a thought);
RefinementRunner.build_repair_queue selects files whose average score < threshold (default 3.0);
RefinementEngine.format_metrics_feedback assembles those scores/rationales into the feedback_text fed to the refiner.

The problem: an LLM scoring its own generated content is inherently unreliable — scores drift, are uncalibrated, vary across runs on identical content, and frequently produce plausible-but-wrong rationales. Using those scores to decide "what to fix and how" stakes the entire refinement quality on one unstable signal.

We should instead make human feedback the primary evaluation signal. The good news: SlideRefiner is already fully decoupled from evaluation — its entry point is simply

refine_slides(content, feedback_text, max_retries)
#                       ↑ just a string

SlideRefiner does not care whether feedback_text came from LLM scoring or not. The LLM-as-judge coupling lives entirely upstream (evaluate.py + build_repair_queue + format_metrics_feedback); swap that layer for human feedback and the rewrite core needs zero changes.

Suggested changes

1. Decouple the rewrite engine from evaluation

Extract SlideRefiner (and ScriptRefiner if needed) out of refinement.py into its own module so it depends only on (latex, feedback_text) and imports nothing evaluation-related. Right now refinement.py mixes Refiner / SlideRefiner / ScriptRefiner / RefinementEngine into one 2000+ line file — too tightly coupled.

2. Make human feedback a first-class input

The source of feedback_text should support manual input (a plain free-text string) as the priority path, not only generation by format_metrics_feedback from evaluation_scores.json. LLM evaluation can remain an optional assist, but should not be the default/only driver.

3. Make "localized rewrite" a selectable strategy, not something hard-wired into the evaluation pipeline

The rewrite should be a reusable capability (manual trigger, human-feedback trigger, optional evaluation trigger) rather than something that can only run on the "evaluate → queue → refine" chain.

4. Minor issues

RefinementEngine.get_content_type's chain of if route == x: return x can be simplified directly.
The dicts returned per route in refine_packet are inconsistent (slides/script/general each differ) — consider normalizing so downstream is easier.
remove_external_resource_feedback filters citation-like feedback sentence-by-sentence with a long regex list; it's brittle — add a comment explaining the intent and boundaries.

Aarsh-collab added 9 commits May 6, 2026 13:56

Build refinement runner infrastructure pipeline

ca845fc

Add initial refinement orchestration pipeline

10e38f9

Restor unrelated env + config.json

1c8c897

Fixed overall_summary being prcessed like a normal file group

ec12235

Added validation report to refinement pipeline

5a70133

Added full-page refinement for syllabus and objectives

c4ee163

Integrate evaluation and refinement into main runner

d42752a

wingsweihua requested a review from Hyan-Yao May 28, 2026 04:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refiner pipeline#12

Refiner pipeline#12
Aarsh-collab wants to merge 9 commits into
DaRL-GenAI:mainfrom
Aarsh-collab:refiner-pipeline

Aarsh-collab commented May 12, 2026 •

edited

Loading

Uh oh!

Hyan-Yao commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Aarsh-collab commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Refinement Pipeline

Running The Pipeline

1. Run Evaluation First

2. Preview The Refinement Queue

3. Run Refinement

4. Main Output Report

Uh oh!

Hyan-Yao commented Jun 18, 2026

One-line summary

What's done well

Main concern: don't use LLM-as-judge as the basis for refinement

Suggested changes

1. Decouple the rewrite engine from evaluation

2. Make human feedback a first-class input

3. Make "localized rewrite" a selectable strategy, not something hard-wired into the evaluation pipeline

4. Minor issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aarsh-collab commented May 12, 2026 •

edited

Loading