Skip to content

Checkpoint config-driven env GRPO eval work#92

Draft
ProfSynapse wants to merge 791 commits into
mainfrom
codex/eval-config-driven-assertions
Draft

Checkpoint config-driven env GRPO eval work#92
ProfSynapse wants to merge 791 commits into
mainfrom
codex/eval-config-driven-assertions

Conversation

@ProfSynapse

Copy link
Copy Markdown
Owner

Summary:

  • Adds the current workspace multistep GRPO projection and refreshed lean SFT datasets/config.
  • Adds workspace multi-turn eval scenario coverage plus focused tests for agentic loops, environment search, response scoping, and stage gates.
  • Adds generic environment execution/scoring support for configured tool aliases so path scoring can match schema-facing commands without hardcoding a toolset.
  • Adds env generation diagnostics, SFT prompt alignment migration, local GRPO image, and PEFT merge helper.

Validation:

  • python -m pytest tests/shared/test_agentic_loop.py tests/shared/test_local_environment_search.py tests/shared/test_workspace_multiturn_scenarios.py tests/synthchat/test_agentic_episode_messages.py tests/synthchat/test_response_scope_message_selection.py tests/synthchat/test_stage_gates.py
  • python .skills/scripts/sync_skill_trees.py --check
  • git diff --check

ProfSynapse and others added 30 commits January 2, 2026 12:24
…u, and animation scenes

- Implemented `LiveEvaluationDashboard` for real-time evaluation metrics display.
- Created `generate_round_flask` function to visually represent a flask shape in terminal.
- Developed interactive menu using `asciimatics` with animated branding and options.
- Added scene creation functions for logo display, training start splash, and celebration animations.
…ation monitoring

- Implemented SynthChatMetrics to track generation progress, including total examples, completed, valid, and invalid counts.
- Created ResultEntry class for logging individual results with status, category, and reason.
- Developed LiveSynthChatDashboard class for displaying metrics and recent results in a user-friendly format.
- Integrated rich console output for enhanced visual representation of progress and results.
- Added methods for updating metrics, building display, and handling live updates.
…m overview

- Implemented ListHandler to manage 'list' subcommands for datasets, models, runs, rubrics, and scenarios.
- Added JSON output support for list commands.
- Created StatusHandler to provide an overview of system state, including environment info, CUDA availability, dependencies, and service connectivity.
- Enhanced output formatting with rich display options for both handlers.
- Delete dead Evaluator parsers (not imported anywhere)
  - Evaluator/response_parser.py (614 lines)
  - Evaluator/tool_call_parser.py (354 lines)

- Remove duplicate SynthChat validators (already using shared/validation/)
  - SynthChat/services/validators/base.py
  - SynthChat/services/validators/structure_validator.py
  - SynthChat/services/validators/cross_scope_validator.py
  - SynthChat/services/validators/content/ (6 files)

- Update CLAUDE.md: replace improvement_engine references with SynthChat
  (improvement_engine/ directory doesn't exist, functionality is in SynthChat)

Total: ~1,200 lines of duplicate/dead code removed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Was looking for Trainers/shared/ui/ but shared UI is at shared/ui/

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
* Add Claude Code skills for fine-tuning, evaluation, and upload-deployment; rewrite README

Create 3 new skills with progressive disclosure architecture (lean SKILL.md + focused
reference docs) covering the full pipeline: fine-tuning (SFT/KTO/GRPO, 7 reference docs),
evaluation (CLI, scenarios, backends, 5 reference docs), and upload-deployment (GGUF, merging,
model cards, 4 reference docs). Update synthetic-data-generation skill with reference docs
and helper scripts. Rewrite README as an agentic-first entrypoint with problem/solution
framing and progressive disclosure pattern.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* Add cross-platform AI coding tool compatibility section to README

Document how to use the agent skills with Cursor, Windsurf, Cline, Roo Code,
Amazon Q, JetBrains AI, Augment, Kilo Code, Tabnine, Zed, GitHub Copilot, and
Aider. Include copy commands and platform-specific notes. Broaden framing from
Claude Code-only to any AI coding agent.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* Simplify cross-platform section: add AGENTS.md convention and .skills/ universal folder

Fix platform guidance to mention AGENTS.md entrypoint convention, add universal
.skills/ folder at project root for Zed/Aider/Copilot/others, and streamline table.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* feat: Add parallel worker support for docs-based generation

Enable --workers flag to parallelize document processing in SynthChat.
Previously, parallel workers only applied to non-docs scenarios.

Changes:
- Add ThreadPoolExecutor for docs when workers > 1
- Preserve sequential behavior when workers == 1
- Reuse existing worker pattern for consistency
- Progress reporting works for both parallel and sequential modes

Fixes #1

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* fix: Guard division-by-zero in parallel progress tracking

Fixes bug identified in PR review where `completed/total*100` would crash
when total==0 (e.g., empty docs list or all scenarios not found).

Changes:
- Guard division: pct = (completed/total*100) if total > 0 else 0
- Skip ThreadPoolExecutor creation when no work items
- User feedback: "No work items to process (check scenario names)"

Applied to both parallel paths (docs and non-docs).

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* refactor: Apply 6 PR review improvements

1. DRY violation: Extract shared _run_parallel_generation() helper
   - Eliminates ~60 lines of duplication between docs/non-docs paths
   - Encapsulates progress tracking, executor, error handling

2. Input validation: Clamp workers to max(1, args.workers)
   - Prevents ValueError from --workers 0 or negative values

3. Output ordering: Sort results by task_id to preserve document order
   - Parallel mode now returns results in input order, not completion order

4. Variable naming: Rename worker_id to task_id throughout
   - More accurate (it's a task counter, not thread identifier)

5. Private method access: Rename _generate_single to generate_single
   - Makes method officially public (called from outside class)

6. BaseException handling: Add try/except with executor cleanup
   - Graceful handling of KeyboardInterrupt, SystemExit

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

---------

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
* Add Claude Code skills for fine-tuning, evaluation, and upload-deployment; rewrite README

Create 3 new skills with progressive disclosure architecture (lean SKILL.md + focused
reference docs) covering the full pipeline: fine-tuning (SFT/KTO/GRPO, 7 reference docs),
evaluation (CLI, scenarios, backends, 5 reference docs), and upload-deployment (GGUF, merging,
model cards, 4 reference docs). Update synthetic-data-generation skill with reference docs
and helper scripts. Rewrite README as an agentic-first entrypoint with problem/solution
framing and progressive disclosure pattern.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* Add cross-platform AI coding tool compatibility section to README

Document how to use the agent skills with Cursor, Windsurf, Cline, Roo Code,
Amazon Q, JetBrains AI, Augment, Kilo Code, Tabnine, Zed, GitHub Copilot, and
Aider. Include copy commands and platform-specific notes. Broaden framing from
Claude Code-only to any AI coding agent.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* Simplify cross-platform section: add AGENTS.md convention and .skills/ universal folder

Fix platform guidance to mention AGENTS.md entrypoint convention, add universal
.skills/ folder at project root for Zed/Aider/Copilot/others, and streamline table.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* docs: Pin parallel docs workers feature to Working Memory

Added context about PR #55:
- --workers N now supports docs-based generation
- Architecture: _run_parallel_generation() helper, instance isolation
- API change: generate_single() now public
- Input validation: clamps workers to max(1, value)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* gitignore

* chore: Add reference skills and clean up old PACT files

Add comprehensive skill documentation:
- evaluation: Model testing and validation
- fine-tuning: SFT, KTO, GRPO training guides
- synthetic-data-generation: Dataset generation and improvement
- upload-deployment: Model upload and GGUF conversion

Clean up old PACT files now managed by plugin:
- Remove .claude/agents/*.md (now in plugin)
- Remove .claude/commands/PACT/*.md (now in plugin)
- Remove .claude/hooks/*.py (now in plugin)
- Remove .claude/protocols/*.md (now in plugin)
- Remove .claude/skills/pact-* (now in plugin)
- Remove .claude/skills/n8n-* (now in plugin)
- Update .claude/settings.json to remove hook references

Other changes:
- Update CLAUDE.md test output location
- Update SynthChat README and docs_loader
- Add new documentation files

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

---------

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
* Add Claude Code skills for fine-tuning, evaluation, and upload-deployment; rewrite README

Create 3 new skills with progressive disclosure architecture (lean SKILL.md + focused
reference docs) covering the full pipeline: fine-tuning (SFT/KTO/GRPO, 7 reference docs),
evaluation (CLI, scenarios, backends, 5 reference docs), and upload-deployment (GGUF, merging,
model cards, 4 reference docs). Update synthetic-data-generation skill with reference docs
and helper scripts. Rewrite README as an agentic-first entrypoint with problem/solution
framing and progressive disclosure pattern.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* Add cross-platform AI coding tool compatibility section to README

Document how to use the agent skills with Cursor, Windsurf, Cline, Roo Code,
Amazon Q, JetBrains AI, Augment, Kilo Code, Tabnine, Zed, GitHub Copilot, and
Aider. Include copy commands and platform-specific notes. Broaden framing from
Claude Code-only to any AI coding agent.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* Simplify cross-platform section: add AGENTS.md convention and .skills/ universal folder

Fix platform guidance to mention AGENTS.md entrypoint convention, add universal
.skills/ folder at project root for Zed/Aider/Copilot/others, and streamline table.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* docs: Pin parallel docs workers feature to Working Memory

Added context about PR #55:
- --workers N now supports docs-based generation
- Architecture: _run_parallel_generation() helper, instance isolation
- API change: generate_single() now public
- Input validation: clamps workers to max(1, value)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* feat: Add live streaming to SynthChat result writing

Write generated examples to output file immediately as they complete
instead of batching in memory. Prevents data loss on process crashes.

Changes:
- Add StreamingResultWriter class (context manager)
- Thread-safe writes via threading.Lock for parallel mode
- Metadata header written at generation start
- All code paths (docs, parallel, sequential) stream results
- Keep _save_results() for potential future use
- Fix datetime.utcnow() deprecation in new code

Files: SynthChat/run.py (+147, -69 lines)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

---------

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
commit 8d64dd5
Merge: a80e883 31da4da
Author: ProfSynapse <[email protected]>
Date:   Sat Feb 14 10:36:50 2026 -0500

    Merge main into feat/live-streaming-results

    Resolves merge conflict in SynthChat/run.py by combining:
    - Live streaming results (StreamingResultWriter) from this branch
    - Parallel docs-based generation and _run_parallel_generation() DRY
      helper from main (PR #55)

    Key integration decisions:
    - _run_parallel_generation() accepts optional `writer` param for streaming
    - All 4 code paths stream to disk: parallel docs, sequential docs,
      parallel non-docs, sequential non-docs
    - Uses generate_single() public API and 3-tuple returns from main
    - Preserves graceful shutdown on interrupts from main

    Co-Authored-By: Claude Opus 4.6 <[email protected]>

commit a80e883
Author: ProfSynapse <[email protected]>
Date:   Sat Feb 14 10:25:04 2026 -0500

    feat: Add live streaming to SynthChat result writing

    Write generated examples to output file immediately as they complete
    instead of batching in memory. Prevents data loss on process crashes.

    Changes:
    - Add StreamingResultWriter class (context manager)
    - Thread-safe writes via threading.Lock for parallel mode
    - Metadata header written at generation start
    - All code paths (docs, parallel, sequential) stream results
    - Keep _save_results() for potential future use
    - Fix datetime.utcnow() deprecation in new code

    Files: SynthChat/run.py (+147, -69 lines)

    Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

commit 91c5b22
Author: ProfSynapse <[email protected]>
Date:   Sat Feb 14 10:09:56 2026 -0500

    docs: Pin parallel docs workers feature to Working Memory

    Added context about PR #55:
    - --workers N now supports docs-based generation
    - Architecture: _run_parallel_generation() helper, instance isolation
    - API change: generate_single() now public
    - Input validation: clamps workers to max(1, value)

    Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

commit 553fd38
Author: ProfSynapse <[email protected]>
Date:   Sat Feb 14 09:25:27 2026 -0500

    Simplify cross-platform section: add AGENTS.md convention and .skills/ universal folder

    Fix platform guidance to mention AGENTS.md entrypoint convention, add universal
    .skills/ folder at project root for Zed/Aider/Copilot/others, and streamline table.

    Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

commit ef394f4
Author: ProfSynapse <[email protected]>
Date:   Sat Feb 14 09:21:43 2026 -0500

    Add cross-platform AI coding tool compatibility section to README

    Document how to use the agent skills with Cursor, Windsurf, Cline, Roo Code,
    Amazon Q, JetBrains AI, Augment, Kilo Code, Tabnine, Zed, GitHub Copilot, and
    Aider. Include copy commands and platform-specific notes. Broaden framing from
    Claude Code-only to any AI coding agent.

    Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

commit 92c4863
Author: ProfSynapse <[email protected]>
Date:   Sat Feb 14 09:11:42 2026 -0500

    Add Claude Code skills for fine-tuning, evaluation, and upload-deployment; rewrite README

    Create 3 new skills with progressive disclosure architecture (lean SKILL.md + focused
    reference docs) covering the full pipeline: fine-tuning (SFT/KTO/GRPO, 7 reference docs),
    evaluation (CLI, scenarios, backends, 5 reference docs), and upload-deployment (GGUF, merging,
    model cards, 4 reference docs). Update synthetic-data-generation skill with reference docs
    and helper scripts. Rewrite README as an agentic-first entrypoint with problem/solution
    framing and progressive disclosure pattern.

    Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Increase MAX_RETRIES from 3 to 6 for better handling of transient OpenRouter rate limits
- Add automatic fallback provider chain: OpenRouter → LMStudio → Ollama
- Rewrite _call_with_retry() to iterate through providers with exponential backoff per provider
- Add _switch_to_fallback_provider() helper that swaps llm_client in engine and services
- Fallback client creation uses environment variables (same as primary via LLMConfig.from_env())
- Skip unavailable fallback providers with warning (e.g., missing API key, connection refused)

Context: Recent 96-essay generation hit OpenRouter rate limits with 14 failures.
With 50 workers: reduced to 5 failures. Increased retries + fallback ensures
generation doesn't fail even if OpenRouter is completely down.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Added `fixture_parser.py` to parse workspace fixture information from system prompts, defining the `EnvironmentFixture` class and related functions.
- Introduced `local_runtime.py` for a filesystem-backed runtime using a temporary local directory, implementing methods for directory and file operations.
- Created `tool_executor.py` to execute parsed tool calls against the environment runtime, supporting various actions and schema-driven execution.
- Defined data types in `types.py` for environment validation, including `EnvironmentIssue`, `ExecutedToolCall`, and `EnvironmentValidationResult`.
- Developed `validator.py` for high-level environment validation, executing tool calls and validating state assertions against runtime.
- Integrated YAML configuration loading for tool schemas and execution settings.
Creates comprehensive essay writing training data from Meditations on Alignment essays.

Dataset Structure:
- 192 total training examples (96 essays × 2 conversation pairs)
- Pair 1: User brainstorm → Assistant structured outline
- Pair 2: User feedback → Assistant full essay with frontmatter

Generation Pipeline:
1. Extracted outlines from original essays
2. Paired outlines with cleaned essays (removed dataview blocks, Obsidian syntax)
3. Generated synthetic user feedback using SynthChat scenario
4. Split into 2-turn conversation pairs for SFT training

Token Distribution:
- 93% under 4K tokens
- Avg pair 1: 905 tokens (brainstorm → outline)
- Avg pair 2: 2,926 tokens (feedback → essay)

Files:
- Datasets/essay_datasets/essay_2turn_pairs.jsonl (final dataset)
- SynthChat/scenarios/essay_feedback.yaml (feedback generation scenario)
- scratch/essay_dataset/*.py (processing scripts)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Fix config.yaml for LiquidAI/LFM2.5-1.2B-Instruct: set load_in_4bit:
  false (LIV conv blocks incompatible with bnb-4bit), correct LoRA
  target_modules (out_proj/in_proj/w1/w2/w3), r=16/alpha=16/dropout=0,
  linear scheduler, warmup_ratio=0.02, batch_size=2, max_seq=4096

- Add validate_model_compatibility() to train_sft.py: extensible
  MODEL_COMPATIBILITY_RULES registry detects LFM2-family models at
  startup and warns if load_in_4bit=true or wrong LoRA target_modules
  are configured — runs before model load so crash is prevented

- Update fine-tuning skill docs (model-presets.md, training-config.md,
  troubleshooting.md) with LFM2.5 architecture-specific overrides and
  SIGABRT/exit-code-6 troubleshooting entry

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…unPod

- Add CloudTrainHandler with provider selection menu, cost estimates,
  and graceful degradation for uninstalled SDKs
- Add HFJobsBackend with UV/PEP 723 script wrapping (uses existing HF_TOKEN)
- Add ModalBackend with OAuth flow + dual volume caching for model weights
- Add RunPodBackend with pod lifecycle management and always-terminate safety
- Add base_cloud.py: shared helpers (load_cloud_config, poll_until_done, GPU pricing)
- Add cloud_config.yaml: budget/standard/performance GPU tiers across all providers
- Add Trainers/cloud/: modal_train.py standalone app and runpod_sync.py utilities
- Add requirements-cloud.txt for optional cloud dependencies
- Wire cloud command into tuner CLI and main menu

All three backends conditionally register — local-only users unaffected.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Security:
- Replace GH_TOKEN value embedding with $GH_TOKEN shell var in RunPod
  startup command and runpod_sync.py clone URL (credential leak fix)
- Scrub credential URLs from Modal git clone stderr before logging

API correctness:
- Rewrite HF Jobs execute() to use correct run_job(image, command, flavor)
  signature; replace _build_uv_script() with _build_training_command()
- Fix job.id and status_obj.stage attribute access for HF Jobs API

Billing safety:
- Add ERROR state detection in RunPod _poll_training (prevents 6hr timeout)
- Add 3-attempt retry with backoff in _terminate_pod (prevents billing leak)
- poll_until_done now raises immediately on persistent errors (auth/not-found)
- Add timeout to Modal subprocess.Popen.wait() with graceful kill

Type correctness:
- modal_backend.load_config() now returns CloudTrainingConfig
- runpod_backend.load_config() now returns CloudTrainingConfig

Other:
- train_modal.py: .options() → .with_options() for Modal >= 0.73.0
- modal_backend: use shared load_cloud_config() instead of inline YAML parse
- cloud_config.yaml: default cloud_type COMMUNITY → SECURE (preemption safety)
- requirements-cloud.txt: add python-dotenv

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…al subprocess timeout

- base_cloud.py: Add consecutive-error counter (max 3) to poll_until_done.
  Persistent errors (unauthorized, not found, forbidden, invalid) raise
  immediately. Too many consecutive transient errors also raise instead
  of polling silently for hours.
- modal_backend.py: Add timeout to subprocess.Popen.wait() based on
  timeout_hours from cloud config. Kills process on TimeoutExpired.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- runpod_backend: properly add ERROR+FAILED to _poll_training terminal
  states (prevents 6hr timeout on failed pods)
- runpod_backend: replace single-attempt terminate with 3-attempt retry
  loop with exponential backoff (1s, 2s, 4s) and CRITICAL log on failure
- train_modal.py: confirm stderr URL scrubbing present (re.sub redaction)

poll_until_done circuit breaker and Modal subprocess timeout were
confirmed present in prior commit b198c9f.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…oud backends

122 tests, 89% overall coverage (96-100% on critical billing safety paths).
Tests cover poll_until_done circuit breaker, RunPod pod lifecycle + terminate
retry, ERROR/FAILED state detection, Modal subprocess timeout, HF Jobs API,
and GH_TOKEN credential isolation.

Run: python -m pytest tests/cloud/ -v

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Consolidate GPU pricing into cloud_config.yaml (single source of truth)
- modal_backend: use shared load_cloud_config() instead of inline YAML parse
- train_modal: .options() → .with_options() for Modal >= 0.73.0 API
- runpod_sync: remove duplication with base_cloud, import shared helpers
- CloudTrainHandler: wire gpu_tiers from cloud_config.yaml (dynamic, not hardcoded)
- build_training_startup_command: validate method param (sft/kto only)
- runpod validate_environment: stronger RUNPOD_API_KEY format check
- runpod/sync: conditional $GH_TOKEN@ injection (only when GH_TOKEN is set)
- train_modal: Secret.from_dict() with explicit keys instead of from_dotenv()

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…g, type fixes, validation

1. Move GPU pricing from hardcoded Python dict to cloud_config.yaml (DRY)
2. Modal Secret.from_dict for env vars instead of individual from_name calls
3. Type annotation fixes (Optional return types, str hints)
4. runpod_sync imports resolve_repo_url from base_cloud (eliminates duplicate)
5. CloudTrainHandler reads method labels and gpu_tiers from YAML
6. Method validation guard in build_training_startup_command
7. RunPod API key validation: 32+ chars with alpha check
8. GH_TOKEN clone guard: only inject when token is actually set
9. Modal stderr capture via subprocess.PIPE

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
* feat(evaluator): add optional LLM-as-judge to evaluation pipeline

Introduces a reusable shared/judge/ module and integrates LLM-as-judge
scoring into the Evaluator alongside existing pattern matching. The judge
can be composed with pattern matching using AND, OR, or judge_only modes.

New: shared/judge/ module (generic, reusable)
- models.py: RubricDef, JudgeScore, JudgeResult, JudgeConfig dataclasses
- rubric_loader.py: Load YAML rubric files -> RubricDef instances
- schema_builder.py: Merge rubric output_schemas into combined JSON schema
- judge_service.py: Execute LLM judge via BaseLLMClient.structured_output()
- interaction_logger.py: Thread-safe JSONL logging for KTO training

New: Evaluator/judge_validator.py
- JudgeValidator: renders evaluator-specific template vars, calls JudgeService
- JudgeValidationResult: result dataclass with judge_mode

New: Evaluator/config/rubrics/
- tool_call_quality.yaml: judges tool selection and argument correctness
- response_appropriateness.yaml: judges response clarity and helpfulness

Modified: Evaluator integration
- runner.py: EvaluationRecord.judge field, AND/OR/judge_only status logic,
  AND optimization (skip judge call if pattern match already fails)
- config.py: EvalJudgeConfig dataclass attached to EvaluatorConfig
- reporting.py: judge stats in console/markdown/JSON output

Note: Evaluator/cli.py (--judge* flags) committed separately due to
pre-existing false positive in hook security scan (line 625).

72/72 unit tests pass. Architecture doc: docs/architecture/llm-judge-integration.md

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

* feat(evaluator): implement LLM-as-judge evaluation with configurable options

* fix(cli): clarify HuggingFace authentication message for better user guidance

---------

Co-authored-by: Claude Sonnet 4.6 <[email protected]>
ProfSynapse and others added 29 commits April 21, 2026 07:13
- Implement `filter_lora_adapter.py` to filter LoRA adapter directories based on tensor key substrings.
- Create `manage_space.py` for rendering, deploying, and managing Hugging Face Spaces with support for various configurations.
- Add Dockerfile and README template for the `vllm_warm` space, including entrypoint and sync script for adapter management.
- Introduce `manual_lora_merge.py` for merging LoRA deltas into models locally.
- Add tests for filtering LoRA adapters and managing spaces, ensuring functionality and correctness.
- Introduced `06_migrate_cli_schema_datasets.py` to migrate non-thinking datasets to the CLI-oriented schema, including argument normalization and command rendering.
- Added `08_inventory_synthchat_cli_schema_refs.py` to inventory SynthChat configuration references needing CLI-schema alignment, with special pattern detection.
- Created `cli_schema_rules.py` for classification rules used in the migration pipeline, defining in-scope agents and tool classifications.
- Implemented `cli_schema_utils.py` with shared helper functions for dataset migration, including loading catalogs, validating call shapes, and rendering CLI commands.
Align SynthChat and Evaluator with config-driven CLI formats
…tool-datasets

Regenerate non-thinking tool datasets and merge latest outputs
- Introduced a new YAML configuration file for SFT training of the Qwen 3.5 9B model.
- Set up dataset source and training parameters including batch size, learning rate, and LoRA settings.
- Disabled evaluation and loss tracking for initial training phase.
- Enabled feature formats for output data.
…ontainer mode

Adds a new `local-run` command that runs SFT/KTO inside Docker on a local GPU without
the usual UID/GID permission headaches, with the asciimatics dashboard visible inside
the container, and an optional persistent-container mode that caches pip installs,
HuggingFace model downloads, and triton compile output across repeat runs.

Three workstreams, all config-driven via `Trainers/local/jobs/*.yaml`:

1. **UID-agnostic** — defaults to `-u 0:0` inside the container with a bash trap that
   chowns artifacts back to the host uid/gid on EXIT. Handles bind and copy transfer
   modes; detects WSL drvfs and warns when POSIX metadata isn't enabled.

2. **TTY-aware** — allocates `-i -t` when stdout is a tty so the asciimatics dashboard
   renders inside the container. `job.tty: auto|always|never`.

3. **Persistent container** — `job.persist: true` keeps a long-lived container per job
   config so repeat runs skip pip install + model download. Uses `--init` (tini) for
   clean ctrl-C signal propagation, plus HF and pip cache mounts. Managed via
   `--stop`, `--rm-persistent`, `--container-status`.

Also rolls up a small cleanup of stale `Trainers/rtx3090_sft` / `rtx3090_kto` references
in docstrings and docs to match the actual `Trainers/sft` / `Trainers/kto` layout, and
removes the fully orphaned `docs/prep/local-training/rtx3090-kto-finetuning.md`.

Reference: `Trainers/local/jobs/qwen35_2b_sft_smoke.yaml` (fast smoke config) and
`qwen35_2b_sft_2epoch.yaml` (full 2-epoch SFT). Config keys documented in
`.skills/fine-tuning/reference/training-config.md`. Troubleshooting in
`docs/troubleshooting.md`.

Tests: 148 unit tests covering uid-agnostic helpers, TTY resolution, persistent
container lifecycle, cache mounts, and pip marker-hash guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
feat(local-run): uid-agnostic local Docker training + persistent-container mode
…ndings

Skill content updates (all three synced trees: .skills, .agents/skills, .claude/skills):
- `fine-tuning/SKILL.md` gains a "Local Docker config run" row and a preference note
  for `python tuner.py local-run --job-config <yaml>` over ad-hoc `docker run` for
  repeatable local GPU training; adds `Trainers/local/jobs/` + `Trainers/archive/legacy_rtx3090/`
  to the directory map; records the 2026-04-22 `unsloth/unsloth:latest` digest +
  transformers pin guidance for Qwen3.5.
- `fine-tuning/reference/sft-training.md` gains a config-driven local Docker SFT
  smoke-run example pointing at `Trainers/local/jobs/qwen35_2b_sft_smoke.yaml`.
- `fine-tuning/reference/grpo-training.md`, `upload-deployment/*` — small path
  updates (`rtx3090_sft` / `kto_output_rtx3090` -> `sft` / `kto_output`) to match
  the actual `Trainers/` layout.

Docs path cleanup:
- QUICKSTART, INSTALLATION_GUIDE, PROJECT_OVERVIEW, EVOLUTIONARY_FINETUNING,
  SYNTH_CHAT_*, NEBIUS_*, ml-pipeline-*, etc. — stale `rtx3090_sft` / `rtx3090_kto`
  / `kto_output_rtx3090` references swapped for the canonical `sft` / `kto` /
  `kto_output` paths.

Line-ending hygiene:
- Many docs + skill mirrors were stored in HEAD with CRLF. Working tree has been
  LF for a while; this commit normalizes them in the index so the working tree and
  repo agree. `.gitattributes` already specified `* text=auto` — this just cleans up
  files that were committed pre-normalization.

Verified: `python3 .skills/scripts/sync_skill_trees.py --check` reports "Skill trees are in sync."

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
chore(skills+docs): document local-run in skills, normalize LF line endings
- Added PrivacyPreprocessor for orchestrating privacy-related text processing.
- Introduced Pseudonymizer for replacing sensitive information with synthetic values.
- Created a smoke runbook for testing privacy features with synthetic data.
- Developed unit tests for privacy preprocessing and pseudonymization functionalities.
- Included sample privacy fixture documents and JSONL datasets for testing.
- Enhanced existing services to integrate privacy preprocessing and pseudonymization.
…#85)

* fix(trainers): epoch counter in TUI dashboard + DRY callback refactor

Two related changes bundled:

1. Fix: HuggingFace Trainer emits `logs['epoch']` as a float; three callback
   classes were casting it to int before `dashboard.update(epoch=...)`,
   truncating sub-epoch progress to 0 until each full epoch completed. Cast
   is now `float(...)`, matching the dashboard's internal type and the JSONL
   log-reader path.

2. Refactor: the duplication surfaced by the fix (identical bug in three
   places) motivated a DRY pass. Extracted a shared `Trainers/shared/
   callbacks/` package — BaseMetricsCallback + BaseLiveDashboardCallback,
   HealthChecker strategy with SFT/KTO/NoOp concrete subclasses, hoisted
   TwoStageLRCallback and CheckpointMonitorCallback. Per-trainer
   `training_callbacks.py` files reduced to thin subclassing shims that
   re-export the same public symbols at the same paths — zero caller edits.

Net LOC: -383 overall (callback files -1066, shared package +683).
21 new unit tests (tests/trainers/test_callback_refactor.py) cover the
four HIGH/MEDIUM uncertainties from the architect + coder HANDOFFs.

Intentional additive behavior change: KTO and GRPO now gain env-fallback
cloud-provider resolution (CLOUD_PROVIDER env var takes precedence over
getattr(args, "cloud_provider")). SFT behavior unchanged. See design
doc §6 risk matrix at docs/architecture/training-callbacks-refactor.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(trainers): remediate B1+B2 from PR #85 review — SFT cadence + GRPO dict-merge

Addresses two blocking behavior regressions surfaced by the architect +
backend-coder review of the callback refactor:

B1 — SFT cadence drift: health_checker.check() and last_log_time update
were firing every on_log instead of only at interval-gated steps for SFT.
Pre-refactor SFT early-returned on the modulo before both. Fixed via two
new class attrs on BaseMetricsCallback:

  health_check_every_on_log: bool = True          # KTO/GRPO default
  interval_time_updates_every_on_log: bool = True # KTO/GRPO default

SFT's MetricsTableCallback overrides both to False — the two lines now
only run inside the should_write_jsonl branch for SFT. KTO/GRPO
unchanged (both originals called health-check and updated last_log_time
every on_log).

B2 — GRPO dict-merge precedence: original GRPO built the JSONL row with
our-fields-win semantics (`entry = dict(logs); entry[k]=v; entry.update(cap)`);
SFT and KTO originals used logs-win (`{**our_fields, **capacity, **logs}`).
New base was uniformly logs-win, silently flipping GRPO. Fixed via
per-trainer class attr:

  fields_win_on_collision: bool = False  # SFT/KTO default

GRPO's LiveDashboardCallback overrides to True. _write_log_row branches
on the attr to emit the correct spread order per trainer. All three
trainers now preserve pre-refactor behavior byte-exact for key
collisions (no current HF log key collides with our fields in default
training, so practical impact is zero, but the refactor must not
silently change semantics).

Design doc §6a + §6b updated with correction notes acknowledging the
earlier "unify on SFT/KTO style" HANDOFF was based on a misread of the
pre-refactor GRPO build order.

Test updates: TestDictMergeOrder split into per-trainer assertions
(GRPO fields-win, KTO logs-win). All 22 tests green.

Review reports included under docs/review/pr-85-*.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* refactor(trainers): PR #85 minor/future remediation (code + tests)

Code (backend-coder-reviewer):
- M-A: drop duplicate banner line in BaseMetricsCallback.on_train_end
- M-D: add log_write_swallow_errors to BaseLiveDashboardCallback JSONL write
- M-E: rename per-trainer module docstrings ('shims' -> concrete subclasses)
- M-F: consolidate sys.path.insert into Trainers/shared/callbacks/__init__.py
- M-G: document _annotate_cloud dual-call-site contract
- M-H: document total_epochs=1 sentinel
- F-E: remove no-op CheckpointMonitorCallback.on_save
- F-F: cosmetic cleanup (redundant NoOpHealthChecker assignment, return/pass)

Tests (test-engineer):
- M-J: _dashboard_metrics fallback coverage (KTO kl, GRPO reward chain)
- F-A: HealthChecker output-format snapshot tests
- F-B: capture_runtime_capacity_snapshot GPU-branch tests
- F-C: suppress_training_logs context-manager tests
- F-D: SFT JSONL shape-parity baseline (strip-list + type-stability)

52/52 test_callback_refactor.py pass. All post-review minor/future items
addressed per user decisions from peer-review workflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
…-option

feat: add privacy preprocess and OPF integration
[codex] Make evaluator assertions config driven
Consolidate training configs from Trainers/local/jobs/ and
Trainers/cloud/jobs/ into a single Trainers/recipes/ directory.

- Add tuner/discovery/recipes.py with RecipeMeta, list_recipes(),
  and load_recipe() (supports target:both deep-merge)
- Migrate 16 recipe YAMLs via git mv, adding target: and method: fields
- Update local_run_handler and cloud_run_handler to use discovery module
- Add local-run and cloud-run entries to TUI main menu
- Update 27 references across skills, docs, READMEs, and tests
- Fix stale hf_jobs_hardware.py reference in cloud-training.md
- Sync skill mirrors via sync_skill_trees.py

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add tests/discovery/test_recipes.py with 34 tests covering:
  list_recipes filtering (target, method, combined), robustness
  (malformed YAML, missing fields, non-dict files), load_recipe
  deep-merge (sub-block precedence, list replacement, nesting),
  handler integration (path verification), reference completeness
- Fix stale CLI help text in parser.py:119-120 referencing old
  Trainers/local/jobs/ and Trainers/cloud/jobs/ paths

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Move `from huggingface_hub import sync_bucket` from top-level to
inside `pull_artifacts()` where it's actually used. The top-level
import killed the entire CLI (TUI, local-run, cloud-run) in conda
environments where huggingface_hub doesn't export sync_bucket
(e.g., unsloth_latest with huggingface_hub 0.36.0).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
fix: lazy-import sync_bucket to unblock CLI
Snapshot of in-progress work prior to merging origin/main (recipe system).
To be reorganized into proper commits later.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…ven-assertions

# Conflicts:
#	.agents/skills/fine-tuning/SKILL.md
#	.claude/skills/fine-tuning/SKILL.md
#	.skills/fine-tuning/SKILL.md
#	.skills/fine-tuning/reference/cloud-training.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants