Skip to content

Add Docker-first local runtime bootstrap and eval flow#80

Open
ProfSynapse wants to merge 763 commits into
mainfrom
codex/docker-local-runtime-bootstrap
Open

Add Docker-first local runtime bootstrap and eval flow#80
ProfSynapse wants to merge 763 commits into
mainfrom
codex/docker-local-runtime-bootstrap

Conversation

@ProfSynapse

Copy link
Copy Markdown
Owner

Summary

  • add a first-class local Docker runtime with docker status, docker bootstrap, docker pull, docker smoke, and bucket-helper build support
  • route local train/eval through Docker images, add bucket-helper isolation for modern HF Buckets support, and auto-load root .env
  • make pulled bucket adapters discoverable in local eval flows, document the Docker-first bootstrap path, and sync the canonical fine-tuning skill
  • fix Docker vLLM readiness polling during startup and force UTF-8 CLI output on Windows to avoid interactive Rich encoding crashes

Verification

  • python .skills/scripts/sync_skill_trees.py
  • python .skills/scripts/sync_skill_trees.py --check
  • AST parse for updated Python files
  • python tuner.py eval --json --runtime docker
  • python tuner.py docker bootstrap --json --docker-target all
  • real end-to-end python tuner.py eval --runtime docker against pulled run toolset-training-artifacts/runs/hf_jobs/sft/20260321_191536-07065b91/final_model

Notes

  • the end-to-end Docker eval completed successfully; the CLI exited nonzero only because the model failed 1 of 27 eval cases, not because the runtime failed
  • left unrelated untracked file docs/plans/docker-first-local-runtime-plan.md out of this PR

claude and others added 30 commits December 24, 2025 13:03
Core evolutionary module (shared/evolutionary/):
- CandidateGenerator: Generates gradient update candidates
- EvolutionaryTrainerWrapper: Wraps HuggingFace trainers with ES
- Three strategies: gradient_noise, scale_variation, combined
- GradientCandidate dataclass for candidate representation

Fitness evaluation (shared/validation/fitness.py):
- FitnessEvaluator: Config-driven fitness scoring (0.0-1.0)
- Wraps parsing + validation layers from Phase 1
- Scoring methods: binary, error_count, error_penalty

Training integration:
- Added evolutionary config to config.yaml
- Updated config_loader.py with EvolutionaryConfig dataclass
- Integrated EvolutionaryTrainerWrapper into train_sft.py
- Created example fitness config (configs/fitness/tool_calling.yaml)

Documentation:
- Updated EVOLUTIONARY_FINETUNING.md with Phase 2 complete
- Added usage guide with quick start instructions

The evolutionary training is opt-in via `evolutionary.enabled: true`
in the YAML config. When enabled, training generates N candidate
gradient modifications, evaluates each using fitness scoring, and
applies only the best candidate.
Consolidated all agent tool schemas from SynthChat rubrics:
- vaultManager: 9 tools (listDirectory, moveNote, deleteFolder, etc.)
- contentManager: 8 tools (readContent, createContent, replaceByLine, etc.)
- vaultLibrarian: 3 tools (searchContent, searchDirectory, searchMemory)
- memoryManager: 12 tools (sessions, states, workspaces)
- agentManager: 8 tools (createAgent, generateImage, executePrompts, etc.)
- commandManager: 2 tools (listCommands, executeCommand)

Validates same schema structure as training data:
- function.name = "useTools"
- arguments.context: workspaceId, sessionId, memory, goal
- arguments.calls: [{agent, tool, params}]
- Per-agent _required params enforcement
During warmup, training proceeds normally (standard gradient descent).
After warmup_steps, evolutionary gradient selection kicks in.

This solves the cold-start problem: the model needs to learn basic
tool call patterns before evolutionary optimization can provide
a meaningful fitness signal.

Default: 200 warmup steps (configurable in config.yaml)
- SynthChat: 75 files (refactored services, validators, rubrics)
- Datasets: 42 files (non_thinking tools datasets updates)
- shared: 39 files (validation, utilities, upload converters)
- Evaluator: 23 files (config, clients, validators)
- tuner: 12 files (handlers, backends, CLI)
- Trainers: config updates

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Resolution strategy:
- Accept main: MLC backend support (Evaluator, tuner, webgpu converter)
- Accept main: Updated datasets (12.24.25)
- Accept main: Config yaml updates (prompts, tool_schema)
- Keep ours: Evolutionary training config section in Trainers/config.yaml
- Delete: Old vaultManager dataset files (v1.5-v1.19)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add split_for_gspo.py tool for dataset splitting
- Implement weighted scoring for context, tool, and params matching
- Update reward rubrics for YAML-driven configuration
- Configure 4x4 batch size for faster training

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- eval_handler: Discover GRPO runs alongside SFT/KTO
- eval_handler: Display reward metrics for GRPO checkpoints
- unsloth_backend: Search grpo_output_rtx3090 for LoRA adapters
- llamacpp_backend: Search grpo_output_rtx3090 for GGUF models
- mlc_backend: Search grpo_output_rtx3090 for WebGPU models
- All backends: Detect 'grpo' trainer type from path

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The lora_path field was defined in config.yaml but missing from the
ModelConfig dataclass, causing it to be silently ignored. This meant
GRPO training was using the base model instead of the SFT-trained
checkpoint, resulting in 0.0 rewards (model didn't know tool format).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…TrainingProgressDisplay for unified output across backends
…perations, and enhance descriptions for clarity
…u, and animation scenes

- Implemented `LiveEvaluationDashboard` for real-time evaluation metrics display.
- Created `generate_round_flask` function to visually represent a flask shape in terminal.
- Developed interactive menu using `asciimatics` with animated branding and options.
- Added scene creation functions for logo display, training start splash, and celebration animations.
…ation monitoring

- Implemented SynthChatMetrics to track generation progress, including total examples, completed, valid, and invalid counts.
- Created ResultEntry class for logging individual results with status, category, and reason.
- Developed LiveSynthChatDashboard class for displaying metrics and recent results in a user-friendly format.
- Integrated rich console output for enhanced visual representation of progress and results.
- Added methods for updating metrics, building display, and handling live updates.
claude and others added 29 commits March 27, 2026 00:59
Config-driven integration adding pivot filtering (variance-gated data
selection) and functional equivalence rewards as optional GRPO modes.
~335 lines new Python, 3 new files, 4 edited files. Reuses existing
reward system, dataset loader, and per_example_loss batch patterns.

https://claude.ai/code/session_016LzqdWVDkBqELE5b33yFAJ
Phase 1: pivot_profiler.py — extracts (state, action) candidates from
SFT trajectories, generates N rollouts per turn, scores with existing
reward system, filters by variance threshold. Includes SHA1-based caching.

Phase 2: functional_verifier.py — normalized tool-call comparison reward
that accepts functionally equivalent actions (arg reorder, type coercion,
path normalization). Plugs into existing custom reward mechanism.

https://claude.ai/code/session_016LzqdWVDkBqELE5b33yFAJ
Phase 3: pivot_config.yaml preset (pivot enabled, functional_equivalence
reward at weight 0.5). train_grpo.py gains --pivot-profile-only flag and
conditional pivot branch that profiles SFT turns before dataset loading.
Reward function build moved earlier so profiling can use it.

Phase 4: SKILL.md quick reference entries and grpo-training.md section
covering PivotRL usage, config reference, and key metrics.

https://claude.ai/code/session_016LzqdWVDkBqELE5b33yFAJ
36 tests covering:
- Candidate extraction (single/multi-turn, system messages)
- Variance filtering (threshold, mean range, max cap, min warning)
- Dataset output format validation
- Value normalization (bool/numeric coercion, paths, whitespace)
- Tool call extraction (Qwen/Mistral/plain formats)
- Argument comparison (exact/partial/no match)
- Functional equivalence reward (matching, wrong tool, fallbacks)
- Config backward compatibility and pivot preset defaults

https://claude.ai/code/session_016LzqdWVDkBqELE5b33yFAJ
New preset for training models to search a corpus, select relevant
documents, and produce grounded answers. Inspired by Chroma Context-1's
explore-verify-extend pattern. All templates are tool-agnostic with
placeholder tool names — users must confirm their actual search/read
tools before proceeding.

Deliverables:
- Scenario template (seed corpus + find-and-answer + multi-hop)
- Three rubrics (search_term_quality, doc_selection, groundedness)
- Eval preset with 10 self-contained test cases
- End-to-end case study doc (third alongside tool-calling and essay-style)
- Updated SKILL.md index + synced mirror trees

https://claude.ai/code/session_01TY65D9dzmbD3DaXuRJR8Bo
Eval template now supports two modes:
- Static (AS_*): corpus in system prompt, quick behavioral check
- Runtime (ASR_*): real files via fixture, tool calls actually executed

5 runtime tests mirror the most important static tests but require
the model to actually search and read files to discover content.
Two presets: agentic_search (static) and agentic_search_runtime.

https://claude.ai/code/session_01TY65D9dzmbD3DaXuRJR8Bo
Runtime eval fixtures must not overlap with training data — eval
should measure capability, not memorization. Added prominent warnings
in both the eval YAML and the pipeline case study doc. Also documented
both static and runtime eval modes in the case study.

https://claude.ai/code/session_01TY65D9dzmbD3DaXuRJR8Bo
Comprehensive research document mapping the open-source mech interp
landscape to our fine-tuning/eval/improvement pipeline. Identifies
6 concrete integration points: post-training diagnostics via SAEs,
interp-aware evaluation validators, flywheel feature drift detection,
mechanistically-guided LoRA placement, activation steering for
inference-time fixes, and SynthChat representation-level validation.

https://claude.ai/code/session_0116APH1YFGUBjuC8RznS3Wx
Remove tool-calling-specific framing from all integration points.
The pipeline now generalizes to any fine-tuning objective: essay
writing, code generation, domain adaptation, agentic behavior, etc.
Added task agnosticism section and multi-task examples throughout.

https://claude.ai/code/session_0116APH1YFGUBjuC8RznS3Wx
…aluator DRY (#76)

* Improve experiment workflows and harden SFT preprocessing

* Refactor tuner: extract stage runners and decompose hf_jobs_backend

Split experiment_handler.py (946→207 lines) by extracting
HFTrainingStageRunner, HFEvalStageRunner, and HFLossStageRunner
into tuner/handlers/stages/ subpackage. ExperimentHandler remains
as the orchestrator.

Decompose HFJobsBackend (990→465 lines) into 4 focused mixins:
CommandBuilder, JobWatcher, BucketOps, PostTraining. External API
(ITrainingBackend) unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Refactor evaluator: DRY display tables and declarative check registry

Replace 7 near-identical _display_*_table methods in eval_handler.py
with TableSpec dataclass + generic _display_table renderer. Conditional
columns handled via dynamic spec construction.

Replace 18 _check_* methods in config_validator.py with declarative
CheckDescriptor registry + stateless _run_* functions. Checks are
now decoupled, independently testable, and trivially extensible.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Add comprehensive tests for SOLID/DRY refactoring batch 1

122 new tests covering all 4 refactored areas:
- Stage runner extraction: import isolation, Protocol contracts, re-exports
- HFJobsBackend mixins: MRO composition, cross-mixin calls, ITrainingBackend
- EvalHandler TableSpec: Rich/plain-text rendering, conditional columns
- CheckDescriptor registry: all 18 entries, tool_sequence composite, adversarial

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Address review findings: cleanup imports, normalize style, add behavioral tests

Review remediation (F1-F6):
- Remove unused imports: StageResult, List (F1, F2)
- Remove redundant inline shutil import (F3)
- Normalize Optional[X] to X | None across stage runners (F4)
- Document mixin dependency order in HFJobsBackend docstring (F5)
- Add 56 behavioral tests for stage runner .run() methods and
  recovery state machines (F6)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
…rategy pattern (#77)

* Extract shared trainer utilities and unify lineage flow

DRY extraction phases 1-5:
- Create shared/env_bootstrap.py: init_trainer_env() consolidates
  env vars, Windows patches, dotenv, logging across all trainers
- Create shared/training_utils.py: setup_wandb, save_training_lineage,
  extract_previous_log_entries (KTO canonical), build_base_lineage
  (sub-dict API), apply_tier_preset
- GRPO now generates training_lineage.json (unified lineage flow)
- SFT: 1351→1174 lines, KTO: 1329→1101 lines

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Refactor LoRA surgery with Strategy pattern

Extract 8 surgical operations from LoRASurgeon (1,054 lines) into
shared/evolutionary/surgery/ package with Protocol-based strategies:
- 8 operation classes in surgery/operations/
- Decorator-based static registry
- LoRASurgeon reduced to thin orchestrator (232 lines)
- Original lora_surgery.py becomes 53-line backward-compat shim
- All 60 existing tests pass (40 surgery + 20 karpathy integration)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Add comprehensive tests for trainer DRY extraction and LoRA surgery

50 new tests covering both batch 2 refactoring areas:
- 11 tests: init_trainer_env() flag combinations and logging
- 25 tests: setup_wandb, extract_previous_log_entries, save/build
  lineage, apply_tier_preset
- 14 tests: surgery registry API, Protocol conformance, async typing,
  backward-compat shim, context manager, proxy methods

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Address review findings: async typing, imports, config ISP, test robustness

Review remediation (R1-R8):
- Fix evaluate_fn type: Callable[[str], Awaitable[float]] (R1)
- Remove unused json imports from train_sft/kto (R2)
- Move stray import in train_grpo to top (R3)
- Use monkeypatch in env_bootstrap tests (R4)
- Add deprecation docstring to lora_surgery shim (R6)
- Refactor SurgeryConfig with per-operation config groups (R7)
- Add 4 negative-path tests for sft_preprocessing (R8)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
…ules (#78)

* Decompose SynthChat generator God class into focused modules

Break SynthChatGenerator (3,384→1,635 lines) and run.py (1,042→220 lines)
into 14 focused modules following the 9-step decomposition roadmap:

Extracted modules:
- template_utils.py: template rendering utilities
- targets.py: target spec normalization (DRY fix)
- workspace/: environment rendering (renderer, sections, fixture_helpers)
- schemas/: JSON schemas (environment, tool response)
- labeling.py: metadata label classification
- parsing.py: response parsing and normalization
- llm/: LLM client pool management
- review.py: stage review and judge templates
- agentic/: episode generation and turn management
- modes/: CLI mode handlers (generate, improve, validate)
- parallel/: worker pool and parallel execution
- result_writer.py: streaming output management

All 43 existing tests pass. Backward-compat re-exports preserved.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Add comprehensive tests for SynthChat decomposition

263 new tests across 13 files covering all 14 extracted modules:
- Import isolation and backward-compat re-exports
- template_utils, targets, parsing, labeling, schemas
- workspace rendering, LLM client pool, review, agentic
- result_writer, modes, parallel workers

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
* Make SynthChat config-driven: tool schemas, workspace, labels via YAML

Replace hardcoded scenario-specific behavior with 3 config registries:
- tool_call_formats.yaml: tool-call response schema, prompt instructions,
  wrapper name, context fields — fully configurable per format
- workspace_formats.yaml: system prompt section order, tag names,
  default values, selected workspace fields
- label_mappings.yaml: issue-to-behavior classification rules
  and label rollup groups

Resolution priority: scenario inline > string reference > tool_format
config > "default" from registry. Backward compatible — existing
scenarios work unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Strip backward-compat shims and clean up hardcoded defaults

Remove all backward-compatibility layers (no external consumers):
- SynthChat/generator.py: remove re-export block for extracted functions
- SynthChat/schemas/tool_response_schema.py: remove legacy signatures,
  keep only config-driven versions
- SynthChat/workspace/renderer.py: remove _render_legacy path
- SynthChat/labeling.py: make config params required
- shared/evolutionary/lora_surgery.py: convert shim to thin re-export
  without underscore-prefixed aliases
- shared/evolutionary/surgery/utils.py: remove _prefixed compat aliases
- tests/synthchat/test_backward_compat.py: removed entirely
- Update all tests to import from new module paths directly

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Document config-driven architecture and no-hardcoding principle

- CLAUDE.md: Add NO HARDCODING rule + no backward-compat shims rule
- README.md: Add Config-Driven Architecture section, clarify useTools
  is a toy example not ground truth
- SKILL.md: Add config-driven architecture section with key config files

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
- Fine-tuning skill: experiment configs (gemma4, qwen3, qwen35 A100
  templates) and HF Spaces warm iteration reference
- Refactoring plans from SOLID/DRY analysis session (6 plan docs)
- Qwen3.5 4B A100 cloud experiment spec
- SFT model loader source test

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Standalone Python script downloads a HF model, converts to GGUF via
llama.cpp's pure-Python converter, and uploads the result. Job config
uses cpu-upgrade flavor (no GPU needed for conversion).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The unsloth cloud image ships an older transformers that lacks Gemma 4
tokenizer support, causing AttributeError in special_tokens handling.
Pinning transformers>=4.52.0 resolves the incompatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The Gemma 4 model's tokenizer_config.json has extra_special_tokens as a
list, but transformers expects a dict. Patch the config before running
the converter to avoid AttributeError in tokenizer initialization.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…options, update documentation for GGUF conversion, and improve tool call parsing for Gemma format
Local model artifacts (pulled from HF bucket) were slowing git operations.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Completes the checkpoint control flags added in 1095859 — these are the
trainer-side argument parsing and config override that were blocked by
the pre-commit hook false positive on tokenizer-related print statements.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Full precision (no 4-bit), 3 epochs, save every 200 steps (keep 10),
pip upgrades transformers>=5.0 for Gemma 4 architecture support.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Allows experiment specs to control checkpoint frequency and retention.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…sing

Multimodal models (Gemma 4, Qwen-VL) return a Processor from
AutoTokenizer.from_pretrained(). Processors have apply_chat_template()
but lack encode(). Unwrap to inner .tokenizer for encode() calls.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants