Add Docker-first local runtime bootstrap and eval flow#80
Open
ProfSynapse wants to merge 763 commits into
Open
Add Docker-first local runtime bootstrap and eval flow#80ProfSynapse wants to merge 763 commits into
ProfSynapse wants to merge 763 commits into
Conversation
Core evolutionary module (shared/evolutionary/): - CandidateGenerator: Generates gradient update candidates - EvolutionaryTrainerWrapper: Wraps HuggingFace trainers with ES - Three strategies: gradient_noise, scale_variation, combined - GradientCandidate dataclass for candidate representation Fitness evaluation (shared/validation/fitness.py): - FitnessEvaluator: Config-driven fitness scoring (0.0-1.0) - Wraps parsing + validation layers from Phase 1 - Scoring methods: binary, error_count, error_penalty Training integration: - Added evolutionary config to config.yaml - Updated config_loader.py with EvolutionaryConfig dataclass - Integrated EvolutionaryTrainerWrapper into train_sft.py - Created example fitness config (configs/fitness/tool_calling.yaml) Documentation: - Updated EVOLUTIONARY_FINETUNING.md with Phase 2 complete - Added usage guide with quick start instructions The evolutionary training is opt-in via `evolutionary.enabled: true` in the YAML config. When enabled, training generates N candidate gradient modifications, evaluates each using fitness scoring, and applies only the best candidate.
Consolidated all agent tool schemas from SynthChat rubrics:
- vaultManager: 9 tools (listDirectory, moveNote, deleteFolder, etc.)
- contentManager: 8 tools (readContent, createContent, replaceByLine, etc.)
- vaultLibrarian: 3 tools (searchContent, searchDirectory, searchMemory)
- memoryManager: 12 tools (sessions, states, workspaces)
- agentManager: 8 tools (createAgent, generateImage, executePrompts, etc.)
- commandManager: 2 tools (listCommands, executeCommand)
Validates same schema structure as training data:
- function.name = "useTools"
- arguments.context: workspaceId, sessionId, memory, goal
- arguments.calls: [{agent, tool, params}]
- Per-agent _required params enforcement
…prefix generation
During warmup, training proceeds normally (standard gradient descent). After warmup_steps, evolutionary gradient selection kicks in. This solves the cold-start problem: the model needs to learn basic tool call patterns before evolutionary optimization can provide a meaningful fitness signal. Default: 200 warmup steps (configurable in config.yaml)
- SynthChat: 75 files (refactored services, validators, rubrics) - Datasets: 42 files (non_thinking tools datasets updates) - shared: 39 files (validation, utilities, upload converters) - Evaluator: 23 files (config, clients, validators) - tuner: 12 files (handlers, backends, CLI) - Trainers: config updates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Resolution strategy: - Accept main: MLC backend support (Evaluator, tuner, webgpu converter) - Accept main: Updated datasets (12.24.25) - Accept main: Config yaml updates (prompts, tool_schema) - Keep ours: Evolutionary training config section in Trainers/config.yaml - Delete: Old vaultManager dataset files (v1.5-v1.19) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
… evolutionary fine-tuning
- Add split_for_gspo.py tool for dataset splitting - Implement weighted scoring for context, tool, and params matching - Update reward rubrics for YAML-driven configuration - Configure 4x4 batch size for faster training 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- eval_handler: Discover GRPO runs alongside SFT/KTO - eval_handler: Display reward metrics for GRPO checkpoints - unsloth_backend: Search grpo_output_rtx3090 for LoRA adapters - llamacpp_backend: Search grpo_output_rtx3090 for GGUF models - mlc_backend: Search grpo_output_rtx3090 for WebGPU models - All backends: Detect 'grpo' trainer type from path 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The lora_path field was defined in config.yaml but missing from the ModelConfig dataclass, causing it to be silently ignored. This meant GRPO training was using the base model instead of the SFT-trained checkpoint, resulting in 0.0 rewards (model didn't know tool format). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
…TrainingProgressDisplay for unified output across backends
…perations, and enhance descriptions for clarity
…u, and animation scenes - Implemented `LiveEvaluationDashboard` for real-time evaluation metrics display. - Created `generate_round_flask` function to visually represent a flask shape in terminal. - Developed interactive menu using `asciimatics` with animated branding and options. - Added scene creation functions for logo display, training start splash, and celebration animations.
…ation monitoring - Implemented SynthChatMetrics to track generation progress, including total examples, completed, valid, and invalid counts. - Created ResultEntry class for logging individual results with status, category, and reason. - Developed LiveSynthChatDashboard class for displaying metrics and recent results in a user-friendly format. - Integrated rich console output for enhanced visual representation of progress and results. - Added methods for updating metrics, building display, and handling live updates.
Config-driven integration adding pivot filtering (variance-gated data selection) and functional equivalence rewards as optional GRPO modes. ~335 lines new Python, 3 new files, 4 edited files. Reuses existing reward system, dataset loader, and per_example_loss batch patterns. https://claude.ai/code/session_016LzqdWVDkBqELE5b33yFAJ
Phase 1: pivot_profiler.py — extracts (state, action) candidates from SFT trajectories, generates N rollouts per turn, scores with existing reward system, filters by variance threshold. Includes SHA1-based caching. Phase 2: functional_verifier.py — normalized tool-call comparison reward that accepts functionally equivalent actions (arg reorder, type coercion, path normalization). Plugs into existing custom reward mechanism. https://claude.ai/code/session_016LzqdWVDkBqELE5b33yFAJ
Phase 3: pivot_config.yaml preset (pivot enabled, functional_equivalence reward at weight 0.5). train_grpo.py gains --pivot-profile-only flag and conditional pivot branch that profiles SFT turns before dataset loading. Reward function build moved earlier so profiling can use it. Phase 4: SKILL.md quick reference entries and grpo-training.md section covering PivotRL usage, config reference, and key metrics. https://claude.ai/code/session_016LzqdWVDkBqELE5b33yFAJ
36 tests covering: - Candidate extraction (single/multi-turn, system messages) - Variance filtering (threshold, mean range, max cap, min warning) - Dataset output format validation - Value normalization (bool/numeric coercion, paths, whitespace) - Tool call extraction (Qwen/Mistral/plain formats) - Argument comparison (exact/partial/no match) - Functional equivalence reward (matching, wrong tool, fallbacks) - Config backward compatibility and pivot preset defaults https://claude.ai/code/session_016LzqdWVDkBqELE5b33yFAJ
New preset for training models to search a corpus, select relevant documents, and produce grounded answers. Inspired by Chroma Context-1's explore-verify-extend pattern. All templates are tool-agnostic with placeholder tool names — users must confirm their actual search/read tools before proceeding. Deliverables: - Scenario template (seed corpus + find-and-answer + multi-hop) - Three rubrics (search_term_quality, doc_selection, groundedness) - Eval preset with 10 self-contained test cases - End-to-end case study doc (third alongside tool-calling and essay-style) - Updated SKILL.md index + synced mirror trees https://claude.ai/code/session_01TY65D9dzmbD3DaXuRJR8Bo
Eval template now supports two modes: - Static (AS_*): corpus in system prompt, quick behavioral check - Runtime (ASR_*): real files via fixture, tool calls actually executed 5 runtime tests mirror the most important static tests but require the model to actually search and read files to discover content. Two presets: agentic_search (static) and agentic_search_runtime. https://claude.ai/code/session_01TY65D9dzmbD3DaXuRJR8Bo
Runtime eval fixtures must not overlap with training data — eval should measure capability, not memorization. Added prominent warnings in both the eval YAML and the pipeline case study doc. Also documented both static and runtime eval modes in the case study. https://claude.ai/code/session_01TY65D9dzmbD3DaXuRJR8Bo
Comprehensive research document mapping the open-source mech interp landscape to our fine-tuning/eval/improvement pipeline. Identifies 6 concrete integration points: post-training diagnostics via SAEs, interp-aware evaluation validators, flywheel feature drift detection, mechanistically-guided LoRA placement, activation steering for inference-time fixes, and SynthChat representation-level validation. https://claude.ai/code/session_0116APH1YFGUBjuC8RznS3Wx
Remove tool-calling-specific framing from all integration points. The pipeline now generalizes to any fine-tuning objective: essay writing, code generation, domain adaptation, agentic behavior, etc. Added task agnosticism section and multi-task examples throughout. https://claude.ai/code/session_0116APH1YFGUBjuC8RznS3Wx
…bility-research-GJYpb
…aluator DRY (#76) * Improve experiment workflows and harden SFT preprocessing * Refactor tuner: extract stage runners and decompose hf_jobs_backend Split experiment_handler.py (946→207 lines) by extracting HFTrainingStageRunner, HFEvalStageRunner, and HFLossStageRunner into tuner/handlers/stages/ subpackage. ExperimentHandler remains as the orchestrator. Decompose HFJobsBackend (990→465 lines) into 4 focused mixins: CommandBuilder, JobWatcher, BucketOps, PostTraining. External API (ITrainingBackend) unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Refactor evaluator: DRY display tables and declarative check registry Replace 7 near-identical _display_*_table methods in eval_handler.py with TableSpec dataclass + generic _display_table renderer. Conditional columns handled via dynamic spec construction. Replace 18 _check_* methods in config_validator.py with declarative CheckDescriptor registry + stateless _run_* functions. Checks are now decoupled, independently testable, and trivially extensible. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Add comprehensive tests for SOLID/DRY refactoring batch 1 122 new tests covering all 4 refactored areas: - Stage runner extraction: import isolation, Protocol contracts, re-exports - HFJobsBackend mixins: MRO composition, cross-mixin calls, ITrainingBackend - EvalHandler TableSpec: Rich/plain-text rendering, conditional columns - CheckDescriptor registry: all 18 entries, tool_sequence composite, adversarial Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Address review findings: cleanup imports, normalize style, add behavioral tests Review remediation (F1-F6): - Remove unused imports: StageResult, List (F1, F2) - Remove redundant inline shutil import (F3) - Normalize Optional[X] to X | None across stage runners (F4) - Document mixin dependency order in HFJobsBackend docstring (F5) - Add 56 behavioral tests for stage runner .run() methods and recovery state machines (F6) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
…rategy pattern (#77) * Extract shared trainer utilities and unify lineage flow DRY extraction phases 1-5: - Create shared/env_bootstrap.py: init_trainer_env() consolidates env vars, Windows patches, dotenv, logging across all trainers - Create shared/training_utils.py: setup_wandb, save_training_lineage, extract_previous_log_entries (KTO canonical), build_base_lineage (sub-dict API), apply_tier_preset - GRPO now generates training_lineage.json (unified lineage flow) - SFT: 1351→1174 lines, KTO: 1329→1101 lines Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Refactor LoRA surgery with Strategy pattern Extract 8 surgical operations from LoRASurgeon (1,054 lines) into shared/evolutionary/surgery/ package with Protocol-based strategies: - 8 operation classes in surgery/operations/ - Decorator-based static registry - LoRASurgeon reduced to thin orchestrator (232 lines) - Original lora_surgery.py becomes 53-line backward-compat shim - All 60 existing tests pass (40 surgery + 20 karpathy integration) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Add comprehensive tests for trainer DRY extraction and LoRA surgery 50 new tests covering both batch 2 refactoring areas: - 11 tests: init_trainer_env() flag combinations and logging - 25 tests: setup_wandb, extract_previous_log_entries, save/build lineage, apply_tier_preset - 14 tests: surgery registry API, Protocol conformance, async typing, backward-compat shim, context manager, proxy methods Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Address review findings: async typing, imports, config ISP, test robustness Review remediation (R1-R8): - Fix evaluate_fn type: Callable[[str], Awaitable[float]] (R1) - Remove unused json imports from train_sft/kto (R2) - Move stray import in train_grpo to top (R3) - Use monkeypatch in env_bootstrap tests (R4) - Add deprecation docstring to lora_surgery shim (R6) - Refactor SurgeryConfig with per-operation config groups (R7) - Add 4 negative-path tests for sft_preprocessing (R8) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
…ules (#78) * Decompose SynthChat generator God class into focused modules Break SynthChatGenerator (3,384→1,635 lines) and run.py (1,042→220 lines) into 14 focused modules following the 9-step decomposition roadmap: Extracted modules: - template_utils.py: template rendering utilities - targets.py: target spec normalization (DRY fix) - workspace/: environment rendering (renderer, sections, fixture_helpers) - schemas/: JSON schemas (environment, tool response) - labeling.py: metadata label classification - parsing.py: response parsing and normalization - llm/: LLM client pool management - review.py: stage review and judge templates - agentic/: episode generation and turn management - modes/: CLI mode handlers (generate, improve, validate) - parallel/: worker pool and parallel execution - result_writer.py: streaming output management All 43 existing tests pass. Backward-compat re-exports preserved. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Add comprehensive tests for SynthChat decomposition 263 new tests across 13 files covering all 14 extracted modules: - Import isolation and backward-compat re-exports - template_utils, targets, parsing, labeling, schemas - workspace rendering, LLM client pool, review, agentic - result_writer, modes, parallel workers Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
* Make SynthChat config-driven: tool schemas, workspace, labels via YAML Replace hardcoded scenario-specific behavior with 3 config registries: - tool_call_formats.yaml: tool-call response schema, prompt instructions, wrapper name, context fields — fully configurable per format - workspace_formats.yaml: system prompt section order, tag names, default values, selected workspace fields - label_mappings.yaml: issue-to-behavior classification rules and label rollup groups Resolution priority: scenario inline > string reference > tool_format config > "default" from registry. Backward compatible — existing scenarios work unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Strip backward-compat shims and clean up hardcoded defaults Remove all backward-compatibility layers (no external consumers): - SynthChat/generator.py: remove re-export block for extracted functions - SynthChat/schemas/tool_response_schema.py: remove legacy signatures, keep only config-driven versions - SynthChat/workspace/renderer.py: remove _render_legacy path - SynthChat/labeling.py: make config params required - shared/evolutionary/lora_surgery.py: convert shim to thin re-export without underscore-prefixed aliases - shared/evolutionary/surgery/utils.py: remove _prefixed compat aliases - tests/synthchat/test_backward_compat.py: removed entirely - Update all tests to import from new module paths directly Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Document config-driven architecture and no-hardcoding principle - CLAUDE.md: Add NO HARDCODING rule + no backward-compat shims rule - README.md: Add Config-Driven Architecture section, clarify useTools is a toy example not ground truth - SKILL.md: Add config-driven architecture section with key config files Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
- Fine-tuning skill: experiment configs (gemma4, qwen3, qwen35 A100 templates) and HF Spaces warm iteration reference - Refactoring plans from SOLID/DRY analysis session (6 plan docs) - Qwen3.5 4B A100 cloud experiment spec - SFT model loader source test Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Standalone Python script downloads a HF model, converts to GGUF via llama.cpp's pure-Python converter, and uploads the result. Job config uses cpu-upgrade flavor (no GPU needed for conversion). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The unsloth cloud image ships an older transformers that lacks Gemma 4 tokenizer support, causing AttributeError in special_tokens handling. Pinning transformers>=4.52.0 resolves the incompatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The Gemma 4 model's tokenizer_config.json has extra_special_tokens as a list, but transformers expects a dict. Patch the config before running the converter to avoid AttributeError in tokenizer initialization. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…options, update documentation for GGUF conversion, and improve tool call parsing for Gemma format
Local model artifacts (pulled from HF bucket) were slowing git operations. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Completes the checkpoint control flags added in 1095859 — these are the trainer-side argument parsing and config override that were blocked by the pre-commit hook false positive on tokenizer-related print statements. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Full precision (no 4-bit), 3 epochs, save every 200 steps (keep 10), pip upgrades transformers>=5.0 for Gemma 4 architecture support. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Allows experiment specs to control checkpoint frequency and retention. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…sing Multimodal models (Gemma 4, Qwen-VL) return a Processor from AutoTokenizer.from_pretrained(). Processors have apply_chat_template() but lack encode(). Unwrap to inner .tokenizer for encode() calls. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
docker status,docker bootstrap,docker pull,docker smoke, and bucket-helper build support.envVerification
python .skills/scripts/sync_skill_trees.pypython .skills/scripts/sync_skill_trees.py --checkpython tuner.py eval --json --runtime dockerpython tuner.py docker bootstrap --json --docker-target allpython tuner.py eval --runtime dockeragainst pulled runtoolset-training-artifacts/runs/hf_jobs/sft/20260321_191536-07065b91/final_modelNotes
docs/plans/docker-first-local-runtime-plan.mdout of this PR