Add Docker-first local runtime bootstrap and eval flow by ProfSynapse · Pull Request #80 · ProfSynapse/Synaptic-Tuner

ProfSynapse · 2026-04-10T15:59:18Z

Summary

add a first-class local Docker runtime with docker status, docker bootstrap, docker pull, docker smoke, and bucket-helper build support
route local train/eval through Docker images, add bucket-helper isolation for modern HF Buckets support, and auto-load root .env
make pulled bucket adapters discoverable in local eval flows, document the Docker-first bootstrap path, and sync the canonical fine-tuning skill
fix Docker vLLM readiness polling during startup and force UTF-8 CLI output on Windows to avoid interactive Rich encoding crashes

Verification

python .skills/scripts/sync_skill_trees.py
python .skills/scripts/sync_skill_trees.py --check
AST parse for updated Python files
python tuner.py eval --json --runtime docker
python tuner.py docker bootstrap --json --docker-target all
real end-to-end python tuner.py eval --runtime docker against pulled run toolset-training-artifacts/runs/hf_jobs/sft/20260321_191536-07065b91/final_model

Notes

the end-to-end Docker eval completed successfully; the CLI exited nonzero only because the model failed 1 of 27 eval cases, not because the runtime failed
left unrelated untracked file docs/plans/docker-first-local-runtime-plan.md out of this PR

Core evolutionary module (shared/evolutionary/): - CandidateGenerator: Generates gradient update candidates - EvolutionaryTrainerWrapper: Wraps HuggingFace trainers with ES - Three strategies: gradient_noise, scale_variation, combined - GradientCandidate dataclass for candidate representation Fitness evaluation (shared/validation/fitness.py): - FitnessEvaluator: Config-driven fitness scoring (0.0-1.0) - Wraps parsing + validation layers from Phase 1 - Scoring methods: binary, error_count, error_penalty Training integration: - Added evolutionary config to config.yaml - Updated config_loader.py with EvolutionaryConfig dataclass - Integrated EvolutionaryTrainerWrapper into train_sft.py - Created example fitness config (configs/fitness/tool_calling.yaml) Documentation: - Updated EVOLUTIONARY_FINETUNING.md with Phase 2 complete - Added usage guide with quick start instructions The evolutionary training is opt-in via `evolutionary.enabled: true` in the YAML config. When enabled, training generates N candidate gradient modifications, evaluates each using fitness scoring, and applies only the best candidate.

Consolidated all agent tool schemas from SynthChat rubrics: - vaultManager: 9 tools (listDirectory, moveNote, deleteFolder, etc.) - contentManager: 8 tools (readContent, createContent, replaceByLine, etc.) - vaultLibrarian: 3 tools (searchContent, searchDirectory, searchMemory) - memoryManager: 12 tools (sessions, states, workspaces) - agentManager: 8 tools (createAgent, generateImage, executePrompts, etc.) - commandManager: 2 tools (listCommands, executeCommand) Validates same schema structure as training data: - function.name = "useTools" - arguments.context: workspaceId, sessionId, memory, goal - arguments.calls: [{agent, tool, params}] - Per-agent _required params enforcement

…prefix generation

During warmup, training proceeds normally (standard gradient descent). After warmup_steps, evolutionary gradient selection kicks in. This solves the cold-start problem: the model needs to learn basic tool call patterns before evolutionary optimization can provide a meaningful fitness signal. Default: 200 warmup steps (configurable in config.yaml)

- SynthChat: 75 files (refactored services, validators, rubrics) - Datasets: 42 files (non_thinking tools datasets updates) - shared: 39 files (validation, utilities, upload converters) - Evaluator: 23 files (config, clients, validators) - tuner: 12 files (handlers, backends, CLI) - Trainers: config updates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Resolution strategy: - Accept main: MLC backend support (Evaluator, tuner, webgpu converter) - Accept main: Updated datasets (12.24.25) - Accept main: Config yaml updates (prompts, tool_schema) - Keep ours: Evolutionary training config section in Trainers/config.yaml - Delete: Old vaultManager dataset files (v1.5-v1.19) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

… evolutionary fine-tuning

- Add split_for_gspo.py tool for dataset splitting - Implement weighted scoring for context, tool, and params matching - Update reward rubrics for YAML-driven configuration - Configure 4x4 batch size for faster training 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- eval_handler: Discover GRPO runs alongside SFT/KTO - eval_handler: Display reward metrics for GRPO checkpoints - unsloth_backend: Search grpo_output_rtx3090 for LoRA adapters - llamacpp_backend: Search grpo_output_rtx3090 for GGUF models - mlc_backend: Search grpo_output_rtx3090 for WebGPU models - All backends: Detect 'grpo' trainer type from path 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The lora_path field was defined in config.yaml but missing from the ModelConfig dataclass, causing it to be silently ignored. This meant GRPO training was using the base model instead of the SFT-trained checkpoint, resulting in 0.0 rewards (model didn't know tool format). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

…TrainingProgressDisplay for unified output across backends

…perations, and enhance descriptions for clarity

… for clarity

…d metrics

…u, and animation scenes - Implemented `LiveEvaluationDashboard` for real-time evaluation metrics display. - Created `generate_round_flask` function to visually represent a flask shape in terminal. - Developed interactive menu using `asciimatics` with animated branding and options. - Added scene creation functions for logo display, training start splash, and celebration animations.

…ation monitoring - Implemented SynthChatMetrics to track generation progress, including total examples, completed, valid, and invalid counts. - Created ResultEntry class for logging individual results with status, category, and reason. - Developed LiveSynthChatDashboard class for displaying metrics and recent results in a user-friendly format. - Integrated rich console output for enhanced visual representation of progress and results. - Added methods for updating metrics, building display, and handling live updates.

Config-driven integration adding pivot filtering (variance-gated data selection) and functional equivalence rewards as optional GRPO modes. ~335 lines new Python, 3 new files, 4 edited files. Reuses existing reward system, dataset loader, and per_example_loss batch patterns. https://claude.ai/code/session_016LzqdWVDkBqELE5b33yFAJ

Phase 1: pivot_profiler.py — extracts (state, action) candidates from SFT trajectories, generates N rollouts per turn, scores with existing reward system, filters by variance threshold. Includes SHA1-based caching. Phase 2: functional_verifier.py — normalized tool-call comparison reward that accepts functionally equivalent actions (arg reorder, type coercion, path normalization). Plugs into existing custom reward mechanism. https://claude.ai/code/session_016LzqdWVDkBqELE5b33yFAJ

Phase 3: pivot_config.yaml preset (pivot enabled, functional_equivalence reward at weight 0.5). train_grpo.py gains --pivot-profile-only flag and conditional pivot branch that profiles SFT turns before dataset loading. Reward function build moved earlier so profiling can use it. Phase 4: SKILL.md quick reference entries and grpo-training.md section covering PivotRL usage, config reference, and key metrics. https://claude.ai/code/session_016LzqdWVDkBqELE5b33yFAJ

https://claude.ai/code/session_016LzqdWVDkBqELE5b33yFAJ

36 tests covering: - Candidate extraction (single/multi-turn, system messages) - Variance filtering (threshold, mean range, max cap, min warning) - Dataset output format validation - Value normalization (bool/numeric coercion, paths, whitespace) - Tool call extraction (Qwen/Mistral/plain formats) - Argument comparison (exact/partial/no match) - Functional equivalence reward (matching, wrong tool, fallbacks) - Config backward compatibility and pivot preset defaults https://claude.ai/code/session_016LzqdWVDkBqELE5b33yFAJ

…search-RB3e9

New preset for training models to search a corpus, select relevant documents, and produce grounded answers. Inspired by Chroma Context-1's explore-verify-extend pattern. All templates are tool-agnostic with placeholder tool names — users must confirm their actual search/read tools before proceeding. Deliverables: - Scenario template (seed corpus + find-and-answer + multi-hop) - Three rubrics (search_term_quality, doc_selection, groundedness) - Eval preset with 10 self-contained test cases - End-to-end case study doc (third alongside tool-calling and essay-style) - Updated SKILL.md index + synced mirror trees https://claude.ai/code/session_01TY65D9dzmbD3DaXuRJR8Bo

Eval template now supports two modes: - Static (AS_*): corpus in system prompt, quick behavioral check - Runtime (ASR_*): real files via fixture, tool calls actually executed 5 runtime tests mirror the most important static tests but require the model to actually search and read files to discover content. Two presets: agentic_search (static) and agentic_search_runtime. https://claude.ai/code/session_01TY65D9dzmbD3DaXuRJR8Bo

Runtime eval fixtures must not overlap with training data — eval should measure capability, not memorization. Added prominent warnings in both the eval YAML and the pipeline case study doc. Also documented both static and runtime eval modes in the case study. https://claude.ai/code/session_01TY65D9dzmbD3DaXuRJR8Bo

…er-e50N5

Comprehensive research document mapping the open-source mech interp landscape to our fine-tuning/eval/improvement pipeline. Identifies 6 concrete integration points: post-training diagnostics via SAEs, interp-aware evaluation validators, flywheel feature drift detection, mechanistically-guided LoRA placement, activation steering for inference-time fixes, and SynthChat representation-level validation. https://claude.ai/code/session_0116APH1YFGUBjuC8RznS3Wx

Remove tool-calling-specific framing from all integration points. The pipeline now generalizes to any fine-tuning objective: essay writing, code generation, domain adaptation, agentic behavior, etc. Added task agnosticism section and multi-task examples throughout. https://claude.ai/code/session_0116APH1YFGUBjuC8RznS3Wx

…bility-research-GJYpb

…aluator DRY (#76) * Improve experiment workflows and harden SFT preprocessing * Refactor tuner: extract stage runners and decompose hf_jobs_backend Split experiment_handler.py (946→207 lines) by extracting HFTrainingStageRunner, HFEvalStageRunner, and HFLossStageRunner into tuner/handlers/stages/ subpackage. ExperimentHandler remains as the orchestrator. Decompose HFJobsBackend (990→465 lines) into 4 focused mixins: CommandBuilder, JobWatcher, BucketOps, PostTraining. External API (ITrainingBackend) unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Refactor evaluator: DRY display tables and declarative check registry Replace 7 near-identical _display_*_table methods in eval_handler.py with TableSpec dataclass + generic _display_table renderer. Conditional columns handled via dynamic spec construction. Replace 18 _check_* methods in config_validator.py with declarative CheckDescriptor registry + stateless _run_* functions. Checks are now decoupled, independently testable, and trivially extensible. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Add comprehensive tests for SOLID/DRY refactoring batch 1 122 new tests covering all 4 refactored areas: - Stage runner extraction: import isolation, Protocol contracts, re-exports - HFJobsBackend mixins: MRO composition, cross-mixin calls, ITrainingBackend - EvalHandler TableSpec: Rich/plain-text rendering, conditional columns - CheckDescriptor registry: all 18 entries, tool_sequence composite, adversarial Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Address review findings: cleanup imports, normalize style, add behavioral tests Review remediation (F1-F6): - Remove unused imports: StageResult, List (F1, F2) - Remove redundant inline shutil import (F3) - Normalize Optional[X] to X | None across stage runners (F4) - Document mixin dependency order in HFJobsBackend docstring (F5) - Add 56 behavioral tests for stage runner .run() methods and recovery state machines (F6) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

…rategy pattern (#77) * Extract shared trainer utilities and unify lineage flow DRY extraction phases 1-5: - Create shared/env_bootstrap.py: init_trainer_env() consolidates env vars, Windows patches, dotenv, logging across all trainers - Create shared/training_utils.py: setup_wandb, save_training_lineage, extract_previous_log_entries (KTO canonical), build_base_lineage (sub-dict API), apply_tier_preset - GRPO now generates training_lineage.json (unified lineage flow) - SFT: 1351→1174 lines, KTO: 1329→1101 lines Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Refactor LoRA surgery with Strategy pattern Extract 8 surgical operations from LoRASurgeon (1,054 lines) into shared/evolutionary/surgery/ package with Protocol-based strategies: - 8 operation classes in surgery/operations/ - Decorator-based static registry - LoRASurgeon reduced to thin orchestrator (232 lines) - Original lora_surgery.py becomes 53-line backward-compat shim - All 60 existing tests pass (40 surgery + 20 karpathy integration) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Add comprehensive tests for trainer DRY extraction and LoRA surgery 50 new tests covering both batch 2 refactoring areas: - 11 tests: init_trainer_env() flag combinations and logging - 25 tests: setup_wandb, extract_previous_log_entries, save/build lineage, apply_tier_preset - 14 tests: surgery registry API, Protocol conformance, async typing, backward-compat shim, context manager, proxy methods Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Address review findings: async typing, imports, config ISP, test robustness Review remediation (R1-R8): - Fix evaluate_fn type: Callable[[str], Awaitable[float]] (R1) - Remove unused json imports from train_sft/kto (R2) - Move stray import in train_grpo to top (R3) - Use monkeypatch in env_bootstrap tests (R4) - Add deprecation docstring to lora_surgery shim (R6) - Refactor SurgeryConfig with per-operation config groups (R7) - Add 4 negative-path tests for sft_preprocessing (R8) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

…ules (#78) * Decompose SynthChat generator God class into focused modules Break SynthChatGenerator (3,384→1,635 lines) and run.py (1,042→220 lines) into 14 focused modules following the 9-step decomposition roadmap: Extracted modules: - template_utils.py: template rendering utilities - targets.py: target spec normalization (DRY fix) - workspace/: environment rendering (renderer, sections, fixture_helpers) - schemas/: JSON schemas (environment, tool response) - labeling.py: metadata label classification - parsing.py: response parsing and normalization - llm/: LLM client pool management - review.py: stage review and judge templates - agentic/: episode generation and turn management - modes/: CLI mode handlers (generate, improve, validate) - parallel/: worker pool and parallel execution - result_writer.py: streaming output management All 43 existing tests pass. Backward-compat re-exports preserved. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Add comprehensive tests for SynthChat decomposition 263 new tests across 13 files covering all 14 extracted modules: - Import isolation and backward-compat re-exports - template_utils, targets, parsing, labeling, schemas - workspace rendering, LLM client pool, review, agentic - result_writer, modes, parallel workers Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

* Make SynthChat config-driven: tool schemas, workspace, labels via YAML Replace hardcoded scenario-specific behavior with 3 config registries: - tool_call_formats.yaml: tool-call response schema, prompt instructions, wrapper name, context fields — fully configurable per format - workspace_formats.yaml: system prompt section order, tag names, default values, selected workspace fields - label_mappings.yaml: issue-to-behavior classification rules and label rollup groups Resolution priority: scenario inline > string reference > tool_format config > "default" from registry. Backward compatible — existing scenarios work unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Strip backward-compat shims and clean up hardcoded defaults Remove all backward-compatibility layers (no external consumers): - SynthChat/generator.py: remove re-export block for extracted functions - SynthChat/schemas/tool_response_schema.py: remove legacy signatures, keep only config-driven versions - SynthChat/workspace/renderer.py: remove _render_legacy path - SynthChat/labeling.py: make config params required - shared/evolutionary/lora_surgery.py: convert shim to thin re-export without underscore-prefixed aliases - shared/evolutionary/surgery/utils.py: remove _prefixed compat aliases - tests/synthchat/test_backward_compat.py: removed entirely - Update all tests to import from new module paths directly Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Document config-driven architecture and no-hardcoding principle - CLAUDE.md: Add NO HARDCODING rule + no backward-compat shims rule - README.md: Add Config-Driven Architecture section, clarify useTools is a toy example not ground truth - SKILL.md: Add config-driven architecture section with key config files Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

- Fine-tuning skill: experiment configs (gemma4, qwen3, qwen35 A100 templates) and HF Spaces warm iteration reference - Refactoring plans from SOLID/DRY analysis session (6 plan docs) - Qwen3.5 4B A100 cloud experiment spec - SFT model loader source test Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Standalone Python script downloads a HF model, converts to GGUF via llama.cpp's pure-Python converter, and uploads the result. Job config uses cpu-upgrade flavor (no GPU needed for conversion). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

The unsloth cloud image ships an older transformers that lacks Gemma 4 tokenizer support, causing AttributeError in special_tokens handling. Pinning transformers>=4.52.0 resolves the incompatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

The Gemma 4 model's tokenizer_config.json has extra_special_tokens as a list, but transformers expects a dict. Patch the config before running the converter to avoid AttributeError in tokenizer initialization. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…options, update documentation for GGUF conversion, and improve tool call parsing for Gemma format

Local model artifacts (pulled from HF bucket) were slowing git operations. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Completes the checkpoint control flags added in 1095859 — these are the trainer-side argument parsing and config override that were blocked by the pre-commit hook false positive on tokenizer-related print statements. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Full precision (no 4-bit), 3 epochs, save every 200 steps (keep 10), pip upgrades transformers>=5.0 for Gemma 4 architecture support. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Allows experiment specs to control checkpoint frequency and retention. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…sing Multimodal models (Gemma 4, Qwen-VL) return a Processor from AutoTokenizer.from_pretrained(). Processors have apply_chat_template() but lack encode(). Unwrap to inner .tokenizer for encode() calls. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

claude and others added 30 commits December 24, 2025 13:03

Remove commandManager from fitness config

bdfc586

feat: Add model name parameter to compile_webgpu for safe system lib …

527c794

…prefix generation

cleaning up datasets

5da6af4

Add webllm_output directory to .gitignore

edc277b

Update validation requirements for tool calls in tool_calling.yaml

7b2eb16

Add future work section on exploration vs exploitation strategies for…

07dbe83

… evolutionary fine-tuning

updated with gspo

0b819d5

getting GRPO/GSPO running

cdbc75d

updating tool schemas and data

192d91f

updated datasets, synthchat tested

853eadf

updated datasets, added mac functionality

c842096

Enhance MacBackend to support live training progress display and add …

87c9d8a

…TrainingProgressDisplay for unified output across backends

remove sft output folder

678fd64

removing output folder

c22b6cf

working on grpo

98098e4

Merge branch 'main' of https://github.com/ProfSynapse/Toolset-Training

994048d

Update tool schemas: modify generated date, remove redundant folder o…

345b2bc

…perations, and enhance descriptions for clarity

Update tool schemas: modify generated date and adjust required fields…

24dc229

… for clarity

major update to cli ui and updated some of the tool schemas

16dd40c

Add LiveDashboard support for GRPO and KTO training; enhance dashboar…

90aa59b

…d metrics

claude and others added 29 commits March 27, 2026 00:59

Sync skill mirror trees after PivotRL docs update

a279bd6

https://claude.ai/code/session_016LzqdWVDkBqELE5b33yFAJ

Merge pull request #73 from ProfSynapse/claude/pivotrl-integration-re…

e6fb966

…search-RB3e9

Merge pull request #74 from ProfSynapse/claude/integrate-synaptic-tun…

04fe177

…er-e50N5

Merge pull request #75 from ProfSynapse/claude/mechanistic-interpreta…

f24d894

…bility-research-GJYpb

Enhance cloud training configuration: add save steps and total limit …

1095859

…options, update documentation for GGUF conversion, and improve tool call parsing for Gemma format

Add toolset-training-artifacts/ to gitignore

5091e84

Local model artifacts (pulled from HF bucket) were slowing git operations. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add Gemma 4 E4B SFT v2 experiment spec

0542cb2

Full precision (no 4-bit), 3 epochs, save every 200 steps (keep 10), pip upgrades transformers>=5.0 for Gemma 4 architecture support. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add save_steps/save_total_limit to TrainingStageSpec

a1dd4de

Allows experiment specs to control checkpoint frequency and retention. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add Docker-first local runtime bootstrap and eval flow

ca4a18c

Document Docker-first local workflow

a84246a

ProfSynapse force-pushed the main branch from 5824c4f to df8de53 Compare June 22, 2026 20:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Docker-first local runtime bootstrap and eval flow#80

Add Docker-first local runtime bootstrap and eval flow#80
ProfSynapse wants to merge 763 commits into
mainfrom
codex/docker-local-runtime-bootstrap

ProfSynapse commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ProfSynapse commented Apr 10, 2026

Summary

Verification

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants