Skip to content

Commit bcf4370

Browse files
committed
update optimize
1 parent ce57d5c commit bcf4370

1 file changed

Lines changed: 26 additions & 62 deletions

File tree

Lines changed: 26 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
After preparing quality training data, the next challenge involves configuring fine-tuning parameters and interpreting metrics to optimize model performance. This unit covers [understanding key training configurations](#understand-key-training-configurations), [monitoring and interpreting training metrics](#monitor-and-interpret-training-metrics), and [making optimization decisions based on evidence](#make-optimization-decisions-based-on-evidence).
1+
After preparing quality training data, the next challenge involves configuring fine-tuning parameters and interpreting metrics to optimize model performance. This unit covers understanding key training configurations, monitoring and interpreting training metrics, and making optimization decisions based on evidence.
22

33
## Understand key training configurations
44

@@ -9,40 +9,40 @@ Fine-tuning jobs expose several hyperparameters that significantly impact model
99
**What it controls**: Number of training examples processed together in one forward-backward pass during training. The model updates parameters after processing each batch.
1010

1111
**How it affects training**:
12-
- **Larger batches** (8-32): More stable parameter updates with lower variance, faster training on powerful hardware, but requires more memory and may struggle with small datasets
13-
- **Smaller batches** (1-4): Higher variance in updates can help escape local optima, works with limited memory, but training takes longer
1412

15-
**Optimization strategy**: Microsoft Foundry sets model-specific defaults that work for most cases. Start with defaults. If training is unstable (loss fluctuates wildly), increase batch size. If overfitting occurs (training loss decreases but validation loss increases), decrease batch size to add regularization through noise.
13+
| **Larger batches** (8-32) | **Smaller batches** (1-4) |
14+
| --- | --- |
15+
| More stable parameter updates with lower variance, faster training on powerful hardware, but requires more memory and may struggle with small datasets | Higher variance in updates can help escape local optima, works with limited memory, but training takes longer |
1616

17-
Adventure Works kept default batch size for their gear specification SFT (dataset of 300 examples) but reduced it for their safety recommendation DPO (only 80 preference pairs) to prevent overfitting.
17+
**Optimization strategy**: Microsoft Foundry sets model-specific defaults that work for most cases. Start with defaults. If training is unstable (loss fluctuates wildly), increase batch size. If overfitting occurs (training loss decreases but validation loss increases), decrease batch size to add regularization through noise.
1818

1919
### Learning rate multiplier
2020

2121
**What it controls**: Scaling factor applied to the base learning rate. The effective learning rate = base model's pretrained learning rate × multiplier. Controls how aggressively the model adjusts parameters during fine-tuning.
2222

2323
**How it affects training**:
24-
- **Higher multipliers** (0.1-0.2): Faster adaptation to training data, risks overshooting optimal parameters or catastrophic forgetting of pretrained knowledge
25-
- **Lower multipliers** (0.02-0.05): Gentler adaptation preserving more pretrained knowledge, slower convergence, may underfit if too conservative
2624

27-
**Optimization strategy**: Recommended range is 0.02-0.2. Start with 0.05-0.1 as baseline. If training loss remains high after multiple epochs, try 0.15-0.2. If validation loss increases while training loss decreases (overfitting), reduce to 0.02-0.05. Larger batch sizes generally pair well with higher learning rates.
25+
| **Higher multipliers** (0.1-0.2) | **Lower multipliers** (0.02-0.05) |
26+
| --- | --- |
27+
| Faster adaptation to training data, risks overshooting optimal parameters or catastrophic forgetting of pretrained knowledge | Gentler adaptation preserving more pretrained knowledge, slower convergence, may underfit if too conservative |
2828

29-
Adventure Works used 0.08 for SFT (standard adaptation), 0.05 for DPO (gentler to preserve nuanced tone preferences), and 0.12 for RFT (needed stronger signal for complex reasoning).
29+
**Optimization strategy**: Recommended range is 0.02-0.2. Start with 0.05-0.1 as baseline. If training loss remains high after multiple epochs, try 0.15-0.2. If validation loss increases while training loss decreases (overfitting), reduce to 0.02-0.05. Larger batch sizes generally pair well with higher learning rates.
3030

3131
### Number of epochs
3232

3333
**What it controls**: How many times the model sees the entire training dataset. One epoch = one complete pass through all training examples.
3434

35-
**How it affects training**:
36-
- **More epochs** (5-10): Better convergence, higher risk of overfitting, increased cost
37-
- **Fewer epochs** (1-3): Faster and cheaper, may underfit, leaves performance on table
35+
**How it affects training**
3836

39-
**Optimization strategy**: Start with 3-4 epochs. Monitor validation metrics—if validation loss stops improving or starts increasing while training loss keeps decreasing, you're overfitting and should reduce epochs. If both training and validation loss are still decreasing at your final epoch, add 1-2 more epochs.
37+
|**More epochs** (5-10) | **Fewer epochs** (1-3) |
38+
|---|---|
39+
| Better convergence, higher risk of overfitting, increased cost | Faster and cheaper, may underfit, leaves performance on table |
4040

41-
Adventure Works trained gear specification SFT for 4 epochs (300 diverse examples), safety DPO for 6 epochs (80 preference pairs needed more exposure), and trip planning RFT for 8 epochs (complex reasoning requires more iterations).
41+
**Optimization strategy**: Start with 3-4 epochs. Monitor validation metrics—if validation loss stops improving or starts increasing while training loss keeps decreasing, you're overfitting and should reduce epochs. If both training and validation loss are still decreasing at your final epoch, add 1-2 more epochs.
4242

4343
### Seed
4444

45-
**What it controls**: Random seed for reproducibility. Same seed + same data + same parameters should produce nearly identical results (small differences can occur due to hardware variations).
45+
**What it controls**: Random seed for reproducibility. The same seed with the same data and the same parameters should produce nearly identical results (small differences can occur due to hardware variations).
4646

4747
**Optimization strategy**: Set a fixed seed when comparing configuration changes—this isolates the effect of your hyperparameter adjustment from random variation. Use different seeds when training production models to avoid inadvertent memorization patterns.
4848

@@ -58,6 +58,7 @@ Microsoft Foundry provides training and validation metrics during and after fine
5858
**What it measures**: Average prediction error on training data during each epoch. Lower values indicate the model better predicts training examples.
5959

6060
**What to watch for**:
61+
6162
- **Steadily decreasing**: Normal healthy training progression
6263
- **Plateaus early**: Model may need higher learning rate, more epochs, or better data quality
6364
- **Fluctuates wildly**: Batch size too small or learning rate too high—training is unstable
@@ -70,6 +71,7 @@ Adventure Works saw training loss decrease from 2.1 → 1.8 → 1.5 → 1.3 acro
7071
**What it measures**: Average prediction error on held-out validation data (never seen during training). Measures generalization to new examples.
7172

7273
**What to watch for**:
74+
7375
- **Decreases with training loss**: Ideal scenario—model learns genuine patterns that generalize
7476
- **Plateaus while training loss decreases**: Early overfitting signal—model memorizing training data
7577
- **Increases while training loss decreases**: Clear overfitting—stop training or use fewer epochs
@@ -82,6 +84,7 @@ Adventure Works tracked their safety DPO validation loss carefully. It decreased
8284
**What it measures**: Percentage of predicted tokens matching expected tokens in validation set. Higher percentages indicate better prediction accuracy.
8385

8486
**What to watch for**:
87+
8588
- **Improving steadily**: Good learning signal
8689
- **Very high (>95%)**: May indicate data leakage, overfitting, or excessively simple task
8790
- **Very low (<40%)**: May indicate task too difficult, data quality issues, or configuration problems
@@ -97,59 +100,20 @@ Adventure Works tracked their safety DPO validation loss carefully. It decreased
97100
98101
## Make optimization decisions based on evidence
99102

100-
Effective optimization requires systematic experimentation informed by metrics rather than random hyperparameter adjustments.
101-
102-
### Establish a baseline
103+
Effective optimization requires systematic experimentation informed by metrics rather than random hyperparameter adjustments. Follow this workflow to move from baseline to production-ready models while balancing quality and cost:
103104

104-
Before optimization, train one model with default parameters to establish baseline performance:
105+
1. **Establish baseline performance**: Train one model with default parameters (batch size default, learning rate 0.05-0.1, 3-4 epochs). Reserve 15-20% of data for validation. Deploy to Developer tier for 24-hour testing. Measure validation loss, token accuracy, and real-world performance on 20-30 test queries.
105106

106-
1. Use default batch size, learning rate multiplier (0.05-0.1), 3-4 epochs
107-
1. Reserve 15-20% of data for validation
108-
1. Deploy to Developer tier for 24-hour testing
109-
1. Measure baseline: validation loss, token accuracy, real-world test performance on 20-30 queries
107+
1. **Diagnose performance problems**: Use metrics and testing to identify root causes. If both training and validation loss are high (underfitting), increase learning rate or add epochs. If training loss is low but validation loss is high (overfitting), decrease epochs or lower learning rate. If training is unstable (loss fluctuates), increase batch size or halve learning rate. If metrics look good but real-world performance is poor, revise training data to match actual usage patterns.
110108

111-
Adventure Works' baseline gear specification SFT achieved: validation loss 1.3, token accuracy 78%, 85% format consistency on manual test set.
109+
1. **Run controlled experiments**: Change one hyperparameter at a time while keeping others constant to isolate effects. Train the new model and compare validation metrics to baseline. Deploy the improved model and test on the same queries used for baseline. Measure whether metric improvements translate to better user experience.
112110

113-
### Diagnose performance problems
111+
1. **Iterate or deploy**: If improvement is substantial (>5%), consider one more refinement cycle. Otherwise, deploy the current model. Remember that the optimal model delivers acceptable performance at sustainable cost—not necessarily the lowest validation loss.
114112

115-
Based on metrics and real-world testing, identify the root cause:
113+
1. **Manage deployment costs**: Use Developer tier (free hosting, auto-deletes after 24 hours) for all experimentation. Deploy validated models to Standard tier for production (incurs hourly hosting fees). Factor in training costs—token-based for SFT/DPO, time-based for RFT—as more epochs and iterations increase expenses.
116114

117-
| Symptom | Likely Cause | Optimization Action |
118-
|---------|--------------|---------------------|
119-
| Both training and validation loss high | Underfitting—insufficient learning | Increase learning rate (try 1.5-2× current), add more epochs, verify data quality |
120-
| Training loss low, validation loss high | Overfitting—memorizing training data | Decrease epochs (remove 1-2), lower learning rate (try 0.5× current), add more diverse data |
121-
| Training unstable (loss fluctuates) | Batch size too small or learning rate too high | Increase batch size if possible, halve learning rate |
122-
| Metrics good but real-world performance poor | Data doesn't represent real usage | Revise training data to match actual query patterns |
123-
| Fast convergence then plateau | Hit local optimum or data limitations | Try different learning rate schedule, add data diversity, or accept current performance |
115+
Adventure Works followed this workflow for their trip planning RFT. Their baseline (Config A: learning_rate=0.08, epochs=6) achieved validation loss 0.9 and 72% reasoning quality. Config B (learning_rate=0.12, epochs=6) improved to validation loss 0.85 and 78% reasoning quality (+6%). Config C (learning_rate=0.12, epochs=8) reached validation loss 0.82 and 79% reasoning quality (+1% vs B). They deployed Config B because the marginal improvement from B to C didn't justify additional training cost and overfitting risk.
124116

125-
### Implement iterative refinement
126-
127-
Optimize through controlled experiments:
128-
129-
1. **Change one variable**: Adjust one hyperparameter while keeping others constant to isolate effects
130-
1. **Compare validation metrics**: Train new model, compare validation loss and task accuracy to baseline
131-
1. **Test real-world impact**: Deploy improved model, test on same 20-30 queries as baseline
132-
1. **Measure practical improvement**: Does the metric improvement translate to better user experience?
133-
1. **Iterate or deploy**: If improvement is substantial (>5%), consider one more refinement cycle; otherwise deploy
134-
135-
Adventure Works tried three configurations for their trip planning RFT:
136-
- **Config A** (baseline): learning_rate=0.08, epochs=6 → validation loss 0.9, 72% reasoning quality
137-
- **Config B**: learning_rate=0.12, epochs=6 → validation loss 0.85, 78% reasoning quality (+6%)
138-
- **Config C**: learning_rate=0.12, epochs=8 → validation loss 0.82, 79% reasoning quality (+1% vs B)
139-
140-
They deployed Config B—the improvement from B to C didn't justify additional training cost and the risk of overfitting.
141-
142-
### Balance cost and quality
143-
144-
Fine-tuning optimization involves trade-offs between performance, cost, and time:
145-
146-
- **Developer tier**: Free hosting (auto-deletes after 24 hours)—use for all experimentation and validation
147-
- **Standard tier**: Hourly hosting fees—deploy only validated models for production use
148-
- **Training costs**: Token-based (SFT/DPO) or time-based (RFT)—more epochs and iterations increase costs
149-
150-
The optimal model isn't the one with the lowest validation loss—it's the one delivering acceptable performance at sustainable cost. Adventure Works determined 85% format consistency met their requirements, so they deployed Config B for gear specifications rather than chasing 90%+ through expensive additional iterations.
151-
152-
> [!TIP]
153-
> **Optimization stopping criteria**: Stop optimization when: (1) validation metrics meet your quality threshold, (2) real-world testing shows acceptable performance, (3) marginal improvements are under 3-5%, or (4) additional training cost exceeds value of improvement. Perfect isn't necessary—sufficient is optimal.
117+
Stop optimization when validation metrics meet your quality threshold, real-world testing shows acceptable performance, marginal improvements fall under 3-5%, or additional training cost exceeds the value of improvement. Perfect isn't necessary—sufficient is optimal.
154118

155119
By understanding how configurations affect training, monitoring metrics to diagnose problems, and making evidence-based decisions through systematic experimentation, you optimize fine-tuning investments for maximum practical impact while controlling costs.

0 commit comments

Comments
 (0)