[Examples]: GPT-oss, Qwen3Moe streaming specdec example by h-guo18 · Pull Request #1692 · NVIDIA/Model-Optimizer

h-guo18 · 2026-06-11T21:55:10Z

What does this PR do?

Type of change: new example

Adds streaming speculative-decoding examples (EAGLE3 + DFlash) for gpt-oss-20b and Qwen3-30B-A3B to the ModelOpt launcher, mirroring the existing Qwen3-8B/Kimi examples.

New yamls: tools/launcher/examples/{openai/gpt-oss-20b,Qwen/Qwen3-30B-A3B}/hf_streaming_{eagle3,dflash}_multi_node.yaml, plus gpt-oss chat_template_train.jinja (generation-tagged, for answer_only_loss).
eagle_utils.py: the streaming path now installs a custom data.chat_template on the tokenizer (the online path already did) — needed for the tagged template.

Usage

cd tools/launcher
export SLURM_HOST=... SLURM_ACCOUNT=... SLURM_HF_LOCAL=... SLURM_JOB_DIR=...
uv run launch.py --yaml examples/openai/gpt-oss-20b/hf_streaming_eagle3_multi_node.yaml --yes

Testing

Pipeline sanity test on unsynthesized data (daring-anteater), 1× H100-80GB, 12k steps. All four train and pass the vLLM acceptance-length eval:

Model	Method	Train speed	vLLM AL
Qwen3-30B-A3B	EAGLE3	7.12 it/s	1.74
Qwen3-30B-A3B	DFlash	2.31 it/s	1.29
gpt-oss-20b	EAGLE3	5.07 it/s	1.19
gpt-oss-20b	DFlash	2.01 it/s	1.14

Sanity test only, not a quality run. gpt-oss AL is low because it is a reasoning model (CoT at inference) while daring-anteater has no reasoning traces and answer_only_loss masks all but the final content — quality runs need synthesized/reasoning data.

Before your PR is "Ready for review"

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: N/A
Did you update Changelog?: N/A
Did you get Claude approval on this PR?: ❌

Summary by CodeRabbit

Release Notes

New Features
- Added multi-node speculative decoding pipeline configurations for Qwen3-30B-A3B and gpt-oss-20b with DFlash and EAGLE3 support.
- Introduced chat template training support for improved model instruction formatting.
Enhancements
- Increased benchmark concurrency from 1 to 32 across Qwen3-8B configurations for more realistic performance evaluation.
- Extended training runs from 500 to 2000 steps for Kimi-K2.5 models.
- Improved chat template handling in speculative decoding workflows.

Signed-off-by: h-guo18 <[email protected]>

copy-pr-bot · 2026-06-11T21:55:13Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-11T21:55:18Z

📝 Walkthrough

Walkthrough

This PR introduces streaming mode chat template support for speculative decoding, adds a comprehensive training chat template for gpt-oss-20b with tool calling capabilities, creates new multi-node DFlash and EAGLE3 pipeline configurations for both gpt-oss-20b and Qwen3-30B-A3B, and optimizes benchmark performance parameters across multiple existing Qwen3-8B and Kimi-K2.5 configurations.

Changes

Streaming Chat Template Support

Layer / File(s)	Summary
Streaming tokenizer chat template installation `examples/speculative_decoding/eagle_utils.py`	When creating streaming datasets, custom `chat_template` from data arguments is installed onto the tokenizer before dataset creation, enabling template-time features during offline tokenization unlike the online collator path.
gpt-oss-20b training template with tool support `tools/launcher/examples/openai/gpt-oss-20b/chat_template_train.jinja`	Comprehensive Jinja template for training with tool calling and reasoning channels, including TypeScript schema rendering, system/developer message construction, message loop validation with channel tag checking, assistant tool-call formatting, and optional generation prompt support.

gpt-oss-20b Speculative Decoding Pipelines

Layer / File(s)	Summary
DFlash multi-node pipeline `tools/launcher/examples/openai/gpt-oss-20b/hf_streaming_dflash_multi_node.yaml`	Three-task DFlash pipeline: dataset generation, distributed serve/trainer streaming training across 4 nodes with DFlash-specific parameters and capture-layer IDs, and vLLM smoke test validating the exported checkpoint.
EAGLE3 multi-node pipeline `tools/launcher/examples/openai/gpt-oss-20b/hf_streaming_eagle3_multi_node.yaml`	Three-task EAGLE3 pipeline: dataset generation, streaming training across 4 nodes with serve/trainer split and capture-layer configuration, and MT-Bench benchmarking with VLLM and gptoss postprocessing.

Qwen3-30B-A3B Multi-Node Pipelines

Layer / File(s)	Summary
DFlash pipeline configuration `tools/launcher/examples/Qwen/Qwen3-30B-A3B/hf_streaming_dflash_multi_node.yaml`	Complete pipeline orchestrating dataset generation, streaming DFlash training across 2 serve nodes (TP=2) and 2 trainer nodes with DFlash parameters and environment variables, and vLLM smoke test with TP=2 speculative decoding.
EAGLE3 pipeline configuration `tools/launcher/examples/Qwen/Qwen3-30B-A3B/hf_streaming_eagle3_multi_node.yaml`	Complete pipeline orchestrating dataset generation, streaming EAGLE3 training across 4 nodes with serve/trainer split via SERVE_NODES/SERVE_TP and capture-layer IDs, and MT-Bench speculative decoding benchmark.

Benchmark Optimization Updates

Layer / File(s)	Summary
Qwen3-8B concurrency increases `tools/launcher/examples/Qwen/Qwen3-8B/eagle3_quick_check.yaml`, `hf_offline_eagle3.yaml`, `hf_offline_eagle3_ptq.yaml`, `hf_online_eagle3.yaml`, `hf_streaming_eagle3.yaml`, `hf_streaming_eagle3_multi_node.yaml`	Increases VLLM benchmark `--concurrency` from 1 to 32 across six pipeline configurations, enabling higher concurrent request execution during performance evaluation.
Kimi-K2.5 training step extensions `tools/launcher/examples/moonshotai/Kimi-K2.5/hf_streaming_dflash_multi_node.yaml`, `hf_streaming_eagle3_multi_node.yaml`	Increases `training.max_steps` from 500 to 2000 in both DFlash and EAGLE3 pipelines, extending the training budget for improved model convergence.

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

yeyu-nvidia
ChenhanYu

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[Examples]: GPT-oss, Qwen3Moe streaming specdec example' directly corresponds to the PR's main objective of adding streaming speculative-decoding examples for gpt-oss-20b and Qwen3-30B-A3B, which is confirmed by the raw summary showing new YAML configurations and template files for these models.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	Checked PR-mentioned Python files (eagle_utils.py, export_hf_checkpoint.py, ar_validate.py) for SECURITY.md anti-patterns (torch.load weights_only=False, np.load allow_pickle=True, trust_remote_cod...

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch haoguo/gptoss-specexample

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: h-guo18 <[email protected]>

codecov · 2026-06-11T22:08:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.54%. Comparing base (dd49a46) to head (15fad77).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1692      +/-   ##
==========================================
- Coverage   67.72%   67.54%   -0.18%     
==========================================
  Files         511      511              
  Lines       56168    56498     +330     
==========================================
+ Hits        38037    38164     +127     
- Misses      18131    18334     +203

Flag	Coverage Δ
examples	`40.40% <ø> (-0.91%)`	⬇️
regression	`14.66% <ø> (+0.02%)`	⬆️
unit	`54.34% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: h-guo18 <[email protected]>

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🧹 Nitpick comments (1)

tools/launcher/examples/openai/gpt-oss-20b/chat_template_train.jinja (1)

136-140: 📐 Maintainability & Code Quality | 💤 Low value

Redundant conditional: both branches output the same value.

Both the if and else branches emit ",\n". Simplify to a single statement:

♻️ Suggested simplification

-            {%- if not loop.last %}
-                {{- ",\n" }}
-            {%- else %}
-                {{- ",\n" }}
-            {%- endif -%}
+            {{- ",\n" }}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/examples/openai/gpt-oss-20b/chat_template_train.jinja` around
lines 136 - 140, The conditional block checking "if not loop.last" is redundant
because both branches emit the same string; remove the entire if/else and
replace it with a single output of the comma+newline token so the template
simply emits ",\n" (preserving the existing whitespace control like the
surrounding {{- / -}} markers) where the conditional currently appears.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/launcher/examples/openai/gpt-oss-20b/chat_template_train.jinja`:
- Around line 47-52: The variable has_object_variants is being set inside a
Jinja2 for-loop which creates a new local scope, so the outer flag never
changes; switch to a namespace() to persist the flag: create a namespace like ns
= namespace(has_object_variants=False), inside the for-loop assign
ns.has_object_variants = True when variant.type == "object", and later check
ns.has_object_variants instead of has_object_variants (references:
has_object_variants, param_spec.oneOf, variant.type).

In
`@tools/launcher/examples/openai/gpt-oss-20b/hf_streaming_dflash_multi_node.yaml`:
- Around line 97-98: The YAML currently hardcodes EXPORT_EXTRA_ARGS:
"--trust_remote_code", which propagates through train_eagle_streaming.sh into
export_hf_checkpoint.py and ultimately into load_vlm_or_llm()
(AutoConfig/AutoModel.from_pretrained trust_remote_code), creating an RCE risk;
remove the hardcoded value and make it opt-in by defaulting EXPORT_EXTRA_ARGS to
empty (or adding a dedicated flag like ENABLE_TRUST_REMOTE_CODE) in the YAML and
update train_eagle_streaming.sh to pass that flag only when explicitly set, and
ensure export_hf_checkpoint.py only forwards trust_remote_code when the flag is
present (or document a clear, audited exception) so load_vlm_or_llm() receives
trust_remote_code=true only when explicitly requested.

---

Nitpick comments:
In `@tools/launcher/examples/openai/gpt-oss-20b/chat_template_train.jinja`:
- Around line 136-140: The conditional block checking "if not loop.last" is
redundant because both branches emit the same string; remove the entire if/else
and replace it with a single output of the comma+newline token so the template
simply emits ",\n" (preserving the existing whitespace control like the
surrounding {{- / -}} markers) where the conditional currently appears.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 34056ab8-ec72-4e9d-b618-ab8b466cd8a1

📥 Commits

Reviewing files that changed from the base of the PR and between dd49a46 and 15fad77.

📒 Files selected for processing (14)

examples/speculative_decoding/eagle_utils.py
tools/launcher/examples/Qwen/Qwen3-30B-A3B/hf_streaming_dflash_multi_node.yaml
tools/launcher/examples/Qwen/Qwen3-30B-A3B/hf_streaming_eagle3_multi_node.yaml
tools/launcher/examples/Qwen/Qwen3-8B/eagle3_quick_check.yaml
tools/launcher/examples/Qwen/Qwen3-8B/hf_offline_eagle3.yaml
tools/launcher/examples/Qwen/Qwen3-8B/hf_offline_eagle3_ptq.yaml
tools/launcher/examples/Qwen/Qwen3-8B/hf_online_eagle3.yaml
tools/launcher/examples/Qwen/Qwen3-8B/hf_streaming_eagle3.yaml
tools/launcher/examples/Qwen/Qwen3-8B/hf_streaming_eagle3_multi_node.yaml
tools/launcher/examples/moonshotai/Kimi-K2.5/hf_streaming_dflash_multi_node.yaml
tools/launcher/examples/moonshotai/Kimi-K2.5/hf_streaming_eagle3_multi_node.yaml
tools/launcher/examples/openai/gpt-oss-20b/chat_template_train.jinja
tools/launcher/examples/openai/gpt-oss-20b/hf_streaming_dflash_multi_node.yaml
tools/launcher/examples/openai/gpt-oss-20b/hf_streaming_eagle3_multi_node.yaml

coderabbitai · 2026-06-12T23:50:38Z

+        {%- set has_object_variants = false -%}
+        {%- for variant in param_spec.oneOf -%}
+            {%- if variant.type == "object" -%}
+                {%- set has_object_variants = true -%}
+            {%- endif -%}
+        {%- endfor -%}


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Jinja2 scoping bug: has_object_variants never persists outside the loop.

In Jinja2, set inside a for loop creates a new local scope—the assignment won't affect the outer variable. This means has_object_variants will always be false after the loop, regardless of whether object variants exist. The condition at line 53 will never be true.

You correctly use the namespace() pattern elsewhere (lines 209, 277). Apply the same fix here:

🐛 Proposed fix

- {%- set has_object_variants = false -%} + {%- set ns = namespace(has_object_variants=false) -%} {%- for variant in param_spec.oneOf -%} {%- if variant.type == "object" -%} - {%- set has_object_variants = true -%} + {%- set ns.has_object_variants = true -%} {%- endif -%} {%- endfor -%} - {%- if has_object_variants and param_spec.oneOf|length > 1 -%} + {%- if ns.has_object_variants and param_spec.oneOf|length > 1 -%}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/openai/gpt-oss-20b/chat_template_train.jinja` around lines 47 - 52, The variable has_object_variants is being set inside a Jinja2 for-loop which creates a new local scope, so the outer flag never changes; switch to a namespace() to persist the flag: create a namespace like ns = namespace(has_object_variants=False), inside the for-loop assign ns.has_object_variants = True when variant.type == "object", and later check ns.has_object_variants instead of has_object_variants (references: has_object_variants, param_spec.oneOf, variant.type).

h-guo18 · 2026-06-12T23:51:46Z

/claude review

claude

Claude review passed — no blocking issues found. LGTM

This is an examples-only PR (launcher YAMLs + a Jinja chat template + a small Python edit). Reviewed for correctness:

Scope:

4 new pipeline YAMLs (gpt-oss-20b/Qwen3-30B-A3B × DFlash/EAGLE3)
New chat_template_train.jinja for gpt-oss-20b (copy of upstream OpenAI Harmony template + {% generation %} tags around assistant content for answer_only_loss=true)
eagle_utils.py: install data.chat_template on the tokenizer for streaming mode (correctly mirrors that the streaming dataset calls apply_chat_template without a chat_template= kwarg, while the online collator threads it through)
Concurrency 1→32 in 6 Qwen3-8B benchmark configs; max_steps 500→2000 for Kimi-K2.5

Verified:

Layer-id math matches helper functions: default_eagle_aux_layer_ids(48)=[1,23,44] and default_eagle_aux_layer_ids(24)=[1,11,20]; build_target_layer_ids(48,5)=[1,12,23,34,45] and build_target_layer_ids(24,5)=[1,6,11,16,21]. Capture IDs in YAML are those +1 plus final layer, as the comments document.
SERVE_MAX_MODEL_LEN is consumed in train_eagle_streaming.sh:154 ✓
EXPORT_EXTRA_ARGS="--trust_remote_code" for DFlash is intentional and gated (used identically in existing Kimi DFlash examples; the export script's --trust_remote_code is opt-in via argparse).
data.chat_template path resolution matches existing pattern (examples/<Org>/<Model>/chat_template_train.jinja is relative to the script's working dir, same as existing Qwen3-0.6B example).
Tokenizer mutation (tokenizer.chat_template = chat_template) is fine — one-shot setup, called once per training run.

Notes (not blocking):

CodeRabbit flagged a Jinja2 scoping bug in render_typescript_type's {%- set has_object_variants -%} inside a for-loop — that's a real bug, but it's inherited from the upstream OpenAI gpt-oss-20b chat template (this template is copied from there). It only triggers on tool schemas with oneOf containing object variants; daring-anteater (the validated training data) has no tool calls, so the bug is dead code in this training context. Worth fixing if/when upstream does, but doesn't block.
PR description's vLLM AL results table confirms all four configs train and pass acceptance-length eval end-to-end.

specdec add model: gptoss

ee53ca6

Signed-off-by: h-guo18 <[email protected]>

add dflash

6193a09

Signed-off-by: h-guo18 <[email protected]>

h-guo18 added 3 commits June 12, 2026 06:37

fix yaml

c1ba97f

Signed-off-by: h-guo18 <[email protected]>

fix yamls

6c2e898

Signed-off-by: h-guo18 <[email protected]>

add qwen3 moe example

7186182

Signed-off-by: h-guo18 <[email protected]>

h-guo18 changed the title ~~specdec add model: gptoss~~ [Examples]: GPT-oss, Qwen3Moe streaming specdec example Jun 12, 2026

h-guo18 added 4 commits June 12, 2026 07:24

use 32 for specdec bench concuir

a6deb0f

Signed-off-by: h-guo18 <[email protected]>

polish

8ac0f41

Signed-off-by: h-guo18 <[email protected]>

add chat template and answer only=true for gptoss

eb7495a

Signed-off-by: h-guo18 <[email protected]>

fix: increat block size for gptoss

15fad77

Signed-off-by: h-guo18 <[email protected]>

h-guo18 marked this pull request as ready for review June 12, 2026 23:40

h-guo18 requested a review from a team as a code owner June 12, 2026 23:40

h-guo18 requested review from ChenhanYu and yeyu-nvidia June 12, 2026 23:40

h-guo18 self-assigned this Jun 12, 2026

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

claude Bot approved these changes Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Examples]: GPT-oss, Qwen3Moe streaming specdec example#1692

[Examples]: GPT-oss, Qwen3Moe streaming specdec example#1692
h-guo18 wants to merge 9 commits into
mainfrom
haoguo/gptoss-specexample

h-guo18 commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Jun 11, 2026

Uh oh!

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

Walkthrough

Changes

Suggested reviewers

Uh oh!

codecov Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

Uh oh!

h-guo18 commented Jun 12, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

h-guo18 commented Jun 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented Jun 11, 2026

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Suggested reviewers

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

h-guo18 commented Jun 12, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

h-guo18 commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

codecov Bot commented Jun 11, 2026 •

edited

Loading