Skip to content

[Examples]: GPT-oss, Qwen3Moe streaming specdec example#1692

Open
h-guo18 wants to merge 9 commits into
mainfrom
haoguo/gptoss-specexample
Open

[Examples]: GPT-oss, Qwen3Moe streaming specdec example#1692
h-guo18 wants to merge 9 commits into
mainfrom
haoguo/gptoss-specexample

Conversation

@h-guo18

@h-guo18 h-guo18 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: new example

Adds streaming speculative-decoding examples (EAGLE3 + DFlash) for gpt-oss-20b and Qwen3-30B-A3B to the ModelOpt launcher, mirroring the existing Qwen3-8B/Kimi examples.

  • New yamls: tools/launcher/examples/{openai/gpt-oss-20b,Qwen/Qwen3-30B-A3B}/hf_streaming_{eagle3,dflash}_multi_node.yaml, plus gpt-oss chat_template_train.jinja (generation-tagged, for answer_only_loss).
  • eagle_utils.py: the streaming path now installs a custom data.chat_template on the tokenizer (the online path already did) — needed for the tagged template.

Usage

cd tools/launcher
export SLURM_HOST=... SLURM_ACCOUNT=... SLURM_HF_LOCAL=... SLURM_JOB_DIR=...
uv run launch.py --yaml examples/openai/gpt-oss-20b/hf_streaming_eagle3_multi_node.yaml --yes

Testing

Pipeline sanity test on unsynthesized data (daring-anteater), 1× H100-80GB, 12k steps. All four train and pass the vLLM acceptance-length eval:

Model Method Train speed vLLM AL
Qwen3-30B-A3B EAGLE3 7.12 it/s 1.74
Qwen3-30B-A3B DFlash 2.31 it/s 1.29
gpt-oss-20b EAGLE3 5.07 it/s 1.19
gpt-oss-20b DFlash 2.01 it/s 1.14
image

Sanity test only, not a quality run. gpt-oss AL is low because it is a reasoning model (CoT at inference) while daring-anteater has no reasoning traces and answer_only_loss masks all but the final content — quality runs need synthesized/reasoning data.

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A
  • Did you update Changelog?: N/A
  • Did you get Claude approval on this PR?: ❌

Summary by CodeRabbit

Release Notes

  • New Features

    • Added multi-node speculative decoding pipeline configurations for Qwen3-30B-A3B and gpt-oss-20b with DFlash and EAGLE3 support.
    • Introduced chat template training support for improved model instruction formatting.
  • Enhancements

    • Increased benchmark concurrency from 1 to 32 across Qwen3-8B configurations for more realistic performance evaluation.
    • Extended training runs from 500 to 2000 steps for Kimi-K2.5 models.
    • Improved chat template handling in speculative decoding workflows.

@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces streaming mode chat template support for speculative decoding, adds a comprehensive training chat template for gpt-oss-20b with tool calling capabilities, creates new multi-node DFlash and EAGLE3 pipeline configurations for both gpt-oss-20b and Qwen3-30B-A3B, and optimizes benchmark performance parameters across multiple existing Qwen3-8B and Kimi-K2.5 configurations.

Changes

Streaming Chat Template Support

Layer / File(s) Summary
Streaming tokenizer chat template installation
examples/speculative_decoding/eagle_utils.py
When creating streaming datasets, custom chat_template from data arguments is installed onto the tokenizer before dataset creation, enabling template-time features during offline tokenization unlike the online collator path.
gpt-oss-20b training template with tool support
tools/launcher/examples/openai/gpt-oss-20b/chat_template_train.jinja
Comprehensive Jinja template for training with tool calling and reasoning channels, including TypeScript schema rendering, system/developer message construction, message loop validation with channel tag checking, assistant tool-call formatting, and optional generation prompt support.

gpt-oss-20b Speculative Decoding Pipelines

Layer / File(s) Summary
DFlash multi-node pipeline
tools/launcher/examples/openai/gpt-oss-20b/hf_streaming_dflash_multi_node.yaml
Three-task DFlash pipeline: dataset generation, distributed serve/trainer streaming training across 4 nodes with DFlash-specific parameters and capture-layer IDs, and vLLM smoke test validating the exported checkpoint.
EAGLE3 multi-node pipeline
tools/launcher/examples/openai/gpt-oss-20b/hf_streaming_eagle3_multi_node.yaml
Three-task EAGLE3 pipeline: dataset generation, streaming training across 4 nodes with serve/trainer split and capture-layer configuration, and MT-Bench benchmarking with VLLM and gptoss postprocessing.

Qwen3-30B-A3B Multi-Node Pipelines

Layer / File(s) Summary
DFlash pipeline configuration
tools/launcher/examples/Qwen/Qwen3-30B-A3B/hf_streaming_dflash_multi_node.yaml
Complete pipeline orchestrating dataset generation, streaming DFlash training across 2 serve nodes (TP=2) and 2 trainer nodes with DFlash parameters and environment variables, and vLLM smoke test with TP=2 speculative decoding.
EAGLE3 pipeline configuration
tools/launcher/examples/Qwen/Qwen3-30B-A3B/hf_streaming_eagle3_multi_node.yaml
Complete pipeline orchestrating dataset generation, streaming EAGLE3 training across 4 nodes with serve/trainer split via SERVE_NODES/SERVE_TP and capture-layer IDs, and MT-Bench speculative decoding benchmark.

Benchmark Optimization Updates

Layer / File(s) Summary
Qwen3-8B concurrency increases
tools/launcher/examples/Qwen/Qwen3-8B/eagle3_quick_check.yaml, hf_offline_eagle3.yaml, hf_offline_eagle3_ptq.yaml, hf_online_eagle3.yaml, hf_streaming_eagle3.yaml, hf_streaming_eagle3_multi_node.yaml
Increases VLLM benchmark --concurrency from 1 to 32 across six pipeline configurations, enabling higher concurrent request execution during performance evaluation.
Kimi-K2.5 training step extensions
tools/launcher/examples/moonshotai/Kimi-K2.5/hf_streaming_dflash_multi_node.yaml, hf_streaming_eagle3_multi_node.yaml
Increases training.max_steps from 500 to 2000 in both DFlash and EAGLE3 pipelines, extending the training budget for improved model convergence.

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • yeyu-nvidia
  • ChenhanYu
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[Examples]: GPT-oss, Qwen3Moe streaming specdec example' directly corresponds to the PR's main objective of adding streaming speculative-decoding examples for gpt-oss-20b and Qwen3-30B-A3B, which is confirmed by the raw summary showing new YAML configurations and template files for these models.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed Checked PR-mentioned Python files (eagle_utils.py, export_hf_checkpoint.py, ar_validate.py) for SECURITY.md anti-patterns (torch.load weights_only=False, np.load allow_pickle=True, trust_remote_cod...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch haoguo/gptoss-specexample

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: h-guo18 <[email protected]>
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.54%. Comparing base (dd49a46) to head (15fad77).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1692      +/-   ##
==========================================
- Coverage   67.72%   67.54%   -0.18%     
==========================================
  Files         511      511              
  Lines       56168    56498     +330     
==========================================
+ Hits        38037    38164     +127     
- Misses      18131    18334     +203     
Flag Coverage Δ
examples 40.40% <ø> (-0.91%) ⬇️
regression 14.66% <ø> (+0.02%) ⬆️
unit 54.34% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

h-guo18 added 3 commits June 12, 2026 06:37
Signed-off-by: h-guo18 <[email protected]>
Signed-off-by: h-guo18 <[email protected]>
Signed-off-by: h-guo18 <[email protected]>
@h-guo18 h-guo18 changed the title specdec add model: gptoss [Examples]: GPT-oss, Qwen3Moe streaming specdec example Jun 12, 2026
@h-guo18 h-guo18 marked this pull request as ready for review June 12, 2026 23:40
@h-guo18 h-guo18 requested a review from a team as a code owner June 12, 2026 23:40
@h-guo18 h-guo18 requested review from ChenhanYu and yeyu-nvidia June 12, 2026 23:40
@h-guo18 h-guo18 self-assigned this Jun 12, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🧹 Nitpick comments (1)
tools/launcher/examples/openai/gpt-oss-20b/chat_template_train.jinja (1)

136-140: 📐 Maintainability & Code Quality | 💤 Low value

Redundant conditional: both branches output the same value.

Both the if and else branches emit ",\n". Simplify to a single statement:

♻️ Suggested simplification
-            {%- if not loop.last %}
-                {{- ",\n" }}
-            {%- else %}
-                {{- ",\n" }}
-            {%- endif -%}
+            {{- ",\n" }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/examples/openai/gpt-oss-20b/chat_template_train.jinja` around
lines 136 - 140, The conditional block checking "if not loop.last" is redundant
because both branches emit the same string; remove the entire if/else and
replace it with a single output of the comma+newline token so the template
simply emits ",\n" (preserving the existing whitespace control like the
surrounding {{- / -}} markers) where the conditional currently appears.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/launcher/examples/openai/gpt-oss-20b/chat_template_train.jinja`:
- Around line 47-52: The variable has_object_variants is being set inside a
Jinja2 for-loop which creates a new local scope, so the outer flag never
changes; switch to a namespace() to persist the flag: create a namespace like ns
= namespace(has_object_variants=False), inside the for-loop assign
ns.has_object_variants = True when variant.type == "object", and later check
ns.has_object_variants instead of has_object_variants (references:
has_object_variants, param_spec.oneOf, variant.type).

In
`@tools/launcher/examples/openai/gpt-oss-20b/hf_streaming_dflash_multi_node.yaml`:
- Around line 97-98: The YAML currently hardcodes EXPORT_EXTRA_ARGS:
"--trust_remote_code", which propagates through train_eagle_streaming.sh into
export_hf_checkpoint.py and ultimately into load_vlm_or_llm()
(AutoConfig/AutoModel.from_pretrained trust_remote_code), creating an RCE risk;
remove the hardcoded value and make it opt-in by defaulting EXPORT_EXTRA_ARGS to
empty (or adding a dedicated flag like ENABLE_TRUST_REMOTE_CODE) in the YAML and
update train_eagle_streaming.sh to pass that flag only when explicitly set, and
ensure export_hf_checkpoint.py only forwards trust_remote_code when the flag is
present (or document a clear, audited exception) so load_vlm_or_llm() receives
trust_remote_code=true only when explicitly requested.

---

Nitpick comments:
In `@tools/launcher/examples/openai/gpt-oss-20b/chat_template_train.jinja`:
- Around line 136-140: The conditional block checking "if not loop.last" is
redundant because both branches emit the same string; remove the entire if/else
and replace it with a single output of the comma+newline token so the template
simply emits ",\n" (preserving the existing whitespace control like the
surrounding {{- / -}} markers) where the conditional currently appears.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 34056ab8-ec72-4e9d-b618-ab8b466cd8a1

📥 Commits

Reviewing files that changed from the base of the PR and between dd49a46 and 15fad77.

📒 Files selected for processing (14)
  • examples/speculative_decoding/eagle_utils.py
  • tools/launcher/examples/Qwen/Qwen3-30B-A3B/hf_streaming_dflash_multi_node.yaml
  • tools/launcher/examples/Qwen/Qwen3-30B-A3B/hf_streaming_eagle3_multi_node.yaml
  • tools/launcher/examples/Qwen/Qwen3-8B/eagle3_quick_check.yaml
  • tools/launcher/examples/Qwen/Qwen3-8B/hf_offline_eagle3.yaml
  • tools/launcher/examples/Qwen/Qwen3-8B/hf_offline_eagle3_ptq.yaml
  • tools/launcher/examples/Qwen/Qwen3-8B/hf_online_eagle3.yaml
  • tools/launcher/examples/Qwen/Qwen3-8B/hf_streaming_eagle3.yaml
  • tools/launcher/examples/Qwen/Qwen3-8B/hf_streaming_eagle3_multi_node.yaml
  • tools/launcher/examples/moonshotai/Kimi-K2.5/hf_streaming_dflash_multi_node.yaml
  • tools/launcher/examples/moonshotai/Kimi-K2.5/hf_streaming_eagle3_multi_node.yaml
  • tools/launcher/examples/openai/gpt-oss-20b/chat_template_train.jinja
  • tools/launcher/examples/openai/gpt-oss-20b/hf_streaming_dflash_multi_node.yaml
  • tools/launcher/examples/openai/gpt-oss-20b/hf_streaming_eagle3_multi_node.yaml

Comment on lines +47 to +52
{%- set has_object_variants = false -%}
{%- for variant in param_spec.oneOf -%}
{%- if variant.type == "object" -%}
{%- set has_object_variants = true -%}
{%- endif -%}
{%- endfor -%}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Jinja2 scoping bug: has_object_variants never persists outside the loop.

In Jinja2, set inside a for loop creates a new local scope—the assignment won't affect the outer variable. This means has_object_variants will always be false after the loop, regardless of whether object variants exist. The condition at line 53 will never be true.

You correctly use the namespace() pattern elsewhere (lines 209, 277). Apply the same fix here:

🐛 Proposed fix
-    {%- set has_object_variants = false -%}
+    {%- set ns = namespace(has_object_variants=false) -%}
     {%- for variant in param_spec.oneOf -%}
         {%- if variant.type == "object" -%}
-            {%- set has_object_variants = true -%}
+            {%- set ns.has_object_variants = true -%}
         {%- endif -%}
     {%- endfor -%}
-    {%- if has_object_variants and param_spec.oneOf|length > 1 -%}
+    {%- if ns.has_object_variants and param_spec.oneOf|length > 1 -%}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/examples/openai/gpt-oss-20b/chat_template_train.jinja` around
lines 47 - 52, The variable has_object_variants is being set inside a Jinja2
for-loop which creates a new local scope, so the outer flag never changes;
switch to a namespace() to persist the flag: create a namespace like ns =
namespace(has_object_variants=False), inside the for-loop assign
ns.has_object_variants = True when variant.type == "object", and later check
ns.has_object_variants instead of has_object_variants (references:
has_object_variants, param_spec.oneOf, variant.type).

@h-guo18

h-guo18 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

/claude review

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review passed — no blocking issues found. LGTM

This is an examples-only PR (launcher YAMLs + a Jinja chat template + a small Python edit). Reviewed for correctness:

Scope:

  • 4 new pipeline YAMLs (gpt-oss-20b/Qwen3-30B-A3B × DFlash/EAGLE3)
  • New chat_template_train.jinja for gpt-oss-20b (copy of upstream OpenAI Harmony template + {% generation %} tags around assistant content for answer_only_loss=true)
  • eagle_utils.py: install data.chat_template on the tokenizer for streaming mode (correctly mirrors that the streaming dataset calls apply_chat_template without a chat_template= kwarg, while the online collator threads it through)
  • Concurrency 1→32 in 6 Qwen3-8B benchmark configs; max_steps 500→2000 for Kimi-K2.5

Verified:

  • Layer-id math matches helper functions: default_eagle_aux_layer_ids(48)=[1,23,44] and default_eagle_aux_layer_ids(24)=[1,11,20]; build_target_layer_ids(48,5)=[1,12,23,34,45] and build_target_layer_ids(24,5)=[1,6,11,16,21]. Capture IDs in YAML are those +1 plus final layer, as the comments document.
  • SERVE_MAX_MODEL_LEN is consumed in train_eagle_streaming.sh:154
  • EXPORT_EXTRA_ARGS="--trust_remote_code" for DFlash is intentional and gated (used identically in existing Kimi DFlash examples; the export script's --trust_remote_code is opt-in via argparse).
  • data.chat_template path resolution matches existing pattern (examples/<Org>/<Model>/chat_template_train.jinja is relative to the script's working dir, same as existing Qwen3-0.6B example).
  • Tokenizer mutation (tokenizer.chat_template = chat_template) is fine — one-shot setup, called once per training run.

Notes (not blocking):

  • CodeRabbit flagged a Jinja2 scoping bug in render_typescript_type's {%- set has_object_variants -%} inside a for-loop — that's a real bug, but it's inherited from the upstream OpenAI gpt-oss-20b chat template (this template is copied from there). It only triggers on tool schemas with oneOf containing object variants; daring-anteater (the validated training data) has no tool calls, so the bug is dead code in this training context. Worth fixing if/when upstream does, but doesn't block.
  • PR description's vLLM AL results table confirms all four configs train and pass acceptance-length eval end-to-end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant