Releases: ai-dynamo/dynamo
Dynamo v1.1.0
Release Notes
Dynamo v1.1.0 is the 14th feature release of the open-source distributed inference platform. It makes the standalone KV indexer recoverable across node failures, brings the Anthropic Messages API to production for Claude Code, lands SGLang multimodal disaggregated serving, and turns the Mocker into a unified performance-modeling and offline-replay engine.
Summary
Resilient KV Routing at Scale
The standalone KV indexer is now recoverable. New replicas bootstrap their radix tree from a healthy peer's /dump endpoint before serving. Inline ZMQ gap detection replays dropped messages from the engine ring buffer. Multi-model and multi-tenant isolation lets shared clusters route to the correct cache. Ships as a maturin-built dynamo-kv-indexer on PATH with Prometheus metrics.
Anthropic Messages API for Claude Code
/v1/messages is production-grade for Claude Code-style harnesses. cache_control is honored at top-level, per-block, and system-block-array forms. Thinking-block pass-through, system-prompt preamble stripping, accurate streaming input_tokens, and Anthropic image → OpenAI image_url conversion cover end-to-end vision and reasoning. /v1/models exposes context_window; streaming double-parsing and reasoning_content round-tripping are fixed.
Performance Modeling & Offline Replay
The Mocker becomes a performance-modeling and offline-replay engine. SGLang engine simulation, AIConfigurator-backed latency prediction with MoE parallelism, --decode-speedup-ratio for speculative decoding, and offline agg/disagg replay over Mooncake-style traces all land. Forward-pass metrics flow onto the event plane, enabling planner-in-the-loop replay. The Planner consolidates onto a single FPM-based regression model that serves both throughput and load scaling.
Multimodal Embedding Cache & Diffusion
SGLang multimodal disaggregated serving lands with NixlEmbeddingSender/NixlEmbeddingReceiver, an embedding cache, and device-type-and-load EPD routing. vLLM E/P/D worker init moves to a factory; embedding loading is abstracted into ImageLoader. Diffusion adds image-to-video on vLLM Omni, image-to-image on SGLang, audio/TTS, video input for SGLang aggregated, audio-in-video, and Flux benchmarking. The v1.0.0 PersistentConnector monkey-patch is retired for a proper nixl_connector integration.
Open-Source Contributions
Between v1.0.2 and v1.1.0, the project merged 896 PRs from 113 contributors. New first-time external contributors in this release include:
- @stevemurr (Baseten) added a dynamic default
max_tokensfor the TensorRT-LLM backend (#5152) - @YconquestY added FlexKV integration for cross-tier KV cache management (#5858)
- @Spycsh (Intel) added GPUDirect support for Intel XPU (#5852)
- @sywangyi (Intel) added the SGLang multi-image request path, NIXL EmbeddingSender/Receiver in SGLang, NVTX markers for SGLang EPD, and device-type EPD routing (#6068, #7079, #7153, #7215)
- @kornelcsernai-harmonic (Harmonic AI) added a least-loaded router mode (#6314)
- @danehans (Tetrate) clarified GAIE fallback behavior and source-install flow (#7077)
- @jellysnack added SGLang guided-decoding support (#6620)
- @blarson-b10 (Baseten) improved active-sequence request expiration (#7340)
- @simone-chen updated the AIC disaggregated serving guide (#6553)
- @yifjiang added TRT-LLM dynamo-trtllm metrics and fixed guided-decoding arg placement (#6617, #6668)
- @joshuayao (Intel) added vLLM aggregated serving examples and unit tests for XPU (#7146, #7078)
- @ZhengHongming888 (Intel) enabled the Intel XPU Dockerfile for dev targets (#7134)
Returning external contributors include @michaelfeil (Baseten), @vladnosiv (Yandex.Cloud), @dsocek (Intel), @AmeenP (PrimeIntellect), @huitianbai, @InfraWhisperer (F5), @devivasudevan (Microsoft), @Jont828 (Microsoft), @ashnamehrotra (Microsoft), @Ryan-Amirthan (Fern), and several others.
If you would like to get involved, please see our Contribution Guide.
Key Dependencies
| Dynamo | SGLang | TensorRT-LLM | vLLM | NIXL | UCX |
|---|---|---|---|---|---|
| v1.1.0 | v0.5.10.post1 |
v1.3.0rc11 |
v0.19.0 |
v1.0.1 (SGLang) / v0.10.1 (TRT-LLM, vLLM) |
1.20 |
CUDA Variants
| Backend | CUDA 12 | CUDA 13 |
|---|---|---|
| vLLM | 12.9 | 13.0 |
| SGLang | 12.9 | 13.0 |
| TensorRT-LLM | — | 13.1 |
The vLLM XPU/CPU image targets Intel deep-learning-essentials 2025.3.2 and stays on vLLM v0.16.0 for v1.1.0.
Dynamo Ecosystem
| AIConfigurator | AIPerf | ModelExpress | Grove |
|---|---|---|---|
v0.8.0 |
v0.7.0 |
v0.3.0 |
v0.1.0-alpha.6 |
For container images, wheels, Helm charts, Rust crates, and the full pinned matrix, see Release Artifacts and the Support Matrix.
Breaking Changes
ACTION REQUIRED: The following changes require updates to your code, configuration, or deployment manifests before upgrading.
Notable Behavioral Changes
-
enable_natsanduse_kv_eventsRemoved fromDistributedRuntime(#7265): Both parameters are removed fromDistributedRuntime,create_runtime(), and thedynamo_worker()decorator. NATS is now auto-detected from the event plane: enabled when the request plane is NATS orNATS_SERVERis configured.Migrate: Drop both arguments from your Python entry points and configure NATS via the
DYN_EVENT_PLANEandNATS_SERVERenvironment variables instead. -
Experimental
nvext.cache_controlCache Pinning Removed (#7790): The experimental cache-pinning feature is removed: thenvext.cache_controlrequest field, the--enable-cache-controlflag, and theDYN_ENABLE_CACHE_CONTROLenv var are all gone. SGLang upstream chose a different direction, so the v1.0.0 plumbing is being unwound. The Anthropic Messages parser still acceptscache_controlblocks for protocol compatibility but no longer derives router-pin TTLs from them.Migrate: If you depended on cache pinning, track the v1.2.0 sticky-session / session-controller work. There is no drop-in replacement in v1.1.0.
-
Cargo-Built
dynamo-kv-indexerBinary Removed (#7338): The Cargo-builtdynamo-kv-indexerbinary inlib/kv-router/target/release/is removed; the maturin-built binary shipped via the Python wheel is now the single source.Migrate: Update launchers and Dockerfiles to point at the wheel-installed
dynamo-kv-indexer(onPATHafterpip install ai-dynamo). -
LLaVA-Specific EPD Path Removed; EPD Now Single-GPU (#6674): The LLaVA-specific multimodal EPD code path is removed, EPD is now constrained to single-GPU configurations, and the default multimodal example moved from
Llava-MistraltoQwen/Qwen3-VL-2B-Instruct.Migrate: Switch LLaVA workloads to the aggregated path or to a Qwen3-VL recipe.
-
Compressed Concurrent Tree Default (#7874): The KV router defaults to the compressed concurrent radix tree. Improves resource utilization for multi-threaded indexing; node-allocation semantics differ from the previous tree.
Migrate: Update any custom instrumentation that targeted the old radix-tree internals. No action required for default deployments.
-
MDC Checksum Scoped Per-WorkerSet (#7368): Model Discovery Card checksum validation moved from per-Model to per-WorkerSet. Different WorkerSets under the same Model can now carry different configuration without forcing workers to drain first.
Migrate: If you relied on the v1.0.0 strict per-Model behavior, audit your WorkerSet configs before upgrading.
Deprecated and Removed
-
vLLM Auto-Enable KV Events Removed (#7591): The deprecated automatic KV-events config in vLLM is removed; the
DYN_VLLM_KV_EVENT_PORTenv var is also no longer supported.Migrate: Set
--kv-events-configexplicitly per the v1.0.0 migration note. -
Unused
genai-perfPin Dropped (#8763): The unusedgenai-perf==0.0.15pin was removed fromcontainer/deps/requirements.benchmark.txt. It was not invoked anywhere in the repo.Migrate: No action required.
aiperfis the supported in-container benchmarking tool.
v1.0.0 Future-Deprecation Reminders
The following warnings from v1.0.0 still apply. Migrate before they are removed:
v1alpha1DGDR API: migrate tov1beta1enableGpuDiscoveryCRD field has no effectComponentNamefield onServiceReplicaStatus: migrate toComponentNames- Router CLI flags without the
--router-prefix - vLLM
--is-prefill-worker/--is-decode-worker: migrate to--disaggregation-mode --router-durable-kv-events: migrate to the event-plane subscriber
Features & Improvements
Multimodal & Diffusion
Embedding Cache & E/P/D
Dynamo v1.2.0-deepseek-v4-dev.2
Dynamo v1.2.0-deepseek-v4-dev.2 - Release Notes
Summary
Dynamo v1.2.0-deepseek-v4-dev.2 is the second experimental dev cut for DeepSeek-V4-Flash and DeepSeek-V4-Pro on Blackwell. The headline change is the vLLM 0.20.0 upgrade, which lands native DeepSeek-V4 support upstream and lets the published vLLM container drop the dev.1 custom Dockerfile overlay. The release also publishes both vLLM and SGLang containers for V4-Flash and V4-Pro, tightens DeepSeek-V4 frontend correctness (thinking-mode toggle keys, DSML tool-call parsing, skip_special_tokens default), fixes a vLLM/NIXL cancellation race that crashed EngineCore mid-KV-transfer, and consolidates the recipe layout into a single recipes/deepseek-v4/ subtree. This is a snapshot build for early access to V4 model support and is not a QA-gated release.
Base Branch: release/1.2.0-deepseek-v4-dev.2
Container Images (RC4, staging)
| Backend | Arch | Image |
|---|---|---|
| vLLM (CUDA 13) | multi-arch (amd64 + arm64) |
nvcr.io/nvstaging/ai-dynamo/vllm-runtime:1.2.0-deepseek-v4-cuda13-dev.2-rc4 |
| SGLang (CUDA 12) | amd64 only |
nvcr.io/nvstaging/ai-dynamo/sglang-runtime:1.2.0-deepseek-v4-cuda12-dev.2-rc4 |
| SGLang (CUDA 13) | arm64 only |
nvcr.io/nvstaging/ai-dynamo/sglang-runtime:1.2.0-deepseek-v4-cuda13-dev.2-rc4 |
Backend Versions
| Backend | Base Image | CUDA | Python | Notes |
|---|---|---|---|---|
| vLLM | vllm/vllm-openai (0.20.0) |
13.0 | 3.12 | vLLM 0.20.0 ships native DeepSeek-V4 support; the custom dsv4 Dockerfile was dropped in #8786 |
| SGLang | lmsysorg/sglang:deepseek-v4-blackwell |
12.9 / 13.0 | 3.12 | Upstream DSv4 Blackwell preview branch; Dynamo overlays V4 parsers, hardening fixes, and routed_experts opt-in gating |
NIXL 0.10.1 and UCX v1.20.0 are bundled in both published images. TensorRT-LLM is not part of this dev release. Backend versions are pinned to upstream DSv4 preview images, not the standard Dynamo backend pins.
Models
| Model | HuggingFace ID | Hardware | Notes |
|---|---|---|---|
| DeepSeek-V4-Flash | deepseek-ai/DeepSeek-V4-Flash |
4× B200 (TP=4); GB200 also supported | MXFP4 MoE via FlashInfer; EAGLE MTP 3/4 |
| DeepSeek-V4-Pro | deepseek-ai/DeepSeek-V4-Pro |
8× B200 (TP=8) | MXFP4 MoE via FlashInfer; EAGLE MTP 3/4. Thinking mode is broken upstream — see Known Issues |
Both models run on either the published vLLM or SGLang container. See recipes/deepseek-v4/deepseek-v4-{flash,pro}/README.md for per-model deployment.
What Changed Since dev.1
dev.2 carries forward everything that shipped in v1.2.0-sglang-deepseek-v4-dev.1:
- DeepSeek V4 frontend parser, prompt formatter, and DSML tool-call parser (#8665).
- KV router ZMQ wire-parser fix that filters non-main KV event groups (#8669).
- Initial DeepSeek-V4 SGLang recipe with
Dockerfile.dsv4-sglangand per-model manifests (#8704). - Initial DeepSeek-V4-Flash and V4-Pro vLLM aggregated recipes (#8668).
dev.2 adds the changes below on top.
Full Changelog
vLLM Backend Upgrade
- vLLM 0.20.0 with Native DeepSeek-V4 Support: Bumped the published vLLM container to vLLM 0.20.0 (#8762), which lands native DeepSeek-V4 architecture support upstream and eliminates the need for the dev.1 custom DSv4 Dockerfile overlay (dropped in #8786). DSv4-relevant upstream additions in vLLM 0.20.0 (full notes:
v0.20.0):- Initial DeepSeek-V4 architecture support (vllm-project/vllm#40860).
- DSML token-leakage fix for DSv4 / V3.2 (vllm-project/vllm#40806).
- DeepSeek Sparse Attention + MTP illegal-memory-access fix (vllm-project/vllm#40772).
siluclamp limit on the DSv4 shared expert (vllm-project/vllm#40950).
Recipes
- DeepSeek-V4 Family Layout Consolidation: Restructured the DSv4 recipes into a single self-contained
recipes/deepseek-v4/subtree following the repo-wide<recipe>/<framework>/<mode>/deploy.yamlconvention, deduped the shared dsv4 Dockerfiles into one set, simplified the SGLang Dockerfile to consume the Dynamo donor image directly (no in-Dockerfile source build), pinned the SGLang base image to a specific digest, and documented both vLLM and SGLang deployment paths side-by-side in each per-recipe README (#8735). The dev.1 pathsrecipes/deepseek-v4-flash/andrecipes/deepseek-v4-pro/are nowrecipes/deepseek-v4/deepseek-v4-{flash,pro}/.
Bug Fixes
Frontend
- DeepSeek V4 Frontend Hardening: Hardened the DeepSeek V4 frontend after post-merge review of the dev.1 parser (#8670). The V4 prompt formatter now honors
thinking: false,enable_thinking: false, andthinking_mode: "chat"as toggle keys end-to-end (previously only some keys routed through, leaving reasoning extraction silently on);reasoning_effort="max"and per-requestdrop_thinkingoverrides now actually take effect; badchat_template_argsvalues emit a structured warning instead of falling back silently. The DSML tool-call parser now keeps parameters that omit the optionalstring="true|false"attribute, preserves text between or after multiple DSML blocks, no longer leaks raw<|DSML|…>markup intonormal_contentwhen invoke parsing fails, and emits OpenAI-stylecall_<24hex>IDs. TheDsmlParserConfig.function_calls_*fields were renamed toblock_*(with serde aliases for back-compat), and V4 model-name matching now rejects composites likedeepseek-v3.2-v4-merge. SGLangchat_processor_frontendparallel_tool_callsis wired through, fixing 9 previously broken runtime tests. - Special-Token Leakage in Streaming Detokenizer: Fixed the Rust backend's streaming detokenizer to default
skip_special_tokens=truewhen the OpenAI request omits the field (#8780), matching vLLM, SGLang, and TensorRT-LLM defaults. Without this, DeepSeek-V4 occasionally emits token id0(<|begin▁of▁sentence|>) mid-output and the literal<|begin▁of▁sentence|>text leaked intocontentandreasoning_contentfor any client that did not opt in.
vLLM
- vLLM/NIXL Cancellation Race During KV Transfer: Fixed a vLLM/NIXL race in disaggregated serving where aborting a request immediately after prefill could crash EngineCore while KV transfer was still in flight (#8624). The vLLM handler now defers
engine_client.abort(request_id)until the engine produces the first token in decode mode, then resumes normal cancellation. Behavior is gated to the vLLM disaggregated decode flow only; aggregated and other backends are unchanged. - vLLM Pod Startup Crash on Renamed Flag: Fixed DSv4 and Qwen3 vLLM recipe
deploy.yamlfiles that still passed the removed--disable-log-requestsflag (#8693), causing pods to crash at startup withunrecognized arguments: --disable-log-requestson any image built against vLLM 0.19.1+. Renamed to--no-enable-log-requests(theargparse.BooleanOptionalAction-generated form of the new--enable-log-requests), preserving explicit "logging off" intent.
SGLang
- SGLang
return_routed_expertsCompat Regression: Gated SGLang'sreturn_routed_expertsbehavior behind an opt-in flag for DeepSeek-V4 compatibility, plus faster model downloading in pytests (#8828; cherry-picks #8821 and #8798). This keeps the SGLang DSv4 path stable on the published image without breaking other SGLang model paths that do not expect routed-expert metadata. - Silent 128-Token Cap on Omitted
max_tokens: Fixed the SGLang worker's_build_sampling_paramsto preserve explicitmax_new_tokens=Noneinstead of stripping it (#8743). When a client omittedmax_tokensfrom a chat completion request (valid per OpenAI spec), the dict comprehension previously filtered out theNoneand SGLang silently fell back to its 128-token internal default, returningfinish_reason: lengthafter just 128 tokens. Requests now run until EOS or context-length as expected. - Disagg Prefill Canary False Positive: Fixed the SGLang prefill handler so it honors the
_HEALTH_CHECKmarker and the canary's first observed response comes from the scheduler instead of a syntheticbootstrap_infoyield (#8611). Without this, a hung SGLang prefill scheduler rank left/healthreporting 200 indefinitely. Also movedHEALTH_CHECK_KEY = "_HEALTH_CHECK"to the shareddynamo.health_checkmodule and applied the marker to all three vLLM payloads for cross-backend wire-format consistency. SGLang and TensorRT-LLM keep re-exports for backwards compatibility.
Known Issues
DSv4-Pro Thinking-Mode Output Corruption
DeepSeek-V4-Pro produces corrupted output when thinking mode is enabled. The bug is engine-side (sparse-attention state lifecycle), reproduces on both vLLM and SGLang DSv4-Pro, and does not reproduce on DSv4-Flash. Tool calling, structured output, and non-thinking responses are unaffected.
Workaround: Disable thinking mode in chat_template_kwargs:
curl http://<frontend>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Pro",
"messages": [{"role": "user", "content": "..."}],
"chat_template_kwargs": {"thinking": false}
}'The recipes/deepseek-v4-pro/README.md quickstart and the recipes/README.md index were updated with this caveat in #8720, and the curl example sets thinking: false explicitly so the included quickstart does not surprise users with corrupted output. Tracked for an upstream fix.
Full Changelog: [https://github.com/ai-dynamo/dynamo/compare/v1.2.0-sglang-deepseek-v4-dev.1...release/1.2.0-deepseek-v4-dev.2](v1.2.0-sglang-deepseek-v4-dev.1...release/1.2.0-deepseek-v...
v1.2.0-sglang-deepseek-v4-dev.1
Dynamo v1.2.0-dev.1-deepseekv4 - Release Notes
Summary
Dynamo v1.2.0-dev.1-deepseekv4 is an experimental dev release that adds DeepSeek-V4 model support, an SGLang recipe for V4-Flash and V4-Pro, and a KV router correctness fix required by V4's grouped KV cache layout. The release branches from main at commit 7d4572f9 (2026-04-23) and adds seven cherry-picks on top to enable end-to-end serving of deepseek-ai/DeepSeek-V4-Flash and deepseek-ai/DeepSeek-V4-Pro on Blackwell hardware. Only the SGLang container is published; vLLM recipes are included as source for users to build locally. This is a snapshot build for early access to V4 model support and is not a QA-gated release.
Base Branch: release/1.2.0-dev.1-deepseekv4
Container Image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0-sglang-deepseek-v4-b200-dev.1
Backend Versions (shipped):
| Backend | Base Image | CUDA | Python | Notes |
|---|---|---|---|---|
| SGLang | lmsysorg/sglang:deepseek-v4-blackwell |
12.9 | 3.12 | Upstream DSv4 Blackwell preview branch; Dynamo overlays V4 parsers + routed_experts fix |
NIXL 0.10.1 and UCX v1.20.0 are bundled in the published SGLang image.
Source-only (not published):
- vLLM recipes for V4-Flash and V4-Pro target
vllm/vllm-openai:deepseekv4-cu130(built from vLLM PR #40760, thezyongye/vllm:dsv4fork; CUDA 13.0; Python 3.12). Users build locally perrecipes/deepseek-v4-{flash,pro}/container/README.md. - TensorRT-LLM is not part of this dev release.
Note: Backend versions in this dev release are pinned to the upstream DSv4 preview images, not the standard Dynamo backend pins (vLLM 0.19.1, SGLang 0.5.x).
Full Changelog
Frontend & Agents
- DeepSeek V4 Parser Support: Added DeepSeek V4 frontend parser support including a new prompt formatter, the DSML tool-call parser, reasoning parser registration with
deepseek_v4/deepseek-v4/deepseekv4aliases routing to the Qwen reasoning parser, and 11 new test fixtures covering streaming and non-streaming tool-call scenarios (#8709). This is the cherry-pick of upstream #8665 and is the core enablement work that lets Dynamo's frontend tokenize, prompt, and parse tool calls for DeepSeek-V4-Flash and DeepSeek-V4-Pro models.
Recipes
SGLang (shipped)
- DeepSeek-V4 SGLang Recipe: Added an SGLang recipe for DeepSeek-V4-Flash and DeepSeek-V4-Pro with a dedicated
Dockerfile.dsv4-sglangand per-modelsglang-dgd.yamlmanifests (#8712, #8713, #8718). This supersedes the closed #8703 and includes follow-up fixes for the containerPATHand theetcdbinary path in the SGLang Dockerfile so the recipe runs end-to-end without manual edits. Users deploy with the publishedsglang-runtime:1.2.0-sglang-deepseek-v4-b200-dev.1image.
vLLM (source-only, no published container)
- DeepSeek-V4-Flash and V4-Pro vLLM Aggregated Recipes: Added experimental aggregated serving recipes for
deepseek-ai/DeepSeek-V4-Flashanddeepseek-ai/DeepSeek-V4-Proon the vLLM backend (#8719) including dedicated Dockerfiles, model-cache and model-download manifests, andDynamoGraphDeploymentYAMLs. Users build the image locally from the recipe Dockerfile against the upstream DSv4 vLLM base.
KV Router
- ZMQ KV Event Group Filtering: Fixed the KV router's ZMQ wire parser to filter out non-main KV event groups so only the main full-compressed KV group is published into the radix index, instead of flattening sliding-window-attention (SWA) and other auxiliary groups together (#8705). This is the cherry-pick of #8669 and is required for correct prefix matching on DeepSeek V4, which emits a
group_idxseparating the main KV group from SWA-only groups.
Dynamo v1.0.2
Dynamo v1.0.2 - Release Notes
Summary
Dynamo v1.0.2 is a patch release focusing on Frontend correctness fixes, DGDR-driven Kubernetes deployment robustness, rolling-update flexibility, and guided-decoding input hardening.
Key fixes restore real stream metadata in non-streaming responses with tool calls, correct Kimi K2.5 tokenizer special-token handling that caused TensorRT-LLM to reject requests, and add byte-length and nesting-depth caps to the OpenAI guided-decoding path.
On the deployment side, DGDR-created DynamoGraphDeployments now derive their name from the parent DGDR, DGDR-managed ConfigMaps cascade-delete with their parent, the Operator no longer thrashes on foreground cascading deletion, and per-WorkerSet MDC checksum validation enables rolling updates with divergent worker configuration under the same Model.
Base Branch: release/1.0.1
Key Dependencies
| Dynamo | SGLang | TensorRT-LLM | vLLM | NIXL |
|---|---|---|---|---|
| v1.0.2 | 0.5.9 |
1.3.0rc5.post1 |
0.16.0 |
0.10.1 |
For container images, wheels, Helm charts, and Rust crates, see Dynamo Release Artifacts.
For full version compatilbity information, see Dynamo Support Matrix.
Full Changelog
Kubernetes Deployment
- DGDR-Driven DGD Naming: Fixed Profiler-generated DynamoGraphDeployment naming so that DGDs derive their name from the parent DynamoGraphDeploymentRequest (
<DGDR>-dgd) instead of from topology alone (<backend>-<agg/disagg>) (#7835), eliminating namespace-level name collisions when multiple DGDRs share the same backend/topology and respecting user-provided names fromspec.overrideswhen present. - DGDR ConfigMap Owner References: Added Kubernetes owner references to ConfigMaps created by DGDR (#7881) so that DGDR-managed ConfigMaps are cascade-deleted with their parent.
Runtime
- Per-WorkerSet MDC Checksum Validation: Scoped Model Discovery Card checksum validation from per-Model to per-WorkerSet (#8278), enabling rolling updates where different WorkerSets under the same Model can carry different configuration (e.g. tool-call parser) without draining existing workers first. Mismatches are still rejected when a new worker joins an existing WorkerSet, but cross-WorkerSet checksum drift is no longer a hard error.
Bug Fixes
- DGD Cascading Deletion Thrashing: Fixed Operator behavior under foreground cascading deletion of DynamoGraphDeployments (#8212) so the Operator no longer thrashes the resource during teardown, ensuring clean DGD deletion in Kubernetes garbage-collection scenarios.
- Stream Metadata Preservation: Fixed OpenAI Frontend stream finalization that overwrote real
id,model, andcreatedfields with hardcoded placeholders (stream-end,unknown,0) when a tool-call parser combined streamed chunks into a non-streaming response (#8281), restoring correct response metadata for non-streaming tool-call requests. - Per-Node GPU Topology in DGD Builder: Fixed thorough-mode MoE config enumeration in the Planner/Profiler that ignored
numGpusPerNodeand produced unschedulable candidate DGDs on multi-node clusters (#8281). Worker GPU resource limits are now clamped per node andmultinode.nodeCountis set for workers that span multiple nodes. - Kimi Tokenizer Special Tokens: Fixed Rust tiktoken tokenizer handling of reserved-token fallback names for Kimi K2.5 (#7898), resolving prompt-token inflation that caused TensorRT-LLM to reject requests with negative
default_max_tokensand enabling correct serving ofnvidia/Kimi-K2.5-NVFP4and other Kimi K2.5 models. - Guided-Decoding Input Bounds: Added byte-length and nesting-depth caps to OpenAI guided-decoding input validation (#8349) —
guided_grammar64 KiB,guided_regex32 KiB,guided_whitespace_pattern1 KiB,guided_json256 KiB serialized with a nesting-depth cap of 64 — bounding pathological inputs before they reach the downstream guided-decoding backend.
Full Changelog: v1.0.1...v1.0.2
Dynamo v1.1.0-dev.1
Release Notes
Dynamo v1.1.0-dev.1 is a pre-release build that gives an early look at the features in Dynamo v1.1.0.
This build is not recommended for production use — features may be incomplete and APIs, behaviors, and defaults may change before the stable release.
Use it for evaluation, testing, and early feedback only.
Branch: release/1.1.0-dev.1
main Commit: c758eb5b58872fa9b2880b9ea29d058b141bb8ec
Full Changelog: v1.0.1...v1.1.0-dev.1
Major Features
- Move standalone KV indexer into its own crate so it can run as an independent process (#6569)
- Multi-model and multi-tenant isolation for KV indexer (#6830)
- P2P recovery for standalone KV indexer (#6934)
- ZMQ gap detection + replay for standalone KV indexer (#7209)
- Prometheus metrics for standalone KV indexer (#7339)
- Package dynamo-kv-indexer binary via maturin (#7194, #7395)
- Standalone KV indexer runtime integration (#7295)
- Pluggable scheduling policy for router queue (#7260)
- Concurrent router perf improvements (#6536)
- Trait-based event system (Velo) for async coordination (#6315)
- Velo transports (#6547)
- Unix domain socket for Velo (#7197)
- kvbm-physical for direct GPU memory management (#6490)
- FlexKV integration in Dynamo (#5858)
- Full Anthropic Messages API cache_control support (top-level, per-block, system block arrays) (#6629)
- Anthropic thinking block support and preamble stripping to /v1/messages (#7137)
- Support multimodal (vision) inputs in Anthropic Messages API (#7256)
- Streaming tool call and reasoning dispatch SSE events (#7114)
- Dynamic batching of client-side events (#6733, #6741)
- SGLang chat processor for frontend pre/post processing (#6834)
- Move CRD apply from Helm hook Job to init container on operator Deployment (#6780)
- Move webhook certificate management and CA injection from Helm hooks into operator (#6839)
- Move MPI SSH key generation from Helm hook Job into operator reconciliation (#6940)
- Replace kube-rbac-proxy sidecar with controller-runtime WithAuthenticationAndAuthorization (#7045)
- GPU discovery extension using DCGM exporter for advanced metrics (#6705)
- GlobalPlanner --max-total-gpus for cluster-wide GPU budget (#7103)
- Loki log aggregation, unified OTLP ingestion for traces and logs (#6974)
- Propagate OTEL trace context across E/P/D multimodal workers (#7239)
- Kimi-K2.5 model recipe with Baseten's model (#6602)
- Kimi-K2.5 (nvidia/Kimi-K2.5-NVFP4) recipe for agg and KVBM with TensorRT-LLM patch (#6842)
- DeepSeek V3.2 TensorRT-LLM recipe (#6688)
- Qwen3-VL-30B recipe for agg and encoder cache with vLLM patch (#6919)
- Integrate fastokens BPE tokenizer backend (#7387)
- Upgrade Rust to v1.93.0 (#6802)
- vLLM 0.16.0 → 0.17.1 (#7170)
- Enable Intel XPU Dockerfile (#6109, #7134)
- Add support for CPU builds in Dockerfiles (#7139)
- Multi-image in request support for SGLang backend (#6068)
- vLLM omni image-to-video support (#6530)
- Auto GPU VRAM estimator for disagg-same-GPU (#6868)
- Forward pass metrics via ZMQ in vLLM (#7200) and Dynamo event plane integration (#7250)
Minor Features & Improvements
- Add profiler job overrides (#6607)
- Improve mooncake_bench sweep logic and throughput accounting (#6631)
- Allow deepseek_v3 architecture to use Kimi's BPE pattern (#6653)
- Wire nvext.cache_control TTL-based pinning through Dynamo router (#6213)
- Default router_event_threads to 4 (#6672)
- Add ActiveSequences benchmark and extract common bench utils for KV router (#6633)
- Add GPU info to tests when killing a process (#6552)
- Initial Claude skills (#6703)
- BPF for frontend perf tracing (#6737)
- Linear scan improvements for KV router (#6363)
- Router queue depth Prometheus metric and nvext field (#6786)
- Testing utilities for kvbm-logical and kvbm-physical (#6691)
- Use model_fields_set to distinguish TTFT/ITL default usage (#6814)
- Optimize dev/local-dev Dockerfiles for source-based development (#6743)
- Enable KVBM metrics on K8s for Kimi-K2.5 recipe (#6963)
- Add NVTX markers for vLLM EPD (#6627) and SGLang EPD (#7079)
- Multimodal benchmark sweep (#6795)
- Additional metrics for dynamo-trtllm (#6668)
- Main branch cross-reference to pr-monitor skill (#6889)
- Add missing overrides (#7017)
- Remove benchmark shim, use AIPerf directly (#7074)
- Add extra CodeRabbit review guidelines for Python and pytest (#7144)
- Split monolithic requirements.txt and remove test deps from runtime image (#6656)
- Frontend pipeline and tokio runtime perf metric definitions (#6731)
- Hide optimizationType (#7160)
- Refactor launch scripts with shared launch_utils.sh for consistent failure handling (#7008)
- Apply DGD overrides before running interpolation (#7226)
- Add generate health check support for PD SGLang (#6004)
- Replace PersistentConnector monkey-patch with proper nixl_conn (#6913)
- --decode-speedup-ratio for speculative decoding simulation in mocker (#7349)
- Harden ManagedProcess teardown, add xdist-safe tests (#6670)
- Request-plane, transport, and work-handler metric definitions (#6735)
Bug Fixes
- Guided decoding arg placement, None guard, test cleanup (#6617)
- Fix GPU discovery preflight job (#6628)
- Proper DGD prefix for naive fallback in DGDR (#6667)
- Remove costly logs in EPD (#6696)
- Fix docs links (#6702)
- Fix chat processor for vLLM video/audio examples (#6689)
- Broken link in profiler guide (#6709)
- Shell-quote Ray leader args (#6693)
- AutoApply bool → AutoApply *bool (#6683)
- Store nodesWithGPUs (#6690)
- Phase out llava and make EPD single GPU (#6674)
- Propagate vllm-distributed-executor-backend annotation from DGD metadata to backends (#6692)
- Poll for snapshot before starting Router 2 in indexers_sync (#6707)
- Fix TRT-LLM worker SSH crash in non-root containers (#6694)
- Replace broken archive docs URLs in release-artifacts (#6722)
- Always emit PVC block in operator ConfigMap when checkpoint storage type is PVC (#6752)
- Mypy type fixes (#6730)
- Handle missing out_hidden_size for LLaVA models in EPD encode worker (#6759)
- Decouple Helm chart from runtime cluster state for helm template and GitOps compatibility (#6754)
- Add embedding transfer implementation with NIXL WRITE initiation (#6651)
- disagg_planner.yaml using new planner CLI (#6760)
- Autogenerate Helm chart README (#6774)
- Fix vLLM disaggregated cancellation tests (#6758)
- Add weight: 1 to EPP config plugins (#6756)
- Fix vLLM KVBM integration test (#6768)
- Update container image to standard vllm-runtime tag (#6781)
- Fix E + PD multimodal flow in TensorRT-LLM (#6726)
- Auto-scale request count in benchmarks (#6777)
- Pass through extra args in TensorRT-LLM agg.sh launch script (#6787)
- Update vLLM processor for vLLM 0.16 (#6799)
- Allow v1alpha1 DGDR creation by fixing webhook version matching and backend enum (#6803)
- Added retries for docker pull, removed unused docker-tag-push GH action (#6804)
- Plumb expected_osl through scheduler queue path (#6812)
- Pass in device_id to KVBM PinnedAllocator instead of hardcoding to 0 (#6809)
- Update NIXL version to 0.10.0 (#6789)
- Properly setup and register vLLM worker for external/hybrid load balancing (#6695)
- Lychee cache — don't cache failures (#6824)
- Fix docs README skill links (#6856)
- vLLM processor works with stream_interval > 1 (#6816)
- Refactor profiler's DGD generation workflow to correctly generate mocker config (#6848)
- Exclude NIXL .so files from ai-dynamo-runtime wheel (#6430)
- Remove deprecated beam_width from TensorRT-LLM health check payload (#6879)
- Strip None args when profiler generating configs (#6882)
- Remove empty multimodal input to avoid invalid UUID check in vLLM (#6853)
- Add missing --kv-transfer-config to disagg_router.yaml (#6897)
- Support TRT-LLM 1.3 apply_mm_hashes API (#6810)
- TRT-LLM multimodal preprocessor — revert to old default_multimodal_input_loader for embeddings (#6840)
- Resolve socket UUIDs via CUDA driver API (#6891)
- Revert changes in autogen files (#6908)
- Skip encoder LLM creation for unsupported models in TensorRT-LLM (#6866)
- Preserve reasoning content when tool-call starts mid-chunk (#6902)
- Revert accidental change of trtllm/multimodal_processor (#6921)
- Fix operator race condition (#6929)
- Make KVBM respect CUDA_VISIBLE_DEVICES for NUMA binding (#6931)
- Enforce min_endpoint flag in Planner (#6637)
- Extend LoRA download S3 timeout and stream large LoRA downloads to disk (#6544)
- Fix call to normalize_finish_reason on OmniHandler for main (#6910)
- Add missing UCX/NIXL native libraries to SGLang runtime (#6939)
- Validate initial uptime metric parsing in integration test (#6943)
- Reject prompts exceeding max_seq_len with HTTP 400 (#6635)
- Efficiently fill dummy data to bypass vLLM preprocessor in E/P/D (#6968)
- Propagate tolerations and cap auto-discovered GPUs (#6947)
- TRT-LLM multimodal preprocessor — remove default_multimodal_input_loader from embedding paths (#6924)
- Remove aws-ofi-nccl plugin from linker cache in regular TensorRT-LLM runtime image (#6944)
- Restore --enforce-disagg to reject requests before prefill router activates (#6957)
- Move some router logs to DYN_LOG=debug (#6994)
- Remove llm → mocker crate dependency, move config (#6998)
- Accept assistant output_text messages without id/status in /v1/responses input (#6599)
- Guard SGLang/vLLM memory occupation control endpoints (#6967)
- Kimi K2.5 — tiktoken incomplete multi-byte sequence handling with regression tests (#6996)
- Pass checkpointPath to restore (#6941)
- Hide inactive models from /v1/models (#6966)
- Populate GIT_COMMIT_SHA envvar in containers built in CI (#7001)
- Use checked arithmetic in TwoPartCodec to prevent integer overflow (#6959)
- Remove outdated deploy/discovery ex...
Dynamo v1.0.1
Release Notes
Summary
Dynamo v1.0.1 is a patch release to Dynamo v1.0.0 with critical bug fixes and expanded model support. Key fixes resolve a TensorRT-LLM startup crash on CUDA 13.1 caused by a cutlass-dsl packaging mismatch, and restore OpenAI API logprobs compliance where bytes and token fields were not populated when routing through Dynamo Frontend. This release also enables experimental Kimi K2.5 model support by adding DeepSeek V3 architecture tokenizer handling and fixing tiktoken multi-byte streaming panics.
Base Branch: release/1.0.0
Bug Fixes
- TensorRT-LLM CUDA 13.1 Startup Crash: Fixed startup crash where a
cutlass-dslstub crashed the TensorRT-LLM import chain on CUDA 13.1. This was a known issue in v1.0.0 that blocked MoE models on Blackwell GPUs, including the Qwen3-235B-A22B-FP8 TensorRT-LLM recipe. (#7393) - Kimi K2.5 Tokenizer and Streaming: Added DeepSeek V3 architecture support for Kimi's BPE tokenizer pattern and fixed worker thread panic on incomplete multi-byte sequences during streaming inference. Enables serving nvidia/Kimi-K2.5-NVFP4 and other Kimi K2.5 models that use the DeepSeek V3
model_typewith a tiktoken tokenizer. (#7424) - OpenAI Logprobs Fields: Fixed
bytesandtokenfields in logprobs responses always returningNone/empty when routing through Dynamo Frontend. Affected vLLM backend; direct backend queries returned correct values but requests routed through Frontend did not. (#7404)
Dynamo Release v1.0.0
Release Notes
Dynamo v1.0.0 is the first major release of the open-source distributed inference platform. This release delivers production-grade disaggregated serving with comprehensive multimodal and omni-model support, KV cache optimizations, improved handling of agentic workloads, Kubernetes-native deployment at scale, and a stabilized public API.
Summary
Multimodal & Diffusion
Dynamo now serves a range of generative modalities—text, image, and video—across all three major inference frameworks. Text-to-image generation is available through both vLLM Omni and SGLang image diffusion pipelines, and text-to-video through SGLang, vLLM Omni, and TensorRT-LLM Wan T2V, with experimental MJPEG streaming for real-time video output. Encoder disaggregation matured with a new EncoderCacheManager and content-addressed hashing, enabling multimodal encoder outputs to be cached and reused across workers. Embedding transfer between workers uses NIXL to minimize latency, and multimodal-aware KV cache routing places requests based on media content for better cache hit rates.
Agents
Dynamo added building blocks for agentic workloads: agent hints at the API, priority scheduling, and (experimental) KV cache retention and lifecycle awareness for long agent sessions. Dynamo expanded its agentic capabilities with reasoning content management for DeepSeek v3.2, GLM-4.7, and Kimi-2.5—including interleaved thinking support where reasoning and tool calls alternate within a single response. New tool call parsers for GLM-4.7, MiniMax-M2, and Kimi K2/K2.5 broaden the set of models that can drive tool-use workflows. Agentic frameworks that target OpenAI or Anthropic can now connect to Dynamo directly via new /v1/responses and /v1/messages endpoints, removing the need for adapter layers. Guided decoding now enforces JSON schema constraints on model output across vLLM and TensorRT-LLM, ensuring tool calls and function arguments are always valid structured data.
Unified Configuration & Public API Stabilization
All backends (SGLang, TensorRT-LLM, vLLM) and core components (Frontend, Router, Planner) migrated from fragmented argparse flags to a typed, modular configuration system with validated base classes. The public Python API was streamlined—deprecated types like Component, Namespace, and CancellationToken were removed, and endpoint methods were consolidated. These changes make the SDK smaller, more consistent, and easier to maintain.
See Breaking Changes for migration details.
Kubernetes Production Readiness
Dynamo Operator matured with a v1beta1 DynamoGraphDeploymentRequest API (Preview in Dynamo v1.0.0), config versioning via ConfigMap injection, GPU auto-discovery migrated from Profiler to Operator, rolling updates for DGD worker deployments, and simplified CRD management. The EPP component introduced a decomposed pipeline for supporting Inference Gateway-based routing with pod-level traffic management. LoRA support expanded with routing-aware adapter placement, memory-aware allocation, and multimodal LoRA with Kubernetes deployment examples. Multiple new Kubernetes deployment recipes were added including Kimi-K2.5, Qwen3-VL-30B-A3B-FP8, & Nemotron-3-Super-FP8.
Performance & Reliability
Dynamo Snapshot (Preview in Dynamo v1.0.0) enables fast GPU worker recovery via a portable DaemonSet using CRIU and cuda-checkpoint, now extended to SGLang. The Dynamo Planner now adds a load-based scaling approach, and a new GlobalPlanner mode (Preview in Dynamo v1.0.0) that provides cross-deployment autoscaling for multiple models or deployments backing an endpoint. Observability was overhauled with standardized dynamo_router_* metrics, engine-level Prometheus metrics, OTel tracing for routing, and more robust Grafana dashboards.
Under the Hood
Two posts on the Dynamo Dev Blog give a closer look at some of the problems we've worked on:
- Flash Indexer: Inter-Galactic KV Routing traces six iterations of data structure design—from a Python dictionary to a concurrent positional index with jump search. The result: the Dynamo Router sustains 170M ops/s—42x faster than what we shipped in Dynamo v0.1.0 and enough to handle planetary-scale inference workloads (we think).
- Full-Stack Optimizations for Agentic Inference tackles the visibility gap between agent harnesses and inference stacks. Claude Code and Codex know what's urgent—but the inference engines handling the workloads didn't, until now. The new
nvext.agent_hintsAPI lets harnesses pass scheduling priority, cache retention, and speculative prefill hints directly to the engine.
Open-Source Contributions
Between v0.9.0 and v1.0.0, we merged over 700 commits from over 90 contributors — 34 first-time contributors and 19 external contributors from 12 organizations.
First-Time External Contributors
- @devivasudevan (Microsoft) contributed a PR that adds Azure AKS storage guidance for Dynamo caches (#5581).
- @maljazaery (Microsoft) contributed a PR that clarifies DGDSA creation for services is disabled by default (#6389).
- @dsocek (Intel) contributed a PR that improves multimodal disaggregation reliability (#5895).
- @muskansh-google (Google) contributed a PR that updates build commands for the Dynamo + SGLang container (#5908).
- @InfraWhisperer (F5) contributed a PR that fixes a frontend crash when using the TRT-LLM runtime image (#6481).
- @Kaonael (Gcore) contributed a PR that adds a status state enum to DynamoGraphDeployment for improved lifecycle tracking (#6324).
- @Ryan-Amirthan (Fern) contributed a PR that adds standard NVIDIA Fern styling assets to the documentation site (#6148).
- @bledden (Facilitair) contributed a PR that forwards
stream_optionsthrough the multimodal request pipeline (#6474). - @advpropsys (WhiteCircle.ai) contributed a PR that reduces NATS consumer inactive threshold from 1 hour to 2 minutes to prevent stale connections (#5861).
- @luc-hiverge (Hiverge) contributed a PR that fixes first token creation signal timing by emitting the signal after sleeping (#5681).
- @orangeng contributed a PR that fixes the service name in port-forward documentation (#5527).
- @huitianbai contributed a PR that limits bootstrap room ID range to 0–2^63-1 to prevent overflow (#6277).
First-Time NVIDIA Contributors:
- @knowicki-nvidia contributed a PR that adds image diffusion and text-to-image support for the SGLang backend (#5609).
- @akshatha-k contributed a PR that restructures KVBM documentation into a three-tier format (#5905).
- @alexanderbilk contributed a PR that adds a Prometheus port for NIXL telemetry metrics (#5567).
- @rwipfelnv contributed a PR that adds Grafana dashboard and monitoring setup for observability (#4639).
- @mikwieczorek contributed a PR that fixes TRT-LLM recipe component type from "main" to "worker" (#5788).
- @jpohl-nv contributed a PR that adds experimental MJPEG video streaming via
/v1/videos/stream(#6487). - @rafiw contributed a PR that adds Triton path environment variables to the vLLM runtime Dockerfile (#6401).
Returning External Contributors: @michaelfeil (Baseten), @vladnosiv (Yandex.Cloud), @Jont828 (Microsoft), @ashnamehrotra (Microsoft), @ls-2018, @AmeenP (PrimeIntellect), @kerthcet (InftyAI/Hiverge).
If you would like to get involved, please see our Contribution Guide
Breaking Changes
ACTION REQUIRED: The following changes require updates to your code, configuration, or deployment manifests before upgrading.
CLI Flags and Environment Variables
- KV Router Flags Renamed (#6361): All KV router CLI flags and env vars now use the
--router-* /DYN_ROUTER_* prefix.
| Old Flag / Env Var | New Flag / Env Var |
|---|---|
--kv-events / DYN_KV_EVENTS |
--router-kv-events / DYN_ROUTER_USE_KV_EVENTS |
--kv-overlap-score-weight / DYN_KV_OVERLAP_SCORE_WEIGHT |
--router-kv-overlap-score-weight / DYN_ROUTER_KV_OVERLAP_SCORE_WEIGHT |
--assume-kv-reuse / DYN_ASSUME_KV_REUSE |
--router-assume-kv-reuse / DYN_ROUTER_ASSUME_KV_REUSE |
--durable-kv-events / DYN_DURABLE_KV_EVENTS |
--router-durable-kv-events / DYN_ROUTER_DURABLE_KV_EVENTS |
--track-active-blocks / DYN_TRACK_ACTIVE_BLOCKS |
--router-track-active-blocks / DYN_ROUTER_TRACK_ACTIVE_BLOCKS |
--track-output-blocks |
--router-track-output-blocks |
--router-ttl / DYN_ROUTER_TTL |
--router-ttl-secs / DYN_ROUTER_TTL_SECS |
Migrate: Update all CLI invocations, env vars, and deployment YAMLs to use the new names.
-
Disagg Flag Inverted (#6515):
--enforce-disaggreplaced by--decode-fallbackwith inverted semantics — disaggregated mode is now enforced by default.Migrate: Replace
--enforce-disaggwith--decode-fallback. If you need fallback to aggregated mode, explicitly pass--decode-fallbackorDYN_DECODE_FALLBACK=true. In the EPP plugin, update fromDYN_ENFORCE_DISAGGtoDYN_DECODE_FALLBACKwith inverted boolean. -
Migration Limit Moved to Frontend (#5918): The
--migration-limitCLI flag has been removed from all backend workers (vLLM, SGLang, TRT-LLM) and is now set on the Frontend only.Migrate: Remove
--migration-limitfrom backend launch commands; pass it to the Frontend instead.
...
Dynamo Release v0.9.1
Dynamo v0.9.1
Release Notes
Summary
Dynamo 0.9.1 is a patch release that upgrades TensorRT-LLM from v1.3.0rc1 to v1.3.0rc3 and removes a KVBM workaround that is no longer needed with the upgraded TRT-LLM version.
Base Branch: release/0.9.0
Version Upgrades
- TensorRT-LLM v1.3.0rc3: Upgraded TensorRT-LLM from v1.3.0rc1 to v1.3.0rc3 across version pins. This update includes upstream bug fixes and performance improvements (#6402).
Bug Fixes
-
pydantic-settings Compatibility: Pinned
pydantic-settings<2.13.0to fixTypeError: DynamicYamlWithDeepMergeSettingsSource._read_files() got an unexpected keyword argument 'deep_merge'error that occurred with pydantic-settings v2.13.0+ in TRT-LLM autodeploy tests. This affects the legacyDockerfile.trtllmbuild path used in release/0.9.1 (#6402). -
KVBM Workaround Removal: Reverted the KVBM disaggregated serving workaround since TRT-LLM v1.3.0rc3 includes the upstream fix (TRT-LLM #11247). This re-enables TRT-LLM+KVBM tests and removes legacy workaround code (#6495).
Known Issues
For known issues in this release, refer to the Known Issues section in the Dynamo v0.9.0 Release Notes.
Dynamo v0.9.0
Dynamo v0.9.0 Release Notes
Summary
Dynamo v0.9.0 completes the infrastructure decoupling started in v0.8.0, expands multimodal and diffusion model support across all three backends, and introduces smarter scheduling with predictive load estimation and routing hints.
Infrastructure Modernization
The new Event Plane—built on high-performance ZMQ transport with MessagePack serialization—joins the Discovery Plane and Request Plane to form a fully decoupled communication architecture. Dynamo deployments no longer require NATS or etcd: Kubernetes-native service discovery replaces etcd, KV router queries run over the native Dynamo endpoint instead of NATS, and the Event Plane provides a transport-agnostic pub/sub layer for system events. These changes simplify deployment topology and reduce operational dependencies.
Multimodal & Diffusion
Dynamo expanded multimodal support across all three backends in this release. Encoder disaggregation is now available for both vLLM (via the Embedding Cache connector) and TRT-LLM (via a standalone encoder), allowing encoding to run on a separate GPU from prefill/decode. Dynamo can now serve multimodal SGLang workloads on a single GPU instead of requiring a full E/PD split. We also added first-class support for diffusion-based language models — LLaDA2.0 can now be served alongside autoregressive models in the same Dynamo deployment.
Scheduling Intelligence
Router gained output block tracking with fractional decay for predictive load estimation, expected output token awareness, and support for routing hints from external orchestrators like Kubernetes Gateway API Inference Extension (GAIE). The Planner added Kalman filter and mooncake-style warmup for more accurate load prediction, along with SLA-driven autoscaling for MoE DEP/TEP configurations. The Profiler was enhanced with PVC model cache support and model name validation.
Kubernetes & Observability
Operator added rollout restart for DynamoGraphDeployments, observability metrics, tolerations/affinity for GPU-specific scheduling, and improved restart reliability. Distributed tracing now spans the full request path including TCP transport, and the Prometheus metrics stack was simplified with multi-registry scrape support.
First-Time Contributors
We welcome 14 new contributors to the Dynamo project:
- @siclait contributed a PR that truncates HttpError messages to 8192 characters to prevent ValueError on long messages (#5020).
- @smatta-star contributed a PR that adds auto-generated OpenAPI spec and helper binary for the frontend (#4802).
- @shpgy-shpgy contributed a PR that fixes multimodal processing error when handling pure text conversations (#5088).
- @chay1045 contributed a PR that fixes hidden stop tokens appearing in output by returning
Noneinstead (#5238). - @wenqiglantz contributed a PR that adds prompt embeds support for pre-computed inference inputs in vLLM (#4739).
- @yurekami contributed a PR that preserves original model path for frontend config downloads (#5102).
- @erezzarum contributed a PR that fixes NIXL CUDA12 + CUDA13 build compatibility (#5000).
- @soodoshll contributed a PR that fixes
usagereturningNonewhen using text mode with vLLM (#5336). - @ls-2018 contributed a PR that fixes tag error handling (#5236).
- @debermudez contributed a PR that updates aiperf to v0.4.0 (#5331).
- @wangshangsam contributed a PR that updates vLLM import paths to align with upstream main (#5447).
- @AbhiOnGithub contributed a PR that adds
__all__exports and__repr__methods for improved debugging (#5606). - @davilu-nvidia contributed a PR that resolves SGLang E/P/D multimodal routing issues (#5500).
- @adityapuranik99 contributed a PR that adds cupy-cuda12x to SGLang extras for CUDA compatibility (#5627).
Major Features & Improvements
Infrastructure Modernization
Discovery Plane
- K8s-Native Service Discovery: Enabled Kubernetes-based discovery in GAIE and updated Helm charts/RBAC to support etcd-less deployments, allowing Kubernetes users to deploy without running a separate etcd cluster (#5303, #5432, #5364).
- etcd Reliability: Resolved potential deadlocks in legacy etcd usage and updated examples to run without etcd, ensuring stable startup for users still on etcd-based discovery (#5091, #5422).
- List-and-Watch Diffing: Resolved diffing logic issue where worker metadata updates (e.g., LoRA adapter additions) were not picked up, causing stale routing decisions (#5318).
Request Plane
- NATS Dependency Removal: Migrated KV router worker queries to the native Dynamo endpoint to reduce NATS traffic (#5451), made NATS optional for KV-aware routing in approximate mode so local development works without a NATS server (#5237), fixed NATS container startup failure caused by invalid
--max_payloadCLI flag by moving it to config file (#5384), and cleaned up asymmetric request plane configuration in launch scripts (#5245).
Event Plane
- Event Plane Architecture: Introduced a transport-agnostic Event Plane with MessagePack serialization and auto-discovery, decoupling system events (KV cache transfers, notifications) from direct NATS dependency. Added high-performance ZMQ transport as a scalable alternative for latency-sensitive event channels while preserving NATS for backward compatibility (#5674, #5614, #5624).
- Event Plane NATS Init: Corrected NATS initialization logic based on
--event-planeargument across all backends, preventing silent failures when NATS is not configured (#5750). - ZMQ Transport Timeout: Added receive timeout for ZMQ transport to prevent indefinite hangs when a publisher is unavailable (#5804).
Networking
- IPv6 Support: Added IPv6 support for SGLang disaggregation with proper address formatting, enabling deployments on IPv6-only networks (#5521).
Multimodal & Diffusion
SGLang
- Aggregated Multimodal: Enabled Dynamo to serve multimodal SGLang workloads on a single GPU, removing the previous requirement for a 2-GPU E/PD split (#5450).
- Diffusion LM Support: Enabled Dynamo to serve diffusion-based language models (LLaDA2.0) through the SGLang backend, using existing Dynamo infrastructure for pre/post processing with a new diffusion handler (#5533).
- Multi-Image Qwen EC: Resolved multi-image bug in the Dynamo EC connector that dropped images beyond the first in multimodal requests (#5514).
TensorRT-LLM
- Standalone Encoder: Added encoder disaggregation support to Dynamo's TRT-LLM integration, enabling encoding to run on a separate GPU from prefill/decode (#4668).
- Multimodal Tokenizer Reuse: Optimized Dynamo's multimodal request pipeline for TRT-LLM by reusing the tokenizer across requests instead of reinitializing per request, reducing per-request latency (#5217).
vLLM
- Embedding Cache Connector: Added the Embedding Cache (EC) connector to Dynamo's vLLM integration for encoder disaggregation, where the encoder stores embeddings by hash and PD workers consume them from cache—eliminating redundant encoding and reducing TTFT. Also enabled multiple image inputs per request and parallelized image loading (#5162, #5463, #5444).
- Prompt Embeds Support: Added pre-computed embeddings as a secure input method to Dynamo, allowing applications to transform sensitive data into embeddings before submission for improved privacy and flexible prompt engineering (#4739).
- EPD Refactor: Refactored Dynamo's EPD handler to orchestrate the full encode-to-PD flow (processor → encoder → processor → PD), supporting multiple multimodal data items per request instead of just one (#4994).
- Decode Worker Qwen-VL: Resolved disaggregated decode crash for Qwen2.5-VL models caused by missing
image_grid_thwdata needed for mRoPE position encoding (#5281). - EPD Sampling Params: Corrected sampling params parsing in Dynamo's vLLM EPD flow that could silently produce incorrect generation parameters (#5833).
Performance & Hardware
- SGLang Stream Output: Enforced
stream_output=Truein SGLang ServerArgs, switching from cumulative-to-delta token conversion to direct disjoint segment passthrough—reducing per-token processing overhead in streaming responses (#5510). - Multimodal Payload Optimization: Removed serialization/deserialization in
gather_multi_model_data, significantly reducing latency for requests with large base64-encoded payloads (#5485). - Zero Copy TCP Decoder: Implemented zero copy decoder with bounded worker pool for TCP ingress, eliminating memory leaks under high concurrency and reducing per-message allocations (#5376).
- MoE Data Parallel Tuning: Reduced
VLLM_MOE_DP_CHUNK_SIZEto 384, lowering HBM footprint enough to enable inference on 16xH200 MoE configurations that previously hit OOM (#5307). - TRT-LLM GB200 Support: Resolved memory allocation failure on GB200 hardware (#5328) and updated the Wide-EP disaggregated GB200 recipe for compatibility with latest TRT-LLM version (#5383).
Router
- Router Scheduling Intelligence: Added output block tracking with fractional decay for predictive load estimation (#5452), plumbed expected output tokens so the router can account for generation length when distributing requests (#5181), and added a flag to disable decode KV reuse assumption so the router computes actual block hashes for more accurate cache-hit predictions (#5350).
- Routing Hints from Headers: Added support for reading routing hints from request headers, allowing external orchestrators (e.g., GAIE) to influence routing decisions without modifying the request body (#5502).
- PrefillComplete Hook: Implemented PrefillComplete handling in Dynamo EPP Scor...
Dynamo v0.8.1
Dynamo v0.8.1 Release Notes
Summary
Dynamo 0.8.1 is a patch release that adds profiler enhancements for Kubernetes deployments and addresses bug fixes for SGLang and worker identification. This release adds support for mounting model cache PVCs to profiler pods, fixes YAML configuration parsing for boolean flags in SGLang, resolves container build issues for CUDA 13 SGLang environments, and corrects a pod hash calculation issue that could affect worker identification in Kubernetes.
Base Branch: release/0.8.0
Major Features & Improvements
Kubernetes Deployment
- Profiler Model Cache PVC Support: Added ability to mount model cache PVCs to profiler pods when specified in
DynamoGraphDeploymentRequest, enabling profilers to access pre-downloaded model weights without re-downloading (#5212).
Bug Fixes
- SGLang YAML Config Parsing: Fixed YAML config parsing for
store_truearguments (e.g.,trust-remote-code,enable-metrics) that were incorrectly converted to--flag trueinstead of just--flag, breaking boolean configuration options (#5513). - SGLang CUDA 13 Container Build: Fixed NVIDIA package installation in the SGLang CUDA 13 container to install CuDNN 9.16+ based on CUDA version, resolving PyTorch 2.9.1 compatibility issues with
nn.Conv3dthat caused performance degradation and excessive memory usage in multimodal workloads (#5461). - Worker ID Precision Loss: Fixed routing failures caused by
f64precision loss when worker/instance IDs exceeded 2^53, which caused approximately half of workers in large deployments to be unreachable for KV cache routing decisions (#5471).
Documentation
- DGDR SLA Profiler Compatibility: Documented that
DynamoGraphDeploymentRequestprofiling configurations using camelCase field names and model cache PVC options require Dynamo 0.8.1 or later (#5492).
Known Issues
For known issues in this release, refer to the Known Issues section in the Dynamo v0.8.0 Release Notes.