Skip to content

Releases: ai-dynamo/dynamo

Dynamo v1.1.0

04 May 20:04
0dd5374

Choose a tag to compare

Release Notes

Dynamo v1.1.0 is the 14th feature release of the open-source distributed inference platform. It makes the standalone KV indexer recoverable across node failures, brings the Anthropic Messages API to production for Claude Code, lands SGLang multimodal disaggregated serving, and turns the Mocker into a unified performance-modeling and offline-replay engine.

Summary

Resilient KV Routing at Scale

The standalone KV indexer is now recoverable. New replicas bootstrap their radix tree from a healthy peer's /dump endpoint before serving. Inline ZMQ gap detection replays dropped messages from the engine ring buffer. Multi-model and multi-tenant isolation lets shared clusters route to the correct cache. Ships as a maturin-built dynamo-kv-indexer on PATH with Prometheus metrics.

Anthropic Messages API for Claude Code

/v1/messages is production-grade for Claude Code-style harnesses. cache_control is honored at top-level, per-block, and system-block-array forms. Thinking-block pass-through, system-prompt preamble stripping, accurate streaming input_tokens, and Anthropic image → OpenAI image_url conversion cover end-to-end vision and reasoning. /v1/models exposes context_window; streaming double-parsing and reasoning_content round-tripping are fixed.

Performance Modeling & Offline Replay

The Mocker becomes a performance-modeling and offline-replay engine. SGLang engine simulation, AIConfigurator-backed latency prediction with MoE parallelism, --decode-speedup-ratio for speculative decoding, and offline agg/disagg replay over Mooncake-style traces all land. Forward-pass metrics flow onto the event plane, enabling planner-in-the-loop replay. The Planner consolidates onto a single FPM-based regression model that serves both throughput and load scaling.

Multimodal Embedding Cache & Diffusion

SGLang multimodal disaggregated serving lands with NixlEmbeddingSender/NixlEmbeddingReceiver, an embedding cache, and device-type-and-load EPD routing. vLLM E/P/D worker init moves to a factory; embedding loading is abstracted into ImageLoader. Diffusion adds image-to-video on vLLM Omni, image-to-image on SGLang, audio/TTS, video input for SGLang aggregated, audio-in-video, and Flux benchmarking. The v1.0.0 PersistentConnector monkey-patch is retired for a proper nixl_connector integration.

Open-Source Contributions

Between v1.0.2 and v1.1.0, the project merged 896 PRs from 113 contributors. New first-time external contributors in this release include:

Returning external contributors include @michaelfeil (Baseten), @vladnosiv (Yandex.Cloud), @dsocek (Intel), @AmeenP (PrimeIntellect), @huitianbai, @InfraWhisperer (F5), @devivasudevan (Microsoft), @Jont828 (Microsoft), @ashnamehrotra (Microsoft), @Ryan-Amirthan (Fern), and several others.

If you would like to get involved, please see our Contribution Guide.


Key Dependencies

Dynamo SGLang TensorRT-LLM vLLM NIXL UCX
v1.1.0 v0.5.10.post1 v1.3.0rc11 v0.19.0 v1.0.1 (SGLang) / v0.10.1 (TRT-LLM, vLLM) 1.20

CUDA Variants

Backend CUDA 12 CUDA 13
vLLM 12.9 13.0
SGLang 12.9 13.0
TensorRT-LLM 13.1

The vLLM XPU/CPU image targets Intel deep-learning-essentials 2025.3.2 and stays on vLLM v0.16.0 for v1.1.0.

Dynamo Ecosystem

AIConfigurator AIPerf ModelExpress Grove
v0.8.0 v0.7.0 v0.3.0 v0.1.0-alpha.6

For container images, wheels, Helm charts, Rust crates, and the full pinned matrix, see Release Artifacts and the Support Matrix.


Breaking Changes

ACTION REQUIRED: The following changes require updates to your code, configuration, or deployment manifests before upgrading.

Notable Behavioral Changes

  • enable_nats and use_kv_events Removed from DistributedRuntime (#7265): Both parameters are removed from DistributedRuntime, create_runtime(), and the dynamo_worker() decorator. NATS is now auto-detected from the event plane: enabled when the request plane is NATS or NATS_SERVER is configured.

    Migrate: Drop both arguments from your Python entry points and configure NATS via the DYN_EVENT_PLANE and NATS_SERVER environment variables instead.

  • Experimental nvext.cache_control Cache Pinning Removed (#7790): The experimental cache-pinning feature is removed: the nvext.cache_control request field, the --enable-cache-control flag, and the DYN_ENABLE_CACHE_CONTROL env var are all gone. SGLang upstream chose a different direction, so the v1.0.0 plumbing is being unwound. The Anthropic Messages parser still accepts cache_control blocks for protocol compatibility but no longer derives router-pin TTLs from them.

    Migrate: If you depended on cache pinning, track the v1.2.0 sticky-session / session-controller work. There is no drop-in replacement in v1.1.0.

  • Cargo-Built dynamo-kv-indexer Binary Removed (#7338): The Cargo-built dynamo-kv-indexer binary in lib/kv-router/target/release/ is removed; the maturin-built binary shipped via the Python wheel is now the single source.

    Migrate: Update launchers and Dockerfiles to point at the wheel-installed dynamo-kv-indexer (on PATH after pip install ai-dynamo).

  • LLaVA-Specific EPD Path Removed; EPD Now Single-GPU (#6674): The LLaVA-specific multimodal EPD code path is removed, EPD is now constrained to single-GPU configurations, and the default multimodal example moved from Llava-Mistral to Qwen/Qwen3-VL-2B-Instruct.

    Migrate: Switch LLaVA workloads to the aggregated path or to a Qwen3-VL recipe.

  • Compressed Concurrent Tree Default (#7874): The KV router defaults to the compressed concurrent radix tree. Improves resource utilization for multi-threaded indexing; node-allocation semantics differ from the previous tree.

    Migrate: Update any custom instrumentation that targeted the old radix-tree internals. No action required for default deployments.

  • MDC Checksum Scoped Per-WorkerSet (#7368): Model Discovery Card checksum validation moved from per-Model to per-WorkerSet. Different WorkerSets under the same Model can now carry different configuration without forcing workers to drain first.

    Migrate: If you relied on the v1.0.0 strict per-Model behavior, audit your WorkerSet configs before upgrading.

Deprecated and Removed

  • vLLM Auto-Enable KV Events Removed (#7591): The deprecated automatic KV-events config in vLLM is removed; the DYN_VLLM_KV_EVENT_PORT env var is also no longer supported.

    Migrate: Set --kv-events-config explicitly per the v1.0.0 migration note.

  • Unused genai-perf Pin Dropped (#8763): The unused genai-perf==0.0.15 pin was removed from container/deps/requirements.benchmark.txt. It was not invoked anywhere in the repo.

    Migrate: No action required. aiperf is the supported in-container benchmarking tool.

v1.0.0 Future-Deprecation Reminders

The following warnings from v1.0.0 still apply. Migrate before they are removed:

  • v1alpha1 DGDR API: migrate to v1beta1
  • enableGpuDiscovery CRD field has no effect
  • ComponentName field on ServiceReplicaStatus: migrate to ComponentNames
  • Router CLI flags without the --router- prefix
  • vLLM --is-prefill-worker/--is-decode-worker: migrate to --disaggregation-mode
  • --router-durable-kv-events: migrate to the event-plane subscriber

Features & Improvements

Multimodal & Diffusion

Embedding Cache & E/P/D

  • SGLang Embedding Cache: Added an SGLang embedding cache for cross-request reuse (#7674).
  • NIXL WRITE for Embedding Transfer: Added a NIXL WRITE initiation path for cross-node embedding transfer (#6651).
  • **SGLang Emb...
Read more

Dynamo v1.2.0-deepseek-v4-dev.2

01 May 15:18
faca1e2

Choose a tag to compare

Pre-release

Dynamo v1.2.0-deepseek-v4-dev.2 - Release Notes

Summary

Dynamo v1.2.0-deepseek-v4-dev.2 is the second experimental dev cut for DeepSeek-V4-Flash and DeepSeek-V4-Pro on Blackwell. The headline change is the vLLM 0.20.0 upgrade, which lands native DeepSeek-V4 support upstream and lets the published vLLM container drop the dev.1 custom Dockerfile overlay. The release also publishes both vLLM and SGLang containers for V4-Flash and V4-Pro, tightens DeepSeek-V4 frontend correctness (thinking-mode toggle keys, DSML tool-call parsing, skip_special_tokens default), fixes a vLLM/NIXL cancellation race that crashed EngineCore mid-KV-transfer, and consolidates the recipe layout into a single recipes/deepseek-v4/ subtree. This is a snapshot build for early access to V4 model support and is not a QA-gated release.

Base Branch: release/1.2.0-deepseek-v4-dev.2

Container Images (RC4, staging)

Backend Arch Image
vLLM (CUDA 13) multi-arch (amd64 + arm64) nvcr.io/nvstaging/ai-dynamo/vllm-runtime:1.2.0-deepseek-v4-cuda13-dev.2-rc4
SGLang (CUDA 12) amd64 only nvcr.io/nvstaging/ai-dynamo/sglang-runtime:1.2.0-deepseek-v4-cuda12-dev.2-rc4
SGLang (CUDA 13) arm64 only nvcr.io/nvstaging/ai-dynamo/sglang-runtime:1.2.0-deepseek-v4-cuda13-dev.2-rc4

Backend Versions

Backend Base Image CUDA Python Notes
vLLM vllm/vllm-openai (0.20.0) 13.0 3.12 vLLM 0.20.0 ships native DeepSeek-V4 support; the custom dsv4 Dockerfile was dropped in #8786
SGLang lmsysorg/sglang:deepseek-v4-blackwell 12.9 / 13.0 3.12 Upstream DSv4 Blackwell preview branch; Dynamo overlays V4 parsers, hardening fixes, and routed_experts opt-in gating

NIXL 0.10.1 and UCX v1.20.0 are bundled in both published images. TensorRT-LLM is not part of this dev release. Backend versions are pinned to upstream DSv4 preview images, not the standard Dynamo backend pins.

Models

Model HuggingFace ID Hardware Notes
DeepSeek-V4-Flash deepseek-ai/DeepSeek-V4-Flash 4× B200 (TP=4); GB200 also supported MXFP4 MoE via FlashInfer; EAGLE MTP 3/4
DeepSeek-V4-Pro deepseek-ai/DeepSeek-V4-Pro 8× B200 (TP=8) MXFP4 MoE via FlashInfer; EAGLE MTP 3/4. Thinking mode is broken upstream — see Known Issues

Both models run on either the published vLLM or SGLang container. See recipes/deepseek-v4/deepseek-v4-{flash,pro}/README.md for per-model deployment.

What Changed Since dev.1

dev.2 carries forward everything that shipped in v1.2.0-sglang-deepseek-v4-dev.1:

  • DeepSeek V4 frontend parser, prompt formatter, and DSML tool-call parser (#8665).
  • KV router ZMQ wire-parser fix that filters non-main KV event groups (#8669).
  • Initial DeepSeek-V4 SGLang recipe with Dockerfile.dsv4-sglang and per-model manifests (#8704).
  • Initial DeepSeek-V4-Flash and V4-Pro vLLM aggregated recipes (#8668).

dev.2 adds the changes below on top.

Full Changelog

vLLM Backend Upgrade

  • vLLM 0.20.0 with Native DeepSeek-V4 Support: Bumped the published vLLM container to vLLM 0.20.0 (#8762), which lands native DeepSeek-V4 architecture support upstream and eliminates the need for the dev.1 custom DSv4 Dockerfile overlay (dropped in #8786). DSv4-relevant upstream additions in vLLM 0.20.0 (full notes: v0.20.0):

Recipes

  • DeepSeek-V4 Family Layout Consolidation: Restructured the DSv4 recipes into a single self-contained recipes/deepseek-v4/ subtree following the repo-wide <recipe>/<framework>/<mode>/deploy.yaml convention, deduped the shared dsv4 Dockerfiles into one set, simplified the SGLang Dockerfile to consume the Dynamo donor image directly (no in-Dockerfile source build), pinned the SGLang base image to a specific digest, and documented both vLLM and SGLang deployment paths side-by-side in each per-recipe README (#8735). The dev.1 paths recipes/deepseek-v4-flash/ and recipes/deepseek-v4-pro/ are now recipes/deepseek-v4/deepseek-v4-{flash,pro}/.

Bug Fixes

Frontend

  • DeepSeek V4 Frontend Hardening: Hardened the DeepSeek V4 frontend after post-merge review of the dev.1 parser (#8670). The V4 prompt formatter now honors thinking: false, enable_thinking: false, and thinking_mode: "chat" as toggle keys end-to-end (previously only some keys routed through, leaving reasoning extraction silently on); reasoning_effort="max" and per-request drop_thinking overrides now actually take effect; bad chat_template_args values emit a structured warning instead of falling back silently. The DSML tool-call parser now keeps parameters that omit the optional string="true|false" attribute, preserves text between or after multiple DSML blocks, no longer leaks raw <|DSML|…> markup into normal_content when invoke parsing fails, and emits OpenAI-style call_<24hex> IDs. The DsmlParserConfig.function_calls_* fields were renamed to block_* (with serde aliases for back-compat), and V4 model-name matching now rejects composites like deepseek-v3.2-v4-merge. SGLang chat_processor_frontend parallel_tool_calls is wired through, fixing 9 previously broken runtime tests.
  • Special-Token Leakage in Streaming Detokenizer: Fixed the Rust backend's streaming detokenizer to default skip_special_tokens=true when the OpenAI request omits the field (#8780), matching vLLM, SGLang, and TensorRT-LLM defaults. Without this, DeepSeek-V4 occasionally emits token id 0 (<|begin▁of▁sentence|>) mid-output and the literal <|begin▁of▁sentence|> text leaked into content and reasoning_content for any client that did not opt in.

vLLM

  • vLLM/NIXL Cancellation Race During KV Transfer: Fixed a vLLM/NIXL race in disaggregated serving where aborting a request immediately after prefill could crash EngineCore while KV transfer was still in flight (#8624). The vLLM handler now defers engine_client.abort(request_id) until the engine produces the first token in decode mode, then resumes normal cancellation. Behavior is gated to the vLLM disaggregated decode flow only; aggregated and other backends are unchanged.
  • vLLM Pod Startup Crash on Renamed Flag: Fixed DSv4 and Qwen3 vLLM recipe deploy.yaml files that still passed the removed --disable-log-requests flag (#8693), causing pods to crash at startup with unrecognized arguments: --disable-log-requests on any image built against vLLM 0.19.1+. Renamed to --no-enable-log-requests (the argparse.BooleanOptionalAction-generated form of the new --enable-log-requests), preserving explicit "logging off" intent.

SGLang

  • SGLang return_routed_experts Compat Regression: Gated SGLang's return_routed_experts behavior behind an opt-in flag for DeepSeek-V4 compatibility, plus faster model downloading in pytests (#8828; cherry-picks #8821 and #8798). This keeps the SGLang DSv4 path stable on the published image without breaking other SGLang model paths that do not expect routed-expert metadata.
  • Silent 128-Token Cap on Omitted max_tokens: Fixed the SGLang worker's _build_sampling_params to preserve explicit max_new_tokens=None instead of stripping it (#8743). When a client omitted max_tokens from a chat completion request (valid per OpenAI spec), the dict comprehension previously filtered out the None and SGLang silently fell back to its 128-token internal default, returning finish_reason: length after just 128 tokens. Requests now run until EOS or context-length as expected.
  • Disagg Prefill Canary False Positive: Fixed the SGLang prefill handler so it honors the _HEALTH_CHECK marker and the canary's first observed response comes from the scheduler instead of a synthetic bootstrap_info yield (#8611). Without this, a hung SGLang prefill scheduler rank left /health reporting 200 indefinitely. Also moved HEALTH_CHECK_KEY = "_HEALTH_CHECK" to the shared dynamo.health_check module and applied the marker to all three vLLM payloads for cross-backend wire-format consistency. SGLang and TensorRT-LLM keep re-exports for backwards compatibility.

Known Issues

DSv4-Pro Thinking-Mode Output Corruption

DeepSeek-V4-Pro produces corrupted output when thinking mode is enabled. The bug is engine-side (sparse-attention state lifecycle), reproduces on both vLLM and SGLang DSv4-Pro, and does not reproduce on DSv4-Flash. Tool calling, structured output, and non-thinking responses are unaffected.

Workaround: Disable thinking mode in chat_template_kwargs:

curl http://<frontend>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Pro",
    "messages": [{"role": "user", "content": "..."}],
    "chat_template_kwargs": {"thinking": false}
  }'

The recipes/deepseek-v4-pro/README.md quickstart and the recipes/README.md index were updated with this caveat in #8720, and the curl example sets thinking: false explicitly so the included quickstart does not surprise users with corrupted output. Tracked for an upstream fix.

Full Changelog: [https://github.com/ai-dynamo/dynamo/compare/v1.2.0-sglang-deepseek-v4-dev.1...release/1.2.0-deepseek-v4-dev.2](v1.2.0-sglang-deepseek-v4-dev.1...release/1.2.0-deepseek-v...

Read more

v1.2.0-sglang-deepseek-v4-dev.1

25 Apr 21:26
21f135f

Choose a tag to compare

Pre-release

Dynamo v1.2.0-dev.1-deepseekv4 - Release Notes

Summary

Dynamo v1.2.0-dev.1-deepseekv4 is an experimental dev release that adds DeepSeek-V4 model support, an SGLang recipe for V4-Flash and V4-Pro, and a KV router correctness fix required by V4's grouped KV cache layout. The release branches from main at commit 7d4572f9 (2026-04-23) and adds seven cherry-picks on top to enable end-to-end serving of deepseek-ai/DeepSeek-V4-Flash and deepseek-ai/DeepSeek-V4-Pro on Blackwell hardware. Only the SGLang container is published; vLLM recipes are included as source for users to build locally. This is a snapshot build for early access to V4 model support and is not a QA-gated release.

Base Branch: release/1.2.0-dev.1-deepseekv4

Container Image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0-sglang-deepseek-v4-b200-dev.1

Backend Versions (shipped):

Backend Base Image CUDA Python Notes
SGLang lmsysorg/sglang:deepseek-v4-blackwell 12.9 3.12 Upstream DSv4 Blackwell preview branch; Dynamo overlays V4 parsers + routed_experts fix

NIXL 0.10.1 and UCX v1.20.0 are bundled in the published SGLang image.

Source-only (not published):

  • vLLM recipes for V4-Flash and V4-Pro target vllm/vllm-openai:deepseekv4-cu130 (built from vLLM PR #40760, the zyongye/vllm:dsv4 fork; CUDA 13.0; Python 3.12). Users build locally per recipes/deepseek-v4-{flash,pro}/container/README.md.
  • TensorRT-LLM is not part of this dev release.

Note: Backend versions in this dev release are pinned to the upstream DSv4 preview images, not the standard Dynamo backend pins (vLLM 0.19.1, SGLang 0.5.x).

Full Changelog

Frontend & Agents

  • DeepSeek V4 Parser Support: Added DeepSeek V4 frontend parser support including a new prompt formatter, the DSML tool-call parser, reasoning parser registration with deepseek_v4 / deepseek-v4 / deepseekv4 aliases routing to the Qwen reasoning parser, and 11 new test fixtures covering streaming and non-streaming tool-call scenarios (#8709). This is the cherry-pick of upstream #8665 and is the core enablement work that lets Dynamo's frontend tokenize, prompt, and parse tool calls for DeepSeek-V4-Flash and DeepSeek-V4-Pro models.

Recipes

SGLang (shipped)

  • DeepSeek-V4 SGLang Recipe: Added an SGLang recipe for DeepSeek-V4-Flash and DeepSeek-V4-Pro with a dedicated Dockerfile.dsv4-sglang and per-model sglang-dgd.yaml manifests (#8712, #8713, #8718). This supersedes the closed #8703 and includes follow-up fixes for the container PATH and the etcd binary path in the SGLang Dockerfile so the recipe runs end-to-end without manual edits. Users deploy with the published sglang-runtime:1.2.0-sglang-deepseek-v4-b200-dev.1 image.

vLLM (source-only, no published container)

  • DeepSeek-V4-Flash and V4-Pro vLLM Aggregated Recipes: Added experimental aggregated serving recipes for deepseek-ai/DeepSeek-V4-Flash and deepseek-ai/DeepSeek-V4-Pro on the vLLM backend (#8719) including dedicated Dockerfiles, model-cache and model-download manifests, and DynamoGraphDeployment YAMLs. Users build the image locally from the recipe Dockerfile against the upstream DSv4 vLLM base.

KV Router

  • ZMQ KV Event Group Filtering: Fixed the KV router's ZMQ wire parser to filter out non-main KV event groups so only the main full-compressed KV group is published into the radix index, instead of flattening sliding-window-attention (SWA) and other auxiliary groups together (#8705). This is the cherry-pick of #8669 and is required for correct prefix matching on DeepSeek V4, which emits a group_idx separating the main KV group from SWA-only groups.

Dynamo v1.0.2

23 Apr 03:21
e3cbfde

Choose a tag to compare

Dynamo v1.0.2 - Release Notes

Summary

Dynamo v1.0.2 is a patch release focusing on Frontend correctness fixes, DGDR-driven Kubernetes deployment robustness, rolling-update flexibility, and guided-decoding input hardening.

Key fixes restore real stream metadata in non-streaming responses with tool calls, correct Kimi K2.5 tokenizer special-token handling that caused TensorRT-LLM to reject requests, and add byte-length and nesting-depth caps to the OpenAI guided-decoding path.

On the deployment side, DGDR-created DynamoGraphDeployments now derive their name from the parent DGDR, DGDR-managed ConfigMaps cascade-delete with their parent, the Operator no longer thrashes on foreground cascading deletion, and per-WorkerSet MDC checksum validation enables rolling updates with divergent worker configuration under the same Model.

Base Branch: release/1.0.1

Key Dependencies

Dynamo SGLang TensorRT-LLM vLLM NIXL
v1.0.2 0.5.9 1.3.0rc5.post1 0.16.0 0.10.1

For container images, wheels, Helm charts, and Rust crates, see Dynamo Release Artifacts.
For full version compatilbity information, see Dynamo Support Matrix.

Full Changelog

Kubernetes Deployment

  • DGDR-Driven DGD Naming: Fixed Profiler-generated DynamoGraphDeployment naming so that DGDs derive their name from the parent DynamoGraphDeploymentRequest (<DGDR>-dgd) instead of from topology alone (<backend>-<agg/disagg>) (#7835), eliminating namespace-level name collisions when multiple DGDRs share the same backend/topology and respecting user-provided names from spec.overrides when present.
  • DGDR ConfigMap Owner References: Added Kubernetes owner references to ConfigMaps created by DGDR (#7881) so that DGDR-managed ConfigMaps are cascade-deleted with their parent.

Runtime

  • Per-WorkerSet MDC Checksum Validation: Scoped Model Discovery Card checksum validation from per-Model to per-WorkerSet (#8278), enabling rolling updates where different WorkerSets under the same Model can carry different configuration (e.g. tool-call parser) without draining existing workers first. Mismatches are still rejected when a new worker joins an existing WorkerSet, but cross-WorkerSet checksum drift is no longer a hard error.

Bug Fixes

  • DGD Cascading Deletion Thrashing: Fixed Operator behavior under foreground cascading deletion of DynamoGraphDeployments (#8212) so the Operator no longer thrashes the resource during teardown, ensuring clean DGD deletion in Kubernetes garbage-collection scenarios.
  • Stream Metadata Preservation: Fixed OpenAI Frontend stream finalization that overwrote real id, model, and created fields with hardcoded placeholders (stream-end, unknown, 0) when a tool-call parser combined streamed chunks into a non-streaming response (#8281), restoring correct response metadata for non-streaming tool-call requests.
  • Per-Node GPU Topology in DGD Builder: Fixed thorough-mode MoE config enumeration in the Planner/Profiler that ignored numGpusPerNode and produced unschedulable candidate DGDs on multi-node clusters (#8281). Worker GPU resource limits are now clamped per node and multinode.nodeCount is set for workers that span multiple nodes.
  • Kimi Tokenizer Special Tokens: Fixed Rust tiktoken tokenizer handling of reserved-token fallback names for Kimi K2.5 (#7898), resolving prompt-token inflation that caused TensorRT-LLM to reject requests with negative default_max_tokens and enabling correct serving of nvidia/Kimi-K2.5-NVFP4 and other Kimi K2.5 models.
  • Guided-Decoding Input Bounds: Added byte-length and nesting-depth caps to OpenAI guided-decoding input validation (#8349) — guided_grammar 64 KiB, guided_regex 32 KiB, guided_whitespace_pattern 1 KiB, guided_json 256 KiB serialized with a nesting-depth cap of 64 — bounding pathological inputs before they reach the downstream guided-decoding backend.

Full Changelog: v1.0.1...v1.0.2


Dynamo v1.1.0-dev.1

17 Mar 14:43
c758eb5

Choose a tag to compare

Dynamo v1.1.0-dev.1 Pre-release
Pre-release

Release Notes

Dynamo v1.1.0-dev.1 is a pre-release build that gives an early look at the features in Dynamo v1.1.0.

This build is not recommended for production use — features may be incomplete and APIs, behaviors, and defaults may change before the stable release.

Use it for evaluation, testing, and early feedback only.

Branch: release/1.1.0-dev.1
main Commit: c758eb5b58872fa9b2880b9ea29d058b141bb8ec
Full Changelog: v1.0.1...v1.1.0-dev.1


Major Features

  • Move standalone KV indexer into its own crate so it can run as an independent process (#6569)
  • Multi-model and multi-tenant isolation for KV indexer (#6830)
  • P2P recovery for standalone KV indexer (#6934)
  • ZMQ gap detection + replay for standalone KV indexer (#7209)
  • Prometheus metrics for standalone KV indexer (#7339)
  • Package dynamo-kv-indexer binary via maturin (#7194, #7395)
  • Standalone KV indexer runtime integration (#7295)
  • Pluggable scheduling policy for router queue (#7260)
  • Concurrent router perf improvements (#6536)
  • Trait-based event system (Velo) for async coordination (#6315)
  • Velo transports (#6547)
  • Unix domain socket for Velo (#7197)
  • kvbm-physical for direct GPU memory management (#6490)
  • FlexKV integration in Dynamo (#5858)
  • Full Anthropic Messages API cache_control support (top-level, per-block, system block arrays) (#6629)
  • Anthropic thinking block support and preamble stripping to /v1/messages (#7137)
  • Support multimodal (vision) inputs in Anthropic Messages API (#7256)
  • Streaming tool call and reasoning dispatch SSE events (#7114)
  • Dynamic batching of client-side events (#6733, #6741)
  • SGLang chat processor for frontend pre/post processing (#6834)
  • Move CRD apply from Helm hook Job to init container on operator Deployment (#6780)
  • Move webhook certificate management and CA injection from Helm hooks into operator (#6839)
  • Move MPI SSH key generation from Helm hook Job into operator reconciliation (#6940)
  • Replace kube-rbac-proxy sidecar with controller-runtime WithAuthenticationAndAuthorization (#7045)
  • GPU discovery extension using DCGM exporter for advanced metrics (#6705)
  • GlobalPlanner --max-total-gpus for cluster-wide GPU budget (#7103)
  • Loki log aggregation, unified OTLP ingestion for traces and logs (#6974)
  • Propagate OTEL trace context across E/P/D multimodal workers (#7239)
  • Kimi-K2.5 model recipe with Baseten's model (#6602)
  • Kimi-K2.5 (nvidia/Kimi-K2.5-NVFP4) recipe for agg and KVBM with TensorRT-LLM patch (#6842)
  • DeepSeek V3.2 TensorRT-LLM recipe (#6688)
  • Qwen3-VL-30B recipe for agg and encoder cache with vLLM patch (#6919)
  • Integrate fastokens BPE tokenizer backend (#7387)
  • Upgrade Rust to v1.93.0 (#6802)
  • vLLM 0.16.0 → 0.17.1 (#7170)
  • Enable Intel XPU Dockerfile (#6109, #7134)
  • Add support for CPU builds in Dockerfiles (#7139)
  • Multi-image in request support for SGLang backend (#6068)
  • vLLM omni image-to-video support (#6530)
  • Auto GPU VRAM estimator for disagg-same-GPU (#6868)
  • Forward pass metrics via ZMQ in vLLM (#7200) and Dynamo event plane integration (#7250)

Minor Features & Improvements

  • Add profiler job overrides (#6607)
  • Improve mooncake_bench sweep logic and throughput accounting (#6631)
  • Allow deepseek_v3 architecture to use Kimi's BPE pattern (#6653)
  • Wire nvext.cache_control TTL-based pinning through Dynamo router (#6213)
  • Default router_event_threads to 4 (#6672)
  • Add ActiveSequences benchmark and extract common bench utils for KV router (#6633)
  • Add GPU info to tests when killing a process (#6552)
  • Initial Claude skills (#6703)
  • BPF for frontend perf tracing (#6737)
  • Linear scan improvements for KV router (#6363)
  • Router queue depth Prometheus metric and nvext field (#6786)
  • Testing utilities for kvbm-logical and kvbm-physical (#6691)
  • Use model_fields_set to distinguish TTFT/ITL default usage (#6814)
  • Optimize dev/local-dev Dockerfiles for source-based development (#6743)
  • Enable KVBM metrics on K8s for Kimi-K2.5 recipe (#6963)
  • Add NVTX markers for vLLM EPD (#6627) and SGLang EPD (#7079)
  • Multimodal benchmark sweep (#6795)
  • Additional metrics for dynamo-trtllm (#6668)
  • Main branch cross-reference to pr-monitor skill (#6889)
  • Add missing overrides (#7017)
  • Remove benchmark shim, use AIPerf directly (#7074)
  • Add extra CodeRabbit review guidelines for Python and pytest (#7144)
  • Split monolithic requirements.txt and remove test deps from runtime image (#6656)
  • Frontend pipeline and tokio runtime perf metric definitions (#6731)
  • Hide optimizationType (#7160)
  • Refactor launch scripts with shared launch_utils.sh for consistent failure handling (#7008)
  • Apply DGD overrides before running interpolation (#7226)
  • Add generate health check support for PD SGLang (#6004)
  • Replace PersistentConnector monkey-patch with proper nixl_conn (#6913)
  • --decode-speedup-ratio for speculative decoding simulation in mocker (#7349)
  • Harden ManagedProcess teardown, add xdist-safe tests (#6670)
  • Request-plane, transport, and work-handler metric definitions (#6735)

Bug Fixes

  • Guided decoding arg placement, None guard, test cleanup (#6617)
  • Fix GPU discovery preflight job (#6628)
  • Proper DGD prefix for naive fallback in DGDR (#6667)
  • Remove costly logs in EPD (#6696)
  • Fix docs links (#6702)
  • Fix chat processor for vLLM video/audio examples (#6689)
  • Broken link in profiler guide (#6709)
  • Shell-quote Ray leader args (#6693)
  • AutoApply bool → AutoApply *bool (#6683)
  • Store nodesWithGPUs (#6690)
  • Phase out llava and make EPD single GPU (#6674)
  • Propagate vllm-distributed-executor-backend annotation from DGD metadata to backends (#6692)
  • Poll for snapshot before starting Router 2 in indexers_sync (#6707)
  • Fix TRT-LLM worker SSH crash in non-root containers (#6694)
  • Replace broken archive docs URLs in release-artifacts (#6722)
  • Always emit PVC block in operator ConfigMap when checkpoint storage type is PVC (#6752)
  • Mypy type fixes (#6730)
  • Handle missing out_hidden_size for LLaVA models in EPD encode worker (#6759)
  • Decouple Helm chart from runtime cluster state for helm template and GitOps compatibility (#6754)
  • Add embedding transfer implementation with NIXL WRITE initiation (#6651)
  • disagg_planner.yaml using new planner CLI (#6760)
  • Autogenerate Helm chart README (#6774)
  • Fix vLLM disaggregated cancellation tests (#6758)
  • Add weight: 1 to EPP config plugins (#6756)
  • Fix vLLM KVBM integration test (#6768)
  • Update container image to standard vllm-runtime tag (#6781)
  • Fix E + PD multimodal flow in TensorRT-LLM (#6726)
  • Auto-scale request count in benchmarks (#6777)
  • Pass through extra args in TensorRT-LLM agg.sh launch script (#6787)
  • Update vLLM processor for vLLM 0.16 (#6799)
  • Allow v1alpha1 DGDR creation by fixing webhook version matching and backend enum (#6803)
  • Added retries for docker pull, removed unused docker-tag-push GH action (#6804)
  • Plumb expected_osl through scheduler queue path (#6812)
  • Pass in device_id to KVBM PinnedAllocator instead of hardcoding to 0 (#6809)
  • Update NIXL version to 0.10.0 (#6789)
  • Properly setup and register vLLM worker for external/hybrid load balancing (#6695)
  • Lychee cache — don't cache failures (#6824)
  • Fix docs README skill links (#6856)
  • vLLM processor works with stream_interval > 1 (#6816)
  • Refactor profiler's DGD generation workflow to correctly generate mocker config (#6848)
  • Exclude NIXL .so files from ai-dynamo-runtime wheel (#6430)
  • Remove deprecated beam_width from TensorRT-LLM health check payload (#6879)
  • Strip None args when profiler generating configs (#6882)
  • Remove empty multimodal input to avoid invalid UUID check in vLLM (#6853)
  • Add missing --kv-transfer-config to disagg_router.yaml (#6897)
  • Support TRT-LLM 1.3 apply_mm_hashes API (#6810)
  • TRT-LLM multimodal preprocessor — revert to old default_multimodal_input_loader for embeddings (#6840)
  • Resolve socket UUIDs via CUDA driver API (#6891)
  • Revert changes in autogen files (#6908)
  • Skip encoder LLM creation for unsupported models in TensorRT-LLM (#6866)
  • Preserve reasoning content when tool-call starts mid-chunk (#6902)
  • Revert accidental change of trtllm/multimodal_processor (#6921)
  • Fix operator race condition (#6929)
  • Make KVBM respect CUDA_VISIBLE_DEVICES for NUMA binding (#6931)
  • Enforce min_endpoint flag in Planner (#6637)
  • Extend LoRA download S3 timeout and stream large LoRA downloads to disk (#6544)
  • Fix call to normalize_finish_reason on OmniHandler for main (#6910)
  • Add missing UCX/NIXL native libraries to SGLang runtime (#6939)
  • Validate initial uptime metric parsing in integration test (#6943)
  • Reject prompts exceeding max_seq_len with HTTP 400 (#6635)
  • Efficiently fill dummy data to bypass vLLM preprocessor in E/P/D (#6968)
  • Propagate tolerations and cap auto-discovered GPUs (#6947)
  • TRT-LLM multimodal preprocessor — remove default_multimodal_input_loader from embedding paths (#6924)
  • Remove aws-ofi-nccl plugin from linker cache in regular TensorRT-LLM runtime image (#6944)
  • Restore --enforce-disagg to reject requests before prefill router activates (#6957)
  • Move some router logs to DYN_LOG=debug (#6994)
  • Remove llm → mocker crate dependency, move config (#6998)
  • Accept assistant output_text messages without id/status in /v1/responses input (#6599)
  • Guard SGLang/vLLM memory occupation control endpoints (#6967)
  • Kimi K2.5 — tiktoken incomplete multi-byte sequence handling with regression tests (#6996)
  • Pass checkpointPath to restore (#6941)
  • Hide inactive models from /v1/models (#6966)
  • Populate GIT_COMMIT_SHA envvar in containers built in CI (#7001)
  • Use checked arithmetic in TwoPartCodec to prevent integer overflow (#6959)
  • Remove outdated deploy/discovery ex...
Read more

Dynamo v1.0.1

16 Mar 20:58
5534a9d

Choose a tag to compare

Release Notes

Summary

Dynamo v1.0.1 is a patch release to Dynamo v1.0.0 with critical bug fixes and expanded model support. Key fixes resolve a TensorRT-LLM startup crash on CUDA 13.1 caused by a cutlass-dsl packaging mismatch, and restore OpenAI API logprobs compliance where bytes and token fields were not populated when routing through Dynamo Frontend. This release also enables experimental Kimi K2.5 model support by adding DeepSeek V3 architecture tokenizer handling and fixing tiktoken multi-byte streaming panics.

Base Branch: release/1.0.0

Bug Fixes

  • TensorRT-LLM CUDA 13.1 Startup Crash: Fixed startup crash where a cutlass-dsl stub crashed the TensorRT-LLM import chain on CUDA 13.1. This was a known issue in v1.0.0 that blocked MoE models on Blackwell GPUs, including the Qwen3-235B-A22B-FP8 TensorRT-LLM recipe. (#7393)
  • Kimi K2.5 Tokenizer and Streaming: Added DeepSeek V3 architecture support for Kimi's BPE tokenizer pattern and fixed worker thread panic on incomplete multi-byte sequences during streaming inference. Enables serving nvidia/Kimi-K2.5-NVFP4 and other Kimi K2.5 models that use the DeepSeek V3 model_type with a tiktoken tokenizer. (#7424)
  • OpenAI Logprobs Fields: Fixed bytes and token fields in logprobs responses always returning None/empty when routing through Dynamo Frontend. Affected vLLM backend; direct backend queries returned correct values but requests routed through Frontend did not. (#7404)

Dynamo Release v1.0.0

13 Mar 20:57
b1818dc

Choose a tag to compare

Release Notes

Dynamo v1.0.0 is the first major release of the open-source distributed inference platform. This release delivers production-grade disaggregated serving with comprehensive multimodal and omni-model support, KV cache optimizations, improved handling of agentic workloads, Kubernetes-native deployment at scale, and a stabilized public API.

Summary

Multimodal & Diffusion

Dynamo now serves a range of generative modalities—text, image, and video—across all three major inference frameworks. Text-to-image generation is available through both vLLM Omni and SGLang image diffusion pipelines, and text-to-video through SGLang, vLLM Omni, and TensorRT-LLM Wan T2V, with experimental MJPEG streaming for real-time video output. Encoder disaggregation matured with a new EncoderCacheManager and content-addressed hashing, enabling multimodal encoder outputs to be cached and reused across workers. Embedding transfer between workers uses NIXL to minimize latency, and multimodal-aware KV cache routing places requests based on media content for better cache hit rates.

Agents

Dynamo added building blocks for agentic workloads: agent hints at the API, priority scheduling, and (experimental) KV cache retention and lifecycle awareness for long agent sessions. Dynamo expanded its agentic capabilities with reasoning content management for DeepSeek v3.2, GLM-4.7, and Kimi-2.5—including interleaved thinking support where reasoning and tool calls alternate within a single response. New tool call parsers for GLM-4.7, MiniMax-M2, and Kimi K2/K2.5 broaden the set of models that can drive tool-use workflows. Agentic frameworks that target OpenAI or Anthropic can now connect to Dynamo directly via new /v1/responses and /v1/messages endpoints, removing the need for adapter layers. Guided decoding now enforces JSON schema constraints on model output across vLLM and TensorRT-LLM, ensuring tool calls and function arguments are always valid structured data.

Unified Configuration & Public API Stabilization

All backends (SGLang, TensorRT-LLM, vLLM) and core components (Frontend, Router, Planner) migrated from fragmented argparse flags to a typed, modular configuration system with validated base classes. The public Python API was streamlined—deprecated types like Component, Namespace, and CancellationToken were removed, and endpoint methods were consolidated. These changes make the SDK smaller, more consistent, and easier to maintain.

See Breaking Changes for migration details.

Kubernetes Production Readiness

Dynamo Operator matured with a v1beta1 DynamoGraphDeploymentRequest API (Preview in Dynamo v1.0.0), config versioning via ConfigMap injection, GPU auto-discovery migrated from Profiler to Operator, rolling updates for DGD worker deployments, and simplified CRD management. The EPP component introduced a decomposed pipeline for supporting Inference Gateway-based routing with pod-level traffic management. LoRA support expanded with routing-aware adapter placement, memory-aware allocation, and multimodal LoRA with Kubernetes deployment examples. Multiple new Kubernetes deployment recipes were added including Kimi-K2.5, Qwen3-VL-30B-A3B-FP8, & Nemotron-3-Super-FP8.

Performance & Reliability

Dynamo Snapshot (Preview in Dynamo v1.0.0) enables fast GPU worker recovery via a portable DaemonSet using CRIU and cuda-checkpoint, now extended to SGLang. The Dynamo Planner now adds a load-based scaling approach, and a new GlobalPlanner mode (Preview in Dynamo v1.0.0) that provides cross-deployment autoscaling for multiple models or deployments backing an endpoint. Observability was overhauled with standardized dynamo_router_* metrics, engine-level Prometheus metrics, OTel tracing for routing, and more robust Grafana dashboards.

Under the Hood

Two posts on the Dynamo Dev Blog give a closer look at some of the problems we've worked on:

  1. Flash Indexer: Inter-Galactic KV Routing traces six iterations of data structure design—from a Python dictionary to a concurrent positional index with jump search. The result: the Dynamo Router sustains 170M ops/s—42x faster than what we shipped in Dynamo v0.1.0 and enough to handle planetary-scale inference workloads (we think).
  2. Full-Stack Optimizations for Agentic Inference tackles the visibility gap between agent harnesses and inference stacks. Claude Code and Codex know what's urgent—but the inference engines handling the workloads didn't, until now. The new nvext.agent_hints API lets harnesses pass scheduling priority, cache retention, and speculative prefill hints directly to the engine.

Open-Source Contributions

Between v0.9.0 and v1.0.0, we merged over 700 commits from over 90 contributors — 34 first-time contributors and 19 external contributors from 12 organizations.

First-Time External Contributors

  • @devivasudevan (Microsoft) contributed a PR that adds Azure AKS storage guidance for Dynamo caches (#5581).
  • @maljazaery (Microsoft) contributed a PR that clarifies DGDSA creation for services is disabled by default (#6389).
  • @dsocek (Intel) contributed a PR that improves multimodal disaggregation reliability (#5895).
  • @muskansh-google (Google) contributed a PR that updates build commands for the Dynamo + SGLang container (#5908).
  • @InfraWhisperer (F5) contributed a PR that fixes a frontend crash when using the TRT-LLM runtime image (#6481).
  • @Kaonael (Gcore) contributed a PR that adds a status state enum to DynamoGraphDeployment for improved lifecycle tracking (#6324).
  • @Ryan-Amirthan (Fern) contributed a PR that adds standard NVIDIA Fern styling assets to the documentation site (#6148).
  • @bledden (Facilitair) contributed a PR that forwards stream_options through the multimodal request pipeline (#6474).
  • @advpropsys (WhiteCircle.ai) contributed a PR that reduces NATS consumer inactive threshold from 1 hour to 2 minutes to prevent stale connections (#5861).
  • @luc-hiverge (Hiverge) contributed a PR that fixes first token creation signal timing by emitting the signal after sleeping (#5681).
  • @orangeng contributed a PR that fixes the service name in port-forward documentation (#5527).
  • @huitianbai contributed a PR that limits bootstrap room ID range to 0–2^63-1 to prevent overflow (#6277).

First-Time NVIDIA Contributors:

  • @knowicki-nvidia contributed a PR that adds image diffusion and text-to-image support for the SGLang backend (#5609).
  • @akshatha-k contributed a PR that restructures KVBM documentation into a three-tier format (#5905).
  • @alexanderbilk contributed a PR that adds a Prometheus port for NIXL telemetry metrics (#5567).
  • @rwipfelnv contributed a PR that adds Grafana dashboard and monitoring setup for observability (#4639).
  • @mikwieczorek contributed a PR that fixes TRT-LLM recipe component type from "main" to "worker" (#5788).
  • @jpohl-nv contributed a PR that adds experimental MJPEG video streaming via /v1/videos/stream (#6487).
  • @rafiw contributed a PR that adds Triton path environment variables to the vLLM runtime Dockerfile (#6401).

Returning External Contributors: @michaelfeil (Baseten), @vladnosiv (Yandex.Cloud), @Jont828 (Microsoft), @ashnamehrotra (Microsoft), @ls-2018, @AmeenP (PrimeIntellect), @kerthcet (InftyAI/Hiverge).

If you would like to get involved, please see our Contribution Guide


Breaking Changes

ACTION REQUIRED: The following changes require updates to your code, configuration, or deployment manifests before upgrading.

CLI Flags and Environment Variables

  • KV Router Flags Renamed (#6361): All KV router CLI flags and env vars now use the --router-* / DYN_ROUTER_* prefix.
Old Flag / Env Var New Flag / Env Var
--kv-events / DYN_KV_EVENTS --router-kv-events / DYN_ROUTER_USE_KV_EVENTS
--kv-overlap-score-weight / DYN_KV_OVERLAP_SCORE_WEIGHT --router-kv-overlap-score-weight / DYN_ROUTER_KV_OVERLAP_SCORE_WEIGHT
--assume-kv-reuse / DYN_ASSUME_KV_REUSE --router-assume-kv-reuse / DYN_ROUTER_ASSUME_KV_REUSE
--durable-kv-events / DYN_DURABLE_KV_EVENTS --router-durable-kv-events / DYN_ROUTER_DURABLE_KV_EVENTS
--track-active-blocks / DYN_TRACK_ACTIVE_BLOCKS --router-track-active-blocks / DYN_ROUTER_TRACK_ACTIVE_BLOCKS
--track-output-blocks --router-track-output-blocks
--router-ttl / DYN_ROUTER_TTL --router-ttl-secs / DYN_ROUTER_TTL_SECS

Migrate: Update all CLI invocations, env vars, and deployment YAMLs to use the new names.

  • Disagg Flag Inverted (#6515): --enforce-disagg replaced by --decode-fallback with inverted semantics — disaggregated mode is now enforced by default.

    Migrate: Replace --enforce-disagg with --decode-fallback. If you need fallback to aggregated mode, explicitly pass --decode-fallback or DYN_DECODE_FALLBACK=true. In the EPP plugin, update from DYN_ENFORCE_DISAGG to DYN_DECODE_FALLBACK with inverted boolean.

  • Migration Limit Moved to Frontend (#5918): The --migration-limit CLI flag has been removed from all backend workers (vLLM, SGLang, TRT-LLM) and is now set on the Frontend only.

    Migrate: Remove --migration-limit from backend launch commands; pass it to the Frontend instead.
    ...

Read more

Dynamo Release v0.9.1

04 Mar 19:45
ebcbd61

Choose a tag to compare

Dynamo v0.9.1

Release Notes

Summary

Dynamo 0.9.1 is a patch release that upgrades TensorRT-LLM from v1.3.0rc1 to v1.3.0rc3 and removes a KVBM workaround that is no longer needed with the upgraded TRT-LLM version.

Base Branch: release/0.9.0

Version Upgrades

  • TensorRT-LLM v1.3.0rc3: Upgraded TensorRT-LLM from v1.3.0rc1 to v1.3.0rc3 across version pins. This update includes upstream bug fixes and performance improvements (#6402).

Bug Fixes

  • pydantic-settings Compatibility: Pinned pydantic-settings<2.13.0 to fix TypeError: DynamicYamlWithDeepMergeSettingsSource._read_files() got an unexpected keyword argument 'deep_merge' error that occurred with pydantic-settings v2.13.0+ in TRT-LLM autodeploy tests. This affects the legacy Dockerfile.trtllm build path used in release/0.9.1 (#6402).

  • KVBM Workaround Removal: Reverted the KVBM disaggregated serving workaround since TRT-LLM v1.3.0rc3 includes the upstream fix (TRT-LLM #11247). This re-enables TRT-LLM+KVBM tests and removes legacy workaround code (#6495).

Known Issues

For known issues in this release, refer to the Known Issues section in the Dynamo v0.9.0 Release Notes.

Dynamo v0.9.0

12 Feb 03:46
76c1889

Choose a tag to compare

Dynamo v0.9.0 Release Notes

Summary

Dynamo v0.9.0 completes the infrastructure decoupling started in v0.8.0, expands multimodal and diffusion model support across all three backends, and introduces smarter scheduling with predictive load estimation and routing hints.

Infrastructure Modernization

The new Event Plane—built on high-performance ZMQ transport with MessagePack serialization—joins the Discovery Plane and Request Plane to form a fully decoupled communication architecture. Dynamo deployments no longer require NATS or etcd: Kubernetes-native service discovery replaces etcd, KV router queries run over the native Dynamo endpoint instead of NATS, and the Event Plane provides a transport-agnostic pub/sub layer for system events. These changes simplify deployment topology and reduce operational dependencies.

Multimodal & Diffusion

Dynamo expanded multimodal support across all three backends in this release. Encoder disaggregation is now available for both vLLM (via the Embedding Cache connector) and TRT-LLM (via a standalone encoder), allowing encoding to run on a separate GPU from prefill/decode. Dynamo can now serve multimodal SGLang workloads on a single GPU instead of requiring a full E/PD split. We also added first-class support for diffusion-based language models — LLaDA2.0 can now be served alongside autoregressive models in the same Dynamo deployment.

Scheduling Intelligence

Router gained output block tracking with fractional decay for predictive load estimation, expected output token awareness, and support for routing hints from external orchestrators like Kubernetes Gateway API Inference Extension (GAIE). The Planner added Kalman filter and mooncake-style warmup for more accurate load prediction, along with SLA-driven autoscaling for MoE DEP/TEP configurations. The Profiler was enhanced with PVC model cache support and model name validation.

Kubernetes & Observability

Operator added rollout restart for DynamoGraphDeployments, observability metrics, tolerations/affinity for GPU-specific scheduling, and improved restart reliability. Distributed tracing now spans the full request path including TCP transport, and the Prometheus metrics stack was simplified with multi-registry scrape support.


First-Time Contributors

We welcome 14 new contributors to the Dynamo project:

  • @siclait contributed a PR that truncates HttpError messages to 8192 characters to prevent ValueError on long messages (#5020).
  • @smatta-star contributed a PR that adds auto-generated OpenAPI spec and helper binary for the frontend (#4802).
  • @shpgy-shpgy contributed a PR that fixes multimodal processing error when handling pure text conversations (#5088).
  • @chay1045 contributed a PR that fixes hidden stop tokens appearing in output by returning None instead (#5238).
  • @wenqiglantz contributed a PR that adds prompt embeds support for pre-computed inference inputs in vLLM (#4739).
  • @yurekami contributed a PR that preserves original model path for frontend config downloads (#5102).
  • @erezzarum contributed a PR that fixes NIXL CUDA12 + CUDA13 build compatibility (#5000).
  • @soodoshll contributed a PR that fixes usage returning None when using text mode with vLLM (#5336).
  • @ls-2018 contributed a PR that fixes tag error handling (#5236).
  • @debermudez contributed a PR that updates aiperf to v0.4.0 (#5331).
  • @wangshangsam contributed a PR that updates vLLM import paths to align with upstream main (#5447).
  • @AbhiOnGithub contributed a PR that adds __all__ exports and __repr__ methods for improved debugging (#5606).
  • @davilu-nvidia contributed a PR that resolves SGLang E/P/D multimodal routing issues (#5500).
  • @adityapuranik99 contributed a PR that adds cupy-cuda12x to SGLang extras for CUDA compatibility (#5627).

Major Features & Improvements

Infrastructure Modernization

Discovery Plane

  • K8s-Native Service Discovery: Enabled Kubernetes-based discovery in GAIE and updated Helm charts/RBAC to support etcd-less deployments, allowing Kubernetes users to deploy without running a separate etcd cluster (#5303, #5432, #5364).
  • etcd Reliability: Resolved potential deadlocks in legacy etcd usage and updated examples to run without etcd, ensuring stable startup for users still on etcd-based discovery (#5091, #5422).
  • List-and-Watch Diffing: Resolved diffing logic issue where worker metadata updates (e.g., LoRA adapter additions) were not picked up, causing stale routing decisions (#5318).

Request Plane

  • NATS Dependency Removal: Migrated KV router worker queries to the native Dynamo endpoint to reduce NATS traffic (#5451), made NATS optional for KV-aware routing in approximate mode so local development works without a NATS server (#5237), fixed NATS container startup failure caused by invalid --max_payload CLI flag by moving it to config file (#5384), and cleaned up asymmetric request plane configuration in launch scripts (#5245).

Event Plane

  • Event Plane Architecture: Introduced a transport-agnostic Event Plane with MessagePack serialization and auto-discovery, decoupling system events (KV cache transfers, notifications) from direct NATS dependency. Added high-performance ZMQ transport as a scalable alternative for latency-sensitive event channels while preserving NATS for backward compatibility (#5674, #5614, #5624).
  • Event Plane NATS Init: Corrected NATS initialization logic based on --event-plane argument across all backends, preventing silent failures when NATS is not configured (#5750).
  • ZMQ Transport Timeout: Added receive timeout for ZMQ transport to prevent indefinite hangs when a publisher is unavailable (#5804).

Networking

  • IPv6 Support: Added IPv6 support for SGLang disaggregation with proper address formatting, enabling deployments on IPv6-only networks (#5521).

Multimodal & Diffusion

SGLang

  • Aggregated Multimodal: Enabled Dynamo to serve multimodal SGLang workloads on a single GPU, removing the previous requirement for a 2-GPU E/PD split (#5450).
  • Diffusion LM Support: Enabled Dynamo to serve diffusion-based language models (LLaDA2.0) through the SGLang backend, using existing Dynamo infrastructure for pre/post processing with a new diffusion handler (#5533).
  • Multi-Image Qwen EC: Resolved multi-image bug in the Dynamo EC connector that dropped images beyond the first in multimodal requests (#5514).

TensorRT-LLM

  • Standalone Encoder: Added encoder disaggregation support to Dynamo's TRT-LLM integration, enabling encoding to run on a separate GPU from prefill/decode (#4668).
  • Multimodal Tokenizer Reuse: Optimized Dynamo's multimodal request pipeline for TRT-LLM by reusing the tokenizer across requests instead of reinitializing per request, reducing per-request latency (#5217).

vLLM

  • Embedding Cache Connector: Added the Embedding Cache (EC) connector to Dynamo's vLLM integration for encoder disaggregation, where the encoder stores embeddings by hash and PD workers consume them from cache—eliminating redundant encoding and reducing TTFT. Also enabled multiple image inputs per request and parallelized image loading (#5162, #5463, #5444).
  • Prompt Embeds Support: Added pre-computed embeddings as a secure input method to Dynamo, allowing applications to transform sensitive data into embeddings before submission for improved privacy and flexible prompt engineering (#4739).
  • EPD Refactor: Refactored Dynamo's EPD handler to orchestrate the full encode-to-PD flow (processor → encoder → processor → PD), supporting multiple multimodal data items per request instead of just one (#4994).
  • Decode Worker Qwen-VL: Resolved disaggregated decode crash for Qwen2.5-VL models caused by missing image_grid_thw data needed for mRoPE position encoding (#5281).
  • EPD Sampling Params: Corrected sampling params parsing in Dynamo's vLLM EPD flow that could silently produce incorrect generation parameters (#5833).

Performance & Hardware

  • SGLang Stream Output: Enforced stream_output=True in SGLang ServerArgs, switching from cumulative-to-delta token conversion to direct disjoint segment passthrough—reducing per-token processing overhead in streaming responses (#5510).
  • Multimodal Payload Optimization: Removed serialization/deserialization in gather_multi_model_data, significantly reducing latency for requests with large base64-encoded payloads (#5485).
  • Zero Copy TCP Decoder: Implemented zero copy decoder with bounded worker pool for TCP ingress, eliminating memory leaks under high concurrency and reducing per-message allocations (#5376).
  • MoE Data Parallel Tuning: Reduced VLLM_MOE_DP_CHUNK_SIZE to 384, lowering HBM footprint enough to enable inference on 16xH200 MoE configurations that previously hit OOM (#5307).
  • TRT-LLM GB200 Support: Resolved memory allocation failure on GB200 hardware (#5328) and updated the Wide-EP disaggregated GB200 recipe for compatibility with latest TRT-LLM version (#5383).

Router

  • Router Scheduling Intelligence: Added output block tracking with fractional decay for predictive load estimation (#5452), plumbed expected output tokens so the router can account for generation length when distributing requests (#5181), and added a flag to disable decode KV reuse assumption so the router computes actual block hashes for more accurate cache-hit predictions (#5350).
  • Routing Hints from Headers: Added support for reading routing hints from request headers, allowing external orchestrators (e.g., GAIE) to influence routing decisions without modifying the request body (#5502).
  • PrefillComplete Hook: Implemented PrefillComplete handling in Dynamo EPP Scor...
Read more

Dynamo v0.8.1

23 Jan 07:37
5ea7ff0

Choose a tag to compare

Dynamo v0.8.1 Release Notes

Summary

Dynamo 0.8.1 is a patch release that adds profiler enhancements for Kubernetes deployments and addresses bug fixes for SGLang and worker identification. This release adds support for mounting model cache PVCs to profiler pods, fixes YAML configuration parsing for boolean flags in SGLang, resolves container build issues for CUDA 13 SGLang environments, and corrects a pod hash calculation issue that could affect worker identification in Kubernetes.

Base Branch: release/0.8.0

Major Features & Improvements

Kubernetes Deployment

  • Profiler Model Cache PVC Support: Added ability to mount model cache PVCs to profiler pods when specified in DynamoGraphDeploymentRequest, enabling profilers to access pre-downloaded model weights without re-downloading (#5212).

Bug Fixes

  • SGLang YAML Config Parsing: Fixed YAML config parsing for store_true arguments (e.g., trust-remote-code, enable-metrics) that were incorrectly converted to --flag true instead of just --flag, breaking boolean configuration options (#5513).
  • SGLang CUDA 13 Container Build: Fixed NVIDIA package installation in the SGLang CUDA 13 container to install CuDNN 9.16+ based on CUDA version, resolving PyTorch 2.9.1 compatibility issues with nn.Conv3d that caused performance degradation and excessive memory usage in multimodal workloads (#5461).
  • Worker ID Precision Loss: Fixed routing failures caused by f64 precision loss when worker/instance IDs exceeded 2^53, which caused approximately half of workers in large deployments to be unreachable for KV cache routing decisions (#5471).

Documentation

  • DGDR SLA Profiler Compatibility: Documented that DynamoGraphDeploymentRequest profiling configurations using camelCase field names and model cache PVC options require Dynamo 0.8.1 or later (#5492).

Known Issues

For known issues in this release, refer to the Known Issues section in the Dynamo v0.8.0 Release Notes.