Authors
@tmonty12 @nvrohanv @athreesh
Area
router
Summary
This document proposes a topology-aware routing mechanism for Dynamo that restricts KV-cache transfers between prefill and decode workers based on cluster topology domains defined in Grove. When the KV router selects a decode worker for a disaggregated request, it filters candidates to those in the same topology domain (e.g., zone, rack - user defined) as the prefill worker, preventing cross-domain KV transfers that degrade TTFT.
The design spans three systems: Grove injects resolved topology labels onto worker pods after scheduling, Dynamo workers publish those labels as part of their runtime metadata, and the KV router uses the metadata to enforce a topology filter during decode worker selection. In the DGD v1beta1 API, the constraint is declared on the frontend component under kvCacheTransferTopology.
Motivation
Background
Grove introduces a ClusterTopology CRD that lets cluster administrators define topology levels by mapping logical domains (e.g., zone, rack) to Kubernetes node labels:
apiVersion: grove.io/v1alpha1
kind: ClusterTopology
metadata:
name: my-cluster-topology
spec:
levels: # Ordered broadest → narrowest
- domain: zone
labelKey: topology.kubernetes.io/zone # Maps to node label
- domain: rack
labelKey: nvidia.com/rack
- domain: host
labelKey: kubernetes.io/hostname
The Dynamo operator already uses these for pod scheduling placement via SpecTopologyConstraint (deployment-level) and TopologyConstraint (service-level).
These constraints control where pods are placed during scheduling. However, pods might still be scheduled across different racks or even zones due to capacity limitations. The router has no knowledge of where each worker Pod ends up getting scheduled.
The Problem
In disaggregated serving, a prefill worker produces KV-cache blocks that must be transferred to a decode worker. This transfer is latency-sensitive — crossing an AZ boundary for example adds significant network latency.
The router currently receives a list of prefill and decode workers but has no awareness of where these workers are actually located in relation to each other. When the router selects a decode worker for the prefill worker to perform a KV transfer too, it has no awareness of the fact that the decode worker might be in a different topology domain (rack, AZ, etc.).
Without topology awareness at the router level, in order to prevent cross-domain KV transfers users must deploy separate DGDs per topology domain — each with its own frontend, prefill pool, and decode pool.
Costs of the Multi-DGD Workaround
- Config and Update duplication — Every field (model args, env vars, affinity rules, probes) is duplicated across DGDs so updates also have to be propagated across all DGD which is cumbersome
- Cross DGD Health and Observability Monitoring - Gaining visibility into the overall system requires building an additional aggregation layer on top of all the DGDs
- Fixed capacity partitioning — Capacity is statically split across domains; one domain cannot borrow capacity from another
- No cross-domain failover — If a domain loses workers, its frontend has no fallback; it cannot route to another domain's workers
- Scaling is per-DGD — Autoscaling must be configured and tuned independently for each DGD
- External gateway required — Users must configure a higher-level gateway (e.g., load balancer, ingress) to distribute traffic across the per-domain DGDs. This gateway has no KV-cache awareness — it cannot factor in prefix cache hit rates or worker load when routing, losing the optimization benefits of the KV router
Goals
- Restrict KV-cache transfers to within a specified topology domain
- Integrate with Grove
ClusterTopology for topology label injection
- Allow a single DGD to span multiple topology domains
- Maintain full backward compatibility — existing deployments without topology configuration must behave identically
- Support a configurable mismatch policy (fail vs. fallback) when no same-domain decode workers are available
Non-Goals
- General-purpose application topology interface APIs/library in Grove (future work)
- Topology distance-based routing or cost-weighted soft preferences (future work)
- Constraining prefill worker selection by topology (only decode is constrained)
Proposal
Four components need changes. Each is described below with the relevant code locations.
1. Pod Topology Injection (Grove + Operator)
Two systems cooperate to get topology information into worker containers:
1a. Grove controller patches pod labels post-scheduling
After a pod is scheduled to a node, the Grove controller observes the pod, reads the scheduled node's labels, resolves topology domains from the referenced ClusterTopology CR, and patches the pod with labels for each topology level:
grove.io/topology.zone=us-east-1a
grove.io/topology.rack=rack1
There is a potential race condition between container startup attempting to read the labels from the Downward API volume in step 2. To prevent this race we propose:
- Since Grove has already introduced an init-container for startup ordering, it should be extended to wait for Pods to receive topology labels from the node.
With this guarantee from Grove, in step 2 we ensure that when the runtime starts the topology labels will exist in the volume.
1b. Operator projects labels into containers via Downward API volume files
The Dynamo operator adds a Downward API volume to worker pod specs that projects the Grove topology labels as files for the runtime to consume.
For example inside the container:
/etc/podinfo/topology/zone → "us-east-1a"
/etc/podinfo/topology/rack → "rack-01"
/etc/podinfo/topology/block → "block-3"
2. Worker Metadata Publication (Dynamo)
When the worker startups:
- The operator will inject
DYN_TOPOLOGY_ENABLED=true. This indicates to the worker that it should read the topology labels from the Downward API volume at startup.
- If the worker does not have a file value for
DYN_ROUTER_KV_TRANSFER_TOPOLOGY_LEVEL after a retry period, the worker will hard exit with an error.
- Publish the topology labels as HashMap to the ModelRuntimeConfig for the router to consume.
3. KV Router Filtering
The KvRouterConfig would receive the following fields:
kv_transfer_topology_level (read from env var DYN_ROUTER_KV_TRANSFER_TOPOLOGY_LEVEL injected from operator): topology domain to constrain kv-cache transfers within
kv_transfer_mismatch_policy (read from env var DYN_ROUTER_KV_TRANSFER_MISMATCH_POLICY): an enum with values:
a. fail: return an error when no decode workers match the prefill worker’s topology domain
b. fallback: allow cross-domain transfer if no decode workers exist in the prefill worker’s topology domain
- When the KV Router is selecting a decode worker for a given prefill worker, it would filter the decoder workers for only those in the same topology domain.
a. If none exist see 2a) and 2b)
4. DGD Spec Extension (v1beta1)
File: deploy/operator/api/v1beta1/common.go
The KV-cache transfer topology constraint is placed as a field on the frontend component in v1beta1.
The ClusterTopology reference is inherited from spec.topologyConstraint.clusterTopologyName (already required for scheduling placement), so only the routing-specific fields (level and mismatchPolicy) are needed on the component:
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeployment
metadata:
name: my-llm-disagg
spec:
topologyConstraint:
clusterTopologyName: my-cluster-topology
components:
- name: frontend
type: frontend
replicas: 4
kvCacheTransferTopology:
level: zone # Which topology level to enforce
mismatchPolicy: fail # "fail" or "fallback"
- name: prefill-worker
type: prefill
replicas: 8
- name: decode-worker
type: decode
replicas: 8
The operator translates this into:
- Frontend pods — Sets
DYN_ROUTER_KV_TRANSFER_TOPOLOGY_LEVEL and DYN_ROUTER_KV_TRANSFER_MISMATCH_POLICY env vars (router config)
- Worker pods — Label values to to project from Grove topology labels via the add the Downward API volume
- Validation — Ensures the referenced
ClusterTopology CR exists and the specified level is a valid domain in the hierarchy
- Component type validation —
kvCacheTransferTopology is only valid on type: frontend components; the webhook rejects it on other component types
Failure Handling
Admission-time errors (webhook):
If we end up supporting explicit label fields (in the Considerations section) and a label is defined, then these admission hook validations would not apply. Only if a label is not defined, Grove and its corresponding ClusterTopology validation applies.
| Scenario |
Behavior |
Grove not installed and transferTopology specified |
Webhook rejects DGD — the operator cannot read the ClusterTopology CR referenced by topologyName |
ClusterTopology CR not found |
Webhook rejects DGD (same validation path) |
level references a domain not in ClusterTopology |
Webhook rejects DGD — domain must exist in the CR's spec.levels |
Worker startup:
| Scenario |
Behavior |
| Topology volume files not present |
The operator will inject DYN_TOPOLOGY_ENABLED=true. If file value is not present for DYN_ROUTER_KV_TRANSFER_TOPOLOGY_LEVEL after a retry period, hard exit. |
Runtime routing — mismatchPolicy semantics:
The default for mismachPolicy will be fail - constraining kv transfers to the same topology domain is a hard constraint so it makes sense to fail a request if this cannot be satisfied.
| Scenario |
fail (default) |
fallback |
| No same-domain decode workers, other-domain workers exist |
Return error to client |
Log warning, route to any available decode worker |
| Worker missing topology level for configured domain |
Excluded from selection |
Excluded when same-domain workers exist; eligible as fallback when none do |
kv_transfer_topology_level unset |
No topology filtering, identical to current behavior |
No topology filtering, identical to current behavior |
| Before prefill worker selection - prefill worker has no decode workers available in its domain |
Filter for prefill workers that have decode workers in the same domain first and then fallback to all other prefill workers (window where a decode worker can join). |
Same as fail. |
Backward Compatibility
kv_transfer_topology_level defaults to None -> no filtering
WorkerConfigLike::topology_domain() has a default None implementation
topology_domains on ModelRuntimeConfig defaults to empty and is skip_serializing_if = "HashMap::is_empty" -> zero wire overhead for existing workers
- DGDs without
kvCacheTransferTopology on the frontend component behave identically to today
Implementation Phases
Phase 1 — Worker Metadata (Dynamo)
- Add topology_domains: HashMap<String, String> to ModelRuntimeConfig
- Add topology_domain() default method to WorkerConfigLike trait
- Implement topology_domain() on ModelRuntimeConfig
- Add shared read_topology_domains() utility (components/src/dynamo/common/topology.py)
- Wire into vllm/sglang/trtllm backends: read files at startup, set on runtime_config
- Verify topology metadata flows through RuntimeConfigWatch to the router
Phase 2 — Router Filtering (Dynamo)
- Add kv_transfer_topology_level and kv_transfer_mismatch_policy to KvRouterConfig
- Add topology_affinity to SchedulingRequest
- Add topology filter to DefaultWorkerSelector::select_worker()
- Implement fallback logic
Phase 3 — Prefill->Decode Propagation (Dynamo)
- After prefill selection, look up worker's topology domain
- Inject topology_affinity into decode SchedulingRequest
Phase 4 — Grove + Operator Integration
- Grove controller patches grove.io/topology.* labels onto pods post-scheduling
- Operator adds EnsureTopologyVolume(): generates Downward API volume projecting one file per topology level under /etc/podinfo/topology/
- DGD v1beta1 spec extended with kvCacheTransferTopology on frontend components
- Operator sets router config env vars on frontend pods
- Operator validates topology level exists in ClusterTopology CR at DGD creation
Considerations
Placing kvCacheTransferTopology under experimental
The v1beta1 ExperimentalSpec block was considered as an initial home for kvCacheTransferTopology. Features under experimental carry an explicit warning that they "may change in breaking ways between v1beta1 releases, including disappearing without a name-preserving graduation path." Given the API extensibility (kvTransferWeight addition for soft topology routing) and confidence in API maturity, it was decided to not place in the experimental block.
Explicit label field — operator-managed node-to-pod label copy
An optional label field on kvCacheTransferTopology would let users specify a raw pod label key directly, bypassing the Grove ClusterTopology entirely:
kvCacheTransferTopology:
level: zone
label: topology.kubernetes.io/zone # Pod label key to match on
mismatchPolicy: fail
When label is set, the operator copies the specified node label onto worker pods after scheduling. The rest of the pipeline (worker reads file → topology_domains → router filter) is unchanged.
When label is omitted, the Grove-based path is used (as described in the main proposal).
Pros:
- Would not require a Grove Dependency for hard topology constraints and would be complementary.
- Zone labels are almost always present. topology.kubernetes.io/zone is a well-known Kubernetes label populated automatically by all major cloud providers.
- Same downstream plumbing.
Cons:
- Operator gains a new reconciler responsibility. Would have to add a watch for pod scheduling and propagate node labels to pod – duplicating logic that would be added to Grove.
- No hierarchy or validation. The label field is a raw string — no ClusterTopology CR to validate against, no hierarchy ordering for future soft scoring, no normalization of provider-specific label keys.
- Single level only. The label field supports one topology level per configuration. Multi-level soft scoring (future work) requires the full ClusterTopology hierarchy.
Worker self-discovery without Grove
An alternative to the Grove-based approach is having workers query the Kubernetes node API directly at startup to read their own node's labels and populate topology_domains without any Grove dependency.
Difficulties with this approach:
- No domain-to-label mapping. Without the ClusterTopology CR, the worker has no way to know that "zone" maps to topology.kubernetes.io/zone or that "rack" maps to nvidia.com/rack.
- No hierarchy definition. The ClusterTopology defines an ordered hierarchy (zone > rack > host). Without it, the router has no concept of which domains are broader or narrower — needed for the soft scoring extension.
- RBAC expansion. Every worker service account needs get access to the nodes API.
- No admission-time validation. The ClusterTopology enables the webhook to reject a DGD at creation time if the referenced topology level doesn't exist.
- No single source of truth. The topology hierarchy is defined once in the ClusterTopology CR and referenced by all DGDs. Without it, each DGD must independently carry the label-to-domain mapping, creating duplication and drift risk.
Leveraging Grove’s ClusterTopology and extension to support mapping the topology labels from node to pod is a natural extension to how it’s leveraged for Topology Aware Scheduling in DGDs today and provides a clear separation of concerns.
Future Work
KV Transfer Cost Scoring (Soft Topology Preferences)
The hard filter is binary — candidates are either in the same domain or rejected. A natural extension is to add a topology penalty to the decode cost function that models the actual KV transfer cost, allowing the router to prefer closer workers without strictly excluding distant ones.
The decode logit would gain a transfer cost term:
decode_logit = decode_blocks + kv_transfer_weight * (latency + kv_bytes / bandwidth)
Where latency and bandwidth come from a per-domain QoS table and kv_bytes is derived from the request's token count (isl_tokens * kv_bytes_per_token).
This composes with the hard filter — hard filter removes cross-zone candidates, soft scoring ranks the survivors within the zone. The QoS table could be sourced from the ClusterTopology CR (extended with per-level networkQoS fields), a ConfigMap, or DGD-level configuration. The DGD API extends naturally — kvTransferWeight is an additive field on the existing kvCacheTransferTopology struct:
components:
- name: frontend
type: frontend
kvCacheTransferTopology:
level: zone # Hard: never cross this boundary
mismatchPolicy: fail
kvTransferWeight: 50.0 # Soft: penalty multiplier for distance within boundary
Other Future Directions
- General application topology library — Reusable Grove library for topology queries beyond Dynamo
- Prefill topology constraint — Optional constraint on prefill selection to prefer same-AZ as the frontend pod (minimize client->prefill latency)
- Multi-pool routing – Deployment topology where there are multiple prefill/decode pools for different context lengths as an example. The defined transfer topology constraint should be able to be applied across pools as the same underlying concern would remain.
- Multi-cluster routing – A single cluster can only go so far with constrained capacity. While cross cluster transfers are generally not recommended due to latency overhead, recent work by Moonshot AI on Prefill-as-a-Service demonstrating transfers across data centers means it can be feasible. Dynamo’s support for Multi-Cluster should also encapsulate the ability to encode transfer constraints at the cross cluster level.
Alternate Solutions
No response
Requirements
No response
References
No response
Authors
@tmonty12 @nvrohanv @athreesh
Area
router
Summary
This document proposes a topology-aware routing mechanism for Dynamo that restricts KV-cache transfers between prefill and decode workers based on cluster topology domains defined in Grove. When the KV router selects a decode worker for a disaggregated request, it filters candidates to those in the same topology domain (e.g., zone, rack - user defined) as the prefill worker, preventing cross-domain KV transfers that degrade TTFT.
The design spans three systems: Grove injects resolved topology labels onto worker pods after scheduling, Dynamo workers publish those labels as part of their runtime metadata, and the KV router uses the metadata to enforce a topology filter during decode worker selection. In the DGD v1beta1 API, the constraint is declared on the frontend component under
kvCacheTransferTopology.Motivation
Background
Grove introduces a
ClusterTopologyCRD that lets cluster administrators define topology levels by mapping logical domains (e.g.,zone,rack) to Kubernetes node labels:The Dynamo operator already uses these for pod scheduling placement via
SpecTopologyConstraint(deployment-level) andTopologyConstraint(service-level).These constraints control where pods are placed during scheduling. However, pods might still be scheduled across different racks or even zones due to capacity limitations. The router has no knowledge of where each worker Pod ends up getting scheduled.
The Problem
In disaggregated serving, a prefill worker produces KV-cache blocks that must be transferred to a decode worker. This transfer is latency-sensitive — crossing an AZ boundary for example adds significant network latency.
The router currently receives a list of prefill and decode workers but has no awareness of where these workers are actually located in relation to each other. When the router selects a decode worker for the prefill worker to perform a KV transfer too, it has no awareness of the fact that the decode worker might be in a different topology domain (rack, AZ, etc.).
Without topology awareness at the router level, in order to prevent cross-domain KV transfers users must deploy separate DGDs per topology domain — each with its own frontend, prefill pool, and decode pool.
Costs of the Multi-DGD Workaround
Goals
ClusterTopologyfor topology label injectionNon-Goals
Proposal
Four components need changes. Each is described below with the relevant code locations.
1. Pod Topology Injection (Grove + Operator)
Two systems cooperate to get topology information into worker containers:
1a. Grove controller patches pod labels post-scheduling
After a pod is scheduled to a node, the Grove controller observes the pod, reads the scheduled node's labels, resolves topology domains from the referenced
ClusterTopologyCR, and patches the pod with labels for each topology level:There is a potential race condition between container startup attempting to read the labels from the Downward API volume in step 2. To prevent this race we propose:
With this guarantee from Grove, in step 2 we ensure that when the runtime starts the topology labels will exist in the volume.
1b. Operator projects labels into containers via Downward API volume files
The Dynamo operator adds a Downward API volume to worker pod specs that projects the Grove topology labels as files for the runtime to consume.
For example inside the container:
2. Worker Metadata Publication (Dynamo)
When the worker startups:
DYN_TOPOLOGY_ENABLED=true. This indicates to the worker that it should read the topology labels from the Downward API volume at startup.DYN_ROUTER_KV_TRANSFER_TOPOLOGY_LEVELafter a retry period, the worker will hard exit with an error.3. KV Router Filtering
The
KvRouterConfigwould receive the following fields:kv_transfer_topology_level(read from env varDYN_ROUTER_KV_TRANSFER_TOPOLOGY_LEVELinjected from operator): topology domain to constrain kv-cache transfers withinkv_transfer_mismatch_policy(read from env varDYN_ROUTER_KV_TRANSFER_MISMATCH_POLICY): an enum with values:a.
fail: return an error when no decode workers match the prefill worker’s topology domainb.
fallback: allow cross-domain transfer if no decode workers exist in the prefill worker’s topology domaina. If none exist see 2a) and 2b)
4. DGD Spec Extension (v1beta1)
File:
deploy/operator/api/v1beta1/common.goThe KV-cache transfer topology constraint is placed as a field on the frontend component in v1beta1.
The
ClusterTopologyreference is inherited fromspec.topologyConstraint.clusterTopologyName(already required for scheduling placement), so only the routing-specific fields (levelandmismatchPolicy) are needed on the component:The operator translates this into:
DYN_ROUTER_KV_TRANSFER_TOPOLOGY_LEVELandDYN_ROUTER_KV_TRANSFER_MISMATCH_POLICYenv vars (router config)ClusterTopologyCR exists and the specified level is a valid domain in the hierarchykvCacheTransferTopologyis only valid on type: frontend components; the webhook rejects it on other component typesFailure Handling
Admission-time errors (webhook):
If we end up supporting explicit label fields (in the Considerations section) and a label is defined, then these admission hook validations would not apply. Only if a label is not defined, Grove and its corresponding ClusterTopology validation applies.
transferTopologyspecifiedClusterTopologyCR not foundlevelreferences a domain not inClusterTopologyWorker startup:
DYN_TOPOLOGY_ENABLED=true. If file value is not present forDYN_ROUTER_KV_TRANSFER_TOPOLOGY_LEVELafter a retry period, hard exit.Runtime routing — mismatchPolicy semantics:
The default for
mismachPolicywill befail- constraining kv transfers to the same topology domain is a hard constraint so it makes sense to fail a request if this cannot be satisfied.kv_transfer_topology_levelunsetBackward Compatibility
kv_transfer_topology_leveldefaults toNone-> no filteringWorkerConfigLike::topology_domain()has a defaultNoneimplementationtopology_domainsonModelRuntimeConfigdefaults to empty and is skip_serializing_if = "HashMap::is_empty" -> zero wire overhead for existing workerskvCacheTransferTopologyon the frontend component behave identically to todayImplementation Phases
Phase 1 — Worker Metadata (Dynamo)
Phase 2 — Router Filtering (Dynamo)
Phase 3 — Prefill->Decode Propagation (Dynamo)
Phase 4 — Grove + Operator Integration
Considerations
Placing kvCacheTransferTopology under experimental
The v1beta1
ExperimentalSpecblock was considered as an initial home forkvCacheTransferTopology. Features underexperimentalcarry an explicit warning that they "may change in breaking ways between v1beta1 releases, including disappearing without a name-preserving graduation path." Given the API extensibility (kvTransferWeightaddition for soft topology routing) and confidence in API maturity, it was decided to not place in the experimental block.Explicit label field — operator-managed node-to-pod label copy
An optional
labelfield onkvCacheTransferTopologywould let users specify a raw pod label key directly, bypassing the GroveClusterTopologyentirely:When
labelis set, the operator copies the specified node label onto worker pods after scheduling. The rest of the pipeline (worker reads file →topology_domains→ router filter) is unchanged.When
labelis omitted, the Grove-based path is used (as described in the main proposal).Pros:
Cons:
Worker self-discovery without Grove
An alternative to the Grove-based approach is having workers query the Kubernetes node API directly at startup to read their own node's labels and populate
topology_domainswithout any Grove dependency.Difficulties with this approach:
Leveraging Grove’s
ClusterTopologyand extension to support mapping the topology labels from node to pod is a natural extension to how it’s leveraged for Topology Aware Scheduling in DGDs today and provides a clear separation of concerns.Future Work
KV Transfer Cost Scoring (Soft Topology Preferences)
The hard filter is binary — candidates are either in the same domain or rejected. A natural extension is to add a topology penalty to the decode cost function that models the actual KV transfer cost, allowing the router to prefer closer workers without strictly excluding distant ones.
The decode logit would gain a transfer cost term:
Where latency and bandwidth come from a per-domain QoS table and
kv_bytesis derived from the request's token count (isl_tokens*kv_bytes_per_token).This composes with the hard filter — hard filter removes cross-zone candidates, soft scoring ranks the survivors within the zone. The QoS table could be sourced from the
ClusterTopologyCR (extended with per-levelnetworkQoSfields), a ConfigMap, or DGD-level configuration. The DGD API extends naturally —kvTransferWeightis an additive field on the existingkvCacheTransferTopologystruct:Other Future Directions
Alternate Solutions
No response
Requirements
No response
References
No response