DEP: Topology-Aware KV Transfer Routing Constraints

### Authors
@tmonty12 @nvrohanv @athreesh 

### Area

router

### Summary

This document proposes a topology-aware routing mechanism for Dynamo that restricts KV-cache transfers between prefill and decode workers based on cluster topology domains defined in Grove. When the KV router selects a decode worker for a disaggregated request, it filters candidates to those in the same topology domain (e.g., zone, rack - user defined) as the prefill worker, preventing cross-domain KV transfers that degrade TTFT.

The design spans three systems: **Grove** injects resolved topology labels onto worker pods after scheduling, **Dynamo workers** publish those labels as part of their runtime metadata, and the **KV router** uses the metadata to enforce a topology filter during decode worker selection. In the DGD v1beta1 API, the constraint is declared on the frontend component under `kvCacheTransferTopology`.

### Motivation

### Background
Grove introduces a `ClusterTopology` CRD that lets cluster administrators define topology levels by mapping logical domains (e.g., `zone`, `rack`) to Kubernetes node labels:

```yaml
apiVersion: grove.io/v1alpha1
kind: ClusterTopology
metadata:
  name: my-cluster-topology
spec:
  levels:                                        # Ordered broadest → narrowest
    - domain: zone
      labelKey: topology.kubernetes.io/zone      # Maps to node label
    - domain: rack
      labelKey: nvidia.com/rack
    - domain: host
      labelKey: kubernetes.io/hostname
```

The Dynamo operator already uses these for **pod scheduling placement** via `SpecTopologyConstraint` (deployment-level) and `TopologyConstraint` (service-level). 

These constraints control **where pods are placed** during scheduling. However, pods might still be scheduled across different racks or even zones due to capacity limitations. The router has no knowledge of where each worker Pod ends up getting scheduled.

### The Problem
In disaggregated serving, a prefill worker produces KV-cache blocks that must be transferred to a decode worker. This transfer is latency-sensitive — crossing an AZ boundary for example adds significant network latency.

The router currently receives a list of prefill and decode workers but has no awareness of where these workers are actually located in relation to each other. When the router selects a decode worker for the prefill worker to perform a KV transfer too, it has no awareness of the fact that the decode worker might be in a different topology domain (rack, AZ, etc.).

Without topology awareness at the router level, in order to prevent cross-domain KV transfers users must deploy **separate DGDs per topology domain** — each with its own frontend, prefill pool, and decode pool.

### Costs of the Multi-DGD Workaround
- **Config and Update duplication** — Every field (model args, env vars, affinity rules, probes) is duplicated across DGDs so updates also have to be propagated across all DGD which is cumbersome 
- **Cross DGD Health and Observability Monitoring** - Gaining visibility into the overall system requires building an additional aggregation layer on top of all the DGDs
- **Fixed capacity partitioning** — Capacity is statically split across domains; one domain cannot borrow capacity from another
- **No cross-domain failover** — If a domain loses workers, its frontend has no fallback; it cannot route to another domain's workers
- **Scaling is per-DGD** — Autoscaling must be configured and tuned independently for each DGD
- **External gateway required** — Users must configure a higher-level gateway (e.g., load balancer, ingress) to distribute traffic across the per-domain DGDs. This gateway has no KV-cache awareness — it cannot factor in prefix cache hit rates or worker load when routing, losing the optimization benefits of the KV router

### Goals
- Restrict KV-cache transfers to within a specified topology domain
- Integrate with Grove `ClusterTopology` for topology label injection
- Allow a single DGD to span multiple topology domains
- Maintain full backward compatibility — existing deployments without topology configuration must behave identically
- Support a configurable mismatch policy (fail vs. fallback) when no same-domain decode workers are available

### Non-Goals
- General-purpose application topology interface APIs/library in Grove (future work)
- Topology distance-based routing or cost-weighted soft preferences (future work)
- Constraining prefill worker selection by topology (only decode is constrained)


### Proposal

Four components need changes. Each is described below with the relevant code locations.

### 1. Pod Topology Injection (Grove + Operator)
Two systems cooperate to get topology information into worker containers:

#### 1a. Grove controller patches pod labels post-scheduling
After a pod is scheduled to a node, the Grove controller observes the pod, reads the scheduled node's labels, resolves topology domains from the referenced `ClusterTopology` CR, and patches the pod with labels for each topology level:

```yaml
grove.io/topology.zone=us-east-1a
grove.io/topology.rack=rack1
```

There is a potential race condition between container startup attempting to read the labels from the Downward API volume in step 2. To prevent this race we propose:
- Since Grove has already introduced an init-container for startup ordering, it should be extended to wait for Pods to receive topology labels from the node.

With this guarantee from Grove, in step 2 we ensure that when the runtime starts the topology labels will exist in the volume.

#### 1b. Operator projects labels into containers via Downward API volume files
The Dynamo operator adds a Downward API volume to worker pod specs that projects the Grove topology labels as files for the runtime to consume.

For example inside the container:
```yaml
/etc/podinfo/topology/zone    → "us-east-1a"
/etc/podinfo/topology/rack    → "rack-01"
/etc/podinfo/topology/block   → "block-3"
```

### 2. Worker Metadata Publication (Dynamo)
When the worker startups:

- The operator will inject `DYN_TOPOLOGY_ENABLED=true`. This indicates to the worker that it should read the topology labels from the Downward API volume at startup.
- If the worker does not have a file value for `DYN_ROUTER_KV_TRANSFER_TOPOLOGY_LEVEL` after a retry period, the worker will hard exit with an error.
- Publish the topology labels as HashMap to the ModelRuntimeConfig for the router to consume.

### 3. KV Router Filtering
The `KvRouterConfig `would receive the following fields:
1. `kv_transfer_topology_level` (read from env var `DYN_ROUTER_KV_TRANSFER_TOPOLOGY_LEVEL` injected from operator): topology domain to constrain kv-cache transfers within
2. `kv_transfer_mismatch_policy` (read from env var `DYN_ROUTER_KV_TRANSFER_MISMATCH_POLICY`): an enum with values:
a. `fail`: return an error when no decode workers match the prefill worker’s topology domain
b. `fallback`: allow cross-domain transfer if no decode workers exist in the prefill worker’s topology domain
3. When the KV Router is selecting a decode worker for a given prefill worker, it would filter the decoder workers for only those in the same topology domain.
a. If none exist see 2a) and 2b)

### 4. DGD Spec Extension (v1beta1)
File: `deploy/operator/api/v1beta1/common.go`

The KV-cache transfer topology constraint is placed as a **field on the frontend component** in v1beta1.

The `ClusterTopology` reference is inherited from` spec.topologyConstraint.clusterTopologyName` (already required for scheduling placement), so only the routing-specific fields (`level` and `mismatchPolicy`) are needed on the component:

```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeployment
metadata:
  name: my-llm-disagg
spec:
  topologyConstraint:
    clusterTopologyName: my-cluster-topology
  components:
    - name: frontend
      type: frontend
      replicas: 4
      kvCacheTransferTopology:
        level: zone                  # Which topology level to enforce
        mismatchPolicy: fail         # "fail" or "fallback"

    - name: prefill-worker
      type: prefill
      replicas: 8

    - name: decode-worker
      type: decode
      replicas: 8
```

The operator translates this into:

1. **Frontend pods** — Sets `DYN_ROUTER_KV_TRANSFER_TOPOLOGY_LEVEL` and `DYN_ROUTER_KV_TRANSFER_MISMATCH_POLICY` env vars (router config)
2. **Worker pods** — Label values to to project from Grove topology labels via the add the Downward API volume
3. **Validation** — Ensures the referenced `ClusterTopology` CR exists and the specified level is a valid domain in the hierarchy
4. **Component type validation** — `kvCacheTransferTopology` is only valid on type: frontend components; the webhook rejects it on other component types

### Failure Handling
#### Admission-time errors (webhook):

_If we end up supporting explicit label fields (in the Considerations section) and a label is defined, then these admission hook validations would not apply. Only if a label is not defined, Grove and its corresponding ClusterTopology validation applies._
| Scenario | Behavior |
|----------|----------|
| Grove not installed and `transferTopology` specified | Webhook rejects DGD — the operator cannot read the ClusterTopology CR referenced by topologyName |
| `ClusterTopology` CR not found | Webhook rejects DGD (same validation path) |
| `level` references a domain not in `ClusterTopology` | Webhook rejects DGD — domain must exist in the CR's spec.levels |

#### Worker startup:

| Scenario | Behavior |
|----------|----------|
| Topology volume files not present | The operator will inject `DYN_TOPOLOGY_ENABLED=true`. If file value is not present for `DYN_ROUTER_KV_TRANSFER_TOPOLOGY_LEVEL` after a retry period, hard exit. |


#### Runtime routing — mismatchPolicy semantics:
The default for `mismachPolicy` will be `fail` - constraining kv transfers to the same topology domain is a hard constraint so it makes sense to fail a request if this cannot be satisfied.

| Scenario | fail (default) | fallback |
|----------|----------------|----------|
| No same-domain decode workers, other-domain workers exist | Return error to client | Log warning, route to any available decode worker |
| Worker missing topology level for configured domain | Excluded from selection | Excluded when same-domain workers exist; eligible as fallback when none do |
| `kv_transfer_topology_level` unset | No topology filtering, identical to current behavior | No topology filtering, identical to current behavior |
| Before prefill worker selection - prefill worker has no decode workers available in its domain | Filter for prefill workers that have decode workers in the same domain first and then fallback to all other prefill workers (window where a decode worker can join). | Same as fail. |


### Backward Compatibility
- `kv_transfer_topology_level` defaults to `None` -> no filtering
- `WorkerConfigLike::topology_domain()` has a default `None` implementation
- `topology_domains` on `ModelRuntimeConfig` defaults to empty and is skip_serializing_if = "HashMap::is_empty" -> zero wire overhead for existing workers
- DGDs without `kvCacheTransferTopology` on the frontend component behave identically to today

### Implementation Phases
#### Phase 1 — Worker Metadata (Dynamo)
- Add topology_domains: HashMap<String, String> to ModelRuntimeConfig
- Add topology_domain() default method to WorkerConfigLike trait
- Implement topology_domain() on ModelRuntimeConfig
- Add shared read_topology_domains() utility (components/src/dynamo/common/topology.py)
- Wire into vllm/sglang/trtllm backends: read files at startup, set on runtime_config
- Verify topology metadata flows through RuntimeConfigWatch to the router

#### Phase 2 — Router Filtering (Dynamo)
- Add kv_transfer_topology_level and kv_transfer_mismatch_policy to KvRouterConfig
- Add topology_affinity to SchedulingRequest
- Add topology filter to DefaultWorkerSelector::select_worker()
- Implement fallback logic

#### Phase 3 — Prefill->Decode Propagation (Dynamo)
- After prefill selection, look up worker's topology domain
- Inject topology_affinity into decode SchedulingRequest

#### Phase 4 — Grove + Operator Integration
- Grove controller patches grove.io/topology.* labels onto pods post-scheduling
- Operator adds EnsureTopologyVolume(): generates Downward API volume projecting one file per topology level under /etc/podinfo/topology/
- DGD v1beta1 spec extended with kvCacheTransferTopology on frontend components
- Operator sets router config env vars on frontend pods
- Operator validates topology level exists in ClusterTopology CR at DGD creation

### Considerations
#### Placing kvCacheTransferTopology under experimental
The v1beta1 `ExperimentalSpec` block was considered as an initial home for `kvCacheTransferTopology`. Features under `experimental` carry an explicit warning that they "may change in breaking ways between v1beta1 releases, including disappearing without a name-preserving graduation path." Given the API extensibility (`kvTransferWeight` addition for soft topology routing) and confidence in API maturity, it was decided to not place in the experimental block.

#### Explicit label field — operator-managed node-to-pod label copy
An optional `label` field on `kvCacheTransferTopology` would let users specify a raw pod label key directly, bypassing the Grove `ClusterTopology` entirely:

```yaml
kvCacheTransferTopology:
  level: zone
  label: topology.kubernetes.io/zone      # Pod label key to match on
  mismatchPolicy: fail
```

When `label` is set, the operator copies the specified node label onto worker pods after scheduling. The rest of the pipeline (worker reads file → `topology_domains` → router filter) is unchanged.

When `label` is omitted, the Grove-based path is used (as described in the main proposal).

**Pros**:
- **Would not require a Grove Dependency** for hard topology constraints and would be complementary.
- **Zone labels are almost always present**. topology.kubernetes.io/zone is a well-known Kubernetes label populated automatically by all major cloud providers.
- **Same downstream plumbing**. 

**Cons**:
- **Operator gains a new reconciler responsibility**. Would have to add a watch for pod scheduling and propagate node labels to pod – duplicating logic that would be added to Grove.
- **No hierarchy or validation**. The label field is a raw string — no ClusterTopology CR to validate against, no hierarchy ordering for future soft scoring, no normalization of provider-specific label keys.
- **Single level only**. The label field supports one topology level per configuration. Multi-level soft scoring (future work) requires the full ClusterTopology hierarchy.

#### Worker self-discovery without Grove
An alternative to the Grove-based approach is having workers query the Kubernetes node API directly at startup to read their own node's labels and populate `topology_domains` without any Grove dependency.

**Difficulties with this approach:**

- **No domain-to-label mapping**. Without the ClusterTopology CR, the worker has no way to know that "zone" maps to topology.kubernetes.io/zone or that "rack" maps to nvidia.com/rack. 
- **No hierarchy definition**. The ClusterTopology defines an ordered hierarchy (zone > rack > host). Without it, the router has no concept of which domains are broader or narrower — needed for the soft scoring extension.
- **RBAC expansion**. Every worker service account needs get access to the nodes API. 
- **No admission-time validation.** The ClusterTopology enables the webhook to reject a DGD at creation time if the referenced topology level doesn't exist.
- **No single source of truth**. The topology hierarchy is defined once in the ClusterTopology CR and referenced by all DGDs. Without it, each DGD must independently carry the label-to-domain mapping, creating duplication and drift risk.

Leveraging Grove’s `ClusterTopology` and extension to support mapping the topology labels from node to pod is a natural extension to how it’s leveraged for Topology Aware Scheduling in DGDs today and provides a clear separation of concerns. 

### Future Work
#### KV Transfer Cost Scoring (Soft Topology Preferences)
The hard filter is binary — candidates are either in the same domain or rejected. A natural extension is to add a topology penalty to the decode cost function that models the actual KV transfer cost, allowing the router to prefer closer workers without strictly excluding distant ones.

The decode logit would gain a transfer cost term:

```
decode_logit = decode_blocks + kv_transfer_weight * (latency + kv_bytes / bandwidth)
```

Where latency and bandwidth come from a per-domain QoS table and `kv_bytes` is derived from the request's token count (`isl_tokens` * `kv_bytes_per_token`). 

This composes with the hard filter — hard filter removes cross-zone candidates, soft scoring ranks the survivors within the zone. The QoS table could be sourced from the `ClusterTopology` CR (extended with per-level `networkQoS` fields), a ConfigMap, or DGD-level configuration. The DGD API extends naturally — `kvTransferWeight` is an additive field on the existing `kvCacheTransferTopology` struct:

```yaml
components:
  - name: frontend
    type: frontend
    kvCacheTransferTopology:
      level: zone                    # Hard: never cross this boundary
      mismatchPolicy: fail
      kvTransferWeight: 50.0         # Soft: penalty multiplier for distance within boundary
```

#### Other Future Directions
- **General application topology library** — Reusable Grove library for topology queries beyond Dynamo
- **Prefill topology constraint** — Optional constraint on prefill selection to prefer same-AZ as the frontend pod (minimize client->prefill latency)
- **Multi-pool routing** – Deployment topology where there are multiple prefill/decode pools for different context lengths as an example. The defined transfer topology constraint should be able to be applied across pools as the same underlying concern would remain.
- **Multi-cluster routing** – A single cluster can only go so far with constrained capacity. While cross cluster transfers are generally not recommended due to latency overhead, recent work by Moonshot AI on Prefill-as-a-Service demonstrating transfers across data centers means it can be feasible. Dynamo’s support for Multi-Cluster should also encapsulate the ability to encode transfer constraints at the cross cluster level.

### Alternate Solutions

_No response_

### Requirements

_No response_

### References

_No response_

Scenario	Behavior
Grove not installed and `transferTopology` specified	Webhook rejects DGD — the operator cannot read the ClusterTopology CR referenced by topologyName
`ClusterTopology` CR not found	Webhook rejects DGD (same validation path)
`level` references a domain not in `ClusterTopology`	Webhook rejects DGD — domain must exist in the CR's spec.levels

Scenario	fail (default)	fallback
No same-domain decode workers, other-domain workers exist	Return error to client	Log warning, route to any available decode worker
Worker missing topology level for configured domain	Excluded from selection	Excluded when same-domain workers exist; eligible as fallback when none do
`kv_transfer_topology_level` unset	No topology filtering, identical to current behavior	No topology filtering, identical to current behavior
Before prefill worker selection - prefill worker has no decode workers available in its domain	Filter for prefill workers that have decode workers in the same domain first and then fallback to all other prefill workers (window where a decode worker can join).	Same as fail.

DEP: Topology-Aware KV Transfer Routing Constraints #9118

Description

Authors

Area

Summary

Motivation

Background

The Problem

Costs of the Multi-DGD Workaround

Goals

Non-Goals

Proposal

1. Pod Topology Injection (Grove + Operator)

1a. Grove controller patches pod labels post-scheduling

1b. Operator projects labels into containers via Downward API volume files

2. Worker Metadata Publication (Dynamo)

3. KV Router Filtering

4. DGD Spec Extension (v1beta1)

Failure Handling

Admission-time errors (webhook):

Worker startup:

Runtime routing — mismatchPolicy semantics:

Backward Compatibility

Implementation Phases

Phase 1 — Worker Metadata (Dynamo)

Phase 2 — Router Filtering (Dynamo)

Phase 3 — Prefill->Decode Propagation (Dynamo)

Phase 4 — Grove + Operator Integration

Considerations

Placing kvCacheTransferTopology under experimental

Explicit label field — operator-managed node-to-pod label copy

Worker self-discovery without Grove

Future Work

KV Transfer Cost Scoring (Soft Topology Preferences)

Other Future Directions

Alternate Solutions

Requirements

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions