Speculative decoding draft model fails during load due to memory allocation constraints

I am trying to get kimi-k2.6 nvfp4 to work on a single GB200 node. I've been migrating over to instanttensor with the following settings:
```
"INSTANTTENSOR_BACKEND=URING",
"INSTANTTENSOR_DEBUG=1",
"INSTANTTENSOR_CONCURRENCY=16",
"INSTANTTENSOR_MAX_FREE_MEM_USAGE=0.5",
"INSTANTTENSOR_CHUNK_SIZE=4194304"
```

The initial initial model loads fine, but it fails on loading the draft model. I've even tried setting `INSTANTTENSOR_MAX_FREE_MEM_USAGE` to as low as `0.3`, but that didn't help. Any ideas what i could do? since draft models tend to be quite small i dont necessarily need the full throughput achieved with instanttensor, but for the full model it would still be nice to not have to adjust things for the draft model to load correctly. 
I feel like this might be something that could be handled on vllm's side to allow both a model-loader and a draft-model-loader. But thought I'd report here in case there are any ideas


Stacktrace here:

```
2026-06-10T19:21:11.635592530Z
(Worker_TP1 pid=1037) [InstantTensor][DEBUG] Using backend URING
2026-06-10T19:21:11.636084757Z
(Worker_TP3 pid=1039) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py:1193: RuntimeWarning: Shrink io_depth from 128 to 3 due to memory limit.
2026-06-10T19:21:11.636084757Z
(Worker_TP3 pid=1039)   with instanttensor.safe_open(
2026-06-10T19:21:11.636091477Z
(Worker_TP1 pid=1037) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py:1193: RuntimeWarning: Shrink io_depth from 128 to 3 due to memory limit.
2026-06-10T19:21:11.636097653Z
(Worker_TP1 pid=1037)   with instanttensor.safe_open(
2026-06-10T19:21:11.636101493Z
(Worker_TP0 pid=1036) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py:1193: RuntimeWarning: Shrink io_depth from 128 to 3 due to memory limit.
2026-06-10T19:21:11.636101493Z
(Worker_TP3 pid=1039) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py:1193: RuntimeWarning: Shrink concurrency from 16 to 3 due to memory limit.
2026-06-10T19:21:11.636101493Z
(Worker_TP3 pid=1039)   with instanttensor.safe_open(
2026-06-10T19:21:11.636106261Z
(Worker_TP0 pid=1036)   with instanttensor.safe_open(
2026-06-10T19:21:11.636116533Z
(Worker_TP1 pid=1037) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py:1193: RuntimeWarning: Shrink concurrency from 16 to 3 due to memory limit.
2026-06-10T19:21:11.636116533Z
(Worker_TP1 pid=1037)   with instanttensor.safe_open(
2026-06-10T19:21:11.636120309Z
(Worker_TP0 pid=1036) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py:1193: RuntimeWarning: Shrink concurrency from 16 to 3 due to memory limit.
2026-06-10T19:21:11.636123797Z
(Worker_TP0 pid=1036)   with instanttensor.safe_open(
2026-06-10T19:21:11.636186646Z
(Worker_TP2 pid=1038) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py:1193: RuntimeWarning: Shrink io_depth from 128 to 3 due to memory limit.
2026-06-10T19:21:11.636192182Z
(Worker_TP2 pid=1038)   with instanttensor.safe_open(
2026-06-10T19:21:11.636225942Z
(Worker_TP2 pid=1038) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py:1193: RuntimeWarning: Shrink concurrency from 16 to 3 due to memory limit.
2026-06-10T19:21:11.636225942Z
(Worker_TP2 pid=1038)   with instanttensor.safe_open(
2026-06-10T19:21:11.896240962Z
csrc/loader_common.cpp:79 'cudaMalloc(&this->device_buffer, this->buffer_size)' -> 2:out of memory
2026-06-10T19:21:11.896271490Z
Loader thread exception: std::exception
2026-06-10T19:21:11.896705669Z
terminate called after throwing an instance of 'std::exception'
2026-06-10T19:21:11.896716325Z
  what():  std::exception
2026-06-10T19:21:11.907786158Z
csrc/loader_common.cpp:79 'cudaMalloc(&this->device_buffer, this->buffer_size)' -> 2:out of memory
2026-06-10T19:21:11.907813518Z
Loader thread exception: std::exception
2026-06-10T19:21:11.907825326Z
terminate called after throwing an instance of 'std::exception'
2026-06-10T19:21:11.907829326Z
  what():  std::exception
2026-06-10T19:21:11.907874350Z
csrc/loader_common.cpp:79 'cudaMalloc(&this->device_buffer, this->buffer_size)' -> 2:out of memory
2026-06-10T19:21:11.907927215Z
Loader thread exception: std::exception
2026-06-10T19:21:11.907927215Z
terminate called after throwing an instance of 'std::exception'
2026-06-10T19:21:11.907927215Z
  what():  std::exception
2026-06-10T19:21:11.909171927Z
csrc/loader_common.cpp:79 'cudaMalloc(&this->device_buffer, this->buffer_size)' -> 2:out of memory
2026-06-10T19:21:11.909234199Z
Loader thread exception: std::exception
2026-06-10T19:21:11.909272888Z
terminate called after throwing an instance of 'std::exception'
2026-06-10T19:21:11.909272888Z
  what():  std::exception
2026-06-10T19:21:13.205642972Z
(EngineCore pid=816) ERROR 06-10 19:21:13 [core.py:1165] EngineCore failed to start.
2026-06-10T19:21:13.205642972Z
(EngineCore pid=816) ERROR 06-10 19:21:13 [core.py:1165] Traceback (most recent call last):
2026-06-10T19:21:13.205642972Z
(EngineCore pid=816) ERROR 06-10 19:21:13 [core.py:1165]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1139, in run_engine_core
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative decoding draft model fails during load due to memory allocation constraints #13

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Speculative decoding draft model fails during load due to memory allocation constraints #13

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions