Mosaic is a source-run vLLM fork for diffusion language model serving. It keeps
the Python package name and CLI entry point as vllm, and adds inference paths
for LLaDA, Dream, and LLaDA-MoE.
Paper status: Mosaic: Unlocking Long-Context Inference for Diffusion LLMs via Global Memory Planning and Dynamic Peak Taming has been accepted to ICML 2026.
This repository is intended to contain code only. Model weights, compiled shared objects, benchmark CSV/JSON reports, server logs, traces, and other runtime artifacts are generated locally and should not be committed.
- vLLM V1 API server with a
/generateendpoint for diffusion LLM requests. - Model adapters for LLaDA, Dream, and LLaDA-MoE.
- CUDA sampling kernels, LLaDA fused ops, optional CUDA VMM activation allocator, and Triton gather/GEMM kernels.
- Unified single-request smoke client.
- Context-length benchmark runner that sweeps linearly to a target length and binary searches after the first failure.
| Family | Expected local directory | Server script | Default port | Mask id |
|---|---|---|---|---|
| LLaDA | llada-8b-instruct |
scripts/run_llada_server.sh |
8901 | 126336 |
| Dream | dream-v0-instruct-7b |
scripts/run_dream_server.sh |
8701 | 151666 |
| LLaDA-MoE | llada-moe-7b-a1b |
scripts/run_llada_moe_server.sh |
10001 | 156895 |
Model weights are not included. Place them under models/ in the repository, or
set environment variables to point to external model directories.
The release path was validated in a Debian 12 container with NVIDIA CUDA GPUs:
| Component | Validated version |
|---|---|
| Python | 3.12.12 |
| PyTorch | 2.7.1+cu126 |
| Transformers | 4.57.3 |
| Triton | 3.3.1 |
| CUDA toolkit | 12.x with nvcc available |
Other recent NVIDIA CUDA environments may work, but the custom CUDA extensions must be rebuilt in the target Python environment.
Clone the repository and enter the release tree:
git clone https://github.com/flashserve/Mosaic Mosaic-release
cd Mosaic-releaseCreate or activate a Python environment. The validated environment uses Python 3.12 and PyTorch 2.7.1 with CUDA 12.x:
conda create -n mosaic python=3.12 -y
conda activate mosaicInstall build tools and Python dependencies:
python -m pip install --upgrade pip
python -m pip install -r requirements/build.txt
python -m pip install -r requirements/cuda.txtIf your PyTorch/CUDA stack is already managed by the cluster image or container,
use the versions supplied by that environment and install only the missing Python
packages. The key requirement is that python, torch, and nvcc agree on a
compatible CUDA toolchain.
Mosaic is validated as a source-run tree. Export PYTHONPATH from the repository
root before starting servers or benchmarks:
export MOSAIC_HOME="$PWD"
export PYTHONPATH="$MOSAIC_HOME:$MOSAIC_HOME/flash_sample:$MOSAIC_HOME/vmm_allocator:$MOSAIC_HOME/vllm_add_llada/cuda_kernels:${PYTHONPATH:-}"The server scripts run python -m vllm.entrypoints.api_server, so make sure the
intended environment is active or put its bin directory first in PATH.
Build the diffusion sampling kernels:
cd "$MOSAIC_HOME/flash_sample"
python setup.py build_ext --inplaceBuild the LLaDA fused CUDA ops:
cd "$MOSAIC_HOME/vllm_add_llada/cuda_kernels"
python setup.py build_ext --inplaceBuild the optional CUDA VMM allocator. This allocator is recommended for long LLaDA context benchmarks and is enabled by default in the provided server and benchmark scripts:
cd "$MOSAIC_HOME/vmm_allocator"
python setup.py build_ext --inplaceReturn to the repository root and verify imports:
cd "$MOSAIC_HOME"
python - <<'PY'
import flash_sample
from flash_sample import entropy, entropy_witht, low_confidence
from vllm_add_llada.cuda_kernels import fused_ops
import vmm_allocator
print("flash_sample ok", bool(low_confidence), bool(entropy), bool(entropy_witht))
print("fused_ops ok", fused_ops.__name__)
print("vmm_allocator ok", vmm_allocator.__name__)
PYgather_gemm uses Triton JIT compilation and does not need a separate build
step.
The scripts first look under MOSAIC_MODEL_ROOT, then allow each model family to
be overridden independently.
Recommended layout:
Mosaic-release/
models/
llada-8b-instruct/
dream-v0-instruct-7b/
llada-moe-7b-a1b/
Use this layout with:
export MOSAIC_MODEL_ROOT="$MOSAIC_HOME/models"Or point to external weight directories:
export LLADA_MODEL_DIR=/path/to/llada-8b-instruct
export DREAM_MODEL_DIR=/path/to/dream-v0-instruct-7b
export LLADA_MOE_MODEL_DIR=/path/to/llada-moe-7b-a1bMODEL_DIR can be used as a one-off override for any server script:
MODEL_DIR=/path/to/llada-8b-instruct scripts/run_llada_server.shChoose the GPU with CUDA_VISIBLE_DEVICES. The examples below bind to localhost;
set HOST=0.0.0.0 when serving to other machines.
Start LLaDA:
cd "$MOSAIC_HOME"
CUDA_VISIBLE_DEVICES=0 \
HOST=127.0.0.1 \
PORT=8901 \
MAX_MODEL_LEN=204800 \
GPU_MEMORY_UTILIZATION=0.92 \
scripts/run_llada_server.shStart Dream:
cd "$MOSAIC_HOME"
CUDA_VISIBLE_DEVICES=0 \
HOST=127.0.0.1 \
PORT=8701 \
MAX_MODEL_LEN=4096 \
scripts/run_dream_server.shStart LLaDA-MoE:
cd "$MOSAIC_HOME"
CUDA_VISIBLE_DEVICES=0 \
HOST=127.0.0.1 \
PORT=10001 \
MAX_MODEL_LEN=8192 \
scripts/run_llada_moe_server.shUseful runtime overrides:
export VLLM_USE_VMM=1
export VLLM_VMM_CHUNK_SIZE=2097152
export VLLM_USE_CHUNKWISE_GRAPH=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export EXTRA_VLLM_ARGS="--enforce-eager"With a server running, send a single /generate request:
python scripts/test_api_request.py \
--model-family llada \
--host http://127.0.0.1:8901 \
--steps 8 \
--gen-length 32 \
--block-length 32Dream:
python scripts/test_api_request.py \
--model-family dream \
--host http://127.0.0.1:8701 \
--steps 8 \
--gen-length 32 \
--block-length 32LLaDA-MoE:
python scripts/test_api_request.py \
--model-family llada-moe \
--host http://127.0.0.1:10001 \
--steps 8 \
--gen-length 32 \
--block-length 32The client prints the raw JSON response. It defaults to cfg_scale=0,
temperature=0, and the correct mask id for each model family.
The benchmark scripts start a server, wait for /health, send a long /generate
request, parse execute_diffusion_once time = ... ms from the server log, write
CSV/JSON results, stop the server, and continue to the next length.
Mosaic benchmarks use online chunk planning. The benchmark runner starts the
server with VLLM_USE_CHUNKWISE_GRAPH=1; the model then calls the online planner
for the current prompt/output shape and the active activation pool budget. The
release no longer ships or consumes precomputed hardware-specific chunk config
JSON files.
The activation pool budget can be fixed with --activation-pool-size-bytes,
which is forwarded to the server as VLLM_ACTIVATION_POOL_SIZE_BYTES. If this is
omitted, the server sizes the activation pool from currently available memory at
startup. --gpu-memory-utilization remains the vLLM startup option for the
standard vLLM memory budget; it is separate from the Mosaic activation pool
budget used by online chunk planning.
The length sweep has two phases:
- Linear scan from
--startto--maxby--step. - Binary search between the last success and first failure, using
--precision, only if a failure occurs before--max.
Important parameters:
| Parameter | Meaning |
|---|---|
--start |
First total context length to test. |
--step |
Linear-scan increment. Use a large value for fast upper-limit checks. |
--max |
Target upper bound. If all linear points pass, this is the best length. |
--precision |
Binary-search granularity after the first failure. |
--steps |
Number of diffusion refinement steps. Use 1 for a fast reachability test. |
--alpha |
Prompt length / total length. 0.5 means half prompt, half generated tokens. |
--gpu |
GPU ids exposed to the server process. |
--output-dir |
Where CSV, JSON, and server logs are written. Use /tmp/... for scratch runs. |
--activation-pool-size-bytes |
Optional fixed activation pool budget for online chunk planning. |
The validated LLaDA upper-limit run used alpha=0.5, steps=1, VMM enabled, and
a large 51200-token step so the sweep reached 204800 quickly:
cd "$MOSAIC_HOME"
export VLLM_USE_VMM=1
export VLLM_VMM_CHUNK_SIZE=2097152
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
python benchmarks/llada/run_benchmark.py \
--project-path "$MOSAIC_HOME" \
--model-path "$LLADA_MODEL_DIR" \
--output-dir /tmp/mosaic_benchmark_llada_204800 \
--start 51200 \
--step 51200 \
--max 204800 \
--precision 8192 \
--steps 1 \
--alpha 0.5 \
--gpu 0 \
--port 8916 \
--startup-timeout 1800 \
--request-timeout 64800 \
--shutdown-timeout 300 \
--gpu-memory-utilization 0.92The benchmark writes CSV and JSON summaries to --output-dir.
python benchmarks/dream/run_benchmark.py \
--project-path "$MOSAIC_HOME" \
--model-path "$DREAM_MODEL_DIR" \
--output-dir /tmp/mosaic_benchmark_dream \
--start 1024 \
--step 1024 \
--max 4096 \
--precision 128 \
--steps 10 \
--gpu 0 \
--port 8701python benchmarks/llada_moe/run_benchmark.py \
--project-path "$MOSAIC_HOME" \
--model-path "$LLADA_MOE_MODEL_DIR" \
--output-dir /tmp/mosaic_benchmark_llada_moe \
--start 1024 \
--step 1024 \
--max 8192 \
--precision 128 \
--steps 10 \
--gpu 0 \
--port 10001The server endpoint is /generate. The smoke client and benchmark runner send a
payload with these fields:
{
"prompt": "Explain diffusion language models in one short paragraph.",
"max_tokens": 32,
"gen_length": 32,
"output_length": 32,
"steps": 8,
"block_length": 32,
"temperature": 0.0,
"cfg_scale": 0.0,
"remasking": "low_confidence",
"mask_id": 126336
}For benchmark runs, gen_length and output_length are derived from the total
length and --alpha. When --alpha 0.5 and --max 204800, the request uses a
102400-token prompt and a 102400-token diffusion output.
| Variable | Purpose |
|---|---|
MOSAIC_MODEL_ROOT |
Base directory for all local model weights. |
MODEL_DIR |
One-off model path override for a server script. |
LLADA_MODEL_DIR |
LLaDA model path. |
DREAM_MODEL_DIR |
Dream model path. |
LLADA_MOE_MODEL_DIR |
LLaDA-MoE model path. |
VLLM_USE_VMM |
Enable the optional CUDA VMM activation allocator. |
VLLM_VMM_CHUNK_SIZE |
VMM physical mapping chunk size in bytes. |
VLLM_USE_CHUNKWISE_GRAPH |
Enable chunk-aware graph execution helpers. |
VLLM_ACTIVATION_POOL_SIZE_BYTES |
Optional fixed activation pool budget used by online chunk planning. |
VLLM_ALLOW_LONG_MAX_MODEL_LEN |
Allow long --max-model-len values. |
GPU_MEMORY_UTILIZATION |
Forwarded to vLLM server startup scripts. |
EXTRA_VLLM_ARGS |
Extra command-line flags appended by server scripts. |
If imports fail for flash_sample, fused_ops, or vmm_allocator, rebuild the
custom operators in the active Python environment and confirm that PYTHONPATH
contains the operator directories.
If a long LLaDA run fails around long context lengths, confirm that the server is
started with a large enough MAX_MODEL_LEN or benchmark --max-model-len, VMM is
enabled, and the current tree includes the LLaDA max-sequence-length patch in the
model wrapper.
If /health succeeds but the client gets connection errors, check that the client
uses the same HOST and PORT as the server and that no old server process is
occupying the port.
If GPU memory appears to remain allocated after a failed run, stop the server process group and confirm with:
nvidia-smi
ps -eo pid,pgid,stat,comm,args | grep -E 'vllm.entrypoints.api_server|benchmarks/.*/run_benchmark' | grep -v grepGenerated files are ignored by .gitignore. To return the source tree to a
code-only state after local validation, remove build products and Python caches:
find . -type d \( -name build -o -name __pycache__ -o -name '*.egg-info' -o -name .pytest_cache -o -name .mypy_cache \) -prune -exec rm -rf {} +
find . -type f \( -name '*.so' -o -name '*.pyc' -o -name '*.pyo' -o -name '*.o' -o -name '*.log' -o -name '*.csv' \) -delete
rm -rf benchmarks/results logs benchmark_resultsPrefer /tmp/mosaic-benchmark-run for benchmark --output-dir when validating release
candidates so benchmark reports and logs never enter the repository tree.
If you use Mosaic in research, please cite:
@article{zheng2026mosaic,
title={Mosaic: Unlocking Long-Context Inference for Diffusion LLMs via Global Memory Planning and Dynamic Peak Taming},
author={Zheng, Liang and Shi, Bowen and Hu, Yitao and Zhang, Jiawei and Li, Ruofan and Chen, Sheng and Li, Wenxin and Li, Keqiu},
journal={arXiv preprint arXiv:2601.06562},
year={2026}
}