Skip to content

devinnicholson/llm-inference-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

110 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-inference-benchmark-lab

Research-grade learning repo for ML inference systems experiments.

The project is now centered on KV-cache behavior in LLM serving. The goal is to build a reproducible ML inference systems artifact with clear workload definitions, request lifecycle traces, benchmark methodology, scheduler experiments, KV-cache pressure studies, and eventually backend comparisons against real inference engines.

Current report: docs/prefix-cache-study.md summarizes the no-repeat vLLM prefix-cache result, longer-context follow-ups, and L4 context/model-size follow-ups. The primary no-repeat claim is stable measured-window KV-cache reuse plus a p95 first-event/TTFT improvement; the longer Qwen follow-ups now show broader favorable timing when prefill work is larger.

Generated result artifacts: results/prefix-cache-study/key-results.md and results/prefix-cache-study/intervals.md

Latest extra-long follow-up: results/prefix-cache-study-extra-long-merged-r8/key-results.md

Latest model-shape follow-up: results/prefix-cache-study-extra-long-qwen05b-merged-r8/key-results.md

Latest longer-context follow-up: results/prefix-cache-study-ultra-long-qwen05b-merged-r8/key-results.md

Latest L4-scale context follow-up: results/prefix-cache-study-mega-long-qwen05b-l4-merged-r8/key-results.md

Latest L4 model-size follow-up: results/prefix-cache-study-mega-long-qwen15b-l4-merged-r8/key-results.md

Latest L4 batch-pressure follow-up: results/prefix-cache-study-mega-long-qwen15b-l4-n32-merged-r8/key-results.md

Latest L4 batch-pressure comparison: results/prefix-cache-study-mega-long-qwen15b-l4-n16-vs-n32-r8/batch-pressure-comparison.md

Latest L4 capacity diagnostic: results/modal-vllm-capacity-diagnostic-qwen15b-l4-n32-batched-tokens-r1/capacity-diagnostic.md

Latest L4 scheduler-control follow-up: results/prefix-cache-study-mega-long-qwen15b-l4-n32-batched-tokens60640-merged-r8/key-results.md

Latest L4 serving-path scheduler-control smoke: results/modal-vllm-server-async-qwen15b-l4-n32-batched-tokens60640-phase-order-smoke-r1/phase-order-compare.csv

Latest L4 serving-path log summary: results/modal-vllm-server-async-qwen15b-l4-n32-batched-tokens60640-log-summary-r1/server-async-log-summary.csv

Latest L4 serving-path cache-control smoke: results/modal-vllm-server-async-qwen15b-l4-n32-batched-tokens60640-server-cache-control-log-summary-r1/server-async-log-summary.csv

Latest L4 serving-path cache-control phase-order matrix: results/modal-vllm-server-async-qwen15b-l4-n32-batched-tokens60640-server-cache-control-phase-order-r2/cache-control-phase-order-compare.json

Latest L4 serving-path cache-control multitrial aggregate: results/modal-vllm-server-async-qwen15b-l4-n32-batched-tokens60640-server-cache-control-multitrial-r2/cache-control-multitrial.json

Latest L4 serving-path server-only cache-control comparison: results/modal-vllm-server-async-qwen15b-l4-n32-batched-tokens60640-server-cache-control-server-absolute-r2/server-cache-control-absolute.json

Latest prompt-overlap vs server cache-control check: results/modal-vllm-prefix-cache-prompt-overlap-server-cache-control-qwen15b-n32-r2/prompt-overlap-server-compare.md

Latest single-profile server cache-control isolation: results/modal-vllm-server-async-qwen15b-l4-n32-batched-tokens60640-single-profile-server-cache-control-r1/server-cache-control-absolute.json

Latest server /metrics no-warmup cache-control: results/modal-vllm-server-async-qwen15b-l4-n32-batched-tokens60640-nowarmup-server-cache-control-r1-r2/server-cache-control-absolute.json

Latest server cache-pressure curve: results/modal-vllm-server-async-qwen15b-l4-kvbudget-gpu045-gpu040-gpu035-gpu0325-n32-batched-tokens60640-seed3805-seed3906-pressure-curve-r1/server-cache-pressure-curve.md

Latest server KV-budget report: results/modal-vllm-server-async-qwen15b-l4-kvbudget-report-r1/kv-budget-report.md

Latest server KV-budget request-count comparison: results/modal-vllm-server-async-qwen15b-l4-kvbudget-gpu0325-n16-seed4107-seed4208-vs-n32-request-count-r1/kv-budget-request-count-comparison.md

Latest server KV-budget failure-bound probe: results/modal-vllm-server-async-qwen15b-l4-kvbudget-gpu030-n32-batched-tokens60640-feasibility-failure-r1/failure.md

Latest vLLM server CLI evidence: results/modal-vllm-server-cli-help-v021/vllm-server-cli-help.json

Future TODO

  • Build a static benchmark explorer UI for GitHub Pages that loads generated JSON/CSV artifacts and compares runs by model, GPU, prompt profile, request count, cache mode, phase order, throughput, latency, TPOT/TTFT, and observed prefix-cache hit rate.

Project Track

This repo follows a systems-focused path:

  1. Request lifecycle, KV-cache accounting, and measurement vocabulary
  2. Profiling and trace discipline
  3. Triton primitives for inference hot paths
  4. KV cache, batching, and scheduling policies
  5. Quantization and decoding tradeoffs
  6. Serving APIs, streaming, cancellation, and metrics
  7. Distributed inference and placement policies
  8. Trace-driven workload realism
  9. Artifact report and reproducibility package

Week 1 Artifact

Week 1 defines the basic serving lifecycle before any real model backend exists. The initial code provides:

  • A workload schema for inference requests
  • Validation for prompt/output token counts and arrival times
  • A deterministic FIFO request lifecycle simulator
  • Per-stage traces for queueing, tokenization, prefill, decode, and streaming
  • Per-request KV-cache footprint estimates
  • Active KV-cache timeline metrics
  • Summary metrics for end-to-end latency, queue wait, throughput, and memory pressure

Run the starter workload:

python3 scripts/replay_workload.py workloads/week01_mixed_requests.json \
  --model-config configs/models/llama-7b-gqa-fp16.json

Generate a deterministic bursty workload:

python3 scripts/generate_workload.py mixed_bursty \
  --requests 32 \
  --seed 1337 \
  --output workloads/generated/mixed_bursty_32_seed1337.json

Replay it with overlapping FIFO request slots:

python3 scripts/replay_workload.py workloads/generated/mixed_bursty_32_seed1337.json \
  --model-config configs/models/llama-7b-gqa-fp16.json \
  --max-concurrent-requests 4 \
  --scheduler-policy fifo

Compare a scheduler policy:

python3 scripts/replay_workload.py workloads/generated/mixed_bursty_32_seed1337.json \
  --model-config configs/models/llama-7b-gqa-fp16.json \
  --max-concurrent-requests 4 \
  --scheduler-policy shortest-cache

Replay with a constrained KV-cache budget:

python3 scripts/replay_workload.py workloads/generated/mixed_bursty_32_seed1337.json \
  --model-config configs/models/llama-7b-gqa-fp16.json \
  --capacity-config configs/capacity/tight-1gb-kv.json \
  --max-concurrent-requests 4 \
  --scheduler-policy memory-aware-deadline

Run the first reproducible capacity sweep:

python3 scripts/run_sweep.py

The sweep writes JSON and CSV results to results/experiment-001-capacity-sweep and is documented in docs/experiment-001-capacity-sweep.md.

Run the first Modal remote sweep:

modal run modal_app.py

Run the first Modal GPU probe:

modal run modal_app.py --mode gpu-probe

Run the first Modal tiny inference timing:

modal run modal_app.py --mode tiny-inference

Run the first Modal vLLM inference baseline:

modal run modal_app.py --mode vllm-inference

Inspect vLLM cache-metrics support in the Modal image:

modal run modal_app.py --mode vllm-cache-metrics-probe

Run a tiny GPU-backed vLLM cache-metrics smoke:

modal run modal_app.py --mode vllm-cache-metrics-smoke \
  --prompt-profile shared_prefix_long \
  --prompt-count 4 \
  --max-new-tokens 8

Run the first Modal vLLM streaming timing:

modal run modal_app.py --mode vllm-streaming

Run the first OpenAI-compatible vLLM server streaming smoke:

modal run modal_app.py --mode vllm-server-streaming

Run the first OpenAI-compatible vLLM server concurrent streaming workload:

modal run modal_app.py --mode vllm-server-concurrent

Run the repeated OpenAI-compatible vLLM server sweep:

modal run modal_app.py --mode vllm-server-sweep \
  --prompt-profiles short \
  --output-tokens 32 \
  --repeats 3 \
  --warmup-runs 1

Compare the repeated server sweep against the in-process AsyncLLM sweep:

modal run modal_app.py --mode vllm-server-sweep-compare

Run a paired same-worker server-vs-AsyncLLM benchmark:

modal run modal_app.py --mode vllm-server-async-paired \
  --prompt-profiles short \
  --output-tokens 32 \
  --server-async-prefix-caching off \
  --repeats 3 \
  --warmup-runs 1

Run the server-first phase-order control:

modal run modal_app.py --mode vllm-server-async-paired \
  --phase-order server_first \
  --prompt-profiles short \
  --output-tokens 32 \
  --repeats 3 \
  --warmup-runs 1

Compare paired phase orders:

modal run modal_app.py --mode vllm-server-async-phase-order-compare

Aggregate multiple phase-order comparisons:

modal run modal_app.py --mode vllm-server-async-multitrial-aggregate

Run the long-prompt phase-order workload:

modal run modal_app.py --mode vllm-server-async-paired \
  --phase-order async_first \
  --prompt-profiles long \
  --output-tokens 32 \
  --repeats 3 \
  --warmup-runs 1 \
  --output-dir results/modal-vllm-server-async-paired-long-async-first

modal run modal_app.py --mode vllm-server-async-paired \
  --phase-order server_first \
  --prompt-profiles long \
  --output-tokens 32 \
  --repeats 3 \
  --warmup-runs 1 \
  --output-dir results/modal-vllm-server-async-paired-long-server-first

Compare the long-prompt phase orders and then summarize short-vs-long workload sensitivity:

modal run modal_app.py --mode vllm-server-async-phase-order-compare \
  --async-first-paired-dir results/modal-vllm-server-async-paired-long-async-first \
  --server-first-paired-dir results/modal-vllm-server-async-paired-long-server-first \
  --output-dir results/modal-vllm-server-async-phase-order-compare-long

modal run modal_app.py --mode vllm-server-async-workload-compare

Run the first Modal vLLM concurrent workload:

modal run modal_app.py --mode vllm-concurrent

Run the first Modal vLLM concurrency/context sweep:

modal run modal_app.py --mode vllm-sweep

Run the repeated sweep with an explicit shuffle seed:

modal run modal_app.py --mode vllm-sweep --repeats 3 --scenario-seed 1337

Run the paired prefix-cache sweep and comparison:

modal run modal_app.py --mode vllm-sweep --prefix-caching on
modal run modal_app.py --mode vllm-prefix-cache-compare

Run the shared-prefix KV-cache pilot:

modal run modal_app.py --mode vllm-sweep \
  --prompt-profiles shared_prefix \
  --output-tokens 32 \
  --request-counts 1,2,4,8 \
  --repeats 3 \
  --scenario-seed 1337 \
  --prefix-caching off \
  --output-dir results/modal-vllm-shared-prefix-cold

modal run modal_app.py --mode vllm-sweep \
  --prompt-profiles shared_prefix \
  --output-tokens 32 \
  --request-counts 1,2,4,8 \
  --repeats 3 \
  --scenario-seed 1337 \
  --prefix-caching on \
  --output-dir results/modal-vllm-shared-prefix-cache

modal run modal_app.py --mode vllm-prefix-cache-compare \
  --cold-sweep-dir results/modal-vllm-shared-prefix-cold \
  --prefix-sweep-dir results/modal-vllm-shared-prefix-cache \
  --output-dir results/modal-vllm-shared-prefix-cache-compare

Run the paired same-worker prefix-cache control:

modal run modal_app.py --mode vllm-prefix-cache-paired \
  --prompt-profiles shared_prefix \
  --output-tokens 32 \
  --request-counts 1,2,4,8 \
  --repeats 3 \
  --warmup-runs 1 \
  --scenario-seed 1337 \
  --phase-order cold_first

modal run modal_app.py --mode vllm-prefix-cache-paired \
  --prompt-profiles shared_prefix \
  --output-tokens 32 \
  --request-counts 1,2,4,8 \
  --repeats 3 \
  --warmup-runs 1 \
  --scenario-seed 1337 \
  --phase-order cache_first

modal run modal_app.py --mode vllm-prefix-cache-phase-order-compare

Run the long shared-prefix versus matched unique-prefix control:

modal run modal_app.py --mode vllm-prefix-cache-paired \
  --prompt-profiles shared_prefix_long,matched_unique_prefix \
  --output-tokens 32 \
  --request-counts 1,2,4,8 \
  --repeats 3 \
  --warmup-runs 1 \
  --scenario-seed 1337 \
  --phase-order cold_first \
  --output-dir results/modal-vllm-prefix-cache-long-control-paired

modal run modal_app.py --mode vllm-prefix-cache-paired \
  --prompt-profiles shared_prefix_long,matched_unique_prefix \
  --output-tokens 32 \
  --request-counts 1,2,4,8 \
  --repeats 3 \
  --warmup-runs 1 \
  --scenario-seed 1337 \
  --phase-order cache_first \
  --output-dir results/modal-vllm-prefix-cache-long-control-paired-cache-first

modal run modal_app.py --mode vllm-prefix-cache-phase-order-compare \
  --prefix-cache-cold-first-paired-dir results/modal-vllm-prefix-cache-long-control-paired \
  --prefix-cache-cache-first-paired-dir results/modal-vllm-prefix-cache-long-control-paired-cache-first \
  --output-dir results/modal-vllm-prefix-cache-long-control-phase-order

modal run modal_app.py --mode vllm-prefix-cache-profile-control \
  --prefix-cache-phase-order-compare-dir results/modal-vllm-prefix-cache-long-control-phase-order \
  --output-dir results/modal-vllm-prefix-cache-long-control-profile-control

# Repeat the paired run with --scenario-seed 569 and trial2 output dirs, then:
modal run modal_app.py --mode vllm-prefix-cache-profile-multitrial \
  --prefix-cache-profile-control-dirs results/modal-vllm-prefix-cache-long-control-profile-control,results/modal-vllm-prefix-cache-long-control-profile-control-trial2 \
  --output-dir results/modal-vllm-prefix-cache-long-control-multitrial

Run a small metrics-enabled paired prefix-cache smoke:

modal run modal_app.py --mode vllm-prefix-cache-paired \
  --prompt-profiles shared_prefix_long,matched_unique_prefix \
  --output-tokens 8 \
  --request-counts 4 \
  --repeats 1 \
  --warmup-runs 1 \
  --scenario-seed 570 \
  --phase-order cold_first \
  --cache-metrics on \
  --kv-cache-metrics-sample 1.0 \
  --output-dir results/modal-vllm-prefix-cache-metrics-paired-smoke

modal run modal_app.py --mode vllm-prefix-cache-paired \
  --prompt-profiles shared_prefix_long,matched_unique_prefix \
  --output-tokens 8 \
  --request-counts 4 \
  --repeats 1 \
  --warmup-runs 1 \
  --scenario-seed 570 \
  --phase-order cache_first \
  --cache-metrics on \
  --kv-cache-metrics-sample 1.0 \
  --output-dir results/modal-vllm-prefix-cache-metrics-paired-smoke-cache-first

modal run modal_app.py --mode vllm-prefix-cache-phase-order-compare \
  --prefix-cache-cold-first-paired-dir results/modal-vllm-prefix-cache-metrics-paired-smoke \
  --prefix-cache-cache-first-paired-dir results/modal-vllm-prefix-cache-metrics-paired-smoke-cache-first \
  --output-dir results/modal-vllm-prefix-cache-metrics-phase-order-smoke

modal run modal_app.py --mode vllm-prefix-cache-profile-control \
  --prefix-cache-phase-order-compare-dir results/modal-vllm-prefix-cache-metrics-phase-order-smoke \
  --output-dir results/modal-vllm-prefix-cache-metrics-profile-control-smoke

Run a small repeated metrics-enabled prefix-cache trial:

modal run modal_app.py --mode vllm-prefix-cache-paired \
  --prompt-profiles shared_prefix_long,matched_unique_prefix \
  --output-tokens 8 \
  --request-counts 2,4,8 \
  --repeats 2 \
  --warmup-runs 1 \
  --scenario-seed 571 \
  --phase-order cold_first \
  --cache-metrics on \
  --kv-cache-metrics-sample 1.0 \
  --output-dir results/modal-vllm-prefix-cache-metrics-repeated

modal run modal_app.py --mode vllm-prefix-cache-paired \
  --prompt-profiles shared_prefix_long,matched_unique_prefix \
  --output-tokens 8 \
  --request-counts 2,4,8 \
  --repeats 2 \
  --warmup-runs 1 \
  --scenario-seed 571 \
  --phase-order cache_first \
  --cache-metrics on \
  --kv-cache-metrics-sample 1.0 \
  --output-dir results/modal-vllm-prefix-cache-metrics-repeated-cache-first

modal run modal_app.py --mode vllm-prefix-cache-phase-order-compare \
  --prefix-cache-cold-first-paired-dir results/modal-vllm-prefix-cache-metrics-repeated \
  --prefix-cache-cache-first-paired-dir results/modal-vllm-prefix-cache-metrics-repeated-cache-first \
  --output-dir results/modal-vllm-prefix-cache-metrics-repeated-phase-order

modal run modal_app.py --mode vllm-prefix-cache-profile-control \
  --prefix-cache-phase-order-compare-dir results/modal-vllm-prefix-cache-metrics-repeated-phase-order \
  --output-dir results/modal-vllm-prefix-cache-metrics-repeated-profile-control

Run a fresh-engine isolated cache-metrics trial:

modal run modal_app.py --mode vllm-prefix-cache-isolated-metrics \
  --prompt-profiles shared_prefix_long,matched_unique_prefix \
  --output-tokens 8 \
  --request-counts 2,4 \
  --repeats 1 \
  --scenario-seed 572 \
  --phase-order cold_first \
  --kv-cache-metrics-sample 1.0 \
  --output-dir results/modal-vllm-prefix-cache-isolated-metrics

Run the repeated isolated cache-metrics stability trial:

modal run modal_app.py --mode vllm-prefix-cache-isolated-metrics \
  --prompt-profiles shared_prefix_long,matched_unique_prefix \
  --output-tokens 8 \
  --request-counts 2,4,8 \
  --repeats 2 \
  --scenario-seed 573 \
  --phase-order cold_first \
  --kv-cache-metrics-sample 1.0 \
  --output-dir results/modal-vllm-prefix-cache-isolated-metrics-repeated

Generate the report-ready isolated cache stability summary:

modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
  --prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-isolated-metrics-repeated \
  --output-dir results/modal-vllm-prefix-cache-isolated-stability-summary

Run a warmed-window isolated cache trial. This adds one throwaway warmup scenario run inside each fresh cold/cache engine before the measured scenario:

modal run modal_app.py --mode vllm-prefix-cache-isolated-warm-window \
  --prompt-profiles shared_prefix_long,matched_unique_prefix \
  --output-tokens 8 \
  --request-counts 2,4,8 \
  --repeats 1 \
  --scenario-seed 574 \
  --phase-order cold_first \
  --kv-cache-metrics-sample 1.0 \
  --output-dir results/modal-vllm-prefix-cache-isolated-warm-window

Generate a compact summary for that warmed-window trial:

modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
  --prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-isolated-warm-window \
  --output-dir results/modal-vllm-prefix-cache-isolated-warm-window-summary

Run the neutral-warmup isolated cache trial. This warms each fresh engine with neutral_long prompts instead of the measured shared/control prompt bodies:

modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
  --prompt-profiles shared_prefix_long,matched_unique_prefix \
  --output-tokens 8 \
  --request-counts 2,4,8 \
  --repeats 1 \
  --scenario-seed 575 \
  --phase-order cold_first \
  --kv-cache-metrics-sample 1.0 \
  --output-dir results/modal-vllm-prefix-cache-isolated-neutral-warmup

Generate the neutral-warmup summary:

modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
  --prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-isolated-neutral-warmup \
  --output-dir results/modal-vllm-prefix-cache-isolated-neutral-warmup-summary

Generate a measured-window estimate from a neutral-warmup run that includes warmup-before cache metrics:

modal run modal_app.py --mode vllm-prefix-cache-isolated-window-summary \
  --prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-isolated-neutral-warmup-window-source \
  --output-dir results/modal-vllm-prefix-cache-isolated-window-summary

Run the direct-counter neutral-warmup trial. This records vLLM CachingMetrics query/hit deltas around each measured scenario:

modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
  --prompt-profiles shared_prefix_long,matched_unique_prefix \
  --output-tokens 8 \
  --request-counts 2,4,8 \
  --repeats 1 \
  --scenario-seed 577 \
  --phase-order cold_first \
  --kv-cache-metrics-sample 1.0 \
  --output-dir results/modal-vllm-prefix-cache-isolated-neutral-warmup-counters

Generate the counter-aware summary:

modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
  --prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-isolated-neutral-warmup-counters \
  --output-dir results/modal-vllm-prefix-cache-isolated-counter-summary

Run the repeated direct-counter stability trial and summary:

modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
  --prompt-profiles shared_prefix_long,matched_unique_prefix \
  --output-tokens 8 \
  --request-counts 2,4,8 \
  --repeats 3 \
  --scenario-seed 577 \
  --phase-order cold_first \
  --kv-cache-metrics-sample 1.0 \
  --output-dir results/modal-vllm-prefix-cache-isolated-neutral-warmup-counter-stability

modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
  --prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-isolated-neutral-warmup-counter-stability \
  --output-dir results/modal-vllm-prefix-cache-isolated-counter-stability-summary

Run a variant prompt-family smoke so repeats can vary the workload family:

modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
  --prompt-profiles shared_prefix_long_variant,matched_unique_prefix_variant \
  --output-tokens 8 \
  --request-counts 4 \
  --repeats 2 \
  --scenario-seed 577 \
  --phase-order cold_first \
  --kv-cache-metrics-sample 1.0 \
  --prefix-cache-shared-profile shared_prefix_long_variant \
  --prefix-cache-control-profile matched_unique_prefix_variant \
  --output-dir results/modal-vllm-prefix-cache-variant-smoke

modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
  --prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-variant-smoke \
  --prefix-cache-shared-profile shared_prefix_long_variant \
  --prefix-cache-control-profile matched_unique_prefix_variant \
  --output-dir results/modal-vllm-prefix-cache-variant-smoke-summary

Run the full variant stability grid and summary:

modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
  --prompt-profiles shared_prefix_long_variant,matched_unique_prefix_variant \
  --output-tokens 8 \
  --request-counts 2,4,8 \
  --repeats 3 \
  --scenario-seed 577 \
  --phase-order cold_first \
  --kv-cache-metrics-sample 1.0 \
  --prefix-cache-shared-profile shared_prefix_long_variant \
  --prefix-cache-control-profile matched_unique_prefix_variant \
  --output-dir results/modal-vllm-prefix-cache-variant-stability

modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
  --prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-variant-stability \
  --prefix-cache-shared-profile shared_prefix_long_variant \
  --prefix-cache-control-profile matched_unique_prefix_variant \
  --output-dir results/modal-vllm-prefix-cache-variant-stability-summary

Audit prompt token and cache-block alignment for the variant grid:

modal run modal_app.py --mode vllm-prefix-cache-prompt-audit \
  --prompt-profiles shared_prefix_long_variant,matched_unique_prefix_variant \
  --output-tokens 8 \
  --request-counts 2,4,8 \
  --repeats 3 \
  --scenario-seed 577 \
  --kv-cache-block-size 16 \
  --output-dir results/modal-vllm-prefix-cache-prompt-audit-variant-stability

Run the n=16 variant timing probe and summary:

modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
  --prompt-profiles shared_prefix_long_variant,matched_unique_prefix_variant \
  --output-tokens 8 \
  --request-counts 16 \
  --repeats 3 \
  --scenario-seed 577 \
  --phase-order cold_first \
  --kv-cache-metrics-sample 1.0 \
  --prefix-cache-shared-profile shared_prefix_long_variant \
  --prefix-cache-control-profile matched_unique_prefix_variant \
  --output-dir results/modal-vllm-prefix-cache-variant-n16

modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
  --prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-variant-n16 \
  --prefix-cache-shared-profile shared_prefix_long_variant \
  --prefix-cache-control-profile matched_unique_prefix_variant \
  --output-dir results/modal-vllm-prefix-cache-variant-n16-summary

Audit prompt token, cache-block, and exact duplicate alignment for the n=16 variant probe:

modal run modal_app.py --mode vllm-prefix-cache-prompt-audit \
  --prompt-profiles shared_prefix_long_variant,matched_unique_prefix_variant \
  --output-tokens 8 \
  --request-counts 16 \
  --repeats 3 \
  --scenario-seed 577 \
  --kv-cache-block-size 16 \
  --output-dir results/modal-vllm-prefix-cache-prompt-audit-variant-n16

Run the no-repeat n=16 variant probe, which removes exact duplicate prompts from the matched control:

modal run modal_app.py --mode vllm-prefix-cache-prompt-audit \
  --prompt-profiles shared_prefix_long_no_repeat_variant,matched_unique_prefix_no_repeat_variant \
  --output-tokens 8 \
  --request-counts 16 \
  --repeats 3 \
  --scenario-seed 577 \
  --kv-cache-block-size 16 \
  --output-dir results/modal-vllm-prefix-cache-prompt-audit-no-repeat-n16

modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
  --prompt-profiles shared_prefix_long_no_repeat_variant,matched_unique_prefix_no_repeat_variant \
  --output-tokens 8 \
  --request-counts 16 \
  --repeats 3 \
  --scenario-seed 577 \
  --phase-order cold_first \
  --kv-cache-metrics-sample 1.0 \
  --prefix-cache-shared-profile shared_prefix_long_no_repeat_variant \
  --prefix-cache-control-profile matched_unique_prefix_no_repeat_variant \
  --output-dir results/modal-vllm-prefix-cache-no-repeat-n16

modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
  --prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-no-repeat-n16 \
  --prefix-cache-shared-profile shared_prefix_long_no_repeat_variant \
  --prefix-cache-control-profile matched_unique_prefix_no_repeat_variant \
  --output-dir results/modal-vllm-prefix-cache-no-repeat-n16-summary

Run the no-repeat request-count scaling smoke:

modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
  --prompt-profiles shared_prefix_long_no_repeat_variant,matched_unique_prefix_no_repeat_variant \
  --output-tokens 8 \
  --request-counts 8,12,16,20 \
  --repeats 2 \
  --scenario-seed 577 \
  --phase-order cold_first \
  --kv-cache-metrics-sample 1.0 \
  --prefix-cache-shared-profile shared_prefix_long_no_repeat_variant \
  --prefix-cache-control-profile matched_unique_prefix_no_repeat_variant \
  --output-dir results/modal-vllm-prefix-cache-no-repeat-scaling-smoke

modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
  --prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-no-repeat-scaling-smoke \
  --prefix-cache-shared-profile shared_prefix_long_no_repeat_variant \
  --prefix-cache-control-profile matched_unique_prefix_no_repeat_variant \
  --output-dir results/modal-vllm-prefix-cache-no-repeat-scaling-smoke-summary

Run the no-repeat n=16 fixed-shape stability pass:

modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
  --prompt-profiles shared_prefix_long_no_repeat_variant,matched_unique_prefix_no_repeat_variant \
  --output-tokens 8 \
  --request-counts 16 \
  --repeats 8 \
  --scenario-seed 577 \
  --phase-order cold_first \
  --kv-cache-metrics-sample 1.0 \
  --prefix-cache-shared-profile shared_prefix_long_no_repeat_variant \
  --prefix-cache-control-profile matched_unique_prefix_no_repeat_variant \
  --output-dir results/modal-vllm-prefix-cache-no-repeat-n16-stability-r8

modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
  --prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-no-repeat-n16-stability-r8 \
  --prefix-cache-shared-profile shared_prefix_long_no_repeat_variant \
  --prefix-cache-control-profile matched_unique_prefix_no_repeat_variant \
  --output-dir results/modal-vllm-prefix-cache-no-repeat-n16-stability-r8-summary

Regenerate the no-repeat n=16 stability summary with TTFT/first-event and stream TPOT profile-control intervals:

modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
  --prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-no-repeat-n16-stability-r8 \
  --prefix-cache-shared-profile shared_prefix_long_no_repeat_variant \
  --prefix-cache-control-profile matched_unique_prefix_no_repeat_variant \
  --output-dir results/modal-vllm-prefix-cache-no-repeat-n16-stability-r8-summary-ttft

Merge smaller isolated metrics chunks into a single stability source:

modal run modal_app.py --mode vllm-prefix-cache-isolated-merge \
  --prefix-cache-isolated-merge-dirs results/modal-vllm-prefix-cache-extra-long-no-repeat-n16-smoke-r3,results/modal-vllm-prefix-cache-extra-long-no-repeat-n16-chunk-r3-seed680,results/modal-vllm-prefix-cache-extra-long-no-repeat-n16-chunk-r2-seed790 \
  --prefix-cache-shared-profile shared_prefix_extra_long_no_repeat_variant \
  --prefix-cache-control-profile matched_unique_prefix_extra_long_no_repeat_variant \
  --output-dir results/modal-vllm-prefix-cache-extra-long-no-repeat-n16-merged-r8

modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
  --prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-extra-long-no-repeat-n16-merged-r8 \
  --prefix-cache-shared-profile shared_prefix_extra_long_no_repeat_variant \
  --prefix-cache-control-profile matched_unique_prefix_extra_long_no_repeat_variant \
  --output-dir results/modal-vllm-prefix-cache-extra-long-no-repeat-n16-merged-r8-summary

Modal training is documented in docs/modal-training.md.

Run tests:

python3 -m unittest discover -s tests

Current Status

The repo is at Week 1. The simulator is intentionally simple. Its purpose is to make the measurement model explicit before we attach PyTorch, vLLM, SGLang, TensorRT-LLM, Triton kernels, or real GPUs.

The research focus is documented in docs/research-focus-kv-cache.md.

The first KV-cache pressure baseline is documented in docs/baseline-kv-pressure.md.

Workload generation is documented in docs/workload-generation.md.

Scheduler policies are documented in docs/scheduler-policies.md.

Capacity-aware scheduling is documented in docs/capacity-aware-scheduling.md.

The first capacity sweep is documented in docs/experiment-001-capacity-sweep.md.

The first Modal remote execution path is documented in docs/modal-training.md.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages