Research-grade learning repo for ML inference systems experiments.
The project is now centered on KV-cache behavior in LLM serving. The goal is to build a reproducible ML inference systems artifact with clear workload definitions, request lifecycle traces, benchmark methodology, scheduler experiments, KV-cache pressure studies, and eventually backend comparisons against real inference engines.
Current report: docs/prefix-cache-study.md
summarizes the no-repeat vLLM prefix-cache result, longer-context follow-ups,
and L4 context/model-size follow-ups. The primary no-repeat claim is stable
measured-window KV-cache reuse plus a p95 first-event/TTFT improvement; the
longer Qwen follow-ups now show broader favorable timing when prefill work is
larger.
Generated result artifacts:
results/prefix-cache-study/key-results.md
and results/prefix-cache-study/intervals.md
Latest extra-long follow-up:
results/prefix-cache-study-extra-long-merged-r8/key-results.md
Latest model-shape follow-up:
results/prefix-cache-study-extra-long-qwen05b-merged-r8/key-results.md
Latest longer-context follow-up:
results/prefix-cache-study-ultra-long-qwen05b-merged-r8/key-results.md
Latest L4-scale context follow-up:
results/prefix-cache-study-mega-long-qwen05b-l4-merged-r8/key-results.md
Latest L4 model-size follow-up:
results/prefix-cache-study-mega-long-qwen15b-l4-merged-r8/key-results.md
Latest L4 batch-pressure follow-up:
results/prefix-cache-study-mega-long-qwen15b-l4-n32-merged-r8/key-results.md
Latest L4 batch-pressure comparison:
results/prefix-cache-study-mega-long-qwen15b-l4-n16-vs-n32-r8/batch-pressure-comparison.md
Latest L4 capacity diagnostic:
results/modal-vllm-capacity-diagnostic-qwen15b-l4-n32-batched-tokens-r1/capacity-diagnostic.md
Latest L4 scheduler-control follow-up:
results/prefix-cache-study-mega-long-qwen15b-l4-n32-batched-tokens60640-merged-r8/key-results.md
Latest L4 serving-path scheduler-control smoke:
results/modal-vllm-server-async-qwen15b-l4-n32-batched-tokens60640-phase-order-smoke-r1/phase-order-compare.csv
Latest L4 serving-path log summary:
results/modal-vllm-server-async-qwen15b-l4-n32-batched-tokens60640-log-summary-r1/server-async-log-summary.csv
Latest L4 serving-path cache-control smoke:
results/modal-vllm-server-async-qwen15b-l4-n32-batched-tokens60640-server-cache-control-log-summary-r1/server-async-log-summary.csv
Latest L4 serving-path cache-control phase-order matrix:
results/modal-vllm-server-async-qwen15b-l4-n32-batched-tokens60640-server-cache-control-phase-order-r2/cache-control-phase-order-compare.json
Latest L4 serving-path cache-control multitrial aggregate:
results/modal-vllm-server-async-qwen15b-l4-n32-batched-tokens60640-server-cache-control-multitrial-r2/cache-control-multitrial.json
Latest L4 serving-path server-only cache-control comparison:
results/modal-vllm-server-async-qwen15b-l4-n32-batched-tokens60640-server-cache-control-server-absolute-r2/server-cache-control-absolute.json
Latest prompt-overlap vs server cache-control check:
results/modal-vllm-prefix-cache-prompt-overlap-server-cache-control-qwen15b-n32-r2/prompt-overlap-server-compare.md
Latest single-profile server cache-control isolation:
results/modal-vllm-server-async-qwen15b-l4-n32-batched-tokens60640-single-profile-server-cache-control-r1/server-cache-control-absolute.json
Latest server /metrics no-warmup cache-control:
results/modal-vllm-server-async-qwen15b-l4-n32-batched-tokens60640-nowarmup-server-cache-control-r1-r2/server-cache-control-absolute.json
Latest server cache-pressure curve:
results/modal-vllm-server-async-qwen15b-l4-kvbudget-gpu045-gpu040-gpu035-gpu0325-n32-batched-tokens60640-seed3805-seed3906-pressure-curve-r1/server-cache-pressure-curve.md
Latest server KV-budget report:
results/modal-vllm-server-async-qwen15b-l4-kvbudget-report-r1/kv-budget-report.md
Latest server KV-budget request-count comparison:
results/modal-vllm-server-async-qwen15b-l4-kvbudget-gpu0325-n16-seed4107-seed4208-vs-n32-request-count-r1/kv-budget-request-count-comparison.md
Latest server KV-budget failure-bound probe:
results/modal-vllm-server-async-qwen15b-l4-kvbudget-gpu030-n32-batched-tokens60640-feasibility-failure-r1/failure.md
Latest vLLM server CLI evidence:
results/modal-vllm-server-cli-help-v021/vllm-server-cli-help.json
- Build a static benchmark explorer UI for GitHub Pages that loads generated JSON/CSV artifacts and compares runs by model, GPU, prompt profile, request count, cache mode, phase order, throughput, latency, TPOT/TTFT, and observed prefix-cache hit rate.
This repo follows a systems-focused path:
- Request lifecycle, KV-cache accounting, and measurement vocabulary
- Profiling and trace discipline
- Triton primitives for inference hot paths
- KV cache, batching, and scheduling policies
- Quantization and decoding tradeoffs
- Serving APIs, streaming, cancellation, and metrics
- Distributed inference and placement policies
- Trace-driven workload realism
- Artifact report and reproducibility package
Week 1 defines the basic serving lifecycle before any real model backend exists. The initial code provides:
- A workload schema for inference requests
- Validation for prompt/output token counts and arrival times
- A deterministic FIFO request lifecycle simulator
- Per-stage traces for queueing, tokenization, prefill, decode, and streaming
- Per-request KV-cache footprint estimates
- Active KV-cache timeline metrics
- Summary metrics for end-to-end latency, queue wait, throughput, and memory pressure
Run the starter workload:
python3 scripts/replay_workload.py workloads/week01_mixed_requests.json \
--model-config configs/models/llama-7b-gqa-fp16.jsonGenerate a deterministic bursty workload:
python3 scripts/generate_workload.py mixed_bursty \
--requests 32 \
--seed 1337 \
--output workloads/generated/mixed_bursty_32_seed1337.jsonReplay it with overlapping FIFO request slots:
python3 scripts/replay_workload.py workloads/generated/mixed_bursty_32_seed1337.json \
--model-config configs/models/llama-7b-gqa-fp16.json \
--max-concurrent-requests 4 \
--scheduler-policy fifoCompare a scheduler policy:
python3 scripts/replay_workload.py workloads/generated/mixed_bursty_32_seed1337.json \
--model-config configs/models/llama-7b-gqa-fp16.json \
--max-concurrent-requests 4 \
--scheduler-policy shortest-cacheReplay with a constrained KV-cache budget:
python3 scripts/replay_workload.py workloads/generated/mixed_bursty_32_seed1337.json \
--model-config configs/models/llama-7b-gqa-fp16.json \
--capacity-config configs/capacity/tight-1gb-kv.json \
--max-concurrent-requests 4 \
--scheduler-policy memory-aware-deadlineRun the first reproducible capacity sweep:
python3 scripts/run_sweep.pyThe sweep writes JSON and CSV results to
results/experiment-001-capacity-sweep
and is documented in
docs/experiment-001-capacity-sweep.md.
Run the first Modal remote sweep:
modal run modal_app.pyRun the first Modal GPU probe:
modal run modal_app.py --mode gpu-probeRun the first Modal tiny inference timing:
modal run modal_app.py --mode tiny-inferenceRun the first Modal vLLM inference baseline:
modal run modal_app.py --mode vllm-inferenceInspect vLLM cache-metrics support in the Modal image:
modal run modal_app.py --mode vllm-cache-metrics-probeRun a tiny GPU-backed vLLM cache-metrics smoke:
modal run modal_app.py --mode vllm-cache-metrics-smoke \
--prompt-profile shared_prefix_long \
--prompt-count 4 \
--max-new-tokens 8Run the first Modal vLLM streaming timing:
modal run modal_app.py --mode vllm-streamingRun the first OpenAI-compatible vLLM server streaming smoke:
modal run modal_app.py --mode vllm-server-streamingRun the first OpenAI-compatible vLLM server concurrent streaming workload:
modal run modal_app.py --mode vllm-server-concurrentRun the repeated OpenAI-compatible vLLM server sweep:
modal run modal_app.py --mode vllm-server-sweep \
--prompt-profiles short \
--output-tokens 32 \
--repeats 3 \
--warmup-runs 1Compare the repeated server sweep against the in-process AsyncLLM sweep:
modal run modal_app.py --mode vllm-server-sweep-compareRun a paired same-worker server-vs-AsyncLLM benchmark:
modal run modal_app.py --mode vllm-server-async-paired \
--prompt-profiles short \
--output-tokens 32 \
--server-async-prefix-caching off \
--repeats 3 \
--warmup-runs 1Run the server-first phase-order control:
modal run modal_app.py --mode vllm-server-async-paired \
--phase-order server_first \
--prompt-profiles short \
--output-tokens 32 \
--repeats 3 \
--warmup-runs 1Compare paired phase orders:
modal run modal_app.py --mode vllm-server-async-phase-order-compareAggregate multiple phase-order comparisons:
modal run modal_app.py --mode vllm-server-async-multitrial-aggregateRun the long-prompt phase-order workload:
modal run modal_app.py --mode vllm-server-async-paired \
--phase-order async_first \
--prompt-profiles long \
--output-tokens 32 \
--repeats 3 \
--warmup-runs 1 \
--output-dir results/modal-vllm-server-async-paired-long-async-first
modal run modal_app.py --mode vllm-server-async-paired \
--phase-order server_first \
--prompt-profiles long \
--output-tokens 32 \
--repeats 3 \
--warmup-runs 1 \
--output-dir results/modal-vllm-server-async-paired-long-server-firstCompare the long-prompt phase orders and then summarize short-vs-long workload sensitivity:
modal run modal_app.py --mode vllm-server-async-phase-order-compare \
--async-first-paired-dir results/modal-vllm-server-async-paired-long-async-first \
--server-first-paired-dir results/modal-vllm-server-async-paired-long-server-first \
--output-dir results/modal-vllm-server-async-phase-order-compare-long
modal run modal_app.py --mode vllm-server-async-workload-compareRun the first Modal vLLM concurrent workload:
modal run modal_app.py --mode vllm-concurrentRun the first Modal vLLM concurrency/context sweep:
modal run modal_app.py --mode vllm-sweepRun the repeated sweep with an explicit shuffle seed:
modal run modal_app.py --mode vllm-sweep --repeats 3 --scenario-seed 1337Run the paired prefix-cache sweep and comparison:
modal run modal_app.py --mode vllm-sweep --prefix-caching on
modal run modal_app.py --mode vllm-prefix-cache-compareRun the shared-prefix KV-cache pilot:
modal run modal_app.py --mode vllm-sweep \
--prompt-profiles shared_prefix \
--output-tokens 32 \
--request-counts 1,2,4,8 \
--repeats 3 \
--scenario-seed 1337 \
--prefix-caching off \
--output-dir results/modal-vllm-shared-prefix-cold
modal run modal_app.py --mode vllm-sweep \
--prompt-profiles shared_prefix \
--output-tokens 32 \
--request-counts 1,2,4,8 \
--repeats 3 \
--scenario-seed 1337 \
--prefix-caching on \
--output-dir results/modal-vllm-shared-prefix-cache
modal run modal_app.py --mode vllm-prefix-cache-compare \
--cold-sweep-dir results/modal-vllm-shared-prefix-cold \
--prefix-sweep-dir results/modal-vllm-shared-prefix-cache \
--output-dir results/modal-vllm-shared-prefix-cache-compareRun the paired same-worker prefix-cache control:
modal run modal_app.py --mode vllm-prefix-cache-paired \
--prompt-profiles shared_prefix \
--output-tokens 32 \
--request-counts 1,2,4,8 \
--repeats 3 \
--warmup-runs 1 \
--scenario-seed 1337 \
--phase-order cold_first
modal run modal_app.py --mode vllm-prefix-cache-paired \
--prompt-profiles shared_prefix \
--output-tokens 32 \
--request-counts 1,2,4,8 \
--repeats 3 \
--warmup-runs 1 \
--scenario-seed 1337 \
--phase-order cache_first
modal run modal_app.py --mode vllm-prefix-cache-phase-order-compareRun the long shared-prefix versus matched unique-prefix control:
modal run modal_app.py --mode vllm-prefix-cache-paired \
--prompt-profiles shared_prefix_long,matched_unique_prefix \
--output-tokens 32 \
--request-counts 1,2,4,8 \
--repeats 3 \
--warmup-runs 1 \
--scenario-seed 1337 \
--phase-order cold_first \
--output-dir results/modal-vllm-prefix-cache-long-control-paired
modal run modal_app.py --mode vllm-prefix-cache-paired \
--prompt-profiles shared_prefix_long,matched_unique_prefix \
--output-tokens 32 \
--request-counts 1,2,4,8 \
--repeats 3 \
--warmup-runs 1 \
--scenario-seed 1337 \
--phase-order cache_first \
--output-dir results/modal-vllm-prefix-cache-long-control-paired-cache-first
modal run modal_app.py --mode vllm-prefix-cache-phase-order-compare \
--prefix-cache-cold-first-paired-dir results/modal-vllm-prefix-cache-long-control-paired \
--prefix-cache-cache-first-paired-dir results/modal-vllm-prefix-cache-long-control-paired-cache-first \
--output-dir results/modal-vllm-prefix-cache-long-control-phase-order
modal run modal_app.py --mode vllm-prefix-cache-profile-control \
--prefix-cache-phase-order-compare-dir results/modal-vllm-prefix-cache-long-control-phase-order \
--output-dir results/modal-vllm-prefix-cache-long-control-profile-control
# Repeat the paired run with --scenario-seed 569 and trial2 output dirs, then:
modal run modal_app.py --mode vllm-prefix-cache-profile-multitrial \
--prefix-cache-profile-control-dirs results/modal-vllm-prefix-cache-long-control-profile-control,results/modal-vllm-prefix-cache-long-control-profile-control-trial2 \
--output-dir results/modal-vllm-prefix-cache-long-control-multitrialRun a small metrics-enabled paired prefix-cache smoke:
modal run modal_app.py --mode vllm-prefix-cache-paired \
--prompt-profiles shared_prefix_long,matched_unique_prefix \
--output-tokens 8 \
--request-counts 4 \
--repeats 1 \
--warmup-runs 1 \
--scenario-seed 570 \
--phase-order cold_first \
--cache-metrics on \
--kv-cache-metrics-sample 1.0 \
--output-dir results/modal-vllm-prefix-cache-metrics-paired-smoke
modal run modal_app.py --mode vllm-prefix-cache-paired \
--prompt-profiles shared_prefix_long,matched_unique_prefix \
--output-tokens 8 \
--request-counts 4 \
--repeats 1 \
--warmup-runs 1 \
--scenario-seed 570 \
--phase-order cache_first \
--cache-metrics on \
--kv-cache-metrics-sample 1.0 \
--output-dir results/modal-vllm-prefix-cache-metrics-paired-smoke-cache-first
modal run modal_app.py --mode vllm-prefix-cache-phase-order-compare \
--prefix-cache-cold-first-paired-dir results/modal-vllm-prefix-cache-metrics-paired-smoke \
--prefix-cache-cache-first-paired-dir results/modal-vllm-prefix-cache-metrics-paired-smoke-cache-first \
--output-dir results/modal-vllm-prefix-cache-metrics-phase-order-smoke
modal run modal_app.py --mode vllm-prefix-cache-profile-control \
--prefix-cache-phase-order-compare-dir results/modal-vllm-prefix-cache-metrics-phase-order-smoke \
--output-dir results/modal-vllm-prefix-cache-metrics-profile-control-smokeRun a small repeated metrics-enabled prefix-cache trial:
modal run modal_app.py --mode vllm-prefix-cache-paired \
--prompt-profiles shared_prefix_long,matched_unique_prefix \
--output-tokens 8 \
--request-counts 2,4,8 \
--repeats 2 \
--warmup-runs 1 \
--scenario-seed 571 \
--phase-order cold_first \
--cache-metrics on \
--kv-cache-metrics-sample 1.0 \
--output-dir results/modal-vllm-prefix-cache-metrics-repeated
modal run modal_app.py --mode vllm-prefix-cache-paired \
--prompt-profiles shared_prefix_long,matched_unique_prefix \
--output-tokens 8 \
--request-counts 2,4,8 \
--repeats 2 \
--warmup-runs 1 \
--scenario-seed 571 \
--phase-order cache_first \
--cache-metrics on \
--kv-cache-metrics-sample 1.0 \
--output-dir results/modal-vllm-prefix-cache-metrics-repeated-cache-first
modal run modal_app.py --mode vllm-prefix-cache-phase-order-compare \
--prefix-cache-cold-first-paired-dir results/modal-vllm-prefix-cache-metrics-repeated \
--prefix-cache-cache-first-paired-dir results/modal-vllm-prefix-cache-metrics-repeated-cache-first \
--output-dir results/modal-vllm-prefix-cache-metrics-repeated-phase-order
modal run modal_app.py --mode vllm-prefix-cache-profile-control \
--prefix-cache-phase-order-compare-dir results/modal-vllm-prefix-cache-metrics-repeated-phase-order \
--output-dir results/modal-vllm-prefix-cache-metrics-repeated-profile-controlRun a fresh-engine isolated cache-metrics trial:
modal run modal_app.py --mode vllm-prefix-cache-isolated-metrics \
--prompt-profiles shared_prefix_long,matched_unique_prefix \
--output-tokens 8 \
--request-counts 2,4 \
--repeats 1 \
--scenario-seed 572 \
--phase-order cold_first \
--kv-cache-metrics-sample 1.0 \
--output-dir results/modal-vllm-prefix-cache-isolated-metricsRun the repeated isolated cache-metrics stability trial:
modal run modal_app.py --mode vllm-prefix-cache-isolated-metrics \
--prompt-profiles shared_prefix_long,matched_unique_prefix \
--output-tokens 8 \
--request-counts 2,4,8 \
--repeats 2 \
--scenario-seed 573 \
--phase-order cold_first \
--kv-cache-metrics-sample 1.0 \
--output-dir results/modal-vllm-prefix-cache-isolated-metrics-repeatedGenerate the report-ready isolated cache stability summary:
modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
--prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-isolated-metrics-repeated \
--output-dir results/modal-vllm-prefix-cache-isolated-stability-summaryRun a warmed-window isolated cache trial. This adds one throwaway warmup scenario run inside each fresh cold/cache engine before the measured scenario:
modal run modal_app.py --mode vllm-prefix-cache-isolated-warm-window \
--prompt-profiles shared_prefix_long,matched_unique_prefix \
--output-tokens 8 \
--request-counts 2,4,8 \
--repeats 1 \
--scenario-seed 574 \
--phase-order cold_first \
--kv-cache-metrics-sample 1.0 \
--output-dir results/modal-vllm-prefix-cache-isolated-warm-windowGenerate a compact summary for that warmed-window trial:
modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
--prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-isolated-warm-window \
--output-dir results/modal-vllm-prefix-cache-isolated-warm-window-summaryRun the neutral-warmup isolated cache trial. This warms each fresh engine with
neutral_long prompts instead of the measured shared/control prompt bodies:
modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
--prompt-profiles shared_prefix_long,matched_unique_prefix \
--output-tokens 8 \
--request-counts 2,4,8 \
--repeats 1 \
--scenario-seed 575 \
--phase-order cold_first \
--kv-cache-metrics-sample 1.0 \
--output-dir results/modal-vllm-prefix-cache-isolated-neutral-warmupGenerate the neutral-warmup summary:
modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
--prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-isolated-neutral-warmup \
--output-dir results/modal-vllm-prefix-cache-isolated-neutral-warmup-summaryGenerate a measured-window estimate from a neutral-warmup run that includes warmup-before cache metrics:
modal run modal_app.py --mode vllm-prefix-cache-isolated-window-summary \
--prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-isolated-neutral-warmup-window-source \
--output-dir results/modal-vllm-prefix-cache-isolated-window-summaryRun the direct-counter neutral-warmup trial. This records vLLM
CachingMetrics query/hit deltas around each measured scenario:
modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
--prompt-profiles shared_prefix_long,matched_unique_prefix \
--output-tokens 8 \
--request-counts 2,4,8 \
--repeats 1 \
--scenario-seed 577 \
--phase-order cold_first \
--kv-cache-metrics-sample 1.0 \
--output-dir results/modal-vllm-prefix-cache-isolated-neutral-warmup-countersGenerate the counter-aware summary:
modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
--prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-isolated-neutral-warmup-counters \
--output-dir results/modal-vllm-prefix-cache-isolated-counter-summaryRun the repeated direct-counter stability trial and summary:
modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
--prompt-profiles shared_prefix_long,matched_unique_prefix \
--output-tokens 8 \
--request-counts 2,4,8 \
--repeats 3 \
--scenario-seed 577 \
--phase-order cold_first \
--kv-cache-metrics-sample 1.0 \
--output-dir results/modal-vllm-prefix-cache-isolated-neutral-warmup-counter-stability
modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
--prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-isolated-neutral-warmup-counter-stability \
--output-dir results/modal-vllm-prefix-cache-isolated-counter-stability-summaryRun a variant prompt-family smoke so repeats can vary the workload family:
modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
--prompt-profiles shared_prefix_long_variant,matched_unique_prefix_variant \
--output-tokens 8 \
--request-counts 4 \
--repeats 2 \
--scenario-seed 577 \
--phase-order cold_first \
--kv-cache-metrics-sample 1.0 \
--prefix-cache-shared-profile shared_prefix_long_variant \
--prefix-cache-control-profile matched_unique_prefix_variant \
--output-dir results/modal-vllm-prefix-cache-variant-smoke
modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
--prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-variant-smoke \
--prefix-cache-shared-profile shared_prefix_long_variant \
--prefix-cache-control-profile matched_unique_prefix_variant \
--output-dir results/modal-vllm-prefix-cache-variant-smoke-summaryRun the full variant stability grid and summary:
modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
--prompt-profiles shared_prefix_long_variant,matched_unique_prefix_variant \
--output-tokens 8 \
--request-counts 2,4,8 \
--repeats 3 \
--scenario-seed 577 \
--phase-order cold_first \
--kv-cache-metrics-sample 1.0 \
--prefix-cache-shared-profile shared_prefix_long_variant \
--prefix-cache-control-profile matched_unique_prefix_variant \
--output-dir results/modal-vllm-prefix-cache-variant-stability
modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
--prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-variant-stability \
--prefix-cache-shared-profile shared_prefix_long_variant \
--prefix-cache-control-profile matched_unique_prefix_variant \
--output-dir results/modal-vllm-prefix-cache-variant-stability-summaryAudit prompt token and cache-block alignment for the variant grid:
modal run modal_app.py --mode vllm-prefix-cache-prompt-audit \
--prompt-profiles shared_prefix_long_variant,matched_unique_prefix_variant \
--output-tokens 8 \
--request-counts 2,4,8 \
--repeats 3 \
--scenario-seed 577 \
--kv-cache-block-size 16 \
--output-dir results/modal-vllm-prefix-cache-prompt-audit-variant-stabilityRun the n=16 variant timing probe and summary:
modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
--prompt-profiles shared_prefix_long_variant,matched_unique_prefix_variant \
--output-tokens 8 \
--request-counts 16 \
--repeats 3 \
--scenario-seed 577 \
--phase-order cold_first \
--kv-cache-metrics-sample 1.0 \
--prefix-cache-shared-profile shared_prefix_long_variant \
--prefix-cache-control-profile matched_unique_prefix_variant \
--output-dir results/modal-vllm-prefix-cache-variant-n16
modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
--prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-variant-n16 \
--prefix-cache-shared-profile shared_prefix_long_variant \
--prefix-cache-control-profile matched_unique_prefix_variant \
--output-dir results/modal-vllm-prefix-cache-variant-n16-summaryAudit prompt token, cache-block, and exact duplicate alignment for the n=16 variant probe:
modal run modal_app.py --mode vllm-prefix-cache-prompt-audit \
--prompt-profiles shared_prefix_long_variant,matched_unique_prefix_variant \
--output-tokens 8 \
--request-counts 16 \
--repeats 3 \
--scenario-seed 577 \
--kv-cache-block-size 16 \
--output-dir results/modal-vllm-prefix-cache-prompt-audit-variant-n16Run the no-repeat n=16 variant probe, which removes exact duplicate prompts from the matched control:
modal run modal_app.py --mode vllm-prefix-cache-prompt-audit \
--prompt-profiles shared_prefix_long_no_repeat_variant,matched_unique_prefix_no_repeat_variant \
--output-tokens 8 \
--request-counts 16 \
--repeats 3 \
--scenario-seed 577 \
--kv-cache-block-size 16 \
--output-dir results/modal-vllm-prefix-cache-prompt-audit-no-repeat-n16
modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
--prompt-profiles shared_prefix_long_no_repeat_variant,matched_unique_prefix_no_repeat_variant \
--output-tokens 8 \
--request-counts 16 \
--repeats 3 \
--scenario-seed 577 \
--phase-order cold_first \
--kv-cache-metrics-sample 1.0 \
--prefix-cache-shared-profile shared_prefix_long_no_repeat_variant \
--prefix-cache-control-profile matched_unique_prefix_no_repeat_variant \
--output-dir results/modal-vllm-prefix-cache-no-repeat-n16
modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
--prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-no-repeat-n16 \
--prefix-cache-shared-profile shared_prefix_long_no_repeat_variant \
--prefix-cache-control-profile matched_unique_prefix_no_repeat_variant \
--output-dir results/modal-vllm-prefix-cache-no-repeat-n16-summaryRun the no-repeat request-count scaling smoke:
modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
--prompt-profiles shared_prefix_long_no_repeat_variant,matched_unique_prefix_no_repeat_variant \
--output-tokens 8 \
--request-counts 8,12,16,20 \
--repeats 2 \
--scenario-seed 577 \
--phase-order cold_first \
--kv-cache-metrics-sample 1.0 \
--prefix-cache-shared-profile shared_prefix_long_no_repeat_variant \
--prefix-cache-control-profile matched_unique_prefix_no_repeat_variant \
--output-dir results/modal-vllm-prefix-cache-no-repeat-scaling-smoke
modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
--prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-no-repeat-scaling-smoke \
--prefix-cache-shared-profile shared_prefix_long_no_repeat_variant \
--prefix-cache-control-profile matched_unique_prefix_no_repeat_variant \
--output-dir results/modal-vllm-prefix-cache-no-repeat-scaling-smoke-summaryRun the no-repeat n=16 fixed-shape stability pass:
modal run modal_app.py --mode vllm-prefix-cache-isolated-neutral-warmup \
--prompt-profiles shared_prefix_long_no_repeat_variant,matched_unique_prefix_no_repeat_variant \
--output-tokens 8 \
--request-counts 16 \
--repeats 8 \
--scenario-seed 577 \
--phase-order cold_first \
--kv-cache-metrics-sample 1.0 \
--prefix-cache-shared-profile shared_prefix_long_no_repeat_variant \
--prefix-cache-control-profile matched_unique_prefix_no_repeat_variant \
--output-dir results/modal-vllm-prefix-cache-no-repeat-n16-stability-r8
modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
--prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-no-repeat-n16-stability-r8 \
--prefix-cache-shared-profile shared_prefix_long_no_repeat_variant \
--prefix-cache-control-profile matched_unique_prefix_no_repeat_variant \
--output-dir results/modal-vllm-prefix-cache-no-repeat-n16-stability-r8-summaryRegenerate the no-repeat n=16 stability summary with TTFT/first-event and stream TPOT profile-control intervals:
modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
--prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-no-repeat-n16-stability-r8 \
--prefix-cache-shared-profile shared_prefix_long_no_repeat_variant \
--prefix-cache-control-profile matched_unique_prefix_no_repeat_variant \
--output-dir results/modal-vllm-prefix-cache-no-repeat-n16-stability-r8-summary-ttftMerge smaller isolated metrics chunks into a single stability source:
modal run modal_app.py --mode vllm-prefix-cache-isolated-merge \
--prefix-cache-isolated-merge-dirs results/modal-vllm-prefix-cache-extra-long-no-repeat-n16-smoke-r3,results/modal-vllm-prefix-cache-extra-long-no-repeat-n16-chunk-r3-seed680,results/modal-vllm-prefix-cache-extra-long-no-repeat-n16-chunk-r2-seed790 \
--prefix-cache-shared-profile shared_prefix_extra_long_no_repeat_variant \
--prefix-cache-control-profile matched_unique_prefix_extra_long_no_repeat_variant \
--output-dir results/modal-vllm-prefix-cache-extra-long-no-repeat-n16-merged-r8
modal run modal_app.py --mode vllm-prefix-cache-isolated-stability-summary \
--prefix-cache-isolated-metrics-dir results/modal-vllm-prefix-cache-extra-long-no-repeat-n16-merged-r8 \
--prefix-cache-shared-profile shared_prefix_extra_long_no_repeat_variant \
--prefix-cache-control-profile matched_unique_prefix_extra_long_no_repeat_variant \
--output-dir results/modal-vllm-prefix-cache-extra-long-no-repeat-n16-merged-r8-summaryModal training is documented in
docs/modal-training.md.
Run tests:
python3 -m unittest discover -s testsThe repo is at Week 1. The simulator is intentionally simple. Its purpose is to make the measurement model explicit before we attach PyTorch, vLLM, SGLang, TensorRT-LLM, Triton kernels, or real GPUs.
The research focus is documented in
docs/research-focus-kv-cache.md.
The first KV-cache pressure baseline is documented in
docs/baseline-kv-pressure.md.
Workload generation is documented in
docs/workload-generation.md.
Scheduler policies are documented in
docs/scheduler-policies.md.
Capacity-aware scheduling is documented in
docs/capacity-aware-scheduling.md.
The first capacity sweep is documented in
docs/experiment-001-capacity-sweep.md.
The first Modal remote execution path is documented in
docs/modal-training.md.