Semantic donor discovery for engine-native KV caches.
SemBlend extends exact-prefix KV caching in inference engines with semantic donor discovery. When a prompt is semantically similar to a cached one but lexically different - different instruction phrasing, sentence order, or template fields - SemBlend finds a safe donor and plans reuse of the engine-owned KV state, replacing a multi-second prefill with sub-second KV retrieval on the right workloads.
Exact prefix cache alone: semantically similar prompt -> 0% hit -> full prefill
Engine cache + SemBlend: 83-100% hit -> reuse donor KV
SemBlend's current vLLM compatibility path can run through LMCache. Future vLLM integration work is being explored through engine-native interfaces.
Measured on A10G GPU (0.85 utilization), Qwen2.5-7B-AWQ, vLLM 0.14.1 + the current LMCache compatibility path. All results from live benchmarks on real HuggingFace datasets with fresh pod isolation (n=15 per cell).
| Context | Cold TTFT | SemBlend TTFT | Speedup |
|---|---|---|---|
| 4K | 2,102 ms | 433 ms | 4.9x |
| 8K | 3,816 ms | 539 ms | 7.1x |
| 12K | 5,655 ms | 648 ms | 8.7x |
| 16K | 7,635 ms | 760 ms | 10.0x |
| 24K | 11,977 ms | 972 ms | 12.3x |
SemBlend TTFT stays under 1 second regardless of context length. Speedup scales linearly because cold prefill grows with context while SemBlend loads cached KV in constant time.
Identical speedups across content types -- SemBlend is content-agnostic:
| Dataset | Content Type | 8K Speedup | 16K Speedup | 24K Speedup |
|---|---|---|---|---|
| XSum | News summaries | 7.1x | 10.0x | 12.3x |
| CNN/DailyMail | Long-form journalism | 7.1x | 9.4x | 12.2x |
| MultiNews | Multi-document news | -- | 9.3x | -- |
Quality validated across 5 datasets, 4-5 context lengths each, with PPL ratio + LLM-as-judge faithfulness scoring (360 total runs):
| Dataset | PPL Range | Status | Judge (Cold) | Judge (SemBlend) | Faithful |
|---|---|---|---|---|---|
| XSum | 1.018-1.054 | PASS | 0.84 | 0.84 | 100% |
| CNN/DailyMail | 1.011-1.049 | PASS | 0.87 | 0.86 | 97% |
| WikiHow | 0.987-1.037 | PASS | 0.82 | 0.84 | 97% |
| MultiNews | 0.958-1.064 | PASS | 0.79 | 0.78 | 100% |
| SAMSum | 1.140-1.198 | ELEVATED | 0.78 | 0.86 | 87% |
PPL < 1.065 for 4/5 datasets at all lengths. SAMSum shows elevated PPL due to short dialogue turns, but LLM-as-judge rates SemBlend output higher than cold (0.86 vs 0.78). 24 dataset-length cells, 360 total runs.
pip install semblend # CPU-only core (numpy + rapidfuzz)
pip install semblend[vllm] # + vLLM compatibility integration
pip install semblend[sglang] # + SGLang integration
pip install semblend[embedder] # + sentence-transformers (MiniLM GPU)The current vLLM compatibility path integrates through vLLM's dynamic connector loading and may use LMCache for KV transfer. This path remains supported for users who want to reproduce the existing benchmark results.
pip install semblend[vllm] vllm lmcache
vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ \
--kv-transfer-config '{
"kv_connector": "SemBlendConnectorV1",
"kv_connector_module_path": "semblend.integration.vllm.connector_v1",
"kv_role": "kv_both"
}'CacheBlend support: For selective layer recomputation (CacheBlend), vLLM must expose the loaded model to KV connectors via
initialize_worker_connector(). This is available in vLLM builds that include PR #37339. Without it, SemBlend's semantic matching and KV injection still work — only CacheBlend's per-layer recomputation is unavailable.
pip install semblend[sglang] sglang
# CLI launcher — applies the RadixCache patch automatically
semblend-sglang --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 8000Or programmatically — call before SGLang initializes:
from semblend.integration.sglang.radix_patcher import patch_radix_cache
patch_radix_cache()
# ... start SGLang server ...A first-class SemanticPrefixProvider interface (no patching) is in progress upstream.
| Variable | Default | Description |
|---|---|---|
SEMBLEND_ENABLED |
1 |
Enable semantic donor search |
SEMBLEND_MIN_SIMILARITY |
0.60 |
Cosine similarity threshold |
SEMBLEND_EMBEDDER |
minilm |
minilm (auto GPU) · onnx_gpu |
SEMBLEND_FUZZY_CHUNKS |
0 |
Fuzzy chunk matching for shifted prefixes |
Request → Embed (2–15ms) → Search (1ms) → Align (1ms) → Inject KV
↓ ↓ ↓
MiniLM-L6-v2 cosine search MD5 chunk hash
GPU (ONNX RT) donor store 256-token boundary
segmented pool
- Embed — full-document segmented embedding on GPU via ONNX-runtime. Long prompts are split into overlapping 256-token windows, embedded in parallel, and mean-pooled into a single vector. 100% content coverage at any prompt length (~2ms short, ~10ms at 8K, ~15ms at 32K).
- Search — brute-force cosine similarity against the local donor store (<1ms at 1K donors; optional ANN backends can be used for larger pools)
- Align — MD5 chunk hashing finds reusable 256-token KV chunks; optional fuzzy matching handles shifted boundaries
- Materialize — donor token IDs or engine-native block refs are handed back to the inference engine; the engine retrieves cached KV and SemBlend applies quality-gated correction or recomputation where supported
Most effective when prompts share a large common context:
- Document Q&A / RAG — same retrieved passages, different questions
- Summarization — same article, different instruction phrasing
- Multi-turn dialogue — conversation history prefix reused across turns
- Code completion — shared repository context across requests
Dissimilar workloads (code generation from scratch, fully novel queries) see ~4% overhead with 0% hit — negligible in practice.
See CONTRIBUTING.md.
Built at WorldFlow AI. For enterprise support contact [email protected].