An Efficient and Versatile Inference Engine for Distributed LLM Serving
-
Updated
Jul 5, 2026 - Python
An Efficient and Versatile Inference Engine for Distributed LLM Serving
A lightweight, educational LLM inference engine for studying continuous batching, paged KV cache, chunked prefill, and online serving.
A minimal, native Metal inference engine for Qwen3-30B-A3B on Apple Silicon Macs.
Add a description, image, and links to the chunked-prefill topic page so that developers can more easily learn about it.
To associate your repository with the chunked-prefill topic, visit your repo's landing page and select "manage topics."