📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
-
Updated
Jun 23, 2026 - Python
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
Pure Rust + CUDA LLM inference engine — no PyTorch, OpenAI-compatible, serves Qwen3 to Kimi-K2
From teacher to tiles — a from-scratch LLM distillation & serving engine: custom Triton/CUDA kernels, FSDP distillation, paged-KV continuous batching, speculative decoding, a Rust gateway, a JAX oracle, and interpretability tooling.
A High-Performance LLM Inference Engine with vLLM-Style Continuous Batching
Implementation of PagedAttention from vLLM paper - a breakthrough attention algorithm that treats KV cache like virtual memory. Eliminates memory fragmentation, increases batch sizes, and dramatically improves LLM serving throughput.
High-performance On-Device MoA (Mixture of Agents) Engine in C++. Optimized for CPU inference with RadixCache & PagedAttention. (Tiny-MoA Native)
An AI inference engine from scratch. Like nanoGPT, but for serving.
🌱 A tiny, readable LLM serving engine with vLLM/SGLang-style features.
A production-grade, native Rust speculative inference engine for Apple Silicon with Metal GPU acceleration and paged attention.
Local-first LLM stack on a single RTX 5090: QLoRA fine-tuning, exact speculative decoding, paged KV-cache, and continuous batching — served via FastAPI with a live React dashboard.
High-Performance LLM Inference Engine with PagedAttention & Continuous Batching in Rust
vLLM - High-throughput, memory-efficient LLM inference engine with PagedAttention, continuous batching, CUDA/HIP optimization, quantization (GPTQ/AWQ/INT4/INT8/FP8), tensor/pipeline parallelism, OpenAI-compatible API, multi-GPU/TPU/Neuron support, prefix caching, and multi-LoRA capabilities
Pure-Zig LLM serving — paged attention, BF16 kernels, persistent thread pool, safetensors integration. 6.17× decode speedup. TinyLlama-1.1B end-to-end. 77 tests.
Rust/Python KV-cache block manager for MLA and hybrid-attention LLM serving, with content-addressed sharing, per-layer layouts, FP4/FP8 accounting, and distributed cache semantics.
A from scratch LLM inference engine build in PyTorch with custom GPT2 transformers, kv cache, paged kv cache, continuous batching and A100 benchmarks
Pure-Julia CPU PagedAttention with GPT-2 transformer integration
Intent-aware KV execution prototype for agentic long-context inference: semantic block selection, dynamic scoring, KV quantization modeling, speculative prefetch simulation, CPU references, and future Triton/CUDA kernels.
LangChain integration for Parallel Context-of-Experts Decoding (PCED)
Discrete-tick simulator for KV-cache memory compaction policies in LLM inference servers. Compares NoCompaction, GreedyCompaction and ThresholdCompaction via 2D parameter sweep, Pareto frontier analysis and latency impact (P95/P99). C++20 + Python.
Add a description, image, and links to the paged-attention topic page so that developers can more easily learn about it.
To associate your repository with the paged-attention topic, visit your repo's landing page and select "manage topics."