paged-attention

Here are 31 public repositories matching this topic...

xlite-dev / Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

mla vllm llm-inference awesome-llm flash-attention tensorrt-llm paged-attention deepseek flash-attention-3 deepseek-v3 minimax-01 deepseek-r1 flash-mla qwen3

Updated Jun 23, 2026
Python

openinfer-project / openinfer

Star

Pure Rust + CUDA LLM inference engine — no PyTorch, OpenAI-compatible, serves Qwen3 to Kimi-K2

Updated Jul 5, 2026
Rust

From teacher to tiles — a from-scratch LLM distillation & serving engine: custom Triton/CUDA kernels, FSDP distillation, paged-KV continuous batching, speculative decoding, a Rust gateway, a JAX oracle, and interpretability tooling.

rust cuda pytorch triton quantization knowledge-distillation inference-engine jax kv-cache ml-systems llm mechanistic-interpretability fsdp flash-attention speculative-decoding paged-attention

Updated Jun 5, 2026
Python

lumia431 / photon_infer

Star

A High-Performance LLM Inference Engine with vLLM-Style Continuous Batching

modern-cpp inference-engine ai-infra vllm llm-inference paged-attention continuous-batching

Updated Jan 2, 2026
C++

VARUN3WARE / Paged-Attention

Star

Implementation of PagedAttention from vLLM paper - a breakthrough attention algorithm that treats KV cache like virtual memory. Eliminates memory fragmentation, increases batch sizes, and dramatically improves LLM serving throughput.

memory-optimization kv-cache llm-inference paged-attention transformer-optimization

Updated Dec 3, 2025
Python

gyunggyung / Agent.cpp

Star

High-performance On-Device MoA (Mixture of Agents) Engine in C++. Optimized for CPU inference with RadixCache & PagedAttention. (Tiny-MoA Native)

c cpp moa on-device-ai llm llamacpp llama-cpp ggml paged-attention cpu-optimization mixture-of-agents radix-attention

Updated Jan 25, 2026
C++

pjdurden / nanoserve

Star

An AI inference engine from scratch. Like nanoGPT, but for serving.

machine-learning cuda inference pytorch triton llama from-scratch kv-cache llm vllm llm-inference paged-attention continuous-batching

Updated Jul 5, 2026
Python

iFurySt / nanoLLMServe

Star

🌱 A tiny, readable LLM serving engine with vLLM/SGLang-style features.

Updated May 28, 2026
Python

developertogo / velo-core

Star

A production-grade, native Rust speculative inference engine for Apple Silicon with Metal GPU acceleration and paged attention.

metal gpu-acceleration systems-programming apple-silicon openai-api tensor-parallelism llm-inference speculative-decoding paged-attention continuous-batching prefix-caching disaggregated-serving

Updated Jun 13, 2026
Rust

likhith-v1 / inferd

Star

Local-first LLM stack on a single RTX 5090: QLoRA fine-tuning, exact speculative decoding, paged KV-cache, and continuous batching — served via FastAPI with a live React dashboard.

react cuda pytorch triton lora inference-engine fine-tuning fastapi kv-cache llm local-llm llm-inference qlora qwen speculative-decoding paged-attention continuous-batching rtx-5090

Updated Jul 3, 2026
Python

AICL-Lab / hetero-paged-infer

Star

High-Performance LLM Inference Engine with PagedAttention & Continuous Batching in Rust

rust machine-learning high-performance inference transformer gpu-computing production-ready systems-programming inference-engine serving kv-cache llm vllm llm-inference paged-attention continuous-batching

Updated Jun 29, 2026
Rust

nshkrdotcom / vllm

Sponsor

Star

vLLM - High-throughput, memory-efficient LLM inference engine with PagedAttention, continuous batching, CUDA/HIP optimization, quantization (GPTQ/AWQ/INT4/INT8/FP8), tensor/pipeline parallelism, OpenAI-compatible API, multi-GPU/TPU/Neuron support, prefix caching, and multi-LoRA capabilities

Updated Apr 23, 2026
Elixir

SMC17 / inference

Star

Pure-Zig LLM serving — paged attention, BF16 kernels, persistent thread pool, safetensors integration. 6.17× decode speedup. TinyLlama-1.1B end-to-end. 77 tests.

machine-learning zig transformers inference llm paged-attention

Updated Jul 3, 2026
Zig

angelnicolasc / Tessera

Star

Rust/Python KV-cache block manager for MLA and hybrid-attention LLM serving, with content-addressed sharing, per-layer layouts, FP4/FP8 accounting, and distributed cache semantics.

python rust ai memory-management mla pyo3 kv-cache llm llm-serving vllm paged-attention deepseek hybrid-attention

Updated Jun 18, 2026
Rust

achi9629 / llm-inference-engine

Star

A from scratch LLM inference engine build in PyTorch with custom GPT2 transformers, kv cache, paged kv cache, continuous batching and A100 benchmarks

nlp deep-learning transformers autoregressive inference-engine model-serving fastapi gpt2 kv-cache llm llm-serving llm-inference paged-attention mistral-7b continuous-batching paged-kv-cache

Updated May 8, 2026
Python

omgbox / JuliaPagedAttn.jl

Star

Pure-Julia CPU PagedAttention with GPT-2 transformer integration

open-source machine-learning cpu ai deep-learning julia inference simd transformer high-performance-computing attention-mechanism julialang gpt-2 kv-cache llm paged-attention

Updated Jun 21, 2026
Julia

manishklach / intent-attention-kernel

Star

Intent-aware KV execution prototype for agentic long-context inference: semantic block selection, dynamic scoring, KV quantization modeling, speculative prefetch simulation, CPU references, and future Triton/CUDA kernels.

Updated May 29, 2026
Python

MrAMS / llaisys

Star

AI Infra 手写算子实现Qwen2推理，支持Paged Attention

ai-infra paged-attention qwen2

Updated Sep 28, 2025
C++

giulio98 / langchain-pced

Star

LangChain integration for Parallel Context-of-Experts Decoding (PCED)

transformers rag langchain paged-attention constrastive-decoding pced

Updated Feb 13, 2026
Python

JohnScheuer / kv-cache-compaction-lab

Star

Discrete-tick simulator for KV-cache memory compaction policies in LLM inference servers. Compares NoCompaction, GreedyCompaction and ThresholdCompaction via 2D parameter sweep, Pareto frontier analysis and latency impact (P95/P99). C++20 + Python.

python performance simulator research latency inference sweep memory-management pareto cpp20 fragmentation compaction kv-cache llm paged-attention

Updated Jul 4, 2026
Python

Improve this page

Add a description, image, and links to the paged-attention topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the paged-attention topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paged-attention

Here are 31 public repositories matching this topic...

xlite-dev / Awesome-LLM-Inference

openinfer-project / openinfer

zengxiao-he / tessera

lumia431 / photon_infer

VARUN3WARE / Paged-Attention

gyunggyung / Agent.cpp

pjdurden / nanoserve

iFurySt / nanoLLMServe

developertogo / velo-core

likhith-v1 / inferd

AICL-Lab / hetero-paged-infer

nshkrdotcom / vllm

SMC17 / inference

angelnicolasc / Tessera

achi9629 / llm-inference-engine

omgbox / JuliaPagedAttn.jl

manishklach / intent-attention-kernel

MrAMS / llaisys

giulio98 / langchain-pced

JohnScheuer / kv-cache-compaction-lab

Improve this page

Add this topic to your repo