Skip to content
#

paged-attention

Here are 31 public repositories matching this topic...

From teacher to tiles — a from-scratch LLM distillation & serving engine: custom Triton/CUDA kernels, FSDP distillation, paged-KV continuous batching, speculative decoding, a Rust gateway, a JAX oracle, and interpretability tooling.

  • Updated Jun 5, 2026
  • Python

vLLM - High-throughput, memory-efficient LLM inference engine with PagedAttention, continuous batching, CUDA/HIP optimization, quantization (GPTQ/AWQ/INT4/INT8/FP8), tensor/pipeline parallelism, OpenAI-compatible API, multi-GPU/TPU/Neuron support, prefix caching, and multi-LoRA capabilities

  • Updated Apr 23, 2026
  • Elixir

Intent-aware KV execution prototype for agentic long-context inference: semantic block selection, dynamic scoring, KV quantization modeling, speculative prefetch simulation, CPU references, and future Triton/CUDA kernels.

  • Updated May 29, 2026
  • Python

Improve this page

Add a description, image, and links to the paged-attention topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the paged-attention topic, visit your repo's landing page and select "manage topics."

Learn more