Pure-Zig LLM serving substrate — paged attention, BF16 kernels, persistent thread-pool dispatch.
- OpenAI-compatible HTTP server —
POST /v1/completions+/v1/chat/completionsviazig build serve-openai; single-binary, no Python, no venv - BLIS-style packed-B GEMM — B panels repacked from stride N to stride BN
(256), eliminating L2/L3 cache thrashing on lm_head (N=32000) and MLP shapes.
Beats OpenBLAS (system CBLAS) by 2.2–15.6× on decode shapes (M=1) and
1.2–2.2× on prefill batches (M=32–128).
bench-gemm-gflopsis the harness. - Real end-to-end TinyLlama-1.1B inference through pure-Zig kernels
- Persistent Linux-futex-backed worker pool + tiled-GEMM dispatch routing
- BF16
lm_headmatmul (halves B-matrix bandwidth pulled through L2/L3 on the largest decode-time matmul: M=1, K=2048, N=32000) - PagedAttention KV substrate (
page_manager.zig+paged_attention.zig): K and V planes split, 64-byte-aligned, per-block per-head views - Streaming
generate()— each decoded token emitted via caller callback - Parallel BF16→F32 weight load (safetensors-zig as a path dependency)
- 115/115 unit tests pass; packed-B vs SIMD cross-validation on 6 LLM shapes
- Continuous batching → Phase 2 (currently single-request-at-a-time)
- BEAM/OTP scheduler (LiveDashboard /
:observer) → Phase 2/3 - CUDA / Triton GPU kernels → Phase 1.5+, hardware-gated on Ampere+
- Speculative decoding
- Multi-GPU
zig build test # 115 unit tests
zig build forward-skeleton # Wiring smoke test
zig build bench-matmul # Matmul scalar vs SIMD speedup table
zig build bench-pool # Persistent pool vs ad-hoc spawn
zig build bench-tiled # Cache-tiled vs non-tiled matmul
zig build bench-gemm-gflops # Packed-B vs tiled vs CBLAS GFLOPS on LLM shapes
zig build infer-tinyllama # End-to-end TinyLlama inference (CLI)
zig build serve-openai # OpenAI-compat HTTP server on :8080
zig build bench-tinyllama-pool # TinyLlama pool vs no-pool head-to-head
zig build install -Doptimize=ReleaseFastflowchart LR
A[Prompt] --> B[tokenizers-zig]
B --> C[Embedding]
C --> D[Transformer block × N]
D --> E[LM Head BF16]
E --> F[Sampler]
F --> G[Token]
G -.streaming callback.-> H[Caller]
G -.next step.-> D
subgraph Memory
PM[PageManager]
PA[PagedAttention KV cache]
D <--> PA
PA <--> PM
end
subgraph Dispatch
TP[Persistent thread pool]
TG[Tiled GEMM router]
D <--> TP
TP <--> TG
end
subgraph Weights
ST[safetensors-zig]
BF[BF16 → F32 parallel load]
ST --> BF --> C
ST --> BF --> D
ST --> E
end
tensor.zig—Matrix(f32, row-major, owned slice). Approximate-equality helper for SIMD/scalar cross-validation.kernels/matmul.zig— scalar reference, SIMD outer-product, multi-thread row-split (M-axis), multi-thread col-split (N-axis),matmulSIMDAutodispatcher choosing row-vs-col split by M vs n_threads, andmatmulPackedB— BLIS GEBP-style B-panel packing (stride BN vs N) with AVX-512 ZMM micro-kernel. Routes decode (M=1) to N-col-split; prefill (M≥2) to packed M-split. Beats CBLAS 2–16× on LLM decode shapes.kernels/matmul_bf16.zig— BF16 B-matrix matmul (scalar + SIMD), routed byforward.zigfor the decode-timelm_head.kernels/rmsnorm.zig— Llama-style RMSNorm with SIMD inner loop.kernels/softmax.zig— numerically stable softmax (subtract max).kernels/silu.zig— SiLU +siluMul(Llama-MLPsilu(gate) * up).kernels/rope.zig— RoPE rotation matching the Triton reference.kernels/attention.zig— GQA causal attention with KV cache.
src/bf16.zig— BF16 ↔ F32 (exact BF16→F32, round-to-nearest-even F32→BF16).src/weights.zig—loadMatrixBF16/loadVectorBF16/loadMatrixBF16Transposed. Parallel BF16→F32 load.src/kv_cache.zig— per-layer fixed-size KV cache (~92 MB for TinyLlama atmax_seq=2048).src/page_manager.zig— physical block backing store for PagedAttention (K/V planes, 64-byte-aligned).src/paged_attention.zig— block-indexed attention over the page manager.src/thread_pool.zig— persistent Linux-futex worker pool; replaces ad-hocstd.Thread.spawncost at M=1 decode.src/sampler.zig— greedy / temperature / top-K sampling.src/model.zig— full Llama-family Model (Config fromconfig.json, LayerWeights,loadFromSafeTensors,forward= embed → N × block → final norm → lm_head BF16 → logits).src/forward.zig— block-level forward; routes the BF16 lm_head matmul.
prompt: "The capital of France is" (max_new=8, ReleaseFast, T=8)
(A) no pool — matmulSIMDAuto (v0.0.4)
first-token (prefill+1): 8716 ms
decode ms/tok: 3369
(B) persistent pool (v0.0.5, clean machine)
first-token (prefill+1): 7076 ms
decode ms/tok: 581
(C) persistent pool + BF16 lm_head (v0.0.6, clean machine)
first-token (prefill+1): pending clean re-run
decode ms/tok: pending clean re-run
The v0.0.5 numbers are from a clean machine (load average near zero). The v0.0.6 cells are deliberately blank pending a re-bench on a clean machine — the prior posting (1330 ms/tok at load avg ~44 on an 8-core CPU) was not a fair comparison and has been pulled per audit hygiene. The v0.0.6 BF16 lm_head architecture stands; the corresponding number lands when a clean re-run does.
Real model behavior, real kernels, real tokens streamed out. The 25-second
model load is parallel BF16→F32 (2.3x v0.0.3). The 1.6x prefill is
multi-thread matmul row-split. The v0.0.4 decode regression (std.Thread.spawn
cost dominating at M=1) was closed by the persistent worker pool at v0.0.5.
- Upstream:
SMC17/safetensors-zig(weight loader),SMC17/tokenizers-zig(BPE) - Companion: BEAM/OTP multi-agent on top — the scheduler-layer
swap is the load-bearing architectural choice; see
ARCHITECTURE.md. - vLLM (Kwon et al. 2023) is the reference architecture this echoes; inference replaces the Python scheduler with a future BEAM/OTP layer and the kernels with Zig.
- This is NOT vLLM. No continuous batching, no speculative decoding, no multi-GPU. vLLM (the upstream) is years of optimisation + production GPU/CUDA scheduling. Fair comparison requires Phase 2.
- CPU only. No GPU code at v0.0.6. Per
ARCHITECTURE.md, GPU paths are hardware-gated on Ampere+. - Decode benchmarks are CPU-only on TinyLlama-1.1B — not comparable to GPU production serving.
- Pre-1.0 substrate per the Zig 0.16 ecosystem convention. The BF16 + tiled-pool work is substrate, not a release-readiness claim. Vocabulary stays below "production-grade" until the evidence changes.
115/115 unit tests pass at v0.0.16 (zig build test). SIMD ↔ scalar
matmul agreement across 7 (M, K, N) shape variants; packed-B vs SIMD
cross-validated on 6 LLM shapes (M=1 decode + M=32/128 prefill + lm_head).
forward-skeleton, infer-tinyllama, and bench-tinyllama-pool exercise
the full pipeline end-to-end against real weights.
ARCHITECTURE.md— full design doc (16 sections, ~400 lines)STATUS.md— phase-by-phase status (incl. Type-II watch on the v0.0.6 bench number)CHANGELOG.md— version-by-version delta
AGPL-3.0-or-later. See LICENSE. AGPL on server software means SaaS
deployment requires releasing modifications or negotiating a commercial
license. Standard sovereign-stack posture.