inference

Pure-Zig LLM serving substrate — paged attention, BF16 kernels, persistent thread-pool dispatch.

What ships (v0.0.16)

OpenAI-compatible HTTP server — POST /v1/completions + /v1/chat/completions via zig build serve-openai; single-binary, no Python, no venv
BLIS-style packed-B GEMM — B panels repacked from stride N to stride BN (256), eliminating L2/L3 cache thrashing on lm_head (N=32000) and MLP shapes. Beats OpenBLAS (system CBLAS) by 2.2–15.6× on decode shapes (M=1) and 1.2–2.2× on prefill batches (M=32–128). bench-gemm-gflops is the harness.
Real end-to-end TinyLlama-1.1B inference through pure-Zig kernels
Persistent Linux-futex-backed worker pool + tiled-GEMM dispatch routing
BF16 lm_head matmul (halves B-matrix bandwidth pulled through L2/L3 on the largest decode-time matmul: M=1, K=2048, N=32000)
PagedAttention KV substrate (page_manager.zig + paged_attention.zig): K and V planes split, 64-byte-aligned, per-block per-head views
Streaming generate() — each decoded token emitted via caller callback
Parallel BF16→F32 weight load (safetensors-zig as a path dependency)
115/115 unit tests pass; packed-B vs SIMD cross-validation on 6 LLM shapes

What does NOT ship yet

Continuous batching → Phase 2 (currently single-request-at-a-time)
BEAM/OTP scheduler (LiveDashboard / :observer) → Phase 2/3
CUDA / Triton GPU kernels → Phase 1.5+, hardware-gated on Ampere+
Speculative decoding
Multi-GPU

Quickstart

zig build test                  # 115 unit tests
zig build forward-skeleton      # Wiring smoke test
zig build bench-matmul          # Matmul scalar vs SIMD speedup table
zig build bench-pool            # Persistent pool vs ad-hoc spawn
zig build bench-tiled           # Cache-tiled vs non-tiled matmul
zig build bench-gemm-gflops     # Packed-B vs tiled vs CBLAS GFLOPS on LLM shapes
zig build infer-tinyllama       # End-to-end TinyLlama inference (CLI)
zig build serve-openai          # OpenAI-compat HTTP server on :8080
zig build bench-tinyllama-pool  # TinyLlama pool vs no-pool head-to-head
zig build install -Doptimize=ReleaseFast

Architecture

flowchart LR
    A[Prompt] --> B[tokenizers-zig]
    B --> C[Embedding]
    C --> D[Transformer block × N]
    D --> E[LM Head BF16]
    E --> F[Sampler]
    F --> G[Token]
    G -.streaming callback.-> H[Caller]
    G -.next step.-> D

    subgraph Memory
        PM[PageManager]
        PA[PagedAttention KV cache]
        D <--> PA
        PA <--> PM
    end

    subgraph Dispatch
        TP[Persistent thread pool]
        TG[Tiled GEMM router]
        D <--> TP
        TP <--> TG
    end

    subgraph Weights
        ST[safetensors-zig]
        BF[BF16 → F32 parallel load]
        ST --> BF --> C
        ST --> BF --> D
        ST --> E
    end

Kernels

tensor.zig — Matrix (f32, row-major, owned slice). Approximate-equality helper for SIMD/scalar cross-validation.
kernels/matmul.zig — scalar reference, SIMD outer-product, multi-thread row-split (M-axis), multi-thread col-split (N-axis), matmulSIMDAuto dispatcher choosing row-vs-col split by M vs n_threads, and matmulPackedB — BLIS GEBP-style B-panel packing (stride BN vs N) with AVX-512 ZMM micro-kernel. Routes decode (M=1) to N-col-split; prefill (M≥2) to packed M-split. Beats CBLAS 2–16× on LLM decode shapes.
kernels/matmul_bf16.zig — BF16 B-matrix matmul (scalar + SIMD), routed by forward.zig for the decode-time lm_head.
kernels/rmsnorm.zig — Llama-style RMSNorm with SIMD inner loop.
kernels/softmax.zig — numerically stable softmax (subtract max).
kernels/silu.zig — SiLU + siluMul (Llama-MLP silu(gate) * up).
kernels/rope.zig — RoPE rotation matching the Triton reference.
kernels/attention.zig — GQA causal attention with KV cache.

Forward pass + weights

src/bf16.zig — BF16 ↔ F32 (exact BF16→F32, round-to-nearest-even F32→BF16).
src/weights.zig — loadMatrixBF16 / loadVectorBF16 / loadMatrixBF16Transposed. Parallel BF16→F32 load.
src/kv_cache.zig — per-layer fixed-size KV cache (~92 MB for TinyLlama at max_seq=2048).
src/page_manager.zig — physical block backing store for PagedAttention (K/V planes, 64-byte-aligned).
src/paged_attention.zig — block-indexed attention over the page manager.
src/thread_pool.zig — persistent Linux-futex worker pool; replaces ad-hoc std.Thread.spawn cost at M=1 decode.
src/sampler.zig — greedy / temperature / top-K sampling.
src/model.zig — full Llama-family Model (Config from config.json, LayerWeights, loadFromSafeTensors, forward = embed → N × block → final norm → lm_head BF16 → logits).
src/forward.zig — block-level forward; routes the BF16 lm_head matmul.

End-to-end on TinyLlama-1.1B-Chat-v1.0 (consumer Linux, 8 cores)

prompt: "The capital of France is"  (max_new=8, ReleaseFast, T=8)

(A) no pool — matmulSIMDAuto (v0.0.4)
    first-token (prefill+1):  8716 ms
    decode ms/tok:            3369

(B) persistent pool (v0.0.5, clean machine)
    first-token (prefill+1):  7076 ms
    decode ms/tok:            581

(C) persistent pool + BF16 lm_head (v0.0.6, clean machine)
    first-token (prefill+1):  pending clean re-run
    decode ms/tok:            pending clean re-run

The v0.0.5 numbers are from a clean machine (load average near zero). The v0.0.6 cells are deliberately blank pending a re-bench on a clean machine — the prior posting (1330 ms/tok at load avg ~44 on an 8-core CPU) was not a fair comparison and has been pulled per audit hygiene. The v0.0.6 BF16 lm_head architecture stands; the corresponding number lands when a clean re-run does.

Real model behavior, real kernels, real tokens streamed out. The 25-second model load is parallel BF16→F32 (2.3x v0.0.3). The 1.6x prefill is multi-thread matmul row-split. The v0.0.4 decode regression (std.Thread.spawn cost dominating at M=1) was closed by the persistent worker pool at v0.0.5.

How this fits the portfolio

Upstream: SMC17/safetensors-zig (weight loader), SMC17/tokenizers-zig (BPE)
Companion: BEAM/OTP multi-agent on top — the scheduler-layer swap is the load-bearing architectural choice; see ARCHITECTURE.md.
vLLM (Kwon et al. 2023) is the reference architecture this echoes; inference replaces the Python scheduler with a future BEAM/OTP layer and the kernels with Zig.

Non-claims

This is NOT vLLM. No continuous batching, no speculative decoding, no multi-GPU. vLLM (the upstream) is years of optimisation + production GPU/CUDA scheduling. Fair comparison requires Phase 2.
CPU only. No GPU code at v0.0.6. Per ARCHITECTURE.md, GPU paths are hardware-gated on Ampere+.
Decode benchmarks are CPU-only on TinyLlama-1.1B — not comparable to GPU production serving.
Pre-1.0 substrate per the Zig 0.16 ecosystem convention. The BF16 + tiled-pool work is substrate, not a release-readiness claim. Vocabulary stays below "production-grade" until the evidence changes.

Tests

115/115 unit tests pass at v0.0.16 (zig build test). SIMD ↔ scalar matmul agreement across 7 (M, K, N) shape variants; packed-B vs SIMD cross-validated on 6 LLM shapes (M=1 decode + M=32/128 prefill + lm_head). forward-skeleton, infer-tinyllama, and bench-tinyllama-pool exercise the full pipeline end-to-end against real weights.

License

AGPL-3.0-or-later. See LICENSE. AGPL on server software means SaaS deployment requires releasing modifications or negotiating a commercial license. Standard sovereign-stack posture.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.githooks		.githooks
.github/workflows		.github/workflows
bench		bench
examples		examples
scripts		scripts
src		src
tools		tools
wasm		wasm
.gitignore		.gitignore
.stax-gate.sh		.stax-gate.sh
ARCHITECTURE.md		ARCHITECTURE.md
BENCH.md		BENCH.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
FIXTURES.md		FIXTURES.md
LICENSE		LICENSE
NEXT.md		NEXT.md
PERF.md		PERF.md
README.md		README.md
README_SHOWROOM.md		README_SHOWROOM.md
STATUS.md		STATUS.md
build.zig		build.zig
build.zig.zon		build.zig.zon
test_decode.zig		test_decode.zig

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

inference

What ships (v0.0.16)

What does NOT ship yet

Quickstart

Architecture

Kernels

Forward pass + weights

End-to-end on TinyLlama-1.1B-Chat-v1.0 (consumer Linux, 8 cores)

How this fits the portfolio

Non-claims

Tests

See also

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

inference

What ships (v0.0.16)

What does NOT ship yet

Quickstart

Architecture

Kernels

Forward pass + weights

End-to-end on TinyLlama-1.1B-Chat-v1.0 (consumer Linux, 8 cores)

How this fits the portfolio

Non-claims

Tests

See also

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages