Skip to content

SMC17/inference

Repository files navigation

inference

Pure-Zig LLM serving substrate — paged attention, BF16 kernels, persistent thread-pool dispatch.

CI License: AGPL-3.0 Zig: 0.16 Tests: 77/77

What ships (v0.0.16)

  • OpenAI-compatible HTTP serverPOST /v1/completions + /v1/chat/completions via zig build serve-openai; single-binary, no Python, no venv
  • BLIS-style packed-B GEMM — B panels repacked from stride N to stride BN (256), eliminating L2/L3 cache thrashing on lm_head (N=32000) and MLP shapes. Beats OpenBLAS (system CBLAS) by 2.2–15.6× on decode shapes (M=1) and 1.2–2.2× on prefill batches (M=32–128). bench-gemm-gflops is the harness.
  • Real end-to-end TinyLlama-1.1B inference through pure-Zig kernels
  • Persistent Linux-futex-backed worker pool + tiled-GEMM dispatch routing
  • BF16 lm_head matmul (halves B-matrix bandwidth pulled through L2/L3 on the largest decode-time matmul: M=1, K=2048, N=32000)
  • PagedAttention KV substrate (page_manager.zig + paged_attention.zig): K and V planes split, 64-byte-aligned, per-block per-head views
  • Streaming generate() — each decoded token emitted via caller callback
  • Parallel BF16→F32 weight load (safetensors-zig as a path dependency)
  • 115/115 unit tests pass; packed-B vs SIMD cross-validation on 6 LLM shapes

What does NOT ship yet

  • Continuous batching → Phase 2 (currently single-request-at-a-time)
  • BEAM/OTP scheduler (LiveDashboard / :observer) → Phase 2/3
  • CUDA / Triton GPU kernels → Phase 1.5+, hardware-gated on Ampere+
  • Speculative decoding
  • Multi-GPU

Quickstart

zig build test                  # 115 unit tests
zig build forward-skeleton      # Wiring smoke test
zig build bench-matmul          # Matmul scalar vs SIMD speedup table
zig build bench-pool            # Persistent pool vs ad-hoc spawn
zig build bench-tiled           # Cache-tiled vs non-tiled matmul
zig build bench-gemm-gflops     # Packed-B vs tiled vs CBLAS GFLOPS on LLM shapes
zig build infer-tinyllama       # End-to-end TinyLlama inference (CLI)
zig build serve-openai          # OpenAI-compat HTTP server on :8080
zig build bench-tinyllama-pool  # TinyLlama pool vs no-pool head-to-head
zig build install -Doptimize=ReleaseFast

Architecture

flowchart LR
    A[Prompt] --> B[tokenizers-zig]
    B --> C[Embedding]
    C --> D[Transformer block × N]
    D --> E[LM Head BF16]
    E --> F[Sampler]
    F --> G[Token]
    G -.streaming callback.-> H[Caller]
    G -.next step.-> D

    subgraph Memory
        PM[PageManager]
        PA[PagedAttention KV cache]
        D <--> PA
        PA <--> PM
    end

    subgraph Dispatch
        TP[Persistent thread pool]
        TG[Tiled GEMM router]
        D <--> TP
        TP <--> TG
    end

    subgraph Weights
        ST[safetensors-zig]
        BF[BF16 → F32 parallel load]
        ST --> BF --> C
        ST --> BF --> D
        ST --> E
    end
Loading

Kernels

  • tensor.zigMatrix (f32, row-major, owned slice). Approximate-equality helper for SIMD/scalar cross-validation.
  • kernels/matmul.zig — scalar reference, SIMD outer-product, multi-thread row-split (M-axis), multi-thread col-split (N-axis), matmulSIMDAuto dispatcher choosing row-vs-col split by M vs n_threads, and matmulPackedB — BLIS GEBP-style B-panel packing (stride BN vs N) with AVX-512 ZMM micro-kernel. Routes decode (M=1) to N-col-split; prefill (M≥2) to packed M-split. Beats CBLAS 2–16× on LLM decode shapes.
  • kernels/matmul_bf16.zig — BF16 B-matrix matmul (scalar + SIMD), routed by forward.zig for the decode-time lm_head.
  • kernels/rmsnorm.zig — Llama-style RMSNorm with SIMD inner loop.
  • kernels/softmax.zig — numerically stable softmax (subtract max).
  • kernels/silu.zig — SiLU + siluMul (Llama-MLP silu(gate) * up).
  • kernels/rope.zig — RoPE rotation matching the Triton reference.
  • kernels/attention.zig — GQA causal attention with KV cache.

Forward pass + weights

  • src/bf16.zig — BF16 ↔ F32 (exact BF16→F32, round-to-nearest-even F32→BF16).
  • src/weights.zigloadMatrixBF16 / loadVectorBF16 / loadMatrixBF16Transposed. Parallel BF16→F32 load.
  • src/kv_cache.zig — per-layer fixed-size KV cache (~92 MB for TinyLlama at max_seq=2048).
  • src/page_manager.zig — physical block backing store for PagedAttention (K/V planes, 64-byte-aligned).
  • src/paged_attention.zig — block-indexed attention over the page manager.
  • src/thread_pool.zig — persistent Linux-futex worker pool; replaces ad-hoc std.Thread.spawn cost at M=1 decode.
  • src/sampler.zig — greedy / temperature / top-K sampling.
  • src/model.zig — full Llama-family Model (Config from config.json, LayerWeights, loadFromSafeTensors, forward = embed → N × block → final norm → lm_head BF16 → logits).
  • src/forward.zig — block-level forward; routes the BF16 lm_head matmul.

End-to-end on TinyLlama-1.1B-Chat-v1.0 (consumer Linux, 8 cores)

prompt: "The capital of France is"  (max_new=8, ReleaseFast, T=8)

(A) no pool — matmulSIMDAuto (v0.0.4)
    first-token (prefill+1):  8716 ms
    decode ms/tok:            3369

(B) persistent pool (v0.0.5, clean machine)
    first-token (prefill+1):  7076 ms
    decode ms/tok:            581

(C) persistent pool + BF16 lm_head (v0.0.6, clean machine)
    first-token (prefill+1):  pending clean re-run
    decode ms/tok:            pending clean re-run

The v0.0.5 numbers are from a clean machine (load average near zero). The v0.0.6 cells are deliberately blank pending a re-bench on a clean machine — the prior posting (1330 ms/tok at load avg ~44 on an 8-core CPU) was not a fair comparison and has been pulled per audit hygiene. The v0.0.6 BF16 lm_head architecture stands; the corresponding number lands when a clean re-run does.

Real model behavior, real kernels, real tokens streamed out. The 25-second model load is parallel BF16→F32 (2.3x v0.0.3). The 1.6x prefill is multi-thread matmul row-split. The v0.0.4 decode regression (std.Thread.spawn cost dominating at M=1) was closed by the persistent worker pool at v0.0.5.

How this fits the portfolio

  • Upstream: SMC17/safetensors-zig (weight loader), SMC17/tokenizers-zig (BPE)
  • Companion: BEAM/OTP multi-agent on top — the scheduler-layer swap is the load-bearing architectural choice; see ARCHITECTURE.md.
  • vLLM (Kwon et al. 2023) is the reference architecture this echoes; inference replaces the Python scheduler with a future BEAM/OTP layer and the kernels with Zig.

Non-claims

  • This is NOT vLLM. No continuous batching, no speculative decoding, no multi-GPU. vLLM (the upstream) is years of optimisation + production GPU/CUDA scheduling. Fair comparison requires Phase 2.
  • CPU only. No GPU code at v0.0.6. Per ARCHITECTURE.md, GPU paths are hardware-gated on Ampere+.
  • Decode benchmarks are CPU-only on TinyLlama-1.1B — not comparable to GPU production serving.
  • Pre-1.0 substrate per the Zig 0.16 ecosystem convention. The BF16 + tiled-pool work is substrate, not a release-readiness claim. Vocabulary stays below "production-grade" until the evidence changes.

Tests

115/115 unit tests pass at v0.0.16 (zig build test). SIMD ↔ scalar matmul agreement across 7 (M, K, N) shape variants; packed-B vs SIMD cross-validated on 6 LLM shapes (M=1 decode + M=32/128 prefill + lm_head). forward-skeleton, infer-tinyllama, and bench-tinyllama-pool exercise the full pipeline end-to-end against real weights.

See also

  • ARCHITECTURE.md — full design doc (16 sections, ~400 lines)
  • STATUS.md — phase-by-phase status (incl. Type-II watch on the v0.0.6 bench number)
  • CHANGELOG.md — version-by-version delta

License

AGPL-3.0-or-later. See LICENSE. AGPL on server software means SaaS deployment requires releasing modifications or negotiating a commercial license. Standard sovereign-stack posture.

About

Pure-Zig LLM serving — paged attention, BF16 kernels, persistent thread pool, safetensors integration. 6.17× decode speedup. TinyLlama-1.1B end-to-end. 77 tests.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors