A deterministic PyTorch autograd verification trap for catching silent KV-cache routing and block-alignment failures in vLLM and SGLang serving infrastructure.
-
Updated
Jun 7, 2026 - Python
A deterministic PyTorch autograd verification trap for catching silent KV-cache routing and block-alignment failures in vLLM and SGLang serving infrastructure.
Pure-Zig safetensors reader. @vector(32,u8) structural scan, BF16/F32/I8. 241µs parse on 201-tensor TinyLlama fixture. 21 tests.
A typed, dependency-light LLM router: scores models across a cost/quality/latency Pareto frontier with cost_optimal / quality_first / balanced strategies, an offline eval harness, and a live A/B arena. Mock-mode default, zero API keys. No LangChain.
Systematic VLA training optimization on 2× RTX 3090. WebDataset + FlashAttention-2 + FSDP → 3.3× throughput, 26% VRAM reduction. Profiler traces and W&B report linked. Reproducible in one command.
Regression-safe evaluation framework for RAG systems with faithfulness and coverage-based deployment gating.
Add a description, image, and links to the ml-infra topic page so that developers can more easily learn about it.
To associate your repository with the ml-infra topic, visit your repo's landing page and select "manage topics."