You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deep learning framework for LLMs (Llama/Gemma/Qwen) on CPU, Apple MLX, Metal, CUDA. Load PyTorch/ONNX/TF/GGUF with zero conversion. PyTorch alternative with native Apple Silicon, ONNX Runtime alternative with autograd, llama.cpp alternative with backprop. FlashAttention, tensor parallel, INT4 quant.
Optimized vLLM setup for Gemma 4 31B NVFP4 with MTP on dual RTX PRO 6000 Blackwell using vllm and docker: native FP4 Tensor Cores, Multi-Token Prediction (96.5% acceptance rate), and prefix caching. Includes benchmark results and replication scripts.
Reproducible recipe: serve DeepSeek-V4-Flash with up to 1M token context on 2x NVIDIA DGX Spark (GB10) via vLLM (TP=2). Build (sm_121), launch templates, hardware bring-up, known issues, benchmarks. No binaries/weights.
Patches + recipe to deploy festr2/MiMo-V2.5-Pro-NVFP4-MXFP8-attn-TP8 on 8-node DGX Spark sm_121 (Ray + vLLM, TP=8). Fixes the fused-qkv loader bug that mis-slotted Q values as K/V on 7 of 8 ranks.