GLM-5.2-NVFP4-REAP-469B serving on SM120 (4× RTX PRO 6000 Blackwell) — one-command vLLM launch recipe, 250K context, DeepSeek Sparse Attention + MTP speculative decode
-
Updated
Jun 19, 2026 - Shell
GLM-5.2-NVFP4-REAP-469B serving on SM120 (4× RTX PRO 6000 Blackwell) — one-command vLLM launch recipe, 250K context, DeepSeek Sparse Attention + MTP speculative decode
From-scratch C++/CUDA inference engine for the NVIDIA RTX 5090 (sm_120a) — the best single-GPU backend for agentic AI: tool calling, long-context loops, reasoning and concurrent sub-agents on top of the fastest single-stream decode on the 5090 (beats llama.cpp, at-or-ahead of vLLM on NVFP4). 100% written by Claude Code.
Optimized vLLM setup for Qwen3.6-27B-FP8 on dual RTX PRO 6000 Blackwell (192 GB GDDR7, no NVLink) ; config, benchmark sweep results, and custom chat template with thinking mode off by default.
Systematic 24-hour benchmark study of Qwen3.6-27B inference on dual NVIDIA RTX PRO 6000 Blackwell SM120 (TP=2). 8 experiments comparing repne/vllm fork vs upstream vLLM across FP8/BF16/NVFP4/Q8_0 quants and MTP/DFlash speculative decoding. Peak: 2,083 tok/s at c=32. Quality: KLD vs BF16 = 0.0018 (noise floor).
Production-grade FlashAttention FP8 e4m3 forward kernel for NVIDIA Blackwell consumer GPUs (sm_120a, e.g. RTX PRO 6000). 647–652 TFLOPS at hd=128, sl=8192. Multi-kernel dispatcher, C library with Go and Python bindings
Hub for ongoing Qwen inference benchmarks on NVIDIA Blackwell. Indexes all studies, hosts the rolling SOTA leaderboard, points to the toolchain.
Fish Audio OpenAudio S2-Pro on vLLM-Omni. low-latency ~100ms TTFA, OpenAI-compatible, runs on NVIDIA Blackwell (RTX 5090 / RTX PRO 6000). Self-hosted streaming TTS & voice cloning.
QuantLoom·量梭 的野心,从不只是在手机上弹出几条信号。 这座织机真正要为你织出的终极产物,是 RTX Pro 6000 —— 黑曜神机 的自由召唤权。 它是躺在你机箱里的黑色方尖碑,数万核心如暗夜星海 它是本地训推大模型、实时织造全市场量能全景图、回溯十年资金指纹的物质根基 它过去只降落在超算中心、顶级量化基金和神秘矿场 QuantLoom 每织出一匹盈利的锦缎,都是在为这座黑色圣坛添一根金线。当金线积聚成缆,黑曜神机便会从虚空货架撕开一道裂缝,降临在你的阵中。 从此,你拥有了一座个人算力神殿。
Stress-validation of Qwen3.6-27B inference configurations on dual RTX PRO 6000 Blackwell. 5 configs x 4 phases (gates, throughput matrix, HumanEval, MBPP) = 2,105 hard coding problems, zero crashes. Headline: FP8+MTP=3 wins HumanEval (79.3%), BF16+DFlash wins MBPP (89.5%). MTP=5 dominated on correctness despite faster raw tok/s.
Forge a reproducible code/LLM eval leaderboard on idle TU Delft DAIC RTX Pro 6000 (Blackwell, sm_120) GPUs — sibling of preCal. Strict GPU-generate / CPU-score decouple, requeue-safe SLURM backfill, published to the HF Hub.
Deploy the GLM-5.2-469B model on four RTX PRO 6000 Blackwell GPUs using a turnkey vLLM Docker configuration to enable high-speed sparse attention and inference.
Add a description, image, and links to the rtx-pro-6000 topic page so that developers can more easily learn about it.
To associate your repository with the rtx-pro-6000 topic, visit your repo's landing page and select "manage topics."