tensor-parallel

Here are 8 public repositories matching this topic...

luojieLLMaaS / haxiv

Deep learning framework for LLMs (Llama/Gemma/Qwen) on CPU, Apple MLX, Metal, CUDA. Load PyTorch/ONNX/TF/GGUF with zero conversion. PyTorch alternative with native Apple Silicon, ONNX Runtime alternative with autograd, llama.cpp alternative with backprop. FlashAttention, tensor parallel, INT4 quant.

Updated Jul 5, 2026
Rust

lna-lab / gemma4-12b-vllm-sm120

Star

Reproducible recipe: serve abliterated Gemma-4-12B (gemma4_unified) at 50-118 tok/s on no-NVLink Blackwell (SM120) via vLLM nightly + ModelOpt FP8/NVFP4 + MTP spec-decode.

quantization gemma blackwell fp8 vllm speculative-decoding sm120 nvfp4 abliterated gemma-4 modelopt tensor-parallel

Updated Jun 7, 2026
Python

theogravity / dual-rtx-6000-blackwell-Gemma-4-31B-IT-NVFP4

Sponsor

Star

Optimized vLLM setup for Gemma 4 31B NVFP4 with MTP on dual RTX PRO 6000 Blackwell using vllm and docker: native FP4 Tensor Cores, Multi-Token Prediction (96.5% acceptance rate), and prefix caching. Includes benchmark results and replication scripts.

docker amd cuda gemma blackwell vllm llm-inference am5 speculative-decoding fp4 prefix-caching multi-token-prediction nvfp4 rtx-6000 gemma4 tensor-parallel

Updated May 10, 2026
Shell

GanyX19 / deepseek-v4-1m-on-dgx-spark

Star

Reproducible recipe: serve DeepSeek-V4-Flash with up to 1M token context on 2x NVIDIA DGX Spark (GB10) via vLLM (TP=2). Build (sm_121), launch templates, hardware bring-up, known issues, benchmarks. No binaries/weights.

nvidia long-context vllm llm-inference speculative-decoding deepseek gb10 dgx-spark deepseek-v4 tensor-parallel

Updated Jul 3, 2026
Shell

idonati / spark-vllm-docker-festr2

Star

Patches + recipe to deploy festr2/MiMo-V2.5-Pro-NVFP4-MXFP8-attn-TP8 on 8-node DGX Spark sm_121 (Ray + vLLM, TP=8). Fixes the fused-qkv loader bug that mis-slotted Q values as K/V on 7 of 8 ranks.

moe ray quantization mimo huggingface vllm gb10 nvfp4 dgx-spark mxfp8 sm121 tensor-parallel

Updated May 19, 2026
Python

chriswagner-ai / intel-arc-b70-vllm-multi-gpu

Star

Field-tested guide: multi-GPU vLLM tensor-parallel (TP=2/TP=4) on Intel Arc Pro B70 (Battlemage BMG-G31, Xe2) on Linux. Driver setup (xe force_probe=e223), bare-metal vLLM + oneAPI 2025.3, the compute-runtime multi-root USM + triton-xpu init_devices fixes, FP8/int4-AutoRound quant, root-cause error reports. AI-agent readable (AGENTS.md).

Updated Jun 13, 2026
Shell

Red-Weasel / machx-inference-engine

Star

From-scratch C++/SYCL LLM inference engine for Intel Arc (B-series) — 8+ architectures, tensor-parallel, beats llama.cpp on Arc B70

gpu quantization sycl inference-engine oneapi llm intel-arc gguf battlemage tensor-parallel

Updated Jul 3, 2026
C++

bakunyoav-a11y / gemma4-12b-vllm-sm120

Star

Serve an abliterated Gemma-4-12B at high speeds on Blackwell GPUs without NVLink using vLLM, FP8 quantization, and MTP speculative decoding.

quantization gemma blackwell fp8 vllm speculative-decoding sm120 nvfp4 abliterated modelopt tensor-parallel

Updated Jul 5, 2026
Python

Improve this page

Add a description, image, and links to the tensor-parallel topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the tensor-parallel topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensor-parallel

Here are 8 public repositories matching this topic...

luojieLLMaaS / haxiv

lna-lab / gemma4-12b-vllm-sm120

theogravity / dual-rtx-6000-blackwell-Gemma-4-31B-IT-NVFP4

GanyX19 / deepseek-v4-1m-on-dgx-spark

idonati / spark-vllm-docker-festr2

chriswagner-ai / intel-arc-b70-vllm-multi-gpu

Red-Weasel / machx-inference-engine

bakunyoav-a11y / gemma4-12b-vllm-sm120

Improve this page

Add this topic to your repo