Skip to content
#

tensor-parallel

Here are 8 public repositories matching this topic...

Language: All
Filter by language

Deep learning framework for LLMs (Llama/Gemma/Qwen) on CPU, Apple MLX, Metal, CUDA. Load PyTorch/ONNX/TF/GGUF with zero conversion. PyTorch alternative with native Apple Silicon, ONNX Runtime alternative with autograd, llama.cpp alternative with backprop. FlashAttention, tensor parallel, INT4 quant.

  • Updated Jul 5, 2026
  • Rust

Optimized vLLM setup for Gemma 4 31B NVFP4 with MTP on dual RTX PRO 6000 Blackwell using vllm and docker: native FP4 Tensor Cores, Multi-Token Prediction (96.5% acceptance rate), and prefix caching. Includes benchmark results and replication scripts.

  • Updated May 10, 2026
  • Shell

Field-tested guide: multi-GPU vLLM tensor-parallel (TP=2/TP=4) on Intel Arc Pro B70 (Battlemage BMG-G31, Xe2) on Linux. Driver setup (xe force_probe=e223), bare-metal vLLM + oneAPI 2025.3, the compute-runtime multi-root USM + triton-xpu init_devices fixes, FP8/int4-AutoRound quant, root-cause error reports. AI-agent readable (AGENTS.md).

  • Updated Jun 13, 2026
  • Shell

Improve this page

Add a description, image, and links to the tensor-parallel topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the tensor-parallel topic, visit your repo's landing page and select "manage topics."

Learn more