Multi-node PyTorch DDP fine-tuning of a causal LM on Nebius GPU Kubernetes, with SkyPilot workload orchestration — a 2-node torchrun job with verified NCCL collectives.
kubernetes gpu k8s distributed-training nccl mlops pytorch-ddp skypilot llm-ops llm-finetuning nebius gpu-orchestration workload-orchestration ml-orchestration torchrun
-
Updated
Jun 26, 2026 - Python