Add AMD GPU (ROCm/HIP) support to the Caspar backend#465
Open
jeffdaily wants to merge 1 commit into
Open
Conversation
Adds AMD GPU support to Caspar so its generated kernels and runtime also
build and run on AMD GPUs via HIP/ROCm, while leaving the default NVIDIA
CUDA build unchanged (enabled with use_hip=True / -DUSE_HIP=ON; off by
default).
Review in this order:
1. symforce/caspar/source/runtime/cuda_to_hip.h (new): a compatibility
header that maps the CUDA spellings Caspar emits and uses (cudaMalloc,
__syncthreads, cooperative-group reductions, CUB primitives, the
runtime API) onto their HIP equivalents, and supplies device-side
fallbacks where HIP lacks a cg:: primitive (reduce, labeled_partition).
2. code_generation/library.py and source/templates/*.jinja: the codegen
gains a use_hip/hip_arch path; the generated build file and kernel
templates emit the compat include and HIP-correct spellings, so the
symbolic kernel definitions are unchanged.
3. source/runtime/*.cu, memops.cuh, pybind_array_tools.{cc,h}: the
runtime HIP-compat (shared-memory atomics, the cooperative-group
reduction fallback, and a host-side pointer-attribute lookup).
4. A small Windows build fix (C++17, git path separator) for the
all-clang ROCm toolchain.
5. README: how to build a generated library for AMD GPUs.
Authored with assistance from Claude.
Test Plan:
The full code-generation pipeline (generate -> HIP-compile -> execute on
GPU -> verify numerical output) was exercised on real AMD hardware,
Linux CDNA2 (gfx90a) and RDNA3 (gfx1100) and Windows RDNA4 (gfx1201):
compile_caspar_library(caslib, out_dir, use_hip=True, hip_arch="gfx90a")
# generated kernel executes on GPU; output matches the CUDA path.
The default NVIDIA CUDA build (use_hip=False) is unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds AMD GPU support to Caspar so its generated kernels and runtime build and run on AMD GPUs through ROCm/HIP, while keeping the default NVIDIA CUDA build unchanged. It is enabled with
compile_caspar_library(..., use_hip=True, hip_arch=...)(or-DUSE_HIP=ONin the generated build); when off, the build is exactly as before.The CUDA spellings Caspar emits and uses --
cudaMalloc,__syncthreads, cooperative-group reductions, CUB primitives, and the runtime API -- are mapped to their HIP equivalents through a small compatibility header (source/runtime/cuda_to_hip.h). On an NVIDIA build the header is a transparent passthrough; on a ROCm build it aliases thecuda*symbols tohip*and supplies device-side fallbacks where HIP lacks a cooperative-groups primitive (cg::reduce,cg::labeled_partition). Because the mapping lives in one header and the codegen templates emit it, the symbolic kernel definitions are unchanged.The code generation gains a HIP path:
code_generation/library.pytakesuse_hip/hip_arch, and the Jinja build-file and kernel templates emit the HIP-correct includes and theUSE_HIPCMake option. The runtime sources get the corresponding HIP-compat (shared-memory atomics, the reduction fallback, and a host-side pointer-attribute lookup in the pybind layer). A small Windows build fix (C++17, git path separator) is included for the all-clang ROCm toolchain.How to build for AMD GPUs
Pass
use_hip=Trueand the target architecture when compiling a generated library:Set
hip_archto the target AMD GPU (for examplegfx90afor CDNA2 orgfx1100for RDNA3). The ROCm build needs a HIP-enabled compiler (hipcc/amdclang++) and thehipandhipcubpackages.Validation
The full code-generation pipeline (generate, HIP-compile, execute on GPU, verify numerical output) was exercised on real AMD hardware: Linux CDNA2 (gfx90a) and RDNA3 (gfx1100), and Windows RDNA4 (gfx1201). The generated kernels (cooperative-group reductions, shared-memory atomics, scatter/gather) produce results matching the CUDA path. The default NVIDIA CUDA build is unchanged.