Skip to content

Add AMD GPU (ROCm/HIP) support to the Caspar backend#465

Open
jeffdaily wants to merge 1 commit into
symforce-org:mainfrom
jeffdaily:moat-port
Open

Add AMD GPU (ROCm/HIP) support to the Caspar backend#465
jeffdaily wants to merge 1 commit into
symforce-org:mainfrom
jeffdaily:moat-port

Conversation

@jeffdaily

Copy link
Copy Markdown

This PR adds AMD GPU support to Caspar so its generated kernels and runtime build and run on AMD GPUs through ROCm/HIP, while keeping the default NVIDIA CUDA build unchanged. It is enabled with compile_caspar_library(..., use_hip=True, hip_arch=...) (or -DUSE_HIP=ON in the generated build); when off, the build is exactly as before.

The CUDA spellings Caspar emits and uses -- cudaMalloc, __syncthreads, cooperative-group reductions, CUB primitives, and the runtime API -- are mapped to their HIP equivalents through a small compatibility header (source/runtime/cuda_to_hip.h). On an NVIDIA build the header is a transparent passthrough; on a ROCm build it aliases the cuda* symbols to hip* and supplies device-side fallbacks where HIP lacks a cooperative-groups primitive (cg::reduce, cg::labeled_partition). Because the mapping lives in one header and the codegen templates emit it, the symbolic kernel definitions are unchanged.

The code generation gains a HIP path: code_generation/library.py takes use_hip/hip_arch, and the Jinja build-file and kernel templates emit the HIP-correct includes and the USE_HIP CMake option. The runtime sources get the corresponding HIP-compat (shared-memory atomics, the reduction fallback, and a host-side pointer-attribute lookup in the pybind layer). A small Windows build fix (C++17, git path separator) is included for the all-clang ROCm toolchain.

How to build for AMD GPUs

Pass use_hip=True and the target architecture when compiling a generated library:

compile_caspar_library(caslib, output_dir, use_hip=True, hip_arch="gfx90a")

Set hip_arch to the target AMD GPU (for example gfx90a for CDNA2 or gfx1100 for RDNA3). The ROCm build needs a HIP-enabled compiler (hipcc/amdclang++) and the hip and hipcub packages.

Validation

The full code-generation pipeline (generate, HIP-compile, execute on GPU, verify numerical output) was exercised on real AMD hardware: Linux CDNA2 (gfx90a) and RDNA3 (gfx1100), and Windows RDNA4 (gfx1201). The generated kernels (cooperative-group reductions, shared-memory atomics, scatter/gather) produce results matching the CUDA path. The default NVIDIA CUDA build is unchanged.

Adds AMD GPU support to Caspar so its generated kernels and runtime also
build and run on AMD GPUs via HIP/ROCm, while leaving the default NVIDIA
CUDA build unchanged (enabled with use_hip=True / -DUSE_HIP=ON; off by
default).

Review in this order:

1. symforce/caspar/source/runtime/cuda_to_hip.h (new): a compatibility
   header that maps the CUDA spellings Caspar emits and uses (cudaMalloc,
   __syncthreads, cooperative-group reductions, CUB primitives, the
   runtime API) onto their HIP equivalents, and supplies device-side
   fallbacks where HIP lacks a cg:: primitive (reduce, labeled_partition).

2. code_generation/library.py and source/templates/*.jinja: the codegen
   gains a use_hip/hip_arch path; the generated build file and kernel
   templates emit the compat include and HIP-correct spellings, so the
   symbolic kernel definitions are unchanged.

3. source/runtime/*.cu, memops.cuh, pybind_array_tools.{cc,h}: the
   runtime HIP-compat (shared-memory atomics, the cooperative-group
   reduction fallback, and a host-side pointer-attribute lookup).

4. A small Windows build fix (C++17, git path separator) for the
   all-clang ROCm toolchain.

5. README: how to build a generated library for AMD GPUs.

Authored with assistance from Claude.

Test Plan:

The full code-generation pipeline (generate -> HIP-compile -> execute on
GPU -> verify numerical output) was exercised on real AMD hardware,
Linux CDNA2 (gfx90a) and RDNA3 (gfx1100) and Windows RDNA4 (gfx1201):

  compile_caspar_library(caslib, out_dir, use_hip=True, hip_arch="gfx90a")
  # generated kernel executes on GPU; output matches the CUDA path.

The default NVIDIA CUDA build (use_hip=False) is unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant