[ROCm] HIP port for AMD GPUs#186
Conversation
Ports PopSift (GPU SIFT) to ROCm/HIP. A single compatibility header src/popsift/cuda_to_hip.h aliases the CUDA spellings to HIP under USE_HIP and is a no-op include of the CUDA headers otherwise; the .cu sources build as HIP, and the CUDA build path is byte-for-byte unchanged (every change is wrapped in USE_HIP / if(NOT USE_HIP)). README.md documents the USE_HIP build option, the build recipe, and the supported AMD targets. Getting correct, deterministic SIFT output required addressing five distinct HIP/ROCm fault classes, each behind a USE_HIP guard. 1. Hardware linear-filter textures. HIP rejects a cudaFilterModeLinear + element-read float texture at creation, so the linear fetches do manual bilinear interpolation in software (point-sample the 4 neighbors and lerp, CUDA's -0.5 texel-center convention). 2. Layered cudaArray dimension collapse. A multi-layer cudaArrayLayered array written one layer per kernel launch collapses to a single, last-written layer on any later-launch read on this ROCm (tex2DLayered, surf2DLayeredread, and even host hipMemcpy3D all return the last layer for every index; see ROCm/clr#275, which PopSift motivated, and the partial fix ROCm/rocm-systems#6683 covering only surf2DLayered). The Gaussian pyramid and DoG arrays are converted to non-layered 3D arrays (surf3Dwrite/surf3Dread/tex3D with the level as a real z coordinate, done once in the compat header's write wrapper plus the texture-fetch helper), which is fully coherent across launches. 3. Wavefront-width-generic warp collectives. The bitonic sort, exclusive prefix sums, descriptor reductions and orientation ballots operate per warpSize-lane group rather than assuming a 32-lane warp. The extrema counter in s_extrema.cu is the critical case: the in-wavefront lane is the block's linear thread index modulo the actual warpSize, the ballot/popc/shuffle widths follow warpSize, and the leader is lane 0 of the real wavefront. On wave64 this reduces exactly to a 64-bit ballot with an exclusive-prefix slot assignment; on wave32 it is correct where a fixed 64-lane geometry would have lost atomicAdds and read phantom lanes. 4. RootSift and L2 NaN. The default RootSift normalization sqrt(bin/L1) and the L2 path are guarded against sqrt of a slightly-negative bin and the 0*inf of an all-zero descriptor. 5. Example linkage. The PopSift library links roc::rocthrust PRIVATE and exposes hip::host PUBLIC, so the plain-C++ examples (main.cpp/match.cpp/pgmread.cpp, which have no device code) compile as CXX and just link the HIP runtime, rather than being forced through the HIP compiler by a leaked hip::device interface. The compat header's forward-compat <hip/hip_bf16.h> include is gated on __clang__ so a gcc host consumer never parses that clang-only header; it is pulled in before the __shfl_*_sync compat macros so newer ROCm's real __shfl_*_sync<...> functions are defined before the macros (guarded with __has_include so older ROCm, where the header has no such functions, is a harmless no-op). Tested on three AMD targets. PopSift ships no CPU reference, so the gates are run-to-run determinism, value sanity (finite, correctly normalized descriptors; in-bounds keypoints), and cross-architecture agreement between gfx90a and gfx1100 on the same images. - gfx90a (AMD Instinct MI250X, Linux, ROCm 7.2.1): popsift-demo gives 895 features / 1494 descriptors, identical across 5/5 runs; 0 NaN/Inf in all 191232 descriptor values; per-descriptor L2 in [0.9993, 1.0009], mean 1.0000; a rotated image gives 896/1484, also deterministic. popsift-match produces real finite distances and a sane ratio test. - gfx1100 (AMD Radeon Pro W7800, Linux, RDNA3 wave32, ROCm 7.2.1): the Oxford feature-detection images reproduce the gfx90a reference feature counts across all six images, deterministic across runs, 0 NaN/Inf across all descriptor modes. - gfx1151 (AMD Radeon 8060S APU, Windows, TheRock ROCm): SIFT on a real image gives non-zero, deterministic features (187/221 across runs), 0 NaN/Inf, RootSift L2 ~1.0, keypoints in bounds. Authored with the assistance of Claude (Anthropic).
There was a problem hiding this comment.
Code Review
This pull request adds support for AMD GPUs via ROCm/HIP to the PopSift library, enabling compilation and execution on AMD hardware while preserving the existing CUDA path. It introduces a CUDA-to-HIP compatibility shim, addresses warp-size differences (32-lane vs. 64-lane) in reduction and shuffle operations, and works around ROCm-specific limitations regarding layered image coherency and hardware bilinear filtering. The review feedback highlights two key areas for improvement: addressing a potential NaN corruption in s_desc_norm_rs.h when sum is extremely small, and initializing write_index in s_extrema.cu to prevent compiler or static analysis warnings regarding uninitialized variables.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Addresses automated review on the AMD support PR. s_desc_norm_rs.h: gate the RootSift divisor at a small threshold so a degenerate near-zero descriptor sum is treated as all-zero. The 0*inf=NaN case was already mapped to 0 by the existing fmaxf(.,0), but a tiny subnormal sum makes 1/sum overflow to +inf and a positive bin would then normalize to +inf; the threshold removes that. No effect on any descriptor with real gradient content. s_extrema.cu: initialize write_index. __shfl reads only lane 0, so the prior uninitialized value on other lanes was never used, but the initializer silences -Wmaybe-uninitialized. Both changes are HIP-guarded; the CUDA path is unchanged. Authored with the assistance of Claude (Anthropic).
|
On On |
Authored with the assistance of Claude (Anthropic).
Ports PopSift (GPU SIFT) to ROCm/HIP. A single compatibility header src/popsift/cuda_to_hip.h aliases the CUDA spellings to HIP under USE_HIP and is a no-op include of the CUDA headers otherwise; the .cu sources build as HIP, and the CUDA build path is byte-for-byte unchanged (every change is wrapped in USE_HIP / if(NOT USE_HIP)). README.md documents the USE_HIP build option, the build recipe, and the supported AMD targets.
Getting correct, deterministic SIFT output required addressing five distinct HIP/ROCm fault classes, each behind a USE_HIP guard.
Hardware linear-filter textures. HIP rejects a cudaFilterModeLinear + element-read float texture at creation, so the linear fetches do manual bilinear interpolation in software (point-sample the 4 neighbors and lerp, CUDA's -0.5 texel-center convention).
Layered cudaArray dimension collapse. A multi-layer cudaArrayLayered array written one layer per kernel launch collapses to a single, last-written layer on any later-launch read on this ROCm (tex2DLayered, surf2DLayeredread, and even host hipMemcpy3D all return the last layer for every index; see Layered cudaArray collapses to one layer: surf2DLayeredread/tex2DLayered return the last-written layer for every layer index (gfx90a) ROCm/clr#275, which PopSift motivated, and the partial fix [HIP] Fix incorrect ockl binding for layered surface functions ROCm/rocm-systems#6683 covering only surf2DLayered). The Gaussian pyramid and DoG arrays are converted to non-layered 3D arrays (surf3Dwrite/surf3Dread/tex3D with the level as a real z coordinate, done once in the compat header's write wrapper plus the texture-fetch helper), which is fully coherent across launches.
Wavefront-width-generic warp collectives. The bitonic sort, exclusive prefix sums, descriptor reductions and orientation ballots operate per warpSize-lane group rather than assuming a 32-lane warp. The extrema counter in s_extrema.cu is the critical case: the in-wavefront lane is the block's linear thread index modulo the actual warpSize, the ballot/popc/shuffle widths follow warpSize, and the leader is lane 0 of the real wavefront. On wave64 this reduces exactly to a 64-bit ballot with an exclusive-prefix slot assignment; on wave32 it is correct where a fixed 64-lane geometry would have lost atomicAdds and read phantom lanes.
RootSift and L2 NaN. The default RootSift normalization sqrt(bin/L1) and the L2 path are guarded against sqrt of a slightly-negative bin and the 0*inf of an all-zero descriptor.
Example linkage. The PopSift library links roc::rocthrust PRIVATE and exposes hip::host PUBLIC, so the plain-C++ examples (main.cpp/match.cpp/pgmread.cpp, which have no device code) compile as CXX and just link the HIP runtime, rather than being forced through the HIP compiler by a leaked hip::device interface. The compat header's forward-compat <hip/hip_bf16.h> include is gated on clang so a gcc host consumer never parses that clang-only header; it is pulled in before the _shfl_sync compat macros so newer ROCm's real _shfl_sync<...> functions are defined before the macros (guarded with __has_include so older ROCm, where the header has no such functions, is a harmless no-op).
Tested on three AMD targets. PopSift ships no CPU reference, so the gates are run-to-run determinism, value sanity (finite, correctly normalized descriptors; in-bounds keypoints), and cross-architecture agreement between gfx90a and gfx1100 on the same images.
Authored with the assistance of Claude (Anthropic).