[ROCm] HIP port for AMD GPUs by jeffdaily · Pull Request #186 · alicevision/popsift

jeffdaily · 2026-06-03T22:32:30Z

Ports PopSift (GPU SIFT) to ROCm/HIP. A single compatibility header src/popsift/cuda_to_hip.h aliases the CUDA spellings to HIP under USE_HIP and is a no-op include of the CUDA headers otherwise; the .cu sources build as HIP, and the CUDA build path is byte-for-byte unchanged (every change is wrapped in USE_HIP / if(NOT USE_HIP)). README.md documents the USE_HIP build option, the build recipe, and the supported AMD targets.

Getting correct, deterministic SIFT output required addressing five distinct HIP/ROCm fault classes, each behind a USE_HIP guard.

Hardware linear-filter textures. HIP rejects a cudaFilterModeLinear + element-read float texture at creation, so the linear fetches do manual bilinear interpolation in software (point-sample the 4 neighbors and lerp, CUDA's -0.5 texel-center convention).
Layered cudaArray dimension collapse. A multi-layer cudaArrayLayered array written one layer per kernel launch collapses to a single, last-written layer on any later-launch read on this ROCm (tex2DLayered, surf2DLayeredread, and even host hipMemcpy3D all return the last layer for every index; see Layered cudaArray collapses to one layer: surf2DLayeredread/tex2DLayered return the last-written layer for every layer index (gfx90a) ROCm/clr#275, which PopSift motivated, and the partial fix [HIP] Fix incorrect ockl binding for layered surface functions ROCm/rocm-systems#6683 covering only surf2DLayered). The Gaussian pyramid and DoG arrays are converted to non-layered 3D arrays (surf3Dwrite/surf3Dread/tex3D with the level as a real z coordinate, done once in the compat header's write wrapper plus the texture-fetch helper), which is fully coherent across launches.
Wavefront-width-generic warp collectives. The bitonic sort, exclusive prefix sums, descriptor reductions and orientation ballots operate per warpSize-lane group rather than assuming a 32-lane warp. The extrema counter in s_extrema.cu is the critical case: the in-wavefront lane is the block's linear thread index modulo the actual warpSize, the ballot/popc/shuffle widths follow warpSize, and the leader is lane 0 of the real wavefront. On wave64 this reduces exactly to a 64-bit ballot with an exclusive-prefix slot assignment; on wave32 it is correct where a fixed 64-lane geometry would have lost atomicAdds and read phantom lanes.
RootSift and L2 NaN. The default RootSift normalization sqrt(bin/L1) and the L2 path are guarded against sqrt of a slightly-negative bin and the 0*inf of an all-zero descriptor.
Example linkage. The PopSift library links roc::rocthrust PRIVATE and exposes hip::host PUBLIC, so the plain-C++ examples (main.cpp/match.cpp/pgmread.cpp, which have no device code) compile as CXX and just link the HIP runtime, rather than being forced through the HIP compiler by a leaked hip::device interface. The compat header's forward-compat <hip/hip_bf16.h> include is gated on clang so a gcc host consumer never parses that clang-only header; it is pulled in before the _shfl_sync compat macros so newer ROCm's real _shfl_sync<...> functions are defined before the macros (guarded with __has_include so older ROCm, where the header has no such functions, is a harmless no-op).

Tested on three AMD targets. PopSift ships no CPU reference, so the gates are run-to-run determinism, value sanity (finite, correctly normalized descriptors; in-bounds keypoints), and cross-architecture agreement between gfx90a and gfx1100 on the same images.

gfx90a (AMD Instinct MI250X, Linux, ROCm 7.2.1): popsift-demo gives 895 features / 1494 descriptors, identical across 5/5 runs; 0 NaN/Inf in all 191232 descriptor values; per-descriptor L2 in [0.9993, 1.0009], mean 1.0000; a rotated image gives 896/1484, also deterministic. popsift-match produces real finite distances and a sane ratio test.
gfx1100 (AMD Radeon Pro W7800, Linux, RDNA3 wave32, ROCm 7.2.1): the Oxford feature-detection images reproduce the gfx90a reference feature counts across all six images, deterministic across runs, 0 NaN/Inf across all descriptor modes.
gfx1151 (AMD Radeon 8060S APU, Windows, TheRock ROCm): SIFT on a real image gives non-zero, deterministic features (187/221 across runs), 0 NaN/Inf, RootSift L2 ~1.0, keypoints in bounds.

Authored with the assistance of Claude (Anthropic).

Ports PopSift (GPU SIFT) to ROCm/HIP. A single compatibility header src/popsift/cuda_to_hip.h aliases the CUDA spellings to HIP under USE_HIP and is a no-op include of the CUDA headers otherwise; the .cu sources build as HIP, and the CUDA build path is byte-for-byte unchanged (every change is wrapped in USE_HIP / if(NOT USE_HIP)). README.md documents the USE_HIP build option, the build recipe, and the supported AMD targets. Getting correct, deterministic SIFT output required addressing five distinct HIP/ROCm fault classes, each behind a USE_HIP guard. 1. Hardware linear-filter textures. HIP rejects a cudaFilterModeLinear + element-read float texture at creation, so the linear fetches do manual bilinear interpolation in software (point-sample the 4 neighbors and lerp, CUDA's -0.5 texel-center convention). 2. Layered cudaArray dimension collapse. A multi-layer cudaArrayLayered array written one layer per kernel launch collapses to a single, last-written layer on any later-launch read on this ROCm (tex2DLayered, surf2DLayeredread, and even host hipMemcpy3D all return the last layer for every index; see ROCm/clr#275, which PopSift motivated, and the partial fix ROCm/rocm-systems#6683 covering only surf2DLayered). The Gaussian pyramid and DoG arrays are converted to non-layered 3D arrays (surf3Dwrite/surf3Dread/tex3D with the level as a real z coordinate, done once in the compat header's write wrapper plus the texture-fetch helper), which is fully coherent across launches. 3. Wavefront-width-generic warp collectives. The bitonic sort, exclusive prefix sums, descriptor reductions and orientation ballots operate per warpSize-lane group rather than assuming a 32-lane warp. The extrema counter in s_extrema.cu is the critical case: the in-wavefront lane is the block's linear thread index modulo the actual warpSize, the ballot/popc/shuffle widths follow warpSize, and the leader is lane 0 of the real wavefront. On wave64 this reduces exactly to a 64-bit ballot with an exclusive-prefix slot assignment; on wave32 it is correct where a fixed 64-lane geometry would have lost atomicAdds and read phantom lanes. 4. RootSift and L2 NaN. The default RootSift normalization sqrt(bin/L1) and the L2 path are guarded against sqrt of a slightly-negative bin and the 0*inf of an all-zero descriptor. 5. Example linkage. The PopSift library links roc::rocthrust PRIVATE and exposes hip::host PUBLIC, so the plain-C++ examples (main.cpp/match.cpp/pgmread.cpp, which have no device code) compile as CXX and just link the HIP runtime, rather than being forced through the HIP compiler by a leaked hip::device interface. The compat header's forward-compat <hip/hip_bf16.h> include is gated on __clang__ so a gcc host consumer never parses that clang-only header; it is pulled in before the __shfl_*_sync compat macros so newer ROCm's real __shfl_*_sync<...> functions are defined before the macros (guarded with __has_include so older ROCm, where the header has no such functions, is a harmless no-op). Tested on three AMD targets. PopSift ships no CPU reference, so the gates are run-to-run determinism, value sanity (finite, correctly normalized descriptors; in-bounds keypoints), and cross-architecture agreement between gfx90a and gfx1100 on the same images. - gfx90a (AMD Instinct MI250X, Linux, ROCm 7.2.1): popsift-demo gives 895 features / 1494 descriptors, identical across 5/5 runs; 0 NaN/Inf in all 191232 descriptor values; per-descriptor L2 in [0.9993, 1.0009], mean 1.0000; a rotated image gives 896/1484, also deterministic. popsift-match produces real finite distances and a sane ratio test. - gfx1100 (AMD Radeon Pro W7800, Linux, RDNA3 wave32, ROCm 7.2.1): the Oxford feature-detection images reproduce the gfx90a reference feature counts across all six images, deterministic across runs, 0 NaN/Inf across all descriptor modes. - gfx1151 (AMD Radeon 8060S APU, Windows, TheRock ROCm): SIFT on a real image gives non-zero, deterministic features (187/221 across runs), 0 NaN/Inf, RootSift L2 ~1.0, keypoints in bounds. Authored with the assistance of Claude (Anthropic).

… GPUs)

gemini-code-assist

Code Review

This pull request adds support for AMD GPUs via ROCm/HIP to the PopSift library, enabling compilation and execution on AMD hardware while preserving the existing CUDA path. It introduces a CUDA-to-HIP compatibility shim, addresses warp-size differences (32-lane vs. 64-lane) in reduction and shuffle operations, and works around ROCm-specific limitations regarding layered image coherency and hardware bilinear filtering. The review feedback highlights two key areas for improvement: addressing a potential NaN corruption in s_desc_norm_rs.h when sum is extremely small, and initializing write_index in s_extrema.cu to prevent compiler or static analysis warnings regarding uninitialized variables.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Addresses automated review on the AMD support PR. s_desc_norm_rs.h: gate the RootSift divisor at a small threshold so a degenerate near-zero descriptor sum is treated as all-zero. The 0*inf=NaN case was already mapped to 0 by the existing fmaxf(.,0), but a tiny subnormal sum makes 1/sum overflow to +inf and a positive bin would then normalize to +inf; the threshold removes that. No effect on any descriptor with real gradient content. s_extrema.cu: initialize write_index. __shfl reads only lane 0, so the prior uninitialized value on other lanes was never used, but the initializer silences -Wmaybe-uninitialized. Both changes are HIP-guarded; the CUDA path is unchanged. Authored with the assistance of Claude (Anthropic).

jeffdaily · 2026-06-03T23:05:07Z

On s_extrema.cu: agreed, write_index will be initialized; __shfl reads only lane 0 so it was harmless, but the initializer silences the warning.

On s_desc_norm_rs.h: the 0 * inf = NaN case is already handled -- the result feeds fmaxf(descr.x * inv, 0.0f), and fmaxf returns 0.0f when one argument is NaN, so a zero bin normalizes to 0 rather than NaN (validation over 191232 descriptor values showed no NaN or Inf). The real residual is that a subnormal sum makes 1.0f/sum overflow to +inf, and a positive bin then yields +inf, which fmaxf(., 0) does not clamp. Guarding the divisor (treating a near-zero sum as an all-zero descriptor) closes that, so I've applied the threshold.

Authored with the assistance of Claude (Anthropic).

jeffdaily added a commit to jeffdaily/moat that referenced this pull request Jun 3, 2026

popsift: record upstream PR alicevision/popsift#186 (HIP port for AMD…

691a2a6

… GPUs)

gemini-code-assist Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread src/popsift/s_desc_norm_rs.h Outdated

Comment thread src/popsift/s_extrema.cu Outdated

[ROCm] Document the USE_HIP (AMD GPU) build in the install guide

87597ed

Authored with the assistance of Claude (Anthropic).

simogasp assigned griwodz Jun 16, 2026

simogasp added type:feature type:enhancement refactoring labels Jun 16, 2026

simogasp added this to the v0.11.0 milestone Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] HIP port for AMD GPUs#186

[ROCm] HIP port for AMD GPUs#186
jeffdaily wants to merge 3 commits into
alicevision:developfrom
jeffdaily:moat-port

jeffdaily commented Jun 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

jeffdaily commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jeffdaily commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

jeffdaily commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeffdaily commented Jun 3, 2026 •

edited

Loading