Skip to content

[ROCm] HIP port for AMD GPUs#186

Open
jeffdaily wants to merge 3 commits into
alicevision:developfrom
jeffdaily:moat-port
Open

[ROCm] HIP port for AMD GPUs#186
jeffdaily wants to merge 3 commits into
alicevision:developfrom
jeffdaily:moat-port

Conversation

@jeffdaily

@jeffdaily jeffdaily commented Jun 3, 2026

Copy link
Copy Markdown

Ports PopSift (GPU SIFT) to ROCm/HIP. A single compatibility header src/popsift/cuda_to_hip.h aliases the CUDA spellings to HIP under USE_HIP and is a no-op include of the CUDA headers otherwise; the .cu sources build as HIP, and the CUDA build path is byte-for-byte unchanged (every change is wrapped in USE_HIP / if(NOT USE_HIP)). README.md documents the USE_HIP build option, the build recipe, and the supported AMD targets.

Getting correct, deterministic SIFT output required addressing five distinct HIP/ROCm fault classes, each behind a USE_HIP guard.

  1. Hardware linear-filter textures. HIP rejects a cudaFilterModeLinear + element-read float texture at creation, so the linear fetches do manual bilinear interpolation in software (point-sample the 4 neighbors and lerp, CUDA's -0.5 texel-center convention).

  2. Layered cudaArray dimension collapse. A multi-layer cudaArrayLayered array written one layer per kernel launch collapses to a single, last-written layer on any later-launch read on this ROCm (tex2DLayered, surf2DLayeredread, and even host hipMemcpy3D all return the last layer for every index; see Layered cudaArray collapses to one layer: surf2DLayeredread/tex2DLayered return the last-written layer for every layer index (gfx90a) ROCm/clr#275, which PopSift motivated, and the partial fix [HIP] Fix incorrect ockl binding for layered surface functions ROCm/rocm-systems#6683 covering only surf2DLayered). The Gaussian pyramid and DoG arrays are converted to non-layered 3D arrays (surf3Dwrite/surf3Dread/tex3D with the level as a real z coordinate, done once in the compat header's write wrapper plus the texture-fetch helper), which is fully coherent across launches.

  3. Wavefront-width-generic warp collectives. The bitonic sort, exclusive prefix sums, descriptor reductions and orientation ballots operate per warpSize-lane group rather than assuming a 32-lane warp. The extrema counter in s_extrema.cu is the critical case: the in-wavefront lane is the block's linear thread index modulo the actual warpSize, the ballot/popc/shuffle widths follow warpSize, and the leader is lane 0 of the real wavefront. On wave64 this reduces exactly to a 64-bit ballot with an exclusive-prefix slot assignment; on wave32 it is correct where a fixed 64-lane geometry would have lost atomicAdds and read phantom lanes.

  4. RootSift and L2 NaN. The default RootSift normalization sqrt(bin/L1) and the L2 path are guarded against sqrt of a slightly-negative bin and the 0*inf of an all-zero descriptor.

  5. Example linkage. The PopSift library links roc::rocthrust PRIVATE and exposes hip::host PUBLIC, so the plain-C++ examples (main.cpp/match.cpp/pgmread.cpp, which have no device code) compile as CXX and just link the HIP runtime, rather than being forced through the HIP compiler by a leaked hip::device interface. The compat header's forward-compat <hip/hip_bf16.h> include is gated on clang so a gcc host consumer never parses that clang-only header; it is pulled in before the _shfl_sync compat macros so newer ROCm's real _shfl_sync<...> functions are defined before the macros (guarded with __has_include so older ROCm, where the header has no such functions, is a harmless no-op).

Tested on three AMD targets. PopSift ships no CPU reference, so the gates are run-to-run determinism, value sanity (finite, correctly normalized descriptors; in-bounds keypoints), and cross-architecture agreement between gfx90a and gfx1100 on the same images.

  • gfx90a (AMD Instinct MI250X, Linux, ROCm 7.2.1): popsift-demo gives 895 features / 1494 descriptors, identical across 5/5 runs; 0 NaN/Inf in all 191232 descriptor values; per-descriptor L2 in [0.9993, 1.0009], mean 1.0000; a rotated image gives 896/1484, also deterministic. popsift-match produces real finite distances and a sane ratio test.
  • gfx1100 (AMD Radeon Pro W7800, Linux, RDNA3 wave32, ROCm 7.2.1): the Oxford feature-detection images reproduce the gfx90a reference feature counts across all six images, deterministic across runs, 0 NaN/Inf across all descriptor modes.
  • gfx1151 (AMD Radeon 8060S APU, Windows, TheRock ROCm): SIFT on a real image gives non-zero, deterministic features (187/221 across runs), 0 NaN/Inf, RootSift L2 ~1.0, keypoints in bounds.

Authored with the assistance of Claude (Anthropic).

Ports PopSift (GPU SIFT) to ROCm/HIP. A single compatibility header
src/popsift/cuda_to_hip.h aliases the CUDA spellings to HIP under USE_HIP and
is a no-op include of the CUDA headers otherwise; the .cu sources build as HIP,
and the CUDA build path is byte-for-byte unchanged (every change is wrapped in
USE_HIP / if(NOT USE_HIP)). README.md documents the USE_HIP build option, the
build recipe, and the supported AMD targets.

Getting correct, deterministic SIFT output required addressing five distinct
HIP/ROCm fault classes, each behind a USE_HIP guard.

1. Hardware linear-filter textures. HIP rejects a cudaFilterModeLinear +
element-read float texture at creation, so the linear fetches do manual
bilinear interpolation in software (point-sample the 4 neighbors and lerp,
CUDA's -0.5 texel-center convention).

2. Layered cudaArray dimension collapse. A multi-layer cudaArrayLayered array
written one layer per kernel launch collapses to a single, last-written layer
on any later-launch read on this ROCm (tex2DLayered, surf2DLayeredread, and
even host hipMemcpy3D all return the last layer for every index; see
ROCm/clr#275, which PopSift motivated, and the partial fix
ROCm/rocm-systems#6683 covering only surf2DLayered). The Gaussian pyramid and
DoG arrays are converted to non-layered 3D arrays (surf3Dwrite/surf3Dread/tex3D
with the level as a real z coordinate, done once in the compat header's write
wrapper plus the texture-fetch helper), which is fully coherent across launches.

3. Wavefront-width-generic warp collectives. The bitonic sort, exclusive prefix
sums, descriptor reductions and orientation ballots operate per warpSize-lane
group rather than assuming a 32-lane warp. The extrema counter in s_extrema.cu
is the critical case: the in-wavefront lane is the block's linear thread index
modulo the actual warpSize, the ballot/popc/shuffle widths follow warpSize, and
the leader is lane 0 of the real wavefront. On wave64 this reduces exactly to a
64-bit ballot with an exclusive-prefix slot assignment; on wave32 it is correct
where a fixed 64-lane geometry would have lost atomicAdds and read phantom lanes.

4. RootSift and L2 NaN. The default RootSift normalization sqrt(bin/L1) and the
L2 path are guarded against sqrt of a slightly-negative bin and the 0*inf of an
all-zero descriptor.

5. Example linkage. The PopSift library links roc::rocthrust PRIVATE and exposes
hip::host PUBLIC, so the plain-C++ examples (main.cpp/match.cpp/pgmread.cpp,
which have no device code) compile as CXX and just link the HIP runtime, rather
than being forced through the HIP compiler by a leaked hip::device interface.
The compat header's forward-compat <hip/hip_bf16.h> include is gated on __clang__
so a gcc host consumer never parses that clang-only header; it is pulled in
before the __shfl_*_sync compat macros so newer ROCm's real __shfl_*_sync<...>
functions are defined before the macros (guarded with __has_include so older
ROCm, where the header has no such functions, is a harmless no-op).

Tested on three AMD targets. PopSift ships no CPU
reference, so the gates are run-to-run determinism, value sanity (finite,
correctly normalized descriptors; in-bounds keypoints), and cross-architecture
agreement between gfx90a and gfx1100 on the same images.

- gfx90a (AMD Instinct MI250X, Linux, ROCm 7.2.1): popsift-demo gives 895
  features / 1494 descriptors, identical across 5/5 runs; 0 NaN/Inf in all
  191232 descriptor values; per-descriptor L2 in [0.9993, 1.0009], mean 1.0000;
  a rotated image gives 896/1484, also deterministic. popsift-match produces
  real finite distances and a sane ratio test.
- gfx1100 (AMD Radeon Pro W7800, Linux, RDNA3 wave32, ROCm 7.2.1): the Oxford
  feature-detection images reproduce the gfx90a reference feature counts across
  all six images, deterministic across runs, 0 NaN/Inf across all descriptor
  modes.
- gfx1151 (AMD Radeon 8060S APU, Windows, TheRock ROCm): SIFT on a real image
  gives non-zero, deterministic features (187/221 across runs), 0 NaN/Inf,
  RootSift L2 ~1.0, keypoints in bounds.

Authored with the assistance of Claude (Anthropic).
jeffdaily added a commit to jeffdaily/moat that referenced this pull request Jun 3, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for AMD GPUs via ROCm/HIP to the PopSift library, enabling compilation and execution on AMD hardware while preserving the existing CUDA path. It introduces a CUDA-to-HIP compatibility shim, addresses warp-size differences (32-lane vs. 64-lane) in reduction and shuffle operations, and works around ROCm-specific limitations regarding layered image coherency and hardware bilinear filtering. The review feedback highlights two key areas for improvement: addressing a potential NaN corruption in s_desc_norm_rs.h when sum is extremely small, and initializing write_index in s_extrema.cu to prevent compiler or static analysis warnings regarding uninitialized variables.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/popsift/s_desc_norm_rs.h Outdated
Comment thread src/popsift/s_extrema.cu Outdated
Addresses automated review on the AMD support PR.

s_desc_norm_rs.h: gate the RootSift divisor at a small threshold so a
degenerate near-zero descriptor sum is treated as all-zero. The 0*inf=NaN
case was already mapped to 0 by the existing fmaxf(.,0), but a tiny subnormal
sum makes 1/sum overflow to +inf and a positive bin would then normalize to
+inf; the threshold removes that. No effect on any descriptor with real
gradient content.

s_extrema.cu: initialize write_index. __shfl reads only lane 0, so the prior
uninitialized value on other lanes was never used, but the initializer
silences -Wmaybe-uninitialized.

Both changes are HIP-guarded; the CUDA path is unchanged.

Authored with the assistance of Claude (Anthropic).
@jeffdaily

Copy link
Copy Markdown
Author

On s_extrema.cu: agreed, write_index will be initialized; __shfl reads only lane 0 so it was harmless, but the initializer silences the warning.

On s_desc_norm_rs.h: the 0 * inf = NaN case is already handled -- the result feeds fmaxf(descr.x * inv, 0.0f), and fmaxf returns 0.0f when one argument is NaN, so a zero bin normalizes to 0 rather than NaN (validation over 191232 descriptor values showed no NaN or Inf). The real residual is that a subnormal sum makes 1.0f/sum overflow to +inf, and a positive bin then yields +inf, which fmaxf(., 0) does not clamp. Guarding the divisor (treating a near-zero sum as an all-zero descriptor) closes that, so I've applied the threshold.

Authored with the assistance of Claude (Anthropic).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants