Skip to content

Add AMD GPU support via HIP/ROCm#264

Merged
fangq merged 2 commits into
fangq:masterfrom
jeffdaily:moat-port
Jun 17, 2026
Merged

Add AMD GPU support via HIP/ROCm#264
fangq merged 2 commits into
fangq:masterfrom
jeffdaily:moat-port

Conversation

@jeffdaily

Copy link
Copy Markdown
Contributor

This PR adds AMD GPU support to MCX by allowing the existing CUDA source to be compiled for AMD GPUs through HIP/ROCm, while keeping the default NVIDIA CUDA build completely unchanged.

The CUDA runtime API calls used by MCX are mapped to their HIP equivalents through a thin compatibility header (src/cuda_to_hip.h). On an NVIDIA build this header is a transparent passthrough to the CUDA runtime, so the NVIDIA code path compiles and behaves exactly as before; on a ROCm build (USE_HIP defined) it aliases the cuda* runtime symbols to the corresponding hip* calls. A second header (src/mcx_vector_types.h) supplies the float3/float4/uint3/uint4/int3/int4 vector types so the C, CUDA, and HIP translation units agree on layout and alignment regardless of which compiler (gcc/clang, nvcc, or hipcc) builds a given file.

The CMake build gains a USE_HIP option. When enabled it turns on the HIP language, compiles the .cu GPU sources with hipcc, links against hip::device/hip::host, and selects the target GPU architecture via CMAKE_HIP_ARCHITECTURES. When USE_HIP is off (the default) the build is the unmodified CUDA build.

Two small device-code corrections are included so the simulation produces correct results on AMD GPUs: a boundary-reflection expression that was being miscompiled on AMD targets is rewritten to an equivalent form, and the Windows build is fixed for the ROCm toolchain (-fPIC, the WIN32 macro, an import-library conflict, and a binary fread). These do not change behavior on the NVIDIA path.

How to build the ROCm version

Configure with HIP enabled and pick the target architecture:

cmake -S src -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a
cmake --build build

CMAKE_HIP_ARCHITECTURES accepts any ROCm GPU target (for example gfx90a or gfx1100) and defaults to gfx90a if omitted. Building without -DUSE_HIP=ON produces the usual CUDA binary.

Validation

The ROCm build was exercised on real AMD hardware end to end on Linux (CDNA2 gfx90a and RDNA3 gfx1100) and Windows (RDNA4 gfx1201), running MCX simulations and confirming the fluence and detected-photon outputs match the expected results. The default NVIDIA CUDA build is unchanged.

Adds AMD GPU support to MCX by compiling the existing CUDA sources for
AMD GPUs through HIP/ROCm, while keeping the default NVIDIA CUDA build
unchanged. Enabled with -DUSE_HIP=ON; without it the build is exactly
as before.

Review in this order:

1. src/cuda_to_hip.h and src/mcx_vector_types.h (new): the
   compatibility layer. cuda_to_hip.h maps the CUDA runtime API used by
   MCX to its HIP equivalents and is a transparent passthrough on the
   NVIDIA build; mcx_vector_types.h supplies float3/float4/uint3/... so
   the C, CUDA, and HIP translation units agree on layout and alignment
   regardless of which compiler builds a given file.

2. CMake (src/CMakeLists.txt, src/zmat/CMakeLists.txt): the USE_HIP
   option. When on, it enables the HIP language, compiles the .cu
   sources with hipcc, links hip::device/hip::host, and selects the
   target GPU via CMAKE_HIP_ARCHITECTURES. Off by default, so the CUDA
   build is untouched.

3. A boundary-reflection expression rewritten to an equivalent form
   that is computed correctly on AMD GPUs. No behavior change on the
   NVIDIA path.

4. Four Windows fixes for the all-clang ROCm toolchain: guard -fPIC
   with if(NOT WIN32) (clang targeting MSVC rejects it) and define
   WIN32 so the existing source guards apply; redirect the executable's
   import library so it does not overwrite the static mcx library at
   link time; mkdir -> _mkdir; and a binary-mode read in
   mcx_loadseedjdat so the byte count stays consistent with ftell.

5. README: a HIP/ROCm build note alongside the existing CUDA
   requirements, and AMD copyright/author attribution on the two new
   headers.

Authored with assistance from Claude.

Test Plan:

Linux gfx90a (AMD CDNA2, ROCm), the AMD path:

  cmake -S src -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a
  cmake --build build -j
  # MCX simulations run end to end; fluence and detected-photon
  # outputs match the expected results.

Windows RX 9070 XT (gfx1201), TheRock ROCm, all-clang toolchain, Ninja:

  cmake ../src -G Ninja -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1201 \
      -DBUILD_MEX=OFF -DBUILD_PYTHON=OFF -DCMAKE_BUILD_TYPE=Release
  cmake --build . -j

The default NVIDIA CUDA build (no -DUSE_HIP) is unchanged.
@fangq

fangq commented Jun 16, 2026

Copy link
Copy Markdown
Owner

@jeffdaily, thanks for the PR.

my student, @matinraayai, had already implemented hip based MCX about 3-4 years ago in an internal branch. Our previous impression was that hip-based mcx showed a comparable speed compared to CUDA-based MCX on NVIDIA GPUs, but its speed was 2x slower than the OpenCL based mcxcl.

with the additional overhead of installing and linking with ROCm toolchain, and the lower speed, we did not feel there is a benefit releasing the hip based mcx.

I am sure ROCm has changed a lot since our last visit. I am wondering what is your experience in terms of speed? did you compared it with mcxcl? one built-in benchmark that allow you to test the speed is to run the mcxcontest script in both mcx and mcxcl.

The HIP build accumulates the fluence grid with float atomicAdd. Without
-munsafe-fp-atomics, clang lowers this to a compare-and-swap retry loop on
gfx90a instead of the native global_atomic_add_f32 instruction, which dominates
runtime for this atomic-heavy Monte Carlo accumulator. Adding the flag, plus
-ffast-math (the HIP counterpart of the CUDA build's -use_fast_math), speeds up
the built-in benchmarks by ~4.3x (speedsum) on an MI250X (gfx90a) with the
physics results unchanged: absorbed fraction matches the non-fast-math build
across cube60, cube60b, cube60planar, cubesph60b, skinvessel, sphshells, and
spherebox to within Monte Carlo noise.

Authored with Claude.
@jeffdaily

Copy link
Copy Markdown
Contributor Author

Thanks @fangq. I ran the comparison you suggested on an AMD Instinct MI250X (gfx90a, ROCm 7.2.1), this PR's HIP build vs mcxcl, using the three mcxcontest benchmarks at -n 1e8:

benchmark MCX (HIP, this PR) mcxcl
cube60 10,800 photon/ms 50,400
cubesph60b 4,060 29,300
cube60planar 3,190 34,800
speedsum ~18,100 ~114,500

So on this GPU mcxcl is about 6x faster than the HIP MCX by speedsum, consistent with your earlier impression. Since the two share the same algorithm on the same GPU, I dug into why:

  1. The main HIP-side issue was a real codegen bug, now fixed in this PR. The float atomicAdd into the fluence grid was lowering to a compare-and-swap retry loop instead of the native global_atomic_add_f32, because clang needs -munsafe-fp-atomics to emit the hardware FP-atomic on CDNA/RDNA. Adding that (plus -ffast-math, the counterpart of the CUDA build's -use_fast_math) sped up the built-in benchmarks several-fold (about 8x on cube60). So this PR is already much faster than a straight HIP port.

  2. The remaining gap looks like the same architectural difference that makes mcxcl faster than MCX on NVIDIA too. mcxcl JIT-compiles its kernel per run with the simulation parameters baked in as compile-time constants (your optlevel-3 USE_MACRO_CONST), which the ahead-of-time CUDA/HIP MCX cannot do. Beyond that, comparing the generated gfx90a ISA, the OpenCL-C frontend produces a leaner hot loop than the C++/HIP frontend for the equivalent kernel on the same LLVM backend (the HIP build executes ~3.4x more instructions per photon even after I baked the constants in). That part is a compiler-frontend difference rather than anything ROCm-runtime-specific.

Honest summary: this PR makes MCX build and run correctly on AMD GPUs with the CUDA path unchanged, at performance in the same class as the CUDA MCX; mcxcl stays the faster option on AMD just as on NVIDIA. The value is a working, validated AMD backend for the MCX codebase and the pmcx/mcxlab ecosystem, plus the atomics fix. I also have a few AMD-specific optimizations validated (denormal flush, __restrict__, native intrinsics; ~13% more) that I kept out to keep this a clean CUDA-to-HIP mapping, but I can add them or gate them behind an optional flag if you'd like.

All benchmarks were correctness-checked: absorbed fraction matches the reference to within Monte Carlo noise across cube60, cube60b, cube60planar, cubesph60b, skinvessel, sphshells, spherebox. Happy to share the full ISA/instruction-count analysis if useful.

@fangq fangq merged commit 7826eb9 into fangq:master Jun 17, 2026
31 checks passed
@fangq

fangq commented Jun 17, 2026

Copy link
Copy Markdown
Owner

@jeffdaily, thank you for porting mcx to hip/ROCm, also appreciate the insights on the compiler instruction counts and pass the findings to the compiler team. I have merged the patch.

I will also try it on some of my AMD GPUs. If there is any additional patches that can help hip performance, feel free to create a new PR, I'd be happy to make ROCm a viable platform for mcx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants