Add AMD GPU support via HIP/ROCm#264
Conversation
Adds AMD GPU support to MCX by compiling the existing CUDA sources for
AMD GPUs through HIP/ROCm, while keeping the default NVIDIA CUDA build
unchanged. Enabled with -DUSE_HIP=ON; without it the build is exactly
as before.
Review in this order:
1. src/cuda_to_hip.h and src/mcx_vector_types.h (new): the
compatibility layer. cuda_to_hip.h maps the CUDA runtime API used by
MCX to its HIP equivalents and is a transparent passthrough on the
NVIDIA build; mcx_vector_types.h supplies float3/float4/uint3/... so
the C, CUDA, and HIP translation units agree on layout and alignment
regardless of which compiler builds a given file.
2. CMake (src/CMakeLists.txt, src/zmat/CMakeLists.txt): the USE_HIP
option. When on, it enables the HIP language, compiles the .cu
sources with hipcc, links hip::device/hip::host, and selects the
target GPU via CMAKE_HIP_ARCHITECTURES. Off by default, so the CUDA
build is untouched.
3. A boundary-reflection expression rewritten to an equivalent form
that is computed correctly on AMD GPUs. No behavior change on the
NVIDIA path.
4. Four Windows fixes for the all-clang ROCm toolchain: guard -fPIC
with if(NOT WIN32) (clang targeting MSVC rejects it) and define
WIN32 so the existing source guards apply; redirect the executable's
import library so it does not overwrite the static mcx library at
link time; mkdir -> _mkdir; and a binary-mode read in
mcx_loadseedjdat so the byte count stays consistent with ftell.
5. README: a HIP/ROCm build note alongside the existing CUDA
requirements, and AMD copyright/author attribution on the two new
headers.
Authored with assistance from Claude.
Test Plan:
Linux gfx90a (AMD CDNA2, ROCm), the AMD path:
cmake -S src -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a
cmake --build build -j
# MCX simulations run end to end; fluence and detected-photon
# outputs match the expected results.
Windows RX 9070 XT (gfx1201), TheRock ROCm, all-clang toolchain, Ninja:
cmake ../src -G Ninja -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1201 \
-DBUILD_MEX=OFF -DBUILD_PYTHON=OFF -DCMAKE_BUILD_TYPE=Release
cmake --build . -j
The default NVIDIA CUDA build (no -DUSE_HIP) is unchanged.
|
@jeffdaily, thanks for the PR. my student, @matinraayai, had already implemented hip based MCX about 3-4 years ago in an internal branch. Our previous impression was that hip-based mcx showed a comparable speed compared to CUDA-based MCX on NVIDIA GPUs, but its speed was 2x slower than the OpenCL based mcxcl. with the additional overhead of installing and linking with ROCm toolchain, and the lower speed, we did not feel there is a benefit releasing the hip based mcx. I am sure ROCm has changed a lot since our last visit. I am wondering what is your experience in terms of speed? did you compared it with mcxcl? one built-in benchmark that allow you to test the speed is to run the mcxcontest script in both mcx and mcxcl. |
The HIP build accumulates the fluence grid with float atomicAdd. Without -munsafe-fp-atomics, clang lowers this to a compare-and-swap retry loop on gfx90a instead of the native global_atomic_add_f32 instruction, which dominates runtime for this atomic-heavy Monte Carlo accumulator. Adding the flag, plus -ffast-math (the HIP counterpart of the CUDA build's -use_fast_math), speeds up the built-in benchmarks by ~4.3x (speedsum) on an MI250X (gfx90a) with the physics results unchanged: absorbed fraction matches the non-fast-math build across cube60, cube60b, cube60planar, cubesph60b, skinvessel, sphshells, and spherebox to within Monte Carlo noise. Authored with Claude.
|
Thanks @fangq. I ran the comparison you suggested on an AMD Instinct MI250X (gfx90a, ROCm 7.2.1), this PR's HIP build vs mcxcl, using the three
So on this GPU mcxcl is about 6x faster than the HIP MCX by speedsum, consistent with your earlier impression. Since the two share the same algorithm on the same GPU, I dug into why:
Honest summary: this PR makes MCX build and run correctly on AMD GPUs with the CUDA path unchanged, at performance in the same class as the CUDA MCX; mcxcl stays the faster option on AMD just as on NVIDIA. The value is a working, validated AMD backend for the MCX codebase and the pmcx/mcxlab ecosystem, plus the atomics fix. I also have a few AMD-specific optimizations validated (denormal flush, All benchmarks were correctness-checked: absorbed fraction matches the reference to within Monte Carlo noise across cube60, cube60b, cube60planar, cubesph60b, skinvessel, sphshells, spherebox. Happy to share the full ISA/instruction-count analysis if useful. |
|
@jeffdaily, thank you for porting mcx to hip/ROCm, also appreciate the insights on the compiler instruction counts and pass the findings to the compiler team. I have merged the patch. I will also try it on some of my AMD GPUs. If there is any additional patches that can help hip performance, feel free to create a new PR, I'd be happy to make ROCm a viable platform for mcx. |
This PR adds AMD GPU support to MCX by allowing the existing CUDA source to be compiled for AMD GPUs through HIP/ROCm, while keeping the default NVIDIA CUDA build completely unchanged.
The CUDA runtime API calls used by MCX are mapped to their HIP equivalents through a thin compatibility header (
src/cuda_to_hip.h). On an NVIDIA build this header is a transparent passthrough to the CUDA runtime, so the NVIDIA code path compiles and behaves exactly as before; on a ROCm build (USE_HIPdefined) it aliases thecuda*runtime symbols to the correspondinghip*calls. A second header (src/mcx_vector_types.h) supplies thefloat3/float4/uint3/uint4/int3/int4vector types so the C, CUDA, and HIP translation units agree on layout and alignment regardless of which compiler (gcc/clang, nvcc, or hipcc) builds a given file.The CMake build gains a
USE_HIPoption. When enabled it turns on the HIP language, compiles the.cuGPU sources with hipcc, links againsthip::device/hip::host, and selects the target GPU architecture viaCMAKE_HIP_ARCHITECTURES. WhenUSE_HIPis off (the default) the build is the unmodified CUDA build.Two small device-code corrections are included so the simulation produces correct results on AMD GPUs: a boundary-reflection expression that was being miscompiled on AMD targets is rewritten to an equivalent form, and the Windows build is fixed for the ROCm toolchain (
-fPIC, theWIN32macro, an import-library conflict, and a binaryfread). These do not change behavior on the NVIDIA path.How to build the ROCm version
Configure with HIP enabled and pick the target architecture:
CMAKE_HIP_ARCHITECTURESaccepts any ROCm GPU target (for examplegfx90aorgfx1100) and defaults togfx90aif omitted. Building without-DUSE_HIP=ONproduces the usual CUDA binary.Validation
The ROCm build was exercised on real AMD hardware end to end on Linux (CDNA2 gfx90a and RDNA3 gfx1100) and Windows (RDNA4 gfx1201), running MCX simulations and confirming the fluence and detected-photon outputs match the expected results. The default NVIDIA CUDA build is unchanged.