feat: support Reduce with OpenMPI backend implementation#28
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 794fcf36ff
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| INFINI_CHECK_MPI(MPI_Reduce(host_sendbuf, host_recvbuf, mpi_count, mpi_type, | ||
| mpi_op, root, inst->handle)); |
There was a problem hiding this comment.
Reduce half tensors as values, not bytes
When datatype is infiniFloat16 or infiniBFloat16, DataTypeToOmpiType() maps it to MPI_BYTE, but this call still passes mpi_count == count elements. In that scenario MPI reduces only count bytes of the 2 * count staged buffer and applies the reduction byte-wise rather than as half/bfloat values, so the root receives corrupted/incomplete results for common FP16/BF16 Reduce calls.
Useful? React with 👍 / 👎.
| for (size_t i = 0; i < count; ++i) { | ||
| // TODO(lzm): should later use the unified `Cast` function instead | ||
| // of `static_cast` to support CPU custom types. | ||
| typed_buf[i] *= static_cast<T>(scale); |
There was a problem hiding this comment.
Preserve integer values when averaging
For infiniAvg with any integer datatype and more than one rank, scale is less than 1 and static_cast<T>(scale) becomes zero before the multiplication, so every averaged integer element on the root is written as 0 regardless of the summed value. This should multiply in floating point and cast the final quotient, or reject integer averages if they are unsupported.
Useful? React with 👍 / 👎.
794fcf3 to
21b0d44
Compare
Summary
This PR introduces an OpenMPI-based implementation of
Reduce, along with a complete example program for functionality verification and basic performance evaluation.The public API follows the NCCL parameter order through
infiniReduce(). The current implementation uses host-staging for device buffers and blockingMPI_Reduceinternally; the reduced result is materialized only onroot, consistent with NCCL/MPI semantics.Changes
ReduceImplementationinfiniReduce(), including:include/comm.h;src/base/reduce.h;src/ompi/impl/reduce.h.rootbefore dispatching to backend implementations;infiniInvalidArgumentfor invalid user inputs.recvbufonly onroot, mirroringMPI_Reducesemantics that materialize the reduced result solely onroot.kAvgby piggy-backing onMPI_SUMand then applying a per-dtype CPU scaling onrootviaDispatchFunc<kDev, AllTypes>, since OpenMPI does not provide a native average reduction.examples/reduce.ccsimilar toexamples/all_reduce.ccfor correctness verification and simple performance testing; result validation and metrics printing are restricted toroot.Known Issues & Future Work
Reduceimplementation uses blockingMPI_Reduce, which prevents overlap between communication and computation. Future work may introduce non-blocking collectives (MPI_Ireduce) and stream-aware asynchronous execution to improve concurrency and performance.malloc/freeon every invocation. This may introduce noticeable overhead in high-frequency workloads. Future work may add reusable buffer pools, allocator caching, and pinned host memory support to improve transfer efficiency and reduce allocation overhead.kAvg) is performed via a CPU-side loop onrootafter the MPI call. Future work may move the scaling onto the accelerator once a unifiedCastkernel is available, to avoid the extra host-side traversal and an additional H2D for large messages.countis cast tointfor MPI (with a safety check). Extremely large messages exceedingINT_MAXelements are rejected. Future work may introduce chunked transfers orMPI_Countsupport in MPI-4+ to handle very large tensors safely.Logs & Screenshots
all_reduce test (MetaX-NVIDIA heterogeneous): [all_reduce.log]
all_reduce.log
reduce test (MetaX-NVIDIA heterogeneous): [reduce.log]
reduce.log
broadcast test (MetaX-NVIDIA heterogeneous): [broadcast.log]
broadcast.log
all_gather test (MetaX-NVIDIA heterogeneous): [all_gather.log]
all_gather.log
reduce_scatter test (MetaX-NVIDIA heterogeneous): [reduce_scatter.log]
reduce_scatter.log
all_to_all test (MetaX-NVIDIA heterogeneous): [all_to_all.log]
all_to_all.log
send_recv test (MetaX-NVIDIA heterogeneous): [send_recv.log]
send_recv.log