feat: support `Reduce` with OpenMPI backend implementation by GordonYang1 · Pull Request #28 · InfiniTensor/InfiniCCL

GordonYang1 · 2026-05-27T10:09:24Z

Summary

This PR introduces an OpenMPI-based implementation of Reduce, along with a complete example program for functionality verification and basic performance evaluation.

The public API follows the NCCL parameter order through infiniReduce(). The current implementation uses host-staging for device buffers and blocking MPI_Reduce internally; the reduced result is materialized only on root, consistent with NCCL/MPI semantics.

Changes

OpenMPI-based Reduce Implementation
- add the basic OpenMPI implementation for infiniReduce(), including:
  - the public C API declaration in include/comm.h;
  - the core interface src/base/reduce.h;
  - the OpenMPI backend implementation in src/ompi/impl/reduce.h.
- validate communicator handles, datatype values, reduction op values, root rank range, send buffer presence on all ranks, and receive buffer presence on root before dispatching to backend implementations;
- return infiniInvalidArgument for invalid user inputs.
- allocate the host recvbuf only on root, mirroring MPI_Reduce semantics that materialize the reduced result solely on root.
- handle kAvg by piggy-backing on MPI_SUM and then applying a per-dtype CPU scaling on root via DispatchFunc<kDev, AllTypes>, since OpenMPI does not provide a native average reduction.
- add an example program examples/reduce.cc similar to examples/all_reduce.cc for correctness verification and simple performance testing; result validation and metrics printing are restricted to root.

Known Issues & Future Work

The current OpenMPI Reduce implementation uses blocking MPI_Reduce, which prevents overlap between communication and computation. Future work may introduce non-blocking collectives (MPI_Ireduce) and stream-aware asynchronous execution to improve concurrency and performance.
The current implementation allocates temporary host staging buffers using malloc/free on every invocation. This may introduce noticeable overhead in high-frequency workloads. Future work may add reusable buffer pools, allocator caching, and pinned host memory support to improve transfer efficiency and reduce allocation overhead.
Averaging (kAvg) is performed via a CPU-side loop on root after the MPI call. Future work may move the scaling onto the accelerator once a unified Cast kernel is available, to avoid the extra host-side traversal and an additional H2D for large messages.
The current implementation performs GPU-to-Host and Host-to-GPU copies. While functionally correct, this is not optimal for GPU-intensive workloads. Future work may implement zero-copy GPU-GPU transfers (GPUDirect RDMA) to reduce memory traffic and improve throughput.
count is cast to int for MPI (with a safety check). Extremely large messages exceeding INT_MAX elements are rejected. Future work may introduce chunked transfers or MPI_Count support in MPI-4+ to handle very large tensors safely.

Logs & Screenshots

all_reduce test (MetaX-NVIDIA heterogeneous): [all_reduce.log]
all_reduce.log
reduce test (MetaX-NVIDIA heterogeneous): [reduce.log]
reduce.log
broadcast test (MetaX-NVIDIA heterogeneous): [broadcast.log]
broadcast.log
all_gather test (MetaX-NVIDIA heterogeneous): [all_gather.log]
all_gather.log
reduce_scatter test (MetaX-NVIDIA heterogeneous): [reduce_scatter.log]
reduce_scatter.log
all_to_all test (MetaX-NVIDIA heterogeneous): [all_to_all.log]
all_to_all.log
send_recv test (MetaX-NVIDIA heterogeneous): [send_recv.log]
send_recv.log

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 794fcf36ff

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T10:13:10Z

+    INFINI_CHECK_MPI(MPI_Reduce(host_sendbuf, host_recvbuf, mpi_count, mpi_type,
+                                mpi_op, root, inst->handle));


Reduce half tensors as values, not bytes

When datatype is infiniFloat16 or infiniBFloat16, DataTypeToOmpiType() maps it to MPI_BYTE, but this call still passes mpi_count == count elements. In that scenario MPI reduces only count bytes of the 2 * count staged buffer and applies the reduction byte-wise rather than as half/bfloat values, so the root receives corrupted/incomplete results for common FP16/BF16 Reduce calls.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-27T10:13:10Z

+          for (size_t i = 0; i < count; ++i) {
+            // TODO(lzm): should later use the unified `Cast` function instead
+            // of `static_cast` to support CPU custom types.
+            typed_buf[i] *= static_cast<T>(scale);


Preserve integer values when averaging

For infiniAvg with any integer datatype and more than one rank, scale is less than 1 and static_cast<T>(scale) becomes zero before the multiplication, so every averaged integer element on the root is written as 0 regardless of the summed value. This should multiply in floating point and cast the final quotient, or reject integer averages if they are unsupported.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

feat: support Reduce with OpenMPI backend implementation

21b0d44

GordonYang1 force-pushed the feat/support-reduce branch from 794fcf3 to 21b0d44 Compare May 27, 2026 10:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support `Reduce` with OpenMPI backend implementation#28

feat: support `Reduce` with OpenMPI backend implementation#28
GordonYang1 wants to merge 1 commit into
InfiniTensor:masterfrom
GordonYang1:feat/support-reduce

GordonYang1 commented May 27, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		INFINI_CHECK_MPI(MPI_Reduce(host_sendbuf, host_recvbuf, mpi_count, mpi_type,
		mpi_op, root, inst->handle));

Conversation

GordonYang1 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Known Issues & Future Work

Logs & Screenshots

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GordonYang1 commented May 27, 2026 •

edited

Loading