Skip to content

feat: support Reduce with OpenMPI backend implementation#28

Open
GordonYang1 wants to merge 1 commit into
InfiniTensor:masterfrom
GordonYang1:feat/support-reduce
Open

feat: support Reduce with OpenMPI backend implementation#28
GordonYang1 wants to merge 1 commit into
InfiniTensor:masterfrom
GordonYang1:feat/support-reduce

Conversation

@GordonYang1
Copy link
Copy Markdown
Collaborator

@GordonYang1 GordonYang1 commented May 27, 2026

Summary

This PR introduces an OpenMPI-based implementation of Reduce, along with a complete example program for functionality verification and basic performance evaluation.

The public API follows the NCCL parameter order through infiniReduce(). The current implementation uses host-staging for device buffers and blocking MPI_Reduce internally; the reduced result is materialized only on root, consistent with NCCL/MPI semantics.

Changes

  • OpenMPI-based Reduce Implementation
    • add the basic OpenMPI implementation for infiniReduce(), including:
      • the public C API declaration in include/comm.h;
      • the core interface src/base/reduce.h;
      • the OpenMPI backend implementation in src/ompi/impl/reduce.h.
    • validate communicator handles, datatype values, reduction op values, root rank range, send buffer presence on all ranks, and receive buffer presence on root before dispatching to backend implementations;
    • return infiniInvalidArgument for invalid user inputs.
    • allocate the host recvbuf only on root, mirroring MPI_Reduce semantics that materialize the reduced result solely on root.
    • handle kAvg by piggy-backing on MPI_SUM and then applying a per-dtype CPU scaling on root via DispatchFunc<kDev, AllTypes>, since OpenMPI does not provide a native average reduction.
    • add an example program examples/reduce.cc similar to examples/all_reduce.cc for correctness verification and simple performance testing; result validation and metrics printing are restricted to root.

Known Issues & Future Work

  • The current OpenMPI Reduce implementation uses blocking MPI_Reduce, which prevents overlap between communication and computation. Future work may introduce non-blocking collectives (MPI_Ireduce) and stream-aware asynchronous execution to improve concurrency and performance.
  • The current implementation allocates temporary host staging buffers using malloc/free on every invocation. This may introduce noticeable overhead in high-frequency workloads. Future work may add reusable buffer pools, allocator caching, and pinned host memory support to improve transfer efficiency and reduce allocation overhead.
  • Averaging (kAvg) is performed via a CPU-side loop on root after the MPI call. Future work may move the scaling onto the accelerator once a unified Cast kernel is available, to avoid the extra host-side traversal and an additional H2D for large messages.
  • The current implementation performs GPU-to-Host and Host-to-GPU copies. While functionally correct, this is not optimal for GPU-intensive workloads. Future work may implement zero-copy GPU-GPU transfers (GPUDirect RDMA) to reduce memory traffic and improve throughput.
  • count is cast to int for MPI (with a safety check). Extremely large messages exceeding INT_MAX elements are rejected. Future work may introduce chunked transfers or MPI_Count support in MPI-4+ to handle very large tensors safely.

Logs & Screenshots

  • all_reduce test (MetaX-NVIDIA heterogeneous): [all_reduce.log]
    all_reduce.log

  • reduce test (MetaX-NVIDIA heterogeneous): [reduce.log]
    reduce.log

  • broadcast test (MetaX-NVIDIA heterogeneous): [broadcast.log]
    broadcast.log

  • all_gather test (MetaX-NVIDIA heterogeneous): [all_gather.log]
    all_gather.log

  • reduce_scatter test (MetaX-NVIDIA heterogeneous): [reduce_scatter.log]
    reduce_scatter.log

  • all_to_all test (MetaX-NVIDIA heterogeneous): [all_to_all.log]
    all_to_all.log

  • send_recv test (MetaX-NVIDIA heterogeneous): [send_recv.log]
    send_recv.log

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 794fcf36ff

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/ompi/impl/reduce.h
Comment on lines +64 to +65
INFINI_CHECK_MPI(MPI_Reduce(host_sendbuf, host_recvbuf, mpi_count, mpi_type,
mpi_op, root, inst->handle));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reduce half tensors as values, not bytes

When datatype is infiniFloat16 or infiniBFloat16, DataTypeToOmpiType() maps it to MPI_BYTE, but this call still passes mpi_count == count elements. In that scenario MPI reduces only count bytes of the 2 * count staged buffer and applies the reduction byte-wise rather than as half/bfloat values, so the root receives corrupted/incomplete results for common FP16/BF16 Reduce calls.

Useful? React with 👍 / 👎.

Comment thread src/ompi/impl/reduce.h
for (size_t i = 0; i < count; ++i) {
// TODO(lzm): should later use the unified `Cast` function instead
// of `static_cast` to support CPU custom types.
typed_buf[i] *= static_cast<T>(scale);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve integer values when averaging

For infiniAvg with any integer datatype and more than one rank, scale is less than 1 and static_cast<T>(scale) becomes zero before the multiplication, so every averaged integer element on the root is written as 0 regardless of the summed value. This should multiply in floating point and cast the final quotient, or reject integer averages if they are unsupported.

Useful? React with 👍 / 👎.

@GordonYang1 GordonYang1 force-pushed the feat/support-reduce branch from 794fcf3 to 21b0d44 Compare May 27, 2026 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant