Skip to content

feat(s3): add optional S3-over-RDMA (cuObject) data plane#87

Draft
harshavardhana wants to merge 1 commit into
NVIDIA:mainfrom
harshavardhana:fea-s3-rdma-cuobject
Draft

feat(s3): add optional S3-over-RDMA (cuObject) data plane#87
harshavardhana wants to merge 1 commit into
NVIDIA:mainfrom
harshavardhana:fea-s3-rdma-cuobject

Conversation

@harshavardhana

@harshavardhana harshavardhana commented Jun 26, 2026

Copy link
Copy Markdown

Description

Add an optional S3-over-RDMA data plane to the s3 storage provider, backed by NVIDIA cuObject (libcuobjclient). A new rdma provider option routes object transfers directly into/out of a registered host buffer over RDMA instead of through the HTTP body: the buffer is registered with cuObject and its RDMA descriptor is carried to the endpoint as the signed x-amz-rdma-token header, leaving the HTTP body empty. This offloads bulk transfer from the CPU and the HTTP/TLS path for RDMA-capable endpoints (e.g. MinIO AIStor), benefiting checkpoint and data-loading workloads.

Design notes:

  • Mirrors the existing rust_client sub-option — it swaps only the data plane (inheriting all metadata/list/credentials/error handling) and is mutually exclusive with rust_client. No new provider type; type: s3 is unchanged.
  • Enabling RDMA forces the empty-body wire contract (payload_signing_enabled=False, checksums only when_required) and single-shot put/get, since multipart does not apply to a single registered-buffer transfer. Addressing style stays user-controlled via the s3 option.
  • cuObject is a C++ library whose client needs an ops table at construction, so a thin extern "C" shim (providers/cuobj_shim.cpp) wraps a process-wide cuObjClient and is loaded over ctypes (providers/_cuobj.py). The module is import-safe without the native library present; the provider only instantiates the engine when rdma is configured. This follows the torch.cuda.cuobj / BotoCuObjClient split used in the PyTorch cuObject checkpoint backend.
  • This is host-staged RDMA. A GPUDirect device-memory path would require a new zero-copy buffer API on the client and is out of scope here.

Files: providers/_cuobj.py (primitives + CuObjEngine control plane), providers/cuobj_shim.cpp (build instructions in the header), providers/s3.py (option wiring + _rdma_put/_rdma_get), schema.py (rdma option), examples/rdma_roundtrip.py (E2E), unit tests.

Opening as a draft for design discussion per CONTRIBUTING (new feature). No tracking issue yet — happy to file one and align on the approach before this is finalized.

Validated

  • Hardware E2E (proven): NVIDIA H200 client → RDMA-capable MinIO AIStor endpoint over 400G RoCE (toolkit cuObject 1.2.0 / cuFile 1.18.0). examples/rdma_roundtrip.py with rdma: {} round-trips 1 / 64 / 256 MiB byte-identical over RDMA, and the x-amz-rdma-reply check confirms the payload moved over RDMA rather than the standard path. (Host-memory data plane.)
  • Unit tests (native engine mocked, CI-runnable):
    uv run --extra boto3 pytest tests/test_multistorageclient/unit/providers/test_s3_rdma.py
    
    7 passed; existing S3 + schema unit tests still pass (52 passed, no regression).

Checklist

  • Development PR
    • .release_notes/.unreleased.md
      • Notable changes to the client from this PR have been added.

Add an `rdma` option to the `s3` storage provider that routes object
transfers through NVIDIA cuObject (libcuobjclient) instead of the HTTP
body. When enabled, a contiguous host buffer is registered with cuObject
and its RDMA descriptor is carried to the endpoint as the signed
`x-amz-rdma-token` header; the RDMA-capable endpoint transfers the
payload directly into or out of the buffer, leaving the HTTP body empty.
This offloads the bulk transfer from the CPU and the HTTP/TLS path.

The option mirrors the existing `rust_client` sub-option: it swaps only
the data plane and is mutually exclusive with it. Enabling RDMA forces
the empty-body wire contract (unsigned payload, checksums only
when_required) and single-shot put/get, since multipart does not apply
to a single registered-buffer transfer. cuObject is a C++ library, so a
thin extern "C" shim (providers/cuobj_shim.cpp) wraps a process-wide
cuObjClient and is loaded over ctypes (providers/_cuobj.py); the module
is import-safe without the native library present, mirroring the
torch.cuda.cuobj / BotoCuObjClient split in the PyTorch checkpoint
backend.

Test Plan:
- Unit (native engine mocked), run with the boto3 extra:
  `uv run --extra boto3 pytest tests/test_multistorageclient/unit/providers/test_s3_rdma.py`
- Regression on existing S3/schema unit tests: passing.
- End-to-end against a live RDMA endpoint:
  `python examples/rdma_roundtrip.py` (see the script header for the
  required cuObject runtime and environment).
@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fb0e8a36-12e0-410b-8190-c39f3bbf16e0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@shunjiad

Copy link
Copy Markdown
Contributor

/ok to test 72f8703

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants