Skip to content

Commit 391bfc2

Browse files
ClawLabbyClawLabby
authored andcommitted
docs: multi-GPU Newton training guide
1 parent 5202601 commit 391bfc2

1 file changed

Lines changed: 41 additions & 0 deletions

File tree

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Multi-GPU Training with Newton Physics
2+
3+
## Overview
4+
5+
This fix enables multi-GPU distributed training with Newton physics and camera
6+
rendering in Isaac Lab. Three interrelated bugs were fixed:
7+
8+
1. **Device assignment**: Non-rank-0 GPUs were assigned to wrong CUDA devices
9+
2. **Camera renderer**: Newton warp renderer preset wasn't applied correctly
10+
3. **Fabric mode**: Non-cuda:0 GPUs need fabric disabled to avoid race conditions
11+
12+
## Usage
13+
14+
```bash
15+
NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 \
16+
torchrun --nproc_per_node=4 \
17+
scripts/reinforcement_learning/rsl_rl/train.py \
18+
--task Isaac-Dexsuite-Kuka-Allegro-Lift-v0 \
19+
--num_envs 1024 \
20+
--headless --distributed \
21+
presets=newton,cube
22+
```
23+
24+
## Environment Variables
25+
26+
- `NCCL_P2P_DISABLE=1`: Required for containers with IOMMU (P2P transport fails)
27+
- `NCCL_IB_DISABLE=1`: Required when InfiniBand is not available
28+
29+
## Known Issues
30+
31+
- CCD solver warnings (`opt.ccd_iterations needs to be increased`) are expected
32+
under high contact scenarios. The NaN watchdog handles resulting physics failures.
33+
- `abnormal_robot` termination rate is higher with camera observations (~50%)
34+
vs state-only (~22%). This is a physics stability issue, not a training bug.
35+
36+
## Testing
37+
38+
```bash
39+
# Run multi-GPU Newton physics tests
40+
python -m pytest source/isaaclab/test/test_multigpu_newton.py -v
41+
```

0 commit comments

Comments
 (0)