File tree Expand file tree Collapse file tree
Expand file tree Collapse file tree Original file line number Diff line number Diff line change 1+ # Multi-GPU Training with Newton Physics
2+
3+ ## Overview
4+
5+ This fix enables multi-GPU distributed training with Newton physics and camera
6+ rendering in Isaac Lab. Three interrelated bugs were fixed:
7+
8+ 1 . ** Device assignment** : Non-rank-0 GPUs were assigned to wrong CUDA devices
9+ 2 . ** Camera renderer** : Newton warp renderer preset wasn't applied correctly
10+ 3 . ** Fabric mode** : Non-cuda:0 GPUs need fabric disabled to avoid race conditions
11+
12+ ## Usage
13+
14+ ``` bash
15+ NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 \
16+ torchrun --nproc_per_node=4 \
17+ scripts/reinforcement_learning/rsl_rl/train.py \
18+ --task Isaac-Dexsuite-Kuka-Allegro-Lift-v0 \
19+ --num_envs 1024 \
20+ --headless --distributed \
21+ presets=newton,cube
22+ ```
23+
24+ ## Environment Variables
25+
26+ - ` NCCL_P2P_DISABLE=1 ` : Required for containers with IOMMU (P2P transport fails)
27+ - ` NCCL_IB_DISABLE=1 ` : Required when InfiniBand is not available
28+
29+ ## Known Issues
30+
31+ - CCD solver warnings (` opt.ccd_iterations needs to be increased ` ) are expected
32+ under high contact scenarios. The NaN watchdog handles resulting physics failures.
33+ - ` abnormal_robot ` termination rate is higher with camera observations (~ 50%)
34+ vs state-only (~ 22%). This is a physics stability issue, not a training bug.
35+
36+ ## Testing
37+
38+ ``` bash
39+ # Run multi-GPU Newton physics tests
40+ python -m pytest source/isaaclab/test/test_multigpu_newton.py -v
41+ ```
You can’t perform that action at this time.
0 commit comments