Add several fixes to distributed training by QuantuMope · Pull Request #1835 · HorizonRobotics/alf

QuantuMope · 2026-05-27T20:56:18Z

This PR adds several fixes for distributed training. The first 5 are in the first commit d281324 while the last one is in the second commit 4688708.

Removes optimizers from distributed wrappers and instead handles state dicts according to the wrapped core alg. This makes it so that checkpoints are agnostic to whether they are being trained with distributed trainer.
Add missing summarize_metrics(self) call so that summaries from the algorithm are recorded.
Add weights_only=False to torch.load for torch.__version__ >= 2.6.
Increase unroller acknowledge timeout from 3 to 60 seconds. Three seconds is generally too fast for transmitting VLA weights.
Bug fix in DistributedUnroller triggered by i % 0.
Checkpoint loading support for DistributedTrainer.

le-horizon

Thanks for the fixes, Andrew.

A couple points from codex if they make sense:

alf/algorithms/distributed_off_policy_algorithm.py:155
DistributedOffPolicyAlgorithm.state_dict() now returns only self._core_alg.state_dict(). The trainer replay buffer lives on the distributed wrapper, not the core alg, so distributed checkpoints saved after this change will not include _replay_buffer.* keys. That makes the new checkpoint-loading setup at lines 519-569 ineffective for replay data: resume creates an empty multiprocessing replay buffer, then loads only core model/optimizer state. It also cannot load older distributed checkpoints that do contain wrapper _replay_buffer.* keys because load_state_dict() delegates straight to the core alg.

Need to throw error when resuming a ckpt encounters an empty replay buffer.
Need to add unittest coverage for the correct ckpt and reload.

alf/algorithms/distributed_off_policy_algorithm.py:211
summarize_metrics() now only calls self._core_alg.summarize_metrics(). The wrapper owns the env metrics created by RLAlgorithm.init, and unroller rollout updates those wrapper metrics via inherited observe_for_metrics(). This drops normal unroller env metric summaries such as episode count, env steps, return, and episode length. This should likely call super().summarize_metrics() as well.

QuantuMope added 2 commits May 27, 2026 13:50

Add distributed training-related training fixes

d281324

add ckpt loading for dist trainer

4688708

QuantuMope requested review from emailweixu and le-horizon May 27, 2026 20:56

le-horizon reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add several fixes to distributed training#1835

Add several fixes to distributed training#1835
QuantuMope wants to merge 2 commits into
pytorchfrom
PR/andrew/dist-rl-fixes

QuantuMope commented May 27, 2026

Uh oh!

le-horizon left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

QuantuMope commented May 27, 2026

Uh oh!

le-horizon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants