Skip to content

[BugFix] ParallelEnv over MPS envs: default to use_buffers=False, stage pipe data on CPU#3867

Open
discobot wants to merge 3 commits into
pytorch:mainfrom
discobot:fix/3066-mps-parallelenv-buffers
Open

[BugFix] ParallelEnv over MPS envs: default to use_buffers=False, stage pipe data on CPU#3867
discobot wants to merge 3 commits into
pytorch:mainfrom
discobot:fix/3066-mps-parallelenv-buffers

Conversation

@discobot

Copy link
Copy Markdown
Contributor

Description

ParallelEnv over mps sub-envs (and hence collectors) crashes with
RuntimeError: _share_filename_: only available on CPU. This implements the design
from the issue -- check the spec-leaf devices (via EnvMetaData.device_map), warn and
default to use_buffers=False when unspecified, raise on explicit use_buffers=True,
with configure_parallel running the same check and SerialEnv untouched (in-process
MPS buffers are fine).

Two findings beyond the thread:

  • use_buffers=False alone does not avoid the crash: every pipe send of an MPS
    tensordict (worker results in _run_worker_pipe_direct, parent actions in
    _step_no_buffers/_reset_no_buffers) goes through the same mp reduction
    (reduce_storage -> _share_filename_cpu_()) and fails identically. The no-buffers
    pipe traffic is therefore staged on CPU on both sides and cast back to the env device
    on reception. Staging has to happen before consolidate(): consolidation pickles
    the pre-consolidation device metadata, so consolidate(device="cpu") on its own
    makes the receiver believe the data is already on MPS and skip the cast back.
  • The multi-task branch of _get_metadata unconditionally overwrote an
    already-resolved use_buffers=False; it now honors it, matching the single-task
    branch (without this the MPS default would be reverted; it also affects a
    user-passed False).

Tests: TestParallelEnvMPSBuffers in test/envs/test_special.py covers the
default/raise logic on CPU-only runners by faking MPS device-map entries;
TestMPSSubEnvs (gated on MPS availability) runs reset/rollout/collection over MPS
sub-envs end to end, including the collector setup from the issue, with a
parent-worker round-trip check. Verified on an M-series Mac (macOS 26, torch 2.12).

Generated with Claude Code

Motivation and Context

Fixes #3066.

  • I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I have read the CONTRIBUTION guide (required)
  • I have updated the tests accordingly (required for a bug fix or a new feature).
  • I have updated the documentation accordingly.

…ge pipe data on CPU

MPS storages cannot be placed in shared memory nor pickled through
multiprocessing pipes, so ParallelEnv crashed with "_share_filename_:
only available on CPU" whenever the sub-envs lived on MPS (pytorch#3066).
Following the design discussed in the issue, _get_metadata now inspects
the spec-leaf devices recorded in EnvMetaData.device_map: with MPS
leaves, use_buffers defaults to False with a warning and an explicit
use_buffers=True raises. The no-buffers path stages the pipe data on
CPU before consolidation on both ends and casts it back to the env
device on reception. SerialEnv behavior is unchanged.
@pytorch-bot

pytorch-bot Bot commented Jun 13, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3867

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3c0a18e with merge base e797c19 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 13, 2026
@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Benchmark Results: PR 3c0a18e6 vs main 6364a19b

Benchmark run: https://github.com/pytorch/rl/actions/runs/27761295720

Higher ops/sec is better. Tables are sorted by largest absolute change.

CPU

Compared 192 benchmarks. Regressions over 5%: 8. Improvements over 5%: 22.

Benchmark main ops PR ops Change
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 50.44 196.79 +290.18%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 1,026 3,459 +237.08%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 196.69 39.50 -79.92%
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] 54.09 32.22 -40.43%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 2,870 3,688 +28.53%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 2,854 3,649 +27.84%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 2,939 3,699 +25.85%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 2,818 3,439 +22.04%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 1,963 2,356 +19.99%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 527.57 427.62 -18.95%
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[True-backward] 346.23 411.16 +18.75%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 3,274 2,730 -16.64%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1,923 2,167 +12.68%
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[untyped_storage] 8.5460 7.5917 -11.17%
benchmarks/test_objectives_benchmarks.py::test_a2c_speed[True-backward] 110.54 121.48 +9.89%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 3,050 3,335 +9.35%
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[True-backward] 899.59 979.89 +8.93%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 2,062 2,233 +8.30%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-True-True] 20,542 22,211 +8.13%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 2,189 2,352 +7.45%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[100-img_shape1-atari] 4,926 5,273 +7.04%
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[numpy] 341,966 365,772 +6.96%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 510.56 545.28 +6.80%
benchmarks/test_objectives_benchmarks.py::test_values[td1_return_estimate-False-False] 36.91 34.47 -6.62%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-True-True] 18,315 19,488 +6.40%
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[torch.save] 7,186 6,766 -5.84%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-False-True] 35,413 37,356 +5.49%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[generalized_advantage_estimate-False-1-512] 110.77 104.72 -5.46%
benchmarks/test_objectives_benchmarks.py::test_cql_speed[True-None] 81.97 86.17 +5.13%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-False-True] 35,171 36,965 +5.10%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-True-True] 22,425 23,494 +4.76%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[100-img_shape2-large_img] 410.78 430.11 +4.70%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-True-True] 19,883 19,002 -4.43%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-False-True] 32,621 34,041 +4.35%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 678.56 649.26 -4.32%
benchmarks/test_objectives_benchmarks.py::test_values[td_lambda_return_estimate-True-False] 24.88 23.83 -4.23%
benchmarks/test_objectives_benchmarks.py::test_values[generalized_advantage_estimate-True-True] 95.81 91.81 -4.18%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-False-True] 39,031 40,658 +4.17%
benchmarks/test_objectives_benchmarks.py::test_td3_speed[True-backward] 269.07 279.65 +3.93%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 3,008 2,890 -3.92%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 33.14 34.39 +3.79%
benchmarks/test_objectives_benchmarks.py::test_a2c_speed[True-None] 280.25 290.84 +3.78%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-False-0-lstm] 2.0056 2.0794 +3.68%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-cudnn-False-0-gru] 1.3616 1.3121 -3.63%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-False-0-gru] 3.0132 3.1219 +3.61%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-True-False] 28,750 29,783 +3.59%
benchmarks/test_objectives_benchmarks.py::test_redq_speed[reduce-overhead-None] 222.53 230.22 +3.46%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-False-True-False] 26,587 27,496 +3.42%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-True-True] 19,556 20,213 +3.36%
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[16-same] 19.79 19.14 -3.27%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-True-0-gru] 4.1901 4.3267 +3.26%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 564.54 546.21 -3.25%
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] 23.81 24.58 +3.24%
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[4-constant] 4,319 4,180 -3.22%
benchmarks/test_envs_benchmark.py::test_parallel 0.9670 0.9368 -3.12%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 1,066 1,098 +3.08%
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-parallel-buffers-False] 0.5880 0.6060 +3.06%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-True-False] 34,475 33,468 -2.92%
benchmarks/test_objectives_benchmarks.py::test_cql_speed[False-None] 39.17 38.05 -2.87%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-cudnn-False-0-lstm] 0.8565 0.8326 -2.79%
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[True-None] 276.24 283.82 +2.74%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 159.25 163.58 +2.72%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-False-True] 31,207 32,037 +2.66%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-False-True-True] 17,758 18,229 +2.65%
benchmarks/test_objectives_benchmarks.py::test_values[td0_return_estimate-False-False] 7,965 7,760 -2.58%
benchmarks/test_objectives_benchmarks.py::test_sac_speed[reduce-overhead-None] 491.67 478.99 -2.58%
benchmarks/test_objectives_benchmarks.py::test_cql_speed[True-backward] 57.78 59.24 +2.53%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[100-img_shape2-large_img] 176.73 172.28 -2.52%
benchmarks/test_objectives_benchmarks.py::test_redq_speed[False-None] 97.38 94.94 -2.51%
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[16-constant] 2,657 2,592 -2.47%
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[True-None] 258.85 265.14 +2.43%
benchmarks/test_envs_benchmark.py::test_serial 0.5798 0.5658 -2.41%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb[200-img_shape1-large_batch] 13.38 13.06 -2.41%
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[4-same] 28.09 27.42 -2.38%
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[False-None] 710.06 693.40 -2.35%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-False-False] 54,389 53,155 -2.27%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-True-False] 37,522 36,692 -2.21%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-False-False] 48,866 49,917 +2.15%
benchmarks/test_objectives_benchmarks.py::test_iql_speed[False-None] 50.95 49.88 -2.09%
benchmarks/test_objectives_benchmarks.py::test_redq_speed[True-None] 225.20 229.86 +2.07%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[100-img_shape2-large_img] 390.07 398.12 +2.06%
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[reduce-overhead-None] 283.32 277.63 -2.01%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-True-False] 31,580 30,953 -1.99%
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[False-backward] 248.62 243.73 -1.97%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[100-img_shape1-atari] 269.91 275.17 +1.95%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[100-img_shape2-large_img] 570.18 559.14 -1.94%
benchmarks/test_objectives_benchmarks.py::test_iql_speed[False-backward] 33.77 33.15 -1.86%
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[True-backward] 134.72 137.21 +1.85%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 164.91 167.91 +1.82%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb[100-img_shape0-atari] 30.14 29.59 -1.81%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-False-True] 30,190 29,648 -1.80%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 188.81 192.14 +1.76%
benchmarks/test_objectives_benchmarks.py::test_iql_speed[True-backward] 61.25 60.21 -1.70%
benchmarks/test_objectives_benchmarks.py::test_sac_speed[True-backward] 254.16 249.90 -1.68%
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] 52.45 53.33 +1.67%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[200-img_shape3-large_batch] 139.05 141.34 +1.64%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 1,069 1,086 +1.63%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[100-img_shape1-atari] 694.61 705.69 +1.59%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[200-img_shape3-large_batch] 331.68 336.93 +1.58%
benchmarks/test_objectives_benchmarks.py::test_cql_speed[False-backward] 28.73 28.28 -1.58%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb[200-img_shape1-large_batch] 15.27 15.03 -1.57%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[50-img_shape0-small] 3,496 3,443 -1.53%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-False-False] 76,059 74,908 -1.51%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb[100-img_shape0-atari] 26.14 25.74 -1.51%
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[safetensors] 23,722 24,080 +1.51%
benchmarks/test_collectors_benchmark.py::test_single 8.9557 8.8228 -1.48%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-False-False] 62,930 62,014 -1.46%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[200-img_shape3-large_batch] 764.64 753.53 -1.45%
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[False-None] 214.98 211.89 -1.44%
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[False-backward] 134.22 132.32 -1.42%
benchmarks/test_envs_benchmark.py::test_simple 1.8168 1.7911 -1.42%
benchmarks/test_objectives_benchmarks.py::test_sac_speed[False-backward] 89.58 88.31 -1.42%
benchmarks/test_collectors_benchmark.py::test_sync 16.96 16.72 -1.40%
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[False-None] 351.29 346.45 -1.38%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-cudnn-True-0-gru] 1.4189 1.3995 -1.37%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-True-True] 20,026 20,299 +1.36%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 165.48 167.68 +1.33%
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[reduce-overhead-None] 714.12 704.76 -1.31%
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[False-backward] 523.40 516.55 -1.31%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-True-False] 41,944 41,403 -1.29%
... ... ... Showing 120 of 192 comparisons, sorted by absolute change.

GPU

Compared 202 benchmarks. Regressions over 5%: 16. Improvements over 5%: 9.

Benchmark main ops PR ops Change
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 2,664 970.65 -63.56%
benchmarks/test_objectives_benchmarks.py::test_iql_speed[reduce-overhead-None] 79.26 102.79 +29.68%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 2,666 3,219 +20.73%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 2,279 1,932 -15.23%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 43.81 49.35 +12.66%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 803.29 710.14 -11.60%
benchmarks/test_collectors_benchmark.py::test_single_with_rb_pixels 5.3581 4.7666 -11.04%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 523.95 469.83 -10.33%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 3,107 2,787 -10.31%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 2,672 2,946 +10.24%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 3,283 2,978 -9.29%
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[True-backward] 964.09 880.64 -8.66%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 2,781 2,991 +7.57%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 2,050 1,915 -6.57%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 487.04 455.05 -6.57%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 439.92 466.63 +6.07%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 3,538 3,328 -5.94%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1,970 1,859 -5.64%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[200-img_shape3-large_batch] 132.87 140.29 +5.58%
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[pickle] 12,520 11,831 -5.51%
benchmarks/test_objectives_benchmarks.py::test_td3_speed[True-backward] 387.41 366.33 -5.44%
benchmarks/test_envs_benchmark.py::test_simple 1.2455 1.1792 -5.33%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[100-img_shape2-large_img] 384.13 403.99 +5.17%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[50-img_shape0-small] 3,407 3,581 +5.11%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[generalized_advantage_estimate-False-1-512] 46.30 43.95 -5.08%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[50-img_shape0-small] 4,281 4,488 +4.82%
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[reduce-overhead-None] 127.15 121.07 -4.78%
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-single-True] 1.3329 1.3922 +4.45%
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[False-backward] 236.44 226.01 -4.41%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[100-img_shape2-large_img] 167.11 174.04 +4.14%
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[True-backward] 464.05 445.47 -4.00%
benchmarks/test_objectives_benchmarks.py::test_a2c_speed[True-backward] 357.31 343.21 -3.95%
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[safetensors] 24,871 23,906 -3.88%
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[True-None] 788.32 818.28 +3.80%
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[False-backward] 129.14 124.27 -3.77%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 772.17 743.63 -3.70%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-False-False] 51,247 49,408 -3.59%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 814.99 789.15 -3.17%
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[True-None] 751.20 774.77 +3.14%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb_cuda[200-img_shape1-large_batch] 8.1275 8.3762 +3.06%
benchmarks/test_replaybuffer_benchmark.py::TestPrioritizedReplayBufferBenchmark::test_sample_mixed_devices[1000000-cuda_storage_cpu_sampler] 88.86 86.19 -3.01%
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[reduce-overhead-None] 791.90 815.45 +2.97%
benchmarks/test_objectives_benchmarks.py::test_a2c_speed[False-backward] 145.93 141.60 -2.97%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb_cuda[100-img_shape0-atari] 17.11 17.62 +2.96%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-False-False] 51,109 49,600 -2.95%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-True-False] 30,058 29,177 -2.93%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb_cuda[200-img_shape1-large_batch] 8.4877 8.7358 +2.92%
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] 23.29 23.96 +2.92%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-False-False] 58,341 56,712 -2.79%
benchmarks/test_objectives_benchmarks.py::test_iql_speed[True-None] 497.02 510.84 +2.78%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-True-False] 35,280 34,303 -2.77%
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[torch.save] 7,223 7,025 -2.74%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-True-0-gru] 47.00 48.28 +2.73%
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] 52.66 54.09 +2.72%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb_cuda[100-img_shape0-atari] 16.38 16.80 +2.56%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[100-img_shape2-large_img] 399.77 409.97 +2.55%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-True-True] 22,528 21,956 -2.54%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-False-True] 42,721 41,666 -2.47%
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[False-None] 98.76 96.34 -2.45%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-True-True] 19,015 18,550 -2.45%
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[True-None] 422.38 412.12 -2.43%
benchmarks/test_objectives_benchmarks.py::test_td3_speed[False-backward] 83.25 81.26 -2.39%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-False-False] 55,656 54,347 -2.35%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 951.51 973.57 +2.32%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 160.08 163.72 +2.27%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 1,264 1,235 -2.25%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-True-0-lstm] 73.85 75.47 +2.20%
benchmarks/test_envs_benchmark.py::test_transformed 0.7198 0.7042 -2.18%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-True-False] 43,588 42,666 -2.12%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-True-False] 32,707 32,016 -2.11%
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-serial-no-buffers-False] 0.6883 0.7026 +2.08%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-False-True] 33,133 32,448 -2.07%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 3,029 3,091 +2.06%
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-serial-buffers-False] 0.5944 0.6064 +2.02%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 1,295 1,269 -2.02%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb[100-img_shape0-atari] 29.90 29.30 -2.00%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-False-False-True] 28,705 28,135 -1.98%
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] 0.5974 0.6092 +1.98%
benchmarks/test_objectives_benchmarks.py::test_cql_speed[False-backward] 40.30 39.51 -1.96%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-True-True] 20,138 19,743 -1.96%
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[reduce-overhead-None] 806.84 822.59 +1.95%
benchmarks/test_objectives_benchmarks.py::test_values[td_lambda_return_estimate-True-False] 11.82 11.60 -1.91%
benchmarks/test_objectives_benchmarks.py::test_values[td1_return_estimate-False-False] 19.29 18.93 -1.90%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-False-True] 30,575 30,024 -1.80%
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[False-None] 382.44 389.25 +1.78%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[100-img_shape1-atari] 641.66 652.76 +1.73%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[200-img_shape3-large_batch] 303.74 308.90 +1.70%
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-serial-buffers-True] 0.5181 0.5268 +1.70%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-False-0-lstm] 21.57 21.20 -1.69%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-True-False] 35,236 34,646 -1.68%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-True-False] 39,027 38,377 -1.67%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[100-img_shape1-atari] 4,335 4,407 +1.66%
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-single-False] 1.6052 1.6317 +1.65%
benchmarks/test_collectors_benchmark.py::test_sync_preempt 10.48 10.31 -1.65%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 164.14 166.78 +1.61%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[50-img_shape0-small] 5,947 6,042 +1.60%
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-True] 22.19 21.84 -1.59%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 2,034 2,066 +1.56%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 961.97 976.86 +1.55%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 164.71 167.25 +1.54%
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[16-same] 5.5210 5.4363 -1.53%
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] 22.81 23.16 +1.53%
benchmarks/test_objectives_benchmarks.py::test_sac_speed[False-backward] 78.34 77.17 -1.50%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-False-True] 30,836 30,381 -1.48%
benchmarks/test_objectives_benchmarks.py::test_sac_speed[reduce-overhead-None] 104.04 102.57 -1.41%
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[False-backward] 70.58 69.59 -1.39%
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[reduce-overhead-None] 1,875 1,900 +1.36%
benchmarks/test_envs_benchmark.py::test_serial 0.4234 0.4176 -1.36%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-False-False] 78,410 77,351 -1.35%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-False-True-False] 27,808 27,437 -1.33%
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[False-None] 337.65 342.09 +1.31%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 651.86 643.33 -1.31%
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-False] 0.2277 0.2248 -1.29%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-False-True] 37,676 37,194 -1.28%
benchmarks/test_collectors_benchmark.py::test_async_pixels 10.59 10.73 +1.28%
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[True-backward] 258.44 255.23 -1.24%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 1,248 1,233 -1.24%
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[True-None] 682.95 674.64 -1.22%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 166.23 168.25 +1.22%
benchmarks/test_collectors_benchmark.py::test_sync 10.41 10.54 +1.20%
... ... ... Showing 120 of 202 comparisons, sorted by absolute change.

event.wait(self._timeout)
event.clear()

def _step_and_maybe_reset_no_buffers(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This never casts back
The worker now stages td/root_next_td to CPU for the step_and_maybe_reset message, but the parent reception in _step_and_maybe_reset_no_buffers never casts back to self.device.
Claude seems to be saying that this is a dead method anyway so maybe we don't really care

@vmoens vmoens left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks

@vmoens vmoens force-pushed the fix/3066-mps-parallelenv-buffers branch from 3241614 to a730733 Compare June 17, 2026 16:11
@vmoens

vmoens commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Looks more involved than I initially thought - iterating on this

@vmoens vmoens force-pushed the fix/3066-mps-parallelenv-buffers branch from a730733 to 3c0a18e Compare June 18, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BugFix CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Transforms

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Data collection error on M1 Macs

2 participants