[BugFix] ParallelEnv over MPS envs: default to use_buffers=False, stage pipe data on CPU#3867
[BugFix] ParallelEnv over MPS envs: default to use_buffers=False, stage pipe data on CPU#3867discobot wants to merge 3 commits into
Conversation
…ge pipe data on CPU MPS storages cannot be placed in shared memory nor pickled through multiprocessing pipes, so ParallelEnv crashed with "_share_filename_: only available on CPU" whenever the sub-envs lived on MPS (pytorch#3066). Following the design discussed in the issue, _get_metadata now inspects the spec-leaf devices recorded in EnvMetaData.device_map: with MPS leaves, use_buffers defaults to False with a warning and an explicit use_buffers=True raises. The no-buffers path stages the pipe data on CPU before consolidation on both ends and casts it back to the env device on reception. SerialEnv behavior is unchanged.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3867
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 3c0a18e with merge base e797c19 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Benchmark Results: PR
|
| Benchmark | main ops | PR ops | Change |
|---|---|---|---|
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] |
50.44 | 196.79 | +290.18% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] |
1,026 | 3,459 | +237.08% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] |
196.69 | 39.50 | -79.92% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] |
54.09 | 32.22 | -40.43% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] |
2,870 | 3,688 | +28.53% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] |
2,854 | 3,649 | +27.84% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] |
2,939 | 3,699 | +25.85% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] |
2,818 | 3,439 | +22.04% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] |
1,963 | 2,356 | +19.99% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] |
527.57 | 427.62 | -18.95% |
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[True-backward] |
346.23 | 411.16 | +18.75% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] |
3,274 | 2,730 | -16.64% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] |
1,923 | 2,167 | +12.68% |
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[untyped_storage] |
8.5460 | 7.5917 | -11.17% |
benchmarks/test_objectives_benchmarks.py::test_a2c_speed[True-backward] |
110.54 | 121.48 | +9.89% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] |
3,050 | 3,335 | +9.35% |
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[True-backward] |
899.59 | 979.89 | +8.93% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] |
2,062 | 2,233 | +8.30% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-True-True] |
20,542 | 22,211 | +8.13% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] |
2,189 | 2,352 | +7.45% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[100-img_shape1-atari] |
4,926 | 5,273 | +7.04% |
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[numpy] |
341,966 | 365,772 | +6.96% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] |
510.56 | 545.28 | +6.80% |
benchmarks/test_objectives_benchmarks.py::test_values[td1_return_estimate-False-False] |
36.91 | 34.47 | -6.62% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-True-True] |
18,315 | 19,488 | +6.40% |
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[torch.save] |
7,186 | 6,766 | -5.84% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-False-True] |
35,413 | 37,356 | +5.49% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[generalized_advantage_estimate-False-1-512] |
110.77 | 104.72 | -5.46% |
benchmarks/test_objectives_benchmarks.py::test_cql_speed[True-None] |
81.97 | 86.17 | +5.13% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-False-True] |
35,171 | 36,965 | +5.10% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-True-True] |
22,425 | 23,494 | +4.76% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[100-img_shape2-large_img] |
410.78 | 430.11 | +4.70% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-True-True] |
19,883 | 19,002 | -4.43% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-False-True] |
32,621 | 34,041 | +4.35% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-True-1-512] |
678.56 | 649.26 | -4.32% |
benchmarks/test_objectives_benchmarks.py::test_values[td_lambda_return_estimate-True-False] |
24.88 | 23.83 | -4.23% |
benchmarks/test_objectives_benchmarks.py::test_values[generalized_advantage_estimate-True-True] |
95.81 | 91.81 | -4.18% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-False-True] |
39,031 | 40,658 | +4.17% |
benchmarks/test_objectives_benchmarks.py::test_td3_speed[True-backward] |
269.07 | 279.65 | +3.93% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] |
3,008 | 2,890 | -3.92% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-True-32-512] |
33.14 | 34.39 | +3.79% |
benchmarks/test_objectives_benchmarks.py::test_a2c_speed[True-None] |
280.25 | 290.84 | +3.78% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-False-0-lstm] |
2.0056 | 2.0794 | +3.68% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-cudnn-False-0-gru] |
1.3616 | 1.3121 | -3.63% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-False-0-gru] |
3.0132 | 3.1219 | +3.61% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-True-False] |
28,750 | 29,783 | +3.59% |
benchmarks/test_objectives_benchmarks.py::test_redq_speed[reduce-overhead-None] |
222.53 | 230.22 | +3.46% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-False-True-False] |
26,587 | 27,496 | +3.42% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-True-True] |
19,556 | 20,213 | +3.36% |
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[16-same] |
19.79 | 19.14 | -3.27% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-True-0-gru] |
4.1901 | 4.3267 | +3.26% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-False-32-512] |
564.54 | 546.21 | -3.25% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] |
23.81 | 24.58 | +3.24% |
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[4-constant] |
4,319 | 4,180 | -3.22% |
benchmarks/test_envs_benchmark.py::test_parallel |
0.9670 | 0.9368 | -3.12% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] |
1,066 | 1,098 | +3.08% |
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-parallel-buffers-False] |
0.5880 | 0.6060 | +3.06% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-True-False] |
34,475 | 33,468 | -2.92% |
benchmarks/test_objectives_benchmarks.py::test_cql_speed[False-None] |
39.17 | 38.05 | -2.87% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-cudnn-False-0-lstm] |
0.8565 | 0.8326 | -2.79% |
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[True-None] |
276.24 | 283.82 | +2.74% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] |
159.25 | 163.58 | +2.72% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-False-True] |
31,207 | 32,037 | +2.66% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-False-True-True] |
17,758 | 18,229 | +2.65% |
benchmarks/test_objectives_benchmarks.py::test_values[td0_return_estimate-False-False] |
7,965 | 7,760 | -2.58% |
benchmarks/test_objectives_benchmarks.py::test_sac_speed[reduce-overhead-None] |
491.67 | 478.99 | -2.58% |
benchmarks/test_objectives_benchmarks.py::test_cql_speed[True-backward] |
57.78 | 59.24 | +2.53% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[100-img_shape2-large_img] |
176.73 | 172.28 | -2.52% |
benchmarks/test_objectives_benchmarks.py::test_redq_speed[False-None] |
97.38 | 94.94 | -2.51% |
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[16-constant] |
2,657 | 2,592 | -2.47% |
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[True-None] |
258.85 | 265.14 | +2.43% |
benchmarks/test_envs_benchmark.py::test_serial |
0.5798 | 0.5658 | -2.41% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb[200-img_shape1-large_batch] |
13.38 | 13.06 | -2.41% |
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[4-same] |
28.09 | 27.42 | -2.38% |
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[False-None] |
710.06 | 693.40 | -2.35% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-False-False] |
54,389 | 53,155 | -2.27% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-True-False] |
37,522 | 36,692 | -2.21% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-False-False] |
48,866 | 49,917 | +2.15% |
benchmarks/test_objectives_benchmarks.py::test_iql_speed[False-None] |
50.95 | 49.88 | -2.09% |
benchmarks/test_objectives_benchmarks.py::test_redq_speed[True-None] |
225.20 | 229.86 | +2.07% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[100-img_shape2-large_img] |
390.07 | 398.12 | +2.06% |
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[reduce-overhead-None] |
283.32 | 277.63 | -2.01% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-True-False] |
31,580 | 30,953 | -1.99% |
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[False-backward] |
248.62 | 243.73 | -1.97% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[100-img_shape1-atari] |
269.91 | 275.17 | +1.95% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[100-img_shape2-large_img] |
570.18 | 559.14 | -1.94% |
benchmarks/test_objectives_benchmarks.py::test_iql_speed[False-backward] |
33.77 | 33.15 | -1.86% |
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[True-backward] |
134.72 | 137.21 | +1.85% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] |
164.91 | 167.91 | +1.82% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb[100-img_shape0-atari] |
30.14 | 29.59 | -1.81% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-False-True] |
30,190 | 29,648 | -1.80% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] |
188.81 | 192.14 | +1.76% |
benchmarks/test_objectives_benchmarks.py::test_iql_speed[True-backward] |
61.25 | 60.21 | -1.70% |
benchmarks/test_objectives_benchmarks.py::test_sac_speed[True-backward] |
254.16 | 249.90 | -1.68% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] |
52.45 | 53.33 | +1.67% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[200-img_shape3-large_batch] |
139.05 | 141.34 | +1.64% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] |
1,069 | 1,086 | +1.63% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[100-img_shape1-atari] |
694.61 | 705.69 | +1.59% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[200-img_shape3-large_batch] |
331.68 | 336.93 | +1.58% |
benchmarks/test_objectives_benchmarks.py::test_cql_speed[False-backward] |
28.73 | 28.28 | -1.58% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb[200-img_shape1-large_batch] |
15.27 | 15.03 | -1.57% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[50-img_shape0-small] |
3,496 | 3,443 | -1.53% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-False-False] |
76,059 | 74,908 | -1.51% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb[100-img_shape0-atari] |
26.14 | 25.74 | -1.51% |
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[safetensors] |
23,722 | 24,080 | +1.51% |
benchmarks/test_collectors_benchmark.py::test_single |
8.9557 | 8.8228 | -1.48% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-False-False] |
62,930 | 62,014 | -1.46% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[200-img_shape3-large_batch] |
764.64 | 753.53 | -1.45% |
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[False-None] |
214.98 | 211.89 | -1.44% |
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[False-backward] |
134.22 | 132.32 | -1.42% |
benchmarks/test_envs_benchmark.py::test_simple |
1.8168 | 1.7911 | -1.42% |
benchmarks/test_objectives_benchmarks.py::test_sac_speed[False-backward] |
89.58 | 88.31 | -1.42% |
benchmarks/test_collectors_benchmark.py::test_sync |
16.96 | 16.72 | -1.40% |
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[False-None] |
351.29 | 346.45 | -1.38% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-cudnn-True-0-gru] |
1.4189 | 1.3995 | -1.37% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-True-True] |
20,026 | 20,299 | +1.36% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] |
165.48 | 167.68 | +1.33% |
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[reduce-overhead-None] |
714.12 | 704.76 | -1.31% |
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[False-backward] |
523.40 | 516.55 | -1.31% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-True-False] |
41,944 | 41,403 | -1.29% |
| ... | ... | ... | Showing 120 of 192 comparisons, sorted by absolute change. |
GPU
Compared 202 benchmarks. Regressions over 5%: 16. Improvements over 5%: 9.
| Benchmark | main ops | PR ops | Change |
|---|---|---|---|
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] |
2,664 | 970.65 | -63.56% |
benchmarks/test_objectives_benchmarks.py::test_iql_speed[reduce-overhead-None] |
79.26 | 102.79 | +29.68% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] |
2,666 | 3,219 | +20.73% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] |
2,279 | 1,932 | -15.23% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] |
43.81 | 49.35 | +12.66% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] |
803.29 | 710.14 | -11.60% |
benchmarks/test_collectors_benchmark.py::test_single_with_rb_pixels |
5.3581 | 4.7666 | -11.04% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] |
523.95 | 469.83 | -10.33% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] |
3,107 | 2,787 | -10.31% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] |
2,672 | 2,946 | +10.24% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] |
3,283 | 2,978 | -9.29% |
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[True-backward] |
964.09 | 880.64 | -8.66% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] |
2,781 | 2,991 | +7.57% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] |
2,050 | 1,915 | -6.57% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] |
487.04 | 455.05 | -6.57% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] |
439.92 | 466.63 | +6.07% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] |
3,538 | 3,328 | -5.94% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] |
1,970 | 1,859 | -5.64% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[200-img_shape3-large_batch] |
132.87 | 140.29 | +5.58% |
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[pickle] |
12,520 | 11,831 | -5.51% |
benchmarks/test_objectives_benchmarks.py::test_td3_speed[True-backward] |
387.41 | 366.33 | -5.44% |
benchmarks/test_envs_benchmark.py::test_simple |
1.2455 | 1.1792 | -5.33% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[100-img_shape2-large_img] |
384.13 | 403.99 | +5.17% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[50-img_shape0-small] |
3,407 | 3,581 | +5.11% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[generalized_advantage_estimate-False-1-512] |
46.30 | 43.95 | -5.08% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[50-img_shape0-small] |
4,281 | 4,488 | +4.82% |
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[reduce-overhead-None] |
127.15 | 121.07 | -4.78% |
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-single-True] |
1.3329 | 1.3922 | +4.45% |
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[False-backward] |
236.44 | 226.01 | -4.41% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[100-img_shape2-large_img] |
167.11 | 174.04 | +4.14% |
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[True-backward] |
464.05 | 445.47 | -4.00% |
benchmarks/test_objectives_benchmarks.py::test_a2c_speed[True-backward] |
357.31 | 343.21 | -3.95% |
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[safetensors] |
24,871 | 23,906 | -3.88% |
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[True-None] |
788.32 | 818.28 | +3.80% |
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[False-backward] |
129.14 | 124.27 | -3.77% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] |
772.17 | 743.63 | -3.70% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-False-False] |
51,247 | 49,408 | -3.59% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] |
814.99 | 789.15 | -3.17% |
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[True-None] |
751.20 | 774.77 | +3.14% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb_cuda[200-img_shape1-large_batch] |
8.1275 | 8.3762 | +3.06% |
benchmarks/test_replaybuffer_benchmark.py::TestPrioritizedReplayBufferBenchmark::test_sample_mixed_devices[1000000-cuda_storage_cpu_sampler] |
88.86 | 86.19 | -3.01% |
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[reduce-overhead-None] |
791.90 | 815.45 | +2.97% |
benchmarks/test_objectives_benchmarks.py::test_a2c_speed[False-backward] |
145.93 | 141.60 | -2.97% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb_cuda[100-img_shape0-atari] |
17.11 | 17.62 | +2.96% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-False-False] |
51,109 | 49,600 | -2.95% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-True-False] |
30,058 | 29,177 | -2.93% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb_cuda[200-img_shape1-large_batch] |
8.4877 | 8.7358 | +2.92% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] |
23.29 | 23.96 | +2.92% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-False-False] |
58,341 | 56,712 | -2.79% |
benchmarks/test_objectives_benchmarks.py::test_iql_speed[True-None] |
497.02 | 510.84 | +2.78% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-True-False] |
35,280 | 34,303 | -2.77% |
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[torch.save] |
7,223 | 7,025 | -2.74% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-True-0-gru] |
47.00 | 48.28 | +2.73% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] |
52.66 | 54.09 | +2.72% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb_cuda[100-img_shape0-atari] |
16.38 | 16.80 | +2.56% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[100-img_shape2-large_img] |
399.77 | 409.97 | +2.55% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-True-True] |
22,528 | 21,956 | -2.54% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-False-True] |
42,721 | 41,666 | -2.47% |
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[False-None] |
98.76 | 96.34 | -2.45% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-True-True] |
19,015 | 18,550 | -2.45% |
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[True-None] |
422.38 | 412.12 | -2.43% |
benchmarks/test_objectives_benchmarks.py::test_td3_speed[False-backward] |
83.25 | 81.26 | -2.39% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-False-False] |
55,656 | 54,347 | -2.35% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] |
951.51 | 973.57 | +2.32% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] |
160.08 | 163.72 | +2.27% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-False-32-512] |
1,264 | 1,235 | -2.25% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-True-0-lstm] |
73.85 | 75.47 | +2.20% |
benchmarks/test_envs_benchmark.py::test_transformed |
0.7198 | 0.7042 | -2.18% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-True-False] |
43,588 | 42,666 | -2.12% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-True-False] |
32,707 | 32,016 | -2.11% |
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-serial-no-buffers-False] |
0.6883 | 0.7026 | +2.08% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-False-True] |
33,133 | 32,448 | -2.07% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] |
3,029 | 3,091 | +2.06% |
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-serial-buffers-False] |
0.5944 | 0.6064 | +2.02% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-False-1-512] |
1,295 | 1,269 | -2.02% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb[100-img_shape0-atari] |
29.90 | 29.30 | -2.00% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-False-False-True] |
28,705 | 28,135 | -1.98% |
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] |
0.5974 | 0.6092 | +1.98% |
benchmarks/test_objectives_benchmarks.py::test_cql_speed[False-backward] |
40.30 | 39.51 | -1.96% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-True-True] |
20,138 | 19,743 | -1.96% |
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[reduce-overhead-None] |
806.84 | 822.59 | +1.95% |
benchmarks/test_objectives_benchmarks.py::test_values[td_lambda_return_estimate-True-False] |
11.82 | 11.60 | -1.91% |
benchmarks/test_objectives_benchmarks.py::test_values[td1_return_estimate-False-False] |
19.29 | 18.93 | -1.90% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-False-True] |
30,575 | 30,024 | -1.80% |
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[False-None] |
382.44 | 389.25 | +1.78% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[100-img_shape1-atari] |
641.66 | 652.76 | +1.73% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[200-img_shape3-large_batch] |
303.74 | 308.90 | +1.70% |
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-serial-buffers-True] |
0.5181 | 0.5268 | +1.70% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-False-0-lstm] |
21.57 | 21.20 | -1.69% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-True-False] |
35,236 | 34,646 | -1.68% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-True-False] |
39,027 | 38,377 | -1.67% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[100-img_shape1-atari] |
4,335 | 4,407 | +1.66% |
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-single-False] |
1.6052 | 1.6317 | +1.65% |
benchmarks/test_collectors_benchmark.py::test_sync_preempt |
10.48 | 10.31 | -1.65% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] |
164.14 | 166.78 | +1.61% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[50-img_shape0-small] |
5,947 | 6,042 | +1.60% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-True] |
22.19 | 21.84 | -1.59% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] |
2,034 | 2,066 | +1.56% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] |
961.97 | 976.86 | +1.55% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] |
164.71 | 167.25 | +1.54% |
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[16-same] |
5.5210 | 5.4363 | -1.53% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] |
22.81 | 23.16 | +1.53% |
benchmarks/test_objectives_benchmarks.py::test_sac_speed[False-backward] |
78.34 | 77.17 | -1.50% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-False-True] |
30,836 | 30,381 | -1.48% |
benchmarks/test_objectives_benchmarks.py::test_sac_speed[reduce-overhead-None] |
104.04 | 102.57 | -1.41% |
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[False-backward] |
70.58 | 69.59 | -1.39% |
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[reduce-overhead-None] |
1,875 | 1,900 | +1.36% |
benchmarks/test_envs_benchmark.py::test_serial |
0.4234 | 0.4176 | -1.36% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-False-False] |
78,410 | 77,351 | -1.35% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-False-True-False] |
27,808 | 27,437 | -1.33% |
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[False-None] |
337.65 | 342.09 | +1.31% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-True-32-512] |
651.86 | 643.33 | -1.31% |
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-False] |
0.2277 | 0.2248 | -1.29% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-False-True] |
37,676 | 37,194 | -1.28% |
benchmarks/test_collectors_benchmark.py::test_async_pixels |
10.59 | 10.73 | +1.28% |
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[True-backward] |
258.44 | 255.23 | -1.24% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-True-1-512] |
1,248 | 1,233 | -1.24% |
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[True-None] |
682.95 | 674.64 | -1.22% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] |
166.23 | 168.25 | +1.22% |
benchmarks/test_collectors_benchmark.py::test_sync |
10.41 | 10.54 | +1.20% |
| ... | ... | ... | Showing 120 of 202 comparisons, sorted by absolute change. |
| event.wait(self._timeout) | ||
| event.clear() | ||
|
|
||
| def _step_and_maybe_reset_no_buffers( |
There was a problem hiding this comment.
This never casts back
The worker now stages td/root_next_td to CPU for the step_and_maybe_reset message, but the parent reception in _step_and_maybe_reset_no_buffers never casts back to self.device.
Claude seems to be saying that this is a dead method anyway so maybe we don't really care
3241614 to
a730733
Compare
|
Looks more involved than I initially thought - iterating on this |
a730733 to
3c0a18e
Compare
Description
ParallelEnvovermpssub-envs (and hence collectors) crashes withRuntimeError: _share_filename_: only available on CPU. This implements the designfrom the issue -- check the spec-leaf devices (via
EnvMetaData.device_map), warn anddefault to
use_buffers=Falsewhen unspecified, raise on explicituse_buffers=True,with
configure_parallelrunning the same check andSerialEnvuntouched (in-processMPS buffers are fine).
Two findings beyond the thread:
use_buffers=Falsealone does not avoid the crash: every pipe send of an MPStensordict (worker results in
_run_worker_pipe_direct, parent actions in_step_no_buffers/_reset_no_buffers) goes through the same mp reduction(
reduce_storage->_share_filename_cpu_()) and fails identically. The no-bufferspipe traffic is therefore staged on CPU on both sides and cast back to the env device
on reception. Staging has to happen before
consolidate(): consolidation picklesthe pre-consolidation device metadata, so
consolidate(device="cpu")on its ownmakes the receiver believe the data is already on MPS and skip the cast back.
_get_metadataunconditionally overwrote analready-resolved
use_buffers=False; it now honors it, matching the single-taskbranch (without this the MPS default would be reverted; it also affects a
user-passed
False).Tests:
TestParallelEnvMPSBuffersintest/envs/test_special.pycovers thedefault/raise logic on CPU-only runners by faking MPS device-map entries;
TestMPSSubEnvs(gated on MPS availability) runs reset/rollout/collection over MPSsub-envs end to end, including the collector setup from the issue, with a
parent-worker round-trip check. Verified on an M-series Mac (macOS 26, torch 2.12).
Generated with Claude Code
Motivation and Context
Fixes #3066.
Types of changes
Checklist