Question
Hi,
I noticed that nvshmem_uint64_wait_until calls nvshmemi_transfer_syncapi_update_mem() after the wait, which triggers a flush (via cuFlushGPUDirectRDMAWrites or RDMA Read loopback) to ensure preceding RDMA Write data is visible to GPU SMs.
However, nvshmem_signal_wait_until skips this entirely — it only does a volatile poll and returns. The proxy's process_channel_put_signal also does not call enforce_cst.
how does nvshmem_signal_wait_until guarantee that the put data (sent before the signal) is visible to GPU SMs when the signal is observed? Should users use nvshmem_uint64_wait_until instead?
Thanks!
Question
Hi,
I noticed that nvshmem_uint64_wait_until calls nvshmemi_transfer_syncapi_update_mem() after the wait, which triggers a flush (via cuFlushGPUDirectRDMAWrites or RDMA Read loopback) to ensure preceding RDMA Write data is visible to GPU SMs.
However, nvshmem_signal_wait_until skips this entirely — it only does a volatile poll and returns. The proxy's process_channel_put_signal also does not call enforce_cst.
how does nvshmem_signal_wait_until guarantee that the put data (sent before the signal) is visible to GPU SMs when the signal is observed? Should users use nvshmem_uint64_wait_until instead?
Thanks!