feat: add runtime batch_bool mask overloads for load_masked/store_masked#1332
feat: add runtime batch_bool mask overloads for load_masked/store_masked#1332DiamonDinoia wants to merge 3 commits into
Conversation
7484c4b to
d5f21c7
Compare
|
Coild you split the head / tail part in another PR? This one is already quite dense... |
| // (AVX2 32/64-bit, AVX-512, SVE, RVV) override this with a single | ||
| // intrinsic that suppresses inactive-lane reads in hardware. | ||
| constexpr std::size_t size = batch<T, A>::size; | ||
| alignas(A::alignment()) std::array<T, size> buffer {}; |
There was a problem hiding this comment.
to make it worse, building a mask is not always a single operation depending on the target...
There was a problem hiding this comment.
Addressed in the latest push — switched the common-arch fallback to use mask.get(i) directly instead of materialising mask.mask() once and shifting per lane. The .mask() call is now gone from the hot path on architectures that fall back to common, so the per-target cost of building the bit mask no longer matters here.
| // (AVX2 32/64-bit, AVX-512, SVE, RVV) override this with a single | ||
| // intrinsic that suppresses inactive-lane reads in hardware. | ||
| constexpr std::size_t size = batch<T, A>::size; | ||
| alignas(A::alignment()) std::array<T, size> buffer {}; |
There was a problem hiding this comment.
this array assignment forces everything to zero, while some stores are not needed, and the compiler is notable to optimize this away in the generic case
There was a problem hiding this comment.
Addressed — dropped the value-init {} on the buffer and instead write every lane unconditionally in one pass: buffer[i] = mask.get(i) ? mem[i] : T(0);. No double-write on inactive lanes anymore.
| XSIMD_INLINE std::enable_if_t<std::is_integral<T>::value && (sizeof(T) == 4 || sizeof(T) == 8), batch<T, A>> | ||
| load_masked(T const* mem, batch_bool<T, A> mask, convert<T>, Mode, requires_arch<avx2>) noexcept | ||
| { | ||
| using int_t = std::conditional_t<sizeof(T) == 4, int32_t, long long>; |
There was a problem hiding this comment.
why long long and not int64_t ? Tehre's no garantee that sizeof(long long) == 8
There was a problem hiding this comment.
long long is what the Intel intrinsic _mm256_maskload_epi64 takes (long long const*). Using int64_t* would force a reinterpret_cast at every call site. Added static_assert(sizeof(long long) == 8, ...) next to the helpers so the assumption is pinned at compile time — the dispatcher uses the pointer-width to pick the right intrinsic.
| } | ||
| else | ||
| { | ||
| _mm256_maskstore_epi64(reinterpret_cast<long long*>(mem), __m256i(mask), __m256i(src)); |
There was a problem hiding this comment.
ok, I guess that's a constraint of the Intel intrinsic, at least static_assert that sizeof(long long) ==8 and sizeof(int) == 4 if you're using this to disntinguish between the two?
| // constructs a 128-bit chunk predicate (svdupq_b{8,16,32,64}), which | ||
| // is replication-based and does not correctly express a per-lane | ||
| // mask on SVE wider than 128 bits — going through ``as_batch_bool`` | ||
| // gives the right predicate for every vector width. ``int32``/ |
There was a problem hiding this comment.
Do you know if the pmask approach would be faster? If so we could still if constexpr its usage when the sve size allows it.
There was a problem hiding this comment.
Good question — pmask (svdupq_b{8,16,32,64}) is replication-based: it builds a 128-bit chunk and replicates it across the SVE register, so it only expresses a per-lane mask correctly when the SVE vector length is exactly 128 bits. For 256/512/1024/2048-bit SVE it would silently produce the wrong predicate. The current path through as_batch_bool is correct for every VL and lowers to a single cmpne against the integer-domain mask. We could in principle gate pmask on __ARM_FEATURE_SVE_BITS == 128, but on my qemu build that case lowers to the same ptrue + cmpne pair, so it would just be conditional code without a measured win. Happy to add the gate if you have a benchmark that shows a delta on a 128-bit-VL system.
| * so partial loads across a page boundary are safe. \c stream_mode is not | ||
| * supported. | ||
| * | ||
| * \warning Runtime-mask loads carry a significant performance penalty on |
There was a problem hiding this comment.
I don't think we should go into details here:
- it's difficult to maintain this kind of documentation (what about newly added architectures)
- we already have the case for other operations and we don't specify it.
I think it's important to communicate that info, but until we have an automated way to do so, better not just throw documentation at it.
There was a problem hiding this comment.
Agreed — collapsed the four runtime-mask doxygen blocks down to one short paragraph each (one for load, one for store), with \overload on the unaligned variant. Removed the per-architecture rundown of which targets do/do not have native maskload, since that information rots quickly as the project picks up new arches.
| static_assert(std::is_same<Mode, aligned_mode>::value || std::is_same<Mode, unaligned_mode>::value, | ||
| "supported load mode"); | ||
| constexpr uint64_t full_mask = details::full_mask(size); | ||
| const auto bits = mask.mask(); |
There was a problem hiding this comment.
I'm unsure we want that extra call to mask which may be costly, plus the extra tests... if masking is supported, is it beneifical? If it's not, we're already slow...
There was a problem hiding this comment.
Agreed — dropped the bits == 0 / bits == full_mask early-out in both batch::load(ptr, batch_bool, mode) and batch::store(ptr, batch_bool, mode). The runtime-mask member now just forwards straight to kernel::load_masked / kernel::store_masked. Targets with native predicated instructions (AVX2/AVX-512/SVE/RVV) absorb the all-zero / all-one mask via the hardware predicate, and on the common scalar fallback the per-lane loop handles those cases for free. The extra mask.mask() call and the two compares are gone.
| static_assert(std::is_same<Mode, aligned_mode>::value || std::is_same<Mode, unaligned_mode>::value, | ||
| "supported store mode"); | ||
| constexpr uint64_t full_mask = details::full_mask(size); | ||
| const auto bits = mask.mask(); |
There was a problem hiding this comment.
Same fix — early-out and static_assert are gone from batch::store(T*, batch_bool, Mode); it now forwards directly to kernel::store_masked.
d5f21c7 to
665925b
Compare
…1332 review - common: drop zero-init buffer + mask.mask() pack; use mask.get(i) directly - batch::load/store(batch_bool, Mode): drop bits==0/full early-out, forward to kernel (native arches absorb the all-zero/all-one mask in hardware) - avx2: pin sizeof(int)==4 / sizeof(long long)==8 next to detail::maskload helpers; runtime store routes through detail::maskstore symmetrically - avx2_128: introduce detail::maskload_128 / maskstore_128; constant- and runtime-mask paths share them; fix stale convert<double>/avx_128 dispatch tags on the int64/uint64 constant-mask overloads - xsimd_api.hpp + data_transfer.rst: shorten runtime-mask docs (drop the per-architecture rundown that would rot as new arches land) - test_load_store: collapse run_*_mask_pattern / run_*_runtime_mask_pattern into one helper parameterized on a MaskKind policy; drop first_N/last_N patterns (covered by load_head/load_tail in the follow-up branch) Codegen on AVX2/AVX2-128 verified: each runtime-mask load/store reduces to a single vpmaskmov{d,q}; the early-out removal eliminates an extra vmovmskps + test + cmp + branch tail.
7d7bbc3 to
d240cef
Compare
Adds runtime batch_bool mask overloads of xsimd::load_masked and xsimd::store_masked across AVX, AVX2, AVX-512, SSE, SVE, RVV, and NEON; generic common-path fallback collapsed to a whole-vector select. SVE compile-time masked load/store forwarded through the runtime path so the per-lane predicate is correct on SVE wider than 128 bits. Adds arch-specific runtime-mask overloads of load_masked / store_masked for the avx_128 and avx2_128 arches so they inherit the hardware predicated load/store path on x86. Squashed from: b57a766 feat: add runtime batch_bool mask overloads for load_masked/store_masked d5f21c7 feat: add runtime batch_bool mask overloads for avx_128 / avx2_128
aa676a9 to
7e5cfbe
Compare
…lpers Shorten verbose comments around masked load/store paths, drop the sizeof(int)/sizeof(long long) static_asserts (intrinsic boundaries now reinterpret_cast at the call site), and collapse the four maskload_128/maskstore_128 detail overloads into two XSIMD_IF_CONSTEXPR- dispatched templates. Public surface unchanged.
08bd3f6 to
860bb55
Compare
8/16-bit int masked load/store on AVX512BW previously fell through to the branchy common scalar fallback because xsimd_avx512bw.hpp had no load_masked/store_masked overloads. Add four requires_arch<avx512bw> overloads (runtime batch_bool + compile-time batch_bool_constant, load + store) constrained to sizeof(T)==1||2, emitting the native vmovdqu8 / vmovdqu16 predicated moves (2 instructions, no branch). The size branch lives only in the runtime overloads; the constant overloads delegate via mask.as_batch_bool(), which also avoids batch_bool_constant::mask() (return type int) truncating a 64-lane int8 compile-time mask. 32/64-bit stays on the avx512f path; SSE/AVX2 8/16-bit scalar fallback is hardware-forced and unchanged.
860bb55 to
e592d54
Compare
|
@serge-sans-paille I am now happy with how this looks. I tried to simplify things when possible :) @claude did a couple of rounds of review |
serge-sans-paille
left a comment
There was a problem hiding this comment.
A few minor nits but looks good to me otherwise.
No masked load / store on Neon, I haven't checked on VSX & VXE, @Andreas-Krebbel ?
| .. [#m] Masked ``load`` / ``store`` come in two flavours. The | ||
| :cpp:class:`batch_bool_constant` overload encodes the mask in the type and | ||
| is resolved at compile time. The runtime :cpp:class:`batch_bool` overload | ||
| accepts a mask computed at runtime. Prefer the compile-time mask whenever |
There was a problem hiding this comment.
rewording: For performance reason, prefer the compile-time mask whenever possible.
| load_masked(T const* mem, batch_bool<T, A> mask, convert<T>, Mode, requires_arch<common>) noexcept | ||
| { | ||
| // Scalar fallback: only active lanes are touched. Arches with | ||
| // hardware predicated loads override this. |
There was a problem hiding this comment.
"should override" ?
| store_masked(T* mem, batch<T, A> const& src, batch_bool<T, A> mask, Mode, requires_arch<common>) noexcept | ||
| { | ||
| // Scalar fallback: only active lanes are touched. Arches with | ||
| // hardware predicated stores override this. |
| { | ||
| return _mm256_maskload_epi32(mem, mask); | ||
| XSIMD_IF_CONSTEXPR(sizeof(T) == 4) | ||
| { |
There was a problem hiding this comment.
Could you static_assert(sizeof(int) == 4) here? It's likely that because the condition is not always constexpr (until we reach C++17) you might need to just assert.
| } | ||
| else | ||
| { | ||
| return _mm256_maskload_epi64(reinterpret_cast<long long const*>(mem), mask); |
There was a problem hiding this comment.
same here for long long
| return _mm256_maskload_epi64(reinterpret_cast<long long const*>(mem), mask); | ||
| XSIMD_IF_CONSTEXPR(sizeof(T) == 4) | ||
| { | ||
| _mm256_maskstore_epi32(reinterpret_cast<int*>(mem), mask, src); |
| { | ||
| XSIMD_IF_CONSTEXPR(sizeof(T) == 1) | ||
| { | ||
| return _mm512_maskz_loadu_epi8((__mmask64)mask.mask(), mem); |
There was a problem hiding this comment.
interestingly there's no _mm512_maskz_load_epi8 :-)
| #endif | ||
|
|
||
| #if XSIMD_WITH_AVX512VL | ||
| #include "./xsimd_avx512vl.hpp" |
There was a problem hiding this comment.
any reason for not moving the remaining include above?
Add runtime-mask overloads of xsimd::load_masked and xsimd::store_masked across AVX2, AVX-512, SSE, SVE, RVV, and NEON. The generic common-path fallback is collapsed to a whole-vector select, and the unaligned page-cross fast path is dropped since the underlying intrinsics suppress faults on masked-off lanes regardless of alignment.