feat: add runtime batch_bool mask overloads for load_masked/store_masked by DiamonDinoia · Pull Request #1332 · xtensor-stack/xsimd

DiamonDinoia · 2026-04-28T21:00:16Z

Add runtime-mask overloads of xsimd::load_masked and xsimd::store_masked across AVX2, AVX-512, SSE, SVE, RVV, and NEON. The generic common-path fallback is collapsed to a whole-vector select, and the unaligned page-cross fast path is dropped since the underlying intrinsics suppress faults on masked-off lanes regardless of alignment.

serge-sans-paille · 2026-05-02T19:58:03Z

Coild you split the head / tail part in another PR? This one is already quite dense...

serge-sans-paille · 2026-05-02T20:00:35Z

+            // (AVX2 32/64-bit, AVX-512, SVE, RVV) override this with a single
+            // intrinsic that suppresses inactive-lane reads in hardware.
+            constexpr std::size_t size = batch<T, A>::size;
+            alignas(A::alignment()) std::array<T, size> buffer {};


to make it worse, building a mask is not always a single operation depending on the target...

Addressed in the latest push — switched the common-arch fallback to use mask.get(i) directly instead of materialising mask.mask() once and shifting per lane. The .mask() call is now gone from the hot path on architectures that fall back to common, so the per-target cost of building the bit mask no longer matters here.

serge-sans-paille · 2026-05-02T20:01:38Z

+            // (AVX2 32/64-bit, AVX-512, SVE, RVV) override this with a single
+            // intrinsic that suppresses inactive-lane reads in hardware.
+            constexpr std::size_t size = batch<T, A>::size;
+            alignas(A::alignment()) std::array<T, size> buffer {};


this array assignment forces everything to zero, while some stores are not needed, and the compiler is notable to optimize this away in the generic case

Addressed — dropped the value-init {} on the buffer and instead write every lane unconditionally in one pass: buffer[i] = mask.get(i) ? mem[i] : T(0);. No double-write on inactive lanes anymore.

serge-sans-paille · 2026-05-02T20:03:33Z

+        XSIMD_INLINE std::enable_if_t<std::is_integral<T>::value && (sizeof(T) == 4 || sizeof(T) == 8), batch<T, A>>
+        load_masked(T const* mem, batch_bool<T, A> mask, convert<T>, Mode, requires_arch<avx2>) noexcept
+        {
+            using int_t = std::conditional_t<sizeof(T) == 4, int32_t, long long>;


why long long and not int64_t ? Tehre's no garantee that sizeof(long long) == 8

long long is what the Intel intrinsic _mm256_maskload_epi64 takes (long long const*). Using int64_t* would force a reinterpret_cast at every call site. Added static_assert(sizeof(long long) == 8, ...) next to the helpers so the assumption is pinned at compile time — the dispatcher uses the pointer-width to pick the right intrinsic.

serge-sans-paille · 2026-05-02T20:05:31Z

+            }
+            else
+            {
+                _mm256_maskstore_epi64(reinterpret_cast<long long*>(mem), __m256i(mask), __m256i(src));


ok, I guess that's a constraint of the Intel intrinsic, at least static_assert that sizeof(long long) ==8 and sizeof(int) == 4 if you're using this to disntinguish between the two?

serge-sans-paille · 2026-05-02T20:10:02Z

+        // constructs a 128-bit chunk predicate (svdupq_b{8,16,32,64}), which
+        // is replication-based and does not correctly express a per-lane
+        // mask on SVE wider than 128 bits — going through ``as_batch_bool``
+        // gives the right predicate for every vector width. ``int32``/


Do you know if the pmask approach would be faster? If so we could still if constexpr its usage when the sve size allows it.

Good question — pmask (svdupq_b{8,16,32,64}) is replication-based: it builds a 128-bit chunk and replicates it across the SVE register, so it only expresses a per-lane mask correctly when the SVE vector length is exactly 128 bits. For 256/512/1024/2048-bit SVE it would silently produce the wrong predicate. The current path through as_batch_bool is correct for every VL and lowers to a single cmpne against the integer-domain mask. We could in principle gate pmask on __ARM_FEATURE_SVE_BITS == 128, but on my qemu build that case lowers to the same ptrue + cmpne pair, so it would just be conditional code without a measured win. Happy to add the gate if you have a benchmark that shows a delta on a 128-bit-VL system.

serge-sans-paille · 2026-05-02T20:12:20Z

+     * so partial loads across a page boundary are safe. \c stream_mode is not
+     * supported.
+     *
+     * \warning Runtime-mask loads carry a significant performance penalty on


I don't think we should go into details here:

it's difficult to maintain this kind of documentation (what about newly added architectures)

we already have the case for other operations and we don't specify it.

I think it's important to communicate that info, but until we have an automated way to do so, better not just throw documentation at it.

Agreed — collapsed the four runtime-mask doxygen blocks down to one short paragraph each (one for load, one for store), with \overload on the unaligned variant. Removed the per-architecture rundown of which targets do/do not have native maskload, since that information rots quickly as the project picks up new arches.

serge-sans-paille · 2026-05-02T20:15:03Z

+        static_assert(std::is_same<Mode, aligned_mode>::value || std::is_same<Mode, unaligned_mode>::value,
+                      "supported load mode");
+        constexpr uint64_t full_mask = details::full_mask(size);
+        const auto bits = mask.mask();


I'm unsure we want that extra call to mask which may be costly, plus the extra tests... if masking is supported, is it beneifical? If it's not, we're already slow...

Agreed — dropped the bits == 0 / bits == full_mask early-out in both batch::load(ptr, batch_bool, mode) and batch::store(ptr, batch_bool, mode). The runtime-mask member now just forwards straight to kernel::load_masked / kernel::store_masked. Targets with native predicated instructions (AVX2/AVX-512/SVE/RVV) absorb the all-zero / all-one mask via the hardware predicate, and on the common scalar fallback the per-lane loop handles those cases for free. The extra mask.mask() call and the two compares are gone.

serge-sans-paille · 2026-05-02T20:15:12Z

+        static_assert(std::is_same<Mode, aligned_mode>::value || std::is_same<Mode, unaligned_mode>::value,
+                      "supported store mode");
+        constexpr uint64_t full_mask = details::full_mask(size);
+        const auto bits = mask.mask();


Same fix — early-out and static_assert are gone from batch::store(T*, batch_bool, Mode); it now forwards directly to kernel::store_masked.

…1332 review - common: drop zero-init buffer + mask.mask() pack; use mask.get(i) directly - batch::load/store(batch_bool, Mode): drop bits==0/full early-out, forward to kernel (native arches absorb the all-zero/all-one mask in hardware) - avx2: pin sizeof(int)==4 / sizeof(long long)==8 next to detail::maskload helpers; runtime store routes through detail::maskstore symmetrically - avx2_128: introduce detail::maskload_128 / maskstore_128; constant- and runtime-mask paths share them; fix stale convert<double>/avx_128 dispatch tags on the int64/uint64 constant-mask overloads - xsimd_api.hpp + data_transfer.rst: shorten runtime-mask docs (drop the per-architecture rundown that would rot as new arches land) - test_load_store: collapse run_*_mask_pattern / run_*_runtime_mask_pattern into one helper parameterized on a MaskKind policy; drop first_N/last_N patterns (covered by load_head/load_tail in the follow-up branch) Codegen on AVX2/AVX2-128 verified: each runtime-mask load/store reduces to a single vpmaskmov{d,q}; the early-out removal eliminates an extra vmovmskps + test + cmp + branch tail.

Adds runtime batch_bool mask overloads of xsimd::load_masked and xsimd::store_masked across AVX, AVX2, AVX-512, SSE, SVE, RVV, and NEON; generic common-path fallback collapsed to a whole-vector select. SVE compile-time masked load/store forwarded through the runtime path so the per-lane predicate is correct on SVE wider than 128 bits. Adds arch-specific runtime-mask overloads of load_masked / store_masked for the avx_128 and avx2_128 arches so they inherit the hardware predicated load/store path on x86. Squashed from: b57a766 feat: add runtime batch_bool mask overloads for load_masked/store_masked d5f21c7 feat: add runtime batch_bool mask overloads for avx_128 / avx2_128

…lpers Shorten verbose comments around masked load/store paths, drop the sizeof(int)/sizeof(long long) static_asserts (intrinsic boundaries now reinterpret_cast at the call site), and collapse the four maskload_128/maskstore_128 detail overloads into two XSIMD_IF_CONSTEXPR- dispatched templates. Public surface unchanged.

8/16-bit int masked load/store on AVX512BW previously fell through to the branchy common scalar fallback because xsimd_avx512bw.hpp had no load_masked/store_masked overloads. Add four requires_arch<avx512bw> overloads (runtime batch_bool + compile-time batch_bool_constant, load + store) constrained to sizeof(T)==1||2, emitting the native vmovdqu8 / vmovdqu16 predicated moves (2 instructions, no branch). The size branch lives only in the runtime overloads; the constant overloads delegate via mask.as_batch_bool(), which also avoids batch_bool_constant::mask() (return type int) truncating a 64-lane int8 compile-time mask. 32/64-bit stays on the avx512f path; SSE/AVX2 8/16-bit scalar fallback is hardware-forced and unchanged.

DiamonDinoia · 2026-06-10T20:50:40Z

@serge-sans-paille I am now happy with how this looks. I tried to simplify things when possible :)

@claude did a couple of rounds of review

serge-sans-paille

A few minor nits but looks good to me otherwise.
No masked load / store on Neon, I haven't checked on VSX & VXE, @Andreas-Krebbel ?

serge-sans-paille · 2026-06-10T21:24:07Z

+.. [#m] Masked ``load`` / ``store`` come in two flavours. The
+   :cpp:class:`batch_bool_constant` overload encodes the mask in the type and
+   is resolved at compile time. The runtime :cpp:class:`batch_bool` overload
+   accepts a mask computed at runtime. Prefer the compile-time mask whenever


rewording: For performance reason, prefer the compile-time mask whenever possible.

serge-sans-paille · 2026-06-10T21:24:43Z

+        load_masked(T const* mem, batch_bool<T, A> mask, convert<T>, Mode, requires_arch<common>) noexcept
+        {
+            // Scalar fallback: only active lanes are touched. Arches with
+            // hardware predicated loads override this.


"should override" ?

serge-sans-paille · 2026-06-10T21:25:34Z

+        store_masked(T* mem, batch<T, A> const& src, batch_bool<T, A> mask, Mode, requires_arch<common>) noexcept
+        {
+            // Scalar fallback: only active lanes are touched. Arches with
+            // hardware predicated stores override this.


serge-sans-paille · 2026-06-10T21:29:06Z

            {
-                return _mm256_maskload_epi32(mem, mask);
+                XSIMD_IF_CONSTEXPR(sizeof(T) == 4)
+                {


Could you static_assert(sizeof(int) == 4) here? It's likely that because the condition is not always constexpr (until we reach C++17) you might need to just assert.

serge-sans-paille · 2026-06-10T21:29:21Z

+                }
+                else
+                {
+                    return _mm256_maskload_epi64(reinterpret_cast<long long const*>(mem), mask);


same here for long long

serge-sans-paille · 2026-06-10T21:29:30Z

-                return _mm256_maskload_epi64(reinterpret_cast<long long const*>(mem), mask);
+                XSIMD_IF_CONSTEXPR(sizeof(T) == 4)
+                {
+                    _mm256_maskstore_epi32(reinterpret_cast<int*>(mem), mask, src);


serge-sans-paille · 2026-06-10T21:31:18Z

+        {
+            XSIMD_IF_CONSTEXPR(sizeof(T) == 1)
+            {
+                return _mm512_maskz_loadu_epi8((__mmask64)mask.mask(), mem);


interestingly there's no _mm512_maskz_load_epi8 :-)

serge-sans-paille · 2026-06-10T21:34:42Z

 #endif

 #if XSIMD_WITH_AVX512VL
 #include "./xsimd_avx512vl.hpp"


any reason for not moving the remaining include above?

DiamonDinoia force-pushed the feat/dynamic-masks branch 5 times, most recently from 7484c4b to d5f21c7 Compare May 1, 2026 19:51

DiamonDinoia requested a review from serge-sans-paille May 1, 2026 20:49

serge-sans-paille reviewed May 2, 2026

View reviewed changes

DiamonDinoia force-pushed the feat/dynamic-masks branch from d5f21c7 to 665925b Compare May 5, 2026 14:16

DiamonDinoia force-pushed the feat/dynamic-masks branch from 7d7bbc3 to d240cef Compare May 5, 2026 18:07

DiamonDinoia requested a review from serge-sans-paille May 6, 2026 17:19

DiamonDinoia force-pushed the feat/dynamic-masks branch 3 times, most recently from aa676a9 to 7e5cfbe Compare June 10, 2026 17:28

DiamonDinoia force-pushed the feat/dynamic-masks branch 2 times, most recently from 08bd3f6 to 860bb55 Compare June 10, 2026 18:57

DiamonDinoia force-pushed the feat/dynamic-masks branch from 860bb55 to e592d54 Compare June 10, 2026 20:14

serge-sans-paille reviewed Jun 10, 2026

View reviewed changes

Conversation

DiamonDinoia commented Apr 28, 2026

Uh oh!

serge-sans-paille commented May 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiamonDinoia May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiamonDinoia May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiamonDinoia commented Jun 10, 2026

Uh oh!

serge-sans-paille left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DiamonDinoia May 5, 2026 •

edited

Loading

DiamonDinoia May 5, 2026 •

edited

Loading