Skip to content

feat: add runtime batch_bool mask overloads for load_masked/store_masked#1332

Open
DiamonDinoia wants to merge 3 commits into
xtensor-stack:masterfrom
DiamonDinoia:feat/dynamic-masks
Open

feat: add runtime batch_bool mask overloads for load_masked/store_masked#1332
DiamonDinoia wants to merge 3 commits into
xtensor-stack:masterfrom
DiamonDinoia:feat/dynamic-masks

Conversation

@DiamonDinoia

Copy link
Copy Markdown
Contributor

Add runtime-mask overloads of xsimd::load_masked and xsimd::store_masked across AVX2, AVX-512, SSE, SVE, RVV, and NEON. The generic common-path fallback is collapsed to a whole-vector select, and the unaligned page-cross fast path is dropped since the underlying intrinsics suppress faults on masked-off lanes regardless of alignment.

@DiamonDinoia DiamonDinoia force-pushed the feat/dynamic-masks branch 5 times, most recently from 7484c4b to d5f21c7 Compare May 1, 2026 19:51
@serge-sans-paille

Copy link
Copy Markdown
Contributor

Coild you split the head / tail part in another PR? This one is already quite dense...

// (AVX2 32/64-bit, AVX-512, SVE, RVV) override this with a single
// intrinsic that suppresses inactive-lane reads in hardware.
constexpr std::size_t size = batch<T, A>::size;
alignas(A::alignment()) std::array<T, size> buffer {};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make it worse, building a mask is not always a single operation depending on the target...

@DiamonDinoia DiamonDinoia May 5, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest push — switched the common-arch fallback to use mask.get(i) directly instead of materialising mask.mask() once and shifting per lane. The .mask() call is now gone from the hot path on architectures that fall back to common, so the per-target cost of building the bit mask no longer matters here.

// (AVX2 32/64-bit, AVX-512, SVE, RVV) override this with a single
// intrinsic that suppresses inactive-lane reads in hardware.
constexpr std::size_t size = batch<T, A>::size;
alignas(A::alignment()) std::array<T, size> buffer {};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this array assignment forces everything to zero, while some stores are not needed, and the compiler is notable to optimize this away in the generic case

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — dropped the value-init {} on the buffer and instead write every lane unconditionally in one pass: buffer[i] = mask.get(i) ? mem[i] : T(0);. No double-write on inactive lanes anymore.

Comment thread include/xsimd/arch/xsimd_avx2.hpp Outdated
XSIMD_INLINE std::enable_if_t<std::is_integral<T>::value && (sizeof(T) == 4 || sizeof(T) == 8), batch<T, A>>
load_masked(T const* mem, batch_bool<T, A> mask, convert<T>, Mode, requires_arch<avx2>) noexcept
{
using int_t = std::conditional_t<sizeof(T) == 4, int32_t, long long>;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why long long and not int64_t ? Tehre's no garantee that sizeof(long long) == 8

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long long is what the Intel intrinsic _mm256_maskload_epi64 takes (long long const*). Using int64_t* would force a reinterpret_cast at every call site. Added static_assert(sizeof(long long) == 8, ...) next to the helpers so the assumption is pinned at compile time — the dispatcher uses the pointer-width to pick the right intrinsic.

Comment thread include/xsimd/arch/xsimd_avx2.hpp Outdated
}
else
{
_mm256_maskstore_epi64(reinterpret_cast<long long*>(mem), __m256i(mask), __m256i(src));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I guess that's a constraint of the Intel intrinsic, at least static_assert that sizeof(long long) ==8 and sizeof(int) == 4 if you're using this to disntinguish between the two?

// constructs a 128-bit chunk predicate (svdupq_b{8,16,32,64}), which
// is replication-based and does not correctly express a per-lane
// mask on SVE wider than 128 bits — going through ``as_batch_bool``
// gives the right predicate for every vector width. ``int32``/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if the pmask approach would be faster? If so we could still if constexpr its usage when the sve size allows it.

@DiamonDinoia DiamonDinoia May 5, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — pmask (svdupq_b{8,16,32,64}) is replication-based: it builds a 128-bit chunk and replicates it across the SVE register, so it only expresses a per-lane mask correctly when the SVE vector length is exactly 128 bits. For 256/512/1024/2048-bit SVE it would silently produce the wrong predicate. The current path through as_batch_bool is correct for every VL and lowers to a single cmpne against the integer-domain mask. We could in principle gate pmask on __ARM_FEATURE_SVE_BITS == 128, but on my qemu build that case lowers to the same ptrue + cmpne pair, so it would just be conditional code without a measured win. Happy to add the gate if you have a benchmark that shows a delta on a 128-bit-VL system.

Comment thread include/xsimd/types/xsimd_api.hpp Outdated
* so partial loads across a page boundary are safe. \c stream_mode is not
* supported.
*
* \warning Runtime-mask loads carry a significant performance penalty on

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should go into details here:

  1. it's difficult to maintain this kind of documentation (what about newly added architectures)
  2. we already have the case for other operations and we don't specify it.

I think it's important to communicate that info, but until we have an automated way to do so, better not just throw documentation at it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — collapsed the four runtime-mask doxygen blocks down to one short paragraph each (one for load, one for store), with \overload on the unaligned variant. Removed the per-architecture rundown of which targets do/do not have native maskload, since that information rots quickly as the project picks up new arches.

Comment thread include/xsimd/types/xsimd_batch.hpp Outdated
static_assert(std::is_same<Mode, aligned_mode>::value || std::is_same<Mode, unaligned_mode>::value,
"supported load mode");
constexpr uint64_t full_mask = details::full_mask(size);
const auto bits = mask.mask();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure we want that extra call to mask which may be costly, plus the extra tests... if masking is supported, is it beneifical? If it's not, we're already slow...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — dropped the bits == 0 / bits == full_mask early-out in both batch::load(ptr, batch_bool, mode) and batch::store(ptr, batch_bool, mode). The runtime-mask member now just forwards straight to kernel::load_masked / kernel::store_masked. Targets with native predicated instructions (AVX2/AVX-512/SVE/RVV) absorb the all-zero / all-one mask via the hardware predicate, and on the common scalar fallback the per-lane loop handles those cases for free. The extra mask.mask() call and the two compares are gone.

Comment thread include/xsimd/types/xsimd_batch.hpp Outdated
static_assert(std::is_same<Mode, aligned_mode>::value || std::is_same<Mode, unaligned_mode>::value,
"supported store mode");
constexpr uint64_t full_mask = details::full_mask(size);
const auto bits = mask.mask();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fix — early-out and static_assert are gone from batch::store(T*, batch_bool, Mode); it now forwards directly to kernel::store_masked.

@DiamonDinoia DiamonDinoia force-pushed the feat/dynamic-masks branch from d5f21c7 to 665925b Compare May 5, 2026 14:16
DiamonDinoia added a commit to DiamonDinoia/xsimd that referenced this pull request May 5, 2026
…1332 review

- common: drop zero-init buffer + mask.mask() pack; use mask.get(i) directly
- batch::load/store(batch_bool, Mode): drop bits==0/full early-out, forward
  to kernel (native arches absorb the all-zero/all-one mask in hardware)
- avx2: pin sizeof(int)==4 / sizeof(long long)==8 next to detail::maskload
  helpers; runtime store routes through detail::maskstore symmetrically
- avx2_128: introduce detail::maskload_128 / maskstore_128; constant- and
  runtime-mask paths share them; fix stale convert<double>/avx_128 dispatch
  tags on the int64/uint64 constant-mask overloads
- xsimd_api.hpp + data_transfer.rst: shorten runtime-mask docs (drop the
  per-architecture rundown that would rot as new arches land)
- test_load_store: collapse run_*_mask_pattern / run_*_runtime_mask_pattern
  into one helper parameterized on a MaskKind policy; drop first_N/last_N
  patterns (covered by load_head/load_tail in the follow-up branch)

Codegen on AVX2/AVX2-128 verified: each runtime-mask load/store reduces
to a single vpmaskmov{d,q}; the early-out removal eliminates an extra
vmovmskps + test + cmp + branch tail.
@DiamonDinoia DiamonDinoia force-pushed the feat/dynamic-masks branch from 7d7bbc3 to d240cef Compare May 5, 2026 18:07
Adds runtime batch_bool mask overloads of xsimd::load_masked and
xsimd::store_masked across AVX, AVX2, AVX-512, SSE, SVE, RVV, and NEON;
generic common-path fallback collapsed to a whole-vector select. SVE
compile-time masked load/store forwarded through the runtime path so the
per-lane predicate is correct on SVE wider than 128 bits.

Adds arch-specific runtime-mask overloads of load_masked / store_masked
for the avx_128 and avx2_128 arches so they inherit the hardware
predicated load/store path on x86.

Squashed from:
  b57a766  feat: add runtime batch_bool mask overloads for load_masked/store_masked
  d5f21c7  feat: add runtime batch_bool mask overloads for avx_128 / avx2_128
@DiamonDinoia DiamonDinoia force-pushed the feat/dynamic-masks branch 3 times, most recently from aa676a9 to 7e5cfbe Compare June 10, 2026 17:28
…lpers

Shorten verbose comments around masked load/store paths, drop the
sizeof(int)/sizeof(long long) static_asserts (intrinsic boundaries now
reinterpret_cast at the call site), and collapse the four
maskload_128/maskstore_128 detail overloads into two XSIMD_IF_CONSTEXPR-
dispatched templates. Public surface unchanged.
@DiamonDinoia DiamonDinoia force-pushed the feat/dynamic-masks branch 2 times, most recently from 08bd3f6 to 860bb55 Compare June 10, 2026 18:57
8/16-bit int masked load/store on AVX512BW previously fell through to the
branchy common scalar fallback because xsimd_avx512bw.hpp had no
load_masked/store_masked overloads. Add four requires_arch<avx512bw>
overloads (runtime batch_bool + compile-time batch_bool_constant, load +
store) constrained to sizeof(T)==1||2, emitting the native vmovdqu8 /
vmovdqu16 predicated moves (2 instructions, no branch).

The size branch lives only in the runtime overloads; the constant
overloads delegate via mask.as_batch_bool(), which also avoids
batch_bool_constant::mask() (return type int) truncating a 64-lane int8
compile-time mask.

32/64-bit stays on the avx512f path; SSE/AVX2 8/16-bit scalar fallback is
hardware-forced and unchanged.
@DiamonDinoia

Copy link
Copy Markdown
Contributor Author

@serge-sans-paille I am now happy with how this looks. I tried to simplify things when possible :)

@claude did a couple of rounds of review

@serge-sans-paille serge-sans-paille left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor nits but looks good to me otherwise.
No masked load / store on Neon, I haven't checked on VSX & VXE, @Andreas-Krebbel ?

.. [#m] Masked ``load`` / ``store`` come in two flavours. The
:cpp:class:`batch_bool_constant` overload encodes the mask in the type and
is resolved at compile time. The runtime :cpp:class:`batch_bool` overload
accepts a mask computed at runtime. Prefer the compile-time mask whenever

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewording: For performance reason, prefer the compile-time mask whenever possible.

load_masked(T const* mem, batch_bool<T, A> mask, convert<T>, Mode, requires_arch<common>) noexcept
{
// Scalar fallback: only active lanes are touched. Arches with
// hardware predicated loads override this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"should override" ?

store_masked(T* mem, batch<T, A> const& src, batch_bool<T, A> mask, Mode, requires_arch<common>) noexcept
{
// Scalar fallback: only active lanes are touched. Arches with
// hardware predicated stores override this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

{
return _mm256_maskload_epi32(mem, mask);
XSIMD_IF_CONSTEXPR(sizeof(T) == 4)
{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you static_assert(sizeof(int) == 4) here? It's likely that because the condition is not always constexpr (until we reach C++17) you might need to just assert.

}
else
{
return _mm256_maskload_epi64(reinterpret_cast<long long const*>(mem), mask);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here for long long

return _mm256_maskload_epi64(reinterpret_cast<long long const*>(mem), mask);
XSIMD_IF_CONSTEXPR(sizeof(T) == 4)
{
_mm256_maskstore_epi32(reinterpret_cast<int*>(mem), mask, src);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and there

{
XSIMD_IF_CONSTEXPR(sizeof(T) == 1)
{
return _mm512_maskz_loadu_epi8((__mmask64)mask.mask(), mem);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interestingly there's no _mm512_maskz_load_epi8 :-)

#endif

#if XSIMD_WITH_AVX512VL
#include "./xsimd_avx512vl.hpp"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for not moving the remaining include above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants