fix: avx512vl masked load/store by DiamonDinoia · Pull Request #1353 · xtensor-stack/xsimd

DiamonDinoia · 2026-05-21T07:24:01Z

xsimd::batch<uint32_t, avx512vl_256>::store(ptr, constexpr_mask, mode) used
to compile to a 6-instruction per-lane scalar extract loop instead of a single
EVEX-encoded masked store, because the call site in xsimd_batch.hpp:766
over-specified template arguments and SFINAE'd away every viable masked-store
overload except the scalar common fallback.

The load side was unaffected — its call site (:743) was already correct.

#include <xsimd/xsimd.hpp>

using A    = xsimd::avx512vl_256;
using Bu32 = xsimd::batch<uint32_t, A>;

// Constant alternating mask: lanes 0,2,4,6 active.
struct alt {
    static constexpr bool get(std::size_t i, std::size_t) { return (i & 1) == 0; }
};

static constexpr auto mask = xsimd::make_batch_bool_constant<uint32_t, alt, A>();

Bu32 load_u32(uint32_t const* p) {
    return Bu32::load(p, mask, xsimd::unaligned_mode{});
}

void store_u32(uint32_t* p, Bu32 v) {
    v.store(p, mask, xsimd::unaligned_mode{});
}

g++ -O3 -S -masm=intel -std=c++14 -march=skylake-avx512 -DXSIMD_DEFAULT_ARCH=xsimd::avx512vl_256

Codegen — before (master, commit `7d30b9cc`)

_Z8load_u32PKj:                  # load_u32
    mov         eax, 85
    kmovb       k1, eax
    vmovdqu32   ymm0{k1}{z}, YMMWORD PTR [rdi]
    ret

_Z9store_u32PjN5xsimd5batchIjNS0_12avx512vl_256EEE:   # store_u32
    valignd       ymm1, ymm0, ymm0, 6
    vmovd         DWORD PTR  [rdi], xmm0
    vpextrd       DWORD PTR 8[rdi], xmm0, 2
    vextracti32x4 xmm2, ymm0, 1
    vmovd         DWORD PTR 24[rdi], xmm1
    vmovd         DWORD PTR 16[rdi], xmm2
    ret

Load is fine. Store is the scalar fallback: GCC unrolled the 8-lane loop in
xsimd_common_memory.hpp:377 into four per-lane 32-bit stores, materialising
each active lane no mask instruction involved.

Codegen — after

_Z8load_u32PKj:                  # load_u32 — unchanged
    mov         eax, 85
    kmovb       k1, eax
    vmovdqu32   ymm0{k1}{z}, YMMWORD PTR [rdi]
    ret

_Z9store_u32PjN5xsimd5batchIjNS0_12avx512vl_256EEE:   # store_u32
    mov         eax, 85
    kmovb       k1, eax
    vmovdqu32   YMMWORD PTR [rdi]{k1}, ymm0
    ret

One masked instruction.

Bug fixes

AVX-512VL masked store collapsed to a 6-insn scalar fallback. An over-specified template arg list at the public batch::store(mask) call site pushed a type into a non-type pack, SFINAE'ing every per-arch overload away. The store now reaches the EVEX vmov*{k} intrinsic.
CI linux.yml typo silently dropped CXXFLAGS in the VL_128 matrix row ($CXX_FLAGS vs $CXXFLAGS), so the VL_128-default test job was building with stock flags instead of the requested override.
avx512vl register-traits comment misattributed to AVX512DQ.
Half-confined masked op fell back to scalar on AVX/AVX2/AVX-512F. The half-fold hardcoded sse4_2/avx2 as the half-width target, but two-phase lookup made the better-arch overload invisible at template-definition time — so the recursive call dispatched to the wrong arch. Fixed by include reorder + make_sized_batch_t<T, half>.

Features added

Per-type load_masked / store_masked for avx512vl_128 and avx512vl_256 covering i32/u32/i64/u64/f32/f64 in both aligned and unaligned modes; partial ordering picks them over the avx2 bridges these archs
inherit.
Compile-time guarantee that default_arch matches XSIMD_DEFAULT_ARCH when the macro is set — a static_assert in test_arch.cpp (plus the CMake plumbing to forward the macro) catches default-arch wiring regressions at compile time instead of at runtime.

Cleanups

AVX detail::maskstore helpers now take batch<> types, symmetric with detail::maskload.
Half-store sites and the common select use a typed variable (const batch<T, A> lo = …) instead of const auto lo = batch<T, A>{ … }.
Aligned/unaligned dispatch in the VL overloads uses XSIMD_IF_CONSTEXPR, so the inactive intrinsic isn't instantiated.
Half-fold in xsimd_avx.hpp / xsimd_avx2.hpp / xsimd_avx512f.hpp uses make_sized_batch_t<T, half> instead of a hardcoded arch — picks avx_128 / avx2_128 / avx512vl_256 when available, so half-confined stores land on the EVEX or VEX masked intrinsic.
xsimd_isa.hpp include order: _128 siblings before their wider arch, VL before avx512f.hpp. Required so the recursive store_masked<half_arch> call sees the better-arch overload at template-definition time.

(Changelog by @claude)

serge-sans-paille · 2026-05-23T06:34:21Z

        if [[ '${{ matrix.sys.flags }}' == 'avx512vl_128' ]]; then
          CMAKE_EXTRA_ARGS="$CMAKE_EXTRA_ARGS -DTARGET_ARCH=skylake-avx512"
-          CXXFLAGS="$CXX_FLAGS -DXSIMD_DEFAULT_ARCH=avx512vl_128"
+          CXXFLAGS="$CXXFLAGS -DXSIMD_DEFAULT_ARCH=avx512vl_128"


oopsie. Thanks for fixing this one.

serge-sans-paille

I wish we could keep each architecture file unaware from other architectures. I do understand this will disappear once we move to C++17-based architecture, but I'd be happier if you could find another way to apply the constraints.

serge-sans-paille · 2026-05-23T06:35:42Z

+        // and the VL native as equally specialized for A=avx512vl_*. (bridge_not_vl in fwd.hpp)
        template <class A, bool... Values, class Mode>
-        XSIMD_INLINE batch<int32_t, A> load_masked(int32_t const* mem, batch_bool_constant<int32_t, A, Values...>, convert<int32_t>, Mode, requires_arch<A>) noexcept
+        XSIMD_INLINE std::enable_if_t<bridge_not_vl<A>::value, batch<int32_t, A>>


there should not be any arch-specific code in xsimd_common_memory.hpp. Could you find another approach?

serge-sans-paille · 2026-05-23T06:38:09Z


            template <class A>
-            XSIMD_INLINE void maskstore(double* mem, batch_bool<double, A> const& mask, batch<double, A> const& src) noexcept
+            XSIMD_INLINE void maskstore(double* mem, batch<as_integer_t<double>, A> const& mask, batch<double, A> const& src) noexcept


wgy that change? In my mental model, the masked store take a bool mask, not an integer mask

We _mm256_maskstore_ps takes a __m256i for mask. The bool mask here available in the function calling this utility is backed by a floating point type.

serge-sans-paille · 2026-05-23T06:39:00Z

        // single templated implementation for integer masked loads (32/64-bit)
-        template <class A, class T, bool... Values, class Mode>
+        template <class A, class T, bool... Values, class Mode,
+                  class = std::enable_if_t<std::is_base_of<avx2, A>::value && !std::is_base_of<avx512vl_256, A>::value>>


I quite dislike the fact that xsimd_avx2.hpp needs to know stuff about avx512vl

I tried, I'll have another look. Without this constraint all compilers work fine except gcc-10 :(

serge-sans-paille · 2026-05-23T06:40:23Z

        }
-        template <class A, bool... Values, class Mode>
+        template <class A, bool... Values, class Mode,
+                  class = std::enable_if_t<std::is_base_of<avx_128, A>::value && !std::is_base_of<avx512vl_128, A>::value>>


same architecture mix reference issue here.

serge-sans-paille · 2026-05-23T06:40:46Z

 #if XSIMD_WITH_AVX
-#include "./xsimd_avx.hpp"
+// clang-format off
+// _128 first: avx half-fold recursive call needs avx_128 visible at parse time.


nice catch.

serge-sans-paille · 2026-05-23T06:41:02Z

     * @ingroup architectures
     *
-     * AVX512DQ instructions
+     * AVX512VL instructions


serge-sans-paille · 2026-05-23T06:41:22Z

+    template <typename T, std::size_t N>
+    struct make_sized_batch;
+    template <typename T, std::size_t N>
+    using make_sized_batch_t = typename make_sized_batch<T, N>::type;


serge-sans-paille · 2026-05-23T06:43:04Z

    message(STATUS "Using emulated target: ${TARGET_EMULATED}")
    set(EMULATED_COMPILE_FLAGS -DXSIMD_DEFAULT_ARCH=${TARGET_ARCH};-DXSIMD_WITH_EMULATED=1)
    unset(TARGET_ARCH CACHE)
+elseif (DEFINED XSIMD_DEFAULT_ARCH AND NOT "${XSIMD_DEFAULT_ARCH}" STREQUAL "")


Per https://cmake.org/cmake/help/latest/command/if.html#constant I think

if(XSIMD_DEFAULT_ARCH)

is enough

(confirmed by https://godbolt.org/z/b64GcKaxM)

serge-sans-paille · 2026-05-23T06:46:04Z

 static_assert(xsimd::all_architectures::contains<xsimd::default_arch>(), "default arch is a valid arch");
+#else
+namespace xsimd
+{


AntoinePrv · 2026-05-25T16:21:35Z

@DiamonDinoia I run into similar issues of load_masked in #1348 where I also fixed and improved the linux.yml CI.

I haven't look at assembly (just making it build with the new updated CI settings). The C++ diff should be fairly small, I wonder if you'd be able to get anything useful from it. I was able to simplify much more the functions from common_memory while keeping them lower priority (I believe). However I think the case common memory casting to float and calling again into the original arch is a big footgun. Would be better IMHO to explicitly make that call everywhere (with possibly a second / different utility function).

That being said, my solution currently stalls on avx512vl_128/avx512vl_256...

Let CMake force a specific default arch via -DXSIMD_DEFAULT_ARCH (idiomatic if(XSIMD_DEFAULT_ARCH) guard), add a test_arch.cpp check that the forced arch is the default, and fix the linux.yml CXXFLAGS typo.

Split the avx_128 variable swizzle into explicit float/double overloads with a width static_assert, and fix an AVX512DQ -> AVX512VL doc comment.

Add the missing int64/uint64/float/double load_masked overloads and correct the store_masked batch_bool_constant typing on avx512vl_128 and avx512vl_256, branching aligned vs unaligned to the right EVEX intrinsic (vmovdqu{32,64}{k}{z} / vmov{a,u}p{s,d}{k}{z}); unsigned overloads delegate via bitwise_cast. Resolve the avx/avx2/avx512f half-fold target through make_sized_batch_t<T, half>::arch_type so a 512-bit masked op picks the VL arch and emits EVEX instead of VEX vpmaskmov*/vmaskmov*.

Drop the cross-arch SFINAE/tag mechanism: a concrete requires_arch<avx512vl_128|256> overload now beats the inherited avx2/avx2_128 one by overload conversion ranking, so no arch file knows about another. xsimd_common_memory.hpp keeps only requires_arch<common> and dispatches on the arch-agnostic trait masked_memory_uses_fp_bitcast (integral with a same-width float register -> reuse that float vmaskmov* path, else a scalar buffer). avx/avx2/avx2_128 drop every is_base_of<avx512vl_*, A> guard; avx2_128 routes native 128-bit integer masked memory through vpmaskmov* (long long* cast for 64-bit) and tags int64/uint64 on avx2_128 (those intrinsics need AVX2). detail::maskstore takes a bool mask and casts internally; xsimd_batch.hpp keeps a make_sized_batch fwd-decl and simplifies the store_masked call; xsimd_isa.hpp documents the _128-first include order; sse2.hpp adapts to the new store_masked(common) signature.

DiamonDinoia · 2026-06-01T18:25:51Z

@serge-sans-paille ready for a second round of review. Probably over commented because I chatted with @claude how to best do this in c++14 (No if constexpr...) my solutions where so SFINAE heavy so I asked multiple times how to simplify this and the final outcome is this one. I left the comments in to kind of explain a bit more, happy to trim/clean them after a second round of review.

serge-sans-paille

This looks good to me now, thanks for the extra steps.

We have decent test coverage for this, it would have failed if something were wrong, good to go!

serge-sans-paille · 2026-06-02T14:39:21Z

@@ -49,6 +49,9 @@ if (TARGET_EMULATED)
    message(STATUS "Using emulated target: ${TARGET_EMULATED}")
    set(EMULATED_COMPILE_FLAGS -DXSIMD_DEFAULT_ARCH=${TARGET_ARCH};-DXSIMD_WITH_EMULATED=1)


for the future: this probably means the two options are incompatible

DiamonDinoia force-pushed the fix/avx512vl-masked-memory branch 5 times, most recently from fe2938e to ea882e6 Compare May 21, 2026 12:04

DiamonDinoia marked this pull request as ready for review May 21, 2026 12:51

DiamonDinoia requested a review from serge-sans-paille May 21, 2026 12:52

serge-sans-paille reviewed May 23, 2026

View reviewed changes

serge-sans-paille requested changes May 23, 2026

View reviewed changes

DiamonDinoia added 3 commits June 1, 2026 11:51

ci: support XSIMD_DEFAULT_ARCH override and verify default_arch

108a99b

Let CMake force a specific default arch via -DXSIMD_DEFAULT_ARCH (idiomatic if(XSIMD_DEFAULT_ARCH) guard), add a test_arch.cpp check that the forced arch is the default, and fix the linux.yml CXXFLAGS typo.

chore: small drive-by fixes (avx_128 swizzle, doc typo)

9ebcf0f

Split the avx_128 variable swizzle into explicit float/double overloads with a width static_assert, and fix an AVX512DQ -> AVX512VL doc comment.

DiamonDinoia force-pushed the fix/avx512vl-masked-memory branch from ea882e6 to fa06792 Compare June 1, 2026 15:56

DiamonDinoia force-pushed the fix/avx512vl-masked-memory branch from fa06792 to 5a40538 Compare June 1, 2026 16:53

serge-sans-paille approved these changes Jun 2, 2026

View reviewed changes

DiamonDinoia merged commit 24053b5 into xtensor-stack:master Jun 2, 2026
89 checks passed

		@@ -49,6 +49,9 @@ if (TARGET_EMULATED)
		message(STATUS "Using emulated target: ${TARGET_EMULATED}")
		set(EMULATED_COMPILE_FLAGS -DXSIMD_DEFAULT_ARCH=${TARGET_ARCH};-DXSIMD_WITH_EMULATED=1)

Conversation

DiamonDinoia commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codegen — before (master, commit 7d30b9cc)

Codegen — after

Bug fixes

Features added

Cleanups

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serge-sans-paille left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AntoinePrv commented May 25, 2026

Uh oh!

DiamonDinoia commented Jun 1, 2026

Uh oh!

serge-sans-paille left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DiamonDinoia commented May 21, 2026 •

edited

Loading

Codegen — before (master, commit `7d30b9cc`)