fix: avx512vl masked load/store#1353
Conversation
fe2938e to
ea882e6
Compare
| if [[ '${{ matrix.sys.flags }}' == 'avx512vl_128' ]]; then | ||
| CMAKE_EXTRA_ARGS="$CMAKE_EXTRA_ARGS -DTARGET_ARCH=skylake-avx512" | ||
| CXXFLAGS="$CXX_FLAGS -DXSIMD_DEFAULT_ARCH=avx512vl_128" | ||
| CXXFLAGS="$CXXFLAGS -DXSIMD_DEFAULT_ARCH=avx512vl_128" |
There was a problem hiding this comment.
oopsie. Thanks for fixing this one.
serge-sans-paille
left a comment
There was a problem hiding this comment.
I wish we could keep each architecture file unaware from other architectures. I do understand this will disappear once we move to C++17-based architecture, but I'd be happier if you could find another way to apply the constraints.
| // and the VL native as equally specialized for A=avx512vl_*. (bridge_not_vl in fwd.hpp) | ||
| template <class A, bool... Values, class Mode> | ||
| XSIMD_INLINE batch<int32_t, A> load_masked(int32_t const* mem, batch_bool_constant<int32_t, A, Values...>, convert<int32_t>, Mode, requires_arch<A>) noexcept | ||
| XSIMD_INLINE std::enable_if_t<bridge_not_vl<A>::value, batch<int32_t, A>> |
There was a problem hiding this comment.
there should not be any arch-specific code in xsimd_common_memory.hpp. Could you find another approach?
|
|
||
| template <class A> | ||
| XSIMD_INLINE void maskstore(double* mem, batch_bool<double, A> const& mask, batch<double, A> const& src) noexcept | ||
| XSIMD_INLINE void maskstore(double* mem, batch<as_integer_t<double>, A> const& mask, batch<double, A> const& src) noexcept |
There was a problem hiding this comment.
wgy that change? In my mental model, the masked store take a bool mask, not an integer mask
There was a problem hiding this comment.
We _mm256_maskstore_ps takes a __m256i for mask. The bool mask here available in the function calling this utility is backed by a floating point type.
| // single templated implementation for integer masked loads (32/64-bit) | ||
| template <class A, class T, bool... Values, class Mode> | ||
| template <class A, class T, bool... Values, class Mode, | ||
| class = std::enable_if_t<std::is_base_of<avx2, A>::value && !std::is_base_of<avx512vl_256, A>::value>> |
There was a problem hiding this comment.
I quite dislike the fact that xsimd_avx2.hpp needs to know stuff about avx512vl
There was a problem hiding this comment.
I tried, I'll have another look. Without this constraint all compilers work fine except gcc-10 :(
| } | ||
| template <class A, bool... Values, class Mode> | ||
| template <class A, bool... Values, class Mode, | ||
| class = std::enable_if_t<std::is_base_of<avx_128, A>::value && !std::is_base_of<avx512vl_128, A>::value>> |
There was a problem hiding this comment.
same architecture mix reference issue here.
| #if XSIMD_WITH_AVX | ||
| #include "./xsimd_avx.hpp" | ||
| // clang-format off | ||
| // _128 first: avx half-fold recursive call needs avx_128 visible at parse time. |
| * @ingroup architectures | ||
| * | ||
| * AVX512DQ instructions | ||
| * AVX512VL instructions |
| template <typename T, std::size_t N> | ||
| struct make_sized_batch; | ||
| template <typename T, std::size_t N> | ||
| using make_sized_batch_t = typename make_sized_batch<T, N>::type; |
| message(STATUS "Using emulated target: ${TARGET_EMULATED}") | ||
| set(EMULATED_COMPILE_FLAGS -DXSIMD_DEFAULT_ARCH=${TARGET_ARCH};-DXSIMD_WITH_EMULATED=1) | ||
| unset(TARGET_ARCH CACHE) | ||
| elseif (DEFINED XSIMD_DEFAULT_ARCH AND NOT "${XSIMD_DEFAULT_ARCH}" STREQUAL "") |
There was a problem hiding this comment.
Per https://cmake.org/cmake/help/latest/command/if.html#constant I think
if(XSIMD_DEFAULT_ARCH)
is enough
| static_assert(xsimd::all_architectures::contains<xsimd::default_arch>(), "default arch is a valid arch"); | ||
| #else | ||
| namespace xsimd | ||
| { |
|
@DiamonDinoia I run into similar issues of I haven't look at assembly (just making it build with the new updated CI settings). The C++ diff should be fairly small, I wonder if you'd be able to get anything useful from it. I was able to simplify much more the functions from That being said, my solution currently stalls on |
Let CMake force a specific default arch via -DXSIMD_DEFAULT_ARCH (idiomatic if(XSIMD_DEFAULT_ARCH) guard), add a test_arch.cpp check that the forced arch is the default, and fix the linux.yml CXXFLAGS typo.
Split the avx_128 variable swizzle into explicit float/double overloads with a width static_assert, and fix an AVX512DQ -> AVX512VL doc comment.
Add the missing int64/uint64/float/double load_masked overloads and
correct the store_masked batch_bool_constant typing on avx512vl_128 and
avx512vl_256, branching aligned vs unaligned to the right EVEX intrinsic
(vmovdqu{32,64}{k}{z} / vmov{a,u}p{s,d}{k}{z}); unsigned overloads
delegate via bitwise_cast. Resolve the avx/avx2/avx512f half-fold target
through make_sized_batch_t<T, half>::arch_type so a 512-bit masked op
picks the VL arch and emits EVEX instead of VEX vpmaskmov*/vmaskmov*.
ea882e6 to
fa06792
Compare
Drop the cross-arch SFINAE/tag mechanism: a concrete requires_arch<avx512vl_128|256> overload now beats the inherited avx2/avx2_128 one by overload conversion ranking, so no arch file knows about another. xsimd_common_memory.hpp keeps only requires_arch<common> and dispatches on the arch-agnostic trait masked_memory_uses_fp_bitcast (integral with a same-width float register -> reuse that float vmaskmov* path, else a scalar buffer). avx/avx2/avx2_128 drop every is_base_of<avx512vl_*, A> guard; avx2_128 routes native 128-bit integer masked memory through vpmaskmov* (long long* cast for 64-bit) and tags int64/uint64 on avx2_128 (those intrinsics need AVX2). detail::maskstore takes a bool mask and casts internally; xsimd_batch.hpp keeps a make_sized_batch fwd-decl and simplifies the store_masked call; xsimd_isa.hpp documents the _128-first include order; sse2.hpp adapts to the new store_masked(common) signature.
fa06792 to
5a40538
Compare
|
@serge-sans-paille ready for a second round of review. Probably over commented because I chatted with @claude how to best do this in c++14 (No if constexpr...) my solutions where so SFINAE heavy so I asked multiple times how to simplify this and the final outcome is this one. I left the comments in to kind of explain a bit more, happy to trim/clean them after a second round of review. |
serge-sans-paille
left a comment
There was a problem hiding this comment.
This looks good to me now, thanks for the extra steps.
We have decent test coverage for this, it would have failed if something were wrong, good to go!
| @@ -49,6 +49,9 @@ if (TARGET_EMULATED) | |||
| message(STATUS "Using emulated target: ${TARGET_EMULATED}") | |||
| set(EMULATED_COMPILE_FLAGS -DXSIMD_DEFAULT_ARCH=${TARGET_ARCH};-DXSIMD_WITH_EMULATED=1) | |||
There was a problem hiding this comment.
for the future: this probably means the two options are incompatible
xsimd::batch<uint32_t, avx512vl_256>::store(ptr, constexpr_mask, mode)usedto compile to a 6-instruction per-lane scalar extract loop instead of a single
EVEX-encoded masked store, because the call site in
xsimd_batch.hpp:766over-specified template arguments and SFINAE'd away every viable masked-store
overload except the scalar
commonfallback.The load side was unaffected — its call site (
:743) was already correct.Codegen — before (master, commit
7d30b9cc)Load is fine. Store is the scalar fallback: GCC unrolled the 8-lane loop in
xsimd_common_memory.hpp:377into four per-lane 32-bit stores, materialisingeach active lane no mask instruction involved.
Codegen — after
One masked instruction.
Bug fixes
AVX-512VL masked store collapsed to a 6-insn scalar fallback. An over-specified template arg list at the public
batch::store(mask)call site pushed a type into a non-type pack, SFINAE'ing every per-arch overload away. The store now reaches the EVEXvmov*{k}intrinsic.CI
linux.ymltypo silently droppedCXXFLAGSin the VL_128 matrix row ($CXX_FLAGSvs$CXXFLAGS), so the VL_128-default test job was building with stock flags instead of the requested override.avx512vlregister-traits comment misattributed to AVX512DQ.Half-confined masked op fell back to scalar on AVX/AVX2/AVX-512F. The half-fold hardcoded
sse4_2/avx2as the half-width target, but two-phase lookup made the better-arch overload invisible at template-definition time — so the recursive call dispatched to the wrong arch. Fixed by include reorder +make_sized_batch_t<T, half>.Features added
load_masked/store_maskedforavx512vl_128andavx512vl_256covering i32/u32/i64/u64/f32/f64 in both aligned and unaligned modes; partial ordering picks them over the avx2 bridges these archsinherit.
default_archmatchesXSIMD_DEFAULT_ARCHwhen the macro is set — astatic_assertintest_arch.cpp(plus the CMake plumbing to forward the macro) catches default-arch wiring regressions at compile time instead of at runtime.Cleanups
detail::maskstorehelpers now takebatch<>types, symmetric withdetail::maskload.selectuse a typed variable (const batch<T, A> lo = …) instead ofconst auto lo = batch<T, A>{ … }.XSIMD_IF_CONSTEXPR, so the inactive intrinsic isn't instantiated.xsimd_avx.hpp/xsimd_avx2.hpp/xsimd_avx512f.hppusesmake_sized_batch_t<T, half>instead of a hardcoded arch — picksavx_128/avx2_128/avx512vl_256when available, so half-confined stores land on the EVEX or VEX masked intrinsic.xsimd_isa.hppinclude order:_128siblings before their wider arch, VL beforeavx512f.hpp. Required so the recursivestore_masked<half_arch>call sees the better-arch overload at template-definition time.(Changelog by @claude)