Proposal load/store masked #1162

DiamonDinoia · 2025-08-21T15:49:24Z

I made a rough implementation of masked load/store. Before I hash it out. Can I have some early feedback?

Thanks,
Marco

serge-sans-paille · 2025-08-21T19:49:43Z

interesting. Before going into the details, what are the memory effects of a masked load in terms of read memory and value stored? Are the masked elements set to 0 ? to an undefined value ? AVX etc seems to set the value to zero?

I'm not sold on the common implementation which looks quite heavy in scalar operation. I can see that we can't do a plain load followed by an and because it could lead to access to unallocated memory. If the mask were constant, we could optimize statically some common patterns, but with a dynamic mask as you propose...

DiamonDinoia · 2025-08-22T08:58:25Z

Some thoughts:

Undefined for masked values. Since depending on the operations 0 or 1 might be the correct values. In that case they could use the mask itself to initialize the values. Also, because imagine I want to polulate the even elements from one memory location and the odd from another. Masked loads (I think) are faster than a gather. set to 0 following the x86 convention
We could remove the dynamic mask entirely. I added for completeness.
We could do a la vcl and have a load partial, store partial where we just optimize for head and tail. I preferred this solution as I'm assuming xsimd users know the performance implications of the API.
~~For now this is fast only on avx, av2 but for sse even if it heavy on scalar it is slow only when reading bytes or short. We could optimize these cases.~~
I'm not sure about sve/neon
With static masks if the first and the last element are read it is possible to do load+and

In general, I use these operations when I want to vectorize and inner loop that is not a multiple of the simd width. This is a small inner loop nested in a loop executed a lot of times. Depending on the operations, padding sometimes is slower than masking.

DiamonDinoia · 2025-10-08T20:52:07Z

Hi @serge-sans-paille,

This is starting to take shape now. I still need to clean up. But this is a first working implementation.

I extended the mask operation in batch_bool to mimic std::bit so that i could reuse the code for all masked operations. I feel having those living outside the class in detail:: is not the best. They are also useful to the user I think.

DiamonDinoia · 2025-10-10T17:58:55Z

@serge-sans-paille do you have any clue on why mingw32 fails?

I need to do a final pass make the code coherent (I am mixing reinterpret_casts and c-style casts). Affter that I think will ready for review.

…oad_masked) 2. General use case optimization 3. New tests 4. x86 kernels

serge-sans-paille · 2025-10-12T05:58:33Z

include/xsimd/arch/common/xsimd_common_memory.hpp

+            }
+            else
+            {
+                // Fallback to runtime path for non prefix/suffix masks


Unless the compiler optimizes those out, the dynamic load is going to perform serveral checks that are going to be always false (none / all).
I'm still not convinced by the relevancy of dynamic mask support. Maybe we could start small with "just" the static mask support and eventually add a dynamic version later if it appears to be needed? But maybe you have specific case in mind ?

It is just that in my mind I can offer a public runtime mask API
then the compile time mask optimizes as much as it can and if there is no alternative it calls the runtime mask.

Same as the swizzle. So I was not planning of optimizing the runtime at all we just get a maskl_load/store where possible or a common since sse does not support masking.

PS:
This can be optimized in the future with using avx on sse or avx512vl but I will open a discussion once this is merged

I am cleaning this to only offer compile time for now and the runtime can be a separate PR

DiamonDinoia force-pushed the masked-memory-ops branch from 1124f01 to 53da643 Compare August 21, 2025 15:51

DiamonDinoia force-pushed the masked-memory-ops branch from 2ae15bf to 510c335 Compare October 8, 2025 20:49

DiamonDinoia force-pushed the masked-memory-ops branch 8 times, most recently from fd0e777 to ecf4291 Compare October 9, 2025 23:27

1. Adds new masked API runtime/compile time masks (store_masked and l…

e88369f

…oad_masked) 2. General use case optimization 3. New tests 4. x86 kernels

DiamonDinoia force-pushed the masked-memory-ops branch from ecf4291 to e88369f Compare October 10, 2025 21:35

serge-sans-paille reviewed Oct 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal load/store masked #1162

Proposal load/store masked #1162

Uh oh!

DiamonDinoia commented Aug 21, 2025

Uh oh!

serge-sans-paille commented Aug 21, 2025

Uh oh!

DiamonDinoia commented Aug 22, 2025 •

edited

Loading

Uh oh!

DiamonDinoia commented Oct 8, 2025

Uh oh!

DiamonDinoia commented Oct 10, 2025

Uh oh!

serge-sans-paille Oct 12, 2025

Uh oh!

DiamonDinoia Oct 13, 2025

Uh oh!

DiamonDinoia Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Proposal load/store masked #1162

Are you sure you want to change the base?

Proposal load/store masked #1162

Uh oh!

Conversation

DiamonDinoia commented Aug 21, 2025

Uh oh!

serge-sans-paille commented Aug 21, 2025

Uh oh!

DiamonDinoia commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DiamonDinoia commented Oct 8, 2025

Uh oh!

DiamonDinoia commented Oct 10, 2025

Uh oh!

serge-sans-paille Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

DiamonDinoia Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

DiamonDinoia Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DiamonDinoia commented Aug 22, 2025 •

edited

Loading