Reduce_mul implementation #1132

emrys53 · 2025-07-01T17:28:06Z

While using XSIMD in our library, we realized that a dedicated reduce_mul function was missing. This PR introduces a reduce_mul implementation for all architectures and supported types, with architecture-specific optimizations for SSE, AVX, and AVX512 instruction sets. Relevant unit tests are included.

I was unable to test the AVX512-specific implementation on my machine, which only supports AVX2. For AVX512, I used mm512_reduce_mul{type} intrinsics. According to Intel's documentation, these are sequential rather than parallel. Since I couldn't benchmark or compare alternative implementations, I left it as-is. I'm open to suggestions for improvement.
Motivation

Although XSIMD provides a general reduce function, using it with std::complex or std::complex fails, as as_unsigned_integer_t is not specialized for complex types. I'm not sure whether this is intended behavior, but I wanted to mention it for context. Providing a reduce_mul directly avoids this issue.

This is my first contribution to XSIMD, and I’ve only been using the library for about a week. Feedback is very welcome, and I’m happy to adjust or improve the code based on your review.

…86_64 architecture

serge-sans-paille · 2025-07-01T19:35:31Z

include/xsimd/arch/common/xsimd_common_arithmetic.hpp

@@ -141,6 +141,20 @@ namespace xsimd
            return res;
        }

+        //  hmul
+        template <class A, class T, class /*=typename std::enable_if<std::is_integral<T>::value, void>::type*/>
+        XSIMD_INLINE T hmul(batch<T, A> const& self, requires_arch<common>) noexcept


shouldn't this just be the generic implementation of reduce_mul, and if so, why not using the current generic reduction implementation?

This function is only used for integral types with sizeof(T) < 4. The reason I implemented like this was because for example when testing reduce_mul for uint8_t with AVX machine. What happens is that it has 32 elements which means general reduce function multiplies and swizzle the batch 5 times (32->16->8->4->2->1) and in the end it calls self.get(0) (which is just too inefficient anyway issue #1133). So I thought it would be more efficient to implement it this way. What it does with uint8_t in my implementation is that it first splits the batch into 2 equal sse4_2 batches multiplies them together and then calls hmul.

Also the swizzle function for uint8_t is just loading the batch to a buffer and storing the result back. So using that 5 times (in AVX) is just too inefficient compared to split_avx function which splits it using SIMD intrinsics.

@serge-sans-paille you haven't responded back. If you are open to the idea of reduce_mul implementation. I can fix the code so that it passes all the tests. If you have feedbacks or recommendations to the code I have written, you can write it here and I can fix it after we discuss.

emrys53 and others added 2 commits July 1, 2025 19:26

added reduce_mul implementation for common arch and specialized for x…

8cf4a1f

…86_64 architecture

Merge branch 'xtensor-stack:master' into master

1ebf5f6

serge-sans-paille reviewed Jul 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce_mul implementation #1132

Reduce_mul implementation #1132

Uh oh!

emrys53 commented Jul 1, 2025

Uh oh!

serge-sans-paille Jul 1, 2025

Uh oh!

emrys53 Jul 1, 2025

Uh oh!

emrys53 Jul 1, 2025 •

edited

Loading

Uh oh!

emrys53 Jul 14, 2025

Uh oh!

Uh oh!

Reduce_mul implementation #1132

Are you sure you want to change the base?

Reduce_mul implementation #1132

Uh oh!

Conversation

emrys53 commented Jul 1, 2025

Uh oh!

serge-sans-paille Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

emrys53 Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

emrys53 Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emrys53 Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

emrys53 Jul 1, 2025 •

edited

Loading