Improve [SD]SYEVD performance by parallelizing [SD]LAED3 #5355
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces a parallelized version of the [SD]LAED3 routine, a key component of the [SD]SYEVD eigensolver for symmetric matrices. OpenBLAS replaces certain LAPACK routines with custom-parallelized versions, and this PR aligns with that strategy.
The [SD]SYEVD routine consists of three main steps:
While PR #5221 improved [SD]SYTRD performance on arm64 by adding tuned [SD]SYMV kernels, this PR focuses on the second step, [SD]STEDC, by parallelizing the internal [SD]LAED3 routine.
Note that [SD]STEDC exhibits poorer scalability with increasing thread counts compared to [SD]SYTRD and [SD]ORMTR. As a result, the proportion of time spent in [SD]STEDC within [SD]SYEVD execution increases with higher thread counts as shown in the following graph.
The parallel [SD]LAED3 reduces the execution time of [SD]STEDC by approximately half in multi-threaded environments. This leads to an overall [SD]SYEVD performance improvement of 1.3x to 1.8x.
I understand that improvements at the LAPACK level are relatively rare in OpenBLAS. Therefore, the parallel [SD]LAED3 implementation has been carefully designed to minimize impact on the library’s structure and to adhere to OpenBLAS’s existing thread management. The parallelization is achieved by setting the necessary parameters in the 'blas_queue_t' structure and calling 'exec_blas(num_cpu, queue)'.