Skip to content

Conversation

lhycms
Copy link
Contributor

@lhycms lhycms commented Aug 23, 2025

Background

In elementwise_add_f16x8_pack_kernel, each thread processes 8 half elements at once. However, when the input length N is not divisible by 8, the last thread may perform out-of-bounds memory access.

Changes

Added a tail-case handling branch to safely compute the remaining elements one by one:

} else {
    for (int i = 0; nx + i < N; ++i) {
        d_c[nx + i] = __hadd(d_a[nx + i], d_b[nx + i]);
    }
}

Impact

  1. Prevents potential out-of-bounds access when N % 8 != 0.
  2. Keeps vectorized performance intact when N is a multiple of 8.

@lhycms lhycms requested a review from DefTruth August 24, 2025 12:41
Copy link
Member

@DefTruth DefTruth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DefTruth DefTruth merged commit 6d88448 into xlite-dev:main Aug 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants