Skip to content

[PHI] Optimize Gather kernel with vectorization #72238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

lshpku
Copy link
Contributor

@lshpku lshpku commented Apr 14, 2025

PR Category

Performance Optimization

PR Types

Performance

Description

使用向量化优化GatherGPUKernel的性能,并将原有的2种Gather实现合并为一个

注:原来的2种实现分别处理高维和低维,我发现没有必要,就合并成一个了,但仍然保留了2种调用接口,因为不少别的Kernel还依赖于被弃用的那个

性能测试

A100,float16,假设index的长度和shape[axis]相同,用时单位为us

shape axis 原用时 新用时 性能提升 说明
[128, 1024*1024] 0 1,466 459 219.6% 2D高维,可4x向量化
[128, 1024*1024+2] 0 1,472 834 76.5% 2D高维,可2x向量化
[128, 1024*1024+1] 0 1,471 1,394 5.5% 2D高维,不可向量化
[262144, 256] 1 793 694 14.2% 2D低维,不可向量化
[16384, 4096] 1 884 720 22.7% 2D低维,不可向量化
[4096, 16384] 1 1,151 963 19.6% 2D低维,不可向量化
[128, 1024, 1024] 1 1,700 478 255.5% 3D中维,可4x向量化
[128, 1024, 1024+2] 1 1,747 871 100.6% 3D中维,可2x向量化
[128, 1024, 1024+1] 1 1,695 1,409 20.3% 3D中维,不可向量化

由测试结果可知,本PR主要在可向量化的场景下带来较大的性能提升;对于不可向量化的情况也有略微的提升,这是因为优化了下标的计算方式和增大了loop数量

另外,进行了千级的shape覆盖性测试,也检查了部分shape下float32的性能,均无问题


Pcard-85711

Copy link

paddle-bot bot commented Apr 14, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@CLAassistant
Copy link

CLAassistant commented Apr 14, 2025

CLA assistant check
All committers have signed the CLA.

@lshpku lshpku force-pushed the vectorize-gather-kernel-test branch from 05559d4 to 0498fa7 Compare April 14, 2025 05:36
@PaddlePaddle PaddlePaddle deleted a comment from CLAassistant Apr 14, 2025
@lshpku lshpku closed this Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants