Skip to content

[GPU] modify GroupQueryAttention decompose logic to fit gpu unsqueeze_broadcast_reshape_sdpa_fusion pattern #31507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bopeng1234
Copy link
Contributor

For MultiQueryAttention model, for example, Qwen2.5-1.5B, KV head num = 2, Q head num = 12.

KV need to broadcast to 12 then do SDPA computation.

GPU has pattern to fuse KV broadcast into SDPA kernel to raise performance, in unsqueeze_broadcast_reshape_sdpa_fusion.cpp

This PR adjust GQA decompose logic to match the pattern.

@bopeng1234 bopeng1234 requested a review from a team as a code owner July 29, 2025 01:44
@bopeng1234 bopeng1234 requested review from itikhono and removed request for a team July 29, 2025 01:44
@github-actions github-actions bot added the category: transformations OpenVINO Runtime library - Transformations label Jul 29, 2025
@sys-openvino-ci sys-openvino-ci added the ExternalIntelPR External contributor from Intel label Jul 29, 2025
@bopeng1234 bopeng1234 force-pushed the gqa_multi_query_atten_fix_gpu branch from ecbe728 to be4e135 Compare July 30, 2025 01:46
@mryzhov mryzhov requested a review from CuriousPanCake July 30, 2025 07:44
Copy link
Member

@rkazants rkazants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please share CVS JIRA ticket number.
Also, please add tests for new pattern. I think we can affect other models with existing patterns.

@bopeng1234
Copy link
Contributor Author

added JIRA, https://jira.devtools.intel.com/browse/CVS-171437
will add test soon.

@bopeng1234
Copy link
Contributor Author

Hi @rkazants , I think these is existing GQA test with query head 2, kv head 1

This PR only changes the implementation of the kv broadcast to q, the existing testcase can cover and verify this changes, so I think we don't need to add another duplicate testcase.

@sgbihu
Copy link
Contributor

sgbihu commented Aug 7, 2025

build_jenkins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: transformations OpenVINO Runtime library - Transformations ExternalIntelPR External contributor from Intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants