-
Notifications
You must be signed in to change notification settings - Fork 74
[BACKEND] Enhance the remove layout implementation to reduce the duplicated values with different layout in scf.for. #4527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
486ed4a
to
f42bd66
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the remove layout implementation to better handle layout propagation across scf.for
operations, addressing limitations that create performance bottlenecks on Intel GPU. The changes focus on reducing duplicated layout conversion operations by improving support for multi-result operations and nested basic blocks.
- Adds support for propagating layouts through
scf.for
operations with a newincludeForOp
parameter - Refactors
mappedValues
to handle multiple attribute mappings per value instead of single mappings - Includes debug output and unreachable code handling for
scf.for
operations
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
File | Description |
---|---|
Utility.h | Adds includeForOp parameter to getConvertBackwardSlice function signature |
Utility.cpp | Implements scf.for layout propagation logic with early return check and debug output |
RemoveLayoutConversions.cpp | Updates data structures to support multiple encodings per value and enables scf.for processing |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
third_party/intel/lib/TritonIntelGPUTransforms/RemoveLayoutConversions.cpp
Outdated
Show resolved
Hide resolved
f42bd66
to
7a8f858
Compare
The flex attn backward ttgir has been simplified by these changes. There are only two root tiling layout of the dpas and the transpose of dot of dpas. Another major in-efficient issue on Xe-Xe3 is that the regular pointer under different layout like:
The simplified ttgir
|
7a8f858
to
5828b55
Compare
I checkout this branch and run the benchmark code in flex attn bwd
But the TTGI I get is not the same as the one you mentioned, the IR around the Snippet from TTGIR
The IR still contains the convert_layout operations and there are other ops (e.g. ttg.local_alloc) which do not appear for you. How did you compile the benchmark? Does this PR contains all of your code ? |
5828b55
to
d85a2ba
Compare
@etiotto Added the missed changes for debug the backward. |
…d values with different layout in scf.for. Signed-off-by: Lu,Chengjun <chengjun.lu@intel.com>
Signed-off-by: Lu,Chengjun <chengjun.lu@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
third_party/intel/lib/TritonIntelGPUTransforms/RemoveLayoutConversions.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/RemoveLayoutConversions.cpp
Outdated
Show resolved
Hide resolved
…ntel/intel-xpu-backend-for-triton into chengjun/enhance_remove_layout
…ntel/intel-xpu-backend-for-triton into chengjun/enhance_remove_layout
Signed-off-by: Ettore Tiotto <ettore.tiotto@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
third_party/intel/lib/TritonIntelGPUTransforms/RemoveLayoutConversions.cpp
Outdated
Show resolved
Hide resolved
LogicalResult result = getRematerializableSlice(convertOp.getSrcMutable(), | ||
targetType.getEncoding(), | ||
slice, layout, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Adding a nullptr parameter without documentation or context makes the API unclear. Consider adding a comment explaining what this parameter represents or using a named constant instead of nullptr.
LogicalResult result = getRematerializableSlice(convertOp.getSrcMutable(), | |
targetType.getEncoding(), | |
slice, layout, nullptr); | |
// No filter function is provided, so we pass kNoRematerializationFilter (nullptr). | |
static constexpr auto kNoRematerializationFilter = nullptr; | |
LogicalResult result = getRematerializableSlice(convertOp.getSrcMutable(), | |
targetType.getEncoding(), | |
slice, layout, kNoRematerializationFilter); |
Copilot uses AI. Check for mistakes.
The layout propagation across the
scf.for
op in RemoveLayout is not implemented well for these aspects:scf.for
ops.With the limitations, the
scf.for
operation is the bottle neck of the efficient after the remove layout pass.This is not issue on NV GPU because the NV GPU convert the layout convert operations to async.cp in software pipeline.
But it is an issue for Intel GPU. We rely on the remove layout to get a simple program with less convert layout operations.
Plan to enhance the remove layout to enhance the limitations of the remove layout.
scf.for
ops on demand.This is an PR for CI.