-
Notifications
You must be signed in to change notification settings - Fork 74
Open
Copy link
Description
The layout propagation across the scf.for op in RemoveLayout is not implemented well for these aspects:
- There is not analysis on the cost model of using different layout for the operations. (Choosing different tiling pattern for Triton ops.). It only rely on the anchors in ad-hoc.
- It is not implemented well for ops with multiple results ops.
- It is not implemented well for ops with nested basic blocks.
- The remove layout doesn't support to propagate the layout through the scf.for ops.
With the limitations, the scf.for operation is the bottle neck of the efficient after the remove layout pass.
This is not issue on NV GPU because the NV GPU convert the layout convert operations to async.cp in software pipeline.
But it is an issue for Intel GPU. We rely on the remove layout to get a simple program with less convert layout operations.
Plan to enhance the remove layout to enhance the limitations of the remove layout.
- Refactor the implementation of remove layout to support ops with multiple results and nested basic blocks well.
- Support the propagate layout through the scf.for ops on demand.
- Add an cost model analysis pass to get an costs of the different tiling patterns across the kernel program.
This is an PR for CI.