[BACKEND] Enhance the remove layout for Intel GPU

The layout propagation across the scf.for op in RemoveLayout is not implemented well for these aspects:

- There is not analysis on the cost model of using different layout for the operations. (Choosing different tiling pattern for Triton ops.). It only rely on the anchors in ad-hoc.
- It is not implemented well for ops with multiple results ops.
- It is not implemented well for ops with nested basic blocks.
- The remove layout doesn't support to propagate the layout through the scf.for ops.

With the limitations, the scf.for operation is the bottle neck of the efficient after the remove layout pass.
This is not issue on NV GPU because the NV GPU convert the layout convert operations to async.cp in software pipeline.

But it is an issue for Intel GPU. We rely on the remove layout to get a simple program with less convert layout operations.

Plan to enhance the remove layout to enhance the limitations of the remove layout.
- Refactor the implementation of remove layout to support ops with multiple results and nested basic blocks well.
- Support the propagate layout through the scf.for ops on demand.
- Add an cost model analysis pass to get an costs of the different tiling patterns across the kernel program.
This is an PR for CI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BACKEND] Enhance the remove layout for Intel GPU #4528

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BACKEND] Enhance the remove layout for Intel GPU #4528

Description

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions