Merge OpenAI Triton commit `bea27e3` #5261

whitneywhtsang · 2025-10-06T17:20:32Z

This PR change the Triton base from 51021fb to bea27e3 (Oct 1).
Pass rate: 92.75%->94.2%

Please do not squash and merge this PR.

Given we already encode the kDim in `instrShape`, we don't need to select again when lowering mfma, just directly get the instruction, when matching all parameters.

Target of this PR is to add gluon and triton support for scaled wmma mxfp4 on gfx1250.

Membar only tracks accesses to shared memory so all it needs is a fence to the shared memory address space and a thread barrier. Before this change a `gpu.barrier` op was emitted which implies much stronger guarantees on syncronization of all address spaces and not just shared memory. The new LocalBarrierOp is targeting only accesses between shared memory and registers to allow for better code generation. Added lowering for this new op: - Common code lowers it to the current GPU BarrierOp - AMD will lower to lgkmcnt(0) + s_barrier

`wave_per_eu` indicates how many waves are supposed to execute per eu (SIMD) simulteneously. Value 0 means compiler decides how many waves are assigned to each SIMD; Value 1 indicates only 1 wave per SIMD. Value 1 will impact small kernels like the `vector_add` since there can be far more than 1 wave executing simultaneously per SIMD. So, by default, it makes more sense for the compiler to decide the numbers of wave per eu unless user specifies that. The default value 1 actually affects the perf numbers of the tutorial example `01-vector-add.py`. Memory bandwidth regressed to 3.3TB/s with `wave_per_eu=1`, and it increases to 5.3TB/s with `wave_per_eu=0`. This PR changes the default value from 1 to 0.

Fixes #8229 Background is that we run the gluon inliner prior to auto layout propagation to enable returning auto layout from a function and having different calls of the function resolve to different layouts. However, the inliner calls gluon canonicalize and the `GreedyPatternRewriter` defaults to CSEing constants. This means that two distinct constants which could otherwise resolve to different layouts may be CSEd into a single constant and create a new conflict. I fix this by changing the inliner to do even less canonicalization, and only simplify control flow operations. I then add a canoncalization pass after auto layout resolution to make up for this.

…rs (#8302) This change, combined with the previous PR#7939, is to convert memory accesses of small-tensors into buffer-ops *WITHOUT* analyzing their offsets' value range. * Some context - We informally call a tensor "small-tensor" if it is no more than 2G byte in size. - a jit function is specialized to differentiate "small-tensor" and larger ones; a function formal argument is tagged with tt.pointer_range=32 if it binds to a small-tensor. * This Change - This change, combined with the previous PR#7939, is to convert memory accesses of small-tensors into buffer-ops *WITHOUT* analyzing their offsets' value range. - The contribution of PR#7939 is to reveal the base-pointer, and this PR is to unconditionally perform such conversion. - It side-steps the defect/limitation of offset-range-analysis, and it's safe! * TODO - If the offset of the mem-op of small-tensor is 64-bit quantity, we can cast the offset to 32-bit and then convert it. * Option - Pass option: `analyzeSmallTensorOfst={false|true}`, false by default, meaning when coming across mem-op of small-tensor, no need to analyze its offset's value-range. --------- Co-authored-by: Shuxin Yang <Shuxin.Yang@gmail.com>

Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>

borontion and others added 7 commits September 30, 2025 09:28

[AMD] Replace mfma select in LLVM conversion (#8320)

8b8c607

Given we already encode the kDim in `instrShape`, we don't need to select again when lowering mfma, just directly get the instruction, when matching all parameters.

[AMD] Add initial support for scaled wmma on gfx1250 (#8283)

3544688

Target of this PR is to add gluon and triton support for scaled wmma mxfp4 on gfx1250.

[triton_kernels] some clean-up of the routing (#8330)

bea27e3

whitneywhtsang self-assigned this Oct 6, 2025

Merge commit 'bea27e37b6585e602322d3206dd0b8fcefe8523a'

fe0365c

whitneywhtsang force-pushed the whineywhtsang/merge2 branch from ecbe70b to e5c68c4 Compare October 7, 2025 14:00

whitneywhtsang requested a review from kwasd October 7, 2025 14:03

whitneywhtsang force-pushed the whineywhtsang/merge2 branch from e5c68c4 to fbf33e4 Compare October 7, 2025 16:05

[TEST] Fix triton_kernels IGC after bea27e3

98431a9

Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>

whitneywhtsang force-pushed the whineywhtsang/merge2 branch from fbf33e4 to 98431a9 Compare October 7, 2025 17:03

etiotto approved these changes Oct 7, 2025

View reviewed changes

whitneywhtsang marked this pull request as ready for review October 7, 2025 18:34

whitneywhtsang merged commit efbbb06 into main Oct 7, 2025
23 checks passed

whitneywhtsang deleted the whineywhtsang/merge2 branch October 7, 2025 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge OpenAI Triton commit `bea27e3` #5261

Merge OpenAI Triton commit `bea27e3` #5261

Uh oh!

whitneywhtsang commented Oct 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Merge OpenAI Triton commit bea27e3 #5261

Merge OpenAI Triton commit bea27e3 #5261

Uh oh!

Conversation

whitneywhtsang commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Merge OpenAI Triton commit `bea27e3` #5261

Merge OpenAI Triton commit `bea27e3` #5261

whitneywhtsang commented Oct 6, 2025 •

edited

Loading