Skip to content

Conversation

whitneywhtsang
Copy link
Contributor

@whitneywhtsang whitneywhtsang commented Oct 6, 2025

This PR change the Triton base from 51021fb to bea27e3 (Oct 1).
Pass rate: 92.75%->94.2%

Please do not squash and merge this PR.

borontion and others added 7 commits September 30, 2025 09:28
Given we already encode the kDim in `instrShape`, we don't need to
select again when lowering mfma, just directly get the instruction, when
matching all parameters.
Target of this PR is to add gluon and triton support for scaled wmma
mxfp4 on gfx1250.
Membar only tracks accesses to shared memory so all it needs is a fence
to the shared memory address space and a thread barrier.
Before this change a `gpu.barrier` op was emitted which implies much
stronger guarantees on syncronization of all address spaces and not just
shared memory.

The new LocalBarrierOp is targeting only accesses between shared memory
and registers to allow for better code generation.

Added lowering for this new op:
- Common code lowers it to the current GPU BarrierOp
- AMD will lower to lgkmcnt(0) + s_barrier
`wave_per_eu` indicates how many waves are supposed to execute per eu
(SIMD) simulteneously. Value 0 means compiler decides how many waves are
assigned to each SIMD; Value 1 indicates only 1 wave per SIMD. Value 1
will impact small kernels like the `vector_add` since there can be far
more than 1 wave executing simultaneously per SIMD. So, by default, it
makes more sense for the compiler to decide the numbers of wave per eu
unless user specifies that.

The default value 1 actually affects the perf numbers of the tutorial
example `01-vector-add.py`. Memory bandwidth regressed to 3.3TB/s with
`wave_per_eu=1`, and it increases to 5.3TB/s with `wave_per_eu=0`. This
PR changes the default value from 1 to 0.
Fixes #8229

Background is that we run the gluon inliner prior to auto layout
propagation to enable returning auto layout from a function and having
different calls of the function resolve to different layouts.

However, the inliner calls gluon canonicalize and the
`GreedyPatternRewriter` defaults to CSEing constants. This means that
two distinct constants which could otherwise resolve to different
layouts may be CSEd into a single constant and create a new conflict.

I fix this by changing the inliner to do even less canonicalization, and
only simplify control flow operations. I then add a canoncalization pass
after auto layout resolution to make up for this.
…rs (#8302)

This change, combined with the previous PR#7939, is to convert memory
accesses of small-tensors into buffer-ops *WITHOUT* analyzing their
offsets' value range.

* Some context
- We informally call a tensor "small-tensor" if it is no more than 2G
byte in size.
- a jit function is specialized to differentiate "small-tensor" and
larger ones; a function formal argument is tagged with
tt.pointer_range=32 if it binds to a small-tensor.

* This Change
- This change, combined with the previous PR#7939, is to convert memory
accesses of small-tensors into buffer-ops *WITHOUT* analyzing their
offsets' value range.
- The contribution of PR#7939 is to reveal the base-pointer, and this PR
is to unconditionally perform such conversion.
- It side-steps the defect/limitation of offset-range-analysis, and it's
safe!

* TODO
- If the offset of the mem-op of small-tensor is 64-bit quantity, we can
cast the offset to 32-bit and then convert it.

* Option
- Pass option: `analyzeSmallTensorOfst={false|true}`, false by default,
meaning when coming across mem-op of small-tensor, no need to analyze
its offset's value-range.

---------

Co-authored-by: Shuxin Yang <Shuxin.Yang@gmail.com>
@whitneywhtsang whitneywhtsang self-assigned this Oct 6, 2025
Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>
@whitneywhtsang whitneywhtsang marked this pull request as ready for review October 7, 2025 18:34
@whitneywhtsang whitneywhtsang merged commit efbbb06 into main Oct 7, 2025
23 checks passed
@whitneywhtsang whitneywhtsang deleted the whineywhtsang/merge2 branch October 7, 2025 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants