-
Notifications
You must be signed in to change notification settings - Fork 74
Merge OpenAI Triton commit bea27e3
#5261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Given we already encode the kDim in `instrShape`, we don't need to select again when lowering mfma, just directly get the instruction, when matching all parameters.
Target of this PR is to add gluon and triton support for scaled wmma mxfp4 on gfx1250.
Membar only tracks accesses to shared memory so all it needs is a fence to the shared memory address space and a thread barrier. Before this change a `gpu.barrier` op was emitted which implies much stronger guarantees on syncronization of all address spaces and not just shared memory. The new LocalBarrierOp is targeting only accesses between shared memory and registers to allow for better code generation. Added lowering for this new op: - Common code lowers it to the current GPU BarrierOp - AMD will lower to lgkmcnt(0) + s_barrier
`wave_per_eu` indicates how many waves are supposed to execute per eu (SIMD) simulteneously. Value 0 means compiler decides how many waves are assigned to each SIMD; Value 1 indicates only 1 wave per SIMD. Value 1 will impact small kernels like the `vector_add` since there can be far more than 1 wave executing simultaneously per SIMD. So, by default, it makes more sense for the compiler to decide the numbers of wave per eu unless user specifies that. The default value 1 actually affects the perf numbers of the tutorial example `01-vector-add.py`. Memory bandwidth regressed to 3.3TB/s with `wave_per_eu=1`, and it increases to 5.3TB/s with `wave_per_eu=0`. This PR changes the default value from 1 to 0.
Fixes #8229 Background is that we run the gluon inliner prior to auto layout propagation to enable returning auto layout from a function and having different calls of the function resolve to different layouts. However, the inliner calls gluon canonicalize and the `GreedyPatternRewriter` defaults to CSEing constants. This means that two distinct constants which could otherwise resolve to different layouts may be CSEd into a single constant and create a new conflict. I fix this by changing the inliner to do even less canonicalization, and only simplify control flow operations. I then add a canoncalization pass after auto layout resolution to make up for this.
…rs (#8302) This change, combined with the previous PR#7939, is to convert memory accesses of small-tensors into buffer-ops *WITHOUT* analyzing their offsets' value range. * Some context - We informally call a tensor "small-tensor" if it is no more than 2G byte in size. - a jit function is specialized to differentiate "small-tensor" and larger ones; a function formal argument is tagged with tt.pointer_range=32 if it binds to a small-tensor. * This Change - This change, combined with the previous PR#7939, is to convert memory accesses of small-tensors into buffer-ops *WITHOUT* analyzing their offsets' value range. - The contribution of PR#7939 is to reveal the base-pointer, and this PR is to unconditionally perform such conversion. - It side-steps the defect/limitation of offset-range-analysis, and it's safe! * TODO - If the offset of the mem-op of small-tensor is 64-bit quantity, we can cast the offset to 32-bit and then convert it. * Option - Pass option: `analyzeSmallTensorOfst={false|true}`, false by default, meaning when coming across mem-op of small-tensor, no need to analyze its offset's value-range. --------- Co-authored-by: Shuxin Yang <Shuxin.Yang@gmail.com>
ecbe70b
to
e5c68c4
Compare
e5c68c4
to
fbf33e4
Compare
Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>
fbf33e4
to
98431a9
Compare
etiotto
approved these changes
Oct 7, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR change the Triton base from 51021fb to bea27e3 (Oct 1).
Pass rate: 92.75%->94.2%
Please do not squash and merge this PR.