Merge OpenAI Triton commit `0173f75` #5260

whitneywhtsang · 2025-10-06T17:13:07Z

This PR change the Triton base from bea27e3 to 0173f75 (Oct 6).
Pass rate: 94.2%->94.21%

Please do not squash and merge this PR.

This matches the logic in Triton's implementation and avoids constructing a 0d block_type.

Refactor `buffer_atomic_rmw` API based on the [comment](triton-lang/triton#8112 (comment))

The crt headers have moved from nvcc to its own individual thing.

We now support arbitrary linear encodings on shared memory and move away from the previously hardcoded logic. The new lowering fewer registers than the previous one, which results on a tighter SASS codegen. Missing, will do in follow-ups - Reuse this logic in `tcgen05.cp` - Enable generic mmav3SS lowering (and perhaps also for mmav5?

This commit enhances the gluon scaled wmma runtime unit tests.

Stop requiring local_load to have a DotOperand layout attached when trying to generate ds_read_tr instructions. This simplifies the code and allows loading of more complex cases such as local_load + tt.trans.

Currently we're getting a deprecation warning when building triton: > Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. The fix is to pass an explicit filter mode to `extractall`.

@acollins3

* assign-stage-phase follows uses of scalar output outside `ttg.warp_specialize` and assign default partition (`@0`) to all dependent computations * we'd need this change for nested-loop support * add bunch of lit test cc @acollins3

This PR adds initial support for TDM on gfx1250. Expose `tdm.async_load` and `tdm.async_wait` through Gluon. This PR is limited 2D tensor and only supports creating the descriptor in the kernel. Will support more cases in the following PRs.

Make the TRITON_ENABLE_EXPERIMENTAL_CONSAN cache invalidating and add tests verifying that there is no collision between consan enabled and disabled.

Adding the detailed intra kernel profiling tutorial and examples and readme.

…lt_scheme` (#8345)

…orMemoryToLLVM.cpp` (#8346) Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

Updating Actions should be safe as all runners you use are at least version 2.328.0. Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

This PR - moved functions from `ScheduleLoops` to `LowerLoops` to match common path structure, including allocating lds, creating local/async load ops, and creating schedule for these ops. - created symbolic clusters in `ScheduleLoops` and remap them to real clusters in `LowerLoops` - Stopped deducing shared layout in `ScheduleLoops`, did that in `LowerLoops` instead - replaced `preprocessLoop` with `getIndirectLevel` and a loop to generate `loadToInfo` - created an attribute to hand over bypassLDS attribute from `ScheduleLoops` to `LowerLoops` through IR - cleaned up pass arguments. Now `ScheduleLoops` only needs `numStages`. `Pipeline` only needs `useAsyncCopy` and `usePingPong` - rephrase comments accordingly It's part of a series of PRs to refactor StreamPipeliner. Previous PRs: - triton-lang/triton#8307 - triton-lang/triton#8295

Updated `matmul_ogs` to optionally fuse all gather / scatter. python reference impl is added. Test cases will be kept in a separate repo due to some dependencies

@masahi

…ng (#8338) This also fixes a bug found by @masahi, but this uncovered a PTX bug, so we stay as we were on that front lol

…342) Include ConSan state in the compilation options and kwargs of the kernel to force jit cache miss. Also, preparing the knobs to unify this behavior with proton by introducing `instrumentation_mode` compiltion knob that can be set to "consan" or "proton" by proton runtime.

… without commenting out code (#8293)

…#8358)

…(#8353) This forward propagation part of remove layout conversion pass makes this ``` %acc_316 = arith.truncf %p_308 : tensor<128x64xf32, #blocked> to tensor<128x64xbf16, #blocked> loc(#loc247) ttng.tmem_store %acc_316, %acc_35, %true : tensor<128x64xbf16, #blocked> -> !ttg.memdesc<128x64xbf16, #tmem2, #ttng.tensor_memory, mutable> loc(#loc228) ``` become this ``` %acc_315 = arith.truncf %p_307 : tensor<128x64xf32, #linear> to tensor<128x64xbf16, #linear> loc(#loc247) %acc_316 = ttg.convert_layout %acc_315 : tensor<128x64xbf16, #linear> -> tensor<128x64xbf16, #blocked3> loc(#loc247) ttng.tmem_store %acc_316, %acc_35, %true : tensor<128x64xbf16, #blocked3> -> !ttg.memdesc<128x64xbf16, #tmem2, #ttng.tensor_memory, mutable> loc(#loc228) ``` The `linear` layout comes from `TMemLoadReducePattern` for `tmem_load`, so we'd like to respect that if possible. This commit folds the convert op into TMEMStoreOp if it's compatible, getting rid of an unnecessary convert_layout op in the main loop for the flex attn kernel in #8328. This gets flex attention kernel in #8328 to achieve 499 tflops on 750W B200. (The regression was from 435 to 359 tflops) TTGIR files: - before: https://gist.github.com/pchen7e2/f47639a166fbf4ea8b01a723ec3899dd - after: https://gist.github.com/pchen7e2/fd4eb279d8927efb48711511582443ca

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

Closes #7156 `TERMINFO_LIBRARY` is unused since triton-lang/triton@98ed7db Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

#8333 uses the existing `create_make_tensor_descriptor` from triton for the `tdm.make_tensor_descriptor`. It will produce a tensor descriptor type without the layout. This PR exposes a new `create_make_tensor_descriptor` from gluon, allows to explicitly specify the return type.

# New contributor declaration - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [ ] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x ] This PR does not need a test because `It fix random build failure`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

Transition the last use of emitTransferBetweenRegistersAndShared to the new lowering path. Some general cleanup to lowerInst(), including aliasing information for packed loads. This also removes emitTransferBetweenRegistersAndShared as it's now unused.

(Calls to storage() caused issues when using FakeTensor.)

This commit disables pointer-canonicalization for pointer pointing to large tensors. The large tensors refers to JIT specialization for those tensor argument over 2GB. It is disabled on the ground that is has some tricky bugs. We are trying to come up a better approach that address several conflicting performance aspects. --------- Co-authored-by: Shuxin Yang <Shuxin.Yang@gmail.com>

In the current implementation we reset mode register every time when we perform FP conversion to FP8 data type. We modify F16_OVFL flag which also effects clamping during conversions of the FP16 data type. In fact, the flag should be inserted only one (e.g., at the beginning of a kernel). This PR addresses this issue. It moves the manipulation with the mode register to a dedicated function which gets initialized with an `AMD::ISAFamily` instance. Note, the the layout of bits in mode register may vary from architecture to architecture.

This PR added tests for MXFP GEMM Gluon Kernel for GFX1250.

…st '/std:c++20' Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>

This reverts commit 60605d8.

saagarjha and others added 30 commits October 1, 2025 09:15

Short-circuit zero-dimensional splats (#8331)

18293ce

This matches the logic in Triton's implementation and avoids constructing a 0d block_type.

[AMD][Gluon] Refactor buffer_atomic_rmw API (#8325)

32e0baf

Refactor `buffer_atomic_rmw` API based on the [comment](triton-lang/triton#8112 (comment))

[Build] Fix crt header download location for CUDA >= 13 (#8336)

bc4ec6a

The crt headers have moved from nvcc to its own individual thing.

[AMD] Enhance scaled wmma gluon runtime unit tests (#8339)

532fd37

This commit enhances the gluon scaled wmma runtime unit tests.

[AMD] Use linear layout to infer and emit ds_read_tr (#8235)

2d4f16d

Stop requiring local_load to have a DotOperand layout attached when trying to generate ds_read_tr instructions. This simplifies the code and allows loading of more complex cases such as local_load + tt.trans.

[KERNELS] Fix and enable batched matmul with split-k. (#8327)

eb7cdba

[ConSan] ConSan env var should be cache invalidating (#8332)

210c7b5

Make the TRITON_ENABLE_EXPERIMENTAL_CONSAN cache invalidating and add tests verifying that there is no collision between consan enabled and disabled.

[Proton] Intra kernel profiling tutorial and examples (#8334)

48ff763

Adding the detailed intra kernel profiling tutorial and examples and readme.

[Build] Remove Python 3.9 compatibility code for `sysconfig.get_defau…

3e464e9

…lt_scheme` (#8345)

Do not use C++20 designed initializers in `TritonNVIDIAGPUToLLVM/Tens…

7fc1d56

…orMemoryToLLVM.cpp` (#8346) Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

Bump actions/setup-python from 5 to 6 (#8347)

d9215b9

Updating Actions should be safe as all runners you use are at least version 2.328.0. Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

[triton_kernels] fused matmul_ogs + comms (#8340)

aafec41

Updated `matmul_ogs` to optionally fuse all gather / scatter. python reference impl is added. Test cases will be kept in a separate repo due to some dependencies

[BACKEND] Lower tcgen05.cp via the generic matrix descriptor loweri…

6e4647e

…ng (#8338) This also fixes a bug found by @masahi, but this uncovered a PTX bug, so we stay as we were on that front lol

[PROTON] Add a flag to disable proton in order to use other profilers…

43dbdd1

… without commenting out code (#8293)

[PROTON] Simplify proton runtime instrumentation using Triton knobs (…

ec800b5

…#8358)

[Tests] Remove subprocess usage from test_triton_debuginfo_on (#8350)

be6a688

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

[Build] Remove unused find_library(TERMINFO_LIBRARY tinfo) (#8362)

7e042c6

Closes #7156 `TERMINFO_LIBRARY` is unused since triton-lang/triton@98ed7db Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

[TESTS] Remove fresh_knobs from matmul.py::test_op (#8364)

c5d1e01

[KERNELS] Change routing code to avoid storage(). (#8357)

d5156d7

(Calls to storage() caused issues when using FakeTensor.)

ravil-mobile and others added 2 commits October 6, 2025 16:39

[AMD] Add Tests for MXFP GEMM Gluon Kernel for GFX1250 (#8371)

0173f75

This PR added tests for MXFP GEMM Gluon Kernel for GFX1250.

whitneywhtsang force-pushed the whitneywhtsang/merge branch 10 times, most recently from f4a04e2 to a575aa2 Compare October 7, 2025 20:51

etiotto approved these changes Oct 7, 2025

View reviewed changes

whitneywhtsang added 2 commits October 7, 2025 21:24

Merge commit '210c7b5bb29c01781c3e3053fe6bf28eb178347f'

8868aca

[WIN] Fix error C7555: use of designated initializers requires at lea…

5c020ef

…st '/std:c++20' Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>

whitneywhtsang force-pushed the whitneywhtsang/merge branch from a575aa2 to 5c020ef Compare October 7, 2025 21:24

whitneywhtsang added the keep-going label Oct 7, 2025

whitneywhtsang added 3 commits October 7, 2025 21:40

Merge commit '43dbdd1685625ce71daea1caf8a4d90fdea6457f'

633d32d

Merge commit '5d84a9122b519251d1453fc7e7f31e2e304dc1d6'

609e327

Merge commit '0173f7524d8cfc9a5b4b52dec0010eaedef14526'

e52429a

whitneywhtsang marked this pull request as ready for review October 7, 2025 23:13

whitneywhtsang changed the title ~~Merge OpenAI Triton commit 210c7b5~~ Merge OpenAI Triton commit 43dbdd1 Oct 7, 2025

whitneywhtsang changed the title ~~Merge OpenAI Triton commit 43dbdd1~~ Merge OpenAI Triton commit 5d84a91 Oct 8, 2025

Revert "[mxfp] remove col-major assert for mx weight (#8249)"

f21b341

This reverts commit 60605d8.

whitneywhtsang force-pushed the whitneywhtsang/merge branch from 788af8c to f21b341 Compare October 8, 2025 02:09

whitneywhtsang changed the title ~~Merge OpenAI Triton commit 5d84a91~~ Merge OpenAI Triton commit 0173f75 Oct 8, 2025

whitneywhtsang mentioned this pull request Oct 8, 2025

Reland upstream commit 60605d8 #5269

Open

whitneywhtsang removed the keep-going label Oct 8, 2025

whitneywhtsang merged commit c0e155d into main Oct 8, 2025
23 checks passed

whitneywhtsang deleted the whitneywhtsang/merge branch October 8, 2025 03:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge OpenAI Triton commit `0173f75` #5260

Merge OpenAI Triton commit `0173f75` #5260

Uh oh!

whitneywhtsang commented Oct 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Merge OpenAI Triton commit 0173f75 #5260

Merge OpenAI Triton commit 0173f75 #5260

Uh oh!

Conversation

whitneywhtsang commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge OpenAI Triton commit `0173f75` #5260

Merge OpenAI Triton commit `0173f75` #5260

whitneywhtsang commented Oct 6, 2025 •

edited

Loading