Skip to content

Conversation

whitneywhtsang
Copy link
Contributor

@whitneywhtsang whitneywhtsang commented Oct 6, 2025

This PR change the Triton base from bea27e3 to 0173f75 (Oct 6).
Pass rate: 94.2%->94.21%

Please do not squash and merge this PR.

saagarjha and others added 30 commits October 1, 2025 09:15
This matches the logic in Triton's implementation and avoids
constructing a 0d block_type.
Refactor `buffer_atomic_rmw` API based on the
[comment](triton-lang/triton#8112 (comment))
The crt headers have moved from nvcc to its own individual thing.
We now support arbitrary linear encodings on shared memory and move away
from the previously hardcoded logic.

The new lowering fewer registers than the previous one, which results on
a tighter SASS codegen.

Missing, will do in follow-ups
- Reuse this logic in `tcgen05.cp`
- Enable generic mmav3SS lowering (and perhaps also for mmav5?
This commit enhances the gluon scaled wmma runtime unit tests.
Stop requiring local_load to have a DotOperand layout attached when
trying to generate ds_read_tr instructions.
This simplifies the code and allows loading of more complex cases such
as local_load + tt.trans.
Currently we're getting a deprecation warning when building triton:

> Python 3.14 will, by default, filter extracted tar archives and reject
files or modify their metadata. Use the filter argument to control this
behavior.

The fix is to pass an explicit filter mode to `extractall`.
* assign-stage-phase follows uses of scalar output outside
`ttg.warp_specialize` and assign default partition (`@0`) to all
dependent computations
* we'd need this change for nested-loop support
* add bunch of lit test

cc @acollins3
This PR adds initial support for TDM on gfx1250. Expose `tdm.async_load`
and `tdm.async_wait` through Gluon. This PR is limited 2D tensor and
only supports creating the descriptor in the kernel. Will support more
cases in the following PRs.
Make the TRITON_ENABLE_EXPERIMENTAL_CONSAN cache invalidating and add
tests verifying that there is no collision between consan enabled and
disabled.
Adding the detailed intra kernel profiling tutorial and examples and
readme.
…orMemoryToLLVM.cpp` (#8346)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Updating Actions should be safe as all runners you use are at least
version 2.328.0.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
This PR
- moved functions from `ScheduleLoops` to `LowerLoops` to match common
path structure, including allocating lds, creating local/async load ops,
and creating schedule for these ops.
- created symbolic clusters in `ScheduleLoops` and remap them to real
clusters in `LowerLoops`
- Stopped deducing shared layout in `ScheduleLoops`, did that in
`LowerLoops` instead
- replaced `preprocessLoop` with `getIndirectLevel` and a loop to
generate `loadToInfo`
- created an attribute to hand over bypassLDS attribute from
`ScheduleLoops` to `LowerLoops` through IR
- cleaned up pass arguments. Now `ScheduleLoops` only needs `numStages`.
`Pipeline` only needs `useAsyncCopy` and `usePingPong`
- rephrase comments accordingly

It's part of a series of PRs to refactor StreamPipeliner.

Previous PRs:
- triton-lang/triton#8307
- triton-lang/triton#8295
Updated `matmul_ogs` to optionally fuse all gather / scatter.

python reference impl is added. Test cases will be kept in a separate
repo due to some dependencies
…ng (#8338)

This also fixes a bug found by @masahi, but this uncovered a PTX bug, so
we stay as we were on that front lol
…342)

Include ConSan state in the compilation options and kwargs of the kernel
to force jit cache miss. Also, preparing the knobs to unify this
behavior with proton by introducing `instrumentation_mode` compiltion
knob that can be set to "consan" or "proton" by proton runtime.
…(#8353)

This forward propagation part of remove layout conversion pass makes
this

```
%acc_316 = arith.truncf %p_308 : tensor<128x64xf32, #blocked> to tensor<128x64xbf16, #blocked> loc(#loc247)
      ttng.tmem_store %acc_316, %acc_35, %true : tensor<128x64xbf16, #blocked> -> !ttg.memdesc<128x64xbf16, #tmem2, #ttng.tensor_memory, mutable> loc(#loc228)
```

become this

```
%acc_315 = arith.truncf %p_307 : tensor<128x64xf32, #linear> to tensor<128x64xbf16, #linear> loc(#loc247)
      %acc_316 = ttg.convert_layout %acc_315 : tensor<128x64xbf16, #linear> -> tensor<128x64xbf16, #blocked3> loc(#loc247)
      ttng.tmem_store %acc_316, %acc_35, %true : tensor<128x64xbf16, #blocked3> -> !ttg.memdesc<128x64xbf16, #tmem2, #ttng.tensor_memory, mutable> loc(#loc228)
```

The `linear` layout comes from `TMemLoadReducePattern` for `tmem_load`,
so we'd like to respect that if possible.

This commit folds the convert op into TMEMStoreOp if it's compatible,
getting rid of an unnecessary convert_layout op in the main loop for the
flex attn kernel in #8328.

This gets flex attention kernel in #8328 to achieve 499 tflops on 750W
B200. (The regression was from 435 to 359 tflops)

TTGIR files:
- before:
https://gist.github.com/pchen7e2/f47639a166fbf4ea8b01a723ec3899dd
- after:
https://gist.github.com/pchen7e2/fd4eb279d8927efb48711511582443ca
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Closes #7156


`TERMINFO_LIBRARY` is unused since
triton-lang/triton@98ed7db

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
#8333 uses the existing `create_make_tensor_descriptor` from triton for
the `tdm.make_tensor_descriptor`. It will produce a tensor descriptor
type without the layout. This PR exposes a new
`create_make_tensor_descriptor` from gluon, allows to explicitly specify
the return type.
<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [ ] I am not making a trivial change, such as fixing a typo in a
comment.

- [ ] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [ x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x ] This PR does not need a test because `It fix random build
failure`.

- Select one of the following.
  - [ ] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
Transition the last use of emitTransferBetweenRegistersAndShared to the
new lowering path.
Some general cleanup to lowerInst(), including aliasing information for
packed loads.
This also removes emitTransferBetweenRegistersAndShared as it's now
unused.
(Calls to storage() caused issues when using FakeTensor.)
This commit disables pointer-canonicalization for pointer pointing to large
tensors. The large tensors refers to JIT specialization for those tensor
argument over 2GB.

It is disabled on the ground that is has some tricky bugs. We are trying
to come up a better approach that address several conflicting
performance aspects.

---------

Co-authored-by: Shuxin Yang <Shuxin.Yang@gmail.com>
ravil-mobile and others added 2 commits October 6, 2025 16:39
In the current implementation we reset mode register every time when we
perform FP conversion to FP8 data type. We modify F16_OVFL flag which
also effects clamping during conversions of the FP16 data type. In fact,
the flag should be inserted only one (e.g., at the beginning of a
kernel). This PR addresses this issue. It moves the manipulation with
the mode register to a dedicated function which gets initialized with an
`AMD::ISAFamily` instance. Note, the the layout of bits in mode register
may vary from architecture to architecture.
This PR added tests for MXFP GEMM Gluon Kernel for GFX1250.
@whitneywhtsang whitneywhtsang force-pushed the whitneywhtsang/merge branch 10 times, most recently from f4a04e2 to a575aa2 Compare October 7, 2025 20:51
@whitneywhtsang whitneywhtsang marked this pull request as ready for review October 7, 2025 23:13
@whitneywhtsang whitneywhtsang changed the title Merge OpenAI Triton commit 210c7b5 Merge OpenAI Triton commit 43dbdd1 Oct 7, 2025
@whitneywhtsang whitneywhtsang changed the title Merge OpenAI Triton commit 43dbdd1 Merge OpenAI Triton commit 5d84a91 Oct 8, 2025
@whitneywhtsang whitneywhtsang changed the title Merge OpenAI Triton commit 5d84a91 Merge OpenAI Triton commit 0173f75 Oct 8, 2025
@whitneywhtsang whitneywhtsang merged commit c0e155d into main Oct 8, 2025
23 checks passed
@whitneywhtsang whitneywhtsang deleted the whitneywhtsang/merge branch October 8, 2025 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.