-
Notifications
You must be signed in to change notification settings - Fork 74
Merge OpenAI Triton commit 0173f75
#5260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This matches the logic in Triton's implementation and avoids constructing a 0d block_type.
Refactor `buffer_atomic_rmw` API based on the [comment](triton-lang/triton#8112 (comment))
The crt headers have moved from nvcc to its own individual thing.
We now support arbitrary linear encodings on shared memory and move away from the previously hardcoded logic. The new lowering fewer registers than the previous one, which results on a tighter SASS codegen. Missing, will do in follow-ups - Reuse this logic in `tcgen05.cp` - Enable generic mmav3SS lowering (and perhaps also for mmav5?
This commit enhances the gluon scaled wmma runtime unit tests.
Stop requiring local_load to have a DotOperand layout attached when trying to generate ds_read_tr instructions. This simplifies the code and allows loading of more complex cases such as local_load + tt.trans.
Currently we're getting a deprecation warning when building triton: > Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. The fix is to pass an explicit filter mode to `extractall`.
* assign-stage-phase follows uses of scalar output outside `ttg.warp_specialize` and assign default partition (`@0`) to all dependent computations * we'd need this change for nested-loop support * add bunch of lit test cc @acollins3
This PR adds initial support for TDM on gfx1250. Expose `tdm.async_load` and `tdm.async_wait` through Gluon. This PR is limited 2D tensor and only supports creating the descriptor in the kernel. Will support more cases in the following PRs.
Make the TRITON_ENABLE_EXPERIMENTAL_CONSAN cache invalidating and add tests verifying that there is no collision between consan enabled and disabled.
Adding the detailed intra kernel profiling tutorial and examples and readme.
…lt_scheme` (#8345)
…orMemoryToLLVM.cpp` (#8346) Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Updating Actions should be safe as all runners you use are at least version 2.328.0. Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
This PR - moved functions from `ScheduleLoops` to `LowerLoops` to match common path structure, including allocating lds, creating local/async load ops, and creating schedule for these ops. - created symbolic clusters in `ScheduleLoops` and remap them to real clusters in `LowerLoops` - Stopped deducing shared layout in `ScheduleLoops`, did that in `LowerLoops` instead - replaced `preprocessLoop` with `getIndirectLevel` and a loop to generate `loadToInfo` - created an attribute to hand over bypassLDS attribute from `ScheduleLoops` to `LowerLoops` through IR - cleaned up pass arguments. Now `ScheduleLoops` only needs `numStages`. `Pipeline` only needs `useAsyncCopy` and `usePingPong` - rephrase comments accordingly It's part of a series of PRs to refactor StreamPipeliner. Previous PRs: - triton-lang/triton#8307 - triton-lang/triton#8295
Updated `matmul_ogs` to optionally fuse all gather / scatter. python reference impl is added. Test cases will be kept in a separate repo due to some dependencies
…ng (#8338) This also fixes a bug found by @masahi, but this uncovered a PTX bug, so we stay as we were on that front lol
…342) Include ConSan state in the compilation options and kwargs of the kernel to force jit cache miss. Also, preparing the knobs to unify this behavior with proton by introducing `instrumentation_mode` compiltion knob that can be set to "consan" or "proton" by proton runtime.
… without commenting out code (#8293)
…(#8353) This forward propagation part of remove layout conversion pass makes this ``` %acc_316 = arith.truncf %p_308 : tensor<128x64xf32, #blocked> to tensor<128x64xbf16, #blocked> loc(#loc247) ttng.tmem_store %acc_316, %acc_35, %true : tensor<128x64xbf16, #blocked> -> !ttg.memdesc<128x64xbf16, #tmem2, #ttng.tensor_memory, mutable> loc(#loc228) ``` become this ``` %acc_315 = arith.truncf %p_307 : tensor<128x64xf32, #linear> to tensor<128x64xbf16, #linear> loc(#loc247) %acc_316 = ttg.convert_layout %acc_315 : tensor<128x64xbf16, #linear> -> tensor<128x64xbf16, #blocked3> loc(#loc247) ttng.tmem_store %acc_316, %acc_35, %true : tensor<128x64xbf16, #blocked3> -> !ttg.memdesc<128x64xbf16, #tmem2, #ttng.tensor_memory, mutable> loc(#loc228) ``` The `linear` layout comes from `TMemLoadReducePattern` for `tmem_load`, so we'd like to respect that if possible. This commit folds the convert op into TMEMStoreOp if it's compatible, getting rid of an unnecessary convert_layout op in the main loop for the flex attn kernel in #8328. This gets flex attention kernel in #8328 to achieve 499 tflops on 750W B200. (The regression was from 435 to 359 tflops) TTGIR files: - before: https://gist.github.com/pchen7e2/f47639a166fbf4ea8b01a723ec3899dd - after: https://gist.github.com/pchen7e2/fd4eb279d8927efb48711511582443ca
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Closes #7156 `TERMINFO_LIBRARY` is unused since triton-lang/triton@98ed7db Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
#8333 uses the existing `create_make_tensor_descriptor` from triton for the `tdm.make_tensor_descriptor`. It will produce a tensor descriptor type without the layout. This PR exposes a new `create_make_tensor_descriptor` from gluon, allows to explicitly specify the return type.
<!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [ ] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x ] This PR does not need a test because `It fix random build failure`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)
Transition the last use of emitTransferBetweenRegistersAndShared to the new lowering path. Some general cleanup to lowerInst(), including aliasing information for packed loads. This also removes emitTransferBetweenRegistersAndShared as it's now unused.
(Calls to storage() caused issues when using FakeTensor.)
This commit disables pointer-canonicalization for pointer pointing to large tensors. The large tensors refers to JIT specialization for those tensor argument over 2GB. It is disabled on the ground that is has some tricky bugs. We are trying to come up a better approach that address several conflicting performance aspects. --------- Co-authored-by: Shuxin Yang <Shuxin.Yang@gmail.com>
In the current implementation we reset mode register every time when we perform FP conversion to FP8 data type. We modify F16_OVFL flag which also effects clamping during conversions of the FP16 data type. In fact, the flag should be inserted only one (e.g., at the beginning of a kernel). This PR addresses this issue. It moves the manipulation with the mode register to a dedicated function which gets initialized with an `AMD::ISAFamily` instance. Note, the the layout of bits in mode register may vary from architecture to architecture.
This PR added tests for MXFP GEMM Gluon Kernel for GFX1250.
f4a04e2
to
a575aa2
Compare
etiotto
approved these changes
Oct 7, 2025
…st '/std:c++20' Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>
a575aa2
to
5c020ef
Compare
210c7b5
43dbdd1
43dbdd1
5d84a91
This reverts commit 60605d8.
788af8c
to
f21b341
Compare
5d84a91
0173f75
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR change the Triton base from bea27e3 to 0173f75 (Oct 6).
Pass rate: 94.2%->94.21%
Please do not squash and merge this PR.