[AMDGPU][gfx12] Clean-up implementation of waits before SCOPE_SYS stores #150587

Pierre-vh · 2025-07-25T08:03:20Z

We can do it all in finalizeStore if we ensure it always sees the stores.
For that, I needed to fix a hidden bug where finalizeStore wouldn't see all stores
because sometimes the iterator got out-of-sync and didn't point to the store anymore.

This also removes the waits before volatile LDS stores which never needed it, that was a bug until now.

Pierre-vh · 2025-07-25T08:03:38Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-07-25T08:04:30Z

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

We can do it all in finalizeStore if we ensure it always sees the stores.
For that, I needed to fix a hidden bug where finalizeStore wouldn't see all stores
because sometimes the iterator got out-of-sync and didn't point to the store anymore.

This also removes the waits before volatile LDS stores which never needed it, that was a bug until now.

Full diff: https://github.com/llvm/llvm-project/pull/150587.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+9-11)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll (-20)

diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index 4456d9b241425..d6337a85a7361 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -321,7 +321,7 @@ class SICacheControl {
                                               bool IsNonTemporal,
                                               bool IsLastUse = false) const = 0;
 
-  virtual bool finalizeStore(MachineBasicBlock::iterator &MI, bool Atomic) const {
+  virtual bool finalizeStore(MachineInstr &MI, bool Atomic) const {
     return false;
   };
 
@@ -602,7 +602,7 @@ class SIGfx12CacheControl : public SIGfx11CacheControl {
                                       bool IsVolatile, bool IsNonTemporal,
                                       bool IsLastUse) const override;
 
-  bool finalizeStore(MachineBasicBlock::iterator &MI, bool Atomic) const override;
+  bool finalizeStore(MachineInstr &MI, bool Atomic) const override;
 
   bool insertRelease(MachineBasicBlock::iterator &MI, SIAtomicScope Scope,
                      SIAtomicAddrSpace AddrSpace, bool IsCrossAddrSpaceOrdering,
@@ -2526,9 +2526,6 @@ bool SIGfx12CacheControl::enableVolatileAndOrNonTemporal(
   if (IsVolatile) {
     Changed |= setScope(MI, AMDGPU::CPol::SCOPE_SYS);
 
-    if (Op == SIMemOp::STORE)
-      Changed |= insertWaitsBeforeSystemScopeStore(MI);
-
     // Ensure operation has completed at system scope to cause all volatile
     // operations to be visible outside the program in a global order. Do not
     // request cross address space as only the global address space can be
@@ -2541,9 +2538,8 @@ bool SIGfx12CacheControl::enableVolatileAndOrNonTemporal(
   return Changed;
 }
 
-bool SIGfx12CacheControl::finalizeStore(MachineBasicBlock::iterator &MI,
-                                        bool Atomic) const {
-  MachineOperand *CPol = TII->getNamedOperand(*MI, OpName::cpol);
+bool SIGfx12CacheControl::finalizeStore(MachineInstr &MI, bool Atomic) const {
+  MachineOperand *CPol = TII->getNamedOperand(MI, OpName::cpol);
   if (!CPol)
     return false;
 
@@ -2558,7 +2554,7 @@ bool SIGfx12CacheControl::finalizeStore(MachineBasicBlock::iterator &MI,
 
   // GFX12.5 only: Require SCOPE_SE on stores that may hit the scratch address
   // space.
-  if (TII->mayAccessScratchThroughFlat(*MI) && Scope == CPol::SCOPE_CU)
+  if (TII->mayAccessScratchThroughFlat(MI) && Scope == CPol::SCOPE_CU)
     return setScope(MI, CPol::SCOPE_SE);
 
   return false;
@@ -2662,6 +2658,8 @@ bool SIMemoryLegalizer::expandStore(const SIMemOpInfo &MOI,
   assert(!MI->mayLoad() && MI->mayStore());
 
   bool Changed = false;
+  // FIXME: Necessary hack because iterator can lose track of the store.
+  MachineInstr &StoreMI = *MI;
 
   if (MOI.isAtomic()) {
     if (MOI.getOrdering() == AtomicOrdering::Monotonic ||
@@ -2678,7 +2676,7 @@ bool SIMemoryLegalizer::expandStore(const SIMemOpInfo &MOI,
                                    MOI.getIsCrossAddressSpaceOrdering(),
                                    Position::BEFORE);
 
-    Changed |= CC->finalizeStore(MI, /*Atomic=*/true);
+    Changed |= CC->finalizeStore(StoreMI, /*Atomic=*/true);
     return Changed;
   }
 
@@ -2691,7 +2689,7 @@ bool SIMemoryLegalizer::expandStore(const SIMemOpInfo &MOI,
 
   // GFX12 specific, scope(desired coherence domain in cache hierarchy) is
   // instruction field, do not confuse it with atomic scope.
-  Changed |= CC->finalizeStore(MI, /*Atomic=*/false);
+  Changed |= CC->finalizeStore(StoreMI, /*Atomic=*/false);
   return Changed;
 }
 
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll
index bc2508411ed6b..5e5e3bf83d610 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll
@@ -415,11 +415,6 @@ define amdgpu_kernel void @local_volatile_store_0(
 ; GFX12-WGP-NEXT:    v_mov_b32_e32 v0, s1
 ; GFX12-WGP-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-WGP-NEXT:    v_mov_b32_e32 v1, s0
-; GFX12-WGP-NEXT:    s_wait_loadcnt 0x0
-; GFX12-WGP-NEXT:    s_wait_samplecnt 0x0
-; GFX12-WGP-NEXT:    s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT:    s_wait_kmcnt 0x0
-; GFX12-WGP-NEXT:    s_wait_storecnt 0x0
 ; GFX12-WGP-NEXT:    ds_store_b32 v0, v1
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
@@ -432,11 +427,6 @@ define amdgpu_kernel void @local_volatile_store_0(
 ; GFX12-CU-NEXT:    v_mov_b32_e32 v0, s1
 ; GFX12-CU-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-CU-NEXT:    v_mov_b32_e32 v1, s0
-; GFX12-CU-NEXT:    s_wait_loadcnt 0x0
-; GFX12-CU-NEXT:    s_wait_samplecnt 0x0
-; GFX12-CU-NEXT:    s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT:    s_wait_kmcnt 0x0
-; GFX12-CU-NEXT:    s_wait_storecnt 0x0
 ; GFX12-CU-NEXT:    ds_store_b32 v0, v1
 ; GFX12-CU-NEXT:    s_endpgm
     ptr addrspace(1) %in, ptr addrspace(3) %out) {
@@ -562,11 +552,6 @@ define amdgpu_kernel void @local_volatile_store_1(
 ; GFX12-WGP-NEXT:    v_lshl_add_u32 v0, v0, s1, s2
 ; GFX12-WGP-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-WGP-NEXT:    v_mov_b32_e32 v1, s0
-; GFX12-WGP-NEXT:    s_wait_loadcnt 0x0
-; GFX12-WGP-NEXT:    s_wait_samplecnt 0x0
-; GFX12-WGP-NEXT:    s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT:    s_wait_kmcnt 0x0
-; GFX12-WGP-NEXT:    s_wait_storecnt 0x0
 ; GFX12-WGP-NEXT:    ds_store_b32 v0, v1
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
@@ -583,11 +568,6 @@ define amdgpu_kernel void @local_volatile_store_1(
 ; GFX12-CU-NEXT:    v_lshl_add_u32 v0, v0, s1, s2
 ; GFX12-CU-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-CU-NEXT:    v_mov_b32_e32 v1, s0
-; GFX12-CU-NEXT:    s_wait_loadcnt 0x0
-; GFX12-CU-NEXT:    s_wait_samplecnt 0x0
-; GFX12-CU-NEXT:    s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT:    s_wait_kmcnt 0x0
-; GFX12-CU-NEXT:    s_wait_storecnt 0x0
 ; GFX12-CU-NEXT:    ds_store_b32 v0, v1
 ; GFX12-CU-NEXT:    s_endpgm
     ptr addrspace(1) %in, ptr addrspace(3) %out) {

Pierre-vh · 2025-07-28T08:23:50Z

Merge activity

Jul 28, 8:23 AM UTC: A user started a stack merge that includes this pull request via Graphite.
Jul 28, 9:41 AM UTC: Graphite rebased this pull request as part of a merge.
Jul 28, 9:45 AM UTC: Graphite rebased this pull request as part of a merge.
Jul 28, 9:53 AM UTC: Graphite rebased this pull request as part of a merge.
Jul 28, 9:56 AM UTC: Graphite rebased this pull request as part of a merge.
Jul 28, 10:12 AM UTC: Graphite rebased this pull request as part of a merge.
Jul 28, 10:20 AM UTC: Graphite rebased this pull request as part of a merge.
Jul 28, 10:41 AM UTC: Graphite rebased this pull request as part of a merge.
Jul 28, 10:59 AM UTC: Graphite couldn't merge this PR because it was not satisfying all requirements (Failed CI: 'Build and Test Linux').
Jul 28, 12:34 PM UTC: Graphite rebased this pull request as part of a merge.
Jul 28, 12:49 PM UTC: Graphite rebased this pull request as part of a merge.
Jul 28, 12:57 PM UTC: Graphite rebased this pull request as part of a merge.
Jul 28, 1:28 PM UTC: Graphite rebased this pull request as part of a merge.
Jul 28, 1:37 PM UTC: Graphite rebased this pull request as part of a merge.

We can do it all in finalizeStore if we ensure it always sees the stores. For that, I needed to fix a hidden bug where finalizeStore wouldn't see all stores because sometimes the iterator got out-of-sync and didn't point to the store anymore. This also removes the waits before volatile LDS stores which never needed it, that was a bug until now.

This was referenced Jul 25, 2025

[AMDGPU][gfx1250] Use SCOPE_SE for stores that may hit scratch #150586

Merged

[AMDGPU][gfx1250] Add cu-store subtarget feature #150588

Open

Pierre-vh requested review from arsenm, jayfoad, rampitec and shiltian July 25, 2025 08:03

Pierre-vh marked this pull request as ready for review July 25, 2025 08:03

llvmbot added the backend:AMDGPU label Jul 25, 2025

Pierre-vh force-pushed the users/pierre-vh/cleanup-wait-before-scope-sys-store branch from e4821db to 6c5ec02 Compare July 25, 2025 08:44

rampitec approved these changes Jul 25, 2025

View reviewed changes

Pierre-vh force-pushed the users/pierre-vh/gfx1250-scope-se-scratch branch 8 times, most recently from 5c225b5 to 7f59751 Compare July 28, 2025 09:07

Base automatically changed from users/pierre-vh/gfx1250-scope-se-scratch to main July 28, 2025 09:40

Pierre-vh force-pushed the users/pierre-vh/cleanup-wait-before-scope-sys-store branch 8 times, most recently from 1ab0333 to 1e3bb0c Compare July 28, 2025 11:24

Pierre-vh force-pushed the users/pierre-vh/cleanup-wait-before-scope-sys-store branch 4 times, most recently from db04f10 to d591d5d Compare July 28, 2025 13:27

Pierre-vh added 2 commits July 28, 2025 13:36

fix test

8d9e7d5

Pierre-vh force-pushed the users/pierre-vh/cleanup-wait-before-scope-sys-store branch from d591d5d to 8d9e7d5 Compare July 28, 2025 13:36

Pierre-vh merged commit a6532c2 into main Jul 28, 2025
7 of 9 checks passed

Pierre-vh deleted the users/pierre-vh/cleanup-wait-before-scope-sys-store branch July 28, 2025 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU][gfx12] Clean-up implementation of waits before SCOPE_SYS stores #150587

[AMDGPU][gfx12] Clean-up implementation of waits before SCOPE_SYS stores #150587

Uh oh!

Pierre-vh commented Jul 25, 2025

Uh oh!

Pierre-vh commented Jul 25, 2025 •

edited

Loading

Uh oh!

llvmbot commented Jul 25, 2025

Uh oh!

Pierre-vh commented Jul 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[AMDGPU][gfx12] Clean-up implementation of waits before SCOPE_SYS stores #150587

[AMDGPU][gfx12] Clean-up implementation of waits before SCOPE_SYS stores #150587

Uh oh!

Conversation

Pierre-vh commented Jul 25, 2025

Uh oh!

Pierre-vh commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 25, 2025

Uh oh!

Pierre-vh commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

Uh oh!

Pierre-vh commented Jul 25, 2025 •

edited

Loading

Pierre-vh commented Jul 28, 2025 •

edited

Loading