Skip to content

[AMDGPU][gfx1250] Add cu-store subtarget feature #150588

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 29, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion llvm/docs/AMDGPUUsage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -768,6 +768,9 @@ For example:
performant than code generated for XNACK replay
disabled.

cu-stores TODO On GFX12.5, controls whether ``scope:SCOPE_CU`` stores may be used.
If disabled, all stores will be done at ``scope:SCOPE_SE`` or greater.

=============== ============================ ==================================================

.. _amdgpu-target-id:
Expand Down Expand Up @@ -5107,7 +5110,9 @@ The fields used by CP for code objects before V3 also match those specified in
and must be 0,
>454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT
_SIZE
457:455 3 bits Reserved, must be 0.
455 1 bit USES_CU_STORES GFX12.5: Whether the ``cu-stores`` target attribute is enabled.
If 0, then all stores are ``SCOPE_SE`` or higher.
457:456 2 bits Reserved, must be 0.
458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9
Reserved, must be 0.
GFX10-GFX11
Expand Down Expand Up @@ -18188,6 +18193,8 @@ terminated by an ``.end_amdhsa_kernel`` directive.
GFX942)
``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX12 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
``.amdhsa_uses_cu_stores`` 0 GFX12.5 Controls USES_CU_STORES in
:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
``.amdhsa_wavefront_size32`` Target GFX10-GFX12 Controls ENABLE_WAVEFRONT_SIZE32 in
Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
Specific
Expand Down
3 changes: 2 additions & 1 deletion llvm/include/llvm/Support/AMDHSAKernelDescriptor.h
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,8 @@ enum : int32_t {
KERNEL_CODE_PROPERTY(ENABLE_SGPR_DISPATCH_ID, 4, 1),
KERNEL_CODE_PROPERTY(ENABLE_SGPR_FLAT_SCRATCH_INIT, 5, 1),
KERNEL_CODE_PROPERTY(ENABLE_SGPR_PRIVATE_SEGMENT_SIZE, 6, 1),
KERNEL_CODE_PROPERTY(RESERVED0, 7, 3),
KERNEL_CODE_PROPERTY(RESERVED0, 7, 2),
KERNEL_CODE_PROPERTY(USES_CU_STORES, 9, 1), // GFX12.5 +cu-stores
KERNEL_CODE_PROPERTY(ENABLE_WAVEFRONT_SIZE32, 10, 1), // GFX10+
KERNEL_CODE_PROPERTY(USES_DYNAMIC_STACK, 11, 1),
KERNEL_CODE_PROPERTY(RESERVED1, 12, 4),
Expand Down
7 changes: 7 additions & 0 deletions llvm/lib/Target/AMDGPU/AMDGPU.td
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,12 @@ def FeatureSafeCUPrefetch : SubtargetFeature<"safe-cu-prefetch",
"VMEM CU scope prefetches do not fail on illegal address"
>;

def FeatureCUStores : SubtargetFeature<"cu-stores",
"HasCUStores",
"true",
"Whether SCOPE_CU stores can be used on GFX12.5"
>;

def FeatureVcmpxExecWARHazard : SubtargetFeature<"vcmpx-exec-war-hazard",
"HasVcmpxExecWARHazard",
"true",
Expand Down Expand Up @@ -1988,6 +1994,7 @@ def FeatureISAVersion12 : FeatureSet<
def FeatureISAVersion12_50 : FeatureSet<
[FeatureGFX12,
FeatureGFX1250Insts,
FeatureCUStores,
FeatureCuMode,
Feature64BitLiterals,
FeatureLDSBankCount32,
Expand Down
6 changes: 5 additions & 1 deletion llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -552,6 +552,7 @@ const MCExpr *AMDGPUAsmPrinter::getAmdhsaKernelCodeProperties(
MCContext &Ctx = MF.getContext();
uint16_t KernelCodeProperties = 0;
const GCNUserSGPRUsageInfo &UserSGPRInfo = MFI.getUserSGPRInfo();
const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();

if (UserSGPRInfo.hasPrivateSegmentBuffer()) {
KernelCodeProperties |=
Expand Down Expand Up @@ -581,10 +582,13 @@ const MCExpr *AMDGPUAsmPrinter::getAmdhsaKernelCodeProperties(
KernelCodeProperties |=
amdhsa::KERNEL_CODE_PROPERTY_ENABLE_SGPR_PRIVATE_SEGMENT_SIZE;
}
if (MF.getSubtarget<GCNSubtarget>().isWave32()) {
if (ST.isWave32()) {
KernelCodeProperties |=
amdhsa::KERNEL_CODE_PROPERTY_ENABLE_WAVEFRONT_SIZE32;
}
if (isGFX1250(ST) && ST.hasCUStores()) {
KernelCodeProperties |= amdhsa::KERNEL_CODE_PROPERTY_USES_CU_STORES;
}

// CurrentProgramInfo.DynamicCallStack is a MCExpr and could be
// un-evaluatable at this point so it cannot be conditionally checked here.
Expand Down
6 changes: 6 additions & 0 deletions llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6066,6 +6066,12 @@ bool AMDGPUAsmParser::ParseDirectiveAMDHSAKernel() {
ExprVal, ValRange);
if (Val)
ImpliedUserSGPRCount += 1;
} else if (ID == ".amdhsa_uses_cu_stores") {
if (!isGFX1250())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is supposed to be a software controlled setting, it probably should be a separate attribute

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, only .amdhsa_uses_cu_stores needs to be in the metadata. The intention is that the runtime can check whether the code was built with + or -cu-stores

return Error(IDRange.Start, "directive requires gfx12.5", IDRange);

PARSE_BITS_ENTRY(KD.kernel_code_properties,
KERNEL_CODE_PROPERTY_USES_CU_STORES, ExprVal, ValRange);
} else if (ID == ".amdhsa_wavefront_size32") {
EXPR_RESOLVE_OR_ERROR(EvaluatableExpr);
if (IVersion.Major < 10)
Expand Down
3 changes: 3 additions & 0 deletions llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2556,6 +2556,9 @@ Expected<bool> AMDGPUDisassembler::decodeKernelDescriptorDirective(
KERNEL_CODE_PROPERTY_ENABLE_SGPR_FLAT_SCRATCH_INIT);
PRINT_DIRECTIVE(".amdhsa_user_sgpr_private_segment_size",
KERNEL_CODE_PROPERTY_ENABLE_SGPR_PRIVATE_SEGMENT_SIZE);
if (isGFX1250())
PRINT_DIRECTIVE(".amdhsa_uses_cu_stores",
KERNEL_CODE_PROPERTY_USES_CU_STORES);

if (TwoByteBuffer & KERNEL_CODE_PROPERTY_RESERVED0)
return createReservedKDBitsError(KERNEL_CODE_PROPERTY_RESERVED0,
Expand Down
3 changes: 3 additions & 0 deletions llvm/lib/Target/AMDGPU/GCNSubtarget.h
Original file line number Diff line number Diff line change
Expand Up @@ -248,6 +248,7 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
bool HasVmemPrefInsts = false;
bool HasSafeSmemPrefetch = false;
bool HasSafeCUPrefetch = false;
bool HasCUStores = false;
bool HasVcmpxExecWARHazard = false;
bool HasLdsBranchVmemWARHazard = false;
bool HasNSAtoVMEMBug = false;
Expand Down Expand Up @@ -998,6 +999,8 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,

bool hasSafeCUPrefetch() const { return HasSafeCUPrefetch; }

bool hasCUStores() const { return HasCUStores; }

// Has s_cmpk_* instructions.
bool hasSCmpK() const { return getGeneration() < GFX12; }

Expand Down
5 changes: 5 additions & 0 deletions llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUTargetStreamer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -440,6 +440,11 @@ void AMDGPUTargetAsmStreamer::EmitAmdhsaKernelDescriptor(
amdhsa::KERNEL_CODE_PROPERTY_ENABLE_SGPR_PRIVATE_SEGMENT_SIZE_SHIFT,
amdhsa::KERNEL_CODE_PROPERTY_ENABLE_SGPR_PRIVATE_SEGMENT_SIZE,
".amdhsa_user_sgpr_private_segment_size");
if (isGFX1250(STI))
PrintField(KD.kernel_code_properties,
amdhsa::KERNEL_CODE_PROPERTY_USES_CU_STORES_SHIFT,
amdhsa::KERNEL_CODE_PROPERTY_USES_CU_STORES,
".amdhsa_uses_cu_stores");
if (IVersion.Major >= 10)
PrintField(KD.kernel_code_properties,
amdhsa::KERNEL_CODE_PROPERTY_ENABLE_WAVEFRONT_SIZE32_SHIFT,
Expand Down
4 changes: 3 additions & 1 deletion llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2564,7 +2564,9 @@ bool SIGfx12CacheControl::finalizeStore(MachineInstr &MI, bool Atomic) const {

// GFX12.5 only: Require SCOPE_SE on stores that may hit the scratch address
// space.
if (TII->mayAccessScratchThroughFlat(MI) && Scope == CPol::SCOPE_CU)
// We also require SCOPE_SE minimum if we not have the "cu-stores" feature.
if (Scope == CPol::SCOPE_CU &&
(!ST.hasCUStores() || TII->mayAccessScratchThroughFlat(MI)))
return setScope(MI, CPol::SCOPE_SE);

return false;
Expand Down
100 changes: 100 additions & 0 deletions llvm/test/CodeGen/AMDGPU/gfx1250-no-scope-cu-stores.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
; RUN: llc -mtriple=amdgcn-amd-amdhsa -O3 -mcpu=gfx1250 < %s | FileCheck --check-prefixes=GCN,CU %s
; RUN: llc -mtriple=amdgcn-amd-amdhsa -O3 -mcpu=gfx1250 -mattr=-cu-stores < %s | FileCheck --check-prefixes=GCN,NOCU %s

; Check that if -cu-stores is used, we use SCOPE_SE minimum on all stores.

; GCN: flat_store:
; CU: flat_store_b32 v{{.*}}, v{{.*}}, s{{.*}} scope:SCOPE_SE
; NOCU: flat_store_b32 v{{.*}}, v{{.*}}, s{{.*}} scope:SCOPE_SE
; GCN: .amdhsa_kernel flat_store
; CU: .amdhsa_uses_cu_stores 1
; NOCU: .amdhsa_uses_cu_stores 0
define amdgpu_kernel void @flat_store(ptr %dst, i32 %val) {
entry:
store i32 %val, ptr %dst
ret void
}

; GCN: global_store:
; CU: global_store_b32 v{{.*}}, v{{.*}}, s{{.*}}{{$}}
; NOCU: global_store_b32 v{{.*}}, v{{.*}}, s{{.*}} scope:SCOPE_SE
; GCN: .amdhsa_kernel global_store
; CU: .amdhsa_uses_cu_stores 1
; NOCU: .amdhsa_uses_cu_stores 0
define amdgpu_kernel void @global_store(ptr addrspace(1) %dst, i32 %val) {
entry:
store i32 %val, ptr addrspace(1) %dst
ret void
}

; GCN: local_store:
; CU: ds_store_b32 v{{.*}}, v{{.*}}{{$}}
; NOCU: ds_store_b32 v{{.*}}, v{{.*}}{{$}}
; GCN: .amdhsa_kernel local_store
; CU: .amdhsa_uses_cu_stores 1
; NOCU: .amdhsa_uses_cu_stores 0
define amdgpu_kernel void @local_store(ptr addrspace(3) %dst, i32 %val) {
entry:
store i32 %val, ptr addrspace(3) %dst
ret void
}

; GCN: scratch_store:
; CU: scratch_store_b32 off, v{{.*}}, s{{.*}} scope:SCOPE_SE
; NOCU: scratch_store_b32 off, v{{.*}}, s{{.*}} scope:SCOPE_SE
; GCN: .amdhsa_kernel scratch_store
; CU: .amdhsa_uses_cu_stores 1
; NOCU: .amdhsa_uses_cu_stores 0
define amdgpu_kernel void @scratch_store(ptr addrspace(5) %dst, i32 %val) {
entry:
store i32 %val, ptr addrspace(5) %dst
ret void
}

; GCN: flat_atomic_store:
; CU: flat_store_b32 v{{.*}}, v{{.*}}, s{{.*}} scope:SCOPE_SE
; NOCU: flat_store_b32 v{{.*}}, v{{.*}}, s{{.*}} scope:SCOPE_SE
; GCN: .amdhsa_kernel flat_atomic_store
; CU: .amdhsa_uses_cu_stores 1
; NOCU: .amdhsa_uses_cu_stores 0
define amdgpu_kernel void @flat_atomic_store(ptr %dst, i32 %val) {
entry:
store atomic i32 %val, ptr %dst syncscope("wavefront") unordered, align 4
ret void
}

; GCN: global_atomic_store:
; CU: global_store_b32 v{{.*}}, v{{.*}}, s{{.*}}{{$}}
; NOCU: global_store_b32 v{{.*}}, v{{.*}}, s{{.*}} scope:SCOPE_SE
; GCN: .amdhsa_kernel global_atomic_store
; CU: .amdhsa_uses_cu_stores 1
; NOCU: .amdhsa_uses_cu_stores 0
define amdgpu_kernel void @global_atomic_store(ptr addrspace(1) %dst, i32 %val) {
entry:
store atomic i32 %val, ptr addrspace(1) %dst syncscope("wavefront") unordered, align 4
ret void
}

; GCN: local_atomic_store:
; CU: ds_store_b32 v{{.*}}, v{{.*}}{{$}}
; NOCU: ds_store_b32 v{{.*}}, v{{.*}}{{$}}
; GCN: .amdhsa_kernel local_atomic_store
; CU: .amdhsa_uses_cu_stores 1
; NOCU: .amdhsa_uses_cu_stores 0
define amdgpu_kernel void @local_atomic_store(ptr addrspace(3) %dst, i32 %val) {
entry:
store atomic i32 %val, ptr addrspace(3) %dst syncscope("wavefront") unordered, align 4
ret void
}

; GCN: scratch_atomic_store:
; CU: scratch_store_b32 off, v{{.*}}, s{{.*}} scope:SCOPE_SE
; NOCU: scratch_store_b32 off, v{{.*}}, s{{.*}} scope:SCOPE_SE
; GCN: .amdhsa_kernel scratch_atomic_store
; CU: .amdhsa_uses_cu_stores 1
; NOCU: .amdhsa_uses_cu_stores 0
define amdgpu_kernel void @scratch_atomic_store(ptr addrspace(5) %dst, i32 %val) {
entry:
store atomic i32 %val, ptr addrspace(5) %dst syncscope("wavefront") unordered, align 4
ret void
}
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@
# RES_4_2: ; error decoding test.kd: kernel descriptor reserved bits in range (511:480) set
# RES_4_2-NEXT: ; decoding failed region as bytes

# RUN: yaml2obj %s -DGPU=GFX90A -DKD=00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000006000000000000 \
# RUN: | llvm-objdump --disassemble-symbols=test.kd - | FileCheck %s --check-prefix=RES_457
# RES_457: ; error decoding test.kd: kernel descriptor reserved bits in range (457:455) set
# RES_457-NEXT: ; decoding failed region as bytes
# RUN: yaml2obj %s -DGPU=GFX90A -DKD=00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003000000000000 \
# RUN: | llvm-objdump --disassemble-symbols=test.kd - | FileCheck %s --check-prefix=RES_456
# RES_456: ; error decoding test.kd: kernel descriptor reserved bits in range (456:455) set
# RES_456-NEXT: ; decoding failed region as bytes

# RUN: yaml2obj %s -DGPU=GFX90A -DKD=0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000c000000000000 \
# RUN: | llvm-objdump --disassemble-symbols=test.kd - | FileCheck %s --check-prefix=WF32
Expand Down
Loading