Skip to content

[AMDGPU] Add builtins for wave reduction intrinsics #150170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: users/easyonaadit/amdgpu/wave-reduce-intrinsics-bitwise
Choose a base branch
from

Conversation

easyonaadit
Copy link
Contributor

No description provided.

Copy link
Contributor Author

easyonaadit commented Jul 23, 2025

@easyonaadit easyonaadit changed the title Add builtins for wave reduction intrinsics [AMDGPU] Add builtins for wave reduction intrinsics Jul 23, 2025
Copy link

github-actions bot commented Jul 23, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@easyonaadit easyonaadit force-pushed the users/easyonaadit/amdgpu/wave-reduce-builtins branch from 59435cf to a5486f1 Compare July 23, 2025 06:13
@easyonaadit easyonaadit marked this pull request as ready for review July 23, 2025 06:18
@llvmbot llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen IR generation bugs: mangling, exceptions, etc. labels Jul 23, 2025
@llvmbot
Copy link
Member

llvmbot commented Jul 23, 2025

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-backend-amdgpu

Author: Aaditya (easyonaadit)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/150170.diff

3 Files Affected:

  • (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+25)
  • (modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+58)
  • (modified) clang/test/CodeGenOpenCL/builtins-amdgcn.cl (+378)
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 878543566f0e3..c8b324193e9fb 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -351,6 +351,31 @@ BUILTIN(__builtin_amdgcn_endpgm, "v", "nr")
 BUILTIN(__builtin_amdgcn_get_fpenv, "WUi", "n")
 BUILTIN(__builtin_amdgcn_set_fpenv, "vWUi", "n")
 
+//===----------------------------------------------------------------------===//
+
+// Wave Reduction builtins.
+
+//===----------------------------------------------------------------------===//
+
+BUILTIN(__builtin_amdgcn_wave_reduce_add_i32, "iii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_sub_i32, "iii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_min_i32, "iii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_min_u32, "UiUii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_max_i32, "iii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_max_u32, "UiUii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_and_b32, "iii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_or_b32, "iii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_xor_b32, "iii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_add_i64, "WiWii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_sub_i64, "WiWii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_min_i64, "WiWii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_min_u64, "WUiWUii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_max_i64, "WiWii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_max_u64, "WUiWUii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_and_b64, "WiWii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_or_b64, "WiWii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_xor_b64, "WiWii", "nc")
+
 //===----------------------------------------------------------------------===//
 // R600-NI only builtins.
 //===----------------------------------------------------------------------===//
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index 7dccf82b1a7a3..28ea918b97cc5 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -295,11 +295,69 @@ void CodeGenFunction::AddAMDGPUFenceAddressSpaceMMRA(llvm::Instruction *Inst,
   Inst->setMetadata(LLVMContext::MD_mmra, MMRAMetadata::getMD(Ctx, MMRAs));
 }
 
+static Intrinsic::ID getIntrinsicIDforWaveReduction(unsigned BuiltinID) {
+  switch (BuiltinID) {
+  default:
+    llvm_unreachable("Unknown BuiltinID for wave reduction");
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_add_i32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_add_i64:
+    return Intrinsic::amdgcn_wave_reduce_add;
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_sub_i32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_sub_i64:
+    return Intrinsic::amdgcn_wave_reduce_sub;
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_min_i32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_min_i64:
+    return Intrinsic::amdgcn_wave_reduce_min;
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_min_u32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_min_u64:
+    return Intrinsic::amdgcn_wave_reduce_umin;
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_max_i32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_max_i64:
+    return Intrinsic::amdgcn_wave_reduce_max;
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_max_u32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_max_u64:
+    return Intrinsic::amdgcn_wave_reduce_umax;
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_and_b32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_and_b64:
+    return Intrinsic::amdgcn_wave_reduce_and;
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_or_b32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_or_b64:
+    return Intrinsic::amdgcn_wave_reduce_or;
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_xor_b32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_xor_b64:
+    return Intrinsic::amdgcn_wave_reduce_xor;
+  }
+}
+
 Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
                                               const CallExpr *E) {
   llvm::AtomicOrdering AO = llvm::AtomicOrdering::SequentiallyConsistent;
   llvm::SyncScope::ID SSID;
   switch (BuiltinID) {
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_add_i32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_sub_i32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_min_i32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_min_u32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_max_i32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_max_u32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_and_b32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_or_b32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_xor_b32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_add_i64:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_sub_i64:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_min_i64:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_min_u64:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_max_i64:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_max_u64:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_and_b64:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_or_b64:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_xor_b64: {
+    Intrinsic::ID IID = getIntrinsicIDforWaveReduction(BuiltinID);
+    llvm::Value *Value = EmitScalarExpr(E->getArg(0));
+    llvm::Value *Strategy = EmitScalarExpr(E->getArg(1));
+    llvm::Function *F = CGM.getIntrinsic(IID, {Value->getType()});
+    return Builder.CreateCall(F, {Value, Strategy});
+  }
   case AMDGPU::BI__builtin_amdgcn_div_scale:
   case AMDGPU::BI__builtin_amdgcn_div_scalef: {
     // Translate from the intrinsics's struct return to the builtin's out
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
index bf022bc6eb446..16f5a524f3094 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
@@ -398,6 +398,384 @@ void test_s_sendmsghalt_var(int in)
   __builtin_amdgcn_s_sendmsghalt(1, in);
 }
 
+// CHECK-LABEL: @test_wave_reduce_add_i32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.add.i32(
+void test_wave_reduce_add_i32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_add_i32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_add_i64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.add.i64(
+void test_wave_reduce_add_i64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_add_i64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_add_i32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.add.i32(
+void test_wave_reduce_add_i32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_add_i32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_add_i64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.add.i64(
+void test_wave_reduce_add_i64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_add_i64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_add_i32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.add.i32(
+void test_wave_reduce_add_i32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_add_i32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_add_i64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.add.i64(
+void test_wave_reduce_add_i64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_add_i64(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_sub_i32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.sub.i32(
+void test_wave_reduce_sub_i32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_sub_i32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_sub_i64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.sub.i64(
+void test_wave_reduce_sub_i64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_sub_i64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_sub_i32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.sub.i32(
+void test_wave_reduce_sub_i32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_sub_i32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_sub_i64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.sub.i64(
+void test_wave_reduce_sub_i64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_sub_i64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_sub_i32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.sub.i32(
+void test_wave_reduce_sub_i32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_sub_i32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_sub_i64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.sub.i64(
+void test_wave_reduce_sub_i64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_sub_i64(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_and_b32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.and.i32(
+void test_wave_reduce_and_b32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_and_b32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_and_b64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.and.i64(
+void test_wave_reduce_and_b64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_and_b64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_and_b32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.and.i32(
+void test_wave_reduce_and_b32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_and_b32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_and_b64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.and.i64(
+void test_wave_reduce_and_b64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_and_b64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_and_b32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.and.i32(
+void test_wave_reduce_and_b32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_and_b32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_and_b64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.and.i64(
+void test_wave_reduce_and_b64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_and_b64(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_or_b32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.or.i32(
+void test_wave_reduce_or_b32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_or_b32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_or_b64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.or.i64(
+void test_wave_reduce_or_b64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_or_b64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_or_b32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.or.i32(
+void test_wave_reduce_or_b32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_or_b32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_or_b64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.or.i64(
+void test_wave_reduce_or_b64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_or_b64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_or_b32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.or.i32(
+void test_wave_reduce_or_b32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_or_b32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_or_b64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.or.i64(
+void test_wave_reduce_or_b64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_or_b64(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_xor_b32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.xor.i32(
+void test_wave_reduce_xor_b32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_xor_b32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_xor_b64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.xor.i64(
+void test_wave_reduce_xor_b64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_xor_b64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_xor_b32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.xor.i32(
+void test_wave_reduce_xor_b32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_xor_b32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_xor_b64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.xor.i64(
+void test_wave_reduce_xor_b64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_xor_b64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_xor_b32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.xor.i32(
+void test_wave_reduce_xor_b32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_xor_b32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_xor_b64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.xor.i64(
+void test_wave_reduce_xor_b64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_xor_b64(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_i32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.min.i32(
+void test_wave_reduce_min_i32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_i32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_i64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.min.i64(
+void test_wave_reduce_min_i64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_i64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_i32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.min.i32(
+void test_wave_reduce_min_i32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_i32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_i64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.min.i64(
+void test_wave_reduce_min_i64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_i64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_i32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.min.i32(
+void test_wave_reduce_min_i32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_i32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_i64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.min.i64(
+void test_wave_reduce_min_i64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_i64(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_u32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.umin.i32(
+void test_wave_reduce_min_u32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_u32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_u64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.umin.i64(
+void test_wave_reduce_min_u64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_u64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_u32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.umin.i32(
+void test_wave_reduce_min_u32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_u32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_u64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.umin.i64(
+void test_wave_reduce_min_u64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_u64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_u32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.umin.i32(
+void test_wave_reduce_min_u32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_u32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_u64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.umin.i64(
+void test_wave_reduce_min_u64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_u64(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_i32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.max.i32(
+void test_wave_reduce_max_i32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_i32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_i64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.max.i64(
+void test_wave_reduce_max_i64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_i64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_i32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.max.i32(
+void test_wave_reduce_max_i32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_i32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_i64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.max.i64(
+void test_wave_reduce_max_i64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_i64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_i32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.max.i32(
+void test_wave_reduce_max_i32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_i32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_i64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.max.i64(
+void test_wave_reduce_max_i64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_i64(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_u32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.umax.i32(
+void test_wave_reduce_max_u32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_u32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_u64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.umax.i64(
+void test_wave_reduce_max_u64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_u64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_u32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.umax.i32(
+void test_wave_reduce_max_u32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_u32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_u64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.umax.i64(
+void test_wave_reduce_max_u64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_u64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_u32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.umax.i32(
+void test_wave_reduce_max_u32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_u32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_u64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.umax.i64(
+void test_wave_reduce_max_u64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_u64(in, 2);
+}
+
 // CHECK-LABEL: @test_s_barrier
 // CHECK: {{.*}}call{{.*}} void @llvm.amdgcn.s.barrier(
 void test_s_barrier()

@llvmbot
Copy link
Member

llvmbot commented Jul 23, 2025

@llvm/pr-subscribers-clang-codegen

Author: Aaditya (easyonaadit)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/150170.diff

3 Files Affected:

  • (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+25)
  • (modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+58)
  • (modified) clang/test/CodeGenOpenCL/builtins-amdgcn.cl (+378)
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 878543566f0e3..c8b324193e9fb 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -351,6 +351,31 @@ BUILTIN(__builtin_amdgcn_endpgm, "v", "nr")
 BUILTIN(__builtin_amdgcn_get_fpenv, "WUi", "n")
 BUILTIN(__builtin_amdgcn_set_fpenv, "vWUi", "n")
 
+//===----------------------------------------------------------------------===//
+
+// Wave Reduction builtins.
+
+//===----------------------------------------------------------------------===//
+
+BUILTIN(__builtin_amdgcn_wave_reduce_add_i32, "iii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_sub_i32, "iii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_min_i32, "iii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_min_u32, "UiUii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_max_i32, "iii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_max_u32, "UiUii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_and_b32, "iii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_or_b32, "iii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_xor_b32, "iii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_add_i64, "WiWii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_sub_i64, "WiWii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_min_i64, "WiWii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_min_u64, "WUiWUii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_max_i64, "WiWii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_max_u64, "WUiWUii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_and_b64, "WiWii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_or_b64, "WiWii", "nc")
+BUILTIN(__builtin_amdgcn_wave_reduce_xor_b64, "WiWii", "nc")
+
 //===----------------------------------------------------------------------===//
 // R600-NI only builtins.
 //===----------------------------------------------------------------------===//
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index 7dccf82b1a7a3..28ea918b97cc5 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -295,11 +295,69 @@ void CodeGenFunction::AddAMDGPUFenceAddressSpaceMMRA(llvm::Instruction *Inst,
   Inst->setMetadata(LLVMContext::MD_mmra, MMRAMetadata::getMD(Ctx, MMRAs));
 }
 
+static Intrinsic::ID getIntrinsicIDforWaveReduction(unsigned BuiltinID) {
+  switch (BuiltinID) {
+  default:
+    llvm_unreachable("Unknown BuiltinID for wave reduction");
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_add_i32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_add_i64:
+    return Intrinsic::amdgcn_wave_reduce_add;
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_sub_i32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_sub_i64:
+    return Intrinsic::amdgcn_wave_reduce_sub;
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_min_i32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_min_i64:
+    return Intrinsic::amdgcn_wave_reduce_min;
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_min_u32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_min_u64:
+    return Intrinsic::amdgcn_wave_reduce_umin;
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_max_i32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_max_i64:
+    return Intrinsic::amdgcn_wave_reduce_max;
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_max_u32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_max_u64:
+    return Intrinsic::amdgcn_wave_reduce_umax;
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_and_b32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_and_b64:
+    return Intrinsic::amdgcn_wave_reduce_and;
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_or_b32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_or_b64:
+    return Intrinsic::amdgcn_wave_reduce_or;
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_xor_b32:
+  case clang::AMDGPU::BI__builtin_amdgcn_wave_reduce_xor_b64:
+    return Intrinsic::amdgcn_wave_reduce_xor;
+  }
+}
+
 Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
                                               const CallExpr *E) {
   llvm::AtomicOrdering AO = llvm::AtomicOrdering::SequentiallyConsistent;
   llvm::SyncScope::ID SSID;
   switch (BuiltinID) {
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_add_i32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_sub_i32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_min_i32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_min_u32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_max_i32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_max_u32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_and_b32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_or_b32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_xor_b32:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_add_i64:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_sub_i64:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_min_i64:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_min_u64:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_max_i64:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_max_u64:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_and_b64:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_or_b64:
+  case AMDGPU::BI__builtin_amdgcn_wave_reduce_xor_b64: {
+    Intrinsic::ID IID = getIntrinsicIDforWaveReduction(BuiltinID);
+    llvm::Value *Value = EmitScalarExpr(E->getArg(0));
+    llvm::Value *Strategy = EmitScalarExpr(E->getArg(1));
+    llvm::Function *F = CGM.getIntrinsic(IID, {Value->getType()});
+    return Builder.CreateCall(F, {Value, Strategy});
+  }
   case AMDGPU::BI__builtin_amdgcn_div_scale:
   case AMDGPU::BI__builtin_amdgcn_div_scalef: {
     // Translate from the intrinsics's struct return to the builtin's out
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
index bf022bc6eb446..16f5a524f3094 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
@@ -398,6 +398,384 @@ void test_s_sendmsghalt_var(int in)
   __builtin_amdgcn_s_sendmsghalt(1, in);
 }
 
+// CHECK-LABEL: @test_wave_reduce_add_i32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.add.i32(
+void test_wave_reduce_add_i32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_add_i32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_add_i64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.add.i64(
+void test_wave_reduce_add_i64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_add_i64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_add_i32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.add.i32(
+void test_wave_reduce_add_i32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_add_i32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_add_i64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.add.i64(
+void test_wave_reduce_add_i64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_add_i64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_add_i32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.add.i32(
+void test_wave_reduce_add_i32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_add_i32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_add_i64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.add.i64(
+void test_wave_reduce_add_i64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_add_i64(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_sub_i32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.sub.i32(
+void test_wave_reduce_sub_i32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_sub_i32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_sub_i64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.sub.i64(
+void test_wave_reduce_sub_i64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_sub_i64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_sub_i32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.sub.i32(
+void test_wave_reduce_sub_i32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_sub_i32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_sub_i64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.sub.i64(
+void test_wave_reduce_sub_i64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_sub_i64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_sub_i32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.sub.i32(
+void test_wave_reduce_sub_i32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_sub_i32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_sub_i64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.sub.i64(
+void test_wave_reduce_sub_i64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_sub_i64(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_and_b32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.and.i32(
+void test_wave_reduce_and_b32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_and_b32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_and_b64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.and.i64(
+void test_wave_reduce_and_b64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_and_b64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_and_b32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.and.i32(
+void test_wave_reduce_and_b32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_and_b32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_and_b64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.and.i64(
+void test_wave_reduce_and_b64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_and_b64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_and_b32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.and.i32(
+void test_wave_reduce_and_b32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_and_b32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_and_b64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.and.i64(
+void test_wave_reduce_and_b64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_and_b64(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_or_b32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.or.i32(
+void test_wave_reduce_or_b32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_or_b32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_or_b64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.or.i64(
+void test_wave_reduce_or_b64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_or_b64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_or_b32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.or.i32(
+void test_wave_reduce_or_b32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_or_b32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_or_b64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.or.i64(
+void test_wave_reduce_or_b64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_or_b64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_or_b32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.or.i32(
+void test_wave_reduce_or_b32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_or_b32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_or_b64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.or.i64(
+void test_wave_reduce_or_b64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_or_b64(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_xor_b32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.xor.i32(
+void test_wave_reduce_xor_b32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_xor_b32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_xor_b64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.xor.i64(
+void test_wave_reduce_xor_b64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_xor_b64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_xor_b32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.xor.i32(
+void test_wave_reduce_xor_b32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_xor_b32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_xor_b64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.xor.i64(
+void test_wave_reduce_xor_b64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_xor_b64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_xor_b32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.xor.i32(
+void test_wave_reduce_xor_b32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_xor_b32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_xor_b64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.xor.i64(
+void test_wave_reduce_xor_b64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_xor_b64(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_i32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.min.i32(
+void test_wave_reduce_min_i32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_i32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_i64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.min.i64(
+void test_wave_reduce_min_i64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_i64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_i32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.min.i32(
+void test_wave_reduce_min_i32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_i32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_i64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.min.i64(
+void test_wave_reduce_min_i64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_i64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_i32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.min.i32(
+void test_wave_reduce_min_i32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_i32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_i64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.min.i64(
+void test_wave_reduce_min_i64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_i64(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_u32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.umin.i32(
+void test_wave_reduce_min_u32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_u32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_u64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.umin.i64(
+void test_wave_reduce_min_u64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_u64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_u32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.umin.i32(
+void test_wave_reduce_min_u32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_u32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_u64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.umin.i64(
+void test_wave_reduce_min_u64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_u64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_u32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.umin.i32(
+void test_wave_reduce_min_u32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_u32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_min_u64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.umin.i64(
+void test_wave_reduce_min_u64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_min_u64(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_i32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.max.i32(
+void test_wave_reduce_max_i32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_i32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_i64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.max.i64(
+void test_wave_reduce_max_i64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_i64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_i32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.max.i32(
+void test_wave_reduce_max_i32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_i32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_i64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.max.i64(
+void test_wave_reduce_max_i64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_i64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_i32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.max.i32(
+void test_wave_reduce_max_i32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_i32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_i64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.max.i64(
+void test_wave_reduce_max_i64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_i64(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_u32_default
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.umax.i32(
+void test_wave_reduce_max_u32_default(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_u32(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_u64_default
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.umax.i64(
+void test_wave_reduce_max_u64_default(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_u64(in, 0);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_u32_iterative
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.umax.i32(
+void test_wave_reduce_max_u32_iterative(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_u32(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_u64_iterative
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.umax.i64(
+void test_wave_reduce_max_u64_iterative(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_u64(in, 1);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_u32_dpp
+// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.umax.i32(
+void test_wave_reduce_max_u32_dpp(global int* out, int in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_u32(in, 2);
+}
+
+// CHECK-LABEL: @test_wave_reduce_max_u64_dpp
+// CHECK: {{.*}}call{{.*}} i64 @llvm.amdgcn.wave.reduce.umax.i64(
+void test_wave_reduce_max_u64_dpp(global int* out, long in)
+{
+  *out = __builtin_amdgcn_wave_reduce_max_u64(in, 2);
+}
+
 // CHECK-LABEL: @test_s_barrier
 // CHECK: {{.*}}call{{.*}} void @llvm.amdgcn.s.barrier(
 void test_s_barrier()

@easyonaadit easyonaadit force-pushed the users/easyonaadit/amdgpu/wave-reduce-builtins branch 3 times, most recently from b45b107 to 01432a0 Compare July 25, 2025 10:31

//===----------------------------------------------------------------------===//

BUILTIN(__builtin_amdgcn_wave_reduce_add_i32, "iii", "nc")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be "u32" instead of "i32"? Also we should probably start using whatever magic character sequence gives int32_t/uint32_t instead of i/u

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add separate builtins for i and u to avoid ambiguity at the HIP level, though the codegen is the same for both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate builtins not necessary, just switch to the unsigned ones

@easyonaadit easyonaadit force-pushed the users/easyonaadit/amdgpu/wave-reduce-builtins branch from dcaea72 to 45ff803 Compare July 28, 2025 09:15
@@ -357,17 +357,17 @@ BUILTIN(__builtin_amdgcn_set_fpenv, "vWUi", "n")

//===----------------------------------------------------------------------===//

BUILTIN(__builtin_amdgcn_wave_reduce_add_i32, "ZiZiZi", "nc")
BUILTIN(__builtin_amdgcn_wave_reduce_sub_i32, "ZiZiZi", "nc")
BUILTIN(__builtin_amdgcn_wave_reduce_add_u32, "ZiZiZi", "nc")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to use ZUi to match

@@ -398,6 +398,384 @@ void test_s_sendmsghalt_var(int in)
__builtin_amdgcn_s_sendmsghalt(1, in);
}

// CHECK-LABEL: @test_wave_reduce_add_u32_default
// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wave.reduce.add.i32(
void test_wave_reduce_add_u32_default(global int* out, int in)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arsenm should these test cases be changed to unsigned instead of int as well?

@easyonaadit easyonaadit force-pushed the users/easyonaadit/amdgpu/wave-reduce-builtins branch from 508d60c to 8d9130b Compare July 28, 2025 11:36
@easyonaadit easyonaadit changed the base branch from users/easyonaadit/amdgpu/wave-reduce-intrinsics-i64 to users/easyonaadit/amdgpu/wave-reduce-intrinsics-bitwise July 30, 2025 10:45
@easyonaadit easyonaadit force-pushed the users/easyonaadit/amdgpu/wave-reduce-builtins branch from 8d9130b to e9752fc Compare July 30, 2025 10:46
@easyonaadit easyonaadit force-pushed the users/easyonaadit/amdgpu/wave-reduce-intrinsics-bitwise branch from e63d5b7 to 6bb5257 Compare July 30, 2025 11:27
@easyonaadit easyonaadit force-pushed the users/easyonaadit/amdgpu/wave-reduce-builtins branch from e9752fc to bc06c18 Compare July 30, 2025 11:27
@easyonaadit easyonaadit force-pushed the users/easyonaadit/amdgpu/wave-reduce-intrinsics-bitwise branch from 6bb5257 to 6409f93 Compare July 30, 2025 11:33
@easyonaadit easyonaadit force-pushed the users/easyonaadit/amdgpu/wave-reduce-builtins branch from bc06c18 to 2d22d22 Compare July 30, 2025 11:33
@easyonaadit easyonaadit force-pushed the users/easyonaadit/amdgpu/wave-reduce-builtins branch from 2d22d22 to b3a7a98 Compare July 31, 2025 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU clang:codegen IR generation bugs: mangling, exceptions, etc. clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants