请教一个fused_gemm_epilogue算子的问题 #71943

houj04 · 2025-03-27T07:16:36Z

请提出你的问题 Please ask your question

背景：开发非GPU设备上的fused_gemm_epilogue算子。

第一部分，关于前向的算子：

我用了下面的代码，来试一下这个算子的功能：

import paddle
import numpy as np

from paddle import _C_ops

def gelu(x):
    y_ref = (
        0.5
        * x
        * (1.0 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * np.power(x, 3))))
    )
    return y_ref.astype(x.dtype)


def relu(x):
    mask = x > 0
    return x * mask


def get_output(X, Y, bias, act):
    out = np.dot(X, Y) + bias
    if act == 'relu':
        return relu(out)
    elif act == 'gelu':
        return gelu(out)
    else:
        return out


x_np = np.random.random((8, 4)).astype(np.float32) - 0.5
y_np = np.random.random((4, 128)).astype(np.float32) - 0.5
bias_np = np.random.random((128,)).astype(np.float32) - 0.5
x = paddle.to_tensor(x_np)
y = paddle.to_tensor(y_np)
bias = paddle.to_tensor(bias_np)
x.stop_gradient = False
y.stop_gradient = False

out1, _ = _C_ops.fused_gemm_epilogue(x, y, bias, False, False, 'none')
out2, _ = _C_ops.fused_gemm_epilogue(x, y, bias, False, False, 'relu')
out3, _ = _C_ops.fused_gemm_epilogue(x, y, bias, False, False, 'gelu')

out_np1 = get_output(x_np, y_np, bias_np, 'none')
out_np2 = get_output(x_np, y_np, bias_np, 'relu')
out_np3 = get_output(x_np, y_np, bias_np, 'gelu')

np.testing.assert_allclose(out1, out_np1, atol=2e-04)
np.testing.assert_allclose(out2, out_np2, atol=2e-04)
np.testing.assert_allclose(out3, out_np3, atol=1e-03)

可以看出，上面代码，一共调用了三次fused_gemm_epilogue算子，参数仅有激活函数不一样。

跑的时候打开export GLOG_v=6，注意到三次运行的时候，会打印出下面这样的信息，仅在最后的reserve_space不一样。

I0327 15:04:08.895666 153931 fused_gemm_epilogue_kernel.cu:97] x.shape={8, 4}, y.shape={4, 128}, out.shape={8, 128}, M=8, N=128, K=4, trans_x=0, trans_y=0, activation=none, fused_type=3, reserve_space=0
I0327 15:04:08.981643 153931 fused_gemm_epilogue_kernel.cu:97] x.shape={8, 4}, y.shape={4, 128}, out.shape={8, 128}, M=8, N=128, K=4, trans_x=0, trans_y=0, activation=relu, fused_type=7, reserve_space=0x4f38a00
I0327 15:04:08.986799 153931 fused_gemm_epilogue_kernel.cu:97] x.shape={8, 4}, y.shape={4, 128}, out.shape={8, 128}, M=8, N=128, K=4, trans_x=0, trans_y=0, activation=gelu, fused_type=8, reserve_space=0x9705d60

对应到代码的这个地方：代码文件：paddle/phi/kernels/fusion/gpu/fused_gemm_epilogue_kernel.cu

template <typename T, typename Context>
void FusedGemmEpilogueKernel(const Context& dev_ctx,
                             const DenseTensor& x,
                             const DenseTensor& y,
                             const DenseTensor& bias,
                             const bool trans_x,
                             const bool trans_y,
                             const std::string& activation,
                             DenseTensor* out,
                             DenseTensor* reserve_space) {

第二部分，关于反向的单测：

我尝试运行反向算子的单测，即这个文件：test/legacy_test/test_fused_gemm_epilogue_grad_op.py

发现它并没有真正执行，而是skip掉了。排查发现，这段代码逻辑可能有问题。这段逻辑等价于，仅在core.is_compiled_with_cuda() and is_rocm_gfx928()的时候才会运行，否则都skip掉。

@unittest.skipIf(
    not core.is_compiled_with_cuda() or not is_rocm_gfx928(),
    "core is not compiled with CUDA",
)

并且，整个case里面，用的激活函数都是none，因此显然是漏测了。所以我做了如下改动

diff --git a/test/legacy_test/test_fused_gemm_epilogue_grad_op.py b/test/legacy_test/test_fused_gemm_epilogue_grad_op.py
index a8b7c760fb..aa42cd32f7 100644
--- a/test/legacy_test/test_fused_gemm_epilogue_grad_op.py
+++ b/test/legacy_test/test_fused_gemm_epilogue_grad_op.py
@@ -42,10 +42,10 @@ def get_outputs(DOut, X, Y):


 @skip_check_grad_ci(reason="no grap op")
-@unittest.skipIf(
-    not core.is_compiled_with_cuda() or not is_rocm_gfx928(),
-    "core is not compiled with CUDA",
-)
+#@unittest.skipIf(
+#    not core.is_compiled_with_cuda() or not is_rocm_gfx928(),
+#    "core is not compiled with CUDA",
+#)
 class TestFuseGemmEpilogueGradOpDXYBiasFP16(OpTest):
     def setUp(self):
         self.op_type = "fused_gemm_epilogue_grad"
@@ -58,7 +58,7 @@ class TestFuseGemmEpilogueGradOpDXYBiasFP16(OpTest):
             'Y': np.random.random((4, 128)).astype(self.dtype) - 0.5,
         }

-        self.attrs = {"activation_grad": 'none'}
+        self.attrs = {"activation_grad": 'gelu'}

         DX, DY, DBias = get_outputs(
             self.inputs['DOut'], self.inputs['X'], self.inputs['Y']

改了之后，上面的单测就会挂掉，报错是：

2069: ValueError: (InvalidArgument) The ReserveSpace should not be empty. when activation == {relu_grad, gelu_grad}. (at /host_all/workspace3/houjue/paddle_develop/Paddle/paddle/phi/infermeta/fusion.cc:2017)
2069:   [operator < fused_gemm_epilogue_grad > error]

所以，综合上述两部分代码和实测结果，想问的问题是：
1、从函数功能上看，是x乘以y，然后加bias，然后激活，输出到out。那么，这个reserve_space是干什么用的？
2、从日志中看，激活函数为空的时候，这个reserve_space也是空。而激活函数不是空的时候，这个临时空间也不是空。这种行为是在哪里控制的，是什么地方对它进行了赋值操作，以及，是什么地方给它分配了实际设备上的空间呢？
3、反向的单测，skip逻辑是否有问题？错误跳过了很多case。
4、反向的单测，如何能让它正确分配这个临时空间，使得它可以正确跑到激活函数非空的逻辑里面？

The text was updated successfully, but these errors were encountered:

houj04 · 2025-04-07T04:48:03Z

（已另行沟通）
reserve_space是用来保存前向输出的。XPU暂时不使用此空间，并且act只支持none的情况。

houj04 added status/new-issue 新建 type/question 用户提问 labels Mar 27, 2025

paddle-bot bot assigned tianshuo78520a Mar 27, 2025

tianshuo78520a assigned sneaxiy Mar 27, 2025

houj04 closed this as completed Apr 7, 2025

paddle-bot bot added status/close 已关闭 and removed status/new-issue 新建 labels Apr 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

请教一个fused_gemm_epilogue算子的问题 #71943

请教一个fused_gemm_epilogue算子的问题 #71943

houj04 commented Mar 27, 2025 •

edited

Loading

houj04 commented Apr 7, 2025

请教一个fused_gemm_epilogue算子的问题 #71943

请教一个fused_gemm_epilogue算子的问题 #71943

Comments

houj04 commented Mar 27, 2025 • edited Loading

请提出你的问题 Please ask your question

houj04 commented Apr 7, 2025

houj04 commented Mar 27, 2025 •

edited

Loading