Skip to content

请教一个fused_gemm_epilogue算子的问题 #71943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
houj04 opened this issue Mar 27, 2025 · 1 comment
Closed

请教一个fused_gemm_epilogue算子的问题 #71943

houj04 opened this issue Mar 27, 2025 · 1 comment
Assignees
Labels
status/close 已关闭 type/question 用户提问

Comments

@houj04
Copy link
Contributor

houj04 commented Mar 27, 2025

请提出你的问题 Please ask your question

背景:开发非GPU设备上的fused_gemm_epilogue算子。

第一部分,关于前向的算子:

我用了下面的代码,来试一下这个算子的功能:

import paddle
import numpy as np

from paddle import _C_ops

def gelu(x):
    y_ref = (
        0.5
        * x
        * (1.0 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * np.power(x, 3))))
    )
    return y_ref.astype(x.dtype)


def relu(x):
    mask = x > 0
    return x * mask


def get_output(X, Y, bias, act):
    out = np.dot(X, Y) + bias
    if act == 'relu':
        return relu(out)
    elif act == 'gelu':
        return gelu(out)
    else:
        return out


x_np = np.random.random((8, 4)).astype(np.float32) - 0.5
y_np = np.random.random((4, 128)).astype(np.float32) - 0.5
bias_np = np.random.random((128,)).astype(np.float32) - 0.5
x = paddle.to_tensor(x_np)
y = paddle.to_tensor(y_np)
bias = paddle.to_tensor(bias_np)
x.stop_gradient = False
y.stop_gradient = False

out1, _ = _C_ops.fused_gemm_epilogue(x, y, bias, False, False, 'none')
out2, _ = _C_ops.fused_gemm_epilogue(x, y, bias, False, False, 'relu')
out3, _ = _C_ops.fused_gemm_epilogue(x, y, bias, False, False, 'gelu')

out_np1 = get_output(x_np, y_np, bias_np, 'none')
out_np2 = get_output(x_np, y_np, bias_np, 'relu')
out_np3 = get_output(x_np, y_np, bias_np, 'gelu')

np.testing.assert_allclose(out1, out_np1, atol=2e-04)
np.testing.assert_allclose(out2, out_np2, atol=2e-04)
np.testing.assert_allclose(out3, out_np3, atol=1e-03)

可以看出,上面代码,一共调用了三次fused_gemm_epilogue算子,参数仅有激活函数不一样。

跑的时候打开export GLOG_v=6,注意到三次运行的时候,会打印出下面这样的信息,仅在最后的reserve_space不一样。

I0327 15:04:08.895666 153931 fused_gemm_epilogue_kernel.cu:97] x.shape={8, 4}, y.shape={4, 128}, out.shape={8, 128}, M=8, N=128, K=4, trans_x=0, trans_y=0, activation=none, fused_type=3, reserve_space=0
I0327 15:04:08.981643 153931 fused_gemm_epilogue_kernel.cu:97] x.shape={8, 4}, y.shape={4, 128}, out.shape={8, 128}, M=8, N=128, K=4, trans_x=0, trans_y=0, activation=relu, fused_type=7, reserve_space=0x4f38a00
I0327 15:04:08.986799 153931 fused_gemm_epilogue_kernel.cu:97] x.shape={8, 4}, y.shape={4, 128}, out.shape={8, 128}, M=8, N=128, K=4, trans_x=0, trans_y=0, activation=gelu, fused_type=8, reserve_space=0x9705d60

对应到代码的这个地方:代码文件:paddle/phi/kernels/fusion/gpu/fused_gemm_epilogue_kernel.cu

template <typename T, typename Context>
void FusedGemmEpilogueKernel(const Context& dev_ctx,
                             const DenseTensor& x,
                             const DenseTensor& y,
                             const DenseTensor& bias,
                             const bool trans_x,
                             const bool trans_y,
                             const std::string& activation,
                             DenseTensor* out,
                             DenseTensor* reserve_space) {

第二部分,关于反向的单测:

我尝试运行反向算子的单测,即这个文件:test/legacy_test/test_fused_gemm_epilogue_grad_op.py

发现它并没有真正执行,而是skip掉了。排查发现,这段代码逻辑可能有问题。这段逻辑等价于,仅在core.is_compiled_with_cuda() and is_rocm_gfx928()的时候才会运行,否则都skip掉。

@unittest.skipIf(
    not core.is_compiled_with_cuda() or not is_rocm_gfx928(),
    "core is not compiled with CUDA",
)

并且,整个case里面,用的激活函数都是none,因此显然是漏测了。所以我做了如下改动

diff --git a/test/legacy_test/test_fused_gemm_epilogue_grad_op.py b/test/legacy_test/test_fused_gemm_epilogue_grad_op.py
index a8b7c760fb..aa42cd32f7 100644
--- a/test/legacy_test/test_fused_gemm_epilogue_grad_op.py
+++ b/test/legacy_test/test_fused_gemm_epilogue_grad_op.py
@@ -42,10 +42,10 @@ def get_outputs(DOut, X, Y):


 @skip_check_grad_ci(reason="no grap op")
-@unittest.skipIf(
-    not core.is_compiled_with_cuda() or not is_rocm_gfx928(),
-    "core is not compiled with CUDA",
-)
+#@unittest.skipIf(
+#    not core.is_compiled_with_cuda() or not is_rocm_gfx928(),
+#    "core is not compiled with CUDA",
+#)
 class TestFuseGemmEpilogueGradOpDXYBiasFP16(OpTest):
     def setUp(self):
         self.op_type = "fused_gemm_epilogue_grad"
@@ -58,7 +58,7 @@ class TestFuseGemmEpilogueGradOpDXYBiasFP16(OpTest):
             'Y': np.random.random((4, 128)).astype(self.dtype) - 0.5,
         }

-        self.attrs = {"activation_grad": 'none'}
+        self.attrs = {"activation_grad": 'gelu'}

         DX, DY, DBias = get_outputs(
             self.inputs['DOut'], self.inputs['X'], self.inputs['Y']

改了之后,上面的单测就会挂掉,报错是:

2069: ValueError: (InvalidArgument) The ReserveSpace should not be empty. when activation == {relu_grad, gelu_grad}. (at /host_all/workspace3/houjue/paddle_develop/Paddle/paddle/phi/infermeta/fusion.cc:2017)
2069:   [operator < fused_gemm_epilogue_grad > error]

所以,综合上述两部分代码和实测结果,想问的问题是:
1、从函数功能上看,是x乘以y,然后加bias,然后激活,输出到out。那么,这个reserve_space是干什么用的?
2、从日志中看,激活函数为空的时候,这个reserve_space也是空。而激活函数不是空的时候,这个临时空间也不是空。这种行为是在哪里控制的,是什么地方对它进行了赋值操作,以及,是什么地方给它分配了实际设备上的空间呢?
3、反向的单测,skip逻辑是否有问题?错误跳过了很多case。
4、反向的单测,如何能让它正确分配这个临时空间,使得它可以正确跑到激活函数非空的逻辑里面?

@houj04
Copy link
Contributor Author

houj04 commented Apr 7, 2025

(已另行沟通)
reserve_space是用来保存前向输出的。XPU暂时不使用此空间,并且act只支持none的情况。

@houj04 houj04 closed this as completed Apr 7, 2025
@paddle-bot paddle-bot bot added status/close 已关闭 and removed status/new-issue 新建 labels Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/close 已关闭 type/question 用户提问
Projects
None yet
Development

No branches or pull requests

3 participants