Skip to content

[AP] Implement pcc compile engine frontend of trivial fusion. #72640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 53 commits into from
May 13, 2025

Conversation

lixinqi
Copy link
Contributor

@lixinqi lixinqi commented May 9, 2025

PR Category

CINN

PR Types

New features

Description

pcard-76996

动机

当前pcc受制于组合算子拆解和CINN融合前端,用户不能直接操控融合算子范围。
本次pr为pcc.compile切换了新的编译引擎,该引擎直接按用户意图融合代码(相信程序员),不依赖组合算子拆解和CINN融合前端,然后直接进行apass处理。

示例用法

pcc.fuse.by_register

import paddle.incubate.cc as pcc
import paddle.incubate.cc.typing as pct
import paddle

N = pct.DimVar(1024)
K = pct.DimVar(256)
M = pct.DimVar(256)
DType = pct.DTypeVar("T","float32")

def foo(
    x: pct.Tensor([N, K], DType),
    w: pct.Tensor([K, M], DType)
):
    
    y = paddle.matmul(x, w)
    with pcc.fuse.by_register():
        tmp = paddle.sin(y)
        return tmp, paddle.cos(y)[256:]

fused_foo = pcc.compile(
    foo,
    ap_path=dir_to_your_axpr_code,
)

示例日志

E0429 08:50:43.123240 111481 add_pcc_pass.cc:138] 0) after ApplyApFacadePass():
{
    (%0) = "pd_op.data" () {dtype:float32,name:"x",place:Place(undefined:0),shape:[1024,256],stop_gradient:[false]} : () -> tensor<1024x256xf32>
    (%1) = "pd_op.data" () {dtype:float32,name:"w",place:Place(undefined:0),shape:[256,256],stop_gradient:[false]} : () -> tensor<256x256xf32>
    (%2) = "pd_op.matmul" (%0, %1) {stop_gradient:[false],transpose_x:false,transpose_y:false} : (tensor<1024x256xf32>, tensor<256x256xf32>) -> tensor<1024x256xf32>
    (%3) = "pd_op.ap_trivial_fusion_begin" (<<NULL VALUE>>) {stop_gradient:[true]} : (<<NULL TYPE>>) -> tensor<b>
    (%4) = "pd_op.sin" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
    (%5) = "pd_op.cos" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
    (%6) = "pd_op.full_int_array" () {dtype:int64,place:Place(cpu),stop_gradient:[true],value:[256]} : () -> tensor<1xi64>
    (%7) = "pd_op.full_int_array" () {dtype:int64,place:Place(cpu),stop_gradient:[true],value:[2147483647]} : () -> tensor<1xi64>
    (%8) = "pd_op.slice" (%5, %6, %7) {axes:[0],decrease_axis:[],infer_flags:[1],stop_gradient:[false]} : (tensor<1024x256xf32>, tensor<1xi64>, tensor<1xi64>) -> tensor<768x256xf32>
    (%9) = "pd_op.ap_trivial_fusion_end" (<<NULL VALUE>>) {stop_gradient:[true]} : (<<NULL TYPE>>) -> tensor<b>
    () = "builtin.shadow_output" (%4) {output_name:"output_0"} : (tensor<1024x256xf32>) -> 
    () = "builtin.shadow_output" (%8) {output_name:"output_1"} : (tensor<768x256xf32>) -> 
}
E0429 08:50:43.123449 111481 add_pcc_pass.cc:138] 1) after ApplyFuseApTrivialPass():
{
    (%0) = "pd_op.data" () {dtype:float32,name:"x",place:Place(undefined:0),shape:[1024,256],stop_gradient:[false]} : () -> tensor<1024x256xf32>
    (%1) = "pd_op.data" () {dtype:float32,name:"w",place:Place(undefined:0),shape:[256,256],stop_gradient:[false]} : () -> tensor<256x256xf32>
    (%2) = "pd_op.matmul" (%0, %1) {stop_gradient:[false],transpose_x:false,transpose_y:false} : (tensor<1024x256xf32>, tensor<256x256xf32>) -> tensor<1024x256xf32>
    (%3, %4) = "cinn_op.fusion" () -> tensor<1024x256xf32>, tensor<768x256xf32> {
        (%5) = "pd_op.sin" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
        (%6) = "pd_op.cos" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
        (%7) = "pd_op.full_int_array" () {dtype:int64,place:Place(cpu),stop_gradient:[true],value:[256]} : () -> tensor<1xi64>
        (%8) = "pd_op.full_int_array" () {dtype:int64,place:Place(cpu),stop_gradient:[true],value:[2147483647]} : () -> tensor<1xi64>
        (%9) = "pd_op.slice" (%6, %7, %8) {axes:[0],decrease_axis:[],infer_flags:[1],stop_gradient:[false]} : (tensor<1024x256xf32>, tensor<1xi64>, tensor<1xi64>) -> tensor<768x256xf32>
        () = "cf.yield" (%5, %9) {} : (tensor<1024x256xf32>, tensor<768x256xf32>) -> 
    }
    () = "builtin.shadow_output" (%3) {output_name:"output_0"} : (tensor<1024x256xf32>) -> 
    () = "builtin.shadow_output" (%4) {output_name:"output_1"} : (tensor<768x256xf32>) -> 
}
E0429 08:50:43.123849 111481 add_pcc_pass.cc:138] 2) after ApplyGenerateShapePass():
{
    (%0) = "pd_op.data" () {dtype:float32,name:"x",place:Place(undefined:0),shape:[1024,256],stop_gradient:[false]} : () -> tensor<1024x256xf32>
    (%1) = "pd_op.data" () {dtype:float32,name:"w",place:Place(undefined:0),shape:[256,256],stop_gradient:[false]} : () -> tensor<256x256xf32>
    (%2) = "pd_op.matmul" (%0, %1) {stop_gradient:[false],transpose_x:false,transpose_y:false} : (tensor<1024x256xf32>, tensor<256x256xf32>) -> tensor<1024x256xf32>
    (%3, %4) = "cinn_op.fusion" () -> tensor<1024x256xf32>, tensor<768x256xf32> {
        (%5) = "pd_op.sin" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
        (%6) = "pd_op.cos" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
        (%7) = "cinn_op.generate_shape" () {output_dim_exprs:[256],stop_gradient:[true],symbol_bindings:[]} : () -> tensor<1xi64>
        (%8) = "cinn_op.generate_shape" () {output_dim_exprs:[2147483647],stop_gradient:[true],symbol_bindings:[]} : () -> tensor<1xi64>
        (%9) = "pd_op.slice" (%6, %7, %8) {axes:[0],decrease_axis:[],infer_flags:[1],stop_gradient:[false]} : (tensor<1024x256xf32>, tensor<1xi64>, tensor<1xi64>) -> tensor<768x256xf32>
        () = "cf.yield" (%5, %9) {} : (tensor<1024x256xf32>, tensor<768x256xf32>) -> 
    }
    () = "builtin.shadow_output" (%3) {output_name:"output_0"} : (tensor<1024x256xf32>) -> 
    () = "builtin.shadow_output" (%4) {output_name:"output_1"} : (tensor<768x256xf32>) -> 
}
E0429 08:50:43.123997 111481 ap_generic_drr_pass.cc:3313] 
Traceback (most recent call last):
  File "/root/workspace/Paddle/paddle/ap/src/paddle/pass/ap_generic_drr_pass.cc", line 3304, in operator()
    ApRegistryHelper{}.SingletonRegistry()
  File "/root/workspace/Paddle/paddle/ap/src/paddle/pass/ap_registry_helper.cc", line 30, in operator()
    RegistrySingleton::Singleton()
  File "/root/workspace/Paddle/paddle/ap/include/registry/registry_singleton.h", line 26, in operator()
    MutOptSingleton()->has_value()

NotImplementedError: Registry singleton not initialized. 
E0429 08:50:43.124019 111481 ap_generic_drr_pass.cc:3313] 
Traceback (most recent call last):
  File "/root/workspace/Paddle/paddle/ap/src/paddle/pass/ap_generic_drr_pass.cc", line 3304, in operator()
    ApRegistryHelper{}.SingletonRegistry()
  File "/root/workspace/Paddle/paddle/ap/src/paddle/pass/ap_registry_helper.cc", line 30, in operator()
    RegistrySingleton::Singleton()
  File "/root/workspace/Paddle/paddle/ap/include/registry/registry_singleton.h", line 26, in operator()
    MutOptSingleton()->has_value()

NotImplementedError: Registry singleton not initialized. 
E0429 08:50:43.124032 111481 add_pcc_pass.cc:138] 3) after ApplyApGenericDrrPass():
{
    (%0) = "pd_op.data" () {dtype:float32,name:"x",place:Place(undefined:0),shape:[1024,256],stop_gradient:[false]} : () -> tensor<1024x256xf32>
    (%1) = "pd_op.data" () {dtype:float32,name:"w",place:Place(undefined:0),shape:[256,256],stop_gradient:[false]} : () -> tensor<256x256xf32>
    (%2) = "pd_op.matmul" (%0, %1) {stop_gradient:[false],transpose_x:false,transpose_y:false} : (tensor<1024x256xf32>, tensor<256x256xf32>) -> tensor<1024x256xf32>
    (%3, %4) = "cinn_op.fusion" () -> tensor<1024x256xf32>, tensor<768x256xf32> {
        (%5) = "pd_op.sin" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
        (%6) = "pd_op.cos" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
        (%7) = "cinn_op.generate_shape" () {output_dim_exprs:[256],stop_gradient:[true],symbol_bindings:[]} : () -> tensor<1xi64>
        (%8) = "cinn_op.generate_shape" () {output_dim_exprs:[2147483647],stop_gradient:[true],symbol_bindings:[]} : () -> tensor<1xi64>
        (%9) = "pd_op.slice" (%6, %7, %8) {axes:[0],decrease_axis:[],infer_flags:[1],stop_gradient:[false]} : (tensor<1024x256xf32>, tensor<1xi64>, tensor<1xi64>) -> tensor<768x256xf32>
        () = "cf.yield" (%5, %9) {} : (tensor<1024x256xf32>, tensor<768x256xf32>) -> 
    }
    () = "builtin.shadow_output" (%3) {output_name:"output_0"} : (tensor<1024x256xf32>) -> 
    () = "builtin.shadow_output" (%4) {output_name:"output_1"} : (tensor<768x256xf32>) -> 
}
E0429 08:50:43.124235 111481 add_pcc_pass.cc:138] 4) after ApplyFallbackToPhiPass():
{
    (%0) = "pd_op.data" () {dtype:float32,name:"x",place:Place(undefined:0),shape:[1024,256],stop_gradient:[false]} : () -> tensor<1024x256xf32>
    (%1) = "pd_op.data" () {dtype:float32,name:"w",place:Place(undefined:0),shape:[256,256],stop_gradient:[false]} : () -> tensor<256x256xf32>
    (%2) = "pd_op.matmul" (%0, %1) {stop_gradient:[false],transpose_x:false,transpose_y:false} : (tensor<1024x256xf32>, tensor<256x256xf32>) -> tensor<1024x256xf32>
    (%3) = "pd_op.sin" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
    (%4) = "pd_op.cos" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
    (%5) = "pd_op.full_int_array" () {dtype:int64,place:Place(cpu),stop_gradient:[true],value:[256]} : () -> tensor<1xi64>
    (%6) = "pd_op.full_int_array" () {dtype:int64,place:Place(cpu),stop_gradient:[true],value:[2147483647]} : () -> tensor<1xi64>
    (%7) = "pd_op.slice" (%4, %5, %6) {axes:[0],decrease_axis:[],infer_flags:[1],stop_gradient:[false]} : (tensor<1024x256xf32>, tensor<1xi64>, tensor<1xi64>) -> tensor<768x256xf32>
    () = "builtin.shadow_output" (%3) {output_name:"output_0"} : (tensor<1024x256xf32>) -> 
    () = "builtin.shadow_output" (%7) {output_name:"output_1"} : (tensor<768x256xf32>) -> 
}

PCC编译引擎总共分如下几个步骤:

  1. ApplyApFacadePass.将pd_op.ap_facade替换成ap_op.facade
  2. ApplyFuseApTrivialPass. 将介于pd_op.ap_trivial_fusion_beginpd_op.ap_trivial_fusion_end之间的op融合成cinn_op.fusion
  3. ApplyGenerateShapePass. 将用于处理动态形状的算子聚合成cinn_op.generate_shape算子
  4. ApplyApGenericDrrPass. 应用ap_path参数指定的系统的apass。上述日志此步骤报错,因为本示例程序没有配apass。
  5. ApplyFallbackToPhiPass. 将未成功编译的cinn_op.fusion子图降级到phi

lixinqi and others added 30 commits February 13, 2025 07:24
Copy link

paddle-bot bot commented May 9, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label May 9, 2025
@Xreki Xreki force-pushed the pcc_compile_engine branch from 59ce985 to 3b0c1d6 Compare May 9, 2025 12:49
@Xreki Xreki changed the title Pcc compile engine [AP] Implement pcc compile engine frontend of trivial fusion. May 9, 2025
Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@SigureMo SigureMo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMeow 🐾

paddle.base.libpaddle.pir.bind_symbolic_constraints(
forward_program, self._constraints
)
paddle.base.libpaddle.pir.apply_cinn_pass(forward_program)

elif self._backend.is_pcc():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pcc backend 没有调用 apply_general_passes 是因为 CSE 有问题是么?如果是这样的话我后面可以看看

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

多谢啦。pcc backend还需要通过实验验证需要加哪些Pass,目前功能尚不完整。CSE暂时没发现问题。

@Xreki Xreki merged commit 54c8bb3 into PaddlePaddle:develop May 13, 2025
45 of 49 checks passed
GITD245 pushed a commit to GITD245/Paddle that referenced this pull request May 14, 2025
…Paddle#72640)

* 1) add PCC compile engine; 2) add pcc.fuse.by_register() with-block api

* Fix compiling error when cinn is not enabled.

* Polish copyright and error messages.

---------

Co-authored-by: Liu Yiqun <liuyiqun01@baidu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants