[AP] Implement pcc compile engine frontend of trivial fusion. #72640

lixinqi · 2025-05-09T08:38:05Z

PR Category

CINN

PR Types

New features

Description

pcard-76996

动机

当前pcc受制于组合算子拆解和CINN融合前端，用户不能直接操控融合算子范围。
本次pr为pcc.compile切换了新的编译引擎，该引擎直接按用户意图融合代码（相信程序员），不依赖组合算子拆解和CINN融合前端，然后直接进行apass处理。

示例用法

pcc.fuse.by_register

import paddle.incubate.cc as pcc
import paddle.incubate.cc.typing as pct
import paddle

N = pct.DimVar(1024)
K = pct.DimVar(256)
M = pct.DimVar(256)
DType = pct.DTypeVar("T","float32")

def foo(
    x: pct.Tensor([N, K], DType),
    w: pct.Tensor([K, M], DType)
):
    
    y = paddle.matmul(x, w)
    with pcc.fuse.by_register():
        tmp = paddle.sin(y)
        return tmp, paddle.cos(y)[256:]

fused_foo = pcc.compile(
    foo,
    ap_path=dir_to_your_axpr_code,
)

示例日志

E0429 08:50:43.123240 111481 add_pcc_pass.cc:138] 0) after ApplyApFacadePass():
{
    (%0) = "pd_op.data" () {dtype:float32,name:"x",place:Place(undefined:0),shape:[1024,256],stop_gradient:[false]} : () -> tensor<1024x256xf32>
    (%1) = "pd_op.data" () {dtype:float32,name:"w",place:Place(undefined:0),shape:[256,256],stop_gradient:[false]} : () -> tensor<256x256xf32>
    (%2) = "pd_op.matmul" (%0, %1) {stop_gradient:[false],transpose_x:false,transpose_y:false} : (tensor<1024x256xf32>, tensor<256x256xf32>) -> tensor<1024x256xf32>
    (%3) = "pd_op.ap_trivial_fusion_begin" (<<NULL VALUE>>) {stop_gradient:[true]} : (<<NULL TYPE>>) -> tensor<b>
    (%4) = "pd_op.sin" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
    (%5) = "pd_op.cos" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
    (%6) = "pd_op.full_int_array" () {dtype:int64,place:Place(cpu),stop_gradient:[true],value:[256]} : () -> tensor<1xi64>
    (%7) = "pd_op.full_int_array" () {dtype:int64,place:Place(cpu),stop_gradient:[true],value:[2147483647]} : () -> tensor<1xi64>
    (%8) = "pd_op.slice" (%5, %6, %7) {axes:[0],decrease_axis:[],infer_flags:[1],stop_gradient:[false]} : (tensor<1024x256xf32>, tensor<1xi64>, tensor<1xi64>) -> tensor<768x256xf32>
    (%9) = "pd_op.ap_trivial_fusion_end" (<<NULL VALUE>>) {stop_gradient:[true]} : (<<NULL TYPE>>) -> tensor<b>
    () = "builtin.shadow_output" (%4) {output_name:"output_0"} : (tensor<1024x256xf32>) -> 
    () = "builtin.shadow_output" (%8) {output_name:"output_1"} : (tensor<768x256xf32>) -> 
}
E0429 08:50:43.123449 111481 add_pcc_pass.cc:138] 1) after ApplyFuseApTrivialPass():
{
    (%0) = "pd_op.data" () {dtype:float32,name:"x",place:Place(undefined:0),shape:[1024,256],stop_gradient:[false]} : () -> tensor<1024x256xf32>
    (%1) = "pd_op.data" () {dtype:float32,name:"w",place:Place(undefined:0),shape:[256,256],stop_gradient:[false]} : () -> tensor<256x256xf32>
    (%2) = "pd_op.matmul" (%0, %1) {stop_gradient:[false],transpose_x:false,transpose_y:false} : (tensor<1024x256xf32>, tensor<256x256xf32>) -> tensor<1024x256xf32>
    (%3, %4) = "cinn_op.fusion" () -> tensor<1024x256xf32>, tensor<768x256xf32> {
        (%5) = "pd_op.sin" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
        (%6) = "pd_op.cos" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
        (%7) = "pd_op.full_int_array" () {dtype:int64,place:Place(cpu),stop_gradient:[true],value:[256]} : () -> tensor<1xi64>
        (%8) = "pd_op.full_int_array" () {dtype:int64,place:Place(cpu),stop_gradient:[true],value:[2147483647]} : () -> tensor<1xi64>
        (%9) = "pd_op.slice" (%6, %7, %8) {axes:[0],decrease_axis:[],infer_flags:[1],stop_gradient:[false]} : (tensor<1024x256xf32>, tensor<1xi64>, tensor<1xi64>) -> tensor<768x256xf32>
        () = "cf.yield" (%5, %9) {} : (tensor<1024x256xf32>, tensor<768x256xf32>) -> 
    }
    () = "builtin.shadow_output" (%3) {output_name:"output_0"} : (tensor<1024x256xf32>) -> 
    () = "builtin.shadow_output" (%4) {output_name:"output_1"} : (tensor<768x256xf32>) -> 
}
E0429 08:50:43.123849 111481 add_pcc_pass.cc:138] 2) after ApplyGenerateShapePass():
{
    (%0) = "pd_op.data" () {dtype:float32,name:"x",place:Place(undefined:0),shape:[1024,256],stop_gradient:[false]} : () -> tensor<1024x256xf32>
    (%1) = "pd_op.data" () {dtype:float32,name:"w",place:Place(undefined:0),shape:[256,256],stop_gradient:[false]} : () -> tensor<256x256xf32>
    (%2) = "pd_op.matmul" (%0, %1) {stop_gradient:[false],transpose_x:false,transpose_y:false} : (tensor<1024x256xf32>, tensor<256x256xf32>) -> tensor<1024x256xf32>
    (%3, %4) = "cinn_op.fusion" () -> tensor<1024x256xf32>, tensor<768x256xf32> {
        (%5) = "pd_op.sin" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
        (%6) = "pd_op.cos" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
        (%7) = "cinn_op.generate_shape" () {output_dim_exprs:[256],stop_gradient:[true],symbol_bindings:[]} : () -> tensor<1xi64>
        (%8) = "cinn_op.generate_shape" () {output_dim_exprs:[2147483647],stop_gradient:[true],symbol_bindings:[]} : () -> tensor<1xi64>
        (%9) = "pd_op.slice" (%6, %7, %8) {axes:[0],decrease_axis:[],infer_flags:[1],stop_gradient:[false]} : (tensor<1024x256xf32>, tensor<1xi64>, tensor<1xi64>) -> tensor<768x256xf32>
        () = "cf.yield" (%5, %9) {} : (tensor<1024x256xf32>, tensor<768x256xf32>) -> 
    }
    () = "builtin.shadow_output" (%3) {output_name:"output_0"} : (tensor<1024x256xf32>) -> 
    () = "builtin.shadow_output" (%4) {output_name:"output_1"} : (tensor<768x256xf32>) -> 
}
E0429 08:50:43.123997 111481 ap_generic_drr_pass.cc:3313] 
Traceback (most recent call last):
  File "/root/workspace/Paddle/paddle/ap/src/paddle/pass/ap_generic_drr_pass.cc", line 3304, in operator()
    ApRegistryHelper{}.SingletonRegistry()
  File "/root/workspace/Paddle/paddle/ap/src/paddle/pass/ap_registry_helper.cc", line 30, in operator()
    RegistrySingleton::Singleton()
  File "/root/workspace/Paddle/paddle/ap/include/registry/registry_singleton.h", line 26, in operator()
    MutOptSingleton()->has_value()

NotImplementedError: Registry singleton not initialized. 
E0429 08:50:43.124019 111481 ap_generic_drr_pass.cc:3313] 
Traceback (most recent call last):
  File "/root/workspace/Paddle/paddle/ap/src/paddle/pass/ap_generic_drr_pass.cc", line 3304, in operator()
    ApRegistryHelper{}.SingletonRegistry()
  File "/root/workspace/Paddle/paddle/ap/src/paddle/pass/ap_registry_helper.cc", line 30, in operator()
    RegistrySingleton::Singleton()
  File "/root/workspace/Paddle/paddle/ap/include/registry/registry_singleton.h", line 26, in operator()
    MutOptSingleton()->has_value()

NotImplementedError: Registry singleton not initialized. 
E0429 08:50:43.124032 111481 add_pcc_pass.cc:138] 3) after ApplyApGenericDrrPass():
{
    (%0) = "pd_op.data" () {dtype:float32,name:"x",place:Place(undefined:0),shape:[1024,256],stop_gradient:[false]} : () -> tensor<1024x256xf32>
    (%1) = "pd_op.data" () {dtype:float32,name:"w",place:Place(undefined:0),shape:[256,256],stop_gradient:[false]} : () -> tensor<256x256xf32>
    (%2) = "pd_op.matmul" (%0, %1) {stop_gradient:[false],transpose_x:false,transpose_y:false} : (tensor<1024x256xf32>, tensor<256x256xf32>) -> tensor<1024x256xf32>
    (%3, %4) = "cinn_op.fusion" () -> tensor<1024x256xf32>, tensor<768x256xf32> {
        (%5) = "pd_op.sin" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
        (%6) = "pd_op.cos" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
        (%7) = "cinn_op.generate_shape" () {output_dim_exprs:[256],stop_gradient:[true],symbol_bindings:[]} : () -> tensor<1xi64>
        (%8) = "cinn_op.generate_shape" () {output_dim_exprs:[2147483647],stop_gradient:[true],symbol_bindings:[]} : () -> tensor<1xi64>
        (%9) = "pd_op.slice" (%6, %7, %8) {axes:[0],decrease_axis:[],infer_flags:[1],stop_gradient:[false]} : (tensor<1024x256xf32>, tensor<1xi64>, tensor<1xi64>) -> tensor<768x256xf32>
        () = "cf.yield" (%5, %9) {} : (tensor<1024x256xf32>, tensor<768x256xf32>) -> 
    }
    () = "builtin.shadow_output" (%3) {output_name:"output_0"} : (tensor<1024x256xf32>) -> 
    () = "builtin.shadow_output" (%4) {output_name:"output_1"} : (tensor<768x256xf32>) -> 
}
E0429 08:50:43.124235 111481 add_pcc_pass.cc:138] 4) after ApplyFallbackToPhiPass():
{
    (%0) = "pd_op.data" () {dtype:float32,name:"x",place:Place(undefined:0),shape:[1024,256],stop_gradient:[false]} : () -> tensor<1024x256xf32>
    (%1) = "pd_op.data" () {dtype:float32,name:"w",place:Place(undefined:0),shape:[256,256],stop_gradient:[false]} : () -> tensor<256x256xf32>
    (%2) = "pd_op.matmul" (%0, %1) {stop_gradient:[false],transpose_x:false,transpose_y:false} : (tensor<1024x256xf32>, tensor<256x256xf32>) -> tensor<1024x256xf32>
    (%3) = "pd_op.sin" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
    (%4) = "pd_op.cos" (%2) {stop_gradient:[false]} : (tensor<1024x256xf32>) -> tensor<1024x256xf32>
    (%5) = "pd_op.full_int_array" () {dtype:int64,place:Place(cpu),stop_gradient:[true],value:[256]} : () -> tensor<1xi64>
    (%6) = "pd_op.full_int_array" () {dtype:int64,place:Place(cpu),stop_gradient:[true],value:[2147483647]} : () -> tensor<1xi64>
    (%7) = "pd_op.slice" (%4, %5, %6) {axes:[0],decrease_axis:[],infer_flags:[1],stop_gradient:[false]} : (tensor<1024x256xf32>, tensor<1xi64>, tensor<1xi64>) -> tensor<768x256xf32>
    () = "builtin.shadow_output" (%3) {output_name:"output_0"} : (tensor<1024x256xf32>) -> 
    () = "builtin.shadow_output" (%7) {output_name:"output_1"} : (tensor<768x256xf32>) -> 
}

PCC编译引擎总共分如下几个步骤：

ApplyApFacadePass.将pd_op.ap_facade替换成ap_op.facade
ApplyFuseApTrivialPass. 将介于pd_op.ap_trivial_fusion_begin和pd_op.ap_trivial_fusion_end之间的op融合成cinn_op.fusion
ApplyGenerateShapePass. 将用于处理动态形状的算子聚合成cinn_op.generate_shape算子
ApplyApGenericDrrPass. 应用ap_path参数指定的系统的apass。上述日志此步骤报错，因为本示例程序没有配apass。
ApplyFallbackToPhiPass. 将未成功编译的cinn_op.fusion子图降级到phi

…p is enabled.

…interpret_cast to avoid strict-aliasing.

…atement added by previous commit.

…addle#70376.

…zd5568.

…compile.

Fix the mismatched output numerical order issue

paddle-bot · 2025-05-09T08:38:11Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

XiaoguangHu01

LGTM

SigureMo

LGTMeow

SigureMo · 2025-05-13T04:36:37Z

python/paddle/jit/dy2static/pir_partial_program.py

                    paddle.base.libpaddle.pir.bind_symbolic_constraints(
                        forward_program, self._constraints
                    )
                    paddle.base.libpaddle.pir.apply_cinn_pass(forward_program)
-
+                elif self._backend.is_pcc():


pcc backend 没有调用 apply_general_passes 是因为 CSE 有问题是么？如果是这样的话我后面可以看看

多谢啦。pcc backend还需要通过实验验证需要加哪些Pass，目前功能尚不完整。CSE暂时没发现问题。

…Paddle#72640) * 1) add PCC compile engine; 2) add pcc.fuse.by_register() with-block api * Fix compiling error when cinn is not enabled. * Polish copyright and error messages. --------- Co-authored-by: Liu Yiqun <liuyiqun01@baidu.com>

lixinqi and others added 30 commits February 13, 2025 07:24

abstract pass initial commit

0ec2dec

Merge branch 'develop' of github.com:lixinqi/Paddle into ap

df6414a

remove unused index_expr code

96fea1c

Fix compiling error on CI.

f15a766

Merge branch 'ap' of https://github.com/lixinqi/Paddle into ap

aaeed58

Change the log level.

35196b4

Rename ap_lower_fusion_op_pass to ap_generic_drr_pass.

639a8b3

Rename ap_unary -> ap_variadic, ApUnary -> ApVariadic.

1e0a138

Fallback to cinn when ap fails, and disable fuse_gemm_epilogue when a…

01e2686

…p is enabled.

Fix compiling error in CI, including: using std::memcpy instead of re…

34d490a

…interpret_cast to avoid strict-aliasing.

Fix narrowing conversion error and unused value error.

4832e4c

Fix missing-field-initializers and unused-result error on CI.

4db56bc

Fix some sign-compare error on CI.

a441f80

Add cmake dependent.

60676fe

Fix some using statement without creating an alias.

6b786d9

Support experimental/type_traits for WIN32.

f36dfe8

Fix an unused-but-set-parameter and remove the typename in using st…

b950804

…atement added by previous commit.

Disable AP when cinn is not enabled.

5f6e67a

Merge branch 'develop' into ap

6b7446c

Remove the use of Reciprocal because Reciprocal is deleted by PaddleP…

4691ca4

…addle#70376.

Merge branch 'develop' into ap

a8e051d

Fix "basic_string::_M_construct null not valid" error.

b031281

Fix typo.

ba5c576

Support meticulous matching (with input/outoput number). Submit by hx…

9f3b79c

…zd5568.

remove redundant sentence

6408d08

Using void* as StreamT.

d058177

Fix compiling error related to std::optional<StreamT> on gcc12.

0629cb4

Merge branch 'develop' into ap

1119b12

support no-extra-use for temporary ir value in source pattern

2255c2f

minor fix

2def5b4

lixinqi and others added 19 commits April 16, 2025 13:07

support paddle.cc.*

be2389b

fix adt::WeakPtrLock bug

407122e

Merge branch 'ap' of github.com:lixinqi/Paddle into ap

b8afbfe

rename all non python standard api's to __builtin__xxx

4beae5c

Merge branch 'ap' into pcc

e457617

move paddle.cc into paddle.incubate.cc

8ffef38

Add pcc to setup.py.

3ebbd5b

Add missing modules and return partial_program_layer directly in pcc.…

b0b377d

…compile.

Fix the mismatched output numerical order issue

07e0652

Merge pull request #3 from hxzd5568/order

9ddfcef

Fix the mismatched output numerical order issue

Remove force_register_fusion related codes in pcc api.

f5e47cb

Add an argument train.

07f7034

support ap.facade and infer_symbolic/infer_meta in python

0d357f9

merge develop

a9354f0

minor fix

2abc1ee

paddle.cc.ap.FacadeOp

b9356ff

support zero inputs for pd_op.ap_facade

6b9a7c8

1) add PCC compile engine; 2) add pcc.fuse.by_register() with-block api

cc4216e

merge develop

dee0925

paddle-bot bot added the contributor External developers label May 9, 2025

Xreki added 2 commits May 9, 2025 20:45

Fix compiling error when cinn is not enabled.

abdfd76

Polish copyright and error messages.

3b0c1d6

Xreki force-pushed the pcc_compile_engine branch from 59ce985 to 3b0c1d6 Compare May 9, 2025 12:49

Xreki changed the title ~~Pcc compile engine~~ [AP] Implement pcc compile engine frontend of trivial fusion. May 9, 2025

XiaoguangHu01 approved these changes May 13, 2025

View reviewed changes

SigureMo approved these changes May 13, 2025

View reviewed changes

zyfncg approved these changes May 13, 2025

View reviewed changes

Xreki merged commit 54c8bb3 into PaddlePaddle:develop May 13, 2025
45 of 49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AP] Implement pcc compile engine frontend of trivial fusion. #72640

[AP] Implement pcc compile engine frontend of trivial fusion. #72640

lixinqi commented May 9, 2025

paddle-bot bot commented May 9, 2025

XiaoguangHu01 left a comment

SigureMo left a comment

SigureMo May 13, 2025

Xreki May 13, 2025

[AP] Implement pcc compile engine frontend of trivial fusion. #72640

[AP] Implement pcc compile engine frontend of trivial fusion. #72640

Conversation

lixinqi commented May 9, 2025

PR Category

PR Types

Description

动机

示例用法

paddle-bot bot commented May 9, 2025

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

SigureMo left a comment

Choose a reason for hiding this comment

SigureMo May 13, 2025

Choose a reason for hiding this comment

Xreki May 13, 2025

Choose a reason for hiding this comment