Matmul performance optimization with cuBlasLt #46431

JamesLim-sy · 2022-09-23T03:33:18Z

PR types

Performance optimization

PR changes

OPs

Describe

Feature: Matmul with cuBlasLt and autotune.
Ps: After Jan 13, all job was committed by @JamesLim-sy and @Xreki，however @JamesLim-sy`s linux enviroment was changed by other people, also the github-username was changed. What a terrible mistake.

… add_autotune_kernel_tool

paddle-bot · 2022-09-23T03:33:49Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

… add_autotune_kernel_tool

Xreki · 2023-02-13T07:41:06Z

paddle/phi/kernels/autotune/auto_tune_base.h

      ReturnType (*func)(Args...)) {
    static std::once_flag transpose_init_flag_;
-    static std::unique_ptr<
-        AutoTuneBase<T, KernelCallback<T, ReturnType, Args...>>>
+    static std::unique_ptr<TransposeAutoTuner<T, ReturnType, Args...>>


作为函数内部的局部变量，变量名不要加_后缀。

Xreki · 2023-02-13T07:44:59Z

paddle/phi/kernels/autotune/auto_tune_base.h

    });
    return instance_.get();
  }
+
+  template <typename Context>
+  void RunMatmul(const Context& ctx, const size_t key, Args... args) {


不如在基类里面封装一个RunImpl函数，基类的Run函数里面直接调用RunImpl，这里则重写Run函数。对外的接口不要修改。

已根据建议修改，仅仅在外部的MatmulAutoTuner类中重写Run函数即可.

Xreki · 2023-02-13T07:50:53Z

paddle/phi/kernels/autotune/auto_tune_base.h

+    this->is_init_ = true;
+    this->CheckKernelSize();
+    auto& cache = AutoTuneCache::Instance().GetMatmul();
+    if (cache.Find(key)) {


似乎，没有开启AutoTune功能的时候，这里会多1次查cache的开销。

这块比较难避免，AutoTune关闭的状态存在于调优功能开启的之前，和之后，这里的操作逻辑与conv_udnn_v7.h中一致

Xreki · 2023-02-13T09:21:05Z

paddle/phi/kernels/autotune/auto_tune_base.h

+static MatmulAutoTuner<T, ReturnType, Args...>* MakeMatmulTuner(
+    ReturnType (*func)(Args...)) {
+  return MatmulAutoTuner<T, ReturnType, Args...>::Instance(func);
+}


定义个宏吧，DEFINE_AUTOTUNER_FN

根据建议已修改

Xreki · 2023-02-13T09:34:43Z

paddle/phi/kernels/autotune/cache_base.h

+  const size_t GetSubKey(int64_t idx) { return GetKey(key_, idx); }
+
+ private:
+  int size_;


struct成员不用加_。另外，size_这个成员没有用到。

根据建议修改.

Xreki · 2023-02-13T09:52:55Z

paddle/phi/kernels/funcs/blas/blaslt_impl.cu.h

+
+struct MatmulDescCreator {
+ public:
+  static void Create(cublasLtMatmulDesc_t* op_desc,


为啥要定义成全静态函数呢？为啥不把op_desc、x_desc等直接作为类成员？

仿照Conv的实现，减少host端对象的构造和析构动作

其实没必要，cublasLtMatmulDesc_t第类型实际上都是真正，真正的开心在Create、Destroy等函数。

Xreki · 2023-02-13T09:55:07Z

paddle/phi/kernels/funcs/blas/blaslt_impl.cu.h

+};
+
+template <typename T>
+struct MatmulWithCublasLt<phi::GPUContext, T> {


cublaslt只支持GPU啊，Context没有必要作为模板传入，这一层特化看起来可以避免。

这块写的确实挫，太low了，已改掉

Xreki · 2023-02-13T09:57:33Z

paddle/phi/kernels/funcs/blas/blaslt_impl.cu.h

+
+    double alpha64 = 1.0, beta64 = 0.0;
+    float alpha32 = 1.0f, beta32 = 0.0f;
+    void *alpha = nullptr, *beta = nullptr;


可以用MPType。

Xreki · 2023-02-13T10:02:16Z

paddle/phi/kernels/impl/matmul_kernel_impl.h

 }

+template <typename Context, typename T>


这一层封装是为啥？

因为blas是同时支持了CPU\GPU，但是Blaslt 目前只做GPU支持，但是兼容了blas之后，Context传过来的可能是CPUContext也可能是GPUContext

Xreki · 2023-02-13T10:03:01Z

paddle/phi/kernels/impl/slice_kernel_impl.h

@@ -64,10 +64,10 @@ void SliceCompute(const Context& ctx,
    }
  }

-  funcs::CheckAndUpdateSliceAttrs<int64_t>(in_dims, axes, &starts, &ends);
+  // funcs::CheckAndUpdateSliceAttrs<int64_t>(in_dims, axes, &starts, &ends);


这是你改的，还是张博改的？弄个干净的分支吧。

这个是让张博改的，但是我想先试试效果，就本地也改了做测试

… add_autotune_kernel_tool

JamesLim-sy

目前，暂时先回复了auto_tune部分的review建议

JamesLim-sy · 2023-02-15T02:39:35Z

paddle/phi/kernels/autotune/auto_tune_base.h

      ReturnType (*func)(Args...)) {
    static std::once_flag transpose_init_flag_;
-    static std::unique_ptr<
-        AutoTuneBase<T, KernelCallback<T, ReturnType, Args...>>>
+    static std::unique_ptr<TransposeAutoTuner<T, ReturnType, Args...>>


JamesLim-sy · 2023-02-15T02:40:01Z

paddle/phi/kernels/autotune/auto_tune_base.h

+static MatmulAutoTuner<T, ReturnType, Args...>* MakeMatmulTuner(
+    ReturnType (*func)(Args...)) {
+  return MatmulAutoTuner<T, ReturnType, Args...>::Instance(func);
+}


根据建议已修改

JamesLim-sy · 2023-02-15T02:40:46Z

paddle/phi/kernels/impl/slice_kernel_impl.h

@@ -64,10 +64,10 @@ void SliceCompute(const Context& ctx,
    }
  }

-  funcs::CheckAndUpdateSliceAttrs<int64_t>(in_dims, axes, &starts, &ends);
+  // funcs::CheckAndUpdateSliceAttrs<int64_t>(in_dims, axes, &starts, &ends);


这个是让张博改的，但是我想先试试效果，就本地也改了做测试

JamesLim-sy · 2023-02-15T03:06:13Z

paddle/phi/kernels/autotune/auto_tune_base.h

    });
    return instance_.get();
  }
+
+  template <typename Context>
+  void RunMatmul(const Context& ctx, const size_t key, Args... args) {


已根据建议修改，仅仅在外部的MatmulAutoTuner类中重写Run函数即可.

JamesLim-sy · 2023-02-15T03:15:23Z

paddle/phi/kernels/autotune/auto_tune_base.h

+    this->is_init_ = true;
+    this->CheckKernelSize();
+    auto& cache = AutoTuneCache::Instance().GetMatmul();
+    if (cache.Find(key)) {


这块比较难避免，AutoTune关闭的状态存在于调优功能开启的之前，和之后，这里的操作逻辑与conv_udnn_v7.h中一致

JamesLim-sy · 2023-02-15T03:22:50Z

paddle/phi/kernels/autotune/cache_base.h

+                  static_cast<int64_t>(dtype_));
+  }
+
+  const size_t QueryKey() const { return key_; }


L57行其实用GenKey更合适。
这块我另开一个PR改掉（再混一个PR50516）

JamesLim-sy · 2023-02-15T03:23:13Z

paddle/phi/kernels/autotune/cache_base.h

+
+  const size_t QueryKey() const { return key_; }
+  const size_t GetSize() { return x_dims_.size(); }
+  const size_t GetSubKey(int64_t idx) { return GetKey(key_, idx); }


yes，就是GenSubKey的含义

JamesLim-sy · 2023-02-15T03:24:40Z

paddle/phi/kernels/autotune/cache_base.h

+  const size_t GetSubKey(int64_t idx) { return GetKey(key_, idx); }
+
+ private:
+  int size_;


根据建议修改.

JamesLim-sy · 2023-02-15T03:25:06Z

paddle/phi/kernels/autotune/cache_base.h

+  std::vector<int64_t> y_dims_;
+  bool trans_x_;
+  bool trans_y_;
+  int best_algo_;


已删除，并额外删除trans_x_ 和 trans_y_两个变量

test=cuda117

Xreki

LGTM, 模型验证精度没有问题，PR先合入，MatMulFunctionImplWithCublasLt实现有比较多的冗余，需要后续优化。

JamesLim-sy · 2023-02-26T02:35:46Z

LGTM, 模型验证精度没有问题，PR先合入，MatMulFunctionImplWithCublasLt实现有比较多的冗余，需要后续优化。

考虑到Matmul OP的重要性，完整的验证工作会在后续补全，如有问题会持持续跟踪调整.

JamesLim-sy added 3 commits March 17, 2022 05:55

for 1st time interface combine.

7f42952

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

1dba1a6

… add_autotune_kernel_tool

another first commit

96cf58a

JamesLim-sy and others added 26 commits September 26, 2022 13:29

first commit

07677e3

first commit

ee801c3

merge alloc together

67bf57c

remove the autotune.h file

64ee6d7

add CheckEighResult for both sysej and evd kernel

de873b4

profile reduce kernel for fp16 and reduceHigherdim

3aa505a

Merge branch 'PaddlePaddle:develop' into develop

c6c5ca2

Merge branch 'PaddlePaddle:develop' into develop

4187e28

use reinterpret_cast

14180cc

fix for CI on ROCm

2c6eaa5

add Macro for ROCm

1eaf75f

ROCm CI config

5f8c72b

ROCm CI config

444b1c4

unit test repair

fbb8361

Merge branch 'PaddlePaddle:develop' into develop

19de67a

Merge branch 'PaddlePaddle:develop' into develop

427e98c

Merge branch 'PaddlePaddle:develop' into develop

cbf1f3d

pull

c6dbe30

add common_funcs.h

2a9ef0a

reduceType

ba99367

Update reduce_function.h

d326b58

not higher

2ccb0ea

conflict fix

2a14bdb

rename

ff38003

Merge branch 'PaddlePaddle:develop' into develop

3c7e544

Merge branch 'PaddlePaddle:develop' into develop

66475ea

add some changes

fbda72c

JamesLim-sy force-pushed the add_autotune_kernel_tool branch from de26777 to fbda72c Compare February 3, 2023 03:25

JamesLim-sy and others added 3 commits February 6, 2023 14:43

add some changes for matmul_auto_tune

6ec9106

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

ad58d06

… add_autotune_kernel_tool

revise the data format

c4a540d

Xreki reviewed Feb 13, 2023

View reviewed changes

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

fea6614

… add_autotune_kernel_tool

JamesLim-sy commented Feb 15, 2023

View reviewed changes

JamesLim-sy mentioned this pull request Feb 15, 2023

Addition of marco for auto_tune_base.h #50516

Merged

JamesLim-sy and others added 5 commits February 16, 2023 10:06

fix according to CI and review advices

335c134

fix bugs according to CI

cb7c608

change according to ci

9cdfa2c

Merge branch 'develop' into add_autotune_kernel_tool

e2ed925

Polish codes.

c1a7448

Xreki force-pushed the add_autotune_kernel_tool branch from 8e24aa4 to c1a7448 Compare February 21, 2023 06:58

Xreki added 2 commits February 21, 2023 07:17

Warp the matmul function and revert the change of matmul_grad_kernel.

eef4555

Simplify the codes.

9044737

Xreki force-pushed the add_autotune_kernel_tool branch from 762dcfd to 9044737 Compare February 21, 2023 08:22

Xreki added 4 commits February 21, 2023 09:11

Fix typo.

16864be

Fix compiling error.

cc539d7

Merge branch 'develop' into add_autotune_kernel_tool

cf85133

Fix compiling error when no gpu.

c35bdea

test=cuda117

Xreki force-pushed the add_autotune_kernel_tool branch from 73efccb to c35bdea Compare February 21, 2023 13:24

Xreki added 2 commits February 22, 2023 01:55

Add the missing argument.

e863cbe

Merge branch 'develop' into add_autotune_kernel_tool

febeb01

test=cuda117

JamesLim-sy force-pushed the add_autotune_kernel_tool branch 2 times, most recently from ca1a089 to febeb01 Compare February 23, 2023 02:49

Xreki approved these changes Feb 26, 2023

View reviewed changes

Xreki merged commit d4217fc into PaddlePaddle:develop Feb 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matmul performance optimization with cuBlasLt #46431

Matmul performance optimization with cuBlasLt #46431

JamesLim-sy commented Sep 23, 2022 •

edited

Loading

paddle-bot bot commented Sep 23, 2022

Xreki Feb 13, 2023

JamesLim-sy Feb 15, 2023

Xreki Feb 13, 2023

JamesLim-sy Feb 15, 2023

Xreki Feb 13, 2023

JamesLim-sy Feb 15, 2023

Xreki Feb 13, 2023

JamesLim-sy Feb 15, 2023

Xreki Feb 13, 2023

JamesLim-sy Feb 15, 2023

Xreki Feb 13, 2023

JamesLim-sy Feb 15, 2023

Xreki Feb 21, 2023

Xreki Feb 13, 2023

JamesLim-sy Feb 15, 2023

Xreki Feb 13, 2023

JamesLim-sy Feb 15, 2023

Xreki Feb 13, 2023

JamesLim-sy Feb 15, 2023

Xreki Feb 13, 2023

JamesLim-sy Feb 15, 2023

JamesLim-sy left a comment

JamesLim-sy Feb 15, 2023

JamesLim-sy Feb 15, 2023

JamesLim-sy Feb 15, 2023

JamesLim-sy Feb 15, 2023

JamesLim-sy Feb 15, 2023

JamesLim-sy Feb 15, 2023

JamesLim-sy Feb 15, 2023

JamesLim-sy Feb 15, 2023

JamesLim-sy Feb 15, 2023

Xreki left a comment

JamesLim-sy commented Feb 26, 2023

Matmul performance optimization with cuBlasLt #46431

Matmul performance optimization with cuBlasLt #46431

Conversation

JamesLim-sy commented Sep 23, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot bot commented Sep 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesLim-sy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

JamesLim-sy commented Feb 26, 2023

JamesLim-sy commented Sep 23, 2022 •

edited

Loading