[SOT][Faster Guard] support `TensorVariable` Dist check #72327

gouzil · 2025-04-17T01:58:51Z

PR Category

Performance Optimization

PR Types

Performance

Description

添加TensorDistMetaMatchGuard 用于检查 Tensor Dist

paddle-bot · 2025-04-17T01:58:56Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copilot

Pull Request Overview

This PR introduces performance optimizations by adding new guards to check tensor properties, specifically for the stop_gradient flag and the distributed status of a tensor. Key changes include:

New guard classes (StopGradientMatchGuard and TensorIsDistGuard) in the C++ SOT module.
Updates to stringified guard construction in the opcode translator.
New corresponding tests added in the test suite.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

File	Description
test/sot/test_faster_guard.py	Added tests for checking stop_gradient and is_dist behavior on tensors.
python/paddle/jit/sot/opcode_translator/executor/variables/basic.py	Updated guard construction to use new guard classes and modernized shape iteration.
paddle/fluid/pybind/sot/guards.h, guards.cc	Introduced new classes and their implementations for stop_gradient and tensor distribution checks.
paddle/fluid/pybind/jit.cc	Added bindings for the new guard classes.

Comments suppressed due to low confidence (2)

test/sot/test_faster_guard.py:214

Consider extending this test case to also verify the false scenario for TensorIsDistGuard (similar to test_stop_gradient_guard) to ensure both positive and negative behaviors are covered.

def test_tensor_is_dist_guard(self):

paddle/fluid/pybind/jit.cc:132

[nitpick] The argument name 'tensor' in the binding for StopGradientMatchGuard (and similarly for TensorIsDistGuard) might be misleading since it represents an expected boolean flag. Consider renaming it (e.g. 'expected_flag') to improve clarity.

.def(py::init<const py::bool_ &>(), py::arg("tensor"));

…upport_TensorVariable_1

…basic.py`

SigureMo

有点没看懂啊，guard 有必要持有 dist info 吗？

如果要写子 guard，那就应该先在 FasterGuard 测试，不要直接上 guard node

SigureMo · 2025-04-19T06:55:24Z

paddle/fluid/pybind/sot/guards.cc

+phi::distributed::DistTensor* get_dist_tensor_from_py_object(PyObject* obj) {
+  if (paddle::pybind::PyCheckTensor(obj)) {
+    auto tensor = reinterpret_cast<paddle::pybind::TensorObject*>(obj)->tensor;
+    if (tensor.is_dist_tensor()) {


这部分逻辑为啥不复用 get_dist_tensor_from_tensor

SigureMo · 2025-04-19T06:57:18Z

paddle/fluid/pybind/sot/guards.cc

@@ -33,6 +34,12 @@ static inline PyObject* PyObject_CallOneArg(PyObject* func, PyObject* arg) {
 #define Py_IsNone(x) ((x) == Py_None)
 #endif

+#define CheckTensorFromPyObject(value)        \


宏的话用大写，不然这样看不出来是宏，仅仅是函数调用看不出来这里会产生控制流

SigureMo · 2025-04-19T07:00:25Z

paddle/fluid/pybind/jit.cc

+             GuardBase,
+             std::shared_ptr<TensorDistMatchGuard>>(
+      *m, "TensorDistMatchGuard", R"DOC(TensorDistMatchGuard Class.)DOC")
+      .def(py::init<const py::object &>(), py::arg("tensor"));


init 传的不是 dist info 吗？为什么叫 tensor？

SigureMo · 2025-04-19T07:01:56Z

paddle/fluid/pybind/jit.cc

+  py::class_<TensorDistMatchGuard,
+             GuardBase,
+             std::shared_ptr<TensorDistMatchGuard>>(
+      *m, "TensorDistMatchGuard", R"DOC(TensorDistMatchGuard Class.)DOC")


这里应该叫 TensorDistMetaMatchGuard

SigureMo · 2025-04-19T07:02:46Z

paddle/fluid/pybind/sot/guards.cc

+    return false;
+  }
+
+  auto dist_tensor = get_dist_tensor_from_tensor(*tensor);


所有 expected 应该在构建阶段准备好，而不是在 check 时候再准备

codecov-commenter · 2025-04-19T07:35:35Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (develop@a27ff14). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop   #72327   +/-   ##
==========================================
  Coverage           ?   84.94%           
==========================================
  Files              ?        1           
  Lines              ?      897           
  Branches           ?        0           
==========================================
  Hits               ?      762           
  Misses             ?      135           
  Partials           ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ensorVariable_1

Copilot

Pull Request Overview

This PR introduces performance improvements by adding a new guard for checking tensor distribution metadata in SOT operations.

Introduces TensorDistMetaMatchGuard to support distributed tensor metadata checks.
Updates the guard chain to include the new TensorDistMetaMatchGuard.
Binds the new guard class in the pybind interface.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
test/sot/test_faster_guard.py	Adds a commented-out test for the new distributed tensor guard.
python/paddle/jit/sot/opcode_translator/executor/variables/basic.py	Inserts a new faster guard node using TensorDistMetaMatchGuard.
paddle/fluid/pybind/sot/guards.h	Adds the new TensorDistMetaMatchGuard class declaration and necessary includes.
paddle/fluid/pybind/sot/guards.cc	Implements check logic for TensorDistMetaMatchGuard and updates lookup for proper reference management.
paddle/fluid/pybind/jit.cc	Binds TensorDistMetaMatchGuard to the Python interface.

Comments suppressed due to low confidence (1)

test/sot/test_faster_guard.py:205

The test case for the tensor distribution guard is commented out. Consider re-enabling or removing it to ensure that the new guard behavior is covered by tests.

# def test_tensor_is_dist_guard(self):

Copilot · 2025-04-23T03:22:11Z

paddle/fluid/pybind/sot/guards.cc

+      PyObject_CallOneArg(dist_info_from_tensor_func, expr_node);
+  HANDLE_NULL_VALUE(dist_info);
+
+  PyObject* mesh = PyObject_GetAttrString(dist_info, "mesh");


The new guard implementation obtains new references (e.g., for 'mesh', 'mesh_shape', 'process_ids', 'dims_mapping', 'local_shape') without releasing them, which may lead to memory leaks. Consider adding appropriate Py_DECREF calls for these objects after their use.

SigureMo · 2025-04-23T03:32:11Z

test/sot/test_faster_guard.py

+    #     guard_tensor_is_dist = paddle.framework.core.TensorDistMatchGuard(
+    #         tensor
+    #     )
+    #     self.assertTrue(guard_tensor_is_dist.check(tensor))


这块怎么不解开？参考 test/sot/test_sot_distribution.py 的测试条件

SigureMo · 2025-04-23T04:30:00Z

paddle/fluid/pybind/sot/guards.cc

+#define HANDLE_NULL_TENSOR(tensor) \
+  {                                \
+    if (!tensor) {                 \
+      return false;                \
+    }                              \
+  }
+
+#define HANDLE_NULL_VALUE_DECREF(value) \
+  {                                     \
+    if ((value) == NULL) {              \
+      Py_DECREF(value);                 \
+      PyErr_Clear();                    \
+      return false;                     \
+    }                                   \
+  }
+
 #define HANDLE_NULL_VALUE(value) \
-  if ((value) == NULL) {         \
-    PyErr_Clear();               \
-    return false;                \
+  {                              \
+    if ((value) == NULL) {       \
+      PyErr_Clear();             \
+      return false;              \
+    }                            \
  }


这几个要如何用哪个？

SigureMo · 2025-04-23T04:30:58Z

paddle/fluid/pybind/sot/guards.cc

-  // TODO(zrr1999): support multiple exprs
-  auto expr = exprs.back();
-  auto value = expr->eval(frame);
+  PyObject* value = [this, frame]() {


TODO 还得有，现在只是临时解决方案

python/paddle/jit/sot/opcode_translator/executor/variables/basic.py

SigureMo · 2025-04-23T04:34:42Z

paddle/fluid/pybind/sot/guards.cc

+      PyObject* v = exprs.back()->eval(frame);
+      if (v) {
+        Py_INCREF(v);
+      }


这个 v 大概率是 New Reference，本身有一个引用计数，这里 +1，后面 -1，最终这里是不是还是 1？也就是说应该还是没释放？

这里应该只是为了和 tuple 保持一致？应该还没做到释放是吧，可以记一个 TODO

SigureMo · 2025-04-23T06:27:10Z

paddle/fluid/pybind/sot/guards.cc

+bool TensorDistMetaMatchGuard::check(PyObject* value) {
+  HANDLE_NULL_VALUE(value);
+
+  PyObject* expr_node = PyTuple_GetItem(value, 0);


这里为啥会叫 expr_node 呢？这还是 expr 么

…ensorVariable_1

Copilot

Pull Request Overview

This PR enhances performance by adding support for distributed tensor metadata checking via a new guard, TensorDistMetaMatchGuard. Key changes include:

Adding a new test for the TensorDistMetaMatchGuard in test/sot/test_faster_guard.py.
Updating the opcode translator to use TensorDistMetaMatchGuard for checking distributed tensor information.
Extending the binding in pybind11 and jit.cc to expose the new guard for users.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
test/sot/test_faster_guard.py	New test for TensorDistMetaMatchGuard verifying proper behavior for distributed tensors.
python/paddle/jit/sot/opcode_translator/executor/variables/basic.py	Updated faster guard creation to include distributed information checks.
paddle/fluid/pybind/sot/guards.h	Added necessary includes to support new guard functionality.
paddle/fluid/pybind/sot/guards.cc	Implemented TensorDistMetaMatchGuard::check with proper error handling and reference counting.
paddle/fluid/pybind/jit.cc	Bound TensorDistMetaMatchGuard to Python via pybind11.

Comments suppressed due to low confidence (2)

test/sot/test_faster_guard.py:228

Clarify in a test comment that passing DistInfo.from_tensor as a callable in the tuple is intentional and representative of the expected usage for TensorDistMetaMatchGuard. Consider adding tests for additional edge cases, such as handling non-distributed tensors.

self.assertTrue(guard_tensor_is_dist.check((dist_x1, DistInfo.from_tensor)))

paddle/fluid/pybind/sot/guards.cc:253

[nitpick] Ensure that comparing mesh_shape_expected_ is done consistently, for example by explicitly using .value() if mesh_shape_expected_ is an optional, to prevent any ambiguity in type conversion.

if (py::handle(mesh_shape).cast<std::vector<int>>() != mesh_shape_expected_ ||

SigureMo

LGTMeow

TODOs:

引用计数管理需要考虑，不然后面必然有 OOM 问题
TensorShapeMatch dynamic dim 需要考虑约束，否则可能会错误命中 guard

SigureMo · 2025-04-26T07:44:54Z

unittest skip 为合理添加，故豁免

…#72327)

[SOT][Faster Guard] support TensorVariable part 1

cfc6506

gouzil requested review from SigureMo and zrr1999 as code owners April 17, 2025 01:58

paddle-bot bot added the contributor External developers label Apr 17, 2025

gouzil requested review from Copilot and removed request for SigureMo and zrr1999 April 17, 2025 01:59

Copilot AI reviewed Apr 17, 2025

View reviewed changes

gouzil added 3 commits April 18, 2025 14:25

Merge branch 'develop' of https://github.com/gouzil/paddle into sot/s…

3e65c6c

…upport_TensorVariable_1

rollback `python/paddle/jit/sot/opcode_translator/executor/variables/…

303702f

…basic.py`

add TensorDistMatchGuard

fd714ac

gouzil changed the title ~~[SOT][Faster Guard] support TensorVariable part 1~~ [SOT][Faster Guard] support TensorVariable Apr 18, 2025

gouzil added 2 commits April 19, 2025 01:38

fix arg error and add local_shape check

5df2a20

rm TensorDistMatchGuard test

d7ba019

SigureMo reviewed Apr 19, 2025

View reviewed changes

gouzil added 7 commits April 19, 2025 23:31

Merge branch 'develop' of github.com:gouzil/Paddle into sot/support_T…

7ded081

…ensorVariable_1

fix some reviews

ac72e1e

support multiple exprs

29e4f91

fix reference

c075166

fix value is null

5540a86

fix reference

453ec87

fix reference

134da0a

gouzil requested review from SigureMo and Copilot April 23, 2025 03:21

Copilot AI reviewed Apr 23, 2025

View reviewed changes

SigureMo reviewed Apr 23, 2025

View reviewed changes

gouzil added 2 commits April 23, 2025 11:52

fix DECREF

9e32db3

fix review

d09beb9

SigureMo reviewed Apr 23, 2025

View reviewed changes

gouzil added 2 commits April 23, 2025 14:15

fix review

d27736c

Delete comments

0486513

SigureMo reviewed Apr 23, 2025

View reviewed changes

gouzil added 3 commits April 23, 2025 15:31

fix review

f4ca2bc

fix test

1c88504

fix test error

811cec3

zrr1999 mentioned this pull request Apr 24, 2025

SOT Guard 机制性能优化 #69264

Open

gouzil added 2 commits April 24, 2025 19:50

Merge branch 'develop' of github.com:gouzil/Paddle into sot/support_T…

0859dec

…ensorVariable_1

rename test

1af2954

zrr1999 mentioned this pull request Apr 25, 2025

[SOT][Faster Guard] Implement moreguard_tree_expr_node and TensorDtypeVariable.make_faster_guard #72463

Merged

fix process_ids test error

3219385

gouzil requested review from SigureMo and Copilot April 26, 2025 03:13

Copilot AI reviewed Apr 26, 2025

View reviewed changes

gouzil changed the title ~~[SOT][Faster Guard] support TensorVariable~~ [SOT][Faster Guard] support TensorVariable Dist check Apr 26, 2025

SigureMo approved these changes Apr 26, 2025

View reviewed changes

SigureMo added the skip-ci: approval label Apr 26, 2025

zrr1999 approved these changes Apr 26, 2025

View reviewed changes

SigureMo merged commit 026a8df into PaddlePaddle:develop Apr 26, 2025
41 of 43 checks passed

SigureMo deleted the sot/support_TensorVariable_1 branch April 26, 2025 07:48

gouzil mentioned this pull request Apr 27, 2025

[SOT][FasterGuard] fix TensorDistMetaMatchGuard dynamic check #72511

Closed

YqGe585 pushed a commit to YqGe585/Paddle that referenced this pull request May 7, 2025

[SOT][Faster Guard] support TensorVariable Dist check (PaddlePaddle…

8f6a271

…#72327)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SOT][Faster Guard] support `TensorVariable` Dist check #72327

[SOT][Faster Guard] support `TensorVariable` Dist check #72327

gouzil commented Apr 17, 2025 •

edited

Loading

paddle-bot bot commented Apr 17, 2025

Copilot AI left a comment

SigureMo left a comment

SigureMo Apr 19, 2025

SigureMo Apr 19, 2025

SigureMo Apr 19, 2025

SigureMo Apr 19, 2025

SigureMo Apr 19, 2025

codecov-commenter commented Apr 19, 2025

Copilot AI left a comment

Copilot AI Apr 23, 2025

SigureMo Apr 23, 2025

SigureMo Apr 23, 2025

SigureMo Apr 23, 2025

SigureMo Apr 23, 2025

SigureMo Apr 23, 2025

Copilot AI left a comment

SigureMo left a comment

SigureMo commented Apr 26, 2025

[SOT][Faster Guard] support TensorVariable Dist check #72327

[SOT][Faster Guard] support TensorVariable Dist check #72327

Conversation

gouzil commented Apr 17, 2025 • edited Loading

PR Category

PR Types

Description

paddle-bot bot commented Apr 17, 2025

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

SigureMo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Apr 19, 2025

Codecov Report

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Copilot AI Apr 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

SigureMo left a comment

Choose a reason for hiding this comment

SigureMo commented Apr 26, 2025

[SOT][Faster Guard] support `TensorVariable` Dist check #72327

[SOT][Faster Guard] support `TensorVariable` Dist check #72327

gouzil commented Apr 17, 2025 •

edited

Loading