Skip to content

[SOT][Faster Guard] support TensorVariable Dist check #72327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Apr 26, 2025

Conversation

gouzil
Copy link
Member

@gouzil gouzil commented Apr 17, 2025

PR Category

Performance Optimization

PR Types

Performance

Description

  • 添加TensorDistMetaMatchGuard 用于检查 Tensor Dist

@gouzil gouzil requested review from SigureMo and zrr1999 as code owners April 17, 2025 01:58
Copy link

paddle-bot bot commented Apr 17, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Apr 17, 2025
@gouzil gouzil requested review from Copilot and removed request for SigureMo and zrr1999 April 17, 2025 01:59
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces performance optimizations by adding new guards to check tensor properties, specifically for the stop_gradient flag and the distributed status of a tensor. Key changes include:

  • New guard classes (StopGradientMatchGuard and TensorIsDistGuard) in the C++ SOT module.
  • Updates to stringified guard construction in the opcode translator.
  • New corresponding tests added in the test suite.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

File Description
test/sot/test_faster_guard.py Added tests for checking stop_gradient and is_dist behavior on tensors.
python/paddle/jit/sot/opcode_translator/executor/variables/basic.py Updated guard construction to use new guard classes and modernized shape iteration.
paddle/fluid/pybind/sot/guards.h, guards.cc Introduced new classes and their implementations for stop_gradient and tensor distribution checks.
paddle/fluid/pybind/jit.cc Added bindings for the new guard classes.
Comments suppressed due to low confidence (2)

test/sot/test_faster_guard.py:214

  • Consider extending this test case to also verify the false scenario for TensorIsDistGuard (similar to test_stop_gradient_guard) to ensure both positive and negative behaviors are covered.
def test_tensor_is_dist_guard(self):

paddle/fluid/pybind/jit.cc:132

  • [nitpick] The argument name 'tensor' in the binding for StopGradientMatchGuard (and similarly for TensorIsDistGuard) might be misleading since it represents an expected boolean flag. Consider renaming it (e.g. 'expected_flag') to improve clarity.
.def(py::init<const py::bool_ &>(), py::arg("tensor"));

@gouzil gouzil changed the title [SOT][Faster Guard] support TensorVariable part 1 [SOT][Faster Guard] support TensorVariable Apr 18, 2025
Copy link
Member

@SigureMo SigureMo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有点没看懂啊,guard 有必要持有 dist info 吗?

如果要写子 guard,那就应该先在 FasterGuard 测试,不要直接上 guard node

phi::distributed::DistTensor* get_dist_tensor_from_py_object(PyObject* obj) {
if (paddle::pybind::PyCheckTensor(obj)) {
auto tensor = reinterpret_cast<paddle::pybind::TensorObject*>(obj)->tensor;
if (tensor.is_dist_tensor()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分逻辑为啥不复用 get_dist_tensor_from_tensor

@@ -33,6 +34,12 @@ static inline PyObject* PyObject_CallOneArg(PyObject* func, PyObject* arg) {
#define Py_IsNone(x) ((x) == Py_None)
#endif

#define CheckTensorFromPyObject(value) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

宏的话用大写,不然这样看不出来是宏,仅仅是函数调用看不出来这里会产生控制流

GuardBase,
std::shared_ptr<TensorDistMatchGuard>>(
*m, "TensorDistMatchGuard", R"DOC(TensorDistMatchGuard Class.)DOC")
.def(py::init<const py::object &>(), py::arg("tensor"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init 传的不是 dist info 吗?为什么叫 tensor?

py::class_<TensorDistMatchGuard,
GuardBase,
std::shared_ptr<TensorDistMatchGuard>>(
*m, "TensorDistMatchGuard", R"DOC(TensorDistMatchGuard Class.)DOC")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该叫 TensorDistMetaMatchGuard

return false;
}

auto dist_tensor = get_dist_tensor_from_tensor(*tensor);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

所有 expected 应该在构建阶段准备好,而不是在 check 时候再准备

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (develop@a27ff14). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #72327   +/-   ##
==========================================
  Coverage           ?   84.94%           
==========================================
  Files              ?        1           
  Lines              ?      897           
  Branches           ?        0           
==========================================
  Hits               ?      762           
  Misses             ?      135           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gouzil gouzil requested review from SigureMo and Copilot April 23, 2025 03:21
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces performance improvements by adding a new guard for checking tensor distribution metadata in SOT operations.

  • Introduces TensorDistMetaMatchGuard to support distributed tensor metadata checks.
  • Updates the guard chain to include the new TensorDistMetaMatchGuard.
  • Binds the new guard class in the pybind interface.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
test/sot/test_faster_guard.py Adds a commented-out test for the new distributed tensor guard.
python/paddle/jit/sot/opcode_translator/executor/variables/basic.py Inserts a new faster guard node using TensorDistMetaMatchGuard.
paddle/fluid/pybind/sot/guards.h Adds the new TensorDistMetaMatchGuard class declaration and necessary includes.
paddle/fluid/pybind/sot/guards.cc Implements check logic for TensorDistMetaMatchGuard and updates lookup for proper reference management.
paddle/fluid/pybind/jit.cc Binds TensorDistMetaMatchGuard to the Python interface.
Comments suppressed due to low confidence (1)

test/sot/test_faster_guard.py:205

  • The test case for the tensor distribution guard is commented out. Consider re-enabling or removing it to ensure that the new guard behavior is covered by tests.
# def test_tensor_is_dist_guard(self):

PyObject_CallOneArg(dist_info_from_tensor_func, expr_node);
HANDLE_NULL_VALUE(dist_info);

PyObject* mesh = PyObject_GetAttrString(dist_info, "mesh");
Copy link
Preview

Copilot AI Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new guard implementation obtains new references (e.g., for 'mesh', 'mesh_shape', 'process_ids', 'dims_mapping', 'local_shape') without releasing them, which may lead to memory leaks. Consider adding appropriate Py_DECREF calls for these objects after their use.

Copilot uses AI. Check for mistakes.

# guard_tensor_is_dist = paddle.framework.core.TensorDistMatchGuard(
# tensor
# )
# self.assertTrue(guard_tensor_is_dist.check(tensor))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块怎么不解开?参考 test/sot/test_sot_distribution.py 的测试条件

Comment on lines 36 to 58
#define HANDLE_NULL_TENSOR(tensor) \
{ \
if (!tensor) { \
return false; \
} \
}

#define HANDLE_NULL_VALUE_DECREF(value) \
{ \
if ((value) == NULL) { \
Py_DECREF(value); \
PyErr_Clear(); \
return false; \
} \
}

#define HANDLE_NULL_VALUE(value) \
if ((value) == NULL) { \
PyErr_Clear(); \
return false; \
{ \
if ((value) == NULL) { \
PyErr_Clear(); \
return false; \
} \
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这几个要如何用哪个?

// TODO(zrr1999): support multiple exprs
auto expr = exprs.back();
auto value = expr->eval(frame);
PyObject* value = [this, frame]() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO 还得有,现在只是临时解决方案

PyObject* v = exprs.back()->eval(frame);
if (v) {
Py_INCREF(v);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 v 大概率是 New Reference,本身有一个引用计数,这里 +1,后面 -1,最终这里是不是还是 1?也就是说应该还是没释放?

这里应该只是为了和 tuple 保持一致?应该还没做到释放是吧,可以记一个 TODO

bool TensorDistMetaMatchGuard::check(PyObject* value) {
HANDLE_NULL_VALUE(value);

PyObject* expr_node = PyTuple_GetItem(value, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为啥会叫 expr_node 呢?这还是 expr 么

@gouzil gouzil requested review from SigureMo and Copilot April 26, 2025 03:13
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances performance by adding support for distributed tensor metadata checking via a new guard, TensorDistMetaMatchGuard. Key changes include:

  • Adding a new test for the TensorDistMetaMatchGuard in test/sot/test_faster_guard.py.
  • Updating the opcode translator to use TensorDistMetaMatchGuard for checking distributed tensor information.
  • Extending the binding in pybind11 and jit.cc to expose the new guard for users.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
test/sot/test_faster_guard.py New test for TensorDistMetaMatchGuard verifying proper behavior for distributed tensors.
python/paddle/jit/sot/opcode_translator/executor/variables/basic.py Updated faster guard creation to include distributed information checks.
paddle/fluid/pybind/sot/guards.h Added necessary includes to support new guard functionality.
paddle/fluid/pybind/sot/guards.cc Implemented TensorDistMetaMatchGuard::check with proper error handling and reference counting.
paddle/fluid/pybind/jit.cc Bound TensorDistMetaMatchGuard to Python via pybind11.
Comments suppressed due to low confidence (2)

test/sot/test_faster_guard.py:228

  • Clarify in a test comment that passing DistInfo.from_tensor as a callable in the tuple is intentional and representative of the expected usage for TensorDistMetaMatchGuard. Consider adding tests for additional edge cases, such as handling non-distributed tensors.
self.assertTrue(guard_tensor_is_dist.check((dist_x1, DistInfo.from_tensor)))

paddle/fluid/pybind/sot/guards.cc:253

  • [nitpick] Ensure that comparing mesh_shape_expected_ is done consistently, for example by explicitly using .value() if mesh_shape_expected_ is an optional, to prevent any ambiguity in type conversion.
if (py::handle(mesh_shape).cast<std::vector<int>>() != mesh_shape_expected_ ||

@gouzil gouzil changed the title [SOT][Faster Guard] support TensorVariable [SOT][Faster Guard] support TensorVariable Dist check Apr 26, 2025
Copy link
Member

@SigureMo SigureMo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMeow 🐾

TODOs:

  • 引用计数管理需要考虑,不然后面必然有 OOM 问题
  • TensorShapeMatch dynamic dim 需要考虑约束,否则可能会错误命中 guard

@SigureMo
Copy link
Member

unittest skip 为合理添加,故豁免

@SigureMo SigureMo merged commit 026a8df into PaddlePaddle:develop Apr 26, 2025
41 of 43 checks passed
@SigureMo SigureMo deleted the sot/support_TensorVariable_1 branch April 26, 2025 07:48
YqGe585 pushed a commit to YqGe585/Paddle that referenced this pull request May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants