-
Notifications
You must be signed in to change notification settings - Fork 5.7k
[SOT][Faster Guard] support TensorVariable
Dist check
#72327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SOT][Faster Guard] support TensorVariable
Dist check
#72327
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces performance optimizations by adding new guards to check tensor properties, specifically for the stop_gradient flag and the distributed status of a tensor. Key changes include:
- New guard classes (StopGradientMatchGuard and TensorIsDistGuard) in the C++ SOT module.
- Updates to stringified guard construction in the opcode translator.
- New corresponding tests added in the test suite.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
File | Description |
---|---|
test/sot/test_faster_guard.py | Added tests for checking stop_gradient and is_dist behavior on tensors. |
python/paddle/jit/sot/opcode_translator/executor/variables/basic.py | Updated guard construction to use new guard classes and modernized shape iteration. |
paddle/fluid/pybind/sot/guards.h, guards.cc | Introduced new classes and their implementations for stop_gradient and tensor distribution checks. |
paddle/fluid/pybind/jit.cc | Added bindings for the new guard classes. |
Comments suppressed due to low confidence (2)
test/sot/test_faster_guard.py:214
- Consider extending this test case to also verify the false scenario for TensorIsDistGuard (similar to test_stop_gradient_guard) to ensure both positive and negative behaviors are covered.
def test_tensor_is_dist_guard(self):
paddle/fluid/pybind/jit.cc:132
- [nitpick] The argument name 'tensor' in the binding for StopGradientMatchGuard (and similarly for TensorIsDistGuard) might be misleading since it represents an expected boolean flag. Consider renaming it (e.g. 'expected_flag') to improve clarity.
.def(py::init<const py::bool_ &>(), py::arg("tensor"));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有点没看懂啊,guard 有必要持有 dist info 吗?
如果要写子 guard,那就应该先在 FasterGuard 测试,不要直接上 guard node
paddle/fluid/pybind/sot/guards.cc
Outdated
phi::distributed::DistTensor* get_dist_tensor_from_py_object(PyObject* obj) { | ||
if (paddle::pybind::PyCheckTensor(obj)) { | ||
auto tensor = reinterpret_cast<paddle::pybind::TensorObject*>(obj)->tensor; | ||
if (tensor.is_dist_tensor()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这部分逻辑为啥不复用 get_dist_tensor_from_tensor
paddle/fluid/pybind/sot/guards.cc
Outdated
@@ -33,6 +34,12 @@ static inline PyObject* PyObject_CallOneArg(PyObject* func, PyObject* arg) { | |||
#define Py_IsNone(x) ((x) == Py_None) | |||
#endif | |||
|
|||
#define CheckTensorFromPyObject(value) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
宏的话用大写,不然这样看不出来是宏,仅仅是函数调用看不出来这里会产生控制流
paddle/fluid/pybind/jit.cc
Outdated
GuardBase, | ||
std::shared_ptr<TensorDistMatchGuard>>( | ||
*m, "TensorDistMatchGuard", R"DOC(TensorDistMatchGuard Class.)DOC") | ||
.def(py::init<const py::object &>(), py::arg("tensor")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
init 传的不是 dist info 吗?为什么叫 tensor?
paddle/fluid/pybind/jit.cc
Outdated
py::class_<TensorDistMatchGuard, | ||
GuardBase, | ||
std::shared_ptr<TensorDistMatchGuard>>( | ||
*m, "TensorDistMatchGuard", R"DOC(TensorDistMatchGuard Class.)DOC") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里应该叫 TensorDistMetaMatchGuard
paddle/fluid/pybind/sot/guards.cc
Outdated
return false; | ||
} | ||
|
||
auto dist_tensor = get_dist_tensor_from_tensor(*tensor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
所有 expected 应该在构建阶段准备好,而不是在 check 时候再准备
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #72327 +/- ##
==========================================
Coverage ? 84.94%
==========================================
Files ? 1
Lines ? 897
Branches ? 0
==========================================
Hits ? 762
Misses ? 135
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces performance improvements by adding a new guard for checking tensor distribution metadata in SOT operations.
- Introduces TensorDistMetaMatchGuard to support distributed tensor metadata checks.
- Updates the guard chain to include the new TensorDistMetaMatchGuard.
- Binds the new guard class in the pybind interface.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
test/sot/test_faster_guard.py | Adds a commented-out test for the new distributed tensor guard. |
python/paddle/jit/sot/opcode_translator/executor/variables/basic.py | Inserts a new faster guard node using TensorDistMetaMatchGuard. |
paddle/fluid/pybind/sot/guards.h | Adds the new TensorDistMetaMatchGuard class declaration and necessary includes. |
paddle/fluid/pybind/sot/guards.cc | Implements check logic for TensorDistMetaMatchGuard and updates lookup for proper reference management. |
paddle/fluid/pybind/jit.cc | Binds TensorDistMetaMatchGuard to the Python interface. |
Comments suppressed due to low confidence (1)
test/sot/test_faster_guard.py:205
- The test case for the tensor distribution guard is commented out. Consider re-enabling or removing it to ensure that the new guard behavior is covered by tests.
# def test_tensor_is_dist_guard(self):
PyObject_CallOneArg(dist_info_from_tensor_func, expr_node); | ||
HANDLE_NULL_VALUE(dist_info); | ||
|
||
PyObject* mesh = PyObject_GetAttrString(dist_info, "mesh"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new guard implementation obtains new references (e.g., for 'mesh', 'mesh_shape', 'process_ids', 'dims_mapping', 'local_shape') without releasing them, which may lead to memory leaks. Consider adding appropriate Py_DECREF calls for these objects after their use.
Copilot uses AI. Check for mistakes.
test/sot/test_faster_guard.py
Outdated
# guard_tensor_is_dist = paddle.framework.core.TensorDistMatchGuard( | ||
# tensor | ||
# ) | ||
# self.assertTrue(guard_tensor_is_dist.check(tensor)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这块怎么不解开?参考 test/sot/test_sot_distribution.py
的测试条件
paddle/fluid/pybind/sot/guards.cc
Outdated
#define HANDLE_NULL_TENSOR(tensor) \ | ||
{ \ | ||
if (!tensor) { \ | ||
return false; \ | ||
} \ | ||
} | ||
|
||
#define HANDLE_NULL_VALUE_DECREF(value) \ | ||
{ \ | ||
if ((value) == NULL) { \ | ||
Py_DECREF(value); \ | ||
PyErr_Clear(); \ | ||
return false; \ | ||
} \ | ||
} | ||
|
||
#define HANDLE_NULL_VALUE(value) \ | ||
if ((value) == NULL) { \ | ||
PyErr_Clear(); \ | ||
return false; \ | ||
{ \ | ||
if ((value) == NULL) { \ | ||
PyErr_Clear(); \ | ||
return false; \ | ||
} \ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这几个要如何用哪个?
// TODO(zrr1999): support multiple exprs | ||
auto expr = exprs.back(); | ||
auto value = expr->eval(frame); | ||
PyObject* value = [this, frame]() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO 还得有,现在只是临时解决方案
PyObject* v = exprs.back()->eval(frame); | ||
if (v) { | ||
Py_INCREF(v); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个 v 大概率是 New Reference,本身有一个引用计数,这里 +1,后面 -1,最终这里是不是还是 1?也就是说应该还是没释放?
这里应该只是为了和 tuple 保持一致?应该还没做到释放是吧,可以记一个 TODO
paddle/fluid/pybind/sot/guards.cc
Outdated
bool TensorDistMetaMatchGuard::check(PyObject* value) { | ||
HANDLE_NULL_VALUE(value); | ||
|
||
PyObject* expr_node = PyTuple_GetItem(value, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为啥会叫 expr_node
呢?这还是 expr 么
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances performance by adding support for distributed tensor metadata checking via a new guard, TensorDistMetaMatchGuard. Key changes include:
- Adding a new test for the TensorDistMetaMatchGuard in test/sot/test_faster_guard.py.
- Updating the opcode translator to use TensorDistMetaMatchGuard for checking distributed tensor information.
- Extending the binding in pybind11 and jit.cc to expose the new guard for users.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
test/sot/test_faster_guard.py | New test for TensorDistMetaMatchGuard verifying proper behavior for distributed tensors. |
python/paddle/jit/sot/opcode_translator/executor/variables/basic.py | Updated faster guard creation to include distributed information checks. |
paddle/fluid/pybind/sot/guards.h | Added necessary includes to support new guard functionality. |
paddle/fluid/pybind/sot/guards.cc | Implemented TensorDistMetaMatchGuard::check with proper error handling and reference counting. |
paddle/fluid/pybind/jit.cc | Bound TensorDistMetaMatchGuard to Python via pybind11. |
Comments suppressed due to low confidence (2)
test/sot/test_faster_guard.py:228
- Clarify in a test comment that passing DistInfo.from_tensor as a callable in the tuple is intentional and representative of the expected usage for TensorDistMetaMatchGuard. Consider adding tests for additional edge cases, such as handling non-distributed tensors.
self.assertTrue(guard_tensor_is_dist.check((dist_x1, DistInfo.from_tensor)))
paddle/fluid/pybind/sot/guards.cc:253
- [nitpick] Ensure that comparing mesh_shape_expected_ is done consistently, for example by explicitly using .value() if mesh_shape_expected_ is an optional, to prevent any ambiguity in type conversion.
if (py::handle(mesh_shape).cast<std::vector<int>>() != mesh_shape_expected_ ||
TensorVariable
Dist check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unittest skip 为合理添加,故豁免 |
PR Category
Performance Optimization
PR Types
Performance
Description
TensorDistMetaMatchGuard
用于检查 Tensor Dist