-
Notifications
You must be signed in to change notification settings - Fork 664
FIX-#7675: Allow backend switching to backends other than provided arguments #7679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FIX-#7675: Allow backend switching to backends other than provided arguments #7679
Conversation
…n provided arguments Signed-off-by: Jonathan Shi <jonathan.shi@snowflake.com>
modin/core/storage_formats/base/query_compiler_calculator.py
Dismissed
Show dismissed
Hide dismissed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some minor comments. Also, I have some questions:
- Have you run any benchmarks with this change? Does it solve the pathological merge case that motivated it?
- It possible or likely that we switch to an unexpected and/or suboptimal backend during multi-dataset operations? e.g. say we switch to ray for a snowflake-pandas merge? Is there a good way to test for whether this happens in practice?
I ran the pathological merge as a sanity check and it went from 140s -> 6s (it takes around 2s with hybrid disabled, but there's some thrashing because an unnecessary switch occurs after
Right now we don't allow automatic switching to Ray, as its omitted from |
@disable_logging | ||
def max_cost(self) -> int: | ||
@classmethod | ||
def max_cost(cls) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So max cost may be implemented using some innate knowledge of the object; so /technically/ it cannot be a class method. In a practical sense I don't think we ever set this to anything other than COST_IMPOSSIBLE - so I'm wondering if the function should be removed in favor of just using COST_IMPOSSIBLE AS the static return value.
max_cost
is the maximum cost allowed by this query compiler across all data movements. This method
sets a normalized upper bound for situations where multiple data frames from different engines all
need to move to the same engine. The value returned by this method can exceed
QCCoercionCost.COST_IMPOSSIBLE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set it to COST_IMPOSSIBLE
* 1e10 in the Snowflake QC. I turned it into a class method so that backends w/o arguments present in the operation (for example, calculating cost for the Cloud
backend in pd.concat([df_pico1, df_pico2])
can report a max cost.
Should I leave the original max_cost
method as-is, and introduce an alternate static/classmethod max_cost
for these cases?
from modin.logging.metrics import emit_metric | ||
|
||
|
||
def all_switchable_backends() -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this cannot be part of envvars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make this configurable, but I just refactored this out from a function in the QC caster:
modin/modin/core/storage_formats/pandas/query_compiler_caster.py
Lines 800 to 807 in b002708
for backend in Backend.get_active_backends(): | |
if backend in ("Ray", "Unidist", "Dask"): | |
# Disable automatically switching to these engines for now, because | |
# 1) _get_prepared_factory_for_backend() currently calls | |
# _initialize_engine(), which starts up the ray/dask/unidist | |
# processes | |
# 2) we can't decide to switch to unidist in the middle of execution. | |
continue |
""" | ||
Calculate which query compiler we should cast to. | ||
Switching calculation is performed as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to the documentation here.
).io_cls.query_compiler_cls, | ||
) | ||
if preop_switch: | ||
# Initialize backend data for any backends not found among query compiler arguments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create a new environment variable for this behavior; maybe default on so we can perf test with it on and off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preop_switch
is set for individual functions registered by register_function_for_pre_op_switch
when BackendCalculator
is initialized. What would you want the environment variable to do?
@sfc-gh-mvashishtha After doing some more testing I realized it made more sense to only switch to other backends if we explicitly registered a function as a switch point, as is the case for 0/1-argument functions. I've updated the code to reflect this. |
What do these changes do?
After this PR,
AutoSwitchBackend
now has 2 separate behaviors for functions with multiple query compiler arguments:For example, after calling
pd.concat([A1, A2])
, we previously would only consider switching to the backends of the query compilers of argumentsA1
andA2
. Now, after callingregister_function_for_pre_op_switch(class_name=None, backend="Backend_A", method="concat")
, Modin may now move arguments to some third backendBackend_B
.flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date