-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add asynchronous load method #10327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add asynchronous load method #10327
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
These test failures are actually non-deterministic. The test tries to do vectorized indexing, and expects an error about vectorized indexing not being supported to be raised. But the test sometimes fails because an error about orthogonal indexing not being supported is raised instead. What seems to be happening is that for the exact same indexer, sometimes the indexing call goes through the vectorized indexing codepath first and sometimes it goes through the orthogonal indexing codepath first. I think in both cases it gives the same result, but the order of execution can differ. This script replicates the behaviour on this branch. If you run it repeatedly you will find that the behaviour changes between runs, as the error raised is inconsistent. #!/usr/bin/env python3
"""
Standalone reproducer for the flaky async test behavior.
"""
import asyncio
import xarray as xr
import zarr
from xarray.tests.test_dataset import create_test_data
async def test_flaky_behavior():
"""Reproduce the exact test scenario that shows flaky behavior."""
# Create zarr store with format 3
memorystore = zarr.storage.MemoryStore({})
ds = create_test_data()
ds.to_zarr(memorystore, zarr_format=3, consolidated=False)
# Open the dataset
ds = xr.open_zarr(memorystore, consolidated=False, chunks=None)
# Create the exact same indexer as the failing test
indexer = {
"dim1": xr.Variable(data=[2, 3], dims="points"),
"dim2": xr.Variable(data=[1, 3], dims="points"),
}
# Apply isel and try load_async
try:
await ds.isel(**indexer).load_async()
print("ERROR: Should have raised NotImplementedError!")
except NotImplementedError as e:
error_msg = str(e)
if "vectorized async indexing" in error_msg:
print("VECTORIZED")
elif "orthogonal async indexing" in error_msg:
print("ORTHOGONAL")
else:
print(f"OTHER: {error_msg}")
if __name__ == "__main__":
asyncio.run(test_flaky_behavior()) This other script replicates similar behaviour on #!/usr/bin/env python3
"""
Test sync .load() to see which indexing codepath is taken.
"""
import xarray as xr
import zarr
from xarray.tests.test_dataset import create_test_data
def test_sync_load():
"""Test with sync .load() instead of .load_async()"""
# Create zarr store with format 3
memorystore = zarr.storage.MemoryStore({})
ds = create_test_data()
ds.to_zarr(memorystore, zarr_format=3, consolidated=False)
# Open the dataset
ds = xr.open_zarr(memorystore, consolidated=False, chunks=None)
# Create the exact same indexer as the failing test
indexer = {
"dim1": xr.Variable(data=[2, 3], dims="points"),
"dim2": xr.Variable(data=[1, 3], dims="points"),
}
# Apply isel and load (sync)
result = ds.isel(**indexer).load()
print("SYNC_LOAD_COMPLETED")
if __name__ == "__main__":
test_sync_load() # need to add these debugging print statments
class ZarrArrayWrapper:
def __getitem__(self, key):
array = self._array
if isinstance(key, indexing.BasicIndexer):
print(f"DEBUG: SYNC BasicIndexer: {key}")
method = self._getitem
elif isinstance(key, indexing.VectorizedIndexer):
print(f"DEBUG: SYNC VectorizedIndexer: {key}")
method = self._vindex
elif isinstance(key, indexing.OuterIndexer):
print(f"DEBUG: SYNC OuterIndexer: {key}")
method = self._oindex I think this is somehow to do with variable or indexer ordering not being deterministic - which could be due to use of dicts internally perhaps? I can hide this weirdness by simply changing my test to be happy with either error. But I don't know if this is indicative of a bug that needs to be fixed. |
Either way, the decision on whether or not to use basic, orthogonal, or vectorized indexing depends on the types of indexers you pass to. According to xarray/xarray/core/variable.py Lines 661 to 672 in 54ac2fe
_broadcast_indexes_vectorized , which should not return outer indexers.
|
I'm passing exactly the same indexers every time.
It should, but apparently it doesn't always! If you run either of those scripts, you will see OuterIndexers are being created. |
As I thought, with this change applied (in a7918e4) now everything seems to be passing. (I don't think the warnings causing readthedocs or the upstream mypy failures are anything to do with this PR) |
I think I figured out why: |
Riiiiiight, thank you. So actually there's another way for me to dodge this problem in my test: just index into a single |
you can also use a single-variable dataset, but yeah, that would eliminate the issue |
…etween different variables
Adds an
.async_load()
method toVariable
, which works by plumbing asyncget_duck_array
all the way down until it finally gets to the async methods zarr v3 exposes.Needs a lot of refactoring before it could be merged, but it works.
whats-new.rst
api.rst
API:
Variable.load_async
DataArray.load_async
Dataset.load_async
DataTree.load_async
load_dataset
?load_dataarray
?