Skip to content

Add asynchronous load method #10327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 120 commits into
base: main
Choose a base branch
from
Open

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented May 16, 2025

Adds an .async_load() method to Variable, which works by plumbing async get_duck_array all the way down until it finally gets to the async methods zarr v3 exposes.

Needs a lot of refactoring before it could be merged, but it works.

API:

  • Variable.load_async
  • DataArray.load_async
  • Dataset.load_async
  • DataTree.load_async
  • load_dataset?
  • load_dataarray?

@TomNicholas
Copy link
Member Author

TomNicholas commented Aug 11, 2025

These test failures are actually non-deterministic. The test tries to do vectorized indexing, and expects an error about vectorized indexing not being supported to be raised. But the test sometimes fails because an error about orthogonal indexing not being supported is raised instead.

What seems to be happening is that for the exact same indexer, sometimes the indexing call goes through the vectorized indexing codepath first and sometimes it goes through the orthogonal indexing codepath first. I think in both cases it gives the same result, but the order of execution can differ.

This script replicates the behaviour on this branch. If you run it repeatedly you will find that the behaviour changes between runs, as the error raised is inconsistent.

#!/usr/bin/env python3
"""
Standalone reproducer for the flaky async test behavior.
"""

import asyncio
import xarray as xr
import zarr
from xarray.tests.test_dataset import create_test_data


async def test_flaky_behavior():
    """Reproduce the exact test scenario that shows flaky behavior."""
    
    # Create zarr store with format 3
    memorystore = zarr.storage.MemoryStore({})
    ds = create_test_data()
    ds.to_zarr(memorystore, zarr_format=3, consolidated=False)
    
    # Open the dataset
    ds = xr.open_zarr(memorystore, consolidated=False, chunks=None)
    
    # Create the exact same indexer as the failing test
    indexer = {
        "dim1": xr.Variable(data=[2, 3], dims="points"),
        "dim2": xr.Variable(data=[1, 3], dims="points"),
    }
    
    # Apply isel and try load_async
    try:
        await ds.isel(**indexer).load_async()
        print("ERROR: Should have raised NotImplementedError!")
    except NotImplementedError as e:
        error_msg = str(e)
        if "vectorized async indexing" in error_msg:
            print("VECTORIZED")
        elif "orthogonal async indexing" in error_msg:
            print("ORTHOGONAL")  
        else:
            print(f"OTHER: {error_msg}")


if __name__ == "__main__":
    asyncio.run(test_flaky_behavior())

This other script replicates similar behaviour on main. To reveal the use of different codepaths this second script requires inserting debugging print statements. If you run this repeatedly you will see the order of the print statements changes between runs.

#!/usr/bin/env python3
"""
Test sync .load() to see which indexing codepath is taken.
"""

import xarray as xr
import zarr
from xarray.tests.test_dataset import create_test_data


def test_sync_load():
    """Test with sync .load() instead of .load_async()"""
    
    # Create zarr store with format 3
    memorystore = zarr.storage.MemoryStore({})
    ds = create_test_data()
    ds.to_zarr(memorystore, zarr_format=3, consolidated=False)
    
    # Open the dataset
    ds = xr.open_zarr(memorystore, consolidated=False, chunks=None)
    
    # Create the exact same indexer as the failing test
    indexer = {
        "dim1": xr.Variable(data=[2, 3], dims="points"),
        "dim2": xr.Variable(data=[1, 3], dims="points"),
    }
    
    # Apply isel and load (sync)
    result = ds.isel(**indexer).load()
    print("SYNC_LOAD_COMPLETED")


if __name__ == "__main__":
    test_sync_load()
# need to add these debugging print statments
class ZarrArrayWrapper:
    def __getitem__(self, key):
        array = self._array
        if isinstance(key, indexing.BasicIndexer):
            print(f"DEBUG: SYNC BasicIndexer: {key}")
            method = self._getitem
        elif isinstance(key, indexing.VectorizedIndexer):
            print(f"DEBUG: SYNC VectorizedIndexer: {key}")
            method = self._vindex
        elif isinstance(key, indexing.OuterIndexer):
            print(f"DEBUG: SYNC OuterIndexer: {key}")
            method = self._oindex

I think this is somehow to do with variable or indexer ordering not being deterministic - which could be due to use of dicts internally perhaps?

I can hide this weirdness by simply changing my test to be happy with either error. But I don't know if this is indicative of a bug that needs to be fixed.

@keewis
Copy link
Collaborator

keewis commented Aug 11, 2025

which could be due to use of dicts internally perhaps?

dict is deterministic since python 3.7, what you're looking for is set.

Either way, the decision on whether or not to use basic, orthogonal, or vectorized indexing depends on the types of indexers you pass to. According to

dims = []
for k, d in zip(key, self.dims, strict=True):
if isinstance(k, Variable):
if len(k.dims) > 1:
return self._broadcast_indexes_vectorized(key)
dims.append(k.dims[0])
elif not isinstance(k, integer_types):
dims.append(d)
if len(set(dims)) == len(dims):
return self._broadcast_indexes_outer(key)
return self._broadcast_indexes_vectorized(key)
the presence of two variable indexers with a single, common dimension should go into _broadcast_indexes_vectorized, which should not return outer indexers.

@TomNicholas
Copy link
Member Author

the decision on whether or not to use basic, orthogonal, or vectorized indexing depends on the types of indexers you pass to.

I'm passing exactly the same indexers every time.

the presence of two variable indexers with a single, common dimension should go into _broadcast_indexes_vectorized, which should not return outer indexers.

It should, but apparently it doesn't always! If you run either of those scripts, you will see OuterIndexers are being created.

@TomNicholas
Copy link
Member Author

TomNicholas commented Aug 11, 2025

changing my test to be happy with either error

As I thought, with this change applied (in a7918e4) now everything seems to be passing. (I don't think the warnings causing readthedocs or the upstream mypy failures are anything to do with this PR)

@keewis
Copy link
Collaborator

keewis commented Aug 11, 2025

I think I figured out why: create_test_data creates a dataset that has three data variables, two of which do not have both indexed dims. Thus, if these variables are indexed first you get the orthogonal index error (indexing along one dim is always basic or orthogonal indexing), while if the other variable is indexed first you get the vectorized index error.

@TomNicholas
Copy link
Member Author

TomNicholas commented Aug 11, 2025

I think I figured out why: create_test_data creates a dataset that has three data variables, two of which do not have both indexed dims. Thus, if these variables are indexed first you get the orthogonal index error (indexing along one dim is always basic or orthogonal indexing), while if the other variable is indexed first you get the vectorized index error.

Riiiiiight, thank you.

So actually there's another way for me to dodge this problem in my test: just index into a single Variable instead of into a Dataset. Then there can't be a race condition between variables.

@keewis
Copy link
Collaborator

keewis commented Aug 11, 2025

you can also use a single-variable dataset, but yeah, that would eliminate the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration tools dependencies Pull requests that update a dependency file enhancement io run-upstream Run upstream CI topic-backends topic-indexing topic-NamedArray Lightweight version of Variable topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an asynchronous load method?
5 participants