Skip to content

Idea: Ray Plasma backed zarr store for intermediate results #783

@vladidobro

Description

@vladidobro

Hi, cubed community.

I have been thinking about the option to port my zarr+dask-array+xarray+dask-kubernetes workloads to cubed+ray, and currently I feel like the necessity to persist intermediate results to an external storage is a big blocker and performance killer for me, compared to dask's p2p model.

I am no expert in neither ray nor cubed, but I think there might be a simple way to achieve somewhat similar functionality to dask, if we used as the zarr backend for intermediate arrays a (not yet implemented) zarr store backed by the Ray Plasma object store.
That way, we could offload many of the difficult things to ray:

  • Garbage collect the intermediate zarr stores as soon as they are not needed anymore
  • Task locality: consecutive tasks on the same chunks could be scheduled on the same worker (though I understand this is not a big issue for cubed, thanks to its fusing optimization)
  • Sharing data among workers without the need for external storage

Idea for implementation:

  • For each zarr store, there would be one ray actor
  • The actor would keep all zarr metadata
  • All the chunks would be stored in the Plasma store
  • The actor would keep a mapping between Plasma object id <-> zarr chunk key

What do you think about it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions