Idea: Ray Plasma backed zarr store for intermediate results

Hi, cubed community.

I have been thinking about the option to port my zarr+dask-array+xarray+dask-kubernetes workloads to cubed+ray, and currently I feel like the necessity to persist intermediate results to an external storage is a big blocker and performance killer for me, compared to dask's p2p model.

I am no expert in neither ray nor cubed, but I think there might be a simple way to achieve somewhat similar functionality to dask, if we used as the zarr backend for intermediate arrays a (not yet implemented) zarr store backed by the Ray Plasma object store.
That way, we could offload many of the difficult things to ray:
- Garbage collect the intermediate zarr stores as soon as they are not needed anymore
- Task locality: consecutive tasks on the same chunks could be scheduled on the same worker (though I understand this is not a big issue for cubed, thanks to its fusing optimization)
- Sharing data among workers without the need for external storage


Idea for implementation:
- For each zarr store, there would be one ray actor
- The actor would keep all zarr metadata
- All the chunks would be stored in the Plasma store
- The actor would keep a mapping between Plasma object id <-> zarr chunk key

What do you think about it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Idea: Ray Plasma backed zarr store for intermediate results #783

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Idea: Ray Plasma backed zarr store for intermediate results #783

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions