-
Notifications
You must be signed in to change notification settings - Fork 21
Open
Description
Hi, cubed community.
I have been thinking about the option to port my zarr+dask-array+xarray+dask-kubernetes workloads to cubed+ray, and currently I feel like the necessity to persist intermediate results to an external storage is a big blocker and performance killer for me, compared to dask's p2p model.
I am no expert in neither ray nor cubed, but I think there might be a simple way to achieve somewhat similar functionality to dask, if we used as the zarr backend for intermediate arrays a (not yet implemented) zarr store backed by the Ray Plasma object store.
That way, we could offload many of the difficult things to ray:
- Garbage collect the intermediate zarr stores as soon as they are not needed anymore
- Task locality: consecutive tasks on the same chunks could be scheduled on the same worker (though I understand this is not a big issue for cubed, thanks to its fusing optimization)
- Sharing data among workers without the need for external storage
Idea for implementation:
- For each zarr store, there would be one ray actor
- The actor would keep all zarr metadata
- All the chunks would be stored in the Plasma store
- The actor would keep a mapping between Plasma object id <-> zarr chunk key
What do you think about it?
Metadata
Metadata
Assignees
Labels
No labels