-
Notifications
You must be signed in to change notification settings - Fork 7
Add json data type #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
100%. I have raised this as well:
That would be my preferred alternative. But the FYI in
|
As the person who implemented this, I agree! 🙃 I was just copying what Zarr V2 did and trying to get to some level of feature parity quickly. But it's not too late to change this. This issue does highlight the challenge of coupling between dtypes and codecs. It also touches on logical types vs. physical types.
I would favor this route. But I'd appreciate folks thoughts on how to specify these relationships between dtypes and codecs more formally. Like, is there someway for a dtype extension to say "this dtype must be used in conjunction with the |
This coupling is an interesting problem. I think it is currently the other way around: codecs define what dtypes they can (de)serialize. |
The coupling is inherent but not necessarily a problem --- but it would be better to be able to add bidirectional references in both the codec and data type, rather than unidirectional references. If the core codecs were also defined in this repository that would simplify things, though. As far as The @LDeakin One potential issue with what you have implemented in zarrs is that there is an advantage in placing the index at the end rather than the beginning, which is that it is then possible to stream out the data. If the index is at the beginning, then you have to first calculate the size of every element, and e.g. for json if the elements are not already stored encoded it is difficult to avoid buffering or redundant work. On the other hand I suppose it is easier to do a streaming decode if the index is at the beginning. We could add an |
Sounds good |
This defines a
json
data type and adds it as a supported data type to thevlen-utf8
codec.In my opinion it is a bit unfortunate that
vlen-utf8
andvlen-bytes
were added as separate codecs rather than just usingvlen
for both. In zarr v2 a separate identifier was needed because the data type for both was listed as "O" and therefore the real data type had to be determined based on the codec identifier. In zarr v3 that problem does not exist.In this PR I decided to just allow the
json
data type in thevlen-utf8
codec, because encoded JSON is itself valid UTF-8.Possible alternatives:
vlen-json
codec instead.vlen
codec that will supportbytes
,string
, andjson
and makevlen-bytes
andvlen-utf8
deprecated.