Add json data type #9

jbms · 2025-04-30T06:20:55Z

This defines a json data type and adds it as a supported data type to the vlen-utf8 codec.

In my opinion it is a bit unfortunate that vlen-utf8 and vlen-bytes were added as separate codecs rather than just using vlen for both. In zarr v2 a separate identifier was needed because the data type for both was listed as "O" and therefore the real data type had to be determined based on the codec identifier. In zarr v3 that problem does not exist.

In this PR I decided to just allow the json data type in the vlen-utf8 codec, because encoded JSON is itself valid UTF-8.

Possible alternatives:

Add a new vlen-json codec instead.
Add a new vlen codec that will support bytes, string, and json and make vlen-bytes and vlen-utf8 deprecated.

LDeakin · 2025-04-30T07:06:11Z

In my opinion it is a bit unfortunate that vlen-utf8 and vlen-bytes were added as separate codecs rather than just using vlen for both

100%. I have raised this as well:

Add a new vlen codec that will support bytes, string, and json and make vlen-bytes and vlen-utf8 deprecated.

That would be my preferred alternative. But the vlen codec that you proposed in ZEP0007 seems like the way forward for variable length data. So maybe a different name?

FYI in zarrs I support:

vlen-bytes / vlen-utf8
vlen_v2: a data type agnostic codec matching the behaviour of vlen-bytes / vlen-utf8
vlen: as you proposed here

rabernat · 2025-04-30T13:58:31Z

In my opinion it is a bit unfortunate that vlen-utf8 and vlen-bytes were added as separate codecs rather than just using vlen for both.

As the person who implemented this, I agree! 🙃 I was just copying what Zarr V2 did and trying to get to some level of feature parity quickly. But it's not too late to change this.

This issue does highlight the challenge of coupling between dtypes and codecs. It also touches on logical types vs. physical types.

Add a new vlen codec that will support bytes, string, and json and make vlen-bytes and vlen-utf8 deprecated.

I would favor this route. But I'd appreciate folks thoughts on how to specify these relationships between dtypes and codecs more formally. Like, is there someway for a dtype extension to say "this dtype must be used in conjunction with the vlen ArraytoBytes codec"?

normanrz · 2025-04-30T14:03:28Z

I would favor this route. But I'd appreciate folks thoughts on how to specify these relationships between dtypes and codecs more formally. Like, is there someway for a dtype extension to say "this dtype must be used in conjunction with the vlen ArraytoBytes codec"?

This coupling is an interesting problem. I think it is currently the other way around: codecs define what dtypes they can (de)serialize.
That also has issues, because popular codecs such as the "bytes" codec might need to be constantly updated to define the serialization behavior of new dtypes.

jbms · 2025-04-30T17:56:48Z

The coupling is inherent but not necessarily a problem --- but it would be better to be able to add bidirectional references in both the codec and data type, rather than unidirectional references.

If the core codecs were also defined in this repository that would simplify things, though.

As far as vlen-{bytes,utf8} --- when you say it is not too late to change it, are you saying that the existing usage of vlen-{bytes,utf8} with zarr v3 is so low that we can break it, or are you just saying that we could add yet another alias, like vlen_v2 that @LDeakin added to zarrs? Adding yet another alias would mean implementations need to support 3 names for the codec rather than just 2.

The vlen-{bytes,utf8} codecs are useful for compatibility with existing (especially zarr v2) data but the more flexible non-interleaved encoding as in zarr-developers/zeps#47 (comment) is likely to be better in most or all cases, so perhaps we should just not worry about vlen-{bytes,utf8} and I'll see about writing up a proposal for the non-interleaved vlen.

@LDeakin One potential issue with what you have implemented in zarrs is that there is an advantage in placing the index at the end rather than the beginning, which is that it is then possible to stream out the data. If the index is at the beginning, then you have to first calculate the size of every element, and e.g. for json if the elements are not already stored encoded it is difficult to avoid buffering or redundant work. On the other hand I suppose it is easier to do a streaming decode if the index is at the beginning. We could add an index_location parameter as we have for sharding_indexed, but do you have any thoughts about this?

LDeakin · 2025-04-30T21:17:17Z

We could add an index_location parameter as we have for sharding_indexed, but do you have any thoughts about this?

Sounds good

Add json data type

a98e5ae

jbms marked this pull request as draft April 30, 2025 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add json data type #9

Add json data type #9

jbms commented Apr 30, 2025

LDeakin commented Apr 30, 2025

rabernat commented Apr 30, 2025

normanrz commented Apr 30, 2025

jbms commented Apr 30, 2025

LDeakin commented Apr 30, 2025

Add json data type #9

Are you sure you want to change the base?

Add json data type #9

Conversation

jbms commented Apr 30, 2025

LDeakin commented Apr 30, 2025

rabernat commented Apr 30, 2025

normanrz commented Apr 30, 2025

jbms commented Apr 30, 2025

LDeakin commented Apr 30, 2025