Skip to content

Add json data type #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

jbms
Copy link
Contributor

@jbms jbms commented Apr 30, 2025

This defines a json data type and adds it as a supported data type to the vlen-utf8 codec.

In my opinion it is a bit unfortunate that vlen-utf8 and vlen-bytes were added as separate codecs rather than just using vlen for both. In zarr v2 a separate identifier was needed because the data type for both was listed as "O" and therefore the real data type had to be determined based on the codec identifier. In zarr v3 that problem does not exist.

In this PR I decided to just allow the json data type in the vlen-utf8 codec, because encoded JSON is itself valid UTF-8.

Possible alternatives:

  • Add a new vlen-json codec instead.
  • Add a new vlen codec that will support bytes, string, and json and make vlen-bytes and vlen-utf8 deprecated.

@LDeakin
Copy link
Member

LDeakin commented Apr 30, 2025

In my opinion it is a bit unfortunate that vlen-utf8 and vlen-bytes were added as separate codecs rather than just using vlen for both

100%. I have raised this as well:

  • Add a new vlen codec that will support bytes, string, and json and make vlen-bytes and vlen-utf8 deprecated.

That would be my preferred alternative. But the vlen codec that you proposed in ZEP0007 seems like the way forward for variable length data. So maybe a different name?

FYI in zarrs I support:

  • vlen-bytes / vlen-utf8
  • vlen_v2: a data type agnostic codec matching the behaviour of vlen-bytes / vlen-utf8
  • vlen: as you proposed here

@rabernat
Copy link

In my opinion it is a bit unfortunate that vlen-utf8 and vlen-bytes were added as separate codecs rather than just using vlen for both.

As the person who implemented this, I agree! 🙃 I was just copying what Zarr V2 did and trying to get to some level of feature parity quickly. But it's not too late to change this.

This issue does highlight the challenge of coupling between dtypes and codecs. It also touches on logical types vs. physical types.

  • Add a new vlen codec that will support bytes, string, and json and make vlen-bytes and vlen-utf8 deprecated.

I would favor this route. But I'd appreciate folks thoughts on how to specify these relationships between dtypes and codecs more formally. Like, is there someway for a dtype extension to say "this dtype must be used in conjunction with the vlen ArraytoBytes codec"?

@normanrz
Copy link
Member

I would favor this route. But I'd appreciate folks thoughts on how to specify these relationships between dtypes and codecs more formally. Like, is there someway for a dtype extension to say "this dtype must be used in conjunction with the vlen ArraytoBytes codec"?

This coupling is an interesting problem. I think it is currently the other way around: codecs define what dtypes they can (de)serialize.
That also has issues, because popular codecs such as the "bytes" codec might need to be constantly updated to define the serialization behavior of new dtypes.

@jbms
Copy link
Contributor Author

jbms commented Apr 30, 2025

The coupling is inherent but not necessarily a problem --- but it would be better to be able to add bidirectional references in both the codec and data type, rather than unidirectional references.

If the core codecs were also defined in this repository that would simplify things, though.

As far as vlen-{bytes,utf8} --- when you say it is not too late to change it, are you saying that the existing usage of vlen-{bytes,utf8} with zarr v3 is so low that we can break it, or are you just saying that we could add yet another alias, like vlen_v2 that @LDeakin added to zarrs? Adding yet another alias would mean implementations need to support 3 names for the codec rather than just 2.

The vlen-{bytes,utf8} codecs are useful for compatibility with existing (especially zarr v2) data but the more flexible non-interleaved encoding as in zarr-developers/zeps#47 (comment) is likely to be better in most or all cases, so perhaps we should just not worry about vlen-{bytes,utf8} and I'll see about writing up a proposal for the non-interleaved vlen.

@LDeakin One potential issue with what you have implemented in zarrs is that there is an advantage in placing the index at the end rather than the beginning, which is that it is then possible to stream out the data. If the index is at the beginning, then you have to first calculate the size of every element, and e.g. for json if the elements are not already stored encoded it is difficult to avoid buffering or redundant work. On the other hand I suppose it is easier to do a streaming decode if the index is at the beginning. We could add an index_location parameter as we have for sharding_indexed, but do you have any thoughts about this?

@jbms jbms marked this pull request as draft April 30, 2025 19:10
@LDeakin
Copy link
Member

LDeakin commented Apr 30, 2025

We could add an index_location parameter as we have for sharding_indexed, but do you have any thoughts about this?

Sounds good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants