Skip to content

feature(data-types): add data-types from https://github.com/zarr-developers/zarr-python/pull/2874 #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Apr 15, 2025

This PR adds the definition of the 4 new data types from zarr-developers/zarr-python#2874. They include:

  • datetime64
  • fixed-length-ascii
  • fixed-length-bytes
  • fixed-length-ucs4

Closes #4

xref: zarr-developers/zarr-python#2874

…-bytes, and fixed-length-ucs4 data-types from zarr-python

## Permitted fill values

The value of the `fill_value` metadata key must be a signed 64-bit integer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to stick with the integer representation, we should clarify that this is a JSON representation of an integer between -2^63 and 2^63 - 1, inclusive. But as this data type is totally new for zarr v3 and we don't have to worry about backwards compatibility in the json format w.r.t. v2, we could consider a more date-like human-readable alternative to an int. I'd love to get feedback from heavy datetime users on what they would expect.

}
```

## Notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If numpy compatibility is still the design goal for this dtype, I think it would make sense to say something about compatibility with the numpy datetime dtype, e.g. with a link to the numpy docs .

@@ -0,0 +1,33 @@
# Fixed-length ASCII data type

Defines a data type for fixed-length ASCII strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my initial name for this dtype (fixed-length-ascii) was misleading -- it models the numpy S dtype, which represents any fixed-length byte string, which includes non-ascii characters. So we should probably rename this to fixed-length-bytes or equivalent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the key distinction between this dtype (modelling numpy S*) and the fixed-length bytes dtype defined elsewhere in this PR (modelling numpy V*) is that scalars with this dtype are intended to be interpreted as strings, not opaque sequences of bytes. We should probably decide if we really want two dtypes for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy also assumes NUL-padding, i.e. that any NUL characters at the end can be ignored.


## Permitted fill values

The value of the `fill_value` metadata key must be a string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will need more constraints here, such as as length constraint. for reference, in zarr v2 the fill value was a base64-encoded ascii string but it didn't constrain the length.


## Notes

Valid values for the `unit` configuration option include: `["Y", "M", "W", "D", "h", "m", "s", "ms", "us", "μs", "ns", "ps", "fs", "as"]`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say something about the epoch (presumably unix epoch) and time standard (presumably UTC) it uses.

@d-v-b
Copy link
Contributor

d-v-b commented Apr 15, 2025

instead of two fixed-length string data types (ascii and ucs4), we could consider a single fixed-length string dtype that's parametrized by the string encoding.


## Notes

Valid values for the `unit` configuration option include: `["Y", "M", "W", "D", "h", "m", "s", "ms", "us", "μs", "ns", "ps", "fs", "as"]`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the units listed after nanoseconds cannot represent dates beyond 1970 within 64-bits and therefore are presumably not very useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first part of this sentence is correct.

The second is not. There is a lot of use cases in the (paleo-)climate space for dates that go much further back in time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the range of 64-bit picoseconds is just a few months forward or backward from the epoch.

Presumably picoseconds would only be useful either with more than 64 bits or for a time duration rather than time point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it helpful if we define the goal of this datetime type as "give numpy datetime64 users a zarr data type they can use without thinking"? In this case, we optimize for compatibility and accept whatever decisions the numpy folks have made. We can always define a separate datetime data type that avoids perceived defects in the numpy datetime dtype, and we can point users who don't need maximal numpy compatibility to the better data type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment regarding datetime: numpy supports a special NaT (not a time) value. Is that intended to be supported here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm asking because, whether it's sensible or not, ps is a unit numpy datetime64 supports. Instead of litigating that decision, or any other numpy decision, we could optimize for compatibility here, which would simplify the task of this PR considerably.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment regarding datetime: numpy supports a special NaT (not a time) value. Is that intended to be supported here?

yes i believe that should be supported, and so it should be defined in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those units don't hurt anything they just are very unlikely to be used intentionally. Numpy supports the same units for both datetime and timedelta --- are you intending to also add timedelta as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those units don't hurt anything they just are very unlikely to be used intentionally. Numpy supports the same units for both datetime and timedelta --- are you intending to also add timedelta as well?

yes, timedelta is planned.

"type": "object",
"properties": {
"unit": {
"type": "string"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be listed as enum rather than string

@jbms
Copy link
Contributor

jbms commented Apr 15, 2025

Regarding the datetime type generally, I think I'd be happier if that were handled by some sort of unit attribute that was independent of the data type.

For how this is handled by the udunits2 package, see: https://docs.unidata.ucar.edu/udunits/current/udunits2lib.html#Time

In udunits the unit would be specified as "seconds since 1970-01-01".

The complication with making it a data type is that you have to specify how it is handled by every single codec.

@jbms
Copy link
Contributor

jbms commented Apr 15, 2025

All of these need to specify which array -> bytes codecs support them (presumably just bytes), and how they are encoded.


## Permitted fill values

The value of the `fill_value` metadata key must be a string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the length specified in bits? What does it mean if you specify a length that is not a multiple of 8?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think specifying the length in bits is a zarr-python implementation detail that we shouldn't have to deal with in the spec. i think a length in bytes is better here.

@@ -0,0 +1,33 @@
# Fixed-length Unicode string data type

Defines a data type for fixed-length Unicode strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should also include a recommendation NOT to use this type except for compatibility with existing datasets, because UTF-32 encoding is basically never a good choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should include a note that it is intended to only contain valid unicode strings, and that implementations may return an error if given invalid data (e.g. utf-16 surrogate pairs, out-of-range characters) if that is what is intended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should include a note that NUL characters at the end are to be ignored, if that is what is intended.

"data_type": "fixed-length-ucs4",
"fill_value": "",
"configuration": {
"length_bits": 24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 24, which is not a multiple of 32, is confusing.

"configuration": {
"type": "object",
"properties": {
"length_bits": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the length in bits, and what happens if it is not a multiple of 32?

@@ -0,0 +1,33 @@
# Fixed-length Unicode string data type

Defines a data type for fixed-length Unicode strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think utf-32 is now more commonly used and may be a better choice than ucs-4.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I mean the name "utf-32".

@jhamman
Copy link
Member Author

jhamman commented Apr 16, 2025

Thanks @d-v-b and @jbms for the initial reviews. I'll take a second pass tomorrow.

Comment on lines +15 to +19
"data_type": "datetime64",
"fill_value": -9223372036854775808,
"configuration": {
"unit": "s"
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"data_type": "datetime64",
"fill_value": -9223372036854775808,
"configuration": {
"unit": "s"
},
"data_type": { "name": "datetime64", "configuration": {"unit":"s"} },
"fill_value": -9223372036854775808,

"required": ["name", "configuration"],
"additionalProperties": false
},
{ "const": "datetime64" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the configuration (with its unit field) is mandatory, the short-hand should not be permissible.

@normanrz
Copy link
Member

normanrz commented May 6, 2025

Since you mentioned that you are still working on this, I turned this PR into a draft PR, in accordance with #13. Feel free to switch to a regular PR as soon as you think it is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Naming the dtypes from zarr-python#2874
4 participants