-
Notifications
You must be signed in to change notification settings - Fork 6
feature(data-types): add data-types from https://github.com/zarr-developers/zarr-python/pull/2874 #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…-bytes, and fixed-length-ucs4 data-types from zarr-python
|
||
## Permitted fill values | ||
|
||
The value of the `fill_value` metadata key must be a signed 64-bit integer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to stick with the integer representation, we should clarify that this is a JSON representation of an integer between -2^63 and 2^63 - 1, inclusive. But as this data type is totally new for zarr v3 and we don't have to worry about backwards compatibility in the json format w.r.t. v2, we could consider a more date-like human-readable alternative to an int. I'd love to get feedback from heavy datetime users on what they would expect.
} | ||
``` | ||
|
||
## Notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If numpy compatibility is still the design goal for this dtype, I think it would make sense to say something about compatibility with the numpy datetime dtype, e.g. with a link to the numpy docs .
@@ -0,0 +1,33 @@ | |||
# Fixed-length ASCII data type | |||
|
|||
Defines a data type for fixed-length ASCII strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my initial name for this dtype (fixed-length-ascii) was misleading -- it models the numpy S
dtype, which represents any fixed-length byte string, which includes non-ascii characters. So we should probably rename this to fixed-length-bytes
or equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the key distinction between this dtype (modelling numpy S*
) and the fixed-length bytes dtype defined elsewhere in this PR (modelling numpy V*
) is that scalars with this dtype are intended to be interpreted as strings, not opaque sequences of bytes. We should probably decide if we really want two dtypes for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numpy also assumes NUL-padding, i.e. that any NUL characters at the end can be ignored.
|
||
## Permitted fill values | ||
|
||
The value of the `fill_value` metadata key must be a string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will need more constraints here, such as as length constraint. for reference, in zarr v2 the fill value was a base64-encoded ascii string but it didn't constrain the length.
|
||
## Notes | ||
|
||
Valid values for the `unit` configuration option include: `["Y", "M", "W", "D", "h", "m", "s", "ms", "us", "μs", "ns", "ps", "fs", "as"]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should say something about the epoch (presumably unix epoch) and time standard (presumably UTC) it uses.
instead of two fixed-length string data types (ascii and ucs4), we could consider a single fixed-length string dtype that's parametrized by the string encoding. |
|
||
## Notes | ||
|
||
Valid values for the `unit` configuration option include: `["Y", "M", "W", "D", "h", "m", "s", "ms", "us", "μs", "ns", "ps", "fs", "as"]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the units listed after nanoseconds cannot represent dates beyond 1970 within 64-bits and therefore are presumably not very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first part of this sentence is correct.
The second is not. There is a lot of use cases in the (paleo-)climate space for dates that go much further back in time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, the range of 64-bit picoseconds is just a few months forward or backward from the epoch.
Presumably picoseconds would only be useful either with more than 64 bits or for a time duration rather than time point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it helpful if we define the goal of this datetime type as "give numpy datetime64
users a zarr data type they can use without thinking"? In this case, we optimize for compatibility and accept whatever decisions the numpy folks have made. We can always define a separate datetime data type that avoids perceived defects in the numpy datetime dtype, and we can point users who don't need maximal numpy compatibility to the better data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another comment regarding datetime: numpy supports a special NaT (not a time) value. Is that intended to be supported here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm asking because, whether it's sensible or not, ps
is a unit numpy datetime64
supports. Instead of litigating that decision, or any other numpy decision, we could optimize for compatibility here, which would simplify the task of this PR considerably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another comment regarding datetime: numpy supports a special NaT (not a time) value. Is that intended to be supported here?
yes i believe that should be supported, and so it should be defined in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those units don't hurt anything they just are very unlikely to be used intentionally. Numpy supports the same units for both datetime and timedelta --- are you intending to also add timedelta as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those units don't hurt anything they just are very unlikely to be used intentionally. Numpy supports the same units for both datetime and timedelta --- are you intending to also add timedelta as well?
yes, timedelta is planned.
"type": "object", | ||
"properties": { | ||
"unit": { | ||
"type": "string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be listed as enum rather than string
Regarding the datetime type generally, I think I'd be happier if that were handled by some sort of For how this is handled by the udunits2 package, see: https://docs.unidata.ucar.edu/udunits/current/udunits2lib.html#Time In udunits the unit would be specified as "seconds since 1970-01-01". The complication with making it a data type is that you have to specify how it is handled by every single codec. |
All of these need to specify which |
|
||
## Permitted fill values | ||
|
||
The value of the `fill_value` metadata key must be a string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the length specified in bits? What does it mean if you specify a length that is not a multiple of 8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think specifying the length in bits is a zarr-python implementation detail that we shouldn't have to deal with in the spec. i think a length in bytes is better here.
@@ -0,0 +1,33 @@ | |||
# Fixed-length Unicode string data type | |||
|
|||
Defines a data type for fixed-length Unicode strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should also include a recommendation NOT to use this type except for compatibility with existing datasets, because UTF-32 encoding is basically never a good choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should include a note that it is intended to only contain valid unicode strings, and that implementations may return an error if given invalid data (e.g. utf-16 surrogate pairs, out-of-range characters) if that is what is intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should include a note that NUL characters at the end are to be ignored, if that is what is intended.
"data_type": "fixed-length-ucs4", | ||
"fill_value": "", | ||
"configuration": { | ||
"length_bits": 24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using 24, which is not a multiple of 32, is confusing.
"configuration": { | ||
"type": "object", | ||
"properties": { | ||
"length_bits": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the length in bits, and what happens if it is not a multiple of 32?
@@ -0,0 +1,33 @@ | |||
# Fixed-length Unicode string data type | |||
|
|||
Defines a data type for fixed-length Unicode strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think utf-32 is now more commonly used and may be a better choice than ucs-4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I mean the name "utf-32".
"data_type": "datetime64", | ||
"fill_value": -9223372036854775808, | ||
"configuration": { | ||
"unit": "s" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"data_type": "datetime64", | |
"fill_value": -9223372036854775808, | |
"configuration": { | |
"unit": "s" | |
}, | |
"data_type": { "name": "datetime64", "configuration": {"unit":"s"} }, | |
"fill_value": -9223372036854775808, |
"required": ["name", "configuration"], | ||
"additionalProperties": false | ||
}, | ||
{ "const": "datetime64" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the configuration (with its unit field) is mandatory, the short-hand should not be permissible.
Since you mentioned that you are still working on this, I turned this PR into a draft PR, in accordance with #13. Feel free to switch to a regular PR as soon as you think it is ready. |
This PR adds the definition of the 4 new data types from zarr-developers/zarr-python#2874. They include:
Closes #4
xref: zarr-developers/zarr-python#2874