-
Notifications
You must be signed in to change notification settings - Fork 7
feature(data-types): add data-types from https://github.com/zarr-developers/zarr-python/pull/2874 #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feature(data-types): add data-types from https://github.com/zarr-developers/zarr-python/pull/2874 #5
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,33 @@ | ||||||||||||||||
# Datetime64 data type | ||||||||||||||||
|
||||||||||||||||
Defines a data type for a datetime object based on a 64-bit integer. | ||||||||||||||||
|
||||||||||||||||
## Permitted fill values | ||||||||||||||||
|
||||||||||||||||
The value of the `fill_value` metadata key must be a signed 64-bit integer. | ||||||||||||||||
|
||||||||||||||||
## Example | ||||||||||||||||
|
||||||||||||||||
For example, the array metadata below specifies that the array uses the datetime64 data type: | ||||||||||||||||
|
||||||||||||||||
```json | ||||||||||||||||
{ | ||||||||||||||||
"data_type": "datetime64", | ||||||||||||||||
"fill_value": -9223372036854775808, | ||||||||||||||||
"configuration": { | ||||||||||||||||
"unit": "s" | ||||||||||||||||
}, | ||||||||||||||||
Comment on lines
+15
to
+19
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||
} | ||||||||||||||||
``` | ||||||||||||||||
|
||||||||||||||||
## Notes | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If numpy compatibility is still the design goal for this dtype, I think it would make sense to say something about compatibility with the numpy datetime dtype, e.g. with a link to the numpy docs . |
||||||||||||||||
|
||||||||||||||||
Valid values for the `unit` configuration option include: `["Y", "M", "W", "D", "h", "m", "s", "ms", "us", "μs", "ns", "ps", "fs", "as"]` | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should say something about the epoch (presumably unix epoch) and time standard (presumably UTC) it uses. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that the units listed after nanoseconds cannot represent dates beyond 1970 within 64-bits and therefore are presumably not very useful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The first part of this sentence is correct. The second is not. There is a lot of use cases in the (paleo-)climate space for dates that go much further back in time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be clear, the range of 64-bit picoseconds is just a few months forward or backward from the epoch. Presumably picoseconds would only be useful either with more than 64 bits or for a time duration rather than time point. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is it helpful if we define the goal of this datetime type as "give numpy There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another comment regarding datetime: numpy supports a special NaT (not a time) value. Is that intended to be supported here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i'm asking because, whether it's sensible or not, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
yes i believe that should be supported, and so it should be defined in this PR There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Those units don't hurt anything they just are very unlikely to be used intentionally. Numpy supports the same units for both datetime and timedelta --- are you intending to also add timedelta as well? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
yes, timedelta is planned. |
||||||||||||||||
|
||||||||||||||||
## Change log | ||||||||||||||||
|
||||||||||||||||
No changes yet. | ||||||||||||||||
|
||||||||||||||||
## Current maintainers | ||||||||||||||||
|
||||||||||||||||
* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
{ | ||
"$schema": "https://json-schema.org/draft/2020-12/schema", | ||
"oneOf": [ | ||
{ | ||
"type": "object", | ||
"properties": { | ||
"name": { | ||
"const": "datetime64" | ||
}, | ||
"configuration": { | ||
"type": "object", | ||
"properties": { | ||
"unit": { | ||
"type": "string" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could be listed as enum rather than string |
||
} | ||
}, | ||
"required": ["unit"], | ||
"additionalProperties": false | ||
} | ||
}, | ||
"required": ["name", "configuration"], | ||
"additionalProperties": false | ||
}, | ||
{ "const": "datetime64" } | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the configuration (with its unit field) is mandatory, the short-hand should not be permissible. |
||
] | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Fixed-length ASCII data type | ||
|
||
Defines a data type for fixed-length ASCII strings. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think my initial name for this dtype (fixed-length-ascii) was misleading -- it models the numpy There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think the key distinction between this dtype (modelling numpy There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Numpy also assumes NUL-padding, i.e. that any NUL characters at the end can be ignored. |
||
|
||
## Permitted fill values | ||
|
||
The value of the `fill_value` metadata key must be a string. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we will need more constraints here, such as as length constraint. for reference, in zarr v2 the fill value was a base64-encoded ascii string but it didn't constrain the length. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is the length specified in bits? What does it mean if you specify a length that is not a multiple of 8? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think specifying the length in bits is a zarr-python implementation detail that we shouldn't have to deal with in the spec. i think a length in bytes is better here. |
||
|
||
## Example | ||
|
||
For example, the array metadata below specifies that the array contains fixed-length ASCII strings: | ||
|
||
```json | ||
{ | ||
"data_type": "fixed-length-ascii", | ||
"fill_value": "", | ||
"configuration": { | ||
"length_bits": 24 | ||
}, | ||
} | ||
``` | ||
|
||
## Notes | ||
|
||
TBD | ||
|
||
## Change log | ||
|
||
No changes yet. | ||
|
||
## Current maintainers | ||
|
||
* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
{ | ||
"$schema": "https://json-schema.org/draft/2020-12/schema", | ||
"oneOf": [ | ||
{ | ||
"type": "object", | ||
"properties": { | ||
"name": { | ||
"const": "fixed-length-ascii" | ||
}, | ||
"configuration": { | ||
"type": "object", | ||
"properties": { | ||
"length_bits": { | ||
"type": "integer" | ||
} | ||
}, | ||
"required": ["length_bits"], | ||
"additionalProperties": false | ||
} | ||
}, | ||
"required": ["name", "configuration"], | ||
"additionalProperties": false | ||
}, | ||
{ "const": "fixed-length-ascii" } | ||
] | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Fixed-length bytes data type | ||
|
||
Defines a data type for fixed-length byte strings. | ||
|
||
## Permitted fill values | ||
|
||
The value of the `fill_value` metadata key must be a string. | ||
|
||
## Example | ||
|
||
For example, the array metadata below specifies that the array contains fixed-length byte strings: | ||
|
||
```json | ||
{ | ||
"data_type": "fixed-length-bytes", | ||
"fill_value": "", | ||
"configuration": { | ||
"length_bits": 24 | ||
}, | ||
} | ||
``` | ||
|
||
## Notes | ||
|
||
TBD | ||
|
||
## Change log | ||
|
||
No changes yet. | ||
|
||
## Current maintainers | ||
|
||
* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
{ | ||
"$schema": "https://json-schema.org/draft/2020-12/schema", | ||
"oneOf": [ | ||
{ | ||
"type": "object", | ||
"properties": { | ||
"name": { | ||
"const": "fixed-length-bytes" | ||
}, | ||
"configuration": { | ||
"type": "object", | ||
"properties": { | ||
"length_bits": { | ||
"type": "integer" | ||
} | ||
}, | ||
"required": ["length_bits"], | ||
"additionalProperties": false | ||
} | ||
}, | ||
"required": ["name", "configuration"], | ||
"additionalProperties": false | ||
}, | ||
{ "const": "fixed-length-bytes" } | ||
] | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Fixed-length Unicode string data type | ||
|
||
Defines a data type for fixed-length Unicode strings. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably should also include a recommendation NOT to use this type except for compatibility with existing datasets, because UTF-32 encoding is basically never a good choice. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably should include a note that it is intended to only contain valid unicode strings, and that implementations may return an error if given invalid data (e.g. utf-16 surrogate pairs, out-of-range characters) if that is what is intended. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should include a note that NUL characters at the end are to be ignored, if that is what is intended. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think utf-32 is now more commonly used and may be a better choice than ucs-4. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be clear, I mean the name "utf-32". |
||
|
||
## Permitted fill values | ||
|
||
The value of the `fill_value` metadata key must be a string. | ||
|
||
## Example | ||
|
||
For example, the array metadata below specifies that the array contains fixed-length unicode strings: | ||
|
||
```json | ||
{ | ||
"data_type": "fixed-length-ucs4", | ||
"fill_value": "", | ||
"configuration": { | ||
"length_bits": 24 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using 24, which is not a multiple of 32, is confusing. |
||
}, | ||
} | ||
``` | ||
|
||
## Notes | ||
|
||
TBD | ||
|
||
## Change log | ||
|
||
No changes yet. | ||
|
||
## Current maintainers | ||
|
||
* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
{ | ||
"$schema": "https://json-schema.org/draft/2020-12/schema", | ||
"oneOf": [ | ||
{ | ||
"type": "object", | ||
"properties": { | ||
"name": { | ||
"const": "fixed-length-ucs4" | ||
}, | ||
"configuration": { | ||
"type": "object", | ||
"properties": { | ||
"length_bits": { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is the length in bits, and what happens if it is not a multiple of 32? |
||
"type": "integer" | ||
} | ||
}, | ||
"required": ["length_bits"], | ||
"additionalProperties": false | ||
} | ||
}, | ||
"required": ["name", "configuration"], | ||
"additionalProperties": false | ||
}, | ||
{ "const": "fixed-length-ucs4" } | ||
] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to stick with the integer representation, we should clarify that this is a JSON representation of an integer between -2^63 and 2^63 - 1, inclusive. But as this data type is totally new for zarr v3 and we don't have to worry about backwards compatibility in the json format w.r.t. v2, we could consider a more date-like human-readable alternative to an int. I'd love to get feedback from heavy datetime users on what they would expect.