Skip to content

feature(data-types): add data-types from https://github.com/zarr-developers/zarr-python/pull/2874 #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions data-types/datetime64/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Datetime64 data type

Defines a data type for a datetime object based on a 64-bit integer.

## Permitted fill values

The value of the `fill_value` metadata key must be a signed 64-bit integer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to stick with the integer representation, we should clarify that this is a JSON representation of an integer between -2^63 and 2^63 - 1, inclusive. But as this data type is totally new for zarr v3 and we don't have to worry about backwards compatibility in the json format w.r.t. v2, we could consider a more date-like human-readable alternative to an int. I'd love to get feedback from heavy datetime users on what they would expect.


## Example

For example, the array metadata below specifies that the array uses the datetime64 data type:

```json
{
"data_type": "datetime64",
"fill_value": -9223372036854775808,
"configuration": {
"unit": "s"
},
Comment on lines +15 to +19
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"data_type": "datetime64",
"fill_value": -9223372036854775808,
"configuration": {
"unit": "s"
},
"data_type": { "name": "datetime64", "configuration": {"unit":"s"} },
"fill_value": -9223372036854775808,

}
```

## Notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If numpy compatibility is still the design goal for this dtype, I think it would make sense to say something about compatibility with the numpy datetime dtype, e.g. with a link to the numpy docs .


Valid values for the `unit` configuration option include: `["Y", "M", "W", "D", "h", "m", "s", "ms", "us", "μs", "ns", "ps", "fs", "as"]`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say something about the epoch (presumably unix epoch) and time standard (presumably UTC) it uses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the units listed after nanoseconds cannot represent dates beyond 1970 within 64-bits and therefore are presumably not very useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first part of this sentence is correct.

The second is not. There is a lot of use cases in the (paleo-)climate space for dates that go much further back in time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the range of 64-bit picoseconds is just a few months forward or backward from the epoch.

Presumably picoseconds would only be useful either with more than 64 bits or for a time duration rather than time point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it helpful if we define the goal of this datetime type as "give numpy datetime64 users a zarr data type they can use without thinking"? In this case, we optimize for compatibility and accept whatever decisions the numpy folks have made. We can always define a separate datetime data type that avoids perceived defects in the numpy datetime dtype, and we can point users who don't need maximal numpy compatibility to the better data type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment regarding datetime: numpy supports a special NaT (not a time) value. Is that intended to be supported here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm asking because, whether it's sensible or not, ps is a unit numpy datetime64 supports. Instead of litigating that decision, or any other numpy decision, we could optimize for compatibility here, which would simplify the task of this PR considerably.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment regarding datetime: numpy supports a special NaT (not a time) value. Is that intended to be supported here?

yes i believe that should be supported, and so it should be defined in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those units don't hurt anything they just are very unlikely to be used intentionally. Numpy supports the same units for both datetime and timedelta --- are you intending to also add timedelta as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those units don't hurt anything they just are very unlikely to be used intentionally. Numpy supports the same units for both datetime and timedelta --- are you intending to also add timedelta as well?

yes, timedelta is planned.


## Change log

No changes yet.

## Current maintainers

* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs)
26 changes: 26 additions & 0 deletions data-types/datetime64/schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"oneOf": [
{
"type": "object",
"properties": {
"name": {
"const": "datetime64"
},
"configuration": {
"type": "object",
"properties": {
"unit": {
"type": "string"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be listed as enum rather than string

}
},
"required": ["unit"],
"additionalProperties": false
}
},
"required": ["name", "configuration"],
"additionalProperties": false
},
{ "const": "datetime64" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the configuration (with its unit field) is mandatory, the short-hand should not be permissible.

]
}
33 changes: 33 additions & 0 deletions data-types/fixed-length-ascii/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Fixed-length ASCII data type

Defines a data type for fixed-length ASCII strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my initial name for this dtype (fixed-length-ascii) was misleading -- it models the numpy S dtype, which represents any fixed-length byte string, which includes non-ascii characters. So we should probably rename this to fixed-length-bytes or equivalent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the key distinction between this dtype (modelling numpy S*) and the fixed-length bytes dtype defined elsewhere in this PR (modelling numpy V*) is that scalars with this dtype are intended to be interpreted as strings, not opaque sequences of bytes. We should probably decide if we really want two dtypes for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy also assumes NUL-padding, i.e. that any NUL characters at the end can be ignored.


## Permitted fill values

The value of the `fill_value` metadata key must be a string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will need more constraints here, such as as length constraint. for reference, in zarr v2 the fill value was a base64-encoded ascii string but it didn't constrain the length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the length specified in bits? What does it mean if you specify a length that is not a multiple of 8?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think specifying the length in bits is a zarr-python implementation detail that we shouldn't have to deal with in the spec. i think a length in bytes is better here.


## Example

For example, the array metadata below specifies that the array contains fixed-length ASCII strings:

```json
{
"data_type": "fixed-length-ascii",
"fill_value": "",
"configuration": {
"length_bits": 24
},
}
```

## Notes

TBD

## Change log

No changes yet.

## Current maintainers

* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs)
26 changes: 26 additions & 0 deletions data-types/fixed-length-ascii/schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"oneOf": [
{
"type": "object",
"properties": {
"name": {
"const": "fixed-length-ascii"
},
"configuration": {
"type": "object",
"properties": {
"length_bits": {
"type": "integer"
}
},
"required": ["length_bits"],
"additionalProperties": false
}
},
"required": ["name", "configuration"],
"additionalProperties": false
},
{ "const": "fixed-length-ascii" }
]
}
33 changes: 33 additions & 0 deletions data-types/fixed-length-bytes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Fixed-length bytes data type

Defines a data type for fixed-length byte strings.

## Permitted fill values

The value of the `fill_value` metadata key must be a string.

## Example

For example, the array metadata below specifies that the array contains fixed-length byte strings:

```json
{
"data_type": "fixed-length-bytes",
"fill_value": "",
"configuration": {
"length_bits": 24
},
}
```

## Notes

TBD

## Change log

No changes yet.

## Current maintainers

* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs)
26 changes: 26 additions & 0 deletions data-types/fixed-length-bytes/schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"oneOf": [
{
"type": "object",
"properties": {
"name": {
"const": "fixed-length-bytes"
},
"configuration": {
"type": "object",
"properties": {
"length_bits": {
"type": "integer"
}
},
"required": ["length_bits"],
"additionalProperties": false
}
},
"required": ["name", "configuration"],
"additionalProperties": false
},
{ "const": "fixed-length-bytes" }
]
}
33 changes: 33 additions & 0 deletions data-types/fixed-length-ucs4/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Fixed-length Unicode string data type

Defines a data type for fixed-length Unicode strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should also include a recommendation NOT to use this type except for compatibility with existing datasets, because UTF-32 encoding is basically never a good choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should include a note that it is intended to only contain valid unicode strings, and that implementations may return an error if given invalid data (e.g. utf-16 surrogate pairs, out-of-range characters) if that is what is intended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should include a note that NUL characters at the end are to be ignored, if that is what is intended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think utf-32 is now more commonly used and may be a better choice than ucs-4.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I mean the name "utf-32".


## Permitted fill values

The value of the `fill_value` metadata key must be a string.

## Example

For example, the array metadata below specifies that the array contains fixed-length unicode strings:

```json
{
"data_type": "fixed-length-ucs4",
"fill_value": "",
"configuration": {
"length_bits": 24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 24, which is not a multiple of 32, is confusing.

},
}
```

## Notes

TBD

## Change log

No changes yet.

## Current maintainers

* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs)
26 changes: 26 additions & 0 deletions data-types/fixed-length-ucs4/schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"oneOf": [
{
"type": "object",
"properties": {
"name": {
"const": "fixed-length-ucs4"
},
"configuration": {
"type": "object",
"properties": {
"length_bits": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the length in bits, and what happens if it is not a multiple of 32?

"type": "integer"
}
},
"required": ["length_bits"],
"additionalProperties": false
}
},
"required": ["name", "configuration"],
"additionalProperties": false
},
{ "const": "fixed-length-ucs4" }
]
}