Skip to content

Commit 336512d

Browse files
authored
update docs on N-dim arrays (#6956)
* update docs on N-dim arrays * Update use_with_tensorflow.mdx * Update use_with_jax.mdx * Update use_with_jax.mdx * Update use_with_tensorflow.mdx
1 parent f717006 commit 336512d

File tree

3 files changed

+62
-16
lines changed

3 files changed

+62
-16
lines changed

docs/source/use_with_jax.mdx

Lines changed: 32 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -79,18 +79,44 @@ device which is `jax.devices()[0]`.
7979

8080
## N-dimensional arrays
8181

82-
If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists.
83-
In particular, a JAX formatted dataset outputs a `DeviceArray` object, which is a numpy-like array, so it does not
84-
need the [`Array`] feature type to be specified as opposed to PyTorch or TensorFlow formatters.
82+
If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same tensor if the shape is fixed:
8583

8684
```py
8785
>>> from datasets import Dataset
88-
>>> data = [[[1, 2],[3, 4]], [[5, 6],[7, 8]]]
86+
>>> data = [[[1, 2],[3, 4]], [[5, 6],[7, 8]]] # fixed shape
8987
>>> ds = Dataset.from_dict({"data": data})
9088
>>> ds = ds.with_format("jax")
9189
>>> ds[0]
92-
{'data': DeviceArray([[1, 2],
93-
[3, 4]], dtype=int32)}
90+
{'data': Array([[1, 2],
91+
[3, 4]], dtype=int32)}
92+
```
93+
94+
```py
95+
>>> from datasets import Dataset
96+
>>> data = [[[1, 2],[3]], [[4, 5, 6],[7, 8]]] # varying shape
97+
>>> ds = Dataset.from_dict({"data": data})
98+
>>> ds = ds.with_format("jax")
99+
>>> ds[0]
100+
{'data': [Array([1, 2], dtype=int32), Array([3], dtype=int32)]}
101+
```
102+
103+
However this logic often requires slow shape comparisions and data copies, to avoid this you must explicitly use the [`Array`] feature type and specify the shape of your tensors:
104+
105+
```py
106+
>>> from datasets import Dataset, Features, Array2D
107+
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
108+
>>> features = Features({"data": Array2D(shape=(2, 2), dtype='int32')})
109+
>>> ds = Dataset.from_dict({"data": data}, features=features)
110+
>>> ds = ds.with_format("torch")
111+
>>> ds[0]
112+
{'data': Array([[1, 2],
113+
[3, 4]], dtype=int32)}
114+
>>> ds[:2]
115+
{'data': Array([[[1, 2],
116+
[3, 4]],
117+
118+
[[5, 6],
119+
[7, 8]]], dtype=int32)}
94120
```
95121

96122
## Other feature types

docs/source/use_with_pytorch.mdx

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -40,19 +40,28 @@ To load the data as tensors on a GPU, specify the `device` argument:
4040

4141
## N-dimensional arrays
4242

43-
If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists.
44-
In particular, a PyTorch formatted dataset outputs nested lists instead of a single tensor:
43+
If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same tensor if the shape is fixed:
4544

4645
```py
4746
>>> from datasets import Dataset
48-
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
47+
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]] # fixed shape
48+
>>> ds = Dataset.from_dict({"data": data})
49+
>>> ds = ds.with_format("torch")
50+
>>> ds[0]
51+
{'data': tensor([[1, 2],
52+
[3, 4]])}
53+
```
54+
55+
```py
56+
>>> from datasets import Dataset
57+
>>> data = [[[1, 2],[3]],[[4, 5, 6],[7, 8]]] # varying shape
4958
>>> ds = Dataset.from_dict({"data": data})
5059
>>> ds = ds.with_format("torch")
5160
>>> ds[0]
52-
{'data': [tensor([1, 2]), tensor([3, 4])]}
61+
{'data': [tensor([1, 2]), tensor([3])]}
5362
```
5463

55-
To get a single tensor, you must explicitly use the [`Array`] feature type and specify the shape of your tensors:
64+
However this logic often requires slow shape comparisions and data copies, to avoid this you must explicitly use the [`Array`] feature type and specify the shape of your tensors:
5665

5766
```py
5867
>>> from datasets import Dataset, Features, Array2D

docs/source/use_with_tensorflow.mdx

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -41,19 +41,30 @@ array([[1, 2],
4141

4242
## N-dimensional arrays
4343

44-
If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists.
45-
In particular, a TensorFlow formatted dataset outputs a `RaggedTensor` instead of a single tensor:
44+
If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same tensor if the shape is fixed:
45+
Otherwise, a TensorFlow formatted dataset outputs a `RaggedTensor` instead of a single tensor:
4646

4747
```py
4848
>>> from datasets import Dataset
49-
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
49+
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]] # fixed shape
5050
>>> ds = Dataset.from_dict({"data": data})
5151
>>> ds = ds.with_format("tf")
5252
>>> ds[0]
53-
{'data': <tf.RaggedTensor [[1, 2], [3, 4]]>}
53+
{'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy=
54+
array([[1, 2],
55+
[3, 4]])>}
56+
```
57+
58+
```py
59+
>>> from datasets import Dataset
60+
>>> data = [[[1, 2],[3]],[[4, 5, 6],[7, 8]]] # varying shape
61+
>>> ds = Dataset.from_dict({"data": data})
62+
>>> ds = ds.with_format("torch")
63+
>>> ds[0]
64+
{'data': <tf.RaggedTensor [[1, 2], [3]]>}
5465
```
5566

56-
To get a single tensor, you must explicitly use the [`Array`] feature type and specify the shape of your tensors:
67+
However this logic often requires slow shape comparisions and data copies, to avoid this you must explicitly use the [`Array`] feature type and specify the shape of your tensors:
5768

5869
```py
5970
>>> from datasets import Dataset, Features, Array2D

0 commit comments

Comments
 (0)