You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/document_dataset.mdx
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Create a document dataset
2
2
3
-
This guide will show you how to create a document with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document with several thousand pdfs.
3
+
This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand pdfs.
4
4
5
5
<Tip>
6
6
@@ -10,7 +10,7 @@ You can control access to your dataset by requiring users to share their contact
10
10
11
11
## PdfFolder
12
12
13
-
The `PdfFolder` is a dataset builder designed to quickly load a document with several thousand pdfs without requiring you to write any code.
13
+
The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand pdfs without requiring you to write any code.
Copy file name to clipboardExpand all lines: docs/source/document_load.mdx
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ To work with pdf datasets, you need to have the `pdfplumber` package installed.
14
14
15
15
</Tip>
16
16
17
-
When you load an pdf dataset and call the pdf column, the pdfs are decoded as `pdfplumber` Pdfs:
17
+
When you load a pdf dataset and call the pdf column, the pdfs are decoded as `pdfplumber` Pdfs:
18
18
19
19
```py
20
20
>>>from datasets import load_dataset, Pdf
@@ -26,15 +26,15 @@ When you load an pdf dataset and call the pdf column, the pdfs are decoded as `p
26
26
27
27
<Tipwarning={true}>
28
28
29
-
Index into an pdf dataset using the row index first and then the `pdf` column - `dataset[0]["pdf"]` - to avoid creating all the pdf objects in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
29
+
Index into a pdf dataset using the row index first and then the `pdf` column - `dataset[0]["pdf"]` - to avoid creating all the pdf objects in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
30
30
31
31
</Tip>
32
32
33
33
For a guide on how to load any type of dataset, take a look at the <aclass="underline decoration-sky-400 decoration-2 font-semibold"href="./loading">general loading guide</a>.
34
34
35
35
## Read pages
36
36
37
-
Access pages directly from a pdf using the `PDF` using `.pages`.
37
+
Access pages directly from a pdf using the `.pages` attribute.
38
38
39
39
Then you can use the `pdfplumber` functions to read texts, tables and images, e.g.:
40
40
@@ -168,7 +168,7 @@ To ignore the information in the metadata file, set `drop_metadata=True` in [`lo
168
168
169
169
If you don't have a metadata file, `PdfFolder` automatically infers the label name from the directory name.
170
170
If you want to drop automatically created labels, set `drop_labels=True`.
171
-
In this case, your dataset will only contain an pdf column:
171
+
In this case, your dataset will only contain a pdf column:
Copy file name to clipboardExpand all lines: docs/source/video_load.mdx
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ To work with video datasets, you need to have the `torchvision` and `av` package
14
14
15
15
</Tip>
16
16
17
-
When you load an video dataset and call the video column, the videos are decoded as `torchvision` Videos:
17
+
When you load a video dataset and call the video column, the videos are decoded as `torchvision` Videos:
18
18
19
19
```py
20
20
>>>from datasets import load_dataset, Video
@@ -26,7 +26,7 @@ When you load an video dataset and call the video column, the videos are decoded
26
26
27
27
<Tipwarning={true}>
28
28
29
-
Index into an video dataset using the row index first and then the `video` column - `dataset[0]["video"]` - to avoid creating all the video objects in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
29
+
Index into a video dataset using the row index first and then the `video` column - `dataset[0]["video"]` - to avoid creating all the video objects in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
30
30
31
31
</Tip>
32
32
@@ -136,7 +136,7 @@ To ignore the information in the metadata file, set `drop_metadata=True` in [`lo
136
136
137
137
If you don't have a metadata file, `VideoFolder` automatically infers the label name from the directory name.
138
138
If you want to drop automatically created labels, set `drop_labels=True`.
139
-
In this case, your dataset will only contain an video column:
139
+
In this case, your dataset will only contain a video column:
0 commit comments