Skip to content

Commit 0f30039

Browse files
ShreeShree
authored andcommitted
Add info about multiple box file formats
1 parent 7078800 commit 0f30039

File tree

1 file changed

+74
-2
lines changed

1 file changed

+74
-2
lines changed

Making-Box-Files---4.0.md

Lines changed: 74 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,74 @@
1-
See
2-
https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00#creating-training-data for more details.
1+
## Box File formats for Tesseract 4.0.x and later
2+
3+
Multiple formats of box files are accepted by Tesseract4 for LSTM training, though they are different from the one used by Tesseract 3.
4+
5+
### text2image
6+
7+
Character based box files generated by `text2image` using Unicode font files and training text.
8+
9+
```
10+
I 114 4655 120 4691 0
11+
n 127 4655 150 4682 0
12+
f 152 4655 169 4692 0
13+
o 168 4654 193 4682 0
14+
r 197 4654 213 4681 0
15+
m 214 4654 250 4681 0
16+
a 255 4654 280 4681 0
17+
t 282 4654 295 4689 0
18+
i 298 4654 304 4690 0
19+
o 308 4654 333 4681 0
20+
n 337 4654 360 4681 0
21+
360 4653 378 4691 0
22+
G 378 4653 413 4691 0
23+
r 418 4653 434 4680 0
24+
o 434 4653 459 4680 0
25+
u 463 4653 486 4679 0
26+
p 491 4643 515 4680 0
27+
s 517 4653 540 4680 0
28+
540 4653 555 4690 0
29+
```
30+
31+
### lstmbox
32+
33+
Generated by `tesseract` using `lstmbox` config from image files - each char uses coordinates of its entire line. These can also be generated from line images and their transcription with the `Makefile` in `tesstrain` repo.
34+
35+
```
36+
I 114 4640 1912 4692 0
37+
n 114 4640 1912 4692 0
38+
f 114 4640 1912 4692 0
39+
o 114 4640 1912 4692 0
40+
r 114 4640 1912 4692 0
41+
m 114 4640 1912 4692 0
42+
a 114 4640 1912 4692 0
43+
t 114 4640 1912 4692 0
44+
i 114 4640 1912 4692 0
45+
o 114 4640 1912 4692 0
46+
n 114 4640 1912 4692 0
47+
114 4640 1912 4692 0
48+
G 114 4640 1912 4692 0
49+
r 114 4640 1912 4692 0
50+
o 114 4640 1912 4692 0
51+
u 114 4640 1912 4692 0
52+
p 114 4640 1912 4692 0
53+
s 114 4640 1912 4692 0
54+
114 4640 1912 4692 0
55+
```
56+
57+
### wordstrbox
58+
59+
Generated by `tesseract` using `wordstrbox` config from image files - Uses Wordstr text and bounding boxes for the whole line
60+
61+
```
62+
WordStr 114 4640 1907 4692 0 #Information Groups for public OPTIONAL, jaundice Proterozoic Have LOCATION
63+
1908 4640 1912 4692 0
64+
WordStr 112 4544 2015 4592 0 #mixed, Male By TEXT Cove... ¥ INSTABILITY About WERE Crimson THAT HOPKINS
65+
2016 4544 2020 4592 0
66+
```
67+
68+
### `makebox` doesn't work for Tesseract 4.0.x and later
69+
70+
Please note that box files generated by `tesseract` using `makebox` config file are OK for Tesseract3 but not for Tesseract4 LSTM training.
71+
72+
### More details
73+
74+
See the section on [creating training data](https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00#creating-training-data) for more details.

0 commit comments

Comments
 (0)