You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00#creating-training-data for more details.
1
+
## Box File formats for Tesseract 4.0.x and later
2
+
3
+
Multiple formats of box files are accepted by Tesseract4 for LSTM training, though they are different from the one used by Tesseract 3.
4
+
5
+
### text2image
6
+
7
+
Character based box files generated by `text2image` using Unicode font files and training text.
8
+
9
+
```
10
+
I 114 4655 120 4691 0
11
+
n 127 4655 150 4682 0
12
+
f 152 4655 169 4692 0
13
+
o 168 4654 193 4682 0
14
+
r 197 4654 213 4681 0
15
+
m 214 4654 250 4681 0
16
+
a 255 4654 280 4681 0
17
+
t 282 4654 295 4689 0
18
+
i 298 4654 304 4690 0
19
+
o 308 4654 333 4681 0
20
+
n 337 4654 360 4681 0
21
+
360 4653 378 4691 0
22
+
G 378 4653 413 4691 0
23
+
r 418 4653 434 4680 0
24
+
o 434 4653 459 4680 0
25
+
u 463 4653 486 4679 0
26
+
p 491 4643 515 4680 0
27
+
s 517 4653 540 4680 0
28
+
540 4653 555 4690 0
29
+
```
30
+
31
+
### lstmbox
32
+
33
+
Generated by `tesseract` using `lstmbox` config from image files - each char uses coordinates of its entire line. These can also be generated from line images and their transcription with the `Makefile` in `tesstrain` repo.
34
+
35
+
```
36
+
I 114 4640 1912 4692 0
37
+
n 114 4640 1912 4692 0
38
+
f 114 4640 1912 4692 0
39
+
o 114 4640 1912 4692 0
40
+
r 114 4640 1912 4692 0
41
+
m 114 4640 1912 4692 0
42
+
a 114 4640 1912 4692 0
43
+
t 114 4640 1912 4692 0
44
+
i 114 4640 1912 4692 0
45
+
o 114 4640 1912 4692 0
46
+
n 114 4640 1912 4692 0
47
+
114 4640 1912 4692 0
48
+
G 114 4640 1912 4692 0
49
+
r 114 4640 1912 4692 0
50
+
o 114 4640 1912 4692 0
51
+
u 114 4640 1912 4692 0
52
+
p 114 4640 1912 4692 0
53
+
s 114 4640 1912 4692 0
54
+
114 4640 1912 4692 0
55
+
```
56
+
57
+
### wordstrbox
58
+
59
+
Generated by `tesseract` using `wordstrbox` config from image files - Uses Wordstr text and bounding boxes for the whole line
60
+
61
+
```
62
+
WordStr 114 4640 1907 4692 0 #Information Groups for public OPTIONAL, jaundice Proterozoic Have LOCATION
63
+
1908 4640 1912 4692 0
64
+
WordStr 112 4544 2015 4592 0 #mixed, Male By TEXT Cove... ¥ INSTABILITY About WERE Crimson THAT HOPKINS
65
+
2016 4544 2020 4592 0
66
+
```
67
+
68
+
### `makebox` doesn't work for Tesseract 4.0.x and later
69
+
70
+
Please note that box files generated by `tesseract` using `makebox` config file are OK for Tesseract3 but not for Tesseract4 LSTM training.
71
+
72
+
### More details
73
+
74
+
See the section on [creating training data](https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00#creating-training-data) for more details.
0 commit comments