Mathematical proofs in ArXiv papers recognized by OCR have appeared with irrelevant Chinese. #142

bwnjnOEI · 2024-08-31T14:01:50Z

Error screenshot

Code snipp

from pix2text import Pix2Text

img_fp = 'paper/2402.11867v3.pdf'
p2t = Pix2Text.from_config()
doc = p2t.recognize_pdf(img_fp)
doc.to_markdown(f'./{img_fp}.md')

Qeustion

Are Chinese and English models separate or a single model?
What parameters or models can be adjusted to achieve better results?
Will paid models definitely avoid this situation?

breezedeus · 2024-09-03T14:34:04Z

Initalize it with these configs:

import os
from pix2text import Pix2Text

text_formula_config = dict(
    languages=('en', ),  # 设置识别的语言
    text=dict(
        rec_model_name='en_PP-OCRv3',
        rec_model_backend='onnx',
    ),
)
total_config = {
    'layout': {'scores_thresh': 0.45},
    'text_formula': text_formula_config,
}
p2t = Pix2Text.from_config(total_configs=total_config)

bwnjnOEI · 2024-09-04T00:29:07Z

Initalize it with these configs:

import os
from pix2text import Pix2Text

text_formula_config = dict(
    languages=('en', ),  # 设置识别的语言
    text=dict(
        rec_model_name='en_PP-OCRv3',
        rec_model_backend='onnx',
    ),
)
total_config = {
    'layout': {'scores_thresh': 0.45},
    'text_formula': text_formula_config,
}
p2t = Pix2Text.from_config(total_configs=total_config)

I appreciate your feedback, I changed the setting according to your suggestion, but the situation still occurs:

breezedeus · 2024-09-08T16:05:03Z

Check the logs printed during model initialization to see if the model 'en_PP-OCRv3' is really being used. Judging from your result, it seems that this configuration is not effective. If it is not effective, you can try the following initialization code:

import os
from pix2text import Pix2Text

text_formula_config = dict(
    languages=('en', ),  
    text=dict(
        rec_model_name='en_PP-OCRv3',
        rec_model_backend='onnx',
        cand_alphabet=None,  # NOTE: must add this line
    ),
)
total_config = {
    'layout': {'scores_thresh': 0.45},
    'text_formula': text_formula_config,
}
p2t = Pix2Text.from_config(total_configs=total_config)

bwnjnOEI changed the title ~~OCR arXiv papers with mathematical proofs that appear unrelated in Chinese.~~ OCR arXiv paper with mathematical proofs that appear unrelated in Chinese. Aug 31, 2024

bwnjnOEI changed the title ~~OCR arXiv paper with mathematical proofs that appear unrelated in Chinese.~~ Mathematical proofs in ArXiv papers recognized by OCR have appeared with irrelevant Chinese. Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mathematical proofs in ArXiv papers recognized by OCR have appeared with irrelevant Chinese. #142

Mathematical proofs in ArXiv papers recognized by OCR have appeared with irrelevant Chinese. #142

bwnjnOEI commented Aug 31, 2024 •

edited

Loading

breezedeus commented Sep 3, 2024

Uh oh!

bwnjnOEI commented Sep 4, 2024 •

edited

Loading

Uh oh!

breezedeus commented Sep 8, 2024

Uh oh!

Mathematical proofs in ArXiv papers recognized by OCR have appeared with irrelevant Chinese. #142

Mathematical proofs in ArXiv papers recognized by OCR have appeared with irrelevant Chinese. #142

Comments

bwnjnOEI commented Aug 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Error screenshot

Code snipp

Qeustion

breezedeus commented Sep 3, 2024

Uh oh!

bwnjnOEI commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

breezedeus commented Sep 8, 2024

Uh oh!

bwnjnOEI commented Aug 31, 2024 •

edited

Loading

bwnjnOEI commented Sep 4, 2024 •

edited

Loading