Skip to content

Mathematical proofs in ArXiv papers recognized by OCR have appeared with irrelevant Chinese. #142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bwnjnOEI opened this issue Aug 31, 2024 · 3 comments

Comments

@bwnjnOEI
Copy link

bwnjnOEI commented Aug 31, 2024

Error screenshot

iShot_2024-08-31_21 49 47 iShot_2024-08-31_21 48 53

Code snipp

from pix2text import Pix2Text

img_fp = 'paper/2402.11867v3.pdf'
p2t = Pix2Text.from_config()
doc = p2t.recognize_pdf(img_fp)
doc.to_markdown(f'./{img_fp}.md')

Qeustion

Are Chinese and English models separate or a single model?
What parameters or models can be adjusted to achieve better results?
Will paid models definitely avoid this situation?

@bwnjnOEI bwnjnOEI changed the title OCR arXiv papers with mathematical proofs that appear unrelated in Chinese. OCR arXiv paper with mathematical proofs that appear unrelated in Chinese. Aug 31, 2024
@bwnjnOEI bwnjnOEI changed the title OCR arXiv paper with mathematical proofs that appear unrelated in Chinese. Mathematical proofs in ArXiv papers recognized by OCR have appeared with irrelevant Chinese. Aug 31, 2024
@breezedeus
Copy link
Owner

Initalize it with these configs:

import os
from pix2text import Pix2Text

text_formula_config = dict(
    languages=('en', ),  # 设置识别的语言
    text=dict(
        rec_model_name='en_PP-OCRv3',
        rec_model_backend='onnx',
    ),
)
total_config = {
    'layout': {'scores_thresh': 0.45},
    'text_formula': text_formula_config,
}
p2t = Pix2Text.from_config(total_configs=total_config)

@bwnjnOEI
Copy link
Author

bwnjnOEI commented Sep 4, 2024

Initalize it with these configs:

import os
from pix2text import Pix2Text

text_formula_config = dict(
    languages=('en', ),  # 设置识别的语言
    text=dict(
        rec_model_name='en_PP-OCRv3',
        rec_model_backend='onnx',
    ),
)
total_config = {
    'layout': {'scores_thresh': 0.45},
    'text_formula': text_formula_config,
}
p2t = Pix2Text.from_config(total_configs=total_config)

I appreciate your feedback, I changed the setting according to your suggestion, but the situation still occurs:

image

@breezedeus
Copy link
Owner

Check the logs printed during model initialization to see if the model 'en_PP-OCRv3' is really being used. Judging from your result, it seems that this configuration is not effective. If it is not effective, you can try the following initialization code:

import os
from pix2text import Pix2Text

text_formula_config = dict(
    languages=('en', ),  
    text=dict(
        rec_model_name='en_PP-OCRv3',
        rec_model_backend='onnx',
        cand_alphabet=None,  # NOTE: must add this line
    ),
)
total_config = {
    'layout': {'scores_thresh': 0.45},
    'text_formula': text_formula_config,
}
p2t = Pix2Text.from_config(total_configs=total_config)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants