👋 Introduction

LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?

Maoyuan Ye¹, Jing Zhang^{1 ✉️}, Juhua Liu^{1 ✉️}, Bo Du¹, Dacheng Tao²
¹ Wuhan University ² Nanyang Technological University

👋 Introduction

We introduce LogicOCR, a benchmark comprising 1,100 multiple-choice questions designed to evaluate the logical reasoning abilities of Large Multimodal Models (LMMs) on text-rich images, while minimizing reliance on domain-specific knowledge (e.g., mathematics). We construct LogicOCR by developing a scalable, automated pipeline to convert raw text corpora into multimodal samples. First, we design prompt templates to steer GPT-Image-1 to generate images with diverse backgrounds, interleaved text-illustration layouts, and varied fonts, ensuring contextual relevance and visual realism. Then, the generated images are manually verified, with low-quality examples discarded. We evaluate a range of representative open-source and proprietary LMMs under both Chain-of-Thought (CoT) and direct-answer settings.

📌 Key Findings

CoT does not consistently improve accuracy on LogicOCR—most models fail to reason better step-by-step, suggesting flaws in their reasoning paths.
Test-time scaling significantly improves performance on LogicOCR, though the efficiency of open-source LMMs still leaves room for improvement
State-of-the-art LMMs still fall short of fully integrating visual reading and reasoning. While vision-language alignment suffices for perception tasks like OCR, it remains inadequate for more complex reasoning, especially as model size grows.
The perception robustness of LMMs across different visual-text orientations needs further improvement. Perturbations like image rotation can reduce accuracy to near-random levels.

For main results and detailed analysis, please refer to the paper.

🔥 News

[05/16/2025]: Release the dataset on huggingface. Release the codes.

🔨 Evaluation

Setup

Clone this repo and download the images and JSON file:

git clone https://github.com/MiliLab/LogicOCR
cd LogicOCR
wget https://huggingface.co/datasets/MiliLab/LogicOCR/resolve/main/images.zip
unzip images.zip && rm images.zip
wget https://huggingface.co/datasets/MiliLab/LogicOCR/resolve/main/LogicOCR.json

Recommed Environment

python>=3.10, torch 2.5.1, torchvision 0.20.1, transformers 4.49.0, flash-attn 2.7.4.post1, and see requirement.txt

Evaluate LMMs

Some evaluation scripts are provided in infer_models.

bash eval.sh

You can also find the existing evaluation results in huggingface repo.

(Optional) Evaluate OCR and Two-Step Performance

bash eval_ocr.sh

You can also find the existing OCR evaluation results in huggingface repo.

▶️ Text-to-Image Generation

If you want to generate images in yourself, a JSON file with 3 samples and a simple script are provided for reference. You can run the following commands. The generated images will be saved in gen_images/saved_folder

cd gen_images
python gpt_generate.py samples.json $YOUR_API_KEY $YOUR_BASE_URL $NUM_WORKERS

📖 Main Results

📜 License

LogicOCR is licensed under CC BY-NC-SA 4.0.

💗 Acknowledgement

The raw text corpora are collected from LogiQA and LogiQA2.0.

The inference script is modified from OCRBench. The OCR evaluation tool is modified from Fox.

✒️ Citation

If you find LogicOCR helpful, please consider giving this repo a ⭐ and citing:

@article{ye2025logicocr,
  title={LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?},
  author={Maoyuan Ye and Jing Zhang and Juhua Liu and Bo Du and Dacheng Tao},
  journal={arXiv preprint arXiv:2505.12307},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
eval_ocr_tools		eval_ocr_tools
gen_images		gen_images
infer_models		infer_models
infer_ocr		infer_ocr
README.md		README.md
eval.sh		eval.sh
eval_ocr.sh		eval_ocr.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?

Maoyuan Ye¹, Jing Zhang^{1 ✉️}, Juhua Liu^{1 ✉️}, Bo Du¹, Dacheng Tao²
¹ Wuhan University ² Nanyang Technological University

👋 Introduction

📌 Key Findings

🔥 News

🔨 Evaluation

▶️ Text-to-Image Generation

📖 Main Results

📜 License

💗 Acknowledgement

✒️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

MiliLab/LogicOCR

Folders and files

Latest commit

History

Repository files navigation

LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?

Maoyuan Ye1, Jing Zhang1 ✉️, Juhua Liu1 ✉️, Bo Du1, Dacheng Tao2 1 Wuhan University 2 Nanyang Technological University

👋 Introduction

📌 Key Findings

🔥 News

🔨 Evaluation

▶️ Text-to-Image Generation

📖 Main Results

📜 License

💗 Acknowledgement

✒️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Maoyuan Ye¹, Jing Zhang^{1 ✉️}, Juhua Liu^{1 ✉️}, Bo Du¹, Dacheng Tao²
¹ Wuhan University ² Nanyang Technological University

Packages