We introduce LogicOCR, a benchmark comprising 1,100 multiple-choice questions designed to evaluate the logical reasoning abilities of Large Multimodal Models (LMMs) on text-rich images, while minimizing reliance on domain-specific knowledge (e.g., mathematics). We construct LogicOCR by developing a scalable, automated pipeline to convert raw text corpora into multimodal samples. First, we design prompt templates to steer GPT-Image-1 to generate images with diverse backgrounds, interleaved text-illustration layouts, and varied fonts, ensuring contextual relevance and visual realism. Then, the generated images are manually verified, with low-quality examples discarded. We evaluate a range of representative open-source and proprietary LMMs under both Chain-of-Thought (CoT) and direct-answer settings.
- CoT does not consistently improve accuracy on LogicOCR—most models fail to reason better step-by-step, suggesting flaws in their reasoning paths.
- Test-time scaling significantly improves performance on LogicOCR, though the efficiency of open-source LMMs still leaves room for improvement
- State-of-the-art LMMs still fall short of fully integrating visual reading and reasoning. While vision-language alignment suffices for perception tasks like OCR, it remains inadequate for more complex reasoning, especially as model size grows.
- The perception robustness of LMMs across different visual-text orientations needs further improvement. Perturbations like image rotation can reduce accuracy to near-random levels.
For main results and detailed analysis, please refer to the paper.
- [
05/16/2025
]: Release the dataset on huggingface. Release the codes.
- Setup
Clone this repo and download the images and JSON file:
git clone https://github.com/MiliLab/LogicOCR
cd LogicOCR
wget https://huggingface.co/datasets/MiliLab/LogicOCR/resolve/main/images.zip
unzip images.zip && rm images.zip
wget https://huggingface.co/datasets/MiliLab/LogicOCR/resolve/main/LogicOCR.json
- Recommed Environment
python>=3.10, torch 2.5.1, torchvision 0.20.1, transformers 4.49.0, flash-attn 2.7.4.post1, and see requirement.txt
- Evaluate LMMs
Some evaluation scripts are provided in infer_models.
bash eval.sh
You can also find the existing evaluation results in huggingface repo.
- (Optional) Evaluate OCR and Two-Step Performance
bash eval_ocr.sh
You can also find the existing OCR evaluation results in huggingface repo.
If you want to generate images in yourself, a JSON file with 3 samples and a simple script are provided for reference. You can run the following commands. The generated images will be saved in gen_images/saved_folder
cd gen_images
python gpt_generate.py samples.json $YOUR_API_KEY $YOUR_BASE_URL $NUM_WORKERS
LogicOCR is licensed under CC BY-NC-SA 4.0.
The raw text corpora are collected from LogiQA and LogiQA2.0.
The inference script is modified from OCRBench. The OCR evaluation tool is modified from Fox.
If you find LogicOCR helpful, please consider giving this repo a ⭐ and citing:
@article{ye2025logicocr,
title={LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?},
author={Maoyuan Ye and Jing Zhang and Juhua Liu and Bo Du and Dacheng Tao},
journal={arXiv preprint arXiv:2505.12307},
year={2025}
}