🔥News

Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

Fengxiang Wang¹ , Hongzhen Wang², Zonghao Guo², Di Wang³^,⁵, Yulin Wang², Mingshuo Chen⁴,
Qiang Ma², Long Lan¹ , Wenjing Yang¹^* , Jing Zhang³^,⁶, Zhiyuan Liu², Maosong Sun²

¹College of Computer Science and Technology, National University of Defense Technology, ²Tsinghua University,
³School of Computer Science, Wuhan University, ⁴Beijing University of Posts and Telecommunications,
⁵Zhongguancun Academy, ⁶School of Artificial Intelligence, Wuhan University

CVPR 2025 Highlight

[🍎 Project Page] [📖 arXiv Paper] [🤗 Dataset] [🏆 Leaderboard]

🔥News

2025.04.04: Selected as Highlight by CVPR 2025!
2025.04.01: The dataset is currently under final review and will be released in one month.
2025.02.27: XLRS-Bench has been accepted by CVPR 2025!

📚Contents

🔥News
📚Contents
🔍Dataset Overview
📸Dataset Examples
📜Dataset License
🚀Evaluation Pipeline
📊Experiment Results
📖Citation
🙏Acknowledgement
📚Contact

🔍Dataset Overview

Remote sensing (RS) images have become essential for monitoring and understanding human environments, driving advancements in applications like precision agriculture, urban planning, and disaster assessment. While recent studies have proposed benchmarks and metrics to assess MLLM performance in RS, these efforts remain limited in image size, annotation method and evaluation dimensions. We present XLRS-Bench, a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios, featuring the largest average image size of 8,500 × 8,500 observed thus far. Our dataset encompasses 45,942 annotations across 16 tasks, all expertly curated by a team of 45 experts. The main advantages of XLRS-Bench compared to existing MLLM benchmarks as follows:

Ultra-high Resolution. XLRS-Bench features the largest image sizes available, 10∼20× than that of existing datasets, with 840 images out of all images at a resolution of 10,000×10,000 pixels
High-quality Annotation. All the annotations are human involved and manually verified through iterations, resulting in a high-quality benchmark for evaluating MLLMs on real ultra-high-resolution RS scenarios.
Comprehensive Evaluation Dimensions: XLRS-Bench covers 10 perception indicators and 6 reasoning dimensions to assess MLLMs’ capabilities, encompassing 16 sub-tasks with a total of 45,942 questions. Especially, XLRS-Bench includes complex reasoning tasks to explore MLLMs’ potential in conducting planning and change detection in long spatial-temporal RS scenarios.

📸Dataset Examples

📜Dataset License

Annotations of this dataset is released under a Creative Commons Attribution-NonCommercial 4.0 International License. For images from:

DOTA
RGB images from Google Earth and CycloMedia (for academic use only; commercial use is prohibited, and Google Earth terms of use apply).
ITCVD
Licensed under CC-BY-NC-SA-4.0.
MiniFrance, HRSCD
Released under IGN’s "licence ouverte".
Toronto, Potsdam:
The Toronto test data images are derived from the Downtown Toronto dataset provided by Optech Inc., First Base Solutions Inc., GeoICT Lab at York University, and ISPRS WG III/4, and are subject to the following conditions:
1. The data must not be used for other than research purposes. Any other use is prohibited.
2. The data must not be used outside the context of this test project, in particular while the project is still on-going (i.e. until September 2012). Whether the data will be available for other research purposes after the end of this project is still under discussion.
3. The data must not be distributed to third parties. Any person interested in the data may obtain them via ISPRS WG III/4.
4. The data users should include the following acknowledgement in any publication resulting from the datasets: “The authors would like to acknowledge the provision of the Downtown Toronto data set by Optech Inc., First Base Solutions Inc., GeoICT Lab at York University, and ISPRS WG III/4.”

Disclaimer:
If any party believes their rights are infringed, please contact us immediately at wfx23@nudt.edu.cn. We will promptly remove any infringing content.

🚀Evaluation Pipeline

Prompt

The common prompt used in VQA follows this format:

[Image] [Question] The choices are listed below:
(A) [Choice A]
(B) [Choice B]
(C) [Choice C]
(D) [Choice D]
Select the best answer for the multiple-choice question based on the image. Only respond with the letter corresponding to the correct answer (A, B, C, D).
The answer is:

or for VQA task Overall Land Use Classification please use:

Select the best answer(s) for the multiple-choice question based on the image. There may be more than one correct option. Only respond with the letter(s) corresponding to the correct answer(s) (A, B, C, D), with multiple choices separated by spaces.

For Visual Grounding and Detailed Image Captioning, please refer to our source files in the evaluation folder.

Evaluation

As soon as our dataset is released, XLRS-Bench will be integrated with lmms-eval, allowing you to evaluate models easily.

Leaderboard

To add your model to our leaderboard, please send your model responses to wfx23@nudt.edu.cn. Refer to the evaluation/samples folder for the required format.

📊Experiment Results

Models are ranked based on their average performance. Proprietary models are highlighted in gray. Task domains are indicated by abbreviations (e.g., “OC” for Overall Counting, “RC” for Regional Counting, etc.).

L-2 performance on VQA tasks.
L-3 perception dimension on VQA tasks.
L-3 reasoning dimension on VQA tasks.
Visual grounding performance.
Detailed image captioning performance.

📖Citation

If you find our work helpful, please consider citing:

@article{wang2025xlrsbench,
    title={XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?},
    author={Wang, Fengxiang and Wang, Hongzhen and Chen, Mingshuo and Wang, Di and Wang, Yulin and Guo, Zonghao and Ma, Qiang and Lan, Long and Yang, Wenjing and Zhang, Jing and others},
    journal={arXiv preprint arXiv:2503.23771},
    year={2025}
}

🙏Acknowledgement

[MME-RealWorld] MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

📬Contact

For any other questions please contact:

Fengxiang Wang at wfx23@nudt.edu.cn
Mingshuo Chen at chen.mingshuo@bupt.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

🔥News

📚Contents

🔍Dataset Overview

📸Dataset Examples

📜Dataset License

🚀Evaluation Pipeline

Prompt

Evaluation

Leaderboard

📊Experiment Results

📖Citation

🙏Acknowledgement

📬Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

License

AI9Stars/XLRS-Bench

Folders and files

Latest commit

History

Repository files navigation

Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

🔥News

📚Contents

🔍Dataset Overview

📸Dataset Examples

📜Dataset License

🚀Evaluation Pipeline

Prompt

Evaluation

Leaderboard

📊Experiment Results

📖Citation

🙏Acknowledgement

📬Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Packages