Qiang Ma2, Long Lan1 , Wenjing Yang1* , Jing Zhang3,6, Zhiyuan Liu2, Maosong Sun2
3School of Computer Science, Wuhan University, 4Beijing University of Posts and Telecommunications,
5Zhongguancun Academy, 6School of Artificial Intelligence, Wuhan University
2025.04.04
: Selected as Highlight by CVPR 2025!2025.04.01
: The dataset is currently under final review and will be released in one month.2025.02.27
: XLRS-Bench has been accepted by CVPR 2025!
- 🔥News
- 📚Contents
- 🔍Dataset Overview
- 📸Dataset Examples
- 📜Dataset License
- 🚀Evaluation Pipeline
- 📊Experiment Results
- 📖Citation
- 🙏Acknowledgement
- 📚Contact
Remote sensing (RS) images have become essential for monitoring and understanding human environments, driving advancements in applications like precision agriculture, urban planning, and disaster assessment. While recent studies have proposed benchmarks and metrics to assess MLLM performance in RS, these efforts remain limited in image size, annotation method and evaluation dimensions. We present XLRS-Bench, a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios, featuring the largest average image size of 8,500 × 8,500 observed thus far. Our dataset encompasses 45,942 annotations across 16 tasks, all expertly curated by a team of 45 experts. The main advantages of XLRS-Bench compared to existing MLLM benchmarks as follows:
- Ultra-high Resolution. XLRS-Bench features the largest image sizes available, 10∼20× than that of existing datasets, with 840 images out of all images at a resolution of 10,000×10,000 pixels
- High-quality Annotation. All the annotations are human involved and manually verified through iterations, resulting in a high-quality benchmark for evaluating MLLMs on real ultra-high-resolution RS scenarios.
- Comprehensive Evaluation Dimensions: XLRS-Bench covers 10 perception indicators and 6 reasoning dimensions to assess MLLMs’ capabilities, encompassing 16 sub-tasks with a total of 45,942 questions. Especially, XLRS-Bench includes complex reasoning tasks to explore MLLMs’ potential in conducting planning and change detection in long spatial-temporal RS scenarios.
Annotations of this dataset is released under a Creative Commons Attribution-NonCommercial 4.0 International License. For images from:
-
DOTA
RGB images from Google Earth and CycloMedia (for academic use only; commercial use is prohibited, and Google Earth terms of use apply). -
ITCVD
Licensed under CC-BY-NC-SA-4.0. -
MiniFrance, HRSCD
Released under IGN’s "licence ouverte". -
Toronto, Potsdam:
The Toronto test data images are derived from the Downtown Toronto dataset provided by Optech Inc., First Base Solutions Inc., GeoICT Lab at York University, and ISPRS WG III/4, and are subject to the following conditions:- The data must not be used for other than research purposes. Any other use is prohibited.
- The data must not be used outside the context of this test project, in particular while the project is still on-going (i.e. until September 2012). Whether the data will be available for other research purposes after the end of this project is still under discussion.
- The data must not be distributed to third parties. Any person interested in the data may obtain them via ISPRS WG III/4.
- The data users should include the following acknowledgement in any publication resulting from the datasets: “The authors would like to acknowledge the provision of the Downtown Toronto data set by Optech Inc., First Base Solutions Inc., GeoICT Lab at York University, and ISPRS WG III/4.”
Disclaimer:
If any party believes their rights are infringed, please contact us immediately at wfx23@nudt.edu.cn. We will promptly remove any infringing content.
The common prompt used in VQA follows this format:
[Image] [Question] The choices are listed below:
(A) [Choice A]
(B) [Choice B]
(C) [Choice C]
(D) [Choice D]
Select the best answer for the multiple-choice question based on the image. Only respond with the letter corresponding to the correct answer (A, B, C, D).
The answer is:
or for VQA task Overall Land Use Classification please use:
Select the best answer(s) for the multiple-choice question based on the image. There may be more than one correct option. Only respond with the letter(s) corresponding to the correct answer(s) (A, B, C, D), with multiple choices separated by spaces.
For Visual Grounding and Detailed Image Captioning, please refer to our source files in the evaluation folder.
As soon as our dataset is released, XLRS-Bench will be integrated with lmms-eval, allowing you to evaluate models easily.
To add your model to our leaderboard, please send your model responses to wfx23@nudt.edu.cn. Refer to the evaluation/samples
folder for the required format.
Models are ranked based on their average performance. Proprietary models are highlighted in gray. Task domains are indicated by abbreviations (e.g., “OC” for Overall Counting, “RC” for Regional Counting, etc.).
-
L-2 performance on VQA tasks.
-
L-3 reasoning dimension on VQA tasks.
-
Visual grounding performance.
-
Detailed image captioning performance.
If you find our work helpful, please consider citing:
@article{wang2025xlrsbench,
title={XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?},
author={Wang, Fengxiang and Wang, Hongzhen and Chen, Mingshuo and Wang, Di and Wang, Yulin and Guo, Zonghao and Ma, Qiang and Lan, Long and Yang, Wenjing and Zhang, Jing and others},
journal={arXiv preprint arXiv:2503.23771},
year={2025}
}
- [MME-RealWorld] MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
For any other questions please contact:
- Fengxiang Wang at wfx23@nudt.edu.cn
- Mingshuo Chen at chen.mingshuo@bupt.edu.cn