Skip to content

[CVPR 2025 HIghlight] XLRS-Bench: ould Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

License

Notifications You must be signed in to change notification settings

AI9Stars/XLRS-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

Fengxiang Wang1 , Hongzhen Wang2, Zonghao Guo2, Di Wang3,5, Yulin Wang2, Mingshuo Chen4,
Qiang Ma2, Long Lan1 , Wenjing Yang1* , Jing Zhang3,6, Zhiyuan Liu2, Maosong Sun2
1College of Computer Science and Technology, National University of Defense Technology, 2Tsinghua University,
3School of Computer Science, Wuhan University, 4Beijing University of Posts and Telecommunications,
5Zhongguancun Academy, 6School of Artificial Intelligence, Wuhan University
CVPR 2025 Highlight

🔥News

  • 2025.04.04: Selected as Highlight by CVPR 2025!
  • 2025.04.01: The dataset is currently under final review and will be released in one month.
  • 2025.02.27: XLRS-Bench has been accepted by CVPR 2025!

📚Contents

🔍Dataset Overview

Remote sensing (RS) images have become essential for monitoring and understanding human environments, driving advancements in applications like precision agriculture, urban planning, and disaster assessment. While recent studies have proposed benchmarks and metrics to assess MLLM performance in RS, these efforts remain limited in image size, annotation method and evaluation dimensions. ​We present XLRS-Bench, a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios, featuring the largest average image size of 8,500 × 8,500 observed thus far. Our dataset encompasses 45,942 annotations across 16 tasks, all expertly curated by a team of 45 experts. The main advantages of XLRS-Bench compared to existing MLLM benchmarks as follows:

  1. Ultra-high Resolution. XLRS-Bench features the largest image sizes available, 10∼20× than that of existing datasets, with 840 images out of all images at a resolution of 10,000×10,000 pixels
  2. High-quality Annotation. All the annotations are human involved and manually verified through iterations, resulting in a high-quality benchmark for evaluating MLLMs on real ultra-high-resolution RS scenarios.
  3. Comprehensive Evaluation Dimensions: XLRS-Bench covers 10 perception indicators and 6 reasoning dimensions to assess MLLMs’ capabilities, encompassing 16 sub-tasks with a total of 45,942 questions. Especially, XLRS-Bench includes complex reasoning tasks to explore MLLMs’ potential in conducting planning and change detection in long spatial-temporal RS scenarios.

📸Dataset Examples

data_example

📜Dataset License

Annotations of this dataset is released under a Creative Commons Attribution-NonCommercial 4.0 International License. For images from:

  • DOTA
    RGB images from Google Earth and CycloMedia (for academic use only; commercial use is prohibited, and Google Earth terms of use apply).

  • ITCVD
    Licensed under CC-BY-NC-SA-4.0.

  • MiniFrance, HRSCD
    Released under IGN’s "licence ouverte".

  • Toronto, Potsdam:
    The Toronto test data images are derived from the Downtown Toronto dataset provided by Optech Inc., First Base Solutions Inc., GeoICT Lab at York University, and ISPRS WG III/4, and are subject to the following conditions:

    1. The data must not be used for other than research purposes. Any other use is prohibited.
    2. The data must not be used outside the context of this test project, in particular while the project is still on-going (i.e. until September 2012). Whether the data will be available for other research purposes after the end of this project is still under discussion.
    3. The data must not be distributed to third parties. Any person interested in the data may obtain them via ISPRS WG III/4.
    4. The data users should include the following acknowledgement in any publication resulting from the datasets: “The authors would like to acknowledge the provision of the Downtown Toronto data set by Optech Inc., First Base Solutions Inc., GeoICT Lab at York University, and ISPRS WG III/4.

Disclaimer:
If any party believes their rights are infringed, please contact us immediately at wfx23@nudt.edu.cn. We will promptly remove any infringing content.

🚀Evaluation Pipeline

Prompt

The common prompt used in VQA follows this format:

[Image] [Question] The choices are listed below:
(A) [Choice A]
(B) [Choice B]
(C) [Choice C]
(D) [Choice D]
Select the best answer for the multiple-choice question based on the image. Only respond with the letter corresponding to the correct answer (A, B, C, D).
The answer is:

or for VQA task Overall Land Use Classification please use:

Select the best answer(s) for the multiple-choice question based on the image. There may be more than one correct option. Only respond with the letter(s) corresponding to the correct answer(s) (A, B, C, D), with multiple choices separated by spaces.

For Visual Grounding and Detailed Image Captioning, please refer to our source files in the evaluation folder.

Evaluation

As soon as our dataset is released, XLRS-Bench will be integrated with lmms-eval, allowing you to evaluate models easily.

Leaderboard

To add your model to our leaderboard, please send your model responses to wfx23@nudt.edu.cn. Refer to the evaluation/samples folder for the required format.

📊Experiment Results

Models are ranked based on their average performance. Proprietary models are highlighted in gray. Task domains are indicated by abbreviations (e.g., “OC” for Overall Counting, “RC” for Regional Counting, etc.).

  • L-2 performance on VQA tasks.

    l2

  • L-3 perception dimension on VQA tasks. perception

  • L-3 reasoning dimension on VQA tasks.

    reasoning

  • Visual grounding performance.

    grounding

  • Detailed image captioning performance.

    captioning

📖Citation

If you find our work helpful, please consider citing:

@article{wang2025xlrsbench,
    title={XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?},
    author={Wang, Fengxiang and Wang, Hongzhen and Chen, Mingshuo and Wang, Di and Wang, Yulin and Guo, Zonghao and Ma, Qiang and Lan, Long and Yang, Wenjing and Zhang, Jing and others},
    journal={arXiv preprint arXiv:2503.23771},
    year={2025}
}

🙏Acknowledgement

📬Contact

For any other questions please contact:

About

[CVPR 2025 HIghlight] XLRS-Bench: ould Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •