Benchmark and Evaluations, RL Alignment, Applications, and Challenges of Large Vision Language Models
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository
Below we compile awesome papers and model and github repositories that
- State-of-the-Art VLMs Collection of newest to oldest VLMs (we'll keep updating new models and benchmarks).
- Evaluate VLM benchmarks and corresponding link to the works
- Post-training/Alignment Newest related work for VLM alignment including RL, sft.
- Applications applications of VLMs in embodied AI, robotics, etc.
- Contribute surveys, perspectives, and datasets on the above topics.
Welcome to contribute and discuss!
🤩 Papers marked with a ⭐️ are contributed by the maintainers of this repository. If you find them useful, we would greatly appreciate it if you could give the repository a star or cite our paper.
-
- 3.1. RL Alignment for VLM
- 3.2. Regular finetuning (SFT)
- 3.3. VLM Alignment Github
- 3.4. Prompt Engineering
-
- 4.1. Embodied VLM agents
- 4.2. Generative Visual Media Applications
- 4.3. Robotics and Embodied AI
- 4.3.1. Manipulation
- 4.3.2. Navigation
- 4.3.3. Human-robot Interaction
- 4.3.4. Autonomous Driving
- 4.4. Human-Centered AI
- 4.4.1. Web Agent
- 4.4.2. Accessibility
- 4.4.3. Healthcare
- 4.4.4. Social Goodness
-
- 5.1. Hallucination
- 5.2. Safety
- 5.3. Fairness
- 5.4. Alignment
- 5.4.1. Multi-modality Alignment
- 5.5. Efficient Training and Fine-Tuning
- 5.6. Scarce of High-quality Dataset
@InProceedings{Li_2025_CVPR,
author = {Li, Zongxia and Wu, Xiyang and Du, Hongyang and Liu, Fuxiao and Nghiem, Huy and Shi, Guangyao},
title = {A Survey of State of the Art Large Vision Language Models: Benchmark Evaluations and Challenges},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2025},
pages = {1587-1606}
}
| Model | Year | Architecture | Training Data | Parameters | Vision Encoder/Tokenizer | Pretrained Backbone Model |
|---|---|---|---|---|---|---|
| Emu3.5 | 10/30/2025 | Deconder-only | Unified Modality Dataset | - | SigLIP | Qwen3 |
| DeepSeek-OCR | 10/20/2025 | Encoder-Deconder | 70% OCR, 20% general vision, 10% text-only | 3B | DeepEncoder | DeepSeek-3B |
| Qwen3-VL | 10/11/2025 | Decoder-Only | - | 8B/4B | ViT | Qwen3 |
| Qwen3-VL-MoE | 09/25/2025 | Decoder-Only | - | 235B-A22B | ViT | Qwen3 |
| Qwen3-Omni (Visual/Audio/Text) | 09/21/2025 | - | Video/Audio/Image | 30B | ViT | Qwen3-Omni-MoE-Thinker |
| LLaVA-Onevision-1.5 | 09/15/2025 | - | Mid-Training-85M & SFT | 8B | Qwen2VLImageProcessor | Qwen3 |
| InternVL3.5 | 08/25/2025 | Decoder-Only | multimodal & text-only | 30B/38B/241B | InternViT-300M/6B | Qwen3 / GPT-OSS |
| SkyWork-Unipic-1.5B | 07/29/2025 | - | image/video.. | - | - | - |
| Grok 4 | 07/09/2025 | - | image/video.. | 1-2 Trillion | - | - |
| Kwai Keye-VL (Kuaishou) | 07/02/2025 | Decdoer-only | image/video.. | 8B | ViT | QWen-3-8B |
| OmniGen2 | 06/23/2025 | Decdoer-only & VAE | LLaVA-OneVision/ SAM-LLaVA.. | - | ViT | QWen-2.5-VL |
| Gemini-2.5-Pro | 06/17/2025 | - | - | - | - | - |
| GPT-o3/o4-mini | 06/10/2025 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
| Mimo-VL (Xiaomi) | 06/04/2025 | Decdoer-only | 24 Trillion MLLM tokens | 7B | Qwen2.5-ViT | Mimo-7B-base |
| BLIP3-o | 05/14/2025 | Decdoer-only | (BLIP3-o 60K) GPT-4o Generated Image Generation Data | 4/8B | ViT | QWen-2.5-VL |
| InternVL-3 | 04/14/2025 | Decdoer-only | 200 Billion Tokens | 1/2/8/9/14/38/78B | ViT-300M/6B | InterLM2.5/QWen2.5 |
| LLaMA4-Scout/Maverick | 04/04/2025 | Decdoer-only | 40/20 Trillion Tokens | 17B | MetaClip | LLaMA4 |
| Qwen2.5-Omni | 03/26/2025 | Decdoer-only | Video/Audio/Image/Text | 7B | Qwen2-Audio/Qwen2.5-VL ViT | End-to-End Mini-Omni |
| QWen2.5-VL | 01/28/2025 | Decdoer-only | Image caption, VQA, grounding agent, long video | 3B/7B/72B | Redesigned ViT | Qwen2.5 |
| Ola | 2025 | Decoder-only | Image/Video/Audio/Text | 7B | OryxViT | Qwen-2.5-7B, SigLIP-400M, Whisper-V3-Large, BEATs-AS2M(cpt2) |
| Ocean-OCR | 2025 | Decdoer-only | Pure Text, Caption, Interleaved, OCR | 3B | NaViT | Pretrained from scratch |
| SmolVLM | 2025 | Decoder-only | SmolVLM-Instruct | 250M & 500M | SigLIP | SmolLM |
| DeepSeek-Janus-Pro | 2025 | Decoder-only | Undisclosed | 7B | SigLIP | DeepSeek-Janus-Pro |
| Inst-IT | 2024 | Decoder-only | Inst-IT Dataset, LLaVA-NeXT-Data | 7B | CLIP/Vicuna, SigLIP/Qwen2 | LLaVA-NeXT |
| DeepSeek-VL2 | 2024 | Decoder-only | WiT, WikiHow | 4.5B x 74 | SigLIP/SAMB | DeepSeekMoE |
| xGen-MM (BLIP-3) | 2024 | Decoder-only | MINT-1T, OBELICS, Caption | 4B | ViT + Perceiver Resampler | Phi-3-mini |
| TransFusion | 2024 | Encoder-decoder | Undisclosed | 7B | VAE Encoder | Pretrained from scratch on transformer architecture |
| Baichuan Ocean Mini | 2024 | Decoder-only | Image/Video/Audio/Text | 7B | CLIP ViT-L/14 | Baichuan |
| LLaMA 3.2-vision | 2024 | Decoder-only | Undisclosed | 11B-90B | CLIP | LLaMA-3.1 |
| Pixtral | 2024 | Decoder-only | Undisclosed | 12B | CLIP ViT-L/14 | Mistral Large 2 |
| Qwen2-VL | 2024 | Decoder-only | Undisclosed | 7B-14B | EVA-CLIP ViT-L | Qwen-2 |
| NVLM | 2024 | Encoder-decoder | LAION-115M | 8B-24B | Custom ViT | Qwen-2-Instruct |
| Emu3 | 2024 | Decoder-only | Aquila | 7B | MoVQGAN | LLaMA-2 |
| Claude 3 | 2024 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
| InternVL | 2023 | Encoder-decoder | LAION-en, LAION- multi | 7B/20B | Eva CLIP ViT-g | QLLaMA |
| InstructBLIP | 2023 | Encoder-decoder | CoCo, VQAv2 | 13B | ViT | Flan-T5, Vicuna |
| CogVLM | 2023 | Encoder-decoder | LAION-2B ,COYO-700M | 18B | CLIP ViT-L/14 | Vicuna |
| PaLM-E | 2023 | Decoder-only | All robots, WebLI | 562B | ViT | PaLM |
| LLaVA-1.5 | 2023 | Decoder-only | COCO | 13B | CLIP ViT-L/14 | Vicuna |
| Gemini | 2023 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
| GPT-4V | 2023 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
| BLIP-2 | 2023 | Encoder-decoder | COCO, Visual Genome | 7B-13B | ViT-g | Open Pretrained Transformer (OPT) |
| Flamingo | 2022 | Decoder-only | M3W, ALIGN | 80B | Custom | Chinchilla |
| BLIP | 2022 | Encoder-decoder | COCO, Visual Genome | 223M-400M | ViT-B/L/g | Pretrained from scratch |
| CLIP | 2021 | Encoder-decoder | 400M image-text pairs | 63M-355M | ViT/ResNet | Pretrained from scratch |
| Dataset | Task | Size |
|---|---|---|
| FineVision | Mixed Domain | 24.3 M/4.48TB |
| Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site |
|---|---|---|---|---|---|
| MathVision | Visual Math | MC / Answer Match | Human | 3.04 | Repo |
| MathVista | Visual Math | MC / Answer Match | Human | 6 | Repo |
| MathVerse | Visual Math | MC | Human | 4.6 | Repo |
| VisNumBench | Visual Number Reasoning | MC | Python Program generated/Web Collection/Real life photos | 1.91 | Repo |
| Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site |
|---|---|---|---|---|---|
| VideoHallu | Video Understanding | LLM Eval | Human | 3.2 | Repo |
| Video SimpleQA | Video Understanding | LLM Eval | Human | 2.03 | Repo |
| MovieChat | Video Understanding | LLM Eval | Human | 1 | Repo |
| Perception‑Test | Video Understanding | MC | Crowd | 11.6 | Repo |
| VideoMME | Video Understanding | MC | Experts | 2.7 | Site |
| EgoSchem | Video Understanding | MC | Synth / Human | 5 | Site |
| Inst‑IT‑Bench | Fine‑grained Image & Video | MC & LLM | Human / Synth | 2 | Repo |
| Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site |
|---|---|---|---|---|---|
| VisionArena | Multimodal Conversation | Pairwise Pref | Human | 23 | Repo |
| Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site |
|---|---|---|---|---|---|
| MMLU | General MM | MC | Human | 15.9 | Repo |
| MMStar | General MM | MC | Human | 1.5 | Site |
| NaturalBench | General MM | Yes/No, MC | Human | 10 | HF |
| PHYSBENCH | Visual Math Reasoning | MC | Grad STEM | 0.10 | Repo |
| Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site |
|---|---|---|---|---|---|
| EMMA | Visual Reasoning | MC | Human + Synth | 2.8 | Repo |
| MMTBENCH | Visual Reasoning & QA | MC | AI Experts | 30.1 | Repo |
| MM‑Vet | OCR / Visual Reasoning | LLM Eval | Human | 0.2 | Repo |
| MM‑En/CN | Multilingual MM Understanding | MC | Human | 3.2 | Repo |
| GQA | Visual Reasoning & QA | Answer Match | Seed + Synth | 22 | Site |
| VCR | Visual Reasoning & QA | MC | MTurks | 290 | Site |
| VQAv2 | Visual Reasoning & QA | Yes/No, Ans Match | MTurks | 1100 | Repo |
| MMMU | Visual Reasoning & QA | Ans Match, MC | College | 11.5 | Site |
| MMMU-Pro | Visual Reasoning & QA | Ans Match, MC | College | 5.19 | Site |
| R1‑Onevision | Visual Reasoning & QA | MC | Human | 155 | Repo |
| VLM²‑Bench | Visual Reasoning & QA | Ans Match, MC | Human | 3 | Site |
| VisualWebInstruct | Visual Reasoning & QA | LLM Eval | Web | 0.9 | Site |
| Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site |
|---|---|---|---|---|---|
| TextVQA | Visual Text Understanding | Ans Match | Expert | 28.6 | Repo |
| DocVQA | Document VQA | Ans Match | Crowd | 50 | Site |
| ChartQA | Chart Graphic Understanding | Ans Match | Crowd / Synth | 32.7 | Repo |
| Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site |
|---|---|---|---|---|---|
| MSCOCO‑30K | Text‑to‑Image | BLEU, ROUGE, Sim | MTurks | 30 | Site |
| GenAI‑Bench | Text‑to‑Image | Human Rating | Human | 80 | HF |
| Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site |
|---|---|---|---|---|---|
| HallusionBench | Hallucination | Yes/No | Human | 1.13 | Repo |
| POPE | Hallucination | Yes/No | Human | 9 | Repo |
| CHAIR | Hallucination | Yes/No | Human | 124 | Repo |
| MHalDetect | Hallucination | Ans Match | Human | 4 | Repo |
| Hallu‑Pi | Hallucination | Ans Match | Human | 1.26 | Repo |
| HallE‑Control | Hallucination | Yes/No | Human | 108 | Repo |
| AutoHallusion | Hallucination | Ans Match | Synth | 3.129 | Repo |
| BEAF | Hallucination | Yes/No | Human | 26 | Site |
| GAIVE | Hallucination | Ans Match | Synth | 320 | Repo |
| HalEval | Hallucination | Yes/No | Crowd / Synth | 2 | Repo |
| AMBER | Hallucination | Ans Match | Human | 15.22 | Repo |
| Benchmark | Domain | Type | Project |
|---|---|---|---|
| Drive-Bench | Embodied AI | Autonomous Driving | Website |
| Habitat, Habitat 2.0, Habitat 3.0 | Robotics (Navigation) | Simulator + Dataset | Website |
| Gibson | Robotics (Navigation) | Simulator + Dataset | Website, Github Repo |
| iGibson1.0, iGibson2.0 | Robotics (Navigation) | Simulator + Dataset | Website, Document |
| Isaac Gym | Robotics (Navigation) | Simulator | Website, Github Repo |
| Isaac Lab | Robotics (Navigation) | Simulator | Website, Github Repo |
| AI2THOR | Robotics (Navigation) | Simulator | Website, Github Repo |
| ProcTHOR | Robotics (Navigation) | Simulator + Dataset | Website, Github Repo |
| VirtualHome | Robotics (Navigation) | Simulator | Website, Github Repo |
| ThreeDWorld | Robotics (Navigation) | Simulator | Website, Github Repo |
| VIMA-Bench | Robotics (Manipulation) | Simulator | Website, Github Repo |
| VLMbench | Robotics (Manipulation) | Simulator | Github Repo |
| CALVIN | Robotics (Manipulation) | Simulator | Website, Github Repo |
| GemBench | Robotics (Manipulation) | Simulator | Website, Github Repo |
| WebArena | Web Agent | Simulator | Website, Github Repo |
| UniSim | Robotics (Manipulation) | Generative Model, World Model | Website |
| GAIA-1 | Robotics (Automonous Driving) | Generative Model, World Model | Website |
| LWM | Embodied AI | Generative Model, World Model | Website, Github Repo |
| Genesis | Embodied AI | Generative Model, World Model | Github Repo |
| EMMOE | Embodied AI | Generative Model, World Model | Paper |
| RoboGen | Embodied AI | Generative Model, World Model | Website |
| UnrealZoo | Embodied AI (Tracking, Navigation, Multi Agent) | Simulator | Website |
| Title | Year | Paper | RL | Code |
|---|---|---|---|---|
| Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning | 10/12/2025 | Paper | GRPO | - |
| Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play | 09/29/2025 | Paper | GRPO | - |
| Vision-SR1: Self-rewarding vision-language model via reasoning decomposition | 08/26/2025 | Paper | GRPO | - |
| Group Sequence Policy Optimization | 06/24/2025 | Paper | GSPO | - |
| Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning | 05/20/2025 | Paper | GRPO | - |
| VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning | 2025/04/10 | Paper | GRPO | Code |
| OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement | 2025/03/21 | Paper | GRPO | Code |
| Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning | 2025/03/10 | Paper | GRPO | Code |
| OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference | 2025 | Paper | DPO | Code |
| Multimodal Open R1/R1-Multimodal-Journey | 2025 | - | GRPO | Code |
| R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization | 2025 | Paper | GRPO | Code |
| Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning | 2025 | - | PPO/REINFORCE++/GRPO | Code |
| MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning | 2025 | Paper | REINFORCE Leave-One-Out (RLOO) | Code |
| MM-RLHF: The Next Step Forward in Multimodal LLM Alignment | 2025 | Paper | DPO | Code |
| LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL | 2025 | Paper | PPO | Code |
| Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models | 2025 | Paper | GRPO | Code |
| Unified Reward Model for Multimodal Understanding and Generation | 2025 | Paper | DPO | Code |
| Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step | 2025 | Paper | DPO | Code |
| All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning | 2025 | Paper | Online RL | - |
| Video-R1: Reinforcing Video Reasoning in MLLMs | 2025 | Paper | GRPO | Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models | 2025/04/21 | Paper | Website | Code |
| OMNICAPTIONER: One Captioner to Rule Them All | 2025/04/09 | Paper | Website | Code |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | 2024 | Paper | Website | Code |
| LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression | 2024 | Paper | Website | Code |
| ViTamin: Designing Scalable Vision Models in the Vision-Language Era | 2024 | Paper | Website | Code |
| Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model | 2024 | Paper | - | - |
| Should VLMs be Pre-trained with Image Data? | 2025 | Paper | - | - |
| VisionArena: 230K Real World User-VLM Conversations with Preference Labels | 2024 | Paper | - | Code |
| Project | Repository Link |
|---|---|
| Verl | 🔗 GitHub |
| EasyR1 | 🔗 GitHub |
| OpenR1 | 🔗 GitHub |
| LLaMAFactory | 🔗 GitHub |
| MM-Eureka-Zero | 🔗 GitHub |
| MM-RLHF | 🔗 GitHub |
| LMM-R1 | 🔗 GitHub |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| In-ContextEdit:EnablingInstructionalImageEditingwithIn-Context GenerationinLargeScaleDiffusionTransformer | 2025/04/30 | Paper | Website | Code |
| Title | Year | Paper Link |
|---|---|---|
| Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI | 2024 | Paper |
| ScreenAI: A Vision-Language Model for UI and Infographics Understanding | 2024 | Paper |
| ChartLlama: A Multimodal LLM for Chart Understanding and Generation | 2023 | Paper |
| SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement | 2024 | 📄 Paper |
| Training a Vision Language Model as Smartphone Assistant | 2024 | Paper |
| ScreenAgent: A Vision-Language Model-Driven Computer Control Agent | 2024 | Paper |
| Embodied Vision-Language Programmer from Environmental Feedback | 2024 | Paper |
| VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method | 2025 | 📄 Paper |
| MP-GUI: Modality Perception with MLLMs for GUI Understanding | 2025 | 📄 Paper |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning | 2023 | 📄 Paper | 🌍 Website | 💾 Code |
| Spurious Correlation in Multimodal LLMs | 2025 | 📄 Paper | - | - |
| WeGen: A Unified Model for Interactive Multimodal Generation as We Chat | 2025 | 📄 Paper | - | 💾 Code |
| VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation | 2024 | 📄 Paper | 🌍 Website | - |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | 2024 | 📄 Paper | 🌍 Website | - |
| Vision-language model-driven scene understanding and robotic object manipulation | 2024 | 📄 Paper | - | - |
| Guiding Long-Horizon Task and Motion Planning with Vision Language Models | 2024 | 📄 Paper | 🌍 Website | - |
| AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers | 2023 | 📄 Paper | 🌍 Website | - |
| VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model | 2024 | 📄 Paper | - | - |
| Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems? | 2023 | 📄 Paper | 🌍 Website | - |
| DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models | 2024 | 📄 Paper | 🌍 Website | - |
| MotionGPT: Human Motion as a Foreign Language | 2023 | 📄 Paper | - | 💾 Code |
| Learning Reward for Robot Skills Using Large Language Models via Self-Alignment | 2024 | 📄 Paper | - | - |
| Language to Rewards for Robotic Skill Synthesis | 2023 | 📄 Paper | 🌍 Website | - |
| Eureka: Human-Level Reward Design via Coding Large Language Models | 2023 | 📄 Paper | 🌍 Website | - |
| Integrated Task and Motion Planning | 2020 | 📄 Paper | - | - |
| Jailbreaking LLM-Controlled Robots | 2024 | 📄 Paper | 🌍 Website | - |
| Robots Enact Malignant Stereotypes | 2022 | 📄 Paper | 🌍 Website | - |
| LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions | 2024 | 📄 Paper | - | - |
| Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics | 2024 | 📄 Paper | 🌍 Website | - |
| EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents | 2025 | 📄 Paper | 🌍 Website | 💾 Code & Dataset |
| Gemini Robotics: Bringing AI into the Physical World | 2025 | 📄 Technical Report | 🌍 Website | - |
| GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation | 2024 | 📄 Paper | 🌍 Website | - |
| Magma: A Foundation Model for Multimodal AI Agents | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| DayDreamer: World Models for Physical Robot Learning | 2022 | 📄 Paper | 🌍 Website | 💾 Code |
| Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models | 2025 | 📄 Paper | - | - |
| RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| Unified Video Action Model | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| VIMA: General Robot Manipulation with Multimodal Prompts | 2022 | 📄 Paper | 🌍 Website | |
| Instruct2Act: Mapping Multi-Modality Instructions to Robotic Actions with Large Language Model | 2023 | 📄 Paper | - | - |
| Creative Robot Tool Use with Large Language Models | 2023 | 📄 Paper | 🌍 Website | - |
| RoboVQA: Multimodal Long-Horizon Reasoning for Robotics | 2024 | 📄 Paper | - | - |
| RT-1: Robotics Transformer for Real-World Control at Scale | 2022 | 📄 Paper | 🌍 Website | - |
| RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control | 2023 | 📄 Paper | 🌍 Website | - |
| Open X-Embodiment: Robotic Learning Datasets and RT-X Models | 2023 | 📄 Paper | 🌍 Website | - |
| ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models | 2024 | 📄 Paper | 🌍 Website | - |
| AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| Masked World Models for Visual Control | 2022 | 📄 Paper | 🌍 Website | 💾 Code |
| Multi-View Masked World Models for Visual Robotic Manipulation | 2023 | 📄 Paper | 🌍 Website | 💾 Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings | 2022 | 📄 Paper | - | - |
| LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation | 2024 | 📄 Paper | - | - |
| LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action | 2022 | 📄 Paper | 🌍 Website | - |
| NaVILA: Legged Robot Vision-Language-Action Model for Navigation | 2022 | 📄 Paper | 🌍 Website | - |
| VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation | 2024 | 📄 Paper | - | - |
| Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning | 2023 | 📄 Paper | 🌍 Website | - |
| Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments | 2025 | 📄 Paper | - | - |
| Navigation World Models | 2024 | 📄 Paper | 🌍 Website | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| MUTEX: Learning Unified Policies from Multimodal Task Specifications | 2023 | 📄 Paper | 🌍 Website | - |
| LaMI: Large Language Models for Multi-Modal Human-Robot Interaction | 2024 | 📄 Paper | 🌍 Website | - |
| VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models | 2024 | 📄 Paper | - | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives | 01/07/2025 | 📄 Paper | 🌍 Website | - |
| DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models | 2024 | 📄 Paper | 🌍 Website | - |
| GPT-Driver: Learning to Drive with GPT | 2023 | 📄 Paper | - | - |
| LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving | 2023 | 📄 Paper | 🌍 Website | - |
| Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving | 2023 | 📄 Paper | - | - |
| Referring Multi-Object Tracking | 2023 | 📄 Paper | - | 💾 Code |
| VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision | 2023 | 📄 Paper | - | 💾 Code |
| MotionLM: Multi-Agent Motion Forecasting as Language Modeling | 2023 | 📄 Paper | - | - |
| DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models | 2023 | 📄 Paper | 🌍 Website | - |
| VLP: Vision Language Planning for Autonomous Driving | 2024 | 📄 Paper | - | - |
| DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | 2023 | 📄 Paper | - | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis | 2024 | 📄 Paper | - | 💾 Code |
| LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration – A Robot Sous-Chef Application | 2024 | 📄 Paper | - | - |
| Pretrained Language Models as Visual Planners for Human Assistance | 2023 | 📄 Paper | - | - |
| Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research | 2024 | 📄 Paper | - | - |
| Image and Data Mining in Reticular Chemistry Using GPT-4V | 2023 | 📄 Paper | - | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis | 2023 | 📄 Paper | - | - |
| CogAgent: A Visual Language Model for GUI Agents | 2023 | 📄 Paper | - | 💾 Code |
| WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models | 2024 | 📄 Paper | - | 💾 Code |
| ShowUI: One Vision-Language-Action Model for GUI Visual Agent | 2024 | 📄 Paper | - | 💾 Code |
| ScreenAgent: A Vision Language Model-driven Computer Control Agent | 2024 | 📄 Paper | - | 💾 Code |
| Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation | 2024 | 📄 Paper | - | 💾 Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| X-World: Accessibility, Vision, and Autonomy Meet | 2021 | 📄 Paper | - | - |
| Context-Aware Image Descriptions for Web Accessibility | 2024 | 📄 Paper | - | - |
| Improving VR Accessibility Through Automatic 360 Scene Description Using Multimodal Large Language Models | 2024 | 📄 Paper | - | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge | 2024 | 📄 Paper | - | 💾 Code |
| Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology | 2024 | 📄 Paper | - | - |
| M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization | 2023 | 📄 Paper | - | - |
| MedCLIP: Contrastive Learning from Unpaired Medical Images and Text | 2022 | 📄 Paper | - | 💾 Code |
| Med-Flamingo: A Multimodal Medical Few-Shot Learner | 2023 | 📄 Paper | - | 💾 Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| Analyzing K-12 AI Education: A Large Language Model Study of Classroom Instruction on Learning Theories, Pedagogy, Tools, and AI Literacy | 2024 | 📄 Paper | - | - |
| Students Rather Than Experts: A New AI for Education Pipeline to Model More Human-Like and Personalized Early Adolescence | 2024 | 📄 Paper | - | - |
| Harnessing Large Vision and Language Models in Agriculture: A Review | 2024 | 📄 Paper | - | - |
| A Vision-Language Model for Predicting Potential Distribution Land of Soybean Double Cropping | 2024 | 📄 Paper | - | - |
| Vision-Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models | 2024 | 📄 Paper | - | 💾 Code |
| DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images | 2024 | 📄 Paper | - | - |
| MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models | 2024 | 📄 Paper | - | 💾 Code |
| Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps | 2024 | 📄 Paper | - | 💾 Code |
| He is Very Intelligent, She is Very Beautiful? On Mitigating Social Biases in Language Modeling and Generation | 2021 | 📄 Paper | - | - |
| UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling | 2024 | 📄 Paper | - | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| Object Hallucination in Image Captioning | 2018 | 📄 Paper | - | - |
| Evaluating Object Hallucination in Large Vision-Language Models | 2023 | 📄 Paper | - | 💾 Code |
| Detecting and Preventing Hallucinations in Large Vision Language Models | 2023 | 📄 Paper | - | - |
| HallE-Control: Controlling Object Hallucination in Large Multimodal Models | 2023 | 📄 Paper | - | 💾 Code |
| Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs | 2024 | 📄 Paper | - | 💾 Code |
| BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models | 2024 | 📄 Paper | 🌍 Website | - |
| HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models | 2023 | 📄 Paper | - | 💾 Code |
| AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models | 2024 | 📄 Paper | 🌍 Website | - |
| Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | 2023 | 📄 Paper | - | 💾 Code |
| Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models | 2024 | 📄 Paper | - | 💾 Code |
| AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation | 2023 | 📄 Paper | - | 💾 Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models | 2024 | 📄 Paper | 🌍 Website | - |
| Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments | 2023 | 📄 Paper | - | - |
| SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models | 2024 | 📄 Paper | - | - |
| JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks | 2024 | 📄 Paper | - | - |
| SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models | 2024 | 📄 Paper | - | 💾 Code |
| Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models | 2024 | 📄 Paper | - | - |
| Jailbreaking Attack against Multimodal Large Language Model | 2024 | 📄 Paper | - | - |
| Embodied Red Teaming for Auditing Robotic Foundation Models | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| Safety Guardrails for LLM-Enabled Robots | 2025 | 📄 Paper | - | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| Hallucination of Multimodal Large Language Models: A Survey | 2024 | 📄 Paper | - | - |
| Bias and Fairness in Large Language Models: A Survey | 2023 | 📄 Paper | - | - |
| Fairness and Bias in Multimodal AI: A Survey | 2024 | 📄 Paper | - | - |
| Multi-Modal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision–Language Models | 2023 | 📄 Paper | - | - |
| FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks | 2024 | 📄 Paper | - | - |
| FairCLIP: Harnessing Fairness in Vision-Language Learning | 2024 | 📄 Paper | - | - |
| FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models | 2024 | 📄 Paper | - | - |
| Benchmarking Vision Language Models for Cultural Understanding | 2024 | 📄 Paper | - | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding | 2024 | 📄 Paper | - | - |
| Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement | 2024 | 📄 Paper | - | - |
| Assessing and Learning Alignment of Unimodal Vision and Language Models | 2024 | 📄 Paper | 🌍 Website | - |
| Extending Multi-modal Contrastive Representations | 2023 | 📄 Paper | - | 💾 Code |
| OneLLM: One Framework to Align All Modalities with Language | 2023 | 📄 Paper | - | 💾 Code |
| What You See is What You Read? Improving Text-Image Alignment Evaluation | 2023 | 📄 Paper | 🌍 Website | 💾 Code |
| Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| VBench: Comprehensive BenchmarkSuite for Video Generative Models | 2023 | 📄 Paper | 🌍 Website | 💾 Code |
| VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| PhysBench: Benchmarking and Enhancing VLMs for Physical World Understanding | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| VideoPhy: Evaluating Physical Commonsense for Video Generation | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| WorldSimBench: Towards Video Generation Models as World Simulators | 2024 | 📄 Paper | 🌍 Website | - |
| WorldModelBench: Judging Video Generation Models As World Models | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation | 2025 | 📄 Paper | - | 💾 Code |
| Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency | 2025 | 📄 Paper | - | 💾 Code |
| Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding | 2025 | 📄 Paper | - | - |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| Do generative video models understand physical principles? | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| How Far is Video Generation from World Model: A Physical Law Perspective | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| Imagine while Reasoning in Space: Multimodal Visualization-of-Thought | 2025 | 📄 Paper | - | - |
| VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| VILA: On Pre-training for Visual Language Models | 2023 | 📄 Paper | - | - |
| SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | 2021 | 📄 Paper | - | - |
| LoRA: Low-Rank Adaptation of Large Language Models | 2021 | 📄 Paper | - | 💾 Code |
| QLoRA: Efficient Finetuning of Quantized LLMs | 2023 | 📄 Paper | - | - |
| Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback | 2022 | 📄 Paper | - | 💾 Code |
| RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback | 2023 | 📄 Paper | - | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| A Survey on Bridging VLMs and Synthetic Data | 2025 | 📄 Paper | - | 💾 Code |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | 2024 | 📄 Paper | Website | 💾 Code |
| SLIP: Self-supervision meets Language-Image Pre-training | 2021 | 📄 Paper | - | 💾 Code |
| Synthetic Vision: Training Vision-Language Models to Understand Physics | 2024 | 📄 Paper | - | - |
| Synth2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings | 2024 | 📄 Paper | - | - |
| KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data | 2024 | 📄 Paper | - | - |
| Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation | 2024 | 📄 Paper | - | - |