Here are
33 public repositories
matching this topic...
Open-source industrial-grade ASR models supporting Mandarin, Chinese dialects and English, achieving a new SOTA on public Mandarin ASR benchmarks, while also offering outstanding singing lyrics recognition capability.
Updated
Sep 22, 2025
Python
Official implementation of paper "MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens"
Updated
May 8, 2025
Python
Research Code for Multimodal-Cognition Team in Ant Group
Updated
Oct 14, 2025
Python
[ICCV25 Oral] Token Activation Map to Visually Explain Multimodal LLMs
Updated
Aug 8, 2025
Python
Official repository for InfiGUI-G1. We introduce Adaptive Exploration Policy Optimization (AEPO) to overcome semantic alignment bottlenecks in GUI agents through efficient, guided exploration.
Updated
Sep 4, 2025
Python
[IROS'25 Oral & NeurIPSw'24] Official implementation of "MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control "
Updated
Jun 16, 2025
Python
[ECCV 2024] Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs"
Updated
Nov 28, 2023
Python
The code repository for "Wings: Learning Multimodal LLMs without Text-only Forgetting" [NeurIPS 2024]
Updated
Dec 28, 2024
Python
Official repository of the paper: Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics
[ACL 2024] Dataset and Code of "ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction"
Updated
Jun 10, 2024
Jupyter Notebook
[NAACL 2025 Findings] Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding
Updated
Jun 20, 2025
Python
Efficient Test-Time Scaling for Small Vision-Language Models, official implementation of the paper, test-time scaling via test-time augmentation
Updated
Oct 7, 2025
Python
Streamlit app to chat with images using Multi-modal LLMs.
Updated
Mar 17, 2024
Python
Official implementation of ICML 2025 paper "Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach"
Updated
May 27, 2025
Python
Q-HEART: ECG Question Answering via Knowledge-Informed Multimodal LLMs (ECAI 2025)
Updated
Aug 22, 2025
Python
SpatialFusion-LM is a real-time spatial reasoning framework that combines neural depth, 3D reconstruction, and language-driven scene understanding.
Updated
May 2, 2025
Python
Medical Report Generation And VQA (Adapting XrayGPT to Any Modality)
Updated
Jun 28, 2025
Python
LLaVA base model for use with Autodistill.
Updated
Jan 24, 2024
Python
Kani extension for supporting vision-language models (VLMs). Comes with model-agnostic support for GPT-Vision and LLaVA.
Updated
Jul 2, 2025
Python
A minimal, hackable Vision-Language Model built on Karpathy’s nanochat — add image understanding and multimodal chat for under $200 in compute.
Updated
Nov 4, 2025
Python
Improve this page
Add a description, image, and links to the
multimodal-llm
topic page so that developers can more easily learn about it.
Curate this topic
Add this topic to your repo
To associate your repository with the
multimodal-llm
topic, visit your repo's landing page and select "manage topics."
Learn more
You can’t perform that action at this time.