We effectively integrate unseen low-resource modalities to large language models with as few as 32 samples by leveraging high-resource modalities.
Multimodal foundation models can process several modalities. However, since the space of possible modalities is large and evolving over time, training a model from scratch to encompass all modalities is unfeasible. Moreover, integrating a modality into a pre-existing foundation model currently requires a significant amount of paired data (e.g., images with text), which is often not available for low-resource modalities. In this paper, we study how to integrate unseen modalities into Large Language Models (LLMs) in a sample-efficient way. To this end, we train a hypernetwork to generate parameter-efficient adapters, which are modality-specific, on top of a shared projector placed between each modality-specific encoder and the LLM. The hypernetwork, trained on high-resource modalities (i.e., text, speech, audio, video), can be conditioned on a subset of samples from any arbitrary modality at inference time to generate an appropriate adapter. To increase the diversity of seen modalities, we artificially multiply the number of training encoders through isometric transformations. We demonstrate that our method achieves a significant increase in sample efficiency during few-shot integration of new modalities (i.e., satellite images, astronomical images, and molecules) with encoders of arbitrary embedding dimensionality. Specifically, our method's 32-shot performance requires up to 64$\times$ less data than learning a projector from scratch and 77$\times$ less data than fine-tuning a projector pre-trained on seen modalities to achieve comparable results, substantially extending the modality coverage of foundation models.
conda env create -f environment.yml
conda activate dynamic_mm
Then, either install requirements-cuda.txt
or requirements.txt
based on your environment. In our environment, we used requirements-cuda.txt
.
pip install -r requirements-cuda.txt
pip install -e .
You will need to login to Huggingface and Weights & Biases through CLI. Moreover, you will need access to Llama 3.2 series LLMs through Huggingface as well. For further info, please refer to following links:
- Execute the
python pkl.py
command under thedmi/data
directory to decompress the extracted embeddings and other dataset files.
- Clone the
ospanbatyr/cococap
repository under thedmi
folder. - Change directory to
dmi/cococap
. - Execute the
get_stanford_models.sh
script. - Download the JDK 8u202 specific to your distribution from Java SE 8 Archive Downloads or Java 8 Downloads and install it under the
dmi/cococap
folder. The folder should be namedjdk1.8.0_202
. Other Java 8 JDK versions might work as well, but we tested on Linux x64 architecture and 8u202 release. One can also change the values ofJAVAPATH
variables within the Python files under thedmi/cococap
folder.
- To skip the projector pre-training and hypernetwork training stages, download the pre-trained projector and hypernetwork weights from Huggingface and place it under the
dmi/checkpoints
directory.
python -u train_projector.py configs/projector/v1:llama1b_inst_all_extracted.json
python -u train_hypernet.py configs/projector/v4:llama1b_inst_all.json
# Example run, please see dmi/configs/hypernet only_fewshot configs for remaining experiments
python -u train_hypernet.py configs/hypernet/v6:llama1b_inst_all_only_fewshot_candels_base.json
# Example runs, please see dmi/configs/lora and dmi/configs/projector folders for remaining experiments
python -u train_projector.py configs/projector/v2:llama1b_sydney_rn50_mlp2.json # Projector baseline
python -u train_projector.py configs/projector/v3:llama1b_sydney_rn50_mlp2_ft.json # FT Projector baseline
python -u train_lora.py configs/lora/v3:llama1b_inst_mlp2_sydney_rn50.json # LoRA baseline
The evaluation of our experiments are automatic. Please refer to {dataset-name}-results.json
under the outputs/
folder.
We thank Benjamin Minixhofer, Csordás Róbert, and Giorgio Roffo for their valuable codebases. If you liked our work, please pay their wonderful codebases a visit: