Sample-efficient Integration of New Modalities into Large Language Models

¹Osman Batur İnce ^2,3,4André F. T. Martins ¹Oisin Mac Aodha ¹Edoardo M. Ponti

¹University of Edinburgh

²Instituto de Telecomunicações

³Instituto Superior Técnico, Universidade de Lisboa

⁴Unbabel

TL;DR

We effectively integrate unseen low-resource modalities to large language models with as few as 32 samples by leveraging high-resource modalities.

Abstract

Multimodal foundation models can process several modalities. However, since the space of possible modalities is large and evolving over time, training a model from scratch to encompass all modalities is unfeasible. Moreover, integrating a modality into a pre-existing foundation model currently requires a significant amount of paired data (e.g., images with text), which is often not available for low-resource modalities. In this paper, we study how to integrate unseen modalities into Large Language Models (LLMs) in a sample-efficient way. To this end, we train a hypernetwork to generate parameter-efficient adapters, which are modality-specific, on top of a shared projector placed between each modality-specific encoder and the LLM. The hypernetwork, trained on high-resource modalities (i.e., text, speech, audio, video), can be conditioned on a subset of samples from any arbitrary modality at inference time to generate an appropriate adapter. To increase the diversity of seen modalities, we artificially multiply the number of training encoders through isometric transformations. We demonstrate that our method achieves a significant increase in sample efficiency during few-shot integration of new modalities (i.e., satellite images, astronomical images, and molecules) with encoders of arbitrary embedding dimensionality. Specifically, our method's 32-shot performance requires up to 64$\times$ less data than learning a projector from scratch and 77$\times$ less data than fine-tuning a projector pre-trained on seen modalities to achieve comparable results, substantially extending the modality coverage of foundation models.

Installation

Coding Environment

conda env create -f environment.yml
conda activate dynamic_mm

Then, either install requirements-cuda.txt or requirements.txt based on your environment. In our environment, we used requirements-cuda.txt.

pip install -r requirements-cuda.txt
pip install -e .

Used Frameworks

You will need to login to Huggingface and Weights & Biases through CLI. Moreover, you will need access to Llama 3.2 series LLMs through Huggingface as well. For further info, please refer to following links:

Materials

Dataset

Execute the python pkl.py command under the dmi/data directory to decompress the extracted embeddings and other dataset files.

Evaluation

Clone the ospanbatyr/cococap repository under the dmi folder.
Change directory to dmi/cococap.
Execute the get_stanford_models.sh script.
Download the JDK 8u202 specific to your distribution from Java SE 8 Archive Downloads or Java 8 Downloads and install it under the dmi/cococap folder. The folder should be named jdk1.8.0_202. Other Java 8 JDK versions might work as well, but we tested on Linux x64 architecture and 8u202 release. One can also change the values of JAVAPATH variables within the Python files under the dmi/cococap folder.

Checkpoints

To skip the projector pre-training and hypernetwork training stages, download the pre-trained projector and hypernetwork weights from Huggingface and place it under the dmi/checkpoints directory.

Experiments

Projector Pre-training

python -u train_projector.py configs/projector/v1:llama1b_inst_all_extracted.json

Hypernetwork Training

python -u train_hypernet.py configs/projector/v4:llama1b_inst_all.json

Few-shot Runs

# Example run, please see dmi/configs/hypernet only_fewshot configs for remaining experiments
python -u train_hypernet.py configs/hypernet/v6:llama1b_inst_all_only_fewshot_candels_base.json

Baselines

# Example runs, please see dmi/configs/lora and dmi/configs/projector folders for remaining experiments
python -u train_projector.py configs/projector/v2:llama1b_sydney_rn50_mlp2.json        # Projector baseline
python -u train_projector.py configs/projector/v3:llama1b_sydney_rn50_mlp2_ft.json     # FT Projector baseline
python -u train_lora.py configs/lora/v3:llama1b_inst_mlp2_sydney_rn50.json             # LoRA baseline

Evaluation

The evaluation of our experiments are automatic. Please refer to {dataset-name}-results.json under the outputs/ folder.

Acknowledgment

We thank Benjamin Minixhofer, Csordás Róbert, and Giorgio Roffo for their valuable codebases. If you liked our work, please pay their wonderful codebases a visit:

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dmi		dmi
figures		figures
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
environment.yml		environment.yml
requirements-cuda.txt		requirements-cuda.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sample-efficient Integration of New Modalities into Large Language Models

TL;DR

Abstract

Installation

Coding Environment

Used Frameworks

Materials

Dataset

Evaluation

Checkpoints

Experiments

Projector Pre-training

Hypernetwork Training

Few-shot Runs

Baselines

Evaluation

Acknowledgment

About

Uh oh!

Releases

Packages

Languages

License

ospanbatyr/sample-efficient-multimodality

Folders and files

Latest commit

History

Repository files navigation

Sample-efficient Integration of New Modalities into Large Language Models

TL;DR

Abstract

Installation

Coding Environment

Used Frameworks

Materials

Dataset

Evaluation

Checkpoints

Experiments

Projector Pre-training

Hypernetwork Training

Few-shot Runs

Baselines

Evaluation

Acknowledgment

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages