Skip to content

ospanbatyr/sample-efficient-multimodality

Repository files navigation

Sample-efficient Integration of New Modalities into Large Language Models

1University of Edinburgh
2Instituto de Telecomunicações
3Instituto Superior Técnico, Universidade de Lisboa
4Unbabel

Dataset: CAPDELS

Demo GIF

TL;DR

We effectively integrate unseen low-resource modalities to large language models with as few as 32 samples by leveraging high-resource modalities.

Abstract

Multimodal foundation models can process several modalities. However, since the space of possible modalities is large and evolving over time, training a model from scratch to encompass all modalities is unfeasible. Moreover, integrating a modality into a pre-existing foundation model currently requires a significant amount of paired data (e.g., images with text), which is often not available for low-resource modalities. In this paper, we study how to integrate unseen modalities into Large Language Models (LLMs) in a sample-efficient way. To this end, we train a hypernetwork to generate parameter-efficient adapters, which are modality-specific, on top of a shared projector placed between each modality-specific encoder and the LLM. The hypernetwork, trained on high-resource modalities (i.e., text, speech, audio, video), can be conditioned on a subset of samples from any arbitrary modality at inference time to generate an appropriate adapter. To increase the diversity of seen modalities, we artificially multiply the number of training encoders through isometric transformations. We demonstrate that our method achieves a significant increase in sample efficiency during few-shot integration of new modalities (i.e., satellite images, astronomical images, and molecules) with encoders of arbitrary embedding dimensionality. Specifically, our method's 32-shot performance requires up to 64$\times$ less data than learning a projector from scratch and 77$\times$ less data than fine-tuning a projector pre-trained on seen modalities to achieve comparable results, substantially extending the modality coverage of foundation models.

Installation

Coding Environment

conda env create -f environment.yml
conda activate dynamic_mm

Then, either install requirements-cuda.txt or requirements.txt based on your environment. In our environment, we used requirements-cuda.txt.

pip install -r requirements-cuda.txt
pip install -e .

Used Frameworks

You will need to login to Huggingface and Weights & Biases through CLI. Moreover, you will need access to Llama 3.2 series LLMs through Huggingface as well. For further info, please refer to following links:

Materials

Dataset
  1. Execute the python pkl.py command under the dmi/data directory to decompress the extracted embeddings and other dataset files.
Evaluation
  1. Clone the ospanbatyr/cococap repository under the dmi folder.
  2. Change directory to dmi/cococap.
  3. Execute the get_stanford_models.sh script.
  4. Download the JDK 8u202 specific to your distribution from Java SE 8 Archive Downloads or Java 8 Downloads and install it under the dmi/cococap folder. The folder should be named jdk1.8.0_202. Other Java 8 JDK versions might work as well, but we tested on Linux x64 architecture and 8u202 release. One can also change the values of JAVAPATH variables within the Python files under the dmi/cococap folder.
Checkpoints
  1. To skip the projector pre-training and hypernetwork training stages, download the pre-trained projector and hypernetwork weights from Huggingface and place it under the dmi/checkpoints directory.

Experiments

Projector Pre-training
python -u train_projector.py configs/projector/v1:llama1b_inst_all_extracted.json
Hypernetwork Training
python -u train_hypernet.py configs/projector/v4:llama1b_inst_all.json
Few-shot Runs
# Example run, please see dmi/configs/hypernet only_fewshot configs for remaining experiments
python -u train_hypernet.py configs/hypernet/v6:llama1b_inst_all_only_fewshot_candels_base.json
Baselines
# Example runs, please see dmi/configs/lora and dmi/configs/projector folders for remaining experiments
python -u train_projector.py configs/projector/v2:llama1b_sydney_rn50_mlp2.json        # Projector baseline
python -u train_projector.py configs/projector/v3:llama1b_sydney_rn50_mlp2_ft.json     # FT Projector baseline
python -u train_lora.py configs/lora/v3:llama1b_inst_mlp2_sydney_rn50.json             # LoRA baseline
Evaluation

The evaluation of our experiments are automatic. Please refer to {dataset-name}-results.json under the outputs/ folder.

Acknowledgment

We thank Benjamin Minixhofer, Csordás Róbert, and Giorgio Roffo for their valuable codebases. If you liked our work, please pay their wonderful codebases a visit:

About

Code for the "Sample-efficient Integration of New Modalities into Large Language Models" paper

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published