This is the official code repository for the paper "Sparse Autoencoders for Low -
- Sparse Autoencoder Training (
sae_training/): Trains SAEs on ESM model activations - Linear Probe Analysis (
linear_probe/): Evaluates SAE representations for fitness prediction - Protein Engineering Pipeline (
protein_engineering/): Design sequences with SAE and ESM2 - Visualization Tools (
visualize_sequences/): Analysis and visualization of results
- Create conda environment:
conda env create -f sae.yml conda activate sae
-
Train SAE models:
cd sae_training python main.py -
Run linear probe experiments:
cd linear_probe sh run_all_expts.sh -
Execute protein engineering pipeline:
cd protein_engineering python main.py -
Visualize results:
cd visualize_sequences python main.py
For the functions involving tensors, we use shape suffixes to describe their shape.
The project evaluates performance across multiple Deep Mutational Scanning (DMS) datasets from ProteinGym:
- GFP_AEQVI_Sarkisyan_2016
- SPG1_STRSG_Olson_2014
- SPG1_STRSG_Wu_2016
- DLG4_HUMAN_Faure_2021
- GRB2_HUMAN_Faure_2021
- F7YBW8_MESOW_Ding_2023
Each of these DMS files can be downloaded here.
To run the experiments from the paper, simply create the models folder and train the ESM and SAE models. To run these experiments on your own protein, create the DMS folder and add a CSV of its DMS data, as well as the trained SAE/ESM model to models. The models used for this paper can be cloned from HuggingFace, as well as a tutorial on how to use the models or setup your own.
If you use our repository and enjoy it, please consider citing our paper, which was accepted into the NeurIPS AI4Science workshop!
@inproceedings{tsui2025lownsae,
title={Sparse Autoencoders for Low-$N$ Protein Function Prediction and Design},
author={Tsui, Darin and Talreja, Kunal and Aghazadeh, Amirali},
booktitle={NeurIPS AI for Science Workshop},
year={2025}
}