Sparse Autoencoders for Low - $N$ Protein Function Prediction and Design

This is the official code repository for the paper "Sparse Autoencoders for Low - $N$ Protein Function Prediction and Design", by Darin Tsui, Kunal Talreja, and Amirali Aghazadeh. A link to the paper can be found here.

Key Components:

Sparse Autoencoder Training (sae_training/): Trains SAEs on ESM model activations
Linear Probe Analysis (linear_probe/): Evaluates SAE representations for fitness prediction
Protein Engineering Pipeline (protein_engineering/): Design sequences with SAE and ESM2
Visualization Tools (visualize_sequences/): Analysis and visualization of results

Environment Setup

Create conda environment:

conda env create -f sae.yml
conda activate sae

Running Experiments

Train SAE models:
```
cd sae_training
python main.py
```
Run linear probe experiments:
```
cd linear_probe
sh run_all_expts.sh
```
Execute protein engineering pipeline:
```
cd protein_engineering
python main.py
```
Visualize results:
```
cd visualize_sequences
python main.py
```

For the functions involving tensors, we use shape suffixes to describe their shape.

Experiments

The project evaluates performance across multiple Deep Mutational Scanning (DMS) datasets from ProteinGym:

GFP_AEQVI_Sarkisyan_2016
SPG1_STRSG_Olson_2014
SPG1_STRSG_Wu_2016
DLG4_HUMAN_Faure_2021
GRB2_HUMAN_Faure_2021
F7YBW8_MESOW_Ding_2023

Each of these DMS files can be downloaded here.

To run the experiments from the paper, simply create the models folder and train the ESM and SAE models. To run these experiments on your own protein, create the DMS folder and add a CSV of its DMS data, as well as the trained SAE/ESM model to models. The models used for this paper can be cloned from HuggingFace, as well as a tutorial on how to use the models or setup your own.

Citation

If you use our repository and enjoy it, please consider citing our paper, which was accepted into the NeurIPS AI4Science workshop!

@inproceedings{tsui2025lownsae,
  title={Sparse Autoencoders for Low-$N$ Protein Function Prediction and Design},
  author={Tsui, Darin and Talreja, Kunal and Aghazadeh, Amirali},
  booktitle={NeurIPS AI for Science Workshop},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
MSA_files		MSA_files
finetune		finetune
interprot		interprot
linear_probe		linear_probe
protein_engineering		protein_engineering
sae_training		sae_training
training_data		training_data
visualize_sequences		visualize_sequences
.gitignore		.gitignore
README.md		README.md
SAE.svg		SAE.svg
gen_utils.py		gen_utils.py
sae.yml		sae.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sparse Autoencoders for Low - $N$ Protein Function Prediction and Design

Key Components:

Environment Setup

Running Experiments

Experiments

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

amirgroup-codes/LowNSAE

Folders and files

Latest commit

History

Repository files navigation

Sparse Autoencoders for Low - $N$ Protein Function Prediction and Design

Key Components:

Environment Setup

Running Experiments

Experiments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages