Converting protein tertiary structure into discrete tokens via vector-quantized variational autoencoders (VQ-VAEs) creates a language of 3D geometry and provides a natural interface between sequence and structure models. While pose invariance is commonly enforced, retaining chirality and directional cues without sacrificing reconstruction accuracy remains challenging. In this paper, we introduce GCP-VQVAE, a geometry-complete tokenizer built around a strictly SE(3)-equivariant GCPNet encoder that preserves orientation and chirality of protein backbones. We vector-quantize rotation/translation-invariant readouts that retain chirality into a 4096-token vocabulary, and a transformer decoder maps tokens back to backbone coordinates via a 6D rotation head trained with SE(3)-invariant objectives.
Building on these properties, we train GCP-VQVAE on a corpus of 24 million monomer protein backbone structures gathered from the AlphaFold Protein Structure Database. On the CAMEO2024, CASP15, and CASP16 evaluation datasets, the model achieves backbone RMSDs of 0.4377 Å, 0.5293 Å, and 0.7567 Å, respectively, and achieves 100% codebook utilization on a held-out validation set, substantially outperforming prior VQ-VAE–based tokenizers and achieving state-of-the-art performance. Beyond these benchmarks, on a zero-shot set of 1938 completely new experimental structures, GCP-VQVAE attains a backbone RMSD of 0.8193 Å and a TM-score of 0.9673, demonstrating robust generalization to unseen proteins. Lastly, we elaborate on the various applications of this foundation-like model, such as protein structure compression and the integration of generative protein language models. We make the GCP-VQVAE source code, zero-shot dataset, and its pretrained weights fully open for the research community.
- 🗓️ 25 Sept 2025 — 🎉 Our paper was accepted to the NeurIPS 2025 AI4Science workshop!
- 🗓️ 3 Oct 2025 — Our preprint has been published in bioRxiv.
- 🗓️ 10 Oct 2025 — 🚀 Pretrained checkpoints and evaluation datasets are now available for download!
- 🗓️ 17 Oct 2025 - Release a lite version of GCP-VQVAE with half the parameters.
- Python 3.10+
- PyTorch 2.5+
- CUDA-compatible GPU
- 16GB+ GPU memory recommended for training
For AMD64 systems:
docker pull mahdip72/vqvae3d:amd_v8
docker run --gpus all -it mahdip72/vqvae3d:amd_v8For ARM64 systems:
docker pull mahdip72/vqvae3d:arm_v3
docker run --gpus all -it mahdip72/vqvae3d:arm_v3# Clone the repository
git clone https://github.com/mahdip72/vq_encoder_decoder.git
cd vq_encoder_decoder
# Build the Docker image
docker build -t vqvae3d .
# Run the container
docker run --gpus all -it vqvae3dIf you are on H100, H800, GH200, H200 (SM90) you can enable FlashAttention-3 for faster, lower‑memory attention.
Build with FA3 baked in:
docker build --build-arg FA3=1 -t vqvae3d-fa3 .| Dataset | Description | Download Link |
|---|---|---|
| CAMEO2024 | CAMEO 2024 evaluation dataset | Download |
| CASP14 | CASP 14 evaluation dataset | Download |
| CASP15 | CASP 15 evaluation dataset | Download |
| CASP16 | CASP 16 evaluation dataset | Download |
| Zero-Shot | Zero-shot evaluation dataset | Download |
We provide a helper script to fetch a Foldcomp-formatted database and extract structures to uncompressed .pdb files. See the official docs for more details: Foldcomp README and the Foldcomp download server.
Quick start (preferred):
# 1) Open the script and set parameters at the top:
# - DATABASE_NAME (e.g. afdb_swissprot_v4, afdb_uniprot_v4, afdb_rep_v4, afdb_rep_dark_v4,
# esmatlas, esmatlas_v2023_02, highquality_clust30, or organism sets like h_sapiens)
# - DOWNLOAD_DIR (where DB files live)
# - OUTPUT_DIR (where .pdb files will be written)
nano data/download_foldcomp_db_to_pdb.sh
# 2) Run the script
bash data/download_foldcomp_db_to_pdb.sh
# The script will (a) fetch the DB via the optional Python helper if available,
# or instruct you to download DB files from the Foldcomp server, then (b) call
# `foldcomp decompress` to write uncompressed .pdb files to OUTPUT_DIR.Notes:
- You need the
foldcompCLI in your PATH. Install guidance is available in the Foldcomp README. - The script optionally uses the Python package
foldcompto auto-download DB files. If not present, it prints the exact files to fetch from the official server. - After PDBs are downloaded, continue with the converters below to produce the
.h5dataset used by this repo.
- seq: length‑L amino‑acid string. Standard 20‑letter alphabet; X marks unknowns and numbering gaps.
- N_CA_C_O_coord: float array of shape (L, 4, 3). Backbone atom coordinates in Å for [N, CA, C, O] per residue. Missing atoms/residues are NaN‑filled.
- plddt_scores: float array of shape (L,). Per‑residue pLDDT pulled from B‑factors when present; NaN if unavailable.
This script scans a directory recursively and writes one .h5 per processed chain.
- Input format: By default it searches for
.pdb. Use--use_cifto read.ciffiles (no.cif.gz). - Chain filtering: drops chains whose final length (after gap handling) is <
--min_lenor >--max_len. - Duplicate sequences: among highly similar chains (identity > 0.95), keeps the one with the most resolved CA atoms.
- Numbering gaps & insertions: handles insertion codes natively. For numeric residue‑number gaps (both PDB and CIF), inserts
Xresidues with NaN coords. If a gap exceeds--gap_threshold(default 5), reduces the number of inserted residues using the straight‑line CA–CA distance (assumes ~3.8 Å per residue); if CA coords are missing, caps at the threshold. This prevents runaway padding for CIF files with non‑contiguous author numbering. - Outputs: by default filenames are
<index>_<basename>.h5or<index>_<basename>_chain_id_<ID>.h5for multi‑chain structures. Add--no_file_indexto omit the<index>_prefix.
Examples:
# Default: PDB input
python data/pdb_to_h5.py \
--data /abs/path/to/pdb_root \
--save_path /abs/path/to/output_h5 \
--max_len 2048 \
--min_len 25 \
--max_workers 16# CIF input (no .gz)
python data/pdb_to_h5.py \
--use_cif \
--data /abs/path/to/cif_root \
--save_path /abs/path/to/output_h5# Control large numeric gaps with CA–CA estimate (applies to PDB and CIF)
python data/pdb_to_h5.py \
--data /abs/path/to/structures \
--save_path /abs/path/to/output_h5 \
--gap_threshold 5# Omit index from output filenames
python data/pdb_to_h5.py \
--no_file_index \
--data /abs/path/to/pdb_or_cif_root \
--save_path /abs/path/to/output_h5Converts .h5 backbones to PDB, writing only N/CA/C atoms and skipping residues with any NaN coordinates.
Example:
python data/h5_to_pdb.py \
--h5_dir /abs/path/to/input_h5 \
--pdb_dir /abs/path/to/output_pdbScans a directory recursively and writes one PDB per selected chain, deduplicating highly similar chains.
- Input format: By default it searches for
.pdb. Use--use_cifto read.ciffiles (no.cif.gz). - Chain filtering: drops chains whose final length (after gap checks) is <
--min_lenor >--max_len. - Duplicate sequences: among highly similar chains (identity > 0.90), keeps the one with the most resolved CA atoms.
- Numbering gaps: for large numeric residue‑numbering gaps, uses the straight‑line CA–CA distance to cap the number of inserted missing residues (quality control; outputs remain original coordinates).
- Outputs: default filenames are
<basename>_chain_id_<ID>.pdb. Add--with_file_indexto prefix with<index>_. Output chain ID is set to "A".
Examples:
# Default: PDB input
python data/break_complex_to_monumers.py \
--data /abs/path/to/structures \
--save_path /abs/path/to/output_pdb \
--max_len 2048 \
--min_len 25 \
--max_workers 16# CIF input (no .gz)
python data/break_complex_to_monumers.py \
--use_cif \
--data /abs/path/to/cif_root \
--save_path /abs/path/to/output_pdb- Inference:
inference_encode.pyandinference_embed.pyread datasets from.h5in the format above.inference_decode.pydecodes VQ indices (from CSV) to backbone coordinates; you can convert decoded.h5/coords to PDB withdata/h5_to_pdb.py. - Evaluation:
evaluation.pyconsumes an.h5file viadata_pathinconfigs/evaluation_config.yamland reports TM‑score/RMSD; it can also write aligned PDBs.
Before you begin:
- Prepare your dataset in
.h5format as described in Data. Use the PDB → HDF5 converter indata/pdb_to_h5.py.
Configure your training parameters in configs/config_vqvae.yaml and run:
Note:
- Training expects datasets in the HDF5 layout defined in HDF5 format used by this repo.
# Set up accelerator configuration for multi-GPU training
accelerate config
# Start training with accelerate for multi-GPU support
accelerate launch train.py --config_path configs/config_vqvae.yamlSee the Accelerate documentation for more options and configurations.
| Model | Description | Download Link |
|---|---|---|
| Large | Full GCP-VQVAE model with best performance | Download |
| Lite | Lightweight version for faster inference | Download |
Setup Instructions:
- Download the zip file of the checkpoint
- Extract the checkpoint folder
- Set the
trained_model_dirpath in your config file (following ones) to point to the right checkpoint.
Multi‑GPU with Hugging Face Accelerate:
- The following scripts support multi‑GPU via Accelerate:
inference_encode.py,inference_embed.py,inference_decode.py, andevaluation.py.
Example (2 GPUs, bfloat16):
accelerate launch --multi_gpu --mixed_precision=bf16 --num_processes=2 evaluation.pyOr like in Training, configure Accelerate first:
accelerate config
accelerate launch evaluation.pySee the Accelerate documentation for more options and configurations.
All inference scripts consume .h5 inputs in the format defined in Data.
To extract the VQ codebook embeddings:
python codebook_extraction.pyEdit configs/inference_codebook_extraction_config.yaml to change paths and output filename.
To encode proteins into discrete VQ indices:
python inference_encode.pyEdit configs/inference_encode_config.yaml to change dataset paths, model, and output. Input datasets should be .h5 as in HDF5 format used by this repo.
To extract per‑residue embeddings from the VQ layer:
python inference_embed.pyEdit configs/inference_embed_config.yaml to change dataset paths, model, and output HDF5. Input .h5 files must follow HDF5 format used by this repo.
To decode VQ indices back to 3D backbone structures:
python inference_decode.pyEdit configs/inference_decode_config.yaml to point to the indices CSV and adjust runtime. To write PDBs from decoded outputs, see Convert HDF5 → PDB.
To evaluate predictions and write TM‑score/RMSD along with aligned PDBs:
python evaluation.pyNotes:
- Set
data_pathto an.h5dataset that follows HDF5 format used by this repo. - To visualize results as PDB, convert
.h5outputs withdata/h5_to_pdb.py.
Example config template (configs/evaluation_config.yaml):
trained_model_dir: "/abs/path/to/trained_model" # Folder containing checkpoint and saved YAMLs
checkpoint_path: "checkpoints/best_valid.pth" # Relative to trained_model_dir
config_vqvae: "config_vqvae.yaml" # Names of saved training YAMLs
config_encoder: "config_gcpnet_encoder.yaml"
config_decoder: "config_geometric_decoder.yaml"
data_path: "/abs/path/to/evaluation/data.h5" # HDF5 used for evaluation
output_base_dir: "evaluation_results" # A timestamped subdir is created inside
batch_size: 8
shuffle: true
num_workers: 0
max_task_samples: 5000000 # Optional cap
vq_indices_csv_filename: "vq_indices.csv" # Also writes observed VQ indices
alignment_strategy: "kabsch" # "kabsch" or "no"
mixed_precision: "bf16" # "no", "fp16", "bf16", "fp8"
tqdm_progress_bar: trueWe evaluated additional VQ-VAE backbones alongside GCP-VQVAE:
- ESM3 VQVAE (forked repo: mahdip72/esm) – community can reuse
pdb_to_tokens.pyandtokens_to_pdb.pythat we authored because the upstream project lacks ready-to-use scripts. - FoldToken-4 (forked repo: mahdip72/FoldToken_open) – we rewrote
foldtoken/pdb_to_token.pyandfoldtoken/token_to_pdb.pyfor better performance and efficiency with negligible increase in error. - Structure Tokenizer (instadeepai/protein-structure-tokenizer) – results reproduced with the official implementation.
We welcome independent validation of our ESM3 and FoldToken-4 conversion scripts to further confirm their correctness.
The table below reproduces Table 2 from the manuscript: reconstruction accuracy on community benchmarks and a zero-shot setting. Metrics are backbone TM-score (↑) and RMSD in Å (↓).
| Dataset | Metric | GCP-VQVAE (Ours) | GCP-VQVAE Lite (Ours) | FoldToken 4 (Gao et al., 2024c) | ESM-3 VQVAE (Hayes et al., 2025) | (Gaujac et al., 2024) |
|---|---|---|---|---|---|---|
| CASP14 | TM-score | 0.9890 | 0.9751 | 0.5410 | 0.5042 | 0.3624 |
| RMSD | 0.5431 | 0.8435 | 8.9838 | 10.4611 | 10.5344 | |
| CASP15 | TM-score | 0.9884 | 0.9665 | 0.3289 | 0.3206 | 0.2329 |
| RMSD | 0.5293 | 0.9219 | 14.6702 | 13.1877 | 14.8956 | |
| CASP16 | TM-score | 0.9857 | 0.9757 | 0.8055 | 0.7685 | 0.6058 |
| RMSD | 0.7567 | 1.0614 | 5.5094 | 8.2640 | 8.7106 | |
| CAMEO2024 | TM-score | 0.9918 | 0.9794 | 0.4784 | 0.4633 | 0.3575 |
| RMSD | 0.4377 | 0.7401 | 12.1089 | 12.1138 | 13.5360 | |
| Zero-Shot | TM-score | 0.9673 | 0.9466 | 0.3324 | 0.3131 | - |
| RMSD | 0.8193 | 1.1152 | 17.4449 | 18.9335 | - |
Notes:
- FoldToken 4 uses a 256-size vocabulary; others use 4096.
- The Structure Tokenizer of Gaujac et al. (2024) only supports sequences of length 50–512; out-of-range samples are excluded for that column only.
- Zero-shot results for Gaujac et al. (2024) are omitted due to limited coverage.
- Evaluation scripts for baselines were reproduced where public tooling was incomplete; see repository docs for details.
- Added an experimental option to compress token sequences using latent codebooks inspired by ByteDance’s 1D tokenizer; this enables configurable compression factors within our VQ pipeline.
- Introduced TikTok residual quantization (multi-depth VQ) using a shared codebook when
tik_tok.residual_depth > 1. Residual latents are packed depth-by-depth, flattened into a single stream for NTP and decoding, and their masks/embeddings remain aligned with the flattened indices. This improves reconstruction capacity without expanding the base codebook. - Included an optional next-token prediction head, drawing on the autoregressive regularization ideas from “When Worse is Better: Navigating the Compression-Generation Tradeoff in Visual Tokenization”, to encourage codebooks that are friendlier to autoregressive modeling.
- Enabled adaptive loss coefficients driven by gradient norms: each active loss (MSE, distance/direction, VQ, NTP) tracks its synchronized gradient magnitude and scales its weight toward the 0.2–5.0 norm “comfort zone.” Coefficients shrink when a loss overpowers the rest and grow when its gradients fade, keeping the multi-objective training balanced without constant manual re-tuning.
This repository builds upon several excellent open-source projects:
- vector-quantize-pytorch – Vector quantization implementations used in our VQ-VAE architecture.
- x-transformers – Transformer components integrated into our encoder and decoder modules of VQ-VAE.
- ProteinWorkshop – We slimmed the original workshop down to the GCPNet core and extended it with:
- opt-in
torch.compilesupport that now powers our training, inference, and evaluation paths (configurable in every config file). - faster aggregation backends: a CSR/
torch.segment_reduceimplementation for PyTorch-only setups and an automatic switch to cuGraph-ops fused aggregators whenpylibcugraphopsis installed. - multi-GPU friendly evaluation utilities that reuse the same compilation flags as training, keeping the code path consistent across scripts.
- opt-in
We gratefully acknowledge NVIDIA for providing computational resources via their dedicated server that made this project feasible, enabling us to train and evaluate our models at scale.
If you use this code or the pretrained models, please cite the following paper:
@article{Pourmirzaei2025gcpvqvae,
author = {Pourmirzaei, Mahdi and Morehead, Alex and Esmaili, Farzaneh and Ren, Jarett and Pourmirzaei, Mohammadreza and Xu, Dong},
title = {GCP-VQVAE: A Geometry-Complete Language for Protein 3D Structure},
journal = {bioRxiv},
year = {2025},
doi = {10.1101/2025.10.01.679833},
url = {https://www.biorxiv.org/content/10.1101/2025.10.01.679833v1}
}
