SAAINT

SAAINT stands for Structural Antibody and Antibody-Antigen INTeraction.

This package provides:

Implementation of the SAAINT-parser workflow, designed for fast and accurate extraction and annotation of antibody (Ab) structures and antibody-antigen interactions (AAIs) from the Protein Data Bank (PDB). This results in a comprehensive and up-to-date structural antibody and antibody-antigen interaction database, SAAINT-DB.
Source code for building, analyzing, and updating SAAINT-DB.

Note: We are unable to upload the large SAAINT-DB structure datasets here due to GitHub's limits on file size. Users can download the unprocessed PDB structures (in the PDBx/mmCIF format) and SAAINT-parser-processed structure models (in the PDB format) at Zenodo.

Installation and running SAAINT-parser

Configure the environment

conda create -n saaint python=3.10 biopython=1.84 numpy=1.26.3 matplotlib=3.10.0 seaborn=0.13.2 urllib3=2.2.1 pandas=2.2.3
conda activate saaint

Clone this repository

git clone https://github.com/tommyhuangthu/SAAINT.git
cd SAAINT/
mkdir -p record database/mmCIF_divided database/fasta_divided database/saaint_divided

Clone the UniDesign, FASPR, and Pulchra repositories (used by SAAINT-parser)
```
git clone https://github.com/tommyhuangthu/UniDesign.git
git clone https://github.com/tommyhuangthu/FASPR.git
git clone https://github.com/euplotes/pulchra.git
```
Then cd into each repository and build executibles of UniDesign, FASPR, and pulchra, following their instructions.
For clarity, we assume these repositories are saved to the user's home directory (any directory is fine as long as the correct directories are specified for these executibles; see below).
Modify related directories and paths within the source code before running

Some directories or paths are hard coded within the source code, and thus a direct run of the code may cause errors. Assume the user home directory is /home/username and the SAAINT package is saved as /home/username/SAAINT, the following paths or directories need changes:
In scripts/sbatch_rsync_mmcifs.sh: Changes /home/xiaoqiah/turbo/work/SAAINT to /home/username/SAAINT.
In scripts/sbatch_update_fastas.sh: Changes /home/xiaoqiah/turbo/work/SAAINT to /home/username/SAAINT.
In scripts/utils.py: Changes the paths for UniDesign, FASPR, and Pulchra.
Rsync mmCIF files from the Protein Data Bank (PDB) and download mmCIF-associated FASTA files
- Download mmCIF files from the PDB
```
sbatch scripts/sbatch_rsync_mmcifs.sh
```
The SLURM script scripts/sbatch_rsync_mmcifs.sh executes scripts/run_rsync_mmcifs.py to download mmCIF files from PDBe. The downloaded files are stored in the database/mmCIF_divided directory, organized into subfolders named after the two middle letters of the corresponding PDB entries. For example, for PDB entry 5zxv, the script automatically retrieves its mmCIF file and saves it as database/mmCIF_divided/zx/5zxv.cif.gz. Users can modify scripts/run_rsync_mmcifs.py to download mmCIF files more efficiently from RCSB PDB, PDBe, or PDBj, depending on their location. For more details, refer to the instructions at wwPDB.
- Download the associated FASTA files
```
sbatch scripts/sbatch_update_fastas.sh
```
The SLURM script scripts/sbatch_update_fastas.sh executes scripts/run_update_fastas.py to download or update FASTA files from the RCSB PDB website (See below for updating). The script scripts/run_update_fastas.py analyzes the output of the SLURM job scripts/sbatch_rsync_mmcifs.sh to identify the PDB entries that need updating or have become obsolete, generating two mmCIF list files: database/list_update_cifs.txt and database/list_obsolete_cifs.txt. The downloaded FASTA files are stored in the database/fasta_divided directory, organized into subfolders named after the two middle letters of the corresponding PDB entries. For example, for PDB entry 5zxv, the script automatically retrieves its FASTA content from RCSB PDB: 5zxv and saves it as database/fasta_divided/zx/5zxv.fasta.
Run and test SAAINT-parser

Users can run SAAINT-parser with a PDB entry as the only input. For example:
```
python3 ./scripts/run_saaint_parser.py 5zxv
```
To enable verbose output for debugging, use the --verbose (or -v) option:
```
python3 ./scripts/run_saaint_parser.py -v 5zxv
```
Successfully identified antibodies and antibody-antigen interactions are saved in a folder named after the two middle letters of the PDB entry. For example, for 5zxv, the results are stored in the zx folder:
zx/5zxv_aai_all.tsv: contains all identified antibodies or AAIs.
zx/5zxv_paired_ab_ag_ids.tsv: lists paired antibody-antigen chain IDs.

Building SAAINT-DB

Run SAAINT-parser to process all mmCIF files

To process all mmCIF files using SAAINT-parser, run:
```
python3 ./scripts/run_submit_saaint_parser_jobs.py -path <mmCIF_path> <work_dir> <n_cpus>
```
<mmCIF_path>: Path to the directory containing the downloaded mmCIF files (e.g., database/mmCIF_divided).
<work_dir>: Directory for saving intermediate outputs and final results generated by SAAINT-parser, e.g., database/saaint_divided.
<n_cpus>: Number of CPU cores allocated for processing, e.g., 300 (A large number of CPUs is recommended for initial construction).
We recommend using this command for the initial construction of the SAAINT-DB database from scratch.

Alternatively, to process a specific set of mmCIF files, run:
```
python3 ./scripts/run_submit_saaint_parser_jobs.py -list <mmCIF_list> <work_dir> <n_cpus>
```
<mmCIF_list>: List of mmCIF files to be processed, e.g., database/list_update_cifs.txt.
<work_dir> and <n_cpus>: Same as described above.
We recommend using the command with -list option for updating SAAINT-DB efficiently.
Collect SAAINT-parser results to build the SAAINT-DB datasets

To build the SAAINT-DB databases, run:
```
python3 ./scripts/run_saaintdb_builder.py
```

This command merges all *_aai_all.tsv files to generate the SAAINT-DB summary file.
The SAAINT-DB full dataset is saved as saaintdb_{timestamp}_all.xlsx and saaintdb_{timestamp}_all.tsv. Here, {timestamp} (e.g., 20250501) represents the date of the database build.
For more details, refer to saaintdb/saaintdb_20250501_all.xlsx and saaintdb/saaintdb_20250501_all.tsv.

Analyzing SAAINT-DB and antibody-antigen binding affinity data

Analyze SAAINT-DB

To make statistical analyses of the SAAINT-DB data entries, run:
```
python3 scripts/run_saaintdb_analyzer.py saaintdb/saaintdb_2025012908_all.xlsx -j <jobname>
```
Here, <jobname> can take the following options:
num_entries: Count the number of PDB and data entries.
num_entries_with_paired_vhvl: Count the PDB and data entries with paired VH/VL chains.
num_entries_with_ag: Count the PDB and data entries with antigen.
date: Analyze the deposition and release dates of PDB entries.
classification: Examine the classification of PDB entries.
method: Analyze the experimental methods used to determine PDB structures.
resolution: Analyze the resolution of X-ray and EM structures
publication: Analyze PDB-associated publication details, including PMID and DOI.
asym_id: Analyze the types of asym_id used to parse PDB entries.
plot_pdb_num: Plot the number of PDB entries over released years.
ab_spe: Analyze the top species for antibody heavy and light chains.
ab_type: Analyze antibody types annotated by the SAAINT parser.
HL_inf_res_num: Analyze and plot the number of heavy chain-light chain interface residues per antibody types.
HL_chain_len: Analyze and plots the length distribution of heavy and light chains.
radius: Analyze and plots the distribution of mean radius for scFv and VHVL types.
ag_spe: Analyze the top antigen species.
ag_type: Analyze antigen types.
ab_ag_inf_res_num: Plot a histogram of antibody-antigen interface residue counts.
cdr_inf_res_num: Plot a histogram of interface CDR residue counts.
cdr_inf_res_ratio: Plot a histogram of the ratio of interface CDR residues to total interface antibody residues.
ag_chain_num: Analyze the number of antigens with varying chain counts.
Analyze antibody-antigen binding affinity data

To make statistical analyses of the SAAINT-DB data entries, run:
```
python3 scripts/run_saaintdb_analyzer.py saaintdb/saaintdb_affinity.tsv -j <jobname>
```
Here, <jobname> takes the following options:
affinity: Plots a histogram of pKd values.
num_pdbs_with_affinity: Counts the number of PDB entries containing antibody-antigen binding affinity data.

Updating SAAINT-DB

Update mmCIF and FASTA files
- Update mmCIF files
```
sbatch scripts/sbatch_rsync_mmcifs.sh
```
The SLURM script scripts/sbatch_rsync_mmcifs.sh runs scripts/run_rsync_mmcifs.py to update mmCIF files from PDBe. The output file rsync_mmCIF.out logs newly added, updated, and deleted PDB entries.
- Update FASTA files
```
sbatch scripts/sbatch_update_fastas.sh
```
The SLURM script scripts/sbatch_update_fastas.sh runs scripts/run_update_fastas.py to update FASTA files. Specifically, it downloads FASTA files for newly added and updated PDB entries and remove FASTA files for deleted entries. The script generates two output files: database/list_obsolete_cifs.txt that records deleted PDB entries; database/list_update_cifs.txt that records newly added and updated PDB entries. Additionally, the script deletes SAAINT-parser results for deleted PDB entries.
Run SAAINT-parser to process updated PDB entries

To process updated mmCIF files, run:
```
python3 ./scripts/run_submit_saaint_parser_jobs.py -list <mmCIF_list> <work_dir> <n_cpus>
```
<mmCIF_list>: List of mmCIF files to be processed, e.g., database/list_update_cifs.txt.
<work_dir>: Directory for saving intermediate outputs and final results generated by SAAINT-parser, e.g., database/saaint_divided.
<n_cpus>: Number of CPU cores allocated for processing, e.g., 5 (A small number of CPUs is recommended for updating).
Collect SAAINT-parser results to update SAAINT-DB

To update the SAAINT-DB databases, run:
```
python3 ./scripts/run_saaintdb_builder.py
```

SAAINT-DB vs. SAbDab

Below is a statistical comparison between the initial SAAINT-DB publication (v.20250501; updated on 1-May-2025) and SAbDab (v.20250502; updated on 2-May-2025)

Item	SAAINT-DB	SAbDab
#{PDB entries}	9,757	9,521
#{data entries}	19,128	18,744
#{PDB entries with ≥1 paired VH/VL}	7,822	7,622
#{data entries with ≥1 paired VH/VL}	14,747	14,420
#{PDB entries with Ag}	7,489	7,788
#{data entries with Ag}	14,316	14,869
#{PDB entries with Ab-Ag binding affinity data}	1,331	736
#{nonredundant Ab-Ag binding affinity data}	1,444	736

Statistics on the latest version of SAAINT-DB

SAAINT-DB is currently updated on Thursday every two weeks, with most recent statistics provided as follows:

Item	v.20251024
#{PDB entries}	10,398
#{data entries}	20,385
#{PDB entries with ≥1 paired VH/VL}	8,303
#{data entries with ≥1 paired VH/VL}	15,673
#{PDB entries with Ag}	8,066
#{data entries with Ag}	15,430
#{PDB entries with Ab-Ag binding affinity data}	1,331
#{nonredundant Ab-Ag binding affinity data}	1,444

Reference

Huang X, Zhou J, Chen S, Xia X, Chen YE, Xu J. SAAINT-DB: a comprehensive structural antibody database for antibody modeling and design. Acta Pharmacologica Sinica (2025). https://doi.org/10.1038/s41401-025-01608-5.

Contact

Please report any bugs or suggestions to Xiaoqiang Huang

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
ext_bin		ext_bin
saaintdb		saaintdb
sabdab		sabdab
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SAAINT

Installation and running SAAINT-parser

Building SAAINT-DB

Analyzing SAAINT-DB and antibody-antigen binding affinity data

Updating SAAINT-DB

SAAINT-DB vs. SAbDab

Statistics on the latest version of SAAINT-DB

Reference

Contact

About

Uh oh!

Releases

Packages

Languages

License

tommyhuangthu/SAAINT

Folders and files

Latest commit

History

Repository files navigation

SAAINT

Installation and running SAAINT-parser

Building SAAINT-DB

Analyzing SAAINT-DB and antibody-antigen binding affinity data

Updating SAAINT-DB

SAAINT-DB vs. SAbDab

Statistics on the latest version of SAAINT-DB

Reference

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages