Skip to content

MAGICS-LAB/Genome_Factory

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Genome-Factory: An Integrated Library for Tuning, Deploying, and Interpreting Genomic Models

overview

Genome-Factory is a Python-based integrated library for tuning and deploying genomic foundation models. The framework consists of six components. Genome Collector acquires genomic sequences from public repositories and performs preprocessing (e.g., GC normalization, ambiguous base correction). Model Loader supports major genomic models (e.g., GenomeOcean, EVO, DNABERT-2, HyenaDNA, Caduceus, Nucleotide Transformer) and their tokenizers. Model Trainer configures workflows, adapts models to classification or regression tasks, and executes training with full fine-tuning or parameter-efficient methods (LoRA, adapters). Inference Engine enables embedding extraction and sequence generation. Benchmarker provides standard benchmarks and allows integration of custom evaluation tasks. Biological Interpreter enhances interpretability through sparse auto-encoders.

Supported Models

The "Variant Type" column specifies how model variants differ: by parameter size or by maximum input sequence length.

Model Name Variant Type Variants
GenomeOcean Parameter Size 100M / 500M / 4B
EVO Sequence Length 8K / 131K
DNABERT-2 Parameter Size 117M
Hyenadna Sequence Length 1K / 16K / 32K / 160K / 450K / 1M
Caduceus Sequence Length 1K / 131K
Nucleotide Transformer Parameter Size 50M / 100M / 250M / 500M / 1B / 2.5B

Installation

  1. Clone the repository:

    git clone https://github.com/xxx/Genome_Factory.git
    cd Genome_Factory
  2. Install dependencies:

    # Install primary Python dependencies from requirements file
    pip install -r requirements.txt
    
    # Install CUDA Toolkit and Compiler 
    # Ensure you have a compatible NVIDIA driver installed and are in a Conda environment.
    conda install cudatoolkit==11.8 -c nvidia
    conda install -c "nvidia/label/cuda-11.8.0" cuda-nvcc
    
    # Install additional dependencies for specific features (e.g., Mamba support, Flash Attention)
    
    pip install mamba-ssm==2.2.2 flash-attn==2.7.2.post1
    
    # Install NCBI Datasets CLI (required for NCBI data download feature)
    conda install conda-forge::ncbi-datasets-cli
    
    # Install EVO from source
    git clone https://github.com/evo-design/evo.git
    cd evo
    pip install .
    cd .. 
    # IMPORTANT: Return to the Genome-Factory root directory before the next step
    
    # Install Genome-Factory in editable mode
    pip install -e .
  3. Environment Notes:

    • For Genome Ocean, use transformers==4.44.2.
    • For other models, use transformers==4.29.2.
    • For DNABERT-2, ensure triton is uninstalled: pip uninstall triton.

Usage via CLI (genomefactory-cli)

Genome-Factory uses YAML configuration files to define tasks. Example files are provided in genomeFactory/Examples/. You can customize the parameters within these files, ensuring you maintain the required YAML structure.

Data Download

Download genomic data from NCBI:

  1. Using a config file: Specify download parameters in a YAML file. Download by species:
    genomefactory-cli download genomeFactory/Examples/download_by_species.yaml
    Download by Link:
    genomefactory-cli download genomeFactory/Examples/download_by_link.yaml
  2. Interactively: Run the command without a config file and follow the prompts in the terminal to specify your download criteria (supports both species-based and link-based downloads).
    genomefactory-cli download

Note: The list of species and their taxon IDs used for downloads is stored in genomeFactory/Data/Download/Datasets_species_taxonid_dict.json. This file is not exhaustive; you can extend it by adding new species-to-taxonID pairs to download data for other species as needed.

Data Processing

Genome-Factory provides tools to prepare data for model fine-tuning. This includes processing data downloaded from NCBI or formatting your own custom datasets.

1. Processing NCBI Data:

  • Gather data downloaded using the download command into a single folder.
  • Run the processing command with a config file:
    genomefactory-cli process genomeFactory/Examples/process_normal.yaml
  • The processed data will be ready for input into the model for fine-tuning.

2. Preparing Custom Datasets:

If you have your own dataset, format it as follows:

  • Separate your data into three CSV files: train.csv, dev.csv, and test.csv.
  • Each CSV file must have two columns:
    • The first column should contain the DNA sequences (e.g., sequence).
    • The second column should contain the corresponding labels (e.g., label).
      • For classification tasks, labels should be integers (e.g., 0, 1, 2...).
      • For regression tasks, labels should be continuous numbers.
  • Place these three CSV files (train.csv, dev.csv, test.csv) together in a single folder.
  • This folder can then be specified as the input data directory in your training configuration YAML file.

3. Advanced Processing Features:

Genome-Factory provides specialized dataset generation tools for common genomic machine learning tasks:

  • Promoter region dataset: Generate promoter vs. non-promoter classification data from the EPDnew database (hg38, mm10, danRer11)

    genomefactory-cli process genomeFactory/Examples/process_promoter.yaml
  • Epigenetic mark dataset: Create gene body sequences with H3K36me3 signal classification from ENCODE/Roadmap data (hg38, mm10)

    genomefactory-cli process genomeFactory/Examples/process_emp.yaml
  • Enhancer region dataset: Build enhancer vs. non-enhancer classification data from FANTOM5 annotations (hg38, mm10)

    genomefactory-cli process genomeFactory/Examples/process_enhancer.yaml
  • All datasets feature quality control, configurable train/val/test splits, and output CSV files with sequence,label format.

Training

For fine-tuning GFMs, Genome-Factory supports two primary task types: classification and regression. You specify the desired task_type in the training YAML configuration file.

Fine-tune GFMs using different methods:

  • Full Fine-tuning:

    genomefactory-cli train genomeFactory/Examples/train_full.yaml
  • LoRA (Low-Rank Adaptation):

    genomefactory-cli train genomeFactory/Examples/train_lora.yaml
    • Specify target modules in the YAML file:
      • all: Targets all linear layers.
      • all_in_and_out_proj: Targets input/output projection layers and the final classification layer.
      • Custom: Specify module names directly.
    • For Evo:
      genomefactory-cli train genomeFactory/Examples/train_evo_lora.yaml
  • Adapter:

    genomefactory-cli train genomeFactory/Examples/train_adapter.yaml
    • Customize the adapter architecture in genomeFactory/Train/workflow/adapter/adapter_model/Adapter.py for potentially better performance on specific downstream tasks.

    Note: Training settings like batch size, learning rate, and epochs can be customized in the respective YAML files for all methods.

    Note on Flash Attention: To enable Flash Attention, set the flash_attention argument to true in your YAML configuration file. You must also enable mixed-precision training by setting either bf16: true or fp16: true. If flash_attention is set to false, or if a specific GFM does not support this argument, the model's default attention mechanism will be used.

    Benchmarking: After fine-tuning, performance metrics are saved to a JSON file. You can use these metrics for benchmarking (e.g., comparing the performance of different models or tuning methods on specific tasks).

Inference

Use trained models for prediction, generation, or embedding extraction:

  1. Prediction: (Predict properties of DNA sequences). Ensure the task_type specified in your inference YAML file (classification or regression) matches the task the model was originally fine-tuned for.

    • Full:
      genomefactory-cli inference genomeFactory/Examples/inference_full.yaml
    • LoRA:
      genomefactory-cli inference genomeFactory/Examples/inference_lora.yaml
    • Adapter:
      genomefactory-cli inference genomeFactory/Examples/inference_adapter.yaml
      • Note: For Adapter-based classification, specify the number of labels (num_label) in the YAML. For regression, set num_label: 1. Full/LoRA methods infer this automatically.
  2. Generation: (Generate new DNA sequences based on existing ones). Applicable to compatible GFMs.

    • For GenomeOcean:
      genomefactory-cli inference genomeFactory/Examples/inference_generation_genomeocean.yaml
    • For Evo:
      genomefactory-cli inference genomeFactory/Examples/inference_generation_evo.yaml
  3. Embedding Extraction: (Extract the last hidden state embeddings from sequences).

    • General Case:
      genomefactory-cli inference genomeFactory/Examples/inference_extract.yaml
    • For Evo specifically:
      genomefactory-cli inference genomeFactory/Examples/inference_extract_evo.yaml
  4. Protein Generation: (Generate biologically realistic protein sequences with structural constraints via FoldMason integration).

  • Structure-aware generation: Apply structural constraints during sequence generation
  • Multi-model support: Evo and GenomeOcean
  • Length control: Flexible sequence lengths
  • Genomic context: Condition on specified genomic coordinates
  • Batch processing: Generate multiple variants

Run:

genomefactory-cli protein genomeFactory/Examples/protein_generation.yaml

Interpretation

Genome-Factory provides comprehensive tools for understanding and interpreting genomic foundation models through sparse autoencoder (SAE) interpretation to provide deep insights into model behavior and biological significance.

Sparse Autoencoder (SAE) Analysis

  • ** Latent Feature Discovery**: Identify interpretable features learned by genomic foundation models
  • ** Ridge Regression Evaluation**: Quantitative assessment of feature importance for downstream tasks
  • ** First-token vs Mean-pooled Analysis**: Compare different pooling strategies for sequence representation
  • ** Feature Weight Analysis**: Understand which SAE features contribute most to biological predictions

Quick Start Guide

SAE-Based Feature Analysis

Complete workflow for SAE training and interpretation:

Step 1: Train SAE Model
genomefactory-cli sae_train genomeFactory/Examples/sae_train.yaml

Configure the following parameters in the YAML file:

data_file: "<YOUR_SEQUENCE_FILE>"
d_model: <MODEL_DIMENSION>
d_hidden: <HIDDEN_DIMENSION>
batch_size: <BATCH_SIZE>
lr: <LEARNING_RATE>
k: <K_VALUE>
auxk: <AUXK_VALUE>
dead_steps_threshold: <THRESHOLD_STEPS>
max_epochs: <MAX_EPOCHS>
num_devices: <NUM_DEVICES>
model_suffix: "<MODEL_SUFFIX>"
wandb_project: "<PROJECT_NAME>"
num_workers: <NUM_WORKERS>
model_name: "<MODEL_NAME>"
Step 2: Downstream Evaluations with Ridge Regression

A. First-token latent embedding analysis:

genomefactory-cli sae_regression genomeFactory/Examples/sae_regression.yaml

Configure the following parameters in the YAML file:

csv_path: "<FEATURE_CSV_PATH>"
sae_checkpoint_path: "<SAE_CHECKPOINT_PATH>"
output_path: "<OUTPUT_CSV_PATH>"
type: "first_token"

B. Mean-pooled latent embedding analysis:

genomefactory-cli sae_regression genomeFactory/Examples/sae_regression.yaml

Configure the following parameters in the YAML file:

csv_path: "<FEATURE_CSV_PATH>"
sae_checkpoint_path: "<SAE_CHECKPOINT_PATH>"
output_path: "<OUTPUT_CSV_PATH>"
type: "mean"

Usage via Web UI

Access all Genome-Factory functionalities through a graphical interface:

genomefactory-cli webui

This command launches a web server. Open the provided URL in your browser to use the WebUI.

Citation

If you find Genome-Factory useful, we would appreciate it if you consider citing our work:

@misc{genomefactory2025,
  title     = {Genome-Factory: An Integrated Library for Tuning, Deploying, and Interpreting Genomic Models},
  author    = {Weimin Wu and Xuefeng Song and Yibo Wen and Qinjie Lin and Zhihan Zhou and Jerry Yao-Chieh Hu and Zhong Wang and Han Liu},
  year      = {2025},
  archivePrefix = {arXiv},
  url       = {https://github.com/MAGICS-LAB/Genome_Factory}
}

Reference

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models.
Zheng, Yaowei, Richong Zhang, Junhao Zhang, YeYanhan YeYanhan, and Zheyan Luo.
In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp. 400-410. 2024.

About

An Integrated Library for Tuning, Deploying and Interpreting Genomic Models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages