Genome-Factory is a Python-based integrated library for tuning and deploying genomic foundation models. The framework consists of six components. Genome Collector acquires genomic sequences from public repositories and performs preprocessing (e.g., GC normalization, ambiguous base correction). Model Loader supports major genomic models (e.g., GenomeOcean, EVO, DNABERT-2, HyenaDNA, Caduceus, Nucleotide Transformer) and their tokenizers. Model Trainer configures workflows, adapts models to classification or regression tasks, and executes training with full fine-tuning or parameter-efficient methods (LoRA, adapters). Inference Engine enables embedding extraction and sequence generation. Benchmarker provides standard benchmarks and allows integration of custom evaluation tasks. Biological Interpreter enhances interpretability through sparse auto-encoders.
The "Variant Type" column specifies how model variants differ: by parameter size or by maximum input sequence length.
Model Name | Variant Type | Variants |
---|---|---|
GenomeOcean | Parameter Size | 100M / 500M / 4B |
EVO | Sequence Length | 8K / 131K |
DNABERT-2 | Parameter Size | 117M |
Hyenadna | Sequence Length | 1K / 16K / 32K / 160K / 450K / 1M |
Caduceus | Sequence Length | 1K / 131K |
Nucleotide Transformer | Parameter Size | 50M / 100M / 250M / 500M / 1B / 2.5B |
-
Clone the repository:
git clone https://github.com/xxx/Genome_Factory.git cd Genome_Factory
-
Install dependencies:
# Install primary Python dependencies from requirements file pip install -r requirements.txt # Install CUDA Toolkit and Compiler # Ensure you have a compatible NVIDIA driver installed and are in a Conda environment. conda install cudatoolkit==11.8 -c nvidia conda install -c "nvidia/label/cuda-11.8.0" cuda-nvcc # Install additional dependencies for specific features (e.g., Mamba support, Flash Attention) pip install mamba-ssm==2.2.2 flash-attn==2.7.2.post1 # Install NCBI Datasets CLI (required for NCBI data download feature) conda install conda-forge::ncbi-datasets-cli # Install EVO from source git clone https://github.com/evo-design/evo.git cd evo pip install . cd .. # IMPORTANT: Return to the Genome-Factory root directory before the next step # Install Genome-Factory in editable mode pip install -e .
-
Environment Notes:
- For Genome Ocean, use
transformers==4.44.2
. - For other models, use
transformers==4.29.2
. - For DNABERT-2, ensure
triton
is uninstalled:pip uninstall triton
.
- For Genome Ocean, use
Genome-Factory uses YAML configuration files to define tasks. Example files are provided in genomeFactory/Examples/
. You can customize the parameters within these files, ensuring you maintain the required YAML structure.
Download genomic data from NCBI:
- Using a config file: Specify download parameters in a YAML file.
Download by species:
Download by Link:
genomefactory-cli download genomeFactory/Examples/download_by_species.yaml
genomefactory-cli download genomeFactory/Examples/download_by_link.yaml
- Interactively: Run the command without a config file and follow the prompts in the terminal to specify your download criteria (supports both species-based and link-based downloads).
genomefactory-cli download
Note: The list of species and their taxon IDs used for downloads is stored in genomeFactory/Data/Download/Datasets_species_taxonid_dict.json
. This file is not exhaustive; you can extend it by adding new species-to-taxonID pairs to download data for other species as needed.
Genome-Factory provides tools to prepare data for model fine-tuning. This includes processing data downloaded from NCBI or formatting your own custom datasets.
1. Processing NCBI Data:
- Gather data downloaded using the
download
command into a single folder. - Run the processing command with a config file:
genomefactory-cli process genomeFactory/Examples/process_normal.yaml
- The processed data will be ready for input into the model for fine-tuning.
2. Preparing Custom Datasets:
If you have your own dataset, format it as follows:
- Separate your data into three CSV files:
train.csv
,dev.csv
, andtest.csv
. - Each CSV file must have two columns:
- The first column should contain the DNA sequences (e.g.,
sequence
). - The second column should contain the corresponding labels (e.g.,
label
).- For classification tasks, labels should be integers (e.g., 0, 1, 2...).
- For regression tasks, labels should be continuous numbers.
- The first column should contain the DNA sequences (e.g.,
- Place these three CSV files (
train.csv
,dev.csv
,test.csv
) together in a single folder. - This folder can then be specified as the input data directory in your training configuration YAML file.
3. Advanced Processing Features:
Genome-Factory provides specialized dataset generation tools for common genomic machine learning tasks:
-
Promoter region dataset: Generate promoter vs. non-promoter classification data from the EPDnew database (hg38, mm10, danRer11)
genomefactory-cli process genomeFactory/Examples/process_promoter.yaml
-
Epigenetic mark dataset: Create gene body sequences with H3K36me3 signal classification from ENCODE/Roadmap data (hg38, mm10)
genomefactory-cli process genomeFactory/Examples/process_emp.yaml
-
Enhancer region dataset: Build enhancer vs. non-enhancer classification data from FANTOM5 annotations (hg38, mm10)
genomefactory-cli process genomeFactory/Examples/process_enhancer.yaml
-
All datasets feature quality control, configurable train/val/test splits, and output CSV files with
sequence,label
format.
For fine-tuning GFMs, Genome-Factory supports two primary task types: classification and regression. You specify the desired task_type
in the training YAML configuration file.
Fine-tune GFMs using different methods:
-
Full Fine-tuning:
genomefactory-cli train genomeFactory/Examples/train_full.yaml
-
LoRA (Low-Rank Adaptation):
genomefactory-cli train genomeFactory/Examples/train_lora.yaml
- Specify target modules in the YAML file:
all
: Targets all linear layers.all_in_and_out_proj
: Targets input/output projection layers and the final classification layer.- Custom: Specify module names directly.
- For Evo:
genomefactory-cli train genomeFactory/Examples/train_evo_lora.yaml
- Specify target modules in the YAML file:
-
Adapter:
genomefactory-cli train genomeFactory/Examples/train_adapter.yaml
- Customize the adapter architecture in
genomeFactory/Train/workflow/adapter/adapter_model/Adapter.py
for potentially better performance on specific downstream tasks.
Note: Training settings like batch size, learning rate, and epochs can be customized in the respective YAML files for all methods.
Note on Flash Attention: To enable Flash Attention, set the
flash_attention
argument totrue
in your YAML configuration file. You must also enable mixed-precision training by setting eitherbf16: true
orfp16: true
. Ifflash_attention
is set tofalse
, or if a specific GFM does not support this argument, the model's default attention mechanism will be used.Benchmarking: After fine-tuning, performance metrics are saved to a JSON file. You can use these metrics for benchmarking (e.g., comparing the performance of different models or tuning methods on specific tasks).
- Customize the adapter architecture in
Use trained models for prediction, generation, or embedding extraction:
-
Prediction: (Predict properties of DNA sequences). Ensure the
task_type
specified in your inference YAML file (classification
orregression
) matches the task the model was originally fine-tuned for.- Full:
genomefactory-cli inference genomeFactory/Examples/inference_full.yaml
- LoRA:
genomefactory-cli inference genomeFactory/Examples/inference_lora.yaml
- Adapter:
genomefactory-cli inference genomeFactory/Examples/inference_adapter.yaml
- Note: For Adapter-based classification, specify the number of labels (
num_label
) in the YAML. For regression, setnum_label: 1
. Full/LoRA methods infer this automatically.
- Note: For Adapter-based classification, specify the number of labels (
- Full:
-
Generation: (Generate new DNA sequences based on existing ones). Applicable to compatible GFMs.
- For GenomeOcean:
genomefactory-cli inference genomeFactory/Examples/inference_generation_genomeocean.yaml
- For Evo:
genomefactory-cli inference genomeFactory/Examples/inference_generation_evo.yaml
- For GenomeOcean:
-
Embedding Extraction: (Extract the last hidden state embeddings from sequences).
- General Case:
genomefactory-cli inference genomeFactory/Examples/inference_extract.yaml
- For Evo specifically:
genomefactory-cli inference genomeFactory/Examples/inference_extract_evo.yaml
- General Case:
-
Protein Generation: (Generate biologically realistic protein sequences with structural constraints via FoldMason integration).
- Structure-aware generation: Apply structural constraints during sequence generation
- Multi-model support: Evo and GenomeOcean
- Length control: Flexible sequence lengths
- Genomic context: Condition on specified genomic coordinates
- Batch processing: Generate multiple variants
Run:
genomefactory-cli protein genomeFactory/Examples/protein_generation.yaml
Genome-Factory provides comprehensive tools for understanding and interpreting genomic foundation models through sparse autoencoder (SAE) interpretation to provide deep insights into model behavior and biological significance.
- ** Latent Feature Discovery**: Identify interpretable features learned by genomic foundation models
- ** Ridge Regression Evaluation**: Quantitative assessment of feature importance for downstream tasks
- ** First-token vs Mean-pooled Analysis**: Compare different pooling strategies for sequence representation
- ** Feature Weight Analysis**: Understand which SAE features contribute most to biological predictions
Complete workflow for SAE training and interpretation:
genomefactory-cli sae_train genomeFactory/Examples/sae_train.yaml
Configure the following parameters in the YAML file:
data_file: "<YOUR_SEQUENCE_FILE>"
d_model: <MODEL_DIMENSION>
d_hidden: <HIDDEN_DIMENSION>
batch_size: <BATCH_SIZE>
lr: <LEARNING_RATE>
k: <K_VALUE>
auxk: <AUXK_VALUE>
dead_steps_threshold: <THRESHOLD_STEPS>
max_epochs: <MAX_EPOCHS>
num_devices: <NUM_DEVICES>
model_suffix: "<MODEL_SUFFIX>"
wandb_project: "<PROJECT_NAME>"
num_workers: <NUM_WORKERS>
model_name: "<MODEL_NAME>"
A. First-token latent embedding analysis:
genomefactory-cli sae_regression genomeFactory/Examples/sae_regression.yaml
Configure the following parameters in the YAML file:
csv_path: "<FEATURE_CSV_PATH>"
sae_checkpoint_path: "<SAE_CHECKPOINT_PATH>"
output_path: "<OUTPUT_CSV_PATH>"
type: "first_token"
B. Mean-pooled latent embedding analysis:
genomefactory-cli sae_regression genomeFactory/Examples/sae_regression.yaml
Configure the following parameters in the YAML file:
csv_path: "<FEATURE_CSV_PATH>"
sae_checkpoint_path: "<SAE_CHECKPOINT_PATH>"
output_path: "<OUTPUT_CSV_PATH>"
type: "mean"
Access all Genome-Factory functionalities through a graphical interface:
genomefactory-cli webui
This command launches a web server. Open the provided URL in your browser to use the WebUI.
If you find Genome-Factory useful, we would appreciate it if you consider citing our work:
@misc{genomefactory2025,
title = {Genome-Factory: An Integrated Library for Tuning, Deploying, and Interpreting Genomic Models},
author = {Weimin Wu and Xuefeng Song and Yibo Wen and Qinjie Lin and Zhihan Zhou and Jerry Yao-Chieh Hu and Zhong Wang and Han Liu},
year = {2025},
archivePrefix = {arXiv},
url = {https://github.com/MAGICS-LAB/Genome_Factory}
}
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models.
Zheng, Yaowei, Richong Zhang, Junhao Zhang, YeYanhan YeYanhan, and Zheyan Luo.
In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp. 400-410. 2024.