Genome-Factory: An Integrated Library for Tuning, Deploying, and Interpreting Genomic Models

Genome-Factory is a Python-based integrated library for tuning and deploying genomic foundation models. The framework consists of six components. Genome Collector acquires genomic sequences from public repositories and performs preprocessing (e.g., GC normalization, ambiguous base correction). Model Loader supports major genomic models (e.g., GenomeOcean, EVO, DNABERT-2, HyenaDNA, Caduceus, Nucleotide Transformer) and their tokenizers. Model Trainer configures workflows, adapts models to classification or regression tasks, and executes training with full fine-tuning or parameter-efficient methods (LoRA, adapters). Inference Engine enables embedding extraction and sequence generation. Benchmarker provides standard benchmarks and allows integration of custom evaluation tasks. Biological Interpreter enhances interpretability through sparse auto-encoders.

Supported Models

The "Variant Type" column specifies how model variants differ: by parameter size or by maximum input sequence length.

Model Name	Variant Type	Variants
GenomeOcean	Parameter Size	100M / 500M / 4B
EVO	Sequence Length	8K / 131K
DNABERT-2	Parameter Size	117M
Hyenadna	Sequence Length	1K / 16K / 32K / 160K / 450K / 1M
Caduceus	Sequence Length	1K / 131K
Nucleotide Transformer	Parameter Size	50M / 100M / 250M / 500M / 1B / 2.5B

Installation

Clone the repository:

git clone https://github.com/xxx/Genome_Factory.git
cd Genome_Factory

Install dependencies:

# Install primary Python dependencies from requirements file
pip install -r requirements.txt

# Install CUDA Toolkit and Compiler 
# Ensure you have a compatible NVIDIA driver installed and are in a Conda environment.
conda install cudatoolkit==11.8 -c nvidia
conda install -c "nvidia/label/cuda-11.8.0" cuda-nvcc

# Install additional dependencies for specific features (e.g., Mamba support, Flash Attention)

pip install mamba-ssm==2.2.2 flash-attn==2.7.2.post1

# Install NCBI Datasets CLI (required for NCBI data download feature)
conda install conda-forge::ncbi-datasets-cli

# Install EVO from source
git clone https://github.com/evo-design/evo.git
cd evo
pip install .
cd .. 
# IMPORTANT: Return to the Genome-Factory root directory before the next step

# Install Genome-Factory in editable mode
pip install -e .

Environment Notes:
- For Genome Ocean, use transformers==4.44.2.
- For other models, use transformers==4.29.2.
- For DNABERT-2, ensure triton is uninstalled: pip uninstall triton.

Usage via CLI (`genomefactory-cli`)

Genome-Factory uses YAML configuration files to define tasks. Example files are provided in genomeFactory/Examples/. You can customize the parameters within these files, ensuring you maintain the required YAML structure.

Data Download

Download genomic data from NCBI:

Using a config file: Specify download parameters in a YAML file.

Download by species:

genomefactory-cli download genomeFactory/Examples/download_by_species.yaml

Download by Link:

genomefactory-cli download genomeFactory/Examples/download_by_link.yaml

Interactively: Run the command without a config file and follow the prompts in the terminal to specify your download criteria (supports both species-based and link-based downloads).
```
genomefactory-cli download
```

Note: The list of species and their taxon IDs used for downloads is stored in genomeFactory/Data/Download/Datasets_species_taxonid_dict.json. This file is not exhaustive; you can extend it by adding new species-to-taxonID pairs to download data for other species as needed.

Data Processing

Genome-Factory provides tools to prepare data for model fine-tuning. This includes processing data downloaded from NCBI or formatting your own custom datasets.

1. Processing NCBI Data:

Gather data downloaded using the download command into a single folder.

Run the processing command with a config file:

genomefactory-cli process genomeFactory/Examples/process_normal.yaml

The processed data will be ready for input into the model for fine-tuning.

2. Preparing Custom Datasets:

If you have your own dataset, format it as follows:

Separate your data into three CSV files: train.csv, dev.csv, and test.csv.
Each CSV file must have two columns:
- The first column should contain the DNA sequences (e.g., sequence).
- The second column should contain the corresponding labels (e.g., label).
  - For classification tasks, labels should be integers (e.g., 0, 1, 2...).
  - For regression tasks, labels should be continuous numbers.
Place these three CSV files (train.csv, dev.csv, test.csv) together in a single folder.
This folder can then be specified as the input data directory in your training configuration YAML file.

3. Advanced Processing Features:

Genome-Factory provides specialized dataset generation tools for common genomic machine learning tasks:

Promoter region dataset: Generate promoter vs. non-promoter classification data from the EPDnew database (hg38, mm10, danRer11)
```
genomefactory-cli process genomeFactory/Examples/process_promoter.yaml
```
Epigenetic mark dataset: Create gene body sequences with H3K36me3 signal classification from ENCODE/Roadmap data (hg38, mm10)
```
genomefactory-cli process genomeFactory/Examples/process_emp.yaml
```
Enhancer region dataset: Build enhancer vs. non-enhancer classification data from FANTOM5 annotations (hg38, mm10)
```
genomefactory-cli process genomeFactory/Examples/process_enhancer.yaml
```
All datasets feature quality control, configurable train/val/test splits, and output CSV files with sequence,label format.

Training

For fine-tuning GFMs, Genome-Factory supports two primary task types: classification and regression. You specify the desired task_type in the training YAML configuration file.

Fine-tune GFMs using different methods:

Full Fine-tuning:

genomefactory-cli train genomeFactory/Examples/train_full.yaml

LoRA (Low-Rank Adaptation):
```
genomefactory-cli train genomeFactory/Examples/train_lora.yaml
```
- Specify target modules in the YAML file:
  - all: Targets all linear layers.
  - all_in_and_out_proj: Targets input/output projection layers and the final classification layer.
  - Custom: Specify module names directly.
- For Evo:
```
genomefactory-cli train genomeFactory/Examples/train_evo_lora.yaml
```
Adapter:
```
genomefactory-cli train genomeFactory/Examples/train_adapter.yaml
```
- Customize the adapter architecture in genomeFactory/Train/workflow/adapter/adapter_model/Adapter.py for potentially better performance on specific downstream tasks.
Note: Training settings like batch size, learning rate, and epochs can be customized in the respective YAML files for all methods.

Note on Flash Attention: To enable Flash Attention, set the flash_attention argument to true in your YAML configuration file. You must also enable mixed-precision training by setting either bf16: true or fp16: true. If flash_attention is set to false, or if a specific GFM does not support this argument, the model's default attention mechanism will be used.

Benchmarking: After fine-tuning, performance metrics are saved to a JSON file. You can use these metrics for benchmarking (e.g., comparing the performance of different models or tuning methods on specific tasks).

Inference

Use trained models for prediction, generation, or embedding extraction:

Prediction: (Predict properties of DNA sequences). Ensure the task_type specified in your inference YAML file (classification or regression) matches the task the model was originally fine-tuned for.
- Full:
```
genomefactory-cli inference genomeFactory/Examples/inference_full.yaml
```
- LoRA:
```
genomefactory-cli inference genomeFactory/Examples/inference_lora.yaml
```
- Adapter:
```
genomefactory-cli inference genomeFactory/Examples/inference_adapter.yaml
```
  - Note: For Adapter-based classification, specify the number of labels (num_label) in the YAML. For regression, set num_label: 1. Full/LoRA methods infer this automatically.

Generation: (Generate new DNA sequences based on existing ones). Applicable to compatible GFMs.

For GenomeOcean:

genomefactory-cli inference genomeFactory/Examples/inference_generation_genomeocean.yaml

For Evo:

genomefactory-cli inference genomeFactory/Examples/inference_generation_evo.yaml

Embedding Extraction: (Extract the last hidden state embeddings from sequences).

General Case:

genomefactory-cli inference genomeFactory/Examples/inference_extract.yaml

For Evo specifically:

genomefactory-cli inference genomeFactory/Examples/inference_extract_evo.yaml

Protein Generation: (Generate biologically realistic protein sequences with structural constraints via FoldMason integration).

Structure-aware generation: Apply structural constraints during sequence generation
Multi-model support: Evo and GenomeOcean
Length control: Flexible sequence lengths
Genomic context: Condition on specified genomic coordinates
Batch processing: Generate multiple variants

Run:

genomefactory-cli protein genomeFactory/Examples/protein_generation.yaml

Interpretation

Genome-Factory provides comprehensive tools for understanding and interpreting genomic foundation models through sparse autoencoder (SAE) interpretation to provide deep insights into model behavior and biological significance.

Sparse Autoencoder (SAE) Analysis

** Latent Feature Discovery**: Identify interpretable features learned by genomic foundation models
** Ridge Regression Evaluation**: Quantitative assessment of feature importance for downstream tasks
** First-token vs Mean-pooled Analysis**: Compare different pooling strategies for sequence representation
** Feature Weight Analysis**: Understand which SAE features contribute most to biological predictions

Quick Start Guide

SAE-Based Feature Analysis

Complete workflow for SAE training and interpretation:

Step 1: Train SAE Model

genomefactory-cli sae_train genomeFactory/Examples/sae_train.yaml

Configure the following parameters in the YAML file:

data_file: "<YOUR_SEQUENCE_FILE>"
d_model: <MODEL_DIMENSION>
d_hidden: <HIDDEN_DIMENSION>
batch_size: <BATCH_SIZE>
lr: <LEARNING_RATE>
k: <K_VALUE>
auxk: <AUXK_VALUE>
dead_steps_threshold: <THRESHOLD_STEPS>
max_epochs: <MAX_EPOCHS>
num_devices: <NUM_DEVICES>
model_suffix: "<MODEL_SUFFIX>"
wandb_project: "<PROJECT_NAME>"
num_workers: <NUM_WORKERS>
model_name: "<MODEL_NAME>"

Step 2: Downstream Evaluations with Ridge Regression

A. First-token latent embedding analysis:

genomefactory-cli sae_regression genomeFactory/Examples/sae_regression.yaml

Configure the following parameters in the YAML file:

csv_path: "<FEATURE_CSV_PATH>"
sae_checkpoint_path: "<SAE_CHECKPOINT_PATH>"
output_path: "<OUTPUT_CSV_PATH>"
type: "first_token"

B. Mean-pooled latent embedding analysis:

genomefactory-cli sae_regression genomeFactory/Examples/sae_regression.yaml

Configure the following parameters in the YAML file:

csv_path: "<FEATURE_CSV_PATH>"
sae_checkpoint_path: "<SAE_CHECKPOINT_PATH>"
output_path: "<OUTPUT_CSV_PATH>"
type: "mean"

Usage via Web UI

Access all Genome-Factory functionalities through a graphical interface:

genomefactory-cli webui

This command launches a web server. Open the provided URL in your browser to use the WebUI.

Citation

If you find Genome-Factory useful, we would appreciate it if you consider citing our work:

@misc{genomefactory2025,
  title     = {Genome-Factory: An Integrated Library for Tuning, Deploying, and Interpreting Genomic Models},
  author    = {Weimin Wu and Xuefeng Song and Yibo Wen and Qinjie Lin and Zhihan Zhou and Jerry Yao-Chieh Hu and Zhong Wang and Han Liu},
  year      = {2025},
  archivePrefix = {arXiv},
  url       = {http://arxiv.org/abs/2509.12266}
}

Reference

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models.
Zheng, Yaowei, Richong Zhang, Junhao Zhang, YeYanhan YeYanhan, and Zheyan Luo.
In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp. 400-410. 2024.

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
genomeFactory		genomeFactory
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Genome-Factory: An Integrated Library for Tuning, Deploying, and Interpreting Genomic Models

Supported Models

Installation

Usage via CLI (`genomefactory-cli`)

Data Download

Data Processing

Training

Inference

Interpretation

Sparse Autoencoder (SAE) Analysis

Quick Start Guide

SAE-Based Feature Analysis

Step 1: Train SAE Model

Step 2: Downstream Evaluations with Ridge Regression

Usage via Web UI

Citation

Reference

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

MAGICS-LAB/Genome_Factory

Folders and files

Latest commit

History

Repository files navigation

Genome-Factory: An Integrated Library for Tuning, Deploying, and Interpreting Genomic Models

Supported Models

Installation

Usage via CLI (genomefactory-cli)

Data Download

Data Processing

Training

Inference

Interpretation

Sparse Autoencoder (SAE) Analysis

Quick Start Guide

SAE-Based Feature Analysis

Step 1: Train SAE Model

Step 2: Downstream Evaluations with Ridge Regression

Usage via Web UI

Citation

Reference

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Usage via CLI (`genomefactory-cli`)

Packages