ADAR-GPT

Overview

RNA editing is a crucial post-transcriptional mechanism that alters RNA sequences, impacting gene regulation and disease. This repository contains code for predicting adenosine-to-inosine (A-to-I) RNA editing sites using GPT-4o-mini with a continual fine-tuning (CFT) strategy.

We introduce a liver-specific dataset, where ADAR1 is the predominant enzyme, and train models progressively on editing thresholds (1%, 5%, 10%, 15%), improving classification accuracy compared to traditional fine-tuning methods.

Key Contributions:

Liver-Specific RNA Editing Analysis: Avoiding confounding multi-tissue variability.
Continual Fine-Tuning (CFT): Training the model step-by-step from low (1%) to high (15%) editing levels.
Non-Overlapping Thresholds: Each site assigned a single editing category to improve classification accuracy.
Improved Performance: Outperforms GPT-3.5 and static fine-tuning (SFT) models.

🧬 Methodology

Our approach focuses on improving RNA editing site prediction using transformer-based models, specifically GPT-4o-mini, in a continual fine-tuning (CFT) paradigm. This methodology allows the model to progressively learn from lower to higher editing thresholds (1%, 5%, 10%, 15%), refining its understanding of RNA editing patterns.

We trained the model using a liver-specific dataset derived from GTEx, ensuring minimal interference from non-relevant ADAR isoforms. The training procedure included:

A) Data Collection & Preprocessing

Extracting double-stranded RNA (dsRNA) structures from Alu elements.
Annotating editing levels based on GTEx liver data.
Predicting RNA secondary structures using ViennaRNA(RNAfold) and converting into Vienna format.

Data Partitioning

Overlapping Sites: Multiple thresholds assigned per site (e.g., 1-5%, 5-10%).
Non-Overlapping Sites: Each site belongs to one distinct threshold to ensure clearer distinctions.

B) RNA Editing as a Classification Problem

Framing RNA editing site prediction as a binary classification task.
The model determines whether a given adenosine is edited (Yes/No) based on its sequence and structure.
Training labels are derived from GTEx data, assigning a binary label to each adenosine.

C) Comparing Fine-Tuning Strategies (SFT vs. CFT)

Static Fine-Tuning (SFT): Training on a single threshold (e.g., only 15% editing).
Continual Fine-Tuning (CFT): Gradual training from low (1%) to high (15%) editing levels.
CFT enables better adaptation across editing ranges, leading to more robust classification performance.

Repository Structure

Getting Started

Requirments

First, clone this repository.

You may use the file environment.yml to create anaconda environment (python 3.8) with the required packages.

Steps to Use the environment.yml File:

Create the Environment:

Save the environment.yml file in your project directory, then run the following command:

conda env create -f environment.yml

Activate the Environment:

conda activate A2IRnaEditing

Data Preparation

1. Classification Task

For the classification task, data preparation involves extracting RNA sequences, computing secondary structures, and assigning editing labels for liver tissue. Classification Data Creation Script: This script generates the dsRNA structure and processes RNA sequences to classify editing sites based on their structural and sequence context.

To run this script, navigate to the Script/data_preparation folder and use the following command:

python Classification_Data_Creation_Liver.py [-h] --pair_region PAIR_REGION --output_dir OUTPUT_DIR
                                        --editing_site_plus EDITING_SITE_PLUS
                                        --editing_site_minus EDITING_SITE_MINUS --genome GENOME

Outputs:

data_for_prepare_classification.csv – Processed classification data

2. Data Balancing by Editing Thresholds

Data balancing ensures equal representation of edited and non-edited sites across different editing levels, preventing bias in model training.

Overlapping Sites

To generate balanced classification datasets for different editing thresholds, navigate to the Script/data_preparation directory and run the following command:

Rscript Division_thresholds_overlapping.R -i <input file(data_for_prepare_classification.csv)> -o < output_dir>

This script divides the dataset into overlapping editing levels (1%, 5%, 10%, 15%) and ensures balanced distributions of edited and non-edited sites. The output consists of four files, each corresponding to a different threshold.

Non-Overlapping Sites

For non-overlapping classification thresholds, use the following command:

Rscript Division_thresholds_non_overlapping.R -i <input file(data_for_prepare_classification.csv)> -o < output_dir>

This script allows a site to belong to multiple editing level categories, resulting in four output files similar to the overlapping approach.

3.Preparing Data for GPT Fine-Tuning

To prepare the data for GPT fine-tuning, navigate to the Script/data_preparation directory and run the following command:

python Model_Input_Preparation_Classification.py <input_csv>

This script processes the classification dataset into a structured JSONL format for model training and evaluation. Outputs:

train_.jsonl – Training dataset
valid_.jsonl – Validation dataset

Inference

The inference process differs based on the training methodology used. In CFT (Continual Fine-Tuning), inference is performed iteratively, where each model serves as the basis for fine-tuning the next model. The key distinction is that each inference step is applied to a different model fine-tuned on progressively refined data. This approach allows for continuous improvement in predictions across multiple runs. In contrast, SFT (Single Fine-Tuning) involves training a model directly on a dataset with a specific editing level, making inference a one-step process where the model is applied directly to new data without iterative refinements.

To perform inference, navigate to the Script/inferencing directory and run the following command:

 python inferencing.py <input_file> <output_file> <temperature>

Input: The <input_file> is the file created in the Model_Input_Preparation_Classification.py step.

Baseline Comparisons

We provide scripts to compare ADAR-GPT against two state-of-the-art baselines: EditPredict (CNN) and RNA-FM (foundation model).

Installing Baseline Models

EditPredict

Installation:

git clone https://github.com/wjd198605/EditPredict.git
cd EditPredict
pip install tensorflow==2.11.0 keras scikit-learn pandas numpy

Required files (included in repository):

editPredict_weight_alu.json – Model architecture
editPredict_construction_alu.h5 – Pre-trained weights

RNA-FM

Installation:

git clone https://github.com/ml4bio/RNA-FM.git
cd RNA-FM
pip install torch transformers multimolecule
pip install -e .

Pre-trained vs. Retrained Models

Pre-trained: Model trained on original authors' dataset . Quick to evaluate but may not generalize to your data.
Retrained: Model trained from scratch on your exact dataset. Provides fair comparison. In our experiments: pre-trained EditPredict (F1=0.67) vs retrained (F1=0.78).

Both-Strand Scoring

The --both_strands flag evaluates both forward sequence and reverse-complement, taking the maximum probability. This can improve recall but may increase false positives.

1. EditPredict Baseline

Option A: Quick Evaluation (Pre-trained Model)

Script: adar_gpt_vs_editpredict.py

Input:

--train_jsonl: Training JSONL file
--valid_jsonl: Validation JSONL file
--editpredict_dir: Path to EditPredict repository
--threshold: Decision threshold (default: 0.5)
--outdir: Output directory

Command:

python Script/baselines/EditPredict/adar_gpt_vs_editpredict.py \
  --train_jsonl path/to/train.jsonl \
  --valid_jsonl path/to/valid.jsonl \
  --editpredict_dir path/to/EditPredict \
  --threshold 0.5 \
  --outdir editpredict_baseline_out

Output:

editpredict_probs_valid.csv – Probabilities for each site
editpredict_metrics_valid.json – Performance metrics

Option B: Enhanced Evaluation

Script: adar_gpt_vs_editpredict_plus.py

Input: Same as Option A, plus:

--both_strands: (Optional) Evaluate both strand orientations

Command:

python Script/baselines/EditPredict/adar_gpt_vs_editpredict_plus.py \
  --train_jsonl path/to/train.jsonl \
  --valid_jsonl path/to/valid.jsonl \
  --editpredict_dir path/to/EditPredict \
  --both_strands \
  --outdir editpredict_plus_out

Output:

probs.csv – Raw probabilities
metrics@0.5.json – Metrics at threshold 0.5
metrics@bestF1.json – Best F1 score with optimal threshold
threshold_sweep.csv – Metrics across 101 thresholds
roc_curve.csv – ROC curve data
pr_curve.csv – Precision-Recall curve data

Option C: Retrain on Your Data

Script: retrain_editpredict_pipeline.py

Input:

--train_jsonl: Training JSONL file
--valid_jsonl: Validation JSONL file
--editpredict_dir: Path to EditPredict repository
--outdir: Output directory
--input_len: Window length (default: 201)
--epochs: Training epochs (default: 10)
--batch_size: Batch size (default: 128)
--evaluate_with_plus: (Optional) Run enhanced evaluation after training

Command:

python Script/baselines/EditPredict/retrain_editpredict_pipeline.py \
  --train_jsonl path/to/train.jsonl \
  --valid_jsonl path/to/valid.jsonl \
  --editpredict_dir path/to/EditPredict \
  --outdir editpredict_retrained \
  --epochs 10 \
  --evaluate_with_plus

Output:

data/ – Converted training data
model_ep_retrained/ – Retrained model (JSON + H5 files)
evaluation/ – (If --evaluate_with_plus used) Comprehensive evaluation results

2. RNA-FM Baseline

Script: rnafm_finetune_adar.py

Input:

--train_jsonl: Training JSONL file
--valid_jsonl: Validation JSONL file
--outdir: Output directory
--model_id: HuggingFace model ID (default: multimolecule/rnafm)
--window_len: Sequence window length (default: 201)
--epochs: Training epochs (default: 5)
--batch_size: Batch size (default: 32)
--lr: Learning rate (default: 3e-5)
--both_strands: (Optional) Evaluate both strand orientations

Command:

python Script/baselines/RNA-FM/rnafm_finetune_adar.py \
  --train_jsonl path/to/train.jsonl \
  --valid_jsonl path/to/valid.jsonl \
  --outdir rnafm_finetuned \
  --epochs 5 \
  --batch_size 32 \
  --both_strands

Output:

rnafm_finetuned_model/ – Fine-tuned model checkpoint
probs.csv – Prediction probabilities
metrics@0.5.json – Fixed threshold metrics
metrics@bestF1.json – Best F1 performance
threshold_sweep.csv – Full threshold analysis
roc_curve.csv, pr_curve.csv – Performance curves

Performance

Baseline Comparison (All Thresholds)

Note: All metrics reported at fixed decision threshold of 0.5. Single-strand uses forward sequence only; both-strand averages forward and reverse-complement predictions.

Threshold	Model	F1	Accuracy	Recall	Specificity
1%	EditPredict Pre-trained (single)	0.8187	0.6946	0.9986	0.0162
	EditPredict Retrained (single)	0.8427	0.7789	0.8579	0.6026
	EditPredict Retrained (both)	0.8313	0.7328	0.9533	0.2407
	RNA-FM Fine-tuned (both)	0.8200	0.6970	0.9997	0.0216
	ADAR-GPT CFT (1%)	0.7956	0.7428	0.7248	0.7832
5%	EditPredict Pre-trained (single)	0.7438	0.5948	1.0000	0.0162
	EditPredict Retrained (single)	0.7969	0.7524	0.8262	0.6470
	EditPredict Retrained (both)	0.7668	0.6705	0.9212	0.3125
	RNA-FM Fine-tuned (both)	0.7575	0.6258	0.9935	0.1009
	ADAR-GPT CFT (5%)	0.7994	0.7655	0.7946	0.7241
10%	EditPredict Pre-trained (single)	0.6871	0.5260	1.0000	0.0119
	EditPredict Retrained (single)	0.7403	0.7259	0.7508	0.6989
	EditPredict Retrained (both)	0.7273	0.6628	0.8642	0.4445
	RNA-FM Fine-tuned (both)	0.7169	0.5984	0.9775	0.1872
	ADAR-GPT CFT (10%)	0.7751	0.7753	0.7444	0.8089
15%	EditPredict Pre-trained (single)	0.6636	0.4995	1.0000	0.0118
	EditPredict Retrained (single)	0.7255	0.7120	0.7712	0.6544
	EditPredict Retrained (both)	0.6969	0.6313	0.8587	0.4097
	RNA-FM Fine-tuned (both)	0.6936	0.5712	0.9836	0.1693
	ADAR-GPT CFT (15%)	0.7735	0.7872	0.7358	0.8373
	ADAR-GPT SFT (15%)	0.6871	0.7092	0.6468	0.7699

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Figure		Figure
Script		Script
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

License

Scientific-Computing-Lab/ADAR-GPT

Folders and files

Latest commit

History

Repository files navigation

ADAR-GPT

Overview

Key Contributions:

🧬 Methodology

Repository Structure

Getting Started

Requirments

Steps to Use the environment.yml File:

Create the Environment:

Data Preparation

1. Classification Task

2. Data Balancing by Editing Thresholds

Overlapping Sites

Non-Overlapping Sites

3.Preparing Data for GPT Fine-Tuning

Inference

Baseline Comparisons

Installing Baseline Models

EditPredict

RNA-FM

Pre-trained vs. Retrained Models

Both-Strand Scoring

1. EditPredict Baseline

Option A: Quick Evaluation (Pre-trained Model)

Option B: Enhanced Evaluation

Option C: Retrain on Your Data

2. RNA-FM Baseline

Performance

Baseline Comparison (All Thresholds)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages