This pipeline is designed to process Next-Generation Sequencing (NGS) data, starting from raw FASTQ files and performing steps such as read trimming, alignment, variant calling, and annotation.
- Dependencies
- Installation
- Pipeline Overview
- Directory Structure
- Usage
- Pipeline Steps
- Input Data
- Output Data
- Error Handling
- Disk Space Management
- Cleanup
- Notes
- Prerequisites
- Version Control
- Author
The pipeline relies on the following software:
- dos2unix: Used to convert file line endings (if necessary).
-
Installation:
sudo apt-get install dos2unix
-
- wget: Used to download data files.
-
Installation:
sudo apt-get install wget
-
- zcat: Used to decompress .qz files. Often part of gzip package.
-
Installation: Usually pre-installed, or
sudo apt-get install gzip
-
- Trimmomatic: Used to trim low-quality reads and adapter sequences.
- Installation: Download from https://github.com/usadellab/Trimmomatic and follow the instructions. The pipeline assumes Trimmomatic is installed and the adapters file is located at /home/ubuntu/anaconda3/pkgs/trimmomatic-0.39-hdfd78af_2/share/trimmomatic-0.39-2/adapters/NexteraPE-PE.fa. You may need to adjust the path in the script.
- BWA: Used to align reads to the reference genome.
-
Installation:
sudo apt-get install bwa
-
- samtools: Used to process SAM/BAM files (convert, sort, index, filter).
-
Installation:
sudo apt-get install samtools
-
- picard: Used to mark duplicate reads.
- Installation: Download the latest .jar file from https://github.com/broadinstitute/picard. The pipeline assumes picard is in the users PATH, or you need to specify the full path to the jar file in the script.
- freebayes: Used for variant calling.
-
Installation:
sudo apt-get install freebayes
-
- bcftools: Used as a fallback variant caller.
-
Installation:
sudo apt-get install bcftools
-
- vcffilter: Part of the VCFtools package, used to filter variants.
-
Installation:
sudo apt-get install vcftools
-
- ANNOVAR: Used to annotate variants.
- Installation: Download from https://annovar.openbioinformatics.org/ and follow the instructions. You will also need to download the required databases. The pipeline assumes ANNOVAR is installed at ~/annovar.
- snpEff: Used to annotate variants.
- Installation: Download from http://snpeff.sourceforge.net/ and follow the instructions. You will also need to download the hg19 database.
-
Clone the repository:
git clone <repository_url> cd <repository_name>
-
Create the necessary directories: The pipeline script will create these, but you can create them manually if needed.
mkdir -p ~/ngs_course/assessment_pipeline/data/untrimmed_fastq mkdir -p ~/ngs_course/assessment_pipeline/data/trimmed_fastq mkdir -p ~/ngs_course/assessment_pipeline/data/aligned_data mkdir -p ~/ngs_course/assessment_pipeline/analysis mkdir -p ~/ngs_course/assessment_pipeline/results mkdir -p ~/ngs_course/assessment_pipeline/docs mkdir -p ~/ngs_course/assessment_pipeline/data/reference mkdir -p ~/annovar # If you are using the default path
-
Download the reference genome: The pipeline downloads hg19, but you can place your own reference genome in the data/reference/ directory.
-
Download ANNOVAR databases: The pipeline attempts to download the required databases, but you may need to do this manually depending on your network and ANNOVAR setup.
-
Set permissions: Ensure the script has execute permissions:
chmod +x pipeline.sh
The pipeline performs the following steps:
- Downloads raw FASTQ data.
- Decompresses the FASTQ files.
- Trims the reads using Trimmomatic.
- Prepares the reference genome (downloads if necessary).
- Indexes the reference genome using BWA and samtools.
- Aligns the trimmed reads to the reference genome using BWA.
- Processes the alignment files (converts SAM to BAM, sorts, indexes).
- Marks duplicate reads using Picard.
- Filters the BAM file.
- Calls variants using FreeBayes (with bcftools fallback).
- Filters variants.
- Annotates variants using ANNOVAR.
- Annotates variants using snpEff.
- Performs basic variant prioritization.
ngs_course/assessment_pipeline/├── pipeline.sh├── data/│ ├── untrimmed_fastq/│ ├── trimmed_fastq/│ ├── aligned_data/│ ├── reference/│ └── annotation.bed├── analysis/├── results/└── docs/
This directory is the main data directory. It contains the following subdirectories and files:
- Description: Stores all input and intermediate data files.
- Contents:
untrimmed_fastq/: Stores the raw, untrimmed FASTQ files.- data/trimmed_fastq/: Stores the trimmed FASTQ files.
- data/aligned_data/: Stores the aligned reads in SAM and BAM formats.
- data/reference/: Stores the reference genome files.
annotation.bed: Stores the annotation BED file.
- Relevance: Essential for organizing the input and output of each processing step.
-
Description: Stores the raw, untrimmed FASTQ files.
-
Contents:
NGS0001.R1.fastq.qz: Raw reads 1 in compressed format.NGS0001.R2.fastq.qz: Raw reads 2 in compressed format.NGS0001.R1.fastq: Decompressed reads 1.NGS0001.R2.fastq: Decompressed reads 2.
-
Relevance: Input for the trimming step.
-
Snippet:
zcat $FILE_R1 > $UNTRIMMED/R1.fastq zcat $FILE_R2 > $UNTRIMMED/R2.fastq
-
Description: Stores the trimmed FASTQ files after Trimmomatic processing.
-
Contents:
trimmed_1P.fastq.gz: Trimmed paired-end reads 1 (compressed).trimmed_1U.fastq.gz: Trimmed unpaired reads 1 (compressed).trimmed_2P.fastq.gz: Trimmed paired-end reads 2 (compressed).trimmed_2U.fastq.gz: Trimmed unpaired reads 2 (compressed).
-
Relevance: Output from the trimming step and input for the alignment step.
-
Snippet:
trimmomatic PE -threads 4 -phred33 $R1 $R2 \ $TRIMMED/trimmed_1P.fastq.gz $TRIMMED/trimmed_1U.fastq.gz \ $TRIMMED/trimmed_2P.fastq.gz $TRIMMED/trimmed_2U.fastq.gz ...
-
Description: Stores the aligned reads in SAM and BAM formats.
-
Contents:
NGS01.sam: Aligned reads in SAM format (intermediate).NGS01.bam: Aligned reads in BAM format (intermediate).NGS01_sorted.bam: Sorted BAM file.NGS01_marked.bam: BAM file with marked duplicates.NGS01_filtered.bam: Filtered BAM file.
-
Relevance: Output from the alignment and duplicate marking steps; input for variant calling.
-
Snippet:
bwa mem "$REF_FA" $TRIMMED/trimmed_1P.fastq.gz $TRIMMED/trimmed_2P.fastq.gz -o $SAM samtools view -b $SAM > $BAM samtools sort $BAM > $SORT_BAM picard MarkDuplicates I=$SORT_BAM O=$MARK_BAM M=$ALIGNED/marked_dup_metrics.txt samtools view -F 0x4 -q 20 -b $MARK_BAM > $FILTER_BAM
-
Description: Stores the reference genome files.
-
Contents:
hg19.fa.gz: Compressed reference genome (downloaded).hg19.fa: Decompressed reference genome.hg19.fa.fai: Samtools index file.hg19.1.bt2etc: BWA index files.
-
Relevance: Required for alignment and variant calling.
-
Snippet:
bwa index "$REF_FA" # BWA Indexing samtools faidx $REF_FA #Samtools Indexing
-
Description: Stores the primary variant calling and annotation results.
-
Contents:
NGS01.vcf: Raw variant calls.NGS01_filtered.vcf: Filtered variant calls.NGS01_annovar.*: ANNOVAR annotation files.NGS01_prioritized_variants.txt: Prioritized variants.
-
Relevance: Contains the main processed data for analysis.
-
Snippet: Variant Calling
freebayes -f "$REF_FA" -v "$VCF" "$FILTER_BAM"
-
Snippet: ANNOVAR Annotation
$ANNOVAR_DIR/table_annovar.pl $FILTER_VCF $HUMANDB_DIR ...
-
Description: Stores additional results, specifically snpEff annotations.
-
Contents:
NGS01_snpEff.vcf: snpEff annotation file.
-
Relevance: Contains results from snpEff variant annotation.
-
Snippet: snpEff Annotation
java -Xmx4g -jar "$snpeff_dir/snpEff.jar" "$genome_version" "$filter_vcf" > "$results_dir/NGS01_snpEff.vcf"
- Description: This directory stores log files and other documentation produced by the pipeline. It can also be used to track download progress.
- Contents:
/home/ubuntu/bioinformatics_course/scripts/docs/available_dbs.txt: A file produced by the pipeline.- Other log files.
- Relevance: Important for tracking pipeline execution, debugging, and recording download status.
To use the pipeline, follow these steps:
-
Clone the repository to your local machine:
git clone https://github.com/LDolanLDolan/bioinformatics_course
-
Navigate to the pipeline directory:
cd bioinformatics_course cd scripts
-
Ensure the pipeline script has execute permissions:
chmod +x pipeline_one.sh
-
Run the pipeline script:
./pipeline_one.sh
The pipeline consists of the following steps:
- Downloading raw data files: Downloads the raw FASTQ files and the annotation BED file using wget.
- Decompressing FASTQ files: Decompresses the downloaded .qz files using zcat.
- Trimming reads: Trims the reads using Trimmomatic to remove low-quality bases and adapter sequences.
- Preparing reference genome: Downloads the hg19 reference genome if it doesn't exist and decompresses it.
- Indexing reference genome: Indexes the reference genome using BWA and samtools faidx.
- Aligning reads: Aligns the trimmed reads to the reference genome using BWA.
- Processing alignment files: Converts the SAM file to BAM format, sorts the BAM file by coordinate, and indexes the sorted BAM file using samtools.
- Marking duplicates: Marks duplicate reads using Picard MarkDuplicates to prepare for variant calling.
- Filtering BAM file: Filters the BAM file to keep only properly paired reads with a mapping quality greater than 20.
- Variant calling: Calls variants using FreeBayes, with a fallback to bcftools if FreeBayes fails.
- Filtering variants: Filters the variants using vcffilter.
- Variant annotation (ANNOVAR): Annotates the variants using ANNOVAR.
- Variant annotation (snpEff): Annotates the variants using snpEff.
- Variant prioritization: Performs a basic variant prioritization step.
- Raw FASTQ files (NGS0001.R1.fastq.qz, NGS0001.R2.fastq.qz)
- Annotation BED file (annotation.bed)
- Reference genome (hg19.fa)
The pipeline generates the following output data:
- Trimmed FASTQ files
- Aligned BAM files
- Variant calls in VCF format
- Annotated variants (ANNOVAR and snpEff)
- Prioritized variants
The pipeline includes error handling at each step. It checks for:
- Successful download of files.
- Successful execution of tools.
- Existence and size of output files.
- Sufficient disk space.
If an error occurs, the pipeline will print an error message and exit.
The pipeline includes checks for available disk space before performing disk-intensive operations. It also attempts to clean up intermediate files to free up space.
The pipeline attempts to remove unnecessary intermediate files, such as:
- Uncompressed raw FASTQ files.
- SAM files after BAM conversion.
- Unsorted BAM files after sorting.
- Compressed reference genome after decompression.
- Temporary files in /tmp.
- The pipeline assumes that all required software is installed and configured correctly. You may need to adjust paths to executables and adapter files.
- The pipeline downloads the hg19 reference genome. You can replace this with your own reference genome if needed.
- The ANNOVAR databases are downloaded dynamically by the script. Ensure you have a working internet connection and that ANNOVAR is configured correctly.
- The variant prioritization step is basic. You may need to modify it for your specific needs.
- Consider adding a logging mechanism to record the pipeline's progress and any errors.
- This script is designed to be run on a Linux system. You may need to modify it for other operating systems.
- Bash: Ensure you have Bash installed (version 4.0 or higher).
- BWA: For sequence alignment.
- SAMtools: For processing SAM/BAM files.
- Picard: For marking duplicates.
- FreeBayes: For variant calling.
- ANNOVAR: For variant annotation.
It is recommended to use version control (e.g., Git) to track changes to the pipeline script.
[Lita Doolan] [litadoolan.net]