NGS Pipeline README

This pipeline is designed to process Next-Generation Sequencing (NGS) data, starting from raw FASTQ files and performing steps such as read trimming, alignment, variant calling, and annotation.

Dependencies

The pipeline relies on the following software:

dos2unix: Used to convert file line endings (if necessary).
- Installation:
```
sudo apt-get install dos2unix
```
wget: Used to download data files.
- Installation:
```
sudo apt-get install wget
```
zcat: Used to decompress .qz files. Often part of gzip package.
- Installation: Usually pre-installed, or
```
sudo apt-get install gzip
```
Trimmomatic: Used to trim low-quality reads and adapter sequences.
- Installation: Download from https://github.com/usadellab/Trimmomatic and follow the instructions. The pipeline assumes Trimmomatic is installed and the adapters file is located at /home/ubuntu/anaconda3/pkgs/trimmomatic-0.39-hdfd78af_2/share/trimmomatic-0.39-2/adapters/NexteraPE-PE.fa. You may need to adjust the path in the script.
BWA: Used to align reads to the reference genome.
- Installation:
```
sudo apt-get install bwa
```
samtools: Used to process SAM/BAM files (convert, sort, index, filter).
- Installation:
```
sudo apt-get install samtools
```
picard: Used to mark duplicate reads.
- Installation: Download the latest .jar file from https://github.com/broadinstitute/picard. The pipeline assumes picard is in the users PATH, or you need to specify the full path to the jar file in the script.
freebayes: Used for variant calling.
- Installation:
```
sudo apt-get install freebayes
```
bcftools: Used as a fallback variant caller.
- Installation:
```
sudo apt-get install bcftools
```
vcffilter: Part of the VCFtools package, used to filter variants.
- Installation:
```
sudo apt-get install vcftools
```
ANNOVAR: Used to annotate variants.
- Installation: Download from https://annovar.openbioinformatics.org/ and follow the instructions. You will also need to download the required databases. The pipeline assumes ANNOVAR is installed at ~/annovar.
snpEff: Used to annotate variants.
- Installation: Download from http://snpeff.sourceforge.net/ and follow the instructions. You will also need to download the hg19 database.

Installation

Clone the repository:

git clone <repository_url>
cd <repository_name>

Create the necessary directories: The pipeline script will create these, but you can create them manually if needed.

mkdir -p ~/ngs_course/assessment_pipeline/data/untrimmed_fastq
mkdir -p ~/ngs_course/assessment_pipeline/data/trimmed_fastq
mkdir -p ~/ngs_course/assessment_pipeline/data/aligned_data
mkdir -p ~/ngs_course/assessment_pipeline/analysis
mkdir -p ~/ngs_course/assessment_pipeline/results
mkdir -p ~/ngs_course/assessment_pipeline/docs
mkdir -p ~/ngs_course/assessment_pipeline/data/reference
mkdir -p ~/annovar # If you are using the default path

Download the reference genome: The pipeline downloads hg19, but you can place your own reference genome in the data/reference/ directory.
Download ANNOVAR databases: The pipeline attempts to download the required databases, but you may need to do this manually depending on your network and ANNOVAR setup.
Set permissions: Ensure the script has execute permissions:
```
chmod +x pipeline.sh
```

Pipeline Overview

The pipeline performs the following steps:

Downloads raw FASTQ data.
Decompresses the FASTQ files.
Trims the reads using Trimmomatic.
Prepares the reference genome (downloads if necessary).
Indexes the reference genome using BWA and samtools.
Aligns the trimmed reads to the reference genome using BWA.
Processes the alignment files (converts SAM to BAM, sorts, indexes).
Marks duplicate reads using Picard.
Filters the BAM file.
Calls variants using FreeBayes (with bcftools fallback).
Filters variants.
Annotates variants using ANNOVAR.
Annotates variants using snpEff.
Performs basic variant prioritization.

Directory Structure

ngs_course/assessment_pipeline/├── pipeline.sh├── data/│ ├── untrimmed_fastq/│ ├── trimmed_fastq/│ ├── aligned_data/│ ├── reference/│ └── annotation.bed├── analysis/├── results/└── docs/

data/

This directory is the main data directory. It contains the following subdirectories and files:

Description: Stores all input and intermediate data files.
Contents:
- untrimmed_fastq/: Stores the raw, untrimmed FASTQ files.
- data/trimmed_fastq/: Stores the trimmed FASTQ files.
- data/aligned_data/: Stores the aligned reads in SAM and BAM formats.
- data/reference/: Stores the reference genome files.
- annotation.bed: Stores the annotation BED file.
Relevance: Essential for organizing the input and output of each processing step.

data/untrimmed_fastq/

Description: Stores the raw, untrimmed FASTQ files.
Contents:
- NGS0001.R1.fastq.qz: Raw reads 1 in compressed format.
- NGS0001.R2.fastq.qz: Raw reads 2 in compressed format.
- NGS0001.R1.fastq: Decompressed reads 1.
- NGS0001.R2.fastq: Decompressed reads 2.
Relevance: Input for the trimming step.

Snippet:

zcat $FILE_R1 > $UNTRIMMED/R1.fastq
zcat $FILE_R2 > $UNTRIMMED/R2.fastq

data/trimmed_fastq/

Description: Stores the trimmed FASTQ files after Trimmomatic processing.
Contents:
- trimmed_1P.fastq.gz: Trimmed paired-end reads 1 (compressed).
- trimmed_1U.fastq.gz: Trimmed unpaired reads 1 (compressed).
- trimmed_2P.fastq.gz: Trimmed paired-end reads 2 (compressed).
- trimmed_2U.fastq.gz: Trimmed unpaired reads 2 (compressed).
Relevance: Output from the trimming step and input for the alignment step.

Snippet:

trimmomatic PE -threads 4 -phred33 $R1 $R2 \
    $TRIMMED/trimmed_1P.fastq.gz $TRIMMED/trimmed_1U.fastq.gz \
    $TRIMMED/trimmed_2P.fastq.gz $TRIMMED/trimmed_2U.fastq.gz ...

data/aligned_data/

Description: Stores the aligned reads in SAM and BAM formats.
Contents:
- NGS01.sam: Aligned reads in SAM format (intermediate).
- NGS01.bam: Aligned reads in BAM format (intermediate).
- NGS01_sorted.bam: Sorted BAM file.
- NGS01_marked.bam: BAM file with marked duplicates.
- NGS01_filtered.bam: Filtered BAM file.
Relevance: Output from the alignment and duplicate marking steps; input for variant calling.

Snippet:

bwa mem "$REF_FA" $TRIMMED/trimmed_1P.fastq.gz  $TRIMMED/trimmed_2P.fastq.gz -o $SAM
samtools view -b $SAM > $BAM
samtools sort $BAM > $SORT_BAM
picard MarkDuplicates I=$SORT_BAM O=$MARK_BAM M=$ALIGNED/marked_dup_metrics.txt
samtools view -F 0x4 -q 20 -b $MARK_BAM > $FILTER_BAM

data/reference/

Description: Stores the reference genome files.
Contents:
- hg19.fa.gz: Compressed reference genome (downloaded).
- hg19.fa: Decompressed reference genome.
- hg19.fa.fai: Samtools index file.
- hg19.1.bt2 etc: BWA index files.
Relevance: Required for alignment and variant calling.

Snippet:

bwa index "$REF_FA" # BWA Indexing
samtools faidx $REF_FA #Samtools Indexing

analysis/

Description: Stores the primary variant calling and annotation results.
Contents:
- NGS01.vcf: Raw variant calls.
- NGS01_filtered.vcf: Filtered variant calls.
- NGS01_annovar.*: ANNOVAR annotation files.
- NGS01_prioritized_variants.txt: Prioritized variants.
Relevance: Contains the main processed data for analysis.

Snippet: Variant Calling

freebayes -f "$REF_FA" -v "$VCF" "$FILTER_BAM"

Snippet: ANNOVAR Annotation

$ANNOVAR_DIR/table_annovar.pl $FILTER_VCF $HUMANDB_DIR ...

results/

Description: Stores additional results, specifically snpEff annotations.
Contents:
- NGS01_snpEff.vcf: snpEff annotation file.
Relevance: Contains results from snpEff variant annotation.

Snippet: snpEff Annotation

java -Xmx4g -jar "$snpeff_dir/snpEff.jar" "$genome_version" "$filter_vcf" > "$results_dir/NGS01_snpEff.vcf"

docs/

Description: This directory stores log files and other documentation produced by the pipeline. It can also be used to track download progress.
Contents:
- /home/ubuntu/bioinformatics_course/scripts/docs/available_dbs.txt: A file produced by the pipeline.
- Other log files.
Relevance: Important for tracking pipeline execution, debugging, and recording download status.

Usage

To use the pipeline, follow these steps:

Clone the repository to your local machine:

git clone https://github.com/LDolanLDolan/bioinformatics_course

Navigate to the pipeline directory:
```
cd bioinformatics_course
cd scripts
```
Ensure the pipeline script has execute permissions:
```
chmod +x pipeline_one.sh
```
Run the pipeline script:
```
./pipeline_one.sh
```

Pipeline Steps

The pipeline consists of the following steps:

Downloading raw data files: Downloads the raw FASTQ files and the annotation BED file using wget.
Decompressing FASTQ files: Decompresses the downloaded .qz files using zcat.
Trimming reads: Trims the reads using Trimmomatic to remove low-quality bases and adapter sequences.
Preparing reference genome: Downloads the hg19 reference genome if it doesn't exist and decompresses it.
Indexing reference genome: Indexes the reference genome using BWA and samtools faidx.
Aligning reads: Aligns the trimmed reads to the reference genome using BWA.
Processing alignment files: Converts the SAM file to BAM format, sorts the BAM file by coordinate, and indexes the sorted BAM file using samtools.
Marking duplicates: Marks duplicate reads using Picard MarkDuplicates to prepare for variant calling.
Filtering BAM file: Filters the BAM file to keep only properly paired reads with a mapping quality greater than 20.
Variant calling: Calls variants using FreeBayes, with a fallback to bcftools if FreeBayes fails.
Filtering variants: Filters the variants using vcffilter.
Variant annotation (ANNOVAR): Annotates the variants using ANNOVAR.
Variant annotation (snpEff): Annotates the variants using snpEff.
Variant prioritization: Performs a basic variant prioritization step.

Input Data

Raw FASTQ files (NGS0001.R1.fastq.qz, NGS0001.R2.fastq.qz)
Annotation BED file (annotation.bed)
Reference genome (hg19.fa)

Output Data

The pipeline generates the following output data:

Trimmed FASTQ files
Aligned BAM files
Variant calls in VCF format
Annotated variants (ANNOVAR and snpEff)
Prioritized variants

Error Handling

The pipeline includes error handling at each step. It checks for:

Successful download of files.
Successful execution of tools.
Existence and size of output files.
Sufficient disk space.

If an error occurs, the pipeline will print an error message and exit.

Disk Space Management

The pipeline includes checks for available disk space before performing disk-intensive operations. It also attempts to clean up intermediate files to free up space.

Cleanup

The pipeline attempts to remove unnecessary intermediate files, such as:

Uncompressed raw FASTQ files.
SAM files after BAM conversion.
Unsorted BAM files after sorting.
Compressed reference genome after decompression.
Temporary files in /tmp.

Notes

The pipeline assumes that all required software is installed and configured correctly. You may need to adjust paths to executables and adapter files.
The pipeline downloads the hg19 reference genome. You can replace this with your own reference genome if needed.
The ANNOVAR databases are downloaded dynamically by the script. Ensure you have a working internet connection and that ANNOVAR is configured correctly.
The variant prioritization step is basic. You may need to modify it for your specific needs.
Consider adding a logging mechanism to record the pipeline's progress and any errors.
This script is designed to be run on a Linux system. You may need to modify it for other operating systems.

Prerequisites

Bash: Ensure you have Bash installed (version 4.0 or higher).
BWA: For sequence alignment.
SAMtools: For processing SAM/BAM files.
Picard: For marking duplicates.
FreeBayes: For variant calling.
ANNOVAR: For variant annotation.

Version Control

It is recommended to use version control (e.g., Git) to track changes to the pipeline script.

Author

[Lita Doolan] [litadoolan.net]

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
scripts		scripts
Advanced_Bioinformatic_-Assessment_RRR_1259.html		Advanced_Bioinformatic_-Assessment_RRR_1259.html
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NGS Pipeline README

Table of Contents

Dependencies

Installation

Pipeline Overview

Directory Structure

data/

data/untrimmed_fastq/

data/trimmed_fastq/

data/aligned_data/

data/reference/

analysis/

results/

docs/

Usage

Pipeline Steps

Input Data

Output Data

Error Handling

Disk Space Management

Cleanup

Notes

Prerequisites

Version Control

Author

About

Uh oh!

Releases

Packages

Languages

LDolanLDolan/bioinformatics_course

Folders and files

Latest commit

History

Repository files navigation

NGS Pipeline README

Table of Contents

Dependencies

Installation

Pipeline Overview

Directory Structure

data/

data/untrimmed_fastq/

data/trimmed_fastq/

data/aligned_data/

data/reference/

analysis/

results/

docs/

Usage

Pipeline Steps

Input Data

Output Data

Error Handling

Disk Space Management

Cleanup

Notes

Prerequisites

Version Control

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages