chipseq-smk-pipeline

Snakemake based pipeline for ChIP-seq and ATAC-seq datasets processing from raw data QC and alignment to visualization and peak calling.

During peak calling steps chipseq-smk-pipeline automatically matches signal with control file by names proximity.

Input

Input FASTQ files

Pipeline aligned FASTQ or gzipped FASTQ reads, defined in config.yaml.
Reads folder is a relative path in pipeline working directory and defined by fastq_dir property.
FASTQ reads extension is defined by fastq_ext property, e.g. could be fq, fq.gz, fastq, fastq.gz.

Input BAM files

Use start_with_bams=True config option to start with existing bam files.
Pipeline starts with BAM files in work_dir/bams folder.

Files

Path	Description
`config.yaml`	Default pipeline options
`trimmed`	Trimmed FASTQ file, if `trim_reads` option is True.
`bams`	BAMs with aligned reads, `MAPQ >= 30`
`bw`	BAM coverage visualization using DeepTools
`<peak_caller_name>`	Peaks provided by peak caller tool `<peak_caller_name>`
`qc`	QC Reports
`multiqc`	MultiQC reports for different steps
`logs`	Shell commands logs

Requirements

The pipeline requires conda.

If conda is not installed, follow the instructions at Conda website.
Navigate to repository directory.

Create a Conda environment for snakemake:

$ conda env create --file environment.yaml --name snakemake

Activate the newly created environment:

$ source activate snakemake

On Ubuntu please ensure that gawk is installed:

$ sudo apt-get install gawk

Launch the pipeline

Run the pipeline to start with fastq reads:

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all [--cores <cores>] --use-conda --directory <work_dir> \
    --config genome=<genome> fastq_dir=<fastq_dir>  --rerun-incomplete

Run the pipeline to start with bams:

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all [--cores <cores>] --use-conda --directory <work_dir> \
    --config genome=<genome> start_with_bams=True bams_dir=<bams_dir> --rerun-incomplete

See config.yaml for a complete list of parameters. Use--config to override default options from config.yaml file.

Launch peak calling

The Default pipeline doesn't perform coverage visualization and launch peak callers.
Please add bw=True, <peak_caller_name>=True to create coverage bw files and call peaks with <peak_caller_name>.

Run the pipeline with all peak callers in default mode:

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all [--cores <cores>] --use-conda --directory <work_dir> \
    --config genome=<genome> \
    macs2=True sicer=True homer=True hotspot=True peakseq=True lanceotron=True omnipeak=True \
    --rerun-incomplete --rerun-trigger mtime;

Run the pipeline with peak callers in alternative mode (MACS2 broad, HOMER histone):

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all [--cores <cores>] --use-conda --directory <work_dir> \
    --config genome=<genome> \
    macs2=True macs2_mode=broad macs2_params="--broad --broad-cutoff 0.1" macs2_suffix=broad0.1 \
    homer=True homer_style=histone homer_suffix=regions.bed \
    --rerun-incomplete --rerun-trigger mtime;

Please note, that BayesPeak, Hotspot, and PeakSeq cannot be installed automatically, so passing <peak_caller_name>_executable=<executable> is required.
Please see Peak callers installation section for details.

Peak callers

Supported peak caller tools and parameters:

Peak caller	Snakemake	Command line
MACS2	macs2.smk	`macs2 callpeak -f BAM -t <signal.bam> -c <control.bam> -q 0.05`
`MACS2 --broad`	macs2.smk	`macs2 callpeak -f BAM -t <signal.bam> -c <control.bam> --broad --broad-cutoff 0.1`
MACS3	macs3.smk	`macs3 callpeak -f BAM -t <signal.bam> -c <control.bam> -q 0.05`
`MACS3 --broad`	macs3.smk	`macs3 callpeak -f BAM -t <signal.bam> -c <control.bam> --broad --broad-cutoff 0.1`
SICER	sicer.smk	`SICER.sh pileups <signal.pileup.bed> <control.pileup.bed> <outputdir> <species> 1 200 150 <effective genome fraction> 600 0.01`
SICER2	sicer2.smk	`sicer -t <signal.pileup.bed> -c <control.pileup.bed> -s hg38 -w 200 -rt 1 -f 150 -egf 0.74 -fdr 0.01 -g 600 -e 1000`
HOMER	homer.smk	`findPeaks <signal tags dir> -style factor -o <output> -i <control tags dir>` *
`HOMER histone`	homer.smk	`findPeaks <signal tags dir> -style histone -o <output> -i <control tags dir>` *
FSeq2	fseq2.smk	`fseq2 callpeak -v -q_thr 0.05 -control_file <control.pileup.bed> -chrom_size_file <chrom.sizes> -o fseq2 -name <name> -standard_narrowpeak <signal.pileup.bed>`
HotSpot	hotspot.smk	`<hotspot_executable> -i <signal.pileup.bed> -o <signal>.hotspot`
PeakSeq	peakseq.smk	`<peakseq_executable> -peak_select <config.dat>` **
GPS	gps.smk	`java -cp <gps.jar> edu.mit.csail.cgs.deepseq.discovery.GPS --d <input.signal_reads_distribution> --g <chrom.sizes> --expt <signal.bam> --ctlr <control.bam> --f SAM --out gps/<signal> --q 0.05` ***
BayesPeak	bayespeak.smk	`Rscript <run_bayespeak.R> <signal.pileup.bed> <control.pileup.bed> <signal>.csv` ****
LanceOtron	lanceotron.smk	`lanceotron callPeaksInput <signal.bw> -i <control.bw> -f lanceotron`
Omnipeak	omnipeak.smk	`java --add-modules=jdk.incubator.vector -Xmx8G -jar <omnipeak.jar> analyze -t <signal.bam> --chrom.sizes <chrom.sizes> -c <control.bam --peaks <output.peak> --iterations 10 --threshold 1e-4 --bin 100 --fragment auto --fdr 0.05`
SPAN	span.smk	`java -Xmx8G -jar <span.jar> analyze -t <signal.bam> --chrom.sizes <chrom.sizes> -c <control.bam --peaks <output.peak> --iterations 10 --threshold 1e-4 --bin 100 --fragment auto --fdr 0.05`

* HOMER tags preparation

makeTagDirectory <tags dir> <bam>

** PeakSeq config.dat

Experiment_id <signal>
Mappability_map_file <mappability_file>
ChIP_Seq_reads_data_dirs <signal>
Input_reads_data_dirs <control>
chromosome_list_file {chrom.sizes}
narrowPeak_output_file_path peakseq/<signal>.raw
Enrichment_mapped_fragment_length 200
target_FDR 0.05
N_Simulations 50
Minimum_interpeak_distance 200
Background_model Simulated
max_Qvalue 0.05

*** GPS Read Distribution

java -Xmx1G -cp <gps.jar> edu.mit.csail.cgs.deepseq.analysis.GPS_ReadDistribution --g <chrom.sizes> --coords <input.coords> --chipseq <signal.bam> --f SAM --name gps/<signal> --range 250 --smooth 5 --mrc 4

**** BayesPeak run_bayespeak.R

Peak callers installation

This section contains instructions for manual peak callers installation.

BayesPeak

Install R

mamba install  -c conda-forge r-base=3.6.3

In R console

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.10")  # Explicitly set correct Bioconductor version
BiocManager::install(c("IRanges", "GenomicRanges"))

Install BayesPeak

wget https://www.bioconductor.org/packages//2.10/bioc/src/contrib/BayesPeak_1.8.0.tar.gz
R CMD INSTALL BayesPeak_1.8.0.tar.gz

Hotspot

Install required dependencies

sudo apt-get install build-essential libgsl-dev

Download and make

wget https://github.com/StamLab/hotspot/archive/refs/tags/v4.1.1.zip
gunzip v4.1.1.zip
cd hotspot-4.1.1/hotspot-distr/hotspot-deploy
make

PeakSeq
Download and make

git clone https://github.com/gersteinlab/PeakSeq.git
cd PeakSeq
make

Rules

Rules DAG produced with additional command line arguments --forceall --rulegraph | dot -Tpdf > rules.pdf

Computational cluster QSUB/LFS/QSUB

Configure profile for required cluster system with name cluster.

$ mkdir -p ~/.config/snakemake
$ cd ~/.config/snakemake
$ cookiecutter https://github.com/iromeo/generic.git

Example of ATAC-Seq processing on qsub

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all --use-conda --directory <work_dir> \
    --profile cluster --cluster-config cluster_config.yaml --jobs 150 \
    --config fastq_dir=<fastq_dir> genome=<genome> \
    bowtie2_params="-X 2000 --dovetail" \
    macs2=True macs2_params="-q 0.05 -f BAMPE --nomodel --nolambda -B --call-summits" \
    omnipeak=True omnipeak_fragment=0 --rerun-incomplete

P.S: Use --config to override default options from config.yaml file

Try with test data

Please download example fastq.gz files from CD14_chr15_fastq folder.
These files are filtered on human hg19 chr15 to reduce size and make computations faster.

Launch chipseq-smk-pipeline:

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all --use-conda --cores all --directory <work_dir> \
    --config fastq_ext=fastq.gz fastq_dir=<work_dir> bw=True genome=hg19 macs2=True sicer=True omnipeak=True \
    --rerun-incomplete

Useful links

Learn more about Snakemake workflow management system
Developed with SnakeCharm plugin for PyCharm IDE by JetBrains Research BioLabs

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
envs		envs
rules		rules
scripts		scripts
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
cluster_config.yaml		cluster_config.yaml
config.yaml		config.yaml
environment.yaml		environment.yaml
pipeline.png		pipeline.png
pipeline_util.py		pipeline_util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

chipseq-smk-pipeline

Input

Files

Requirements

Launch the pipeline

Launch peak calling

Peak callers

Peak callers installation

Rules

Computational cluster QSUB/LFS/QSUB

Try with test data

Useful links

About

Uh oh!

Releases

Packages

Contributors 3

Languages

JetBrains-Research/chipseq-smk-pipeline

Folders and files

Latest commit

History

Repository files navigation

chipseq-smk-pipeline

Input

Files

Requirements

Launch the pipeline

Launch peak calling

Peak callers

Peak callers installation

Rules

Computational cluster QSUB/LFS/QSUB

Try with test data

Useful links

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages