Snakemake based pipeline for ChIP-seq and ATAC-seq datasets processing from raw data QC and alignment to visualization and peak calling.
During peak calling steps chipseq-smk-pipeline automatically matches signal with control file by names proximity.
Input FASTQ files
Pipeline aligned FASTQ or gzipped FASTQ reads, defined in config.yaml.
Reads folder is a relative path in pipeline working directory and defined by fastq_dir property.
FASTQ reads extension is defined by fastq_ext property, e.g. could be fq, fq.gz, fastq, fastq.gz.
Input BAM files
Use start_with_bams=True config option to start with existing bam files.
Pipeline starts with BAM files in work_dir/bams folder.
| Path | Description |
|---|---|
config.yaml |
Default pipeline options |
trimmed |
Trimmed FASTQ file, if trim_reads option is True. |
bams |
BAMs with aligned reads, MAPQ >= 30 |
bw |
BAM coverage visualization using DeepTools |
<peak_caller_name> |
Peaks provided by peak caller tool <peak_caller_name> |
qc |
QC Reports |
multiqc |
MultiQC reports for different steps |
logs |
Shell commands logs |
The pipeline requires conda.
- If
condais not installed, follow the instructions at Conda website. - Navigate to repository directory.
Create a Conda environment for snakemake:
$ conda env create --file environment.yaml --name snakemakeActivate the newly created environment:
$ source activate snakemakeOn Ubuntu please ensure that gawk is installed:
$ sudo apt-get install gawkRun the pipeline to start with fastq reads:
$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
all [--cores <cores>] --use-conda --directory <work_dir> \
--config genome=<genome> fastq_dir=<fastq_dir> --rerun-incompleteRun the pipeline to start with bams:
$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
all [--cores <cores>] --use-conda --directory <work_dir> \
--config genome=<genome> start_with_bams=True bams_dir=<bams_dir> --rerun-incompleteSee config.yaml for a complete list of parameters. Use--config to override default options from config.yaml file.
The Default pipeline doesn't perform coverage visualization and launch peak callers.
Please add bw=True, <peak_caller_name>=True to create coverage bw files and call peaks with <peak_caller_name>.
Run the pipeline with all peak callers in default mode:
$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
all [--cores <cores>] --use-conda --directory <work_dir> \
--config genome=<genome> \
macs2=True sicer=True homer=True hotspot=True peakseq=True lanceotron=True omnipeak=True \
--rerun-incomplete --rerun-trigger mtime;Run the pipeline with peak callers in alternative mode (MACS2 broad, HOMER histone):
$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
all [--cores <cores>] --use-conda --directory <work_dir> \
--config genome=<genome> \
macs2=True macs2_mode=broad macs2_params="--broad --broad-cutoff 0.1" macs2_suffix=broad0.1 \
homer=True homer_style=histone homer_suffix=regions.bed \
--rerun-incomplete --rerun-trigger mtime;Please note, that BayesPeak, Hotspot, and PeakSeq cannot be installed automatically,
so passing <peak_caller_name>_executable=<executable> is required.
Please see Peak callers installation section for details.
Supported peak caller tools and parameters:
| Peak caller | Snakemake | Command line |
|---|---|---|
| MACS2 | macs2.smk | macs2 callpeak -f BAM -t <signal.bam> -c <control.bam> -q 0.05 |
MACS2 --broad |
macs2.smk | macs2 callpeak -f BAM -t <signal.bam> -c <control.bam> --broad --broad-cutoff 0.1 |
| MACS3 | macs3.smk | macs3 callpeak -f BAM -t <signal.bam> -c <control.bam> -q 0.05 |
MACS3 --broad |
macs3.smk | macs3 callpeak -f BAM -t <signal.bam> -c <control.bam> --broad --broad-cutoff 0.1 |
| SICER | sicer.smk | SICER.sh pileups <signal.pileup.bed> <control.pileup.bed> <outputdir> <species> 1 200 150 <effective genome fraction> 600 0.01 |
| SICER2 | sicer2.smk | sicer -t <signal.pileup.bed> -c <control.pileup.bed> -s hg38 -w 200 -rt 1 -f 150 -egf 0.74 -fdr 0.01 -g 600 -e 1000 |
| HOMER | homer.smk | findPeaks <signal tags dir> -style factor -o <output> -i <control tags dir> * |
HOMER histone |
homer.smk | findPeaks <signal tags dir> -style histone -o <output> -i <control tags dir> * |
| FSeq2 | fseq2.smk | fseq2 callpeak -v -q_thr 0.05 -control_file <control.pileup.bed> -chrom_size_file <chrom.sizes> -o fseq2 -name <name> -standard_narrowpeak <signal.pileup.bed> |
| HotSpot | hotspot.smk | <hotspot_executable> -i <signal.pileup.bed> -o <signal>.hotspot |
| PeakSeq | peakseq.smk | <peakseq_executable> -peak_select <config.dat> ** |
| GPS | gps.smk | java -cp <gps.jar> edu.mit.csail.cgs.deepseq.discovery.GPS --d <input.signal_reads_distribution> --g <chrom.sizes> --expt <signal.bam> --ctlr <control.bam> --f SAM --out gps/<signal> --q 0.05 *** |
| BayesPeak | bayespeak.smk | Rscript <run_bayespeak.R> <signal.pileup.bed> <control.pileup.bed> <signal>.csv **** |
| LanceOtron | lanceotron.smk | lanceotron callPeaksInput <signal.bw> -i <control.bw> -f lanceotron |
| Omnipeak | omnipeak.smk | java --add-modules=jdk.incubator.vector -Xmx8G -jar <omnipeak.jar> analyze -t <signal.bam> --chrom.sizes <chrom.sizes> -c <control.bam --peaks <output.peak> --iterations 10 --threshold 1e-4 --bin 100 --fragment auto --fdr 0.05 |
| SPAN | span.smk | java -Xmx8G -jar <span.jar> analyze -t <signal.bam> --chrom.sizes <chrom.sizes> -c <control.bam --peaks <output.peak> --iterations 10 --threshold 1e-4 --bin 100 --fragment auto --fdr 0.05 |
* HOMER tags preparation
makeTagDirectory <tags dir> <bam>
** PeakSeq config.dat
Experiment_id <signal>
Mappability_map_file <mappability_file>
ChIP_Seq_reads_data_dirs <signal>
Input_reads_data_dirs <control>
chromosome_list_file {chrom.sizes}
narrowPeak_output_file_path peakseq/<signal>.raw
Enrichment_mapped_fragment_length 200
target_FDR 0.05
N_Simulations 50
Minimum_interpeak_distance 200
Background_model Simulated
max_Qvalue 0.05
*** GPS Read Distribution
java -Xmx1G -cp <gps.jar> edu.mit.csail.cgs.deepseq.analysis.GPS_ReadDistribution --g <chrom.sizes> --coords <input.coords> --chipseq <signal.bam> --f SAM --name gps/<signal> --range 250 --smooth 5 --mrc 4
**** BayesPeak run_bayespeak.R
This section contains instructions for manual peak callers installation.
-
BayesPeak
- Install R
mamba install -c conda-forge r-base=3.6.3- In R console
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(version = "3.10") # Explicitly set correct Bioconductor version BiocManager::install(c("IRanges", "GenomicRanges"))- Install BayesPeak
wget https://www.bioconductor.org/packages//2.10/bioc/src/contrib/BayesPeak_1.8.0.tar.gz R CMD INSTALL BayesPeak_1.8.0.tar.gz -
Hotspot
- Install required dependencies
sudo apt-get install build-essential libgsl-dev- Download and make
wget https://github.com/StamLab/hotspot/archive/refs/tags/v4.1.1.zip gunzip v4.1.1.zip cd hotspot-4.1.1/hotspot-distr/hotspot-deploy make -
PeakSeq
Download and makegit clone https://github.com/gersteinlab/PeakSeq.git cd PeakSeq make
Rules DAG produced with additional command line arguments --forceall --rulegraph | dot -Tpdf > rules.pdf
Configure profile for required cluster system with name cluster.
$ mkdir -p ~/.config/snakemake
$ cd ~/.config/snakemake
$ cookiecutter https://github.com/iromeo/generic.gitExample of ATAC-Seq processing on qsub
$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
all --use-conda --directory <work_dir> \
--profile cluster --cluster-config cluster_config.yaml --jobs 150 \
--config fastq_dir=<fastq_dir> genome=<genome> \
bowtie2_params="-X 2000 --dovetail" \
macs2=True macs2_params="-q 0.05 -f BAMPE --nomodel --nolambda -B --call-summits" \
omnipeak=True omnipeak_fragment=0 --rerun-incompleteP.S: Use --config to override default options from config.yaml file
Please download example fastq.gz files
from CD14_chr15_fastq folder.
These files are filtered on human hg19 chr15 to reduce size and make computations faster.
Launch chipseq-smk-pipeline:
$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
all --use-conda --cores all --directory <work_dir> \
--config fastq_ext=fastq.gz fastq_dir=<work_dir> bw=True genome=hg19 macs2=True sicer=True omnipeak=True \
--rerun-incomplete- Learn more about Snakemake workflow management system
- Developed with SnakeCharm plugin for PyCharm IDE by JetBrains Research BioLabs
