This repository contains the data and code used to generate the JASPAR UCSC Genome Browser track data hub.
For more information visit the JASPAR website.
01/07/2018 To speed-up TFBS predictions, we switched from MEME and the Perl TFBS package to PWMScan.
- The
genomesfolder contains scripts to download and process different genome assemblies - The
profilesfolder contains the output from the scriptget-profiles.py, which downloads the JASPAR CORE profiles for different taxons - The file
environment.yml, within thecondafolder, contains the conda environment used to generate the genomic tracks for JASPAR 2022 (see installation) - The script
install-pwmscan.shdownloads and installs PWMscan and places its binaries in the in thebinfolder. - The script
scan-sequence.pytakes as its input theprofilesfolder and a nucleotide sequence in FASTA format
(e.g. a genome), and outputs TFBS predictions - The script
scans2bigBedcreates a bigBed track file from TFBS predictions
The original scripts used for the publication of JASPAR 2018 have been placed in the folder version-1.0.
- Python 3.7 with the following libraries: Biopython (<1.74), NumPy, pyfaidx and tqdm
- PWMScan
- UCSC binaries for standalone command-line use
Note that for running scan_sequence.py, only the Python dependencies and PWMScan are required.
To install PWMScan, execute the script install-pwmscan.sh.
The remaining dependencies can be installed through the conda package manager:
conda env create -f ./conda/environment.yml
Genomic tracks and TFBS predictions for human and seven other model organisms, covering 11 genome assemblies, are available online:
To illustrate how the genomic tracks are generated, we provide an example for the baker's yeast genome:
- Download the genome sequence and chromosome sizes (automated in this script)
- Scan the genome sequence using all fungi profiles from the JASPAR CORE
./scan-sequence.py --fasta-file ./genomes/sacCer3/sacCer3.fa --profiles-dir ./profiles/ \
--output-dir ./tracks/sacCer3/ --threads 4 --latest --taxon fungi
For this example, the scanning step should take no longer than a minute. For human and other similar genomes, this step is usually finished within a few hours (the final amount of time will depend on the number of --threads specified).
- Create the genomic track
./scans2bigBed -c ./genomes/sacCer3/sacCer3.fa.sizes -i ./tracks/sacCer3/ -o ./tracks/sacCer3.bb -t 4
TFBS predictions from the previous step are merged into a bigBed track file. In column five, we use as scores the p-values from PWMScan (scaled between 0-1000, where 0 corresponds to p-value = 1 and 1000 to p-value ≤ 10-10). This allows for comparison of prediction confidence across TFBSs. Again, for this example, this step should be completed within a few minutes, while for larger genomes it can take a few hours.
Important note: disk space requirements for large genomes (i.e. danRer11, hg19, hg38, mm10, and mm39) are substantial. In these cases, we highly recommend allocating at least 1Tb of disk space.