Skip to content

Commit 8fd37aa

Browse files
authored
Add script to run analysis per chromosome on HPC. (#6)
* Add script to run analysis per chromosome on HPC. * Describe HPC analysis in pipeline.R and refactor code.
1 parent e95ea19 commit 8fd37aa

File tree

3 files changed

+63
-2
lines changed

3 files changed

+63
-2
lines changed

genoscores/analysis.hpc.cli.sh

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
#!/usr/bin/env bash
2+
3+
#' This script is used to run GENOSCORES on a single chromosome in a HPC
4+
#' environment with qsub
5+
## ============================================================================
6+
7+
## Read the path for the directory where the results are to be stored. Typically
8+
## it is on scratch space on HPC
9+
outputdir=$1
10+
11+
## Read the base name (with full path) of the input file (excluding ID).
12+
## Example: if files are called /home/user/cohort_chr<1-22>.bim|bed|fam,
13+
## set basename="/home/user/cohort_chr"
14+
basename=$2
15+
16+
## Read path to the GENOSCORES analysis R script. Here you can use
17+
## example.analysis.R
18+
## Example: gscript="/home/user/example.analysis.R"
19+
gscript=$3
20+
21+
echo "GENOSCORES analysis script: $gscript will be executed."
22+
23+
## Read chromosome number from command line
24+
chr=$4
25+
26+
## Function to process one chromosome at a time.
27+
## If you target file has suffixes after chromosome ID, for example:
28+
## 'cohort_chr21_filtered', they need to be manually added to inputfile:
29+
## inputfile="${basename}/cohort_chr${chr}_filtered"
30+
processchrom() {
31+
inputfile="${basename}/${chr}"
32+
chromdir="${outputdir}/${chr}"
33+
34+
if [[ ! -d "${chromdir}" ]]; then
35+
mkdir -p "${chromdir}"
36+
fi
37+
38+
logfile="${chromdir}"/log.Rout
39+
40+
echo "Processing chromosome: ${chr}"
41+
echo "Input file: ${inputfile}"
42+
echo "Saving results to: ${chromdir} ..."
43+
44+
## Run main GENOSCORES script with CLI options:
45+
## chromdir: subdirectory in the output directory to store results from
46+
## the analysis for the current chromosome.
47+
## inputfile: PLINK file with the current chromosome.
48+
## logfile: file to store the logs from GENOSCORES.
49+
Rscript --no-save --no-restore --verbose "$(realpath "${gscript}")" \
50+
"${chromdir}" "${inputfile}" 2> "${logfile}" > /dev/null
51+
echo "Done."
52+
}
53+
54+
processchrom
55+
56+
echo "All analyses have been executed."

genoscores/readme.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@ We compute the regional scores in the specified target cohort using weights from
1212
- `gscript` -- path to the GENOSCORES script.
1313
- `N` -- how many chromosomes do you want to run in parallel? This depends on the size of the target cohort and available computational resources.
1414
- `inputfile` -- modify accordingly to the full name of your target cohort files. If the names have any suffixes after chromosome ID, these need to be added.
15-
15+
3. [analysis.hpc.cli.sh](analysis.hpc.cli.sh) is a BASH script which executes [example.analysis.R](example.analysis.R) for a single specified chromosome. This script takes one additional CLI argument `chr` which denotes the chromosome number for which analysis is to be run. This script is useful for high performance computing (HPC) type of analysis where each chromsome can be submitted to a queue within a job array created by the job script.
16+
1617
## General notes
1718

1819
1. If multiple analyses are to be run (eQTL, pQTL, etc), then it may be helpful to either:

pipeline.R

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,11 @@ basename <- file.path(...)
9898
gscript <- file.path("genoscores", "example.analysis.R")
9999

100100
cmd <- sprintf("bash genoscores/analysis.cli.sh %s %s %s",
101-
output.dir, basename, gscript)
101+
output.dir, basename, gscript)
102+
## Alternatively, we could run analysis.hpc.cli.sh specifying chromosome number
103+
## as additional parmeter. This is useful when submitting job arrays on HPC
104+
## cluster, where each job ID corresponds to a single chromosome analysis.
105+
102106
system(cmd)
103107

104108
##------------------------------------------------------------------------------

0 commit comments

Comments
 (0)