Add script to run analysis per chromosome on HPC. (#6)

whimsial · web-flow · commit 8fd37aa17c54 · 2023-03-03T17:53:45.000Z
* Add script to run analysis per chromosome on HPC.

* Describe HPC analysis in pipeline.R and refactor code.
diff --git a/genoscores/analysis.hpc.cli.sh b/genoscores/analysis.hpc.cli.sh
@@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+
+#' This script is used to run GENOSCORES on a single chromosome in a HPC
+#' environment with qsub
+## ============================================================================
+
+## Read the path for the directory where the results are to be stored. Typically
+## it is on scratch space on HPC
+outputdir=$1
+
+## Read the base name (with full path) of the input file (excluding ID).
+## Example: if files are called /home/user/cohort_chr<1-22>.bim|bed|fam,
+## set basename="/home/user/cohort_chr"
+basename=$2
+
+## Read path to the GENOSCORES analysis R script. Here you can use
+## example.analysis.R
+## Example: gscript="/home/user/example.analysis.R"
+gscript=$3
+
+echo "GENOSCORES analysis script: $gscript will be executed."
+
+## Read chromosome number from command line
+chr=$4
+
+## Function to process one chromosome at a time.
+## If you target file has suffixes after chromosome ID, for example:
+## 'cohort_chr21_filtered', they need to be manually added to inputfile:
+## inputfile="${basename}/cohort_chr${chr}_filtered"
+processchrom() {
+    inputfile="${basename}/${chr}"
+    chromdir="${outputdir}/${chr}"
+
+    if [[ ! -d "${chromdir}" ]]; then
+        mkdir -p "${chromdir}"
+    fi
+
+    logfile="${chromdir}"/log.Rout
+
+    echo "Processing chromosome: ${chr}"
+    echo "Input file: ${inputfile}"
+    echo "Saving results to: ${chromdir} ..."
+
+    ## Run main GENOSCORES script with CLI options:
+    ## chromdir: subdirectory in the output directory to store results from
+    ##           the analysis for the current chromosome.
+    ## inputfile: PLINK file with the current chromosome.
+    ## logfile: file to store the logs from GENOSCORES.
+    Rscript --no-save --no-restore --verbose "$(realpath "${gscript}")" \
+         "${chromdir}" "${inputfile}" 2> "${logfile}" > /dev/null
+    echo "Done."
+}
+
+processchrom
+
+echo "All analyses have been executed."
diff --git a/genoscores/readme.md b/genoscores/readme.md
@@ -12,7 +12,8 @@ We compute the regional scores in the specified target cohort using weights from
   - `gscript` -- path to the GENOSCORES script.
   - `N` -- how many chromosomes do you want to run in parallel? This depends on the size of the target cohort and available computational resources.
   - `inputfile` -- modify accordingly to the full name of your target cohort files. If the names have any suffixes after chromosome ID, these need to be added.
-  
+3. [analysis.hpc.cli.sh](analysis.hpc.cli.sh) is a BASH script which executes [example.analysis.R](example.analysis.R) for a single specified chromosome. This script takes one additional CLI argument `chr` which denotes the chromosome number for which analysis is to be run. This script is useful for high performance computing (HPC) type of analysis where each chromsome can be submitted to a queue within a job array created by the job script.
+
 ## General notes
 
 1. If multiple analyses are to be run (eQTL, pQTL, etc), then it may be helpful to either:
diff --git a/pipeline.R b/pipeline.R
@@ -98,7 +98,11 @@ basename <- file.path(...)
 gscript <- file.path("genoscores", "example.analysis.R")
 
 cmd <- sprintf("bash genoscores/analysis.cli.sh %s %s %s",
-              output.dir, basename, gscript)
+               output.dir, basename, gscript)
+## Alternatively, we could run analysis.hpc.cli.sh specifying chromosome number
+## as additional parmeter. This is useful when submitting job arrays on HPC
+## cluster, where each job ID corresponds to a single chromosome analysis.
+
 system(cmd)
 
 ##------------------------------------------------------------------------------