Skip to content

Commit 0fc918a

Browse files
converted submodule4 to AWS
1 parent 7472d8b commit 0fc918a

File tree

7 files changed

+536
-239
lines changed

7 files changed

+536
-239
lines changed

AWS/01-RNA-Seq/rnaseq-aws.config

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ profiles {
77
process {
88
executor = 'awsbatch' // name of your Compute environments
99
queue = 'nextflow-batch-job-queue' // name of your Job queue
10-
container = 'nf-core/rnaseq'
10+
container = 'quay.io/nextflow/rnaseq-nf:v1.1'
1111

1212
}
1313
workDir = 's3://your_bucket_name/rna-tmp/' // path of your working directory

AWS/02-RRBS/rrbs-aws.config

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ profiles {
77
process {
88
executor = 'awsbatch' // name of your Compute environments
99
queue = 'nextflow-batch-job-queue' // name of your Job queue
10-
container = 'nf-core/methylseq'
10+
container = 'quay.io/nextflow/rnaseq-nf:v1.1'
1111

1212
}
1313
workDir = 's3://nextflow-bucket-test/meth-tmp/' // path of your working directory

AWS/New-Data.ipynb

+235
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "1d0516a4-c45d-4ecf-b04c-e21e19f933f3",
6+
"metadata": {},
7+
"source": [
8+
"# <span> Module 4: Running the module with new data <span>"
9+
]
10+
},
11+
{
12+
"cell_type": "markdown",
13+
"id": "8af9faa7-692e-4d8e-b9b9-a2070e21d905",
14+
"metadata": {},
15+
"source": [
16+
"In this notebook, we are going to explore how to run this module with a new dataset. These submodules provide a great framework for running a rigorous and scalable analysis, but there are some considerations that must be made in order to run this with your own data. We will walk through that process here so that hopefully, you are able to take these notebooks to your research group and use them for your own analysis. Notice that we do not give you all the answers in the code blocks, but if you get stuck, use the dropdowns for help. This module largely uses Nextflow for the RNA-seq and Methyl-seq analysis, which makes it very easy to run the same analysis on new datasets by updating the config files."
17+
]
18+
},
19+
{
20+
"cell_type": "markdown",
21+
"id": "8209f0e3-a631-49f2-91cf-aa7ce85fea13",
22+
"metadata": {},
23+
"source": [
24+
"## **Importing the example dataset**"
25+
]
26+
},
27+
{
28+
"cell_type": "markdown",
29+
"id": "f923f5d4-597a-4e39-803e-5252f2fe4cd6",
30+
"metadata": {},
31+
"source": [
32+
"Our new dataset comes from a paper by [Hadad et al. Epigenetics Chromatin. 2019](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6781367/) that compares methylation changes in mice as they age and correlates those changes to gene expression changes. The data is available in SRA under the bioProject number [PRJNA523985](https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA523985). The impact of methylation on aging, particularly in brain tissue, is of great research interest. There are many samples in this dataset, but we will limit our analysis to young vs old female mice."
33+
]
34+
},
35+
{
36+
"cell_type": "markdown",
37+
"id": "364ce93b-f726-45f5-81da-19ddd214b244",
38+
"metadata": {},
39+
"source": [
40+
"To download the dataset, follow the instructions in the [tutorial](https://github.com/STRIDES/NIHCloudLabAWS/blob/main/notebooks/SRADownload/SRA-Download.ipynb) on downloading datasets from SRA using the prefetch+fasterq dump method. The accession numbers for this analysis are: \n",
41+
"\n",
42+
"- SRR8616802\n",
43+
"- SRR8616795\n",
44+
"- SRR8616796\n",
45+
"- SRR8616777\n",
46+
"- SRR8616778\n",
47+
"- SRR8616772\n",
48+
"- SRR8616799\n",
49+
"- SRR8616800\n",
50+
"- SRR8616801\n",
51+
"- SRR8616787\n",
52+
"- SRR8616788\n",
53+
"- SRR8616789\n",
54+
"\n",
55+
"You can save these accession numbers in a file and use that in the `sra-tools` commands. Once you pull these files from SRA, store them in a storage bucket so that Nextflow can see them in the next steps."
56+
]
57+
},
58+
{
59+
"cell_type": "markdown",
60+
"id": "4218d79f-8431-490a-9bc0-76ba89561c83",
61+
"metadata": {},
62+
"source": [
63+
"## **RNA-Seq analysis**"
64+
]
65+
},
66+
{
67+
"cell_type": "markdown",
68+
"id": "0e33ebbe-60c4-4f73-9cf9-70a2245d6fb9",
69+
"metadata": {},
70+
"source": [
71+
"To run the RNA-Seq portion of this tutorial, you need to update the config file to point to your RNA-Seq reads. Let's look at the `rnaseq-aws.config` file and make the necessary changes. We will need to specify `params.outdir`, `workDir`, `params.input`, and `params.genome`."
72+
]
73+
},
74+
{
75+
"cell_type": "markdown",
76+
"id": "1a96a33c-c580-4aea-889a-02c53378586a",
77+
"metadata": {},
78+
"source": [
79+
"```\n",
80+
"profiles {\n",
81+
" aws {\n",
82+
" // AWS batch parameters\n",
83+
" process.executor = 'aws-batch' // name of your Compute environments\n",
84+
" process.queue = 'nextflow-batch-job-queue' // name of your Job queue\n",
85+
" process.container = 'quay.io/nextflow/rnaseq-nf:v1.1'\n",
86+
" aws.region = 'us-east-1'\n",
87+
"\n",
88+
" // Workflow parameters\n",
89+
" params.outdir = 'FILL-IN-HERE'\n",
90+
" workDir = 'FILL-IN-HERE'\n",
91+
" params.input = 'FILL-IN-HERE'\n",
92+
" params.genome = 'FILL-IN-HERE'\n",
93+
" }\n",
94+
"}\n",
95+
"```"
96+
]
97+
},
98+
{
99+
"cell_type": "markdown",
100+
"id": "be4d3374-abf6-4d0a-9fc2-b9bbf356dc0f",
101+
"metadata": {},
102+
"source": [
103+
"Check the [nf-core RNA-seq](https://nf-co.re/rnaseq/3.12.0) documentation to find out how most of this is structured. As with the primary dataset, the workDir and outdir are going to be locations in your Amazon S3 bucket. The sequences are from mice so we need to specify a mouse genome build. Finally, the most effort goes into the input parameter. The nf-core documentation specifies that the reads need to be structured in a sample sheet. Let's look at an example. Using your bucket paths to your reads, create a .csv like this and put it in your bucket, then provide the bucket path in the config file. If you need help, click the help button below to see how we suggest."
104+
]
105+
},
106+
{
107+
"cell_type": "markdown",
108+
"id": "f51d42b4-7b6e-4ab0-8030-b05f8704bfd7",
109+
"metadata": {},
110+
"source": [
111+
"<details>\n",
112+
" <summary>Click for help</summary>\n",
113+
"\n",
114+
"```\n",
115+
"params.outdir = 's3://YOUR-BUCKET/rnaseq/results'\n",
116+
"workDir = 's3://YOUR-BUCKET/rnaseq/work'\n",
117+
"params.input = 's3://YOUR-BUCKET/samplesheet.csv'\n",
118+
"params.genome = 'GRCm38'\n",
119+
"```\n",
120+
"\n",
121+
"</details>\n"
122+
]
123+
},
124+
{
125+
"cell_type": "markdown",
126+
"id": "98f11060-3d5b-43e0-af7a-beda8f4a46c0",
127+
"metadata": {},
128+
"source": [
129+
"Here's how the first row of your sample sheet might look. Make sure to only include the RNA-seq samples in the sample sheet. The methyl-seq samples will be included in that sample sheet when we run that pipeline in the next step."
130+
]
131+
},
132+
{
133+
"cell_type": "markdown",
134+
"id": "351a761b-a1db-4ef8-a671-b5d64b83056d",
135+
"metadata": {},
136+
"source": [
137+
"``` \n",
138+
"sample,fastq_1,fastq_2,strandedness \n",
139+
"Young_1,s3://BUCKETPATH/SRR8616802_1.fastq.gz,s3://BUCKETPATH/SRR8616802_2.fastq.gz,auto\n",
140+
"``` "
141+
]
142+
},
143+
{
144+
"cell_type": "markdown",
145+
"id": "736c254a-a6b9-4901-a666-4bff478dcea0",
146+
"metadata": {},
147+
"source": [
148+
"Once you have that together, run the Nextflow command again to run the nf-core pipeline on this dataset. "
149+
]
150+
},
151+
{
152+
"cell_type": "markdown",
153+
"id": "491e14b9-e0c6-4310-8660-8172db317c80",
154+
"metadata": {},
155+
"source": [
156+
"After the Nextflow quantification completes, you can follow much of the RNA-seq notebook as written using the `salmon.merged.gene_counts.tsv` file that is produced by this step. You will need to update the `sample-info.txt` file to match the new information from these samples, but that is a simple task. As you go through the analysis, think about what comparisons would be interesting to make. Feel free to consult the manuscript for analysis ideas and try to replicate some of their results. "
157+
]
158+
},
159+
{
160+
"cell_type": "markdown",
161+
"id": "89f2cf3a-51ed-4856-a8b1-3b6554e89bbd",
162+
"metadata": {},
163+
"source": [
164+
"## **DNA Methylation Analysis**"
165+
]
166+
},
167+
{
168+
"cell_type": "markdown",
169+
"id": "bfbb1122-2196-45cb-8d93-fbc12bb29913",
170+
"metadata": {},
171+
"source": [
172+
"We will use the same framework for the methylation analysis as for the RNA-seq, which is to adjust the config file and let Nextflow run the pipeline. Let's look at the methylation config file and determine what we need to change. Like before, we need to specify a sample sheet input, the genome, workdir, and outdir. Fill in those below and try running the Nextflow command to run the core methyl-seq analysis."
173+
]
174+
},
175+
{
176+
"cell_type": "markdown",
177+
"id": "c2197a3a-23f3-4b3c-85e0-9acd582b60f8",
178+
"metadata": {},
179+
"source": [
180+
"```\n",
181+
"profiles {\n",
182+
" aws {\n",
183+
" // Google batch parameters\n",
184+
" process.executor = 'aws-batch' // name of your Compute environments\n",
185+
" process.queue = 'nextflow-batch-job-queue' // name of your Job queue\n",
186+
" process.container = 'quay.io/nextflow/rnaseq-nf:v1.1'\n",
187+
"\n",
188+
" aws.region = 'us-east-1'\n",
189+
" // Workflow parameters\n",
190+
" dag.overwrite = true\n",
191+
" params.outdir = 'FILL-IN-HERE'\n",
192+
" workDir = 'FILL-IN-HERE'\n",
193+
" params.genome = `FILL-IN-HERE`\n",
194+
" params.input = `FILL-IN-HERE`\n",
195+
" }\n",
196+
"}\n",
197+
"```"
198+
]
199+
},
200+
{
201+
"cell_type": "markdown",
202+
"id": "88e6f8a7-b349-4144-85fd-859704b04209",
203+
"metadata": {},
204+
"source": [
205+
"Be sure to consult the [nf-core methylseq documentation](https://nf-co.re/methylseq/2.5.0) to see how the input sample sheet is structured. There are some differences from the RNA-seq input. Here's an example of our suggested first line."
206+
]
207+
},
208+
{
209+
"cell_type": "markdown",
210+
"id": "8ae0255f-f710-43dd-857b-9713f8bb97f9",
211+
"metadata": {},
212+
"source": [
213+
"```\n",
214+
"sample,fastq_1,fastq_2 \n",
215+
"Young_1,s3://BUCKETPATH/SRR8616795_1.fastq.gz,s3://BUCKETPATH/SRR8616795_2.fastq.gz\n",
216+
"```"
217+
]
218+
},
219+
{
220+
"cell_type": "markdown",
221+
"id": "7e85dd37-9c5f-4abb-a9d2-107fb8590a37",
222+
"metadata": {},
223+
"source": [
224+
"As with the RNA-seq portion of this notebook, much of the remaining analysis can be completed as written in the RRBS-downstream.ipynb notebook. You will need to update the cells that hard-code sample and file information to match the samples from the new dataset, but this is a straightforward task. After doing this, try working through the Integration.ipynb notebook with your new "
225+
]
226+
}
227+
],
228+
"metadata": {
229+
"language_info": {
230+
"name": "python"
231+
}
232+
},
233+
"nbformat": 4,
234+
"nbformat_minor": 5
235+
}

GoogleCloud/01-RNA-Seq/install_rna_seq_packages.sh

+2-4
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,13 @@
11
#!/bin/bash
22

3-
conda update -n base -c defaults conda -y
4-
53
# Create a conda environment
6-
conda create -n r-package-1 -y
4+
conda create -n r-rna-seq r-base=4.3.3 -y
75

86
# Consider addressing your conda initialization instead.
97
source ~/.bashrc
108

119
# Activate the Conda Environment
12-
conda activate r-package-1
10+
conda activate r-rna-seq
1311

1412
# Install packages
1513
conda install -c conda-forge -c bioconda bioconductor-deseq2 -y

GoogleCloud/02-RRBS/install_rrbs_packages.sh

+3-2
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,16 @@
11
#!/bin/bash
22

33
# Create a conda environment
4-
conda create -n r-package-2 r-base=4.2.2 -y
4+
conda create -n r-rrbs r-base=4.3.3 -y
55

66
# Consider addressing your conda initialization instead.
77
source ~/.bashrc
88

99
# Activate the Conda Environment
10-
conda activate r-package-2
10+
conda activate r-rrbs
1111

1212
# Install packages
13+
conda install -c conda-forge r-data.table=1.16.4 -y
1314
conda install bioconda::bioconductor-methylkit -y
1415

1516
conda install bioconda::bioconductor-genomicranges -y
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
#!/bin/bash
2+
3+
# Create a conda environment
4+
conda create -n r-integration r-base=4.3.3 -y
5+
6+
# Consider addressing your conda initialization instead
7+
source ~/.bashrc
8+
9+
# Activate the Conda Environment
10+
conda activate r-integration
11+
12+
# Install packages
13+
14+
conda install conda-forge::r-pvclust -y
15+
16+
conda install conda-forge::r-ggnewscale -y
17+
18+
conda install conda-forge::r-ggridges -y
19+
20+
conda install conda-forge::r-europepmc -y
21+
22+
conda install conda-forge::r-ggseqlogo -y
23+
24+
conda install bioconda::bioconductor-methreg -y
25+
26+
conda install bioconda::bioconductor-bsgenome -y
27+
28+
conda install bioconda::bioconductor-clusterprofiler -y
29+
30+
conda install bioconda::bioconductor-pathview -y
31+
32+
conda install bioconda::bioconductor-enrichplot -y
33+
34+
conda install bioconda::bioconductor-repitools -y
35+
36+
conda install bioconda::bioconductor-rnaagecalc -y
37+
38+
conda install bioconda::bioconductor-motifmatchr -y
39+
40+
conda install bioconda::bioconductor-tfbstools -y
41+
42+
conda install conda-forge::r-doparallel -y
43+
44+
conda install bioconda::bioconductor-bsgenome.hsapiens.ucsc.hg38 -y
45+
46+
conda install -c conda-forge r-data.table=1.16.4 -y
47+
conda install bioconda::bioconductor-methylkit -y
48+
49+
conda install bioconda::bioconductor-genomation -y
50+
51+
conda install bioconda::bioconductor-deseq2 -y
52+
53+
conda install conda-forge::r-tidyverse -y
54+
55+
conda install bioconda::bioconductor-motifstack -y
56+
57+
58+
59+
R -e 'install.packages(c("IRkernel"), repos="http://cran.rstudio.com/")'
60+
61+
# Install the kernel specification for Jupyter
62+
R -e 'IRkernel::installspec(name = "R-Intergration", displayname = "R-Intergration")'

0 commit comments

Comments
 (0)