|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "id": "1d0516a4-c45d-4ecf-b04c-e21e19f933f3", |
| 6 | + "metadata": {}, |
| 7 | + "source": [ |
| 8 | + "# <span> Module 4: Running the module with new data <span>" |
| 9 | + ] |
| 10 | + }, |
| 11 | + { |
| 12 | + "cell_type": "markdown", |
| 13 | + "id": "8af9faa7-692e-4d8e-b9b9-a2070e21d905", |
| 14 | + "metadata": {}, |
| 15 | + "source": [ |
| 16 | + "In this notebook, we are going to explore how to run this module with a new dataset. These submodules provide a great framework for running a rigorous and scalable analysis, but there are some considerations that must be made in order to run this with your own data. We will walk through that process here so that hopefully, you are able to take these notebooks to your research group and use them for your own analysis. Notice that we do not give you all the answers in the code blocks, but if you get stuck, use the dropdowns for help. This module largely uses Nextflow for the RNA-seq and Methyl-seq analysis, which makes it very easy to run the same analysis on new datasets by updating the config files." |
| 17 | + ] |
| 18 | + }, |
| 19 | + { |
| 20 | + "cell_type": "markdown", |
| 21 | + "id": "8209f0e3-a631-49f2-91cf-aa7ce85fea13", |
| 22 | + "metadata": {}, |
| 23 | + "source": [ |
| 24 | + "## **Importing the example dataset**" |
| 25 | + ] |
| 26 | + }, |
| 27 | + { |
| 28 | + "cell_type": "markdown", |
| 29 | + "id": "f923f5d4-597a-4e39-803e-5252f2fe4cd6", |
| 30 | + "metadata": {}, |
| 31 | + "source": [ |
| 32 | + "Our new dataset comes from a paper by [Hadad et al. Epigenetics Chromatin. 2019](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6781367/) that compares methylation changes in mice as they age and correlates those changes to gene expression changes. The data is available in SRA under the bioProject number [PRJNA523985](https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA523985). The impact of methylation on aging, particularly in brain tissue, is of great research interest. There are many samples in this dataset, but we will limit our analysis to young vs old female mice." |
| 33 | + ] |
| 34 | + }, |
| 35 | + { |
| 36 | + "cell_type": "markdown", |
| 37 | + "id": "364ce93b-f726-45f5-81da-19ddd214b244", |
| 38 | + "metadata": {}, |
| 39 | + "source": [ |
| 40 | + "To download the dataset, follow the instructions in the [tutorial](https://github.com/STRIDES/NIHCloudLabAWS/blob/main/notebooks/SRADownload/SRA-Download.ipynb) on downloading datasets from SRA using the prefetch+fasterq dump method. The accession numbers for this analysis are: \n", |
| 41 | + "\n", |
| 42 | + "- SRR8616802\n", |
| 43 | + "- SRR8616795\n", |
| 44 | + "- SRR8616796\n", |
| 45 | + "- SRR8616777\n", |
| 46 | + "- SRR8616778\n", |
| 47 | + "- SRR8616772\n", |
| 48 | + "- SRR8616799\n", |
| 49 | + "- SRR8616800\n", |
| 50 | + "- SRR8616801\n", |
| 51 | + "- SRR8616787\n", |
| 52 | + "- SRR8616788\n", |
| 53 | + "- SRR8616789\n", |
| 54 | + "\n", |
| 55 | + "You can save these accession numbers in a file and use that in the `sra-tools` commands. Once you pull these files from SRA, store them in a storage bucket so that Nextflow can see them in the next steps." |
| 56 | + ] |
| 57 | + }, |
| 58 | + { |
| 59 | + "cell_type": "markdown", |
| 60 | + "id": "4218d79f-8431-490a-9bc0-76ba89561c83", |
| 61 | + "metadata": {}, |
| 62 | + "source": [ |
| 63 | + "## **RNA-Seq analysis**" |
| 64 | + ] |
| 65 | + }, |
| 66 | + { |
| 67 | + "cell_type": "markdown", |
| 68 | + "id": "0e33ebbe-60c4-4f73-9cf9-70a2245d6fb9", |
| 69 | + "metadata": {}, |
| 70 | + "source": [ |
| 71 | + "To run the RNA-Seq portion of this tutorial, you need to update the config file to point to your RNA-Seq reads. Let's look at the `rnaseq-aws.config` file and make the necessary changes. We will need to specify `params.outdir`, `workDir`, `params.input`, and `params.genome`." |
| 72 | + ] |
| 73 | + }, |
| 74 | + { |
| 75 | + "cell_type": "markdown", |
| 76 | + "id": "1a96a33c-c580-4aea-889a-02c53378586a", |
| 77 | + "metadata": {}, |
| 78 | + "source": [ |
| 79 | + "```\n", |
| 80 | + "profiles {\n", |
| 81 | + " aws {\n", |
| 82 | + " // AWS batch parameters\n", |
| 83 | + " process.executor = 'aws-batch' // name of your Compute environments\n", |
| 84 | + " process.queue = 'nextflow-batch-job-queue' // name of your Job queue\n", |
| 85 | + " process.container = 'quay.io/nextflow/rnaseq-nf:v1.1'\n", |
| 86 | + " aws.region = 'us-east-1'\n", |
| 87 | + "\n", |
| 88 | + " // Workflow parameters\n", |
| 89 | + " params.outdir = 'FILL-IN-HERE'\n", |
| 90 | + " workDir = 'FILL-IN-HERE'\n", |
| 91 | + " params.input = 'FILL-IN-HERE'\n", |
| 92 | + " params.genome = 'FILL-IN-HERE'\n", |
| 93 | + " }\n", |
| 94 | + "}\n", |
| 95 | + "```" |
| 96 | + ] |
| 97 | + }, |
| 98 | + { |
| 99 | + "cell_type": "markdown", |
| 100 | + "id": "be4d3374-abf6-4d0a-9fc2-b9bbf356dc0f", |
| 101 | + "metadata": {}, |
| 102 | + "source": [ |
| 103 | + "Check the [nf-core RNA-seq](https://nf-co.re/rnaseq/3.12.0) documentation to find out how most of this is structured. As with the primary dataset, the workDir and outdir are going to be locations in your Amazon S3 bucket. The sequences are from mice so we need to specify a mouse genome build. Finally, the most effort goes into the input parameter. The nf-core documentation specifies that the reads need to be structured in a sample sheet. Let's look at an example. Using your bucket paths to your reads, create a .csv like this and put it in your bucket, then provide the bucket path in the config file. If you need help, click the help button below to see how we suggest." |
| 104 | + ] |
| 105 | + }, |
| 106 | + { |
| 107 | + "cell_type": "markdown", |
| 108 | + "id": "f51d42b4-7b6e-4ab0-8030-b05f8704bfd7", |
| 109 | + "metadata": {}, |
| 110 | + "source": [ |
| 111 | + "<details>\n", |
| 112 | + " <summary>Click for help</summary>\n", |
| 113 | + "\n", |
| 114 | + "```\n", |
| 115 | + "params.outdir = 's3://YOUR-BUCKET/rnaseq/results'\n", |
| 116 | + "workDir = 's3://YOUR-BUCKET/rnaseq/work'\n", |
| 117 | + "params.input = 's3://YOUR-BUCKET/samplesheet.csv'\n", |
| 118 | + "params.genome = 'GRCm38'\n", |
| 119 | + "```\n", |
| 120 | + "\n", |
| 121 | + "</details>\n" |
| 122 | + ] |
| 123 | + }, |
| 124 | + { |
| 125 | + "cell_type": "markdown", |
| 126 | + "id": "98f11060-3d5b-43e0-af7a-beda8f4a46c0", |
| 127 | + "metadata": {}, |
| 128 | + "source": [ |
| 129 | + "Here's how the first row of your sample sheet might look. Make sure to only include the RNA-seq samples in the sample sheet. The methyl-seq samples will be included in that sample sheet when we run that pipeline in the next step." |
| 130 | + ] |
| 131 | + }, |
| 132 | + { |
| 133 | + "cell_type": "markdown", |
| 134 | + "id": "351a761b-a1db-4ef8-a671-b5d64b83056d", |
| 135 | + "metadata": {}, |
| 136 | + "source": [ |
| 137 | + "``` \n", |
| 138 | + "sample,fastq_1,fastq_2,strandedness \n", |
| 139 | + "Young_1,s3://BUCKETPATH/SRR8616802_1.fastq.gz,s3://BUCKETPATH/SRR8616802_2.fastq.gz,auto\n", |
| 140 | + "``` " |
| 141 | + ] |
| 142 | + }, |
| 143 | + { |
| 144 | + "cell_type": "markdown", |
| 145 | + "id": "736c254a-a6b9-4901-a666-4bff478dcea0", |
| 146 | + "metadata": {}, |
| 147 | + "source": [ |
| 148 | + "Once you have that together, run the Nextflow command again to run the nf-core pipeline on this dataset. " |
| 149 | + ] |
| 150 | + }, |
| 151 | + { |
| 152 | + "cell_type": "markdown", |
| 153 | + "id": "491e14b9-e0c6-4310-8660-8172db317c80", |
| 154 | + "metadata": {}, |
| 155 | + "source": [ |
| 156 | + "After the Nextflow quantification completes, you can follow much of the RNA-seq notebook as written using the `salmon.merged.gene_counts.tsv` file that is produced by this step. You will need to update the `sample-info.txt` file to match the new information from these samples, but that is a simple task. As you go through the analysis, think about what comparisons would be interesting to make. Feel free to consult the manuscript for analysis ideas and try to replicate some of their results. " |
| 157 | + ] |
| 158 | + }, |
| 159 | + { |
| 160 | + "cell_type": "markdown", |
| 161 | + "id": "89f2cf3a-51ed-4856-a8b1-3b6554e89bbd", |
| 162 | + "metadata": {}, |
| 163 | + "source": [ |
| 164 | + "## **DNA Methylation Analysis**" |
| 165 | + ] |
| 166 | + }, |
| 167 | + { |
| 168 | + "cell_type": "markdown", |
| 169 | + "id": "bfbb1122-2196-45cb-8d93-fbc12bb29913", |
| 170 | + "metadata": {}, |
| 171 | + "source": [ |
| 172 | + "We will use the same framework for the methylation analysis as for the RNA-seq, which is to adjust the config file and let Nextflow run the pipeline. Let's look at the methylation config file and determine what we need to change. Like before, we need to specify a sample sheet input, the genome, workdir, and outdir. Fill in those below and try running the Nextflow command to run the core methyl-seq analysis." |
| 173 | + ] |
| 174 | + }, |
| 175 | + { |
| 176 | + "cell_type": "markdown", |
| 177 | + "id": "c2197a3a-23f3-4b3c-85e0-9acd582b60f8", |
| 178 | + "metadata": {}, |
| 179 | + "source": [ |
| 180 | + "```\n", |
| 181 | + "profiles {\n", |
| 182 | + " aws {\n", |
| 183 | + " // Google batch parameters\n", |
| 184 | + " process.executor = 'aws-batch' // name of your Compute environments\n", |
| 185 | + " process.queue = 'nextflow-batch-job-queue' // name of your Job queue\n", |
| 186 | + " process.container = 'quay.io/nextflow/rnaseq-nf:v1.1'\n", |
| 187 | + "\n", |
| 188 | + " aws.region = 'us-east-1'\n", |
| 189 | + " // Workflow parameters\n", |
| 190 | + " dag.overwrite = true\n", |
| 191 | + " params.outdir = 'FILL-IN-HERE'\n", |
| 192 | + " workDir = 'FILL-IN-HERE'\n", |
| 193 | + " params.genome = `FILL-IN-HERE`\n", |
| 194 | + " params.input = `FILL-IN-HERE`\n", |
| 195 | + " }\n", |
| 196 | + "}\n", |
| 197 | + "```" |
| 198 | + ] |
| 199 | + }, |
| 200 | + { |
| 201 | + "cell_type": "markdown", |
| 202 | + "id": "88e6f8a7-b349-4144-85fd-859704b04209", |
| 203 | + "metadata": {}, |
| 204 | + "source": [ |
| 205 | + "Be sure to consult the [nf-core methylseq documentation](https://nf-co.re/methylseq/2.5.0) to see how the input sample sheet is structured. There are some differences from the RNA-seq input. Here's an example of our suggested first line." |
| 206 | + ] |
| 207 | + }, |
| 208 | + { |
| 209 | + "cell_type": "markdown", |
| 210 | + "id": "8ae0255f-f710-43dd-857b-9713f8bb97f9", |
| 211 | + "metadata": {}, |
| 212 | + "source": [ |
| 213 | + "```\n", |
| 214 | + "sample,fastq_1,fastq_2 \n", |
| 215 | + "Young_1,s3://BUCKETPATH/SRR8616795_1.fastq.gz,s3://BUCKETPATH/SRR8616795_2.fastq.gz\n", |
| 216 | + "```" |
| 217 | + ] |
| 218 | + }, |
| 219 | + { |
| 220 | + "cell_type": "markdown", |
| 221 | + "id": "7e85dd37-9c5f-4abb-a9d2-107fb8590a37", |
| 222 | + "metadata": {}, |
| 223 | + "source": [ |
| 224 | + "As with the RNA-seq portion of this notebook, much of the remaining analysis can be completed as written in the RRBS-downstream.ipynb notebook. You will need to update the cells that hard-code sample and file information to match the samples from the new dataset, but this is a straightforward task. After doing this, try working through the Integration.ipynb notebook with your new " |
| 225 | + ] |
| 226 | + } |
| 227 | + ], |
| 228 | + "metadata": { |
| 229 | + "language_info": { |
| 230 | + "name": "python" |
| 231 | + } |
| 232 | + }, |
| 233 | + "nbformat": 4, |
| 234 | + "nbformat_minor": 5 |
| 235 | +} |
0 commit comments