- Quickstart Guide
- Download Data from Manifest File Using the GDC Client
- Run Processing Pipeline
- Sample Subtype Classification using Gene Expression Data
- Sample Subtype Classification using DNA Methylation Data
Install requirements - detailed instructions are found on the Requirements page:
- Install Python 3+
- Install GDC Data Transfer Tool Client
Ensure that steps are completed on the Requirements page - (includes creating working environment, signining in, and manually downloading required data)
Download Gene Expression Data
bash scripts/gdc_download.sh PAADThis will create subfolders in data-raw/GEXP` and place GDC molecular matrices here.
Options for cancer cohort includes
ALL,BLCA,BRCA,COADREAD,ESO,HNSC,KID,LGGGBM,LIHCCHOL,LUNG,OV,PAAD,SARC,SKCM,UCEC
For more details on each cancer cohort option see Cohort Options Page
Example shown for running PAAD cohort
bash scripts/process.sh PAAD data/prepCreates file
data/prep/<CANCER>_GEXP/<CANCER>_GEXP_prep2_<TYPE>.tsvthat is prepped for distance calculations
Options for cancer cohort includes
ALL,BLCA,BRCA,COADREAD,ESO,HNSC,KID,LGGGBM,LIHCCHOL,LUNG,OV,PAAD,SARC,SKCM,UCEC
For more details on each cancer cohort option see Cohort Options Page
The goal of this analysis is to get cancer subtype predictions for HCMI samples (organoids, cell cultures , xenografts, etc). To accomplish this we will use the top performing pre-trained machine learning models (dockerized TMP models that were trained using TCGA data that has been pre-proccessed). Specifically we are interested in using gene expression from the HCMI samples and eventually compare primary tumors to their corresponding models (organoids, cell cultures , xenografts, etc).
The TMP models (pre-trained models) are specific to TCGA cancer cohorts (TCGA abbreviations), therefore we will split HCMI data into TCGA cancer cohorts(based on sample metadata).
Run gene expression classifier pipeline:
# where specify cancer, tumor-file, model-file, transformed-dir
bash scripts/run_classify_GEXP.sh \
PAAD \
data/prep/PAAD_GEXP/PAAD_GEXP_prep2_Tumor.tsv \
data/prep/PAAD_GEXP/PAAD_GEXP_prep2_Model.tsv \
data/classifier_gexp/ml_ready_qrankResults can found in data/classifier_gexp/ml_predictions_qrank/combo/HCMI_TMPsubtype_qRank_<CANCER>.tsv
Note: LUNG (includes LUAD and LUSC), ESO (includes GEA and ESCC) during transformation and classification, then is merged in post-classification summary
The goal of this analysis is to get cancer subtype predictions for HCMI samples (organoids, cell cultures , xenografts, etc). To accomplish this we will use the top performing pre-trained machine learning models (dockerized TMP models that were trained using TCGA data that has been pre-proccessed). Specifically we are interested in using gene expression from the HCMI samples and eventually compare primary tumors to their corresponding models (organoids, cell cultures , xenografts, etc).
The TMP models (pre-trained models) are specific to TCGA cancer cohorts (TCGA abbreviations), therefore we will split HCMI data into TCGA cancer cohorts(based on sample metadata).
Run DNA methylation classifier pipeline:
# where specify cancer, tumor-file, model-file, transformed-dir
bash scripts/run_classify_METHYL.sh \
SKCM \
data/classifier_methyl/processed/20231211_HCMI_TMP_subtype_prediction_feature_matrix_SKCM.tsvResults can found in data/classifier_methyl/ml_predictions/combo/HCMI_METH_TMPsubtypes.<CANCER>.tsv
Note: LUNG (includes LUAD and LUSC), ESO (includes GEA and ESCC) during transformation and classification, then is merged in post-classification summary
Second Example for Combination Cohort
bash scripts/run_classify_METHYL.sh \ LUNG \ data/classifier_methyl/processed/20231211_HCMI_TMP_subtype_prediction_feature_matrix_LUNG.tsv