|
| 1 | +# DeepCDR |
| 2 | + |
| 3 | +[English Version](./README.md) |
| 4 | + |
| 5 | +* [Background](#background) |
| 6 | +* [Datasets](#datasets) |
| 7 | + * [CCLE](#ccle) |
| 8 | + * [GDSC](#gdsc) |
| 9 | +* [Instructions](#instructions) |
| 10 | + * [Data Preparation](#data-preparation) |
| 11 | + * [Training and Evaluation](#train-and-evaluation) |
| 12 | +* [Reference](#reference) |
| 13 | + |
| 14 | +## Background |
| 15 | + |
| 16 | +Accurate prediction of cancer drug response (CDR) is challenging due to the uncertainty of drug efficacy and heterogeneity of cancer patients. Precise identification of CDR is crucial in both guiding anti- cancer drug design and understanding cancer biology. DeepCDR is a model which integrates multi-omics profiles of cancer cells and explores intrinsic chemical structures of drugs for predicting CDR. |
| 17 | + |
| 18 | +## Datasets |
| 19 | + |
| 20 | +First, create a dataset root folder `data` under this application folder. |
| 21 | + |
| 22 | +```sh |
| 23 | +mkdir -p data && cd data |
| 24 | +``` |
| 25 | + |
| 26 | +### CCLE |
| 27 | +Download and uncompress CCLE dataset using following command: |
| 28 | +```sh |
| 29 | +wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/drug_response/CCLE.tar && tar -xvf CCLE.tar |
| 30 | +``` |
| 31 | +The three following CCLE files will in located in `./data/CCLE` folder. |
| 32 | + |
| 33 | +`genomic_mutation_34673_demap_features.csv` -- genomic mutation matrix where each column denotes mutation locus and each row denotes a cell line |
| 34 | + |
| 35 | +`genomic_expression_561celllines_697genes_demap_features.csv` -- gene expression matrix where each column denotes a coding gene and each row denotes a cell line |
| 36 | + |
| 37 | +`genomic_methylation_561celllines_808genes_demap_features.csv` -- DNA methylation matrix where each column denotes a methylation locus and each row denotes a cell line |
| 38 | + |
| 39 | + |
| 40 | + |
| 41 | +### GDSC |
| 42 | + |
| 43 | +Download and uncompress GDSC dataset using following command: |
| 44 | + |
| 45 | +```sh |
| 46 | +wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/drug_response/GDSC.tar && tar -xvf GDSC.tar |
| 47 | +``` |
| 48 | +The two following CCLE files will be located in `./data/GDSC` folder. |
| 49 | + |
| 50 | +`1.Drug_listMon Jun 24 09_00_55 2019.csv` -- drug list |
| 51 | + |
| 52 | +`223drugs_pubchem_smiles.txt` -- drug information with pubchem ID and SMILES |
| 53 | + |
| 54 | + |
| 55 | + |
| 56 | +Then, you can redirect to this application folder and follow instructions to finish next steps. |
| 57 | + |
| 58 | +After downloading these datasets, the `data` folder looks like: |
| 59 | + |
| 60 | +```txt |
| 61 | +data |
| 62 | +├── CCLE |
| 63 | +│ ├── genomic_mutation_34673_demap_features.csv |
| 64 | +│ ├── genomic_expression_561celllines_697genes_demap_features.csv |
| 65 | +│ ├── genomic_methylation_561celllines_808genes_demap_features.csv |
| 66 | +│ ├── Cell_lines_annotations_20181226.txt |
| 67 | +│ └── GDSC_IC50.csv |
| 68 | +| |
| 69 | +├── CCLE.tar |
| 70 | +├── GDSC |
| 71 | +│ ├── 1.Drug_listMon Jun 24 09_00_55 2019.csv |
| 72 | +│ └── 223drugs_pubchem_smiles.txt |
| 73 | +└── GDSC.tar |
| 74 | +``` |
| 75 | + |
| 76 | +## Instructions |
| 77 | + |
| 78 | +### Data Preparation |
| 79 | + |
| 80 | +The script `process_data.py` is the entry for data preprocessing. It creats `train_data_split_ratio.npz` and `test_data_ratio.npz` under `./data/processed/`, which can be trained directly. |
| 81 | + |
| 82 | + |
| 83 | +### Training and Evaluation |
| 84 | + |
| 85 | +The script `train.py` is the entry for CDR model's training and evaluating. It creats `CDRModel` from `model.py`. And the best model params are saved in `./best_model/` |
| 86 | + |
| 87 | + |
| 88 | +Below are the detailed explanations of model parameters and an example of `train.py`'s use: |
| 89 | + |
| 90 | +```data_path```: The path you load data.First you need to download the datasets per the guidence above, It is recommended to untar the datasets and put it in the data folder under the root directory, if not, please create a new data folder. |
| 91 | +```output_path```: The path model results be saved in. |
| 92 | +```batch_size```: Batch size of the model, at training phase, the default value will be 64. |
| 93 | +```use_cuda```: Using GPU if used. |
| 94 | +```device```: GPU device number, use with ```use_cuda```. |
| 95 | +```layer_num```: Layer nums of convolutional graph. |
| 96 | +```units_list```: List of hidden size of each layer. |
| 97 | +```gnn_type```: Three choices of convolutional graphs are presented, which are GCN, GIN and GraphSage. |
| 98 | +```pool_type```: Graph pooling type. |
| 99 | +```epoch_num```: Epochs to train the model. |
| 100 | +```lr```: Learning rate to train the model. |
| 101 | + |
| 102 | +```sh |
| 103 | +CUDA_VISIBLE_DEVICES=0 python pretrain_attrmask.py \ |
| 104 | + --data_path='./data/processed/' \ |
| 105 | + --output_path='./best_model/' \ |
| 106 | + --batch_size=64 \ |
| 107 | + --use_cuda \ |
| 108 | + --device=0 \ |
| 109 | + --layer_num=4 \ |
| 110 | + --units_list=[256,256,256,100] \ |
| 111 | + --gnn_type='gcn' \ |
| 112 | + --pool_type='max' \ |
| 113 | + --epoch_num=500 \ |
| 114 | + --lr=1e-4 \ |
| 115 | +``` |
| 116 | + |
| 117 | +Or, for convenience, use |
| 118 | +```sh |
| 119 | +test -d ./data/processed && python train.py || (python process_data.py && python train.py) |
| 120 | +``` |
| 121 | +to cover data preparation and simply start training. |
| 122 | + |
| 123 | + |
| 124 | + |
| 125 | +## Reference |
| 126 | + |
| 127 | +**DeepCDR** |
| 128 | +>@article{nguyen2020graphdta,title={DeepCDR: a hybrid graph convolutional network for predicting cancer drug response},author={Qiao Liu, Zhiqiang Hu, Rui Jiang, Mu Zhou},journal={Bioinformatics},year={2020},url={https://doi.org/10.1093/bioinformatics/btaa822}} |
| 129 | +
|
| 130 | + |
| 131 | + |
0 commit comments