Skip to content

Commit e5578f7

Browse files
authored
Merge pull request #140 from PaddlePaddle/cdr
add cancer_drug_response
2 parents 2bb20a3 + 4a51908 commit e5578f7

File tree

5 files changed

+860
-0
lines changed

5 files changed

+860
-0
lines changed

apps/cancer_drug_response/README.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# DeepCDR
2+
3+
[English Version](./README.md)
4+
5+
* [Background](#background)
6+
* [Datasets](#datasets)
7+
* [CCLE](#ccle)
8+
* [GDSC](#gdsc)
9+
* [Instructions](#instructions)
10+
* [Data Preparation](#data-preparation)
11+
* [Training and Evaluation](#train-and-evaluation)
12+
* [Reference](#reference)
13+
14+
## Background
15+
16+
Accurate prediction of cancer drug response (CDR) is challenging due to the uncertainty of drug efficacy and heterogeneity of cancer patients. Precise identification of CDR is crucial in both guiding anti- cancer drug design and understanding cancer biology. DeepCDR is a model which integrates multi-omics profiles of cancer cells and explores intrinsic chemical structures of drugs for predicting CDR.
17+
18+
## Datasets
19+
20+
First, create a dataset root folder `data` under this application folder.
21+
22+
```sh
23+
mkdir -p data && cd data
24+
```
25+
26+
### CCLE
27+
Download and uncompress CCLE dataset using following command:
28+
```sh
29+
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/drug_response/CCLE.tar && tar -xvf CCLE.tar
30+
```
31+
The three following CCLE files will in located in `./data/CCLE` folder.
32+
33+
`genomic_mutation_34673_demap_features.csv` -- genomic mutation matrix where each column denotes mutation locus and each row denotes a cell line
34+
35+
`genomic_expression_561celllines_697genes_demap_features.csv` -- gene expression matrix where each column denotes a coding gene and each row denotes a cell line
36+
37+
`genomic_methylation_561celllines_808genes_demap_features.csv` -- DNA methylation matrix where each column denotes a methylation locus and each row denotes a cell line
38+
39+
40+
41+
### GDSC
42+
43+
Download and uncompress GDSC dataset using following command:
44+
45+
```sh
46+
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/drug_response/GDSC.tar && tar -xvf GDSC.tar
47+
```
48+
The two following CCLE files will be located in `./data/GDSC` folder.
49+
50+
`1.Drug_listMon Jun 24 09_00_55 2019.csv` -- drug list
51+
52+
`223drugs_pubchem_smiles.txt` -- drug information with pubchem ID and SMILES
53+
54+
55+
56+
Then, you can redirect to this application folder and follow instructions to finish next steps.
57+
58+
After downloading these datasets, the `data` folder looks like:
59+
60+
```txt
61+
data
62+
├── CCLE
63+
│ ├── genomic_mutation_34673_demap_features.csv
64+
│ ├── genomic_expression_561celllines_697genes_demap_features.csv
65+
│ ├── genomic_methylation_561celllines_808genes_demap_features.csv
66+
│ ├── Cell_lines_annotations_20181226.txt
67+
│ └── GDSC_IC50.csv
68+
|
69+
├── CCLE.tar
70+
├── GDSC
71+
│ ├── 1.Drug_listMon Jun 24 09_00_55 2019.csv
72+
│ └── 223drugs_pubchem_smiles.txt
73+
└── GDSC.tar
74+
```
75+
76+
## Instructions
77+
78+
### Data Preparation
79+
80+
The script `process_data.py` is the entry for data preprocessing. It creats `train_data_split_ratio.npz` and `test_data_ratio.npz` under `./data/processed/`, which can be trained directly.
81+
82+
83+
### Training and Evaluation
84+
85+
The script `train.py` is the entry for CDR model's training and evaluating. It creats `CDRModel` from `model.py`. And the best model params are saved in `./best_model/`
86+
87+
88+
Below are the detailed explanations of model parameters and an example of `train.py`'s use:
89+
90+
```data_path```: The path you load data.First you need to download the datasets per the guidence above, It is recommended to untar the datasets and put it in the data folder under the root directory, if not, please create a new data folder.
91+
```output_path```: The path model results be saved in.
92+
```batch_size```: Batch size of the model, at training phase, the default value will be 64.
93+
```use_cuda```: Using GPU if used.
94+
```device```: GPU device number, use with ```use_cuda```.
95+
```layer_num```: Layer nums of convolutional graph.
96+
```units_list```: List of hidden size of each layer.
97+
```gnn_type```: Three choices of convolutional graphs are presented, which are GCN, GIN and GraphSage.
98+
```pool_type```: Graph pooling type.
99+
```epoch_num```: Epochs to train the model.
100+
```lr```: Learning rate to train the model.
101+
102+
```sh
103+
CUDA_VISIBLE_DEVICES=0 python pretrain_attrmask.py \
104+
--data_path='./data/processed/' \
105+
--output_path='./best_model/' \
106+
--batch_size=64 \
107+
--use_cuda \
108+
--device=0 \
109+
--layer_num=4 \
110+
--units_list=[256,256,256,100] \
111+
--gnn_type='gcn' \
112+
--pool_type='max' \
113+
--epoch_num=500 \
114+
--lr=1e-4 \
115+
```
116+
117+
Or, for convenience, use
118+
```sh
119+
test -d ./data/processed && python train.py || (python process_data.py && python train.py)
120+
```
121+
to cover data preparation and simply start training.
122+
123+
124+
125+
## Reference
126+
127+
**DeepCDR**
128+
>@article{nguyen2020graphdta,title={DeepCDR: a hybrid graph convolutional network for predicting cancer drug response},author={Qiao Liu, Zhiqiang Hu, Rui Jiang, Mu Zhou},journal={Bioinformatics},year={2020},url={https://doi.org/10.1093/bioinformatics/btaa822}}
129+
130+
131+

apps/cancer_drug_response/data_gen.py

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
16+
import os
17+
import sys
18+
import numpy as np
19+
20+
import paddle
21+
from pgl.utils.data import Dataset as BaseDataset
22+
from pgl.utils.data import Dataloader
23+
24+
import pgl
25+
from pgl.utils.logger import log
26+
27+
28+
class Dataset(BaseDataset):
29+
"""
30+
Dataset for CDR(cancer drug response)
31+
"""
32+
33+
def __init__(self, processed_data):
34+
self.data = processed_data
35+
self.keys = list(processed_data.keys())
36+
self.num_samples = len(processed_data[self.keys[0]])
37+
38+
def __getitem__(self, idx):
39+
return self.data[self.keys[0]][idx], self.data[self.keys[1]][idx], self.data[self.keys[2]][idx], \
40+
self.data[self.keys[3]][idx], self.data[self.keys[4]][idx]
41+
42+
def get_data_loader(self, batch_size, num_workers=1,
43+
shuffle=False, collate_fn=None):
44+
"""Get dataloader.
45+
Args:
46+
batch_size (int): number of data items in a batch.
47+
num_workers (int): number of parallel workers.
48+
shuffle (int): whether to shuffle yield data.
49+
collate_fn: callable function that processes batch data to a list of paddle tensor.
50+
"""
51+
return Dataloader(
52+
self,
53+
batch_size=batch_size,
54+
num_workers=num_workers,
55+
shuffle=shuffle,
56+
collate_fn=collate_fn)
57+
58+
def __len__(self):
59+
return self.num_samples
60+
61+
62+
def collate_fn(batch_data):
63+
"""
64+
Collation function to distribute data to samples
65+
:param batch_data: batch data
66+
"""
67+
graphs = []
68+
mut, gexpr, met, Y = [], [], [], []
69+
for g, mu, gex, me, y in batch_data:
70+
graphs.append(g)
71+
mut.append(mu)
72+
gexpr.append(gex)
73+
met.append(me)
74+
Y.append(y)
75+
return graphs, mut, gexpr, met, Y

0 commit comments

Comments
 (0)