Skip to content

Commit ce4a22b

Browse files
committed
update doc and add missed scripts
1 parent 53b2594 commit ce4a22b

File tree

7 files changed

+378
-56
lines changed

7 files changed

+378
-56
lines changed

apps/drug_target_interaction/graph_dta/README.md

+28-28
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ mkdir -p data && cd data
2727
Davis contains the binding affinities for all pairs of 72 drugs and 442 targets, measured as Kd constant (equilibrium dissociation constant). The smaller the Kd value, the greater the binding affinity of the drug for its target. You can download and uncompress this dataset using following command:
2828

2929
```sh
30-
wget "https://baidu-nlp.bj.bcebos.com/PaddleHelix%2Fdatasets%2Fdti_datasets%2Fdavis.tgz" -O davis.tgz
30+
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/dti_datasets/davis_v1.tgz -O davis.tgz
3131
tar -zxvf davis.tgz
3232
```
3333

@@ -36,7 +36,7 @@ tar -zxvf davis.tgz
3636
Kiba contains the binding affinity for 2,116 drugs and 229 targets. Comparing to Davis, some drug-target pairs do not have affinity labels. Moreover, the affinity in Kiba is measured as KIBA scores, which were constructed to optimize the consistency between Ki, Kd, and IC50 by utilizing the statistical information they contained. You can download and uncompress this dataset using following command:
3737

3838
```sh
39-
wget "https://baidu-nlp.bj.bcebos.com/PaddleHelix%2Fdatasets%2Fdti_datasets%2Fkiba.tgz" -O kiba.tgz
39+
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/dti_datasets/kiba_v1.tgz -O kiba.tgz
4040
tar -zxvf kiba.tgz
4141
```
4242

@@ -46,32 +46,32 @@ After downloaed these datasets, the `data` folder looks like:
4646

4747
```txt
4848
data
49-
|-- davis
50-
| |-- folds
51-
| | |-- test_fold_setting1.txt
52-
| | `-- train_fold_setting1.txt
53-
| |-- ligands_can.txt
54-
| |-- processed
55-
| | |-- test
56-
| | | `-- davis_test_0.npz
57-
| | `-- train
58-
| | `-- davis_train_0.npz
59-
| |-- proteins.txt
60-
| `-- Y
61-
|-- davis.tgz
62-
|-- kiba
63-
| |-- folds
64-
| | |-- test_fold_setting1.txt
65-
| | `-- train_fold_setting1.txt
66-
| |-- ligands_can.txt
67-
| |-- processed
68-
| | |-- test
69-
| | | `-- kiba_test_0.npz
70-
| | `-- train
71-
| | `-- kiba_train_0.npz
72-
| |-- proteins.txt
73-
| `-- Y
74-
`-- kiba.tgz
49+
├── davis
50+
├── folds
51+
├── test_fold_setting1.txt
52+
└── train_fold_setting1.txt
53+
├── ligands_can.txt
54+
├── processed
55+
├── test
56+
└── davis_test.npz
57+
└── train
58+
└── davis_train.npz
59+
├── proteins.txt
60+
└── Y
61+
├── davis.tgz
62+
├── kiba
63+
├── folds
64+
├── test_fold_setting1.txt
65+
└── train_fold_setting1.txt
66+
├── ligands_can.txt
67+
├── processed
68+
├── test
69+
└── kiba_test.npz
70+
└── train
71+
└── kiba_train.npz
72+
├── proteins.txt
73+
└── Y
74+
└── kiba.tgz
7575
```
7676

7777
## Instructions

apps/drug_target_interaction/graph_dta/README_cn.md

+28-28
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Davis数据集包含了72种药物和442种靶标蛋白任意之间的Kd值(
3030
执行下面的命令即可下载并解压Davis数据集:
3131

3232
```sh
33-
wget "https://baidu-nlp.bj.bcebos.com/PaddleHelix%2Fdatasets%2Fdti_datasets%2Fdavis.tgz" -O davis.tgz
33+
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/dti_datasets/davis_v1.tgz -O davis.tgz
3434
tar -zxvf davis.tgz
3535
```
3636

@@ -41,40 +41,40 @@ Kiba数据集包含了2,116种药物和229种靶标蛋白,不同于Davis数据
4141
执行下面的命令即可下载并解压Kiba数据集:
4242

4343
```sh
44-
wget "https://baidu-nlp.bj.bcebos.com/PaddleHelix%2Fdatasets%2Fdti_datasets%2Fkiba.tgz" -O kiba.tgz
44+
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/dti_datasets/kiba_v1.tgz -O kiba.tgz
4545
tar -zxvf kiba.tgz
4646
```
4747

4848
下载完成后,`data`目录看起来是这样的:
4949

5050
```txt
5151
data
52-
|-- davis
53-
| |-- folds
54-
| | |-- test_fold_setting1.txt
55-
| | `-- train_fold_setting1.txt
56-
| |-- ligands_can.txt
57-
| |-- processed
58-
| | |-- test
59-
| | | `-- davis_test_0.npz
60-
| | `-- train
61-
| | `-- davis_train_0.npz
62-
| |-- proteins.txt
63-
| `-- Y
64-
|-- davis.tgz
65-
|-- kiba
66-
| |-- folds
67-
| | |-- test_fold_setting1.txt
68-
| | `-- train_fold_setting1.txt
69-
| |-- ligands_can.txt
70-
| |-- processed
71-
| | |-- test
72-
| | | `-- kiba_test_0.npz
73-
| | `-- train
74-
| | `-- kiba_train_0.npz
75-
| |-- proteins.txt
76-
| `-- Y
77-
`-- kiba.tgz
52+
├── davis
53+
├── folds
54+
├── test_fold_setting1.txt
55+
└── train_fold_setting1.txt
56+
├── ligands_can.txt
57+
├── processed
58+
├── test
59+
└── davis_test.npz
60+
└── train
61+
└── davis_train.npz
62+
├── proteins.txt
63+
└── Y
64+
├── davis.tgz
65+
├── kiba
66+
├── folds
67+
├── test_fold_setting1.txt
68+
└── train_fold_setting1.txt
69+
├── ligands_can.txt
70+
├── processed
71+
├── test
72+
└── kiba_test.npz
73+
└── train
74+
└── kiba_train.npz
75+
├── proteins.txt
76+
└── Y
77+
└── kiba.tgz
7878
```
7979

8080
## 使用说明
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
"""
16+
Convert Kiba and Davis datasets into npz file which can be trained directly.
17+
18+
Note that the dataset split is inherited from GraphDTA and DeepDTA
19+
"""
20+
21+
import os
22+
import sys
23+
import json
24+
import random
25+
import pickle
26+
import argparse
27+
import numpy as np
28+
from rdkit import Chem
29+
from rdkit.Chem import AllChem
30+
from collections import OrderedDict
31+
32+
from pahelix.utils.compound_tools import mol_to_graph_data
33+
from pahelix.utils.protein_tools import ProteinTokenizer
34+
from pahelix.utils.data_utils import save_data_list_to_npz
35+
36+
37+
def main():
38+
"""Entry for data preprocessing."""
39+
tokenizer = ProteinTokenizer()
40+
for dataset in ['davis', 'kiba']:
41+
data_dir = os.path.join(args.dataset_root, dataset)
42+
if not os.path.exists(data_dir):
43+
print('Cannot find {}'.format(data_dir))
44+
continue
45+
46+
train_fold = json.load(
47+
open(os.path.join(data_dir, 'folds', 'train_fold_setting1.txt')))
48+
train_fold = [ee for e in train_fold for ee in e] # flatten
49+
test_fold = json.load(
50+
open(os.path.join(data_dir, 'folds', 'test_fold_setting1.txt')))
51+
ligands = json.load(
52+
open(os.path.join(data_dir, 'ligands_can.txt')),
53+
object_pairs_hook=OrderedDict)
54+
proteins = json.load(
55+
open(os.path.join(data_dir, 'proteins.txt')),
56+
object_pairs_hook=OrderedDict)
57+
# Use encoding 'latin1' to load py2 pkl from py3
58+
# pylint: disable=E1123
59+
affinity = pickle.load(
60+
open(os.path.join(data_dir, 'Y'), 'rb'), encoding='latin1')
61+
62+
smiles_lst, protein_lst = [], []
63+
for k in ligands.keys():
64+
smiles = Chem.MolToSmiles(Chem.MolFromSmiles(ligands[k]),
65+
isomericSmiles=True)
66+
smiles_lst.append(smiles)
67+
68+
for k in proteins.keys():
69+
protein_lst.append(proteins[k])
70+
71+
if dataset == 'davis':
72+
# Kd data
73+
affinity = [-np.log10(y / 1e9) for y in affinity]
74+
75+
affinity = np.asarray(affinity)
76+
77+
# pylint: disable=E1123
78+
os.makedirs(os.path.join(data_dir, 'processed'), exist_ok=True)
79+
for split in ['train', 'test']:
80+
print('processing {} set of {}'.format(split, dataset))
81+
82+
split_dir = os.path.join(data_dir, 'processed', split)
83+
# pylint: disable=E1123
84+
os.makedirs(split_dir, exist_ok=True)
85+
86+
fold = train_fold if split == 'train' else test_fold
87+
rows, cols = np.where(np.isnan(affinity) == False)
88+
rows, cols = rows[fold], cols[fold]
89+
90+
data_lst = []
91+
for idx in range(len(rows)):
92+
mol = AllChem.MolFromSmiles(smiles_lst[rows[idx]])
93+
mol_graph = mol_to_graph_data(mol)
94+
data = {k: v for k, v in mol_graph.items()}
95+
96+
seqs = []
97+
for seq in protein_lst[cols[idx]].split('\x01'):
98+
seqs.extend(tokenizer.gen_token_ids(seq))
99+
data['protein_token_ids'] = np.array(seqs)
100+
101+
af = affinity[rows[idx], cols[idx]]
102+
if dataset == 'davis':
103+
data['Log10_Kd'] = np.array([af])
104+
elif dataset == 'kiba':
105+
data['KIBA'] = np.array([af])
106+
107+
data_lst.append(data)
108+
109+
random.shuffle(data_lst)
110+
npz = os.path.join(split_dir, '{}_{}.npz'.format(dataset, split))
111+
save_data_list_to_npz(data_lst, npz)
112+
113+
print('==============================')
114+
print('dataset:', dataset)
115+
print('train_fold:', len(train_fold))
116+
print('test_fold:', len(test_fold))
117+
print('unique drugs:', len(set(smiles_lst)))
118+
print('unique proteins:', len(set(protein_lst)))
119+
120+
121+
if __name__ == '__main__':
122+
parser = argparse.ArgumentParser()
123+
parser.add_argument('--dataset_root', type=str, default='data')
124+
parser.add_argument('--npz_files', type=int, default=1) # set it > 1 for multi trainers
125+
args = parser.parse_args()
126+
main()
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
#!/bin/bash
2+
cd $(dirname $0)
3+
cd ..
4+
5+
############
6+
# config
7+
############
8+
root="data"
9+
10+
train() {
11+
local dataset=$1
12+
local model_dir=$2
13+
local extra_args=${@:3}
14+
15+
python train.py --device gpu \
16+
--train_data "$root/$dataset/processed/train/" \
17+
--test_data "$root/$dataset/processed/test/" \
18+
--model_config $config \
19+
--model_dir $model_dir \
20+
$extra_args
21+
}
22+
23+
dataset=$1
24+
config=$2
25+
26+
if [[ ! -e $config ]]; then
27+
echo "Cannot find "$config
28+
exit 1
29+
fi
30+
31+
config_filename=$(basename "$config")
32+
config_name="${config_filename%.*}"
33+
model_dir="model_dir/"$dataset"_"$config_name
34+
train $dataset $model_dir ${@:3}
35+
36+
# for dataset in "davis"
37+
# do
38+
# config_filename=$(basename "$config")
39+
# config_name="${config_filename%.*}"
40+
# model_dir="model_dir/"$dataset"_"$config_name
41+
# train $dataset $model_dir
42+
# done

0 commit comments

Comments
 (0)