Skip to content

Commit cd72864

Browse files
author
luohongyu01
committed
readme
1 parent 4b20327 commit cd72864

File tree

3 files changed

+251
-3
lines changed

3 files changed

+251
-3
lines changed

apps/molecular_generation/SD_VAE/README.md

+20-3
Original file line numberDiff line numberDiff line change
@@ -70,15 +70,28 @@ To run the trianing scripts:
7070
CUDA_VISIBLE_DEVICES=0 python train_zinc.py \
7171
-mode='gpu' \
7272

73-
#### Run sampling
73+
#### Download the trained-model
74+
You can download the trained model from (https://baidu-nlp.bj.bcebos.com/PaddleHelix/models/molecular_generation/SD_VAE_model.tgz).
75+
76+
unzip the file and put the model into './model' folder:
77+
78+
|__ model
79+
80+
|__ |__ train_model_epoch499
7481

7582

76-
python sample_prior.py -info_fold ../data/data_SD_VAE/context_free_grammars \
83+
#### Run sampling
84+
Sample from normal distribution prior:
85+
86+
python sample_prior.py \
87+
-info_fold ../data/data_SD_VAE/context_free_grammars \
7788
-grammar_file ../data/data_SD_VAE/context_free_grammars/mol_zinc.grammar \
7889
-model_config ../model_config.json \
7990
-saved_model ../model/train_model_epoch499
8091

8192

93+
reconstruct from the reference sequence:
94+
8295
python reconstruct_zinc.py \
8396
-info_fold ../data/data_SD_VAE/context_free_grammars \
8497
-grammar_file ../data/data_SD_VAE/context_free_grammars/mol_zinc.grammar \
@@ -87,13 +100,17 @@ To run the trianing scripts:
87100
-smiles_file ../data/data_SD_VAE/zinc/250k_rndm_zinc_drugs_clean.smi
88101

89102

90-
91103
##### Sampling results from prior
92104
valid: 0.49
105+
93106
unique@100: 1.0
107+
94108
unique@1000: 1.0
109+
95110
IntDiv: 0.92
111+
96112
IntDiv2: 0.82
113+
97114
Filters: 0.30
98115

99116
##### Reconstruction result
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# SD VAE
2+
3+
## 背景
4+
深度生成模型目前是对生成新分子和优化其化学属性的流行模型。在这项工作中,我们将会介绍一个基于分子序列语法和语义的VAE生成模型 - SD VAE。
5+
6+
## 指导
7+
这个代码库将会给你训练SD VAE模型的指导。
8+
9+
## 数据
10+
11+
你可以从以下链接下载数据 [datalink] (https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/molecular_generation/data_SD_VAE.tgz).
12+
13+
1. 创建一个文件夹 './data'
14+
2. 把下载的文件解压到 './data'
15+
16+
文件夹结构:
17+
18+
|__ data (project root)
19+
20+
|__ |__ data_SD_VAE
21+
22+
|__ |__ |__ context_free_grammars
23+
24+
|__ |__ |__ zinc
25+
26+
27+
## 数据预处理
28+
在训练和评估模型之间, 我们需要对数据进行预处理,以提取语法和语义信息:
29+
30+
cd data_preprocessing
31+
32+
python make_dataset_parallel.py \
33+
-info_fold ../data/data_SD_VAE/context_free_grammars \
34+
-grammar_file ../data/data_SD_VAE/context_free_grammars/mol_zinc.grammar \
35+
-smiles_file ../data/data_SD_VAE/zinc/250k_rndm_zinc_drugs_clean.smi
36+
37+
38+
python dump_cfg_trees.py \
39+
-info_fold ../data/data_SD_VAE/context_free_grammars \
40+
-grammar_file ../data/data_SD_VAE/context_free_grammars/mol_zinc.grammar \
41+
-smiles_file ../data/data_SD_VAE/zinc/250k_rndm_zinc_drugs_clean.smi
42+
43+
44+
上面的两个文件将会把txt数据分别转化成二进制和cfg dump文件。
45+
46+
## 模型训练
47+
48+
#### 模型设置
49+
模型设置将会设置建造模型所用的参数,它们保存在:model_config.json
50+
51+
"latent_dim":the hidden size of latent space
52+
"max_decode_steps": maximum steps for making decoding decisions
53+
"eps_std": the standard deviation used in reparameterization tric
54+
"encoder_type": the type of encoder
55+
"rnn_type": The RNN type
56+
57+
58+
#### 训练设置
59+
为了训练模型,我们需要先设置模型的参数。默认参数值保存在文件:./mol_common/cmd_args.py
60+
61+
-loss_type : the type of loss
62+
-num_epochs : number of epochs
63+
-batch_size : minibatch size
64+
-learning_rate : learning_rate
65+
-kl_coeff : coefficient for kl divergence used in vae
66+
-clip_grad : clip gradients to this value
67+
68+
69+
训练模型:
70+
71+
CUDA_VISIBLE_DEVICES=0 python train_zinc.py \
72+
-mode='gpu' \
73+
74+
#### 下载训练好的模型
75+
你可以从以下链接下载已经预先训练好的模型 (https://baidu-nlp.bj.bcebos.com/PaddleHelix/models/molecular_generation/SD_VAE_model.tgz).
76+
77+
解压文件, 然后把模型保存在 './model' 文件夹,格式如下:
78+
79+
|__ model
80+
81+
|__ |__ train_model_epoch499
82+
83+
84+
#### 模型采样
85+
从正态先验中采样:
86+
87+
python sample_prior.py \
88+
-info_fold ../data/data_SD_VAE/context_free_grammars \
89+
-grammar_file ../data/data_SD_VAE/context_free_grammars/mol_zinc.grammar \
90+
-model_config ../model_config.json \
91+
-saved_model ../model/train_model_epoch499
92+
93+
94+
从参考序列中重构:
95+
96+
python reconstruct_zinc.py \
97+
-info_fold ../data/data_SD_VAE/context_free_grammars \
98+
-grammar_file ../data/data_SD_VAE/context_free_grammars/mol_zinc.grammar \
99+
-model_config ../model_config.json \
100+
-saved_model ../model/train_model_epoch499 \
101+
-smiles_file ../data/data_SD_VAE/zinc/250k_rndm_zinc_drugs_clean.smi
102+
103+
104+
##### 正态先验采样的结果
105+
valid: 0.49
106+
107+
unique@100: 1.0
108+
109+
unique@1000: 1.0
110+
111+
IntDiv: 0.92
112+
113+
IntDiv2: 0.82
114+
115+
Filters: 0.30
116+
117+
##### 重构结果
118+
accuracy: 0.92
119+
120+
121+
122+
## 参考文献
123+
[1] @misc{dai2018syntaxdirected,
124+
title={Syntax-Directed Variational Autoencoder for Structured Data},
125+
author={Hanjun Dai and Yingtao Tian and Bo Dai and Steven Skiena and Le Song},
126+
year={2018},
127+
eprint={1802.08786},
128+
archivePrefix={arXiv},
129+
primaryClass={cs.LG}
130+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# 序列 VAE
2+
3+
## 背景
4+
深度生成模型目前是对生成新分子和优化其化学属性的流行模型。在这项工作中,我们将会介绍一个分子序列的的VAE生成模型 - 序列 VAE。
5+
6+
7+
## 指导
8+
这个代码库将会给出训练一个序列VAE模型的指导。
9+
10+
11+
### 数据链接
12+
下载训练数据ZINC Clean Leads[1]: (https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/molecular_generation/zinc_moses.tgz).
13+
14+
1. 创建一个文件夹 './data'
15+
2. 把文件解压到文件夹 './data'
16+
17+
data
18+
|-- zinc_mose
19+
| |-- train.csv
20+
| |-- test.csv
21+
22+
23+
### 训练模型
24+
25+
#### 模型设置
26+
模型设置将会设置建造模型所用的参数,它们保存在:model_config.json
27+
28+
# 编码器
29+
"max_length":the maximun length of the inout sequence
30+
"q_cell": the RNN cell type of encoding network
31+
"q_bidir": if it's bidirectional RNN or not
32+
"q_d_h": the hidden size of encoding RNN
33+
"q_n_layers": the layer numbers of encoding RNN
34+
"q_dropout": the drop out rate of encoding RNN
35+
36+
# 解码器
37+
"d_cell": the RNN cell type of decoding network
38+
"d_n_layers": the layer numbers of decoding RNN
39+
"d_dropout": the drop out rate of decoding RNN
40+
"d_z": the hidden size of latent space
41+
"d_d_h":the hidden size of decoding RNN
42+
"freeze_embeddings": if freeze the embedding layer
43+
44+
#### Training setting
45+
为了训练模型,我们需要先设置模型的参数。默认参数值保存在文件:args.py
46+
47+
48+
# Train
49+
'--n_epoch': number of trianing epoch, default=1000
50+
'--n_batch': number of bach size, default=1000
51+
'--lr_start': 'Initial lr value, default=3 * 1e-4
52+
53+
# kl annealing
54+
'--kl_start': Epoch to start change kl weight from, default=0
55+
'--kl_w_start': Initial kl weight value, default=0
56+
'--kl_w_end': Maximum kl weight value, default=0.05
57+
58+
训练模型
59+
60+
```
61+
62+
CUDA_VISIBLE_DEVICES=0 python trainer.py \
63+
64+
--device='gpu' \
65+
66+
--dataset_dir='./data/zinc_moses/train.csv' \
67+
68+
--model_config='model_config.json' \
69+
70+
--model_save='./results/train_models/' \
71+
72+
--config_save='./results/config/' \
73+
```
74+
75+
76+
#### 下载训练好的模型
77+
你可以从以下链接下载已经预先训练好的模型 (https://baidu-nlp.bj.bcebos.com/PaddleHelix/models/molecular_generation/seq_VAE_model.tgz).
78+
79+
80+
#### 从正态分布先验中采样的结果
81+
82+
Valid: 0.9765
83+
84+
Novelty: 0.731
85+
86+
Unique@1k: 0.993
87+
88+
Filters: 0.853
89+
90+
IntDiv: 0.846
91+
92+
## 参考文献
93+
94+
[1] @article{polykovskiy2020molecular,
95+
title={Molecular sets (MOSES): a benchmarking platform for molecular generation models},
96+
author={Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez-Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and others},
97+
journal={Frontiers in pharmacology},
98+
volume={11},
99+
year={2020},
100+
publisher={Frontiers Media SA}
101+
}

0 commit comments

Comments
 (0)