readme

luohongyu01 · luohongyu01 · commit cd7286405bd8 · 2021-05-17T14:44:51.000+08:00
diff --git a/apps/molecular_generation/SD_VAE/README.md b/apps/molecular_generation/SD_VAE/README.md
@@ -70,15 +70,28 @@ To run the trianing scripts:
     CUDA_VISIBLE_DEVICES=0 python train_zinc.py \
     -mode='gpu' \
 
-#### Run sampling
+#### Download the trained-model
+You can download the trained model from (https://baidu-nlp.bj.bcebos.com/PaddleHelix/models/molecular_generation/SD_VAE_model.tgz).
+
+unzip the file and put the model into './model' folder:
+
+|__  model
+
+|__ |__ train_model_epoch499
 
 
-    python sample_prior.py -info_fold ../data/data_SD_VAE/context_free_grammars  \
+#### Run sampling
+Sample from normal distribution prior:
+
+    python sample_prior.py \
+      -info_fold ../data/data_SD_VAE/context_free_grammars  \
       -grammar_file ../data/data_SD_VAE/context_free_grammars/mol_zinc.grammar \
       -model_config ../model_config.json \
       -saved_model ../model/train_model_epoch499
 
 
+reconstruct from the reference sequence:
+
     python reconstruct_zinc.py  \
       -info_fold ../data/data_SD_VAE/context_free_grammars \        
       -grammar_file ../data/data_SD_VAE/context_free_grammars/mol_zinc.grammar \      
@@ -87,13 +100,17 @@ To run the trianing scripts:
       -smiles_file ../data/data_SD_VAE/zinc/250k_rndm_zinc_drugs_clean.smi 
 
 
-
 ##### Sampling results from prior
 valid: 0.49
+
 unique@100: 1.0
+
 unique@1000: 1.0
+
 IntDiv: 0.92
+
 IntDiv2: 0.82
+
 Filters: 0.30
 
 ##### Reconstruction result
diff --git a/apps/molecular_generation/SD_VAE/README_cn.md b/apps/molecular_generation/SD_VAE/README_cn.md
@@ -0,0 +1,130 @@
+# SD VAE
+
+## 背景
+深度生成模型目前是对生成新分子和优化其化学属性的流行模型。在这项工作中，我们将会介绍一个基于分子序列语法和语义的VAE生成模型 - SD VAE。
+
+## 指导
+这个代码库将会给你训练SD VAE模型的指导。
+
+## 数据
+
+你可以从以下链接下载数据 [datalink] (https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/molecular_generation/data_SD_VAE.tgz).
+
+1. 创建一个文件夹 './data'
+2. 把下载的文件解压到 './data'
+
+文件夹结构：
+
+|__ data (project root)
+
+|__ |__  data_SD_VAE
+
+|__ |__  |__ context_free_grammars
+
+|__ |__  |__ zinc
+
+
+## 数据预处理
+在训练和评估模型之间, 我们需要对数据进行预处理，以提取语法和语义信息：
+
+    cd data_preprocessing
+    
+    python make_dataset_parallel.py \
+    -info_fold ../data/data_SD_VAE/context_free_grammars \
+        -grammar_file ../data/data_SD_VAE/context_free_grammars/mol_zinc.grammar \
+        -smiles_file ../data/data_SD_VAE/zinc/250k_rndm_zinc_drugs_clean.smi 
+        
+        
+    python dump_cfg_trees.py \
+    -info_fold ../data/data_SD_VAE/context_free_grammars \
+        -grammar_file ../data/data_SD_VAE/context_free_grammars/mol_zinc.grammar \
+        -smiles_file ../data/data_SD_VAE/zinc/250k_rndm_zinc_drugs_clean.smi 
+        
+
+上面的两个文件将会把txt数据分别转化成二进制和cfg dump文件。
+
+## 模型训练
+    
+#### 模型设置
+模型设置将会设置建造模型所用的参数，它们保存在：model_config.json
+
+    "latent_dim":the hidden size of latent space
+    "max_decode_steps": maximum steps for making decoding decisions
+    "eps_std": the standard deviation used in reparameterization tric
+    "encoder_type": the type of encoder
+    "rnn_type": The RNN type
+
+
+#### 训练设置
+为了训练模型，我们需要先设置模型的参数。默认参数值保存在文件：./mol_common/cmd_args.py
+
+    -loss_type : the type of loss
+    -num_epochs : number of epochs
+    -batch_size : minibatch size
+    -learning_rate : learning_rate
+    -kl_coeff : coefficient for kl divergence used in vae
+    -clip_grad : clip gradients to this value
+
+
+训练模型：
+
+    CUDA_VISIBLE_DEVICES=0 python train_zinc.py \
+    -mode='gpu' \
+
+#### 下载训练好的模型
+你可以从以下链接下载已经预先训练好的模型 (https://baidu-nlp.bj.bcebos.com/PaddleHelix/models/molecular_generation/SD_VAE_model.tgz).
+
+解压文件， 然后把模型保存在 './model' 文件夹，格式如下:
+
+|__  model
+
+|__ |__ train_model_epoch499
+
+
+#### 模型采样
+从正态先验中采样：
+
+    python sample_prior.py \
+      -info_fold ../data/data_SD_VAE/context_free_grammars  \
+      -grammar_file ../data/data_SD_VAE/context_free_grammars/mol_zinc.grammar \
+      -model_config ../model_config.json \
+      -saved_model ../model/train_model_epoch499
+
+
+从参考序列中重构：
+
+    python reconstruct_zinc.py  \
+      -info_fold ../data/data_SD_VAE/context_free_grammars \        
+      -grammar_file ../data/data_SD_VAE/context_free_grammars/mol_zinc.grammar \      
+      -model_config ../model_config.json \       
+      -saved_model ../model/train_model_epoch499 \
+      -smiles_file ../data/data_SD_VAE/zinc/250k_rndm_zinc_drugs_clean.smi 
+
+
+##### 正态先验采样的结果
+valid: 0.49
+
+unique@100: 1.0
+
+unique@1000: 1.0
+
+IntDiv: 0.92
+
+IntDiv2: 0.82
+
+Filters: 0.30
+
+##### 重构结果
+accuracy: 0.92
+
+
+
+## 参考文献
+[1] @misc{dai2018syntaxdirected,
+      title={Syntax-Directed Variational Autoencoder for Structured Data}, 
+      author={Hanjun Dai and Yingtao Tian and Bo Dai and Steven Skiena and Le Song},
+      year={2018},
+      eprint={1802.08786},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
+}
diff --git a/apps/molecular_generation/seq_VAE/Readme_cn.md b/apps/molecular_generation/seq_VAE/Readme_cn.md
@@ -0,0 +1,101 @@
+# 序列 VAE
+
+## 背景
+深度生成模型目前是对生成新分子和优化其化学属性的流行模型。在这项工作中，我们将会介绍一个分子序列的的VAE生成模型 - 序列 VAE。
+
+
+## 指导
+这个代码库将会给出训练一个序列VAE模型的指导。
+
+
+### 数据链接
+下载训练数据ZINC Clean Leads[1]: (https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/molecular_generation/zinc_moses.tgz).
+
+1. 创建一个文件夹 './data'
+2. 把文件解压到文件夹 './data'
+
+data
+|-- zinc_mose
+|   |-- train.csv
+|   |-- test.csv
+
+
+### 训练模型
+
+#### 模型设置
+模型设置将会设置建造模型所用的参数，它们保存在：model_config.json
+
+  # 编码器
+  "max_length":the maximun length of the inout sequence
+    "q_cell": the RNN cell type of encoding network
+    "q_bidir": if it's bidirectional RNN or not
+    "q_d_h": the hidden size of encoding RNN
+    "q_n_layers": the layer numbers of encoding RNN
+    "q_dropout": the drop out rate of encoding RNN
+    
+    # 解码器
+    "d_cell": the RNN cell type of decoding network
+    "d_n_layers": the layer numbers of decoding RNN
+    "d_dropout": the drop out rate of decoding RNN
+    "d_z": the hidden size of latent space
+    "d_d_h":the hidden size of decoding RNN
+    "freeze_embeddings": if freeze the embedding layer
+    
+#### Training setting
+为了训练模型，我们需要先设置模型的参数。默认参数值保存在文件：args.py
+
+  
+    # Train
+    '--n_epoch': number of trianing epoch, default=1000
+    '--n_batch': number of bach size, default=1000
+    '--lr_start': 'Initial lr value, default=3 * 1e-4
+    
+    # kl annealing
+    '--kl_start': Epoch to start change kl weight from, default=0
+    '--kl_w_start': Initial kl weight value, default=0
+    '--kl_w_end': Maximum kl weight value, default=0.05
+    
+训练模型
+
+```
+
+CUDA_VISIBLE_DEVICES=0 python trainer.py \
+
+--device='gpu' \
+
+--dataset_dir='./data/zinc_moses/train.csv' \
+
+--model_config='model_config.json' \
+
+--model_save='./results/train_models/' \
+
+--config_save='./results/config/' \
+```
+
+
+#### 下载训练好的模型
+你可以从以下链接下载已经预先训练好的模型 (https://baidu-nlp.bj.bcebos.com/PaddleHelix/models/molecular_generation/seq_VAE_model.tgz).
+
+
+#### 从正态分布先验中采样的结果
+
+Valid: 0.9765
+
+Novelty: 0.731
+
+Unique@1k: 0.993
+
+Filters: 0.853
+
+IntDiv: 0.846
+
+## 参考文献
+
+[1] @article{polykovskiy2020molecular,
+  title={Molecular sets (MOSES): a benchmarking platform for molecular generation models},
+  author={Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez-Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and others},
+  journal={Frontiers in pharmacology},
+  volume={11},
+  year={2020},
+  publisher={Frontiers Media SA}
+}