Skip to content

Commit 64af89f

Browse files
hedonglonghedonglong
hedonglong
authored and
hedonglong
committed
fix zero-tensor issue
2 parents f74a1e0 + 1bb0387 commit 64af89f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+6028
-1303
lines changed

.github/LIT-PCBA_result.png

316 KB
Loading

.github/optimus_framework3.png

722 KB
Loading

.github/pcqm4mv2_result.png

251 KB
Loading

README.md

+2
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ English | [简体中文](README_cn.md)
1212

1313

1414
## Latest News
15+
`2022.08.11` PaddleHelix released the codes of HelixGEM-2, a novel Molecular Property Prediction Network that models full-range many-body interactions. And it ranked 1st in the OGB [PCQM4Mv2](https://ogb.stanford.edu/docs/lsc/leaderboards/) leaderboard. Please refer to [paper](https://arxiv.org/abs/2208.05863) and [codes](./apps/pretrained_compound/ChemRL/GEM-2) for more details.
16+
1517
`2022.07.29` PaddleHelix released the codes of HelixFold-Single, an **MSA-free** protein structure prediction pipeline relying on only the primary sequences, which can **predict the protein structures within seconds**. Please refer to [paper](https://arxiv.org/abs/2207.13921) and [codes](./apps/protein_folding/helixfold-single) for more details. Welcome to [PaddleHelix website](https://paddlehelix.baidu.com/app/drug/protein-single/forecast
1618
) to try out the structure prediction online service.
1719

README_cn.md

+2
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@
1010
![support os](https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-yellow.svg)
1111

1212
## 最新消息
13+
`2022.08.11` 螺旋桨团队开源了HelixGEM-2的代码, 它是一个全新的基于长程多体建模的小分子属性预测框架,并在OGB [PCQM4Mv2](https://ogb.stanford.edu/docs/lsc/leaderboards/) 排行榜取得第一的成绩。详情参见 [论文](https://arxiv.org/abs/2208.05863)[代码](./apps/pretrained_compound/ChemRL/GEM-2)
14+
1315
`2022.07.29` 螺旋桨团队开源了HelixFold-Single的代码,HelixFold-Single是一个**不依赖于MSA的**蛋白质结构预测流程,仅仅需要一级序列作为输入就可以提供**秒级别的蛋白质结构预测**。详情参见[论文](https://arxiv.org/abs/2207.13921)[代码](./apps/protein_folding/helixfold-single)。欢迎到[PaddleHelix网站](https://paddlehelix.baidu.com/app/drug/protein-single/forecast
1416
)去试用结构预测的在线服务。
1517

apps/drug_drug_synergy/RGCN/train.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ def train(num_subgraph, graph, label_idx, epochs, sub_neighbours=[10, 10], init=
8787
fpr, tpr, _ = roc_curve(y_true=ground_truth, y_score=pred_prob)
8888
auc_v = auc(fpr, tpr)
8989
print("sub_graph index : {} | epoch: {} | training loss: {:.4f} | AUC: {:.3f}".format(
90-
sub_g, epoch, train_loss.numpy()[0], auc_v))
90+
sub_g, epoch, float(train_loss), auc_v))
9191

9292
return model
9393

apps/drug_target_interaction/batchdta/pairwise/DeepDTA/utils.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -312,7 +312,7 @@ def model_eval(model,val_dataloader):
312312

313313
for i_target_score in range(batch_smiles.shape[0]):
314314

315-
i_target_len = int(batch_len[i_target_score].numpy()[0])
315+
i_target_len = int(batch_len[i_target_score])
316316
smiles = batch_smiles[i_target_score][0:i_target_len]
317317
target = batch_protein[i_target_score][0:i_target_len]
318318
y_label = batch_y[i_target_score][0:i_target_len].numpy()

apps/drug_target_interaction/batchdta/pairwise/GraphDTA/run_pairwise_GraphDTA_CV.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -195,9 +195,9 @@ def model_eval(model,val_dataloader,device):
195195
i_data = i_data.to(device)
196196
pred_scores = model.forward_single(i_data)
197197
# get the predicted labels
198-
i_target_pred_scores.append(pred_scores.cpu().numpy()[0])
198+
i_target_pred_scores.append(float(pred_scores))
199199
# get the true labels
200-
i_target_y_label.append(i_data.y.cpu().numpy()[0])
200+
i_target_y_label.append(float(i_data.y.cpu()))
201201

202202
i_target_pred_scores = np.array(i_target_pred_scores)
203203
i_target_y_label = np.array(i_target_y_label)

apps/drug_target_interaction/batchdta/pairwise/Moltrans/helper/utils/paddle_tensor.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ def item(self):
3232
"""
3333
Item function
3434
"""
35-
return self.numpy()[0]
35+
return float(self)
3636

3737

3838
@add_tensor_function

apps/drug_target_interaction/batchdta/pairwise/Moltrans/run_pairwise_Moltrans_CV.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -297,7 +297,7 @@ def model_eval(model,val_dataloader,len_SMILES,len_target):
297297

298298
for i_target_score in range(batch_x.shape[0]):
299299

300-
i_target_len = int(batch_len[i_target_score].numpy()[0])
300+
i_target_len = int(batch_len[i_target_score])
301301
smiles = batch_x_smiles[i_target_score][0:i_target_len]
302302
target = batch_x_protein[i_target_score][0:i_target_len]
303303
smiles_mask = batch_x_smiles_mask[i_target_score][0:i_target_len]

apps/drug_target_interaction/batchdta/pairwise/Moltrans/run_pairwise_Moltrans_bindingDB.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -282,7 +282,7 @@ def model_eval(model,val_dataloader,len_SMILES,len_target):
282282

283283
for i_target_score in range(batch_x.shape[0]):
284284

285-
i_target_len = int(batch_len[i_target_score].numpy()[0])
285+
i_target_len = int(batch_len[i_target_score])
286286
smiles = batch_x_smiles[i_target_score][0:i_target_len]
287287
target = batch_x_protein[i_target_score][0:i_target_len]
288288
smiles_mask = batch_x_smiles_mask[i_target_score][0:i_target_len]

apps/drug_target_interaction/batchdta/pointwise/DeepDTA/train_bindingdb.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ def training(model, training_loader, optim):
6060
optim.clear_grad()
6161
loss.backward()
6262
optim.step()
63-
res_loss = loss.numpy()[0]
63+
res_loss = float(loss)
6464
return res_loss
6565

6666

apps/drug_target_interaction/batchdta/pointwise/DeepDTA/train_davis.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ def training(model, training_loader, optim):
6060
optim.clear_grad()
6161
loss.backward()
6262
optim.step()
63-
res_loss = loss.numpy()[0]
63+
res_loss = float(loss)
6464
return res_loss
6565

6666

apps/drug_target_interaction/batchdta/pointwise/DeepDTA/train_kiba.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ def training(model, training_loader, optim):
6363
optim.clear_grad()
6464
loss.backward()
6565
optim.step()
66-
res_loss = loss.numpy()[0]
66+
res_loss = float(loss.numpy())
6767
return res_loss
6868

6969

apps/drug_target_interaction/batchdta/pointwise/Moltrans/helper/utils/paddle_tensor.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ def item(self):
3232
"""
3333
Item function
3434
"""
35-
return self.numpy()[0]
35+
return float(self)
3636

3737

3838
@add_tensor_function

apps/drug_target_interaction/moltrans_dti/helper/utils/paddle_tensor.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ def item(self):
3232
"""
3333
Item function
3434
"""
35-
return self.numpy()[0]
35+
return float(self.numpy())
3636

3737

3838
@add_tensor_function

apps/fewshot_molecular_property/chem_lib/models/trainer.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -294,7 +294,7 @@ def train_step(self):
294294
losses_eval.backward()
295295
self.optimizer.step()
296296

297-
print('Train Epoch:',self.train_epoch,', train update step:', k, ', loss_eval:', losses_eval.numpy()[0])
297+
print('Train Epoch:',self.train_epoch,', train update step:', k, ', loss_eval:', float(losses_eval))
298298

299299
return self.model.layers
300300

apps/molecular_generation/SD_VAE/train_zinc.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -122,9 +122,9 @@ def _train_epoch(model, data_loader, epoch, kl_weight, optimizer=None):
122122
optimizer.clear_grad()
123123

124124
# Log
125-
kl_loss_values.append(kl_loss.numpy()[0])
126-
perplexity_loss_values.append(perplexity.numpy()[0])
127-
loss_values.append(loss.numpy()[0])
125+
kl_loss_values.append(float(kl_loss))
126+
perplexity_loss_values.append(float(perplexity))
127+
loss_values.append(float(loss))
128128
lr = (optimizer.get_lr()
129129
if optimizer is not None
130130
else 0)
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,27 @@
1-
# GEM-2: Next Generation Molecular Property Prediction Network with Many-body and Full-range Interaction Modeling
2-
Molecular property prediction is a fundamental task in the drug and material industries. Physically, the properties of a molecule are determined by its own electronic structure, which can be exactly described by the Schrödinger equation. However, solving the Schrödinger equation for most molecules is extremely challenging due to long-range interactions in the behavior of a quantum many-body system. While deep learning methods have proven to be effective in molecular property prediction, we design a novel method, namely GEM-2, which comprehensively considers both the long-range and many-body interactions in molecules. GEM-2 consists of two interacted tracks: an atom-level track modeling both the local and global correlation between any two atoms, and a pair-level track modeling the correlation between all atom pairs, which embed information between any 3 or 4 atoms. Extensive experiments demonstrated the superiority of GEM-2 over multiple baseline methods in quantum chemistry and drug discovery tasks.
3-
1+
# GEM-2: Next Generation Molecular Property Prediction Network by Modeling Full-range Many-body Interactions
2+
GEM-2 is a molecular modeling framework which comprehensively considers full-range many-body interactions in molecules. Multiple tracks are utilized to model the full-range interactions between the many-bodies with different orders, and a novel axial attention mechanism is designed to approximate the full-range interaction modeling with much lower computational cost.
43
A preprint version of our work can be found [here](https://arxiv.org/abs/2208.05863).
54

5+
## Framework
6+
<p align="center">
7+
<img src="../../../../.github/optimus_framework3.png" align="middle" heigh="70%" width="70%" />
8+
</p>
9+
10+
The overall framework of GEM-2. First, a molecule is described by the representations of many-bodies of multiple orders. Then, Optimus blocks are designed to update the representations. Each Optimus block contains $M$ tracks, and the $m$-th track contains a stack of many-body axial attentions to model the full-range interactions between the $m$-bodies. The many-body axial attentions and the Low2High module also play the roles of exchanging messages across the tracks. Finally, the molecular property prediction is made by pooling over the $1$-body representations.
11+
## Result
12+
### PCQM4Mv2
13+
PCQM4Mv2 is a large-scale quantum chemistry dataset containing the DFT-calculated HOMO-LUMO
14+
energy gaps. The OGB leaderboard for PCQM4Mv2 can be found [here](https://ogb.stanford.edu/docs/lsc/leaderboards/#pcqm4mv2).
15+
<p align="center">
16+
<img src="../../../../.github/pcqm4mv2_result.png" align="middle" heigh="70%" width="70%" />
17+
</p>
18+
19+
### LITPCBA
20+
LIT-PCBA is a virtual screening dataset containing protein targets with their corresponding active and inactive compounds selected from high-confidence PubChem Bioassay data
21+
<p align="center">
22+
<img src="../../../../.github/LIT-PCBA_result.png" align="middle" heigh="70%" width="70%" />
23+
</p>
24+
625
# Installation guide
726
## Prerequisites
827

@@ -27,50 +46,53 @@ Firstly, download or clone the lastest github repository:
2746
git checkout dev
2847
cd apps/pretrained_compound/ChemRL/GEM-2
2948

30-
# Data
49+
## Data
3150
You can download the PCQM4Mv2 dataset from ogb website:
3251

33-
https://dgl-data.s3-accelerate.amazonaws.com/dataset/OGB-LSC/pcqm4m-v2.zip
52+
wget https://dgl-data.s3-accelerate.amazonaws.com/dataset/OGB-LSC/pcqm4m-v2.zip
3453

35-
# Processed Data
36-
You can download the processed PCQM4Mv2 dataset with rdkit generated 3d information from:
37-
https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/compound_datasets/pcqm4mv2_gem2.tgz
54+
You can also download the processed PCQM4Mv2 dataset with rdkit generated 3d information from [here](https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/compound_datasets/pcqm4mv2_gem2.tgz).
3855
And then use tar to unzip the data.
3956
```bash
40-
mkdir -p ../data
41-
tar xzf pcqm4mv2_gem2.tgz -C ../data
57+
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/compound_datasets/pcqm4mv2_gem2.tgz
58+
mkdir -p ../data
59+
tar xzf pcqm4mv2_gem2.tgz -C ../data
60+
```
61+
62+
## Pretrained checkpoints
63+
We release the checkpoint for reproducing the results on PCQM4Mv2, which can also serve as a pretrain model for downstream tasks.
64+
```bash
65+
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/models/molecular_modeling/gem2_l12_c256.pdparams
66+
mkdir -p model
67+
mv gem2_l12_c256.pdparams model
4268
```
4369

44-
# How to run
45-
## Introduction to related configs
70+
## How to run
71+
### Introduction to related configs
4672
You can adjsut the json files in the config folder to change the training settings.
47-
### dataset_config
73+
#### dataset_config
4874
- `data_dir`: where the data located
4975
- `task_names`: the name of the label column in the datafile
5076

51-
### model_config
77+
#### model_config
5278
- `model`: model related information, like the channel size, dropout
5379
- `data`: data transform setting
5480

5581

56-
57-
### train_config
82+
#### train_config
5883
- `lr`: learning rate
5984
- `warmup_step`: the step to warm up learning rate to lr
6085
- `mid_step`: steps before learning rate decay
6186

62-
## Start training
87+
### Train on PCQM4Mv2
6388

6489
sh scripts/train.sh
6590

66-
The models will be saved under `./model`.
91+
The models will be saved under `./model`. It will take around 60 mintues to finish one epoch on 16 A100 cards with total batch size of 512.
6792

68-
It will take around 60 mintues to finish one epoch on 16 A100 cards with total batch size of 512.
93+
### Inference on valid set of PCQM4Mv2
94+
To reproduce the result from the ogb leaderboard, just run the inference command:
6995

70-
## Run inference
71-
To reproduce the result from the ogb leaderboard, you can download the checkponit from:
72-
https://baidu-nlp.bj.bcebos.com/PaddleHelix/models/molecular_modeling/gem2_l12_c256.pdparams
73-
Then put it under the local `./model` folder and run the inference command:
7496
sh scripts/inference.sh
7597

7698

@@ -80,10 +102,11 @@ If you use the code or data in this repos, please cite:
80102

81103
```bibtex
82104
@article{liu2022gem-2,
83-
title={GEM-2: Next Generation Molecular Property Prediction Network with Many-body and Full-range Interaction Modeling
84-
},
85-
author={Liu, Lihang and He, Donglong and Fang, Xiaomin and Zhang, Shanzhuo and Wang, Fan and He, Jingzhou and Wu, Hua},
86-
journal={arXiv preprint arXiv:2208.05863},
87-
year={2022}
105+
doi = {10.48550/ARXIV.2208.05863},
106+
url = {https://arxiv.org/abs/2208.05863},
107+
author = {Liu, Lihang and He, Donglong and Fang, Xiaomin and Zhang, Shanzhuo and Wang, Fan and He, Jingzhou and Wu, Hua},
108+
title = {GEM-2: Next Generation Molecular Property Prediction Network by Modeling Full-range Many-body Interactions},
109+
publisher = {arXiv},
110+
year = {2022}
88111
}
89112
```

apps/pretrained_compound/ChemRL/GEM-2/src/optimus.py

+9
Original file line numberDiff line numberDiff line change
@@ -534,6 +534,15 @@ def init_weights(self, layer):
534534
elif isinstance(layer, nn.LayerNorm):
535535
layer._epsilon = 1e-12
536536

537+
def reduce_dropout(self):
538+
"""
539+
setting the model's dropout rate to 0
540+
"""
541+
def reduce_p(layer):
542+
if isinstance(layer, nn.Dropout):
543+
layer.p = 0
544+
self.apply(reduce_p)
545+
537546
def _create_mask(self, batch):
538547
node_mask = batch["node_mask"] # (B, N)
539548

apps/pretrained_compound/ChemRL/GEM-2/src/paddle_utils.py

+4-4
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,8 @@ def dist_mean(array, distributed=False):
3737
n = len(array)
3838
x_sum = 0 if n == 0 else np.sum(array)
3939
if distributed:
40-
n = dist_all_reduce(paddle.to_tensor(n, dtype='int64')).numpy()[0]
41-
x_sum = dist_all_reduce(paddle.to_tensor(x_sum, dtype='float32')).numpy()[0]
40+
n = int(dist_all_reduce(paddle.to_tensor(n, dtype='int64')))
41+
x_sum = float(dist_all_reduce(paddle.to_tensor(x_sum, dtype='float32')))
4242
x_mean = 0 if n == 0 else x_sum / n
4343
return x_mean
4444

@@ -47,14 +47,14 @@ def dist_sum(array, distributed=False):
4747
n = len(array)
4848
x_sum = 0 if n == 0 else np.sum(array)
4949
if distributed:
50-
x_sum = dist_all_reduce(paddle.to_tensor(x_sum, dtype='float32')).numpy()[0]
50+
x_sum = float(dist_all_reduce(paddle.to_tensor(x_sum, dtype='float32')))
5151
return x_sum
5252

5353

5454
def dist_length(array, distributed=False):
5555
n = len(array)
5656
if distributed:
57-
n = dist_all_reduce(paddle.to_tensor(n, dtype='int64')).numpy()[0]
57+
n = int(dist_all_reduce(paddle.to_tensor(n, dtype='int64')))
5858
return n
5959

6060

apps/pretrained_compound/ChemRL/GEM-2/train_gem2.py

+2-20
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ def get_train_steps_per_epoch(dataset_len, args):
8080
min_data_len = paddle.to_tensor(dataset_len)
8181
from paddle.distributed import ReduceOp
8282
dist.all_reduce(min_data_len, ReduceOp.MIN)
83-
dataset_len = min_data_len.numpy()[0]
83+
dataset_len = int(min_data_len)
8484
logging.info(f'min dataset len: {dataset_len}')
8585
return int(dataset_len / args.batch_size) - 5
8686

@@ -228,24 +228,6 @@ def evaluate(args, epoch_id, model, test_dataset, collate_fn):
228228
return mean_mae
229229

230230

231-
def adjust_dropout(model_config, encoder_config, last_ck_path):
232-
"""
233-
adjust the dropout rate of the model to achieve better performance
234-
"""
235-
encoder_config.init_dropout_rate = 0
236-
encoder_config.optimus_block.first_body_axial_attention_dropout = 0
237-
encoder_config.optimus_block.pair_dropout_rate = 0
238-
encoder_config.optimus_block.first_body_axial_attention.dropout_rate = 0
239-
encoder_config.optimus_block.node_ffn.dropout_rate = 0
240-
encoder_config.optimus_block.second_body_first_axis.dropout_rate = 0
241-
encoder_config.optimus_block.second_body_second_axis.dropout_rate = 0
242-
encoder_config.optimus_block.pair_ffn.dropout_rate = 0
243-
model = MolRegressionModel(model_config, encoder_config)
244-
model.set_state_dict(paddle.load(last_ck_path))
245-
print('Load state_dict from %s' % last_ck_path)
246-
return model
247-
248-
249231
def main(args):
250232
"""
251233
Call the configuration function of the model, build the model and load data, then start training.
@@ -326,7 +308,7 @@ def _read_json(path):
326308
ema.register()
327309

328310
if epoch_id == 69:
329-
model = adjust_dropout(model_config, encoder_config, f'./{args.model_dir}/epoch_{epoch_id - 1}.pdparams')
311+
model.encoder.reduce_dropout()
330312

331313
## train
332314
s_time = time.time()

apps/protein_folding/helixfold-single/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,7 @@ To reproduce the results reported in our paper, specific environment settings ar
1616

1717
## Installation
1818
Except those listed in the `requirements.txt`, PaddlePaddle `dev` package is required to run HelixFold.
19-
Visit [here](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html) to
20-
install PaddlePaddle `dev`. Also, we provide a package here if your machine environment is Nvidia A100 with
19+
Visit [here](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html) to install PaddlePaddle `dev`. Also, we provide a package here if your machine environment is Nvidia A100 with
2120
cuda=11.2.
2221

2322
```bash
@@ -45,6 +44,7 @@ python helixfold_single_inference.py \
4544

4645
- `init_model`: the trained model.
4746
- `fasta_file`: the fasta_file file which contains the protein sequence to be predicted.
47+
- `output_dir`: the path to the output.
4848

4949
The output is organized as:
5050

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
>T1026 FBNSV, , 172 residues|
2+
MVSNWNWSGKKGRRTPRRGYTRPFKSAVPTTRVVVHQSAVLKKDDVSGSEIKPEGDVARYKIRKVMLSCTLRMRPGELVNYLIVKCSSPIVNWSAAFTAPALMVKESCQDMITIIGKGKVESNGVAGSDCTKSFNKFIRLGAGISQTQHLYVVMYTSEAVKTVLEHRVYIEV

0 commit comments

Comments
 (0)