Skip to content

Commit 00eda7f

Browse files
authored
Merge pull request #21 from Fairly/dev
Update LinearRNA tutorials
2 parents fa80789 + c4803b7 commit 00eda7f

12 files changed

+281
-524
lines changed

README.md

Lines changed: 17 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,19 @@ PaddleHelix is a machine-learning-based bio-computing framework aiming at facili
1414
> * Drug discovery
1515
> * Precision medicine
1616
17+
## Features
18+
* Highly Efficent: We provide LinearRNA - highly efficient toolkit for mRNA vaccine development. LinearFold & LinearParitition achieves O(n) complexity in RNA-folding prediction, which is hundreds of times faster than traditional folding techniques.
19+
<p align="center">
20+
<img src="./.github/LinearRNA.jpg" align="middle"
21+
</p>
22+
23+
* Large-scale Representation Learning and Transfer Learning: Self-supervised learning for molecule representations offers prospects of a breakthrough in tasks with limited annotation, including drug profiling, drug-target interaction, protein-protein interaction, RNA-RNA interaction, protein folding, RNA folding, and molecule design. PaddleHelix implements a variety of representation learning algorithms and state-of-the-art large-scale pre-trained models to help developers to start from "the shoulders of giants" quickly.
24+
<p align="center">
25+
<img src="./.github/paddlehelix_features.jpg" align="middle"
26+
</p>
27+
28+
* Easy-to-use APIs: PaddleHelix provide frequently used structures and pre-trained models. You can easily use those components to build up your models and systems.
29+
1730
## Installation
1831

1932
### OS support
@@ -74,24 +87,13 @@ conda deactivate
7487
## Documentation
7588

7689
### Tutorials
77-
* We provide abundant [tutorials](./tutorials) to navigate the directory and start quickly.
90+
* We provide abundant [tutorials](./tutorials) to help you navigate the directory and start quickly.
7891
* PaddleHelix is based on [PaddlePaddle](https://github.com/paddlepaddle/paddle), a high-performance Parallelized Deep Learning Platform.
7992

80-
### Features
81-
* Highly Efficent: We provide LinearRNA - highly efficient toolkit for mRNA vaccine development. LinearFold & LinearParitition achieves O(n) complexity in RNA-folding prediction, which is hundreds of times faster than traditional folding techniques.
82-
<p align="center">
83-
<img src="./.github/LinearRNA.jpg" align="middle"
84-
</p>
85-
86-
* Large-scale Representation Learning and Transfer Learning: Self-supervised learning for molecule representations offers prospects of a breakthrough in tasks with limited annotation, including drug profiling, drug-target interaction, protein-protein interaction, RNA-RNA interaction, protein folding, RNA folding, and molecule design. PaddleHelix implements a variety of representation learning algorithms and state-of-the-art large-scale pre-trained models to help developers to start from "the shoulders of giants" quickly.
87-
<p align="center">
88-
<img src="./.github/paddlehelix_features.jpg" align="middle"
89-
</p>
90-
91-
* Easy-to-use APIs: PaddleHelix provide frequently used structures and pre-trained models. You can easily use those components to build up your models and systems.
92-
93-
## Examples
93+
### Examples
9494
* [Representation Learning - Compounds](./apps/pretrained_compound)
9595
* [Representation Learning - Proteins](./apps/pretrained_protein)
9696
* [Drug-Target Interaction](./apps/drug_target_interaction)
9797
* [LinearRNA](./c/pahelix/toolkit/linear_rna)
98+
99+
### [The API reference](https://readthedocs.org/projects/paddlehelix/)

README_cn.md

Lines changed: 30 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -9,16 +9,31 @@
99
![python version](https://img.shields.io/badge/python-3.6+-orange.svg)
1010
![support os](https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-yellow.svg)
1111

12-
PaddleHelix(螺旋桨)是一个基于机器学习的生物计算工具集,致力于加速下面领域的进展
12+
PaddleHelix螺旋桨是一个基于机器学习的生物计算工具集,致力于加速如下领域的进展:
1313
> * 疫苗设计
1414
> * 新药发现
1515
> * 精准医疗
1616
17+
## 特色
18+
19+
* **高性能**:提供了 LinearRNA 系列高性能算法助力 mRNA 疫苗设计。例如,LinearFold 和 LinearParition 能够迅速准确定位能量较低 RNA 二级结构,性能相比传统方法提升数百甚至上千倍。
20+
<p align="center">
21+
<img src="./.github/LinearRNA.jpg" align="middle"
22+
</p>
23+
24+
* 由大规模 **表示预训练****迁移学习** 支撑的生物计算工具:随着自监督学习用于分子表示训练的进展,为样本量非常稀少的很多生物计算任务带来了全新的突破,这些任务包括分子性质预测,药物-靶点相互作用,蛋白质-蛋白质相互作用,RNA-RNA 相互作用,蛋白质折叠,RNA 折叠等等领域。螺旋桨广泛提供了业界最领先的表示学习方法和模型,使得开发者可以基于大规模模型快速切入需求的任务,站在巨人的肩膀上。
25+
<p align="center">
26+
<img src="./.github/paddlehelix_features.jpg" align="middle"
27+
</p>
28+
29+
* 简单易用的 API 接口:螺旋桨提供了生物计算中常用的模型结构和预训练模型,用户可以用非常简单的接口调起这些模型,快速组建自己的网络和系统。
30+
----
31+
1732
## 安装
1833

1934
### 操作系统支持
2035

21-
Windows,Linux 以及OSX
36+
WindowsLinux 以及 OSX
2237

2338
### Python 版本
2439

@@ -41,63 +56,51 @@ Python 3.6, 3.7
4156

4257
### 安装命令
4358

44-
因为paddlehelix安装包的依赖有最新版的paddlepaddle(2.0.0rc0或以上),以及无法直接使用`pip`命令直接安装的rdkit, 因此我们建议创建一个新的conda环境来运行代码,具体命令如下:
59+
因为 PaddleHelix 安装包的依赖有最新版的 paddlepaddle(2.0.0rc0 或以上),以及无法直接使用 `pip` 命令直接安装的 rdkit,因此我们建议创建一个新的 conda 环境来运行代码,具体命令如下
4560

46-
* 如果你之前从来没有使用过conda,可以参考这个网页来安装conda:
61+
* 如果你之前从来没有使用过 conda,可以参考这个网页来安装 conda:
4762

4863
https://docs.conda.io/projects/conda/en/latest/user-guide/install/
4964

50-
* 在安装完conda之后, 可以开始创建一个新的conda环境:
65+
* 在安装完 conda 之后, 可以开始创建一个新的 conda 环境:
5166

5267
```bash
5368
conda create -n paddlehelix python=3.7
5469
```
5570

56-
* 使用如下命令激活conda环境:
71+
* 使用如下命令激活 conda 环境:
5772

5873
```bash
5974
conda activate paddlehelix
6075
```
6176

62-
* 在安装paddlehelix之前,首先需要使用conda安装依赖rdkit:
77+
* 在安装 PaddleHelix 之前,首先需要使用 conda 安装 rdkit:
6378
```bash
6479
conda install -c conda-forge rdkit
6580
```
66-
* rdkit安装完成之后可以使用pip命令来安装paddlehelix了:
81+
* rdkit 安装完成之后,使用 pip 命令安装 PaddleHelix
6782
```bash
6883
pip install paddlehelix
6984
```
7085

71-
* paddlehelix安装完成之后就可以运行代码了
86+
* 等待 PaddleHelix 安装完成!
7287

73-
* 如果想要退出当前conda环境,可以使用下列命令:
88+
* 如果想要退出当前 conda 环境,可以使用下列命令
7489

7590
```bash
7691
conda deactivate
7792
```
78-
93+
----
7994
## 文档
8095

8196
### 教学
8297
* 我们提供了大量的[教学实例](./tutorials)以方便开发者快速了解和使用该框架
83-
* PaddleHelix基于[飞桨](https://github.com/paddlepaddle/paddle)开源深度学习框架实现,该框架在性能表现上尤其出色。
98+
* PaddleHelix 基于[飞桨(PaddlePaddle)](https://github.com/paddlepaddle/paddle)开源深度学习框架实现,该框架在性能表现上尤其出色。
8499

85-
### 特点
86-
87-
* **高性能**:提供了LinearRNA系列高性能算法助力mRNA疫苗设计。例如,LinearFold和LinearParition能够迅速准确定位能量较低RNA二级结构,性能相比传统方法提升数百甚至上千倍
88-
<p align="center">
89-
<img src="./.github/LinearRNA.jpg" align="middle"
90-
</p>
91-
92-
* 由大规模**表示预训练****迁移学习**支撑的生物计算工具: 随着自监督学习用于分子表示训练的进展,为样本量非常稀少的很多生物计算任务带来了全新的突破,这些任务包括分子性质预测,分子-药物作用,蛋白质-蛋白质作用,RNA-RNA作用,蛋白质折叠,RNA折叠等等领域。螺旋桨广泛提供了业界最领先的表示学习方法和模型,使得开发者可以基于大规模模型快速切入需求的任务,站在巨人的肩膀上。
93-
<p align="center">
94-
<img src="./.github/paddlehelix_features.jpg" align="middle"
95-
</p>
96-
97-
* 简单易用API接口: 螺旋桨提供了生物计算中常用的模型结构和预训练模型,用户可以用非常简单的接口调起这些模型,快速组建自己的网络和系统。
98-
99-
## 使用示例
100+
### 使用示例
100101
* [表示学习 - 化合物](./apps/pretrained_compound)
101102
* [表示学习 - 蛋白质](./apps/pretrained_protein)
102103
* [药物-分子作用预测](./apps/drug_target_interaction)
103104
* [LinearRNA](./c/pahelix/toolkit/linear_rna)
105+
106+
### [The API reference](https://readthedocs.org/projects/paddlehelix/)

tutorials/README.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,9 @@ English | [简体中文](README_cn.md)
22

33
# Backgrouds
44

5-
Machine learning (ML), especially deep learning (DL), is playing an increasingly important role in the pharmaceutical industry and bio-informatics. For instance, the DL-based methodology is found to predict the [drug-target interaction](https://www.researchgate.net/publication/334088358_GraphDTA_prediction_of_drug-target_binding_affinity_using_graph_convolutional_networks) and [molecule properties](https://pubmed.ncbi.nlm.nih.gov/30165565/) with reasonable precision and quite low computational cost, while those properties can only be accessed through *in vivo* / *in vitro* experiments or computationally expensive simulations (molecular dynamics simulation etc.) before. As another example, *in silico* [RNA folding](https://www.researchgate.net/publication/344954534_LinearFold_Linear-Time_Prediction_of_RNA_Secondary_Structures) and [protein folding](https://www.researchgate.net/publication/338619491_Improved_protein_structure_prediction_using_potentials_from_deep_learning) are becoming more likely to be accomplished with the help of deep neural models. The usage of ML and DL can greatly improve efficiency, and thus reduce the cost of drug discovery, vaccine design, etc. In contrast to the powerful ability of DL metrics, a key challenge lying in utilizing them in the drug industry is the contradiction between the demand for huge data for training and the limited annotated data. Recently, there is tremendous success in adopting self-supervised learning in natural language processing and computer vision, showing that a large corpus of unlabeled data can be beneficial to learning universal tasks. In molecule representations, there is a similar situation. We have large amount of unlabeled data, including protein sequences (over 100 million) and compounds (over 50 million) but relatively small annotated data.
5+
Machine learning (ML), especially deep learning (DL), is playing an increasingly important role in the pharmaceutical industry and bio-informatics. For instance, the DL-based methodology is found to predict the [drug-target interaction](https://www.researchgate.net/publication/334088358_GraphDTA_prediction_of_drug-target_binding_affinity_using_graph_convolutional_networks) and [molecule properties](https://pubmed.ncbi.nlm.nih.gov/30165565/) with reasonable precision and quite low computational cost, while those properties can only be accessed through *in vivo* / *in vitro* experiments or computationally expensive simulations (molecular dynamics simulation etc.) before. As another example, *in silico* [RNA folding](https://www.researchgate.net/publication/344954534_LinearFold_Linear-Time_Prediction_of_RNA_Secondary_Structures) and [protein folding](https://www.researchgate.net/publication/338619491_Improved_protein_structure_prediction_using_potentials_from_deep_learning) are becoming more likely to be accomplished with the help of deep neural models. The usage of ML and DL can greatly improve efficiency, and thus reduce the cost of drug discovery, vaccine design, etc.
6+
7+
In contrast to the powerful ability of DL metrics, a key challenge lying in utilizing them in the drug industry is the contradiction between the demand for huge data for training and the limited annotated data. Recently, there is tremendous success in adopting self-supervised learning in natural language processing and computer vision, showing that a large corpus of unlabeled data can be beneficial to learning universal tasks. In molecule representations, there is a similar situation. We have large amount of unlabeled data, including protein sequences (over 100 million) and compounds (over 50 million) but relatively small annotated data. It is quite promising to adopt DL-based pretaining technique in the representation learning of chemical compounds, proteins, RNA, etc.
68

79
**PaddleHelix** is a high-performance ML-based bio-computing framework. It features large scale representation learning and easy-to-use APIs, providing pharmaceutical and biological researchers and engineers convenient access to the most up-to-date and state-of-the-art AI tools.
810

@@ -16,11 +18,11 @@ Machine learning (ML), especially deep learning (DL), is playing an increasingly
1618
* [Predicting Drug-Target Interaction](drug_target_interaction_tutorial.ipynb)
1719
* [Compound Representation Learning and Property Prediction](compound_property_prediction_tutorial.ipynb)
1820
* [Protein Representation Learning and Property Prediction](protein_pretrain_and_property_prediction_tutorial.ipynb)
19-
* [Predicting RNA Secondary Structured](linearrna_tutorial.ipynb)
21+
* [Predicting RNA Secondary Structure](linearrna_tutorial.ipynb)
2022

2123
# Run tutorials locally
2224

23-
The tutorials are written as Jupyter Notebooks and designed to be smoothly run on you own machine. If you don't have Jupyter installed, please refer to [here](https://jupyter.org/install).
25+
The tutorials are written as Jupyter Notebooks and designed to be smoothly run on you own machine. If you don't have Jupyter installed, please refer to [here](https://jupyter.org/install). And please also install PaddleHelix before proceeding ([instructions](../README.md)).
2426

2527
After the installation of Jypyter, please go through the following steps:
2628

tutorials/README_cn.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,9 @@
22

33
# 背景
44

5-
机器学习,特别是深度学习正在制药工业和生物信息学中发挥着越来越重要的作用。例如,基于深度学习的方法可以用于预测[药物-靶点相互作用](https://www.researchgate.net/publication/334088358_-GraphDTA_-drug-target_-binding_-affinity_使用_-graph-convolative_-networks)和[分子性质](https://pubmed.ncbi.nlm.nih.gov/30165565/),从而以相当低的计算成本达到可接受的预测精度,而这些数据以前只能通过体内/体外实验或计算复杂度极高的仿真方法(如分子动力学模拟等)来获得。另一个例子是借助深层神经网络,我们可以更好地解决 *in silico* [RNA折叠](https://www.researchgate.net/publication/344954534_LinearFold_Linear-Time_预测_RNA_次级结构)和[蛋白质折叠](https://www.researchgate.net/publication/338619491_改进的_蛋白质结构_预测_使用来自_deep_学习的潜力)这两个问题。机器学习和深度学习的应用可以大大提高问题求解的效率,从而降低药物发现、疫苗设计等工业应用的成本。深度学习模型具有强大的学习能力,而将其应用于制药行业的一个关键挑战在于如何解决模型对大量训练数据的需求与有限的标注数据之间的矛盾。近年来,自监督学习在自然语言处理和计算机视觉领域取得了巨大的成功,表明模型可以从大量的未标注数据和通用任务中学习到有益的信息。在分子表示的问题上,也有类似的情况。我们有大量未标记的数据,包括蛋白质序列(超过1亿条)和化合物信息(超过5000万条)等,但其中有标注的数据较少。
5+
机器学习,特别是深度学习正在制药工业和生物信息学中发挥着越来越重要的作用。例如,基于深度学习的方法可以用于预测[药物-靶点相互作用](https://www.researchgate.net/publication/334088358_-GraphDTA_-drug-target_-binding_-affinity_使用_-graph-convolative_-networks)和[分子性质](https://pubmed.ncbi.nlm.nih.gov/30165565/),从而以相当低的计算成本达到可接受的预测精度,而这些数据以前只能通过体内/体外实验或计算复杂度极高的仿真方法(如分子动力学模拟等)来获得。另一个例子是借助深层神经网络,我们可以更好地解决 *in silico* [RNA折叠](https://www.researchgate.net/publication/344954534_LinearFold_Linear-Time_预测_RNA_次级结构)和[蛋白质折叠](https://www.researchgate.net/publication/338619491_改进的_蛋白质结构_预测_使用来自_deep_学习的潜力)这两个问题。机器学习和深度学习的应用可以大大提高问题求解的效率,从而降低药物发现、疫苗设计等工业应用的成本。
6+
7+
深度学习模型具有强大的学习能力,而将其应用于制药行业的一个关键挑战在于如何解决模型对大量训练数据的需求与有限的标注数据之间的矛盾。近年来,自监督学习在自然语言处理和计算机视觉领域取得了巨大的成功,表明模型可以从大量的未标注数据和通用任务中学习到有益的信息。在分子表示(molecular representation)的问题上,情况十分相似,我们有大量未标记的数据,包括蛋白质序列(超过1亿条)和化合物信息(超过5000万条)等,但其中有标注的数据较少。将基于深度学习的预训练技术应用于化合物、蛋白质、RNA等的表示学习中,是一个非常有前景的方向。
68

79
**PaddleHelix** 是一个高性能并且专为生物计算任务开发的机器学习框架。它的特色在于大规模的表示学习(representation learning)和易用的API,我们期望为制药和生物领域的研究人员和工程师提供最新和最先进的AI工具。
810

@@ -13,14 +15,14 @@
1315
</p>
1416

1517
# 教程
16-
* [药物-靶点相互作用预测](drug_target_interaction_tutorial.ipynb)
17-
* [化合物表示学习和性质预测](compound_property_prediction_tutorial.ipynb)
18-
* [蛋白质表示学习和性质预测](protein_pretrain_and_property_prediction_tutorial.ipynb)
19-
* [RNA二级结构预测](linearrna_tutorial.ipynb)
18+
* [药物-靶点相互作用预测](drug_target_interaction_tutorial_cn.ipynb)
19+
* [化合物表示学习和性质预测](compound_property_prediction_tutorial_cn.ipynb)
20+
* [蛋白质表示学习和性质预测](protein_pretrain_and_property_prediction_tutorial_cn.ipynb)
21+
* [RNA二级结构预测](linearrna_tutorial_cn.ipynb)
2022

2123
# 在本地运行
2224

23-
我们的教程以 Jypyter Notebook 的形式编写,可以方便的在你的本地计算机上运行。如果你没有安装过 Jupyter,请看[这里](https://jupyter.org/install)
25+
我们的教程以 Jupyter Notebook 的形式编写,可以方便的在你的本地计算机上运行。如果你没有安装过 Jupyter,请看[这里](https://jupyter.org/install)。另外也请安装好 PaddleHelix([教程](../README_cn.md)
2426

2527
安装好 Jupyter 之后,请按照以下的步骤来运行:
2628

tutorials/compound_property_prediction_tutorial.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
"source": [
77
"# Compound representation learning and property prediction\n",
88
"\n",
9-
"In this tuorial, we will go through how to run a Graph Neural Network (GNN) model for compound property prediction. In particular, we will demonstrate how to pretrain and finetune the model in the downstream tasks.\n",
9+
"In this tuorial, we will go through how to run a Graph Neural Network (GNN) model for compound property prediction. In particular, we will demonstrate how to pretrain and finetune the model in the downstream tasks. If you are intersted in more details, please refer to the README for \"[info graph](https://github.com/PaddlePaddle/PaddleHelix/apps/pretrained_compound/info_graph)\" and \"[pretrained GNN](https://github.com/PaddlePaddle/PaddleHelix/apps/pretrained_compound/pretrain_gnns)\".\n",
1010
"\n",
1111
"# Part I: Pretraining\n",
1212
"\n",

0 commit comments

Comments
 (0)