Skip to content

Update LinearRNA tutorials #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Dec 16, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 17 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,19 @@ PaddleHelix is a machine-learning-based bio-computing framework aiming at facili
> * Drug discovery
> * Precision medicine

## Features
* Highly Efficent: We provide LinearRNA - highly efficient toolkit for mRNA vaccine development. LinearFold & LinearParitition achieves O(n) complexity in RNA-folding prediction, which is hundreds of times faster than traditional folding techniques.
<p align="center">
<img src="./.github/LinearRNA.jpg" align="middle"
</p>

* Large-scale Representation Learning and Transfer Learning: Self-supervised learning for molecule representations offers prospects of a breakthrough in tasks with limited annotation, including drug profiling, drug-target interaction, protein-protein interaction, RNA-RNA interaction, protein folding, RNA folding, and molecule design. PaddleHelix implements a variety of representation learning algorithms and state-of-the-art large-scale pre-trained models to help developers to start from "the shoulders of giants" quickly.
<p align="center">
<img src="./.github/paddlehelix_features.jpg" align="middle"
</p>

* Easy-to-use APIs: PaddleHelix provide frequently used structures and pre-trained models. You can easily use those components to build up your models and systems.

## Installation

### OS support
Expand Down Expand Up @@ -74,24 +87,13 @@ conda deactivate
## Documentation

### Tutorials
* We provide abundant [tutorials](./tutorials) to navigate the directory and start quickly.
* We provide abundant [tutorials](./tutorials) to help you navigate the directory and start quickly.
* PaddleHelix is based on [PaddlePaddle](https://github.com/paddlepaddle/paddle), a high-performance Parallelized Deep Learning Platform.

### Features
* Highly Efficent: We provide LinearRNA - highly efficient toolkit for mRNA vaccine development. LinearFold & LinearParitition achieves O(n) complexity in RNA-folding prediction, which is hundreds of times faster than traditional folding techniques.
<p align="center">
<img src="./.github/LinearRNA.jpg" align="middle"
</p>

* Large-scale Representation Learning and Transfer Learning: Self-supervised learning for molecule representations offers prospects of a breakthrough in tasks with limited annotation, including drug profiling, drug-target interaction, protein-protein interaction, RNA-RNA interaction, protein folding, RNA folding, and molecule design. PaddleHelix implements a variety of representation learning algorithms and state-of-the-art large-scale pre-trained models to help developers to start from "the shoulders of giants" quickly.
<p align="center">
<img src="./.github/paddlehelix_features.jpg" align="middle"
</p>

* Easy-to-use APIs: PaddleHelix provide frequently used structures and pre-trained models. You can easily use those components to build up your models and systems.

## Examples
### Examples
* [Representation Learning - Compounds](./apps/pretrained_compound)
* [Representation Learning - Proteins](./apps/pretrained_protein)
* [Drug-Target Interaction](./apps/drug_target_interaction)
* [LinearRNA](./c/pahelix/toolkit/linear_rna)

### [The API reference](https://readthedocs.org/projects/paddlehelix/)
57 changes: 30 additions & 27 deletions README_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,31 @@
![python version](https://img.shields.io/badge/python-3.6+-orange.svg)
![support os](https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-yellow.svg)

PaddleHelix(螺旋桨)是一个基于机器学习的生物计算工具集,致力于加速下面领域的进展
PaddleHelix螺旋桨是一个基于机器学习的生物计算工具集,致力于加速如下领域的进展:
> * 疫苗设计
> * 新药发现
> * 精准医疗

## 特色

* **高性能**:提供了 LinearRNA 系列高性能算法助力 mRNA 疫苗设计。例如,LinearFold 和 LinearParition 能够迅速准确定位能量较低 RNA 二级结构,性能相比传统方法提升数百甚至上千倍。
<p align="center">
<img src="./.github/LinearRNA.jpg" align="middle"
</p>

* 由大规模 **表示预训练** 和 **迁移学习** 支撑的生物计算工具:随着自监督学习用于分子表示训练的进展,为样本量非常稀少的很多生物计算任务带来了全新的突破,这些任务包括分子性质预测,药物-靶点相互作用,蛋白质-蛋白质相互作用,RNA-RNA 相互作用,蛋白质折叠,RNA 折叠等等领域。螺旋桨广泛提供了业界最领先的表示学习方法和模型,使得开发者可以基于大规模模型快速切入需求的任务,站在巨人的肩膀上。
<p align="center">
<img src="./.github/paddlehelix_features.jpg" align="middle"
</p>

* 简单易用的 API 接口:螺旋桨提供了生物计算中常用的模型结构和预训练模型,用户可以用非常简单的接口调起这些模型,快速组建自己的网络和系统。
----

## 安装

### 操作系统支持

Windows,Linux 以及OSX
WindowsLinux 以及 OSX

### Python 版本

Expand All @@ -41,63 +56,51 @@ Python 3.6, 3.7

### 安装命令

因为paddlehelix安装包的依赖有最新版的paddlepaddle(2.0.0rc0或以上),以及无法直接使用`pip`命令直接安装的rdkit, 因此我们建议创建一个新的conda环境来运行代码,具体命令如下:
因为 PaddleHelix 安装包的依赖有最新版的 paddlepaddle(2.0.0rc0 或以上),以及无法直接使用 `pip` 命令直接安装的 rdkit,因此我们建议创建一个新的 conda 环境来运行代码,具体命令如下

* 如果你之前从来没有使用过conda,可以参考这个网页来安装conda:
* 如果你之前从来没有使用过 conda,可以参考这个网页来安装 conda:

https://docs.conda.io/projects/conda/en/latest/user-guide/install/

* 在安装完conda之后, 可以开始创建一个新的conda环境:
* 在安装完 conda 之后, 可以开始创建一个新的 conda 环境:

```bash
conda create -n paddlehelix python=3.7
```

* 使用如下命令激活conda环境:
* 使用如下命令激活 conda 环境:

```bash
conda activate paddlehelix
```

* 在安装paddlehelix之前,首先需要使用conda安装依赖rdkit:
* 在安装 PaddleHelix 之前,首先需要使用 conda 安装 rdkit:
```bash
conda install -c conda-forge rdkit
```
* rdkit安装完成之后可以使用pip命令来安装paddlehelix了:
* rdkit 安装完成之后,使用 pip 命令安装 PaddleHelix
```bash
pip install paddlehelix
```

* paddlehelix安装完成之后就可以运行代码了
* 等待 PaddleHelix 安装完成!

* 如果想要退出当前conda环境,可以使用下列命令:
* 如果想要退出当前 conda 环境,可以使用下列命令

```bash
conda deactivate
```

----
## 文档

### 教学
* 我们提供了大量的[教学实例](./tutorials)以方便开发者快速了解和使用该框架
* PaddleHelix基于[飞桨](https://github.com/paddlepaddle/paddle)开源深度学习框架实现,该框架在性能表现上尤其出色。
* PaddleHelix 基于[飞桨(PaddlePaddle)](https://github.com/paddlepaddle/paddle)开源深度学习框架实现,该框架在性能表现上尤其出色。

### 特点

* **高性能**:提供了LinearRNA系列高性能算法助力mRNA疫苗设计。例如,LinearFold和LinearParition能够迅速准确定位能量较低RNA二级结构,性能相比传统方法提升数百甚至上千倍
<p align="center">
<img src="./.github/LinearRNA.jpg" align="middle"
</p>

* 由大规模**表示预训练**和**迁移学习**支撑的生物计算工具: 随着自监督学习用于分子表示训练的进展,为样本量非常稀少的很多生物计算任务带来了全新的突破,这些任务包括分子性质预测,分子-药物作用,蛋白质-蛋白质作用,RNA-RNA作用,蛋白质折叠,RNA折叠等等领域。螺旋桨广泛提供了业界最领先的表示学习方法和模型,使得开发者可以基于大规模模型快速切入需求的任务,站在巨人的肩膀上。
<p align="center">
<img src="./.github/paddlehelix_features.jpg" align="middle"
</p>

* 简单易用API接口: 螺旋桨提供了生物计算中常用的模型结构和预训练模型,用户可以用非常简单的接口调起这些模型,快速组建自己的网络和系统。

## 使用示例
### 使用示例
* [表示学习 - 化合物](./apps/pretrained_compound)
* [表示学习 - 蛋白质](./apps/pretrained_protein)
* [药物-分子作用预测](./apps/drug_target_interaction)
* [LinearRNA](./c/pahelix/toolkit/linear_rna)

### [The API reference](https://readthedocs.org/projects/paddlehelix/)
8 changes: 5 additions & 3 deletions tutorials/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@ English | [简体中文](README_cn.md)

# Backgrouds

Machine learning (ML), especially deep learning (DL), is playing an increasingly important role in the pharmaceutical industry and bio-informatics. For instance, the DL-based methodology is found to predict the [drug-target interaction](https://www.researchgate.net/publication/334088358_GraphDTA_prediction_of_drug-target_binding_affinity_using_graph_convolutional_networks) and [molecule properties](https://pubmed.ncbi.nlm.nih.gov/30165565/) with reasonable precision and quite low computational cost, while those properties can only be accessed through *in vivo* / *in vitro* experiments or computationally expensive simulations (molecular dynamics simulation etc.) before. As another example, *in silico* [RNA folding](https://www.researchgate.net/publication/344954534_LinearFold_Linear-Time_Prediction_of_RNA_Secondary_Structures) and [protein folding](https://www.researchgate.net/publication/338619491_Improved_protein_structure_prediction_using_potentials_from_deep_learning) are becoming more likely to be accomplished with the help of deep neural models. The usage of ML and DL can greatly improve efficiency, and thus reduce the cost of drug discovery, vaccine design, etc. In contrast to the powerful ability of DL metrics, a key challenge lying in utilizing them in the drug industry is the contradiction between the demand for huge data for training and the limited annotated data. Recently, there is tremendous success in adopting self-supervised learning in natural language processing and computer vision, showing that a large corpus of unlabeled data can be beneficial to learning universal tasks. In molecule representations, there is a similar situation. We have large amount of unlabeled data, including protein sequences (over 100 million) and compounds (over 50 million) but relatively small annotated data.
Machine learning (ML), especially deep learning (DL), is playing an increasingly important role in the pharmaceutical industry and bio-informatics. For instance, the DL-based methodology is found to predict the [drug-target interaction](https://www.researchgate.net/publication/334088358_GraphDTA_prediction_of_drug-target_binding_affinity_using_graph_convolutional_networks) and [molecule properties](https://pubmed.ncbi.nlm.nih.gov/30165565/) with reasonable precision and quite low computational cost, while those properties can only be accessed through *in vivo* / *in vitro* experiments or computationally expensive simulations (molecular dynamics simulation etc.) before. As another example, *in silico* [RNA folding](https://www.researchgate.net/publication/344954534_LinearFold_Linear-Time_Prediction_of_RNA_Secondary_Structures) and [protein folding](https://www.researchgate.net/publication/338619491_Improved_protein_structure_prediction_using_potentials_from_deep_learning) are becoming more likely to be accomplished with the help of deep neural models. The usage of ML and DL can greatly improve efficiency, and thus reduce the cost of drug discovery, vaccine design, etc.

In contrast to the powerful ability of DL metrics, a key challenge lying in utilizing them in the drug industry is the contradiction between the demand for huge data for training and the limited annotated data. Recently, there is tremendous success in adopting self-supervised learning in natural language processing and computer vision, showing that a large corpus of unlabeled data can be beneficial to learning universal tasks. In molecule representations, there is a similar situation. We have large amount of unlabeled data, including protein sequences (over 100 million) and compounds (over 50 million) but relatively small annotated data. It is quite promising to adopt DL-based pretaining technique in the representation learning of chemical compounds, proteins, RNA, etc.

**PaddleHelix** is a high-performance ML-based bio-computing framework. It features large scale representation learning and easy-to-use APIs, providing pharmaceutical and biological researchers and engineers convenient access to the most up-to-date and state-of-the-art AI tools.

Expand All @@ -16,11 +18,11 @@ Machine learning (ML), especially deep learning (DL), is playing an increasingly
* [Predicting Drug-Target Interaction](drug_target_interaction_tutorial.ipynb)
* [Compound Representation Learning and Property Prediction](compound_property_prediction_tutorial.ipynb)
* [Protein Representation Learning and Property Prediction](protein_pretrain_and_property_prediction_tutorial.ipynb)
* [Predicting RNA Secondary Structured](linearrna_tutorial.ipynb)
* [Predicting RNA Secondary Structure](linearrna_tutorial.ipynb)

# Run tutorials locally

The tutorials are written as Jupyter Notebooks and designed to be smoothly run on you own machine. If you don't have Jupyter installed, please refer to [here](https://jupyter.org/install).
The tutorials are written as Jupyter Notebooks and designed to be smoothly run on you own machine. If you don't have Jupyter installed, please refer to [here](https://jupyter.org/install). And please also install PaddleHelix before proceeding ([instructions](../README.md)).

After the installation of Jypyter, please go through the following steps:

Expand Down
14 changes: 8 additions & 6 deletions tutorials/README_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

# 背景

机器学习,特别是深度学习正在制药工业和生物信息学中发挥着越来越重要的作用。例如,基于深度学习的方法可以用于预测[药物-靶点相互作用](https://www.researchgate.net/publication/334088358_-GraphDTA_-drug-target_-binding_-affinity_使用_-graph-convolative_-networks)和[分子性质](https://pubmed.ncbi.nlm.nih.gov/30165565/),从而以相当低的计算成本达到可接受的预测精度,而这些数据以前只能通过体内/体外实验或计算复杂度极高的仿真方法(如分子动力学模拟等)来获得。另一个例子是借助深层神经网络,我们可以更好地解决 *in silico* [RNA折叠](https://www.researchgate.net/publication/344954534_LinearFold_Linear-Time_预测_RNA_次级结构)和[蛋白质折叠](https://www.researchgate.net/publication/338619491_改进的_蛋白质结构_预测_使用来自_deep_学习的潜力)这两个问题。机器学习和深度学习的应用可以大大提高问题求解的效率,从而降低药物发现、疫苗设计等工业应用的成本。深度学习模型具有强大的学习能力,而将其应用于制药行业的一个关键挑战在于如何解决模型对大量训练数据的需求与有限的标注数据之间的矛盾。近年来,自监督学习在自然语言处理和计算机视觉领域取得了巨大的成功,表明模型可以从大量的未标注数据和通用任务中学习到有益的信息。在分子表示的问题上,也有类似的情况。我们有大量未标记的数据,包括蛋白质序列(超过1亿条)和化合物信息(超过5000万条)等,但其中有标注的数据较少。
机器学习,特别是深度学习正在制药工业和生物信息学中发挥着越来越重要的作用。例如,基于深度学习的方法可以用于预测[药物-靶点相互作用](https://www.researchgate.net/publication/334088358_-GraphDTA_-drug-target_-binding_-affinity_使用_-graph-convolative_-networks)和[分子性质](https://pubmed.ncbi.nlm.nih.gov/30165565/),从而以相当低的计算成本达到可接受的预测精度,而这些数据以前只能通过体内/体外实验或计算复杂度极高的仿真方法(如分子动力学模拟等)来获得。另一个例子是借助深层神经网络,我们可以更好地解决 *in silico* [RNA折叠](https://www.researchgate.net/publication/344954534_LinearFold_Linear-Time_预测_RNA_次级结构)和[蛋白质折叠](https://www.researchgate.net/publication/338619491_改进的_蛋白质结构_预测_使用来自_deep_学习的潜力)这两个问题。机器学习和深度学习的应用可以大大提高问题求解的效率,从而降低药物发现、疫苗设计等工业应用的成本。

深度学习模型具有强大的学习能力,而将其应用于制药行业的一个关键挑战在于如何解决模型对大量训练数据的需求与有限的标注数据之间的矛盾。近年来,自监督学习在自然语言处理和计算机视觉领域取得了巨大的成功,表明模型可以从大量的未标注数据和通用任务中学习到有益的信息。在分子表示(molecular representation)的问题上,情况十分相似,我们有大量未标记的数据,包括蛋白质序列(超过1亿条)和化合物信息(超过5000万条)等,但其中有标注的数据较少。将基于深度学习的预训练技术应用于化合物、蛋白质、RNA等的表示学习中,是一个非常有前景的方向。

**PaddleHelix** 是一个高性能并且专为生物计算任务开发的机器学习框架。它的特色在于大规模的表示学习(representation learning)和易用的API,我们期望为制药和生物领域的研究人员和工程师提供最新和最先进的AI工具。

Expand All @@ -13,14 +15,14 @@
</p>

# 教程
* [药物-靶点相互作用预测](drug_target_interaction_tutorial.ipynb)
* [化合物表示学习和性质预测](compound_property_prediction_tutorial.ipynb)
* [蛋白质表示学习和性质预测](protein_pretrain_and_property_prediction_tutorial.ipynb)
* [RNA二级结构预测](linearrna_tutorial.ipynb)
* [药物-靶点相互作用预测](drug_target_interaction_tutorial_cn.ipynb)
* [化合物表示学习和性质预测](compound_property_prediction_tutorial_cn.ipynb)
* [蛋白质表示学习和性质预测](protein_pretrain_and_property_prediction_tutorial_cn.ipynb)
* [RNA二级结构预测](linearrna_tutorial_cn.ipynb)

# 在本地运行

我们的教程以 Jypyter Notebook 的形式编写,可以方便的在你的本地计算机上运行。如果你没有安装过 Jupyter,请看[这里](https://jupyter.org/install)。
我们的教程以 Jupyter Notebook 的形式编写,可以方便的在你的本地计算机上运行。如果你没有安装过 Jupyter,请看[这里](https://jupyter.org/install)。另外也请安装好 PaddleHelix([教程](../README_cn.md))

安装好 Jupyter 之后,请按照以下的步骤来运行:

Expand Down
2 changes: 1 addition & 1 deletion tutorials/compound_property_prediction_tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"source": [
"# Compound representation learning and property prediction\n",
"\n",
"In this tuorial, we will go through how to run a Graph Neural Network (GNN) model for compound property prediction. In particular, we will demonstrate how to pretrain and finetune the model in the downstream tasks.\n",
"In this tuorial, we will go through how to run a Graph Neural Network (GNN) model for compound property prediction. In particular, we will demonstrate how to pretrain and finetune the model in the downstream tasks. If you are intersted in more details, please refer to the README for \"[info graph](https://github.com/PaddlePaddle/PaddleHelix/apps/pretrained_compound/info_graph)\" and \"[pretrained GNN](https://github.com/PaddlePaddle/PaddleHelix/apps/pretrained_compound/pretrain_gnns)\".\n",
"\n",
"# Part I: Pretraining\n",
"\n",
Expand Down
Loading