diff --git a/README.md b/README.md index b9116c8d..f534206b 100644 --- a/README.md +++ b/README.md @@ -14,6 +14,19 @@ PaddleHelix is a machine-learning-based bio-computing framework aiming at facili > * Drug discovery > * Precision medicine +## Features +* Highly Efficent: We provide LinearRNA - highly efficient toolkit for mRNA vaccine development. LinearFold & LinearParitition achieves O(n) complexity in RNA-folding prediction, which is hundreds of times faster than traditional folding techniques. +

+ + +* Large-scale Representation Learning and Transfer Learning: Self-supervised learning for molecule representations offers prospects of a breakthrough in tasks with limited annotation, including drug profiling, drug-target interaction, protein-protein interaction, RNA-RNA interaction, protein folding, RNA folding, and molecule design. PaddleHelix implements a variety of representation learning algorithms and state-of-the-art large-scale pre-trained models to help developers to start from "the shoulders of giants" quickly. +

+ + +* Easy-to-use APIs: PaddleHelix provide frequently used structures and pre-trained models. You can easily use those components to build up your models and systems. + ## Installation ### OS support @@ -74,24 +87,13 @@ conda deactivate ## Documentation ### Tutorials -* We provide abundant [tutorials](./tutorials) to navigate the directory and start quickly. +* We provide abundant [tutorials](./tutorials) to help you navigate the directory and start quickly. * PaddleHelix is based on [PaddlePaddle](https://github.com/paddlepaddle/paddle), a high-performance Parallelized Deep Learning Platform. -### Features -* Highly Efficent: We provide LinearRNA - highly efficient toolkit for mRNA vaccine development. LinearFold & LinearParitition achieves O(n) complexity in RNA-folding prediction, which is hundreds of times faster than traditional folding techniques. -

- - -* Large-scale Representation Learning and Transfer Learning: Self-supervised learning for molecule representations offers prospects of a breakthrough in tasks with limited annotation, including drug profiling, drug-target interaction, protein-protein interaction, RNA-RNA interaction, protein folding, RNA folding, and molecule design. PaddleHelix implements a variety of representation learning algorithms and state-of-the-art large-scale pre-trained models to help developers to start from "the shoulders of giants" quickly. -

- - -* Easy-to-use APIs: PaddleHelix provide frequently used structures and pre-trained models. You can easily use those components to build up your models and systems. - -## Examples +### Examples * [Representation Learning - Compounds](./apps/pretrained_compound) * [Representation Learning - Proteins](./apps/pretrained_protein) * [Drug-Target Interaction](./apps/drug_target_interaction) * [LinearRNA](./c/pahelix/toolkit/linear_rna) + +### [The API reference](https://readthedocs.org/projects/paddlehelix/) diff --git a/README_cn.md b/README_cn.md index 75703183..52267d6c 100644 --- a/README_cn.md +++ b/README_cn.md @@ -9,16 +9,31 @@ ![python version](https://img.shields.io/badge/python-3.6+-orange.svg) ![support os](https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-yellow.svg) -PaddleHelix(螺旋桨)是一个基于机器学习的生物计算工具集,致力于加速下面领域的进展 +PaddleHelix(螺旋桨)是一个基于机器学习的生物计算工具集,致力于加速如下领域的进展: > * 疫苗设计 > * 新药发现 > * 精准医疗 +## 特色 + +* **高性能**:提供了 LinearRNA 系列高性能算法助力 mRNA 疫苗设计。例如,LinearFold 和 LinearParition 能够迅速准确定位能量较低 RNA 二级结构,性能相比传统方法提升数百甚至上千倍。 +

+ + +* 由大规模 **表示预训练** 和 **迁移学习** 支撑的生物计算工具:随着自监督学习用于分子表示训练的进展,为样本量非常稀少的很多生物计算任务带来了全新的突破,这些任务包括分子性质预测,药物-靶点相互作用,蛋白质-蛋白质相互作用,RNA-RNA 相互作用,蛋白质折叠,RNA 折叠等等领域。螺旋桨广泛提供了业界最领先的表示学习方法和模型,使得开发者可以基于大规模模型快速切入需求的任务,站在巨人的肩膀上。 +

+ + +* 简单易用的 API 接口:螺旋桨提供了生物计算中常用的模型结构和预训练模型,用户可以用非常简单的接口调起这些模型,快速组建自己的网络和系统。 +---- + ## 安装 ### 操作系统支持 -Windows,Linux 以及OSX +Windows,Linux 以及 OSX ### Python 版本 @@ -41,63 +56,51 @@ Python 3.6, 3.7 ### 安装命令 -因为paddlehelix安装包的依赖有最新版的paddlepaddle(2.0.0rc0或以上),以及无法直接使用`pip`命令直接安装的rdkit, 因此我们建议创建一个新的conda环境来运行代码,具体命令如下: +因为 PaddleHelix 安装包的依赖有最新版的 paddlepaddle(2.0.0rc0 或以上),以及无法直接使用 `pip` 命令直接安装的 rdkit,因此我们建议创建一个新的 conda 环境来运行代码,具体命令如下: -* 如果你之前从来没有使用过conda,可以参考这个网页来安装conda: +* 如果你之前从来没有使用过 conda,可以参考这个网页来安装 conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/ -* 在安装完conda之后, 可以开始创建一个新的conda环境: +* 在安装完 conda 之后, 可以开始创建一个新的 conda 环境: ```bash conda create -n paddlehelix python=3.7 ``` -* 使用如下命令激活conda环境: +* 使用如下命令激活 conda 环境: ```bash conda activate paddlehelix ``` -* 在安装paddlehelix之前,首先需要使用conda安装依赖rdkit: +* 在安装 PaddleHelix 之前,首先需要使用 conda 安装 rdkit: ```bash conda install -c conda-forge rdkit ``` -* rdkit安装完成之后可以使用pip命令来安装paddlehelix了: +* rdkit 安装完成之后,使用 pip 命令安装 PaddleHelix ```bash pip install paddlehelix ``` -* paddlehelix安装完成之后就可以运行代码了 +* 等待 PaddleHelix 安装完成! -* 如果想要退出当前conda环境,可以使用下列命令: +* 如果想要退出当前 conda 环境,可以使用下列命令: ```bash conda deactivate ``` - +---- ## 文档 ### 教学 * 我们提供了大量的[教学实例](./tutorials)以方便开发者快速了解和使用该框架 -* PaddleHelix基于[飞桨](https://github.com/paddlepaddle/paddle)开源深度学习框架实现,该框架在性能表现上尤其出色。 +* PaddleHelix 基于[飞桨(PaddlePaddle)](https://github.com/paddlepaddle/paddle)开源深度学习框架实现,该框架在性能表现上尤其出色。 -### 特点 - -* **高性能**:提供了LinearRNA系列高性能算法助力mRNA疫苗设计。例如,LinearFold和LinearParition能够迅速准确定位能量较低RNA二级结构,性能相比传统方法提升数百甚至上千倍 -

- - -* 由大规模**表示预训练**和**迁移学习**支撑的生物计算工具: 随着自监督学习用于分子表示训练的进展,为样本量非常稀少的很多生物计算任务带来了全新的突破,这些任务包括分子性质预测,分子-药物作用,蛋白质-蛋白质作用,RNA-RNA作用,蛋白质折叠,RNA折叠等等领域。螺旋桨广泛提供了业界最领先的表示学习方法和模型,使得开发者可以基于大规模模型快速切入需求的任务,站在巨人的肩膀上。 -

- - -* 简单易用API接口: 螺旋桨提供了生物计算中常用的模型结构和预训练模型,用户可以用非常简单的接口调起这些模型,快速组建自己的网络和系统。 - -## 使用示例 +### 使用示例 * [表示学习 - 化合物](./apps/pretrained_compound) * [表示学习 - 蛋白质](./apps/pretrained_protein) * [药物-分子作用预测](./apps/drug_target_interaction) * [LinearRNA](./c/pahelix/toolkit/linear_rna) + +### [The API reference](https://readthedocs.org/projects/paddlehelix/) diff --git a/tutorials/README.md b/tutorials/README.md index 7a158536..dad2a66a 100644 --- a/tutorials/README.md +++ b/tutorials/README.md @@ -2,7 +2,9 @@ English | [简体中文](README_cn.md) # Backgrouds -Machine learning (ML), especially deep learning (DL), is playing an increasingly important role in the pharmaceutical industry and bio-informatics. For instance, the DL-based methodology is found to predict the [drug-target interaction](https://www.researchgate.net/publication/334088358_GraphDTA_prediction_of_drug-target_binding_affinity_using_graph_convolutional_networks) and [molecule properties](https://pubmed.ncbi.nlm.nih.gov/30165565/) with reasonable precision and quite low computational cost, while those properties can only be accessed through *in vivo* / *in vitro* experiments or computationally expensive simulations (molecular dynamics simulation etc.) before. As another example, *in silico* [RNA folding](https://www.researchgate.net/publication/344954534_LinearFold_Linear-Time_Prediction_of_RNA_Secondary_Structures) and [protein folding](https://www.researchgate.net/publication/338619491_Improved_protein_structure_prediction_using_potentials_from_deep_learning) are becoming more likely to be accomplished with the help of deep neural models. The usage of ML and DL can greatly improve efficiency, and thus reduce the cost of drug discovery, vaccine design, etc. In contrast to the powerful ability of DL metrics, a key challenge lying in utilizing them in the drug industry is the contradiction between the demand for huge data for training and the limited annotated data. Recently, there is tremendous success in adopting self-supervised learning in natural language processing and computer vision, showing that a large corpus of unlabeled data can be beneficial to learning universal tasks. In molecule representations, there is a similar situation. We have large amount of unlabeled data, including protein sequences (over 100 million) and compounds (over 50 million) but relatively small annotated data. +Machine learning (ML), especially deep learning (DL), is playing an increasingly important role in the pharmaceutical industry and bio-informatics. For instance, the DL-based methodology is found to predict the [drug-target interaction](https://www.researchgate.net/publication/334088358_GraphDTA_prediction_of_drug-target_binding_affinity_using_graph_convolutional_networks) and [molecule properties](https://pubmed.ncbi.nlm.nih.gov/30165565/) with reasonable precision and quite low computational cost, while those properties can only be accessed through *in vivo* / *in vitro* experiments or computationally expensive simulations (molecular dynamics simulation etc.) before. As another example, *in silico* [RNA folding](https://www.researchgate.net/publication/344954534_LinearFold_Linear-Time_Prediction_of_RNA_Secondary_Structures) and [protein folding](https://www.researchgate.net/publication/338619491_Improved_protein_structure_prediction_using_potentials_from_deep_learning) are becoming more likely to be accomplished with the help of deep neural models. The usage of ML and DL can greatly improve efficiency, and thus reduce the cost of drug discovery, vaccine design, etc. + +In contrast to the powerful ability of DL metrics, a key challenge lying in utilizing them in the drug industry is the contradiction between the demand for huge data for training and the limited annotated data. Recently, there is tremendous success in adopting self-supervised learning in natural language processing and computer vision, showing that a large corpus of unlabeled data can be beneficial to learning universal tasks. In molecule representations, there is a similar situation. We have large amount of unlabeled data, including protein sequences (over 100 million) and compounds (over 50 million) but relatively small annotated data. It is quite promising to adopt DL-based pretaining technique in the representation learning of chemical compounds, proteins, RNA, etc. **PaddleHelix** is a high-performance ML-based bio-computing framework. It features large scale representation learning and easy-to-use APIs, providing pharmaceutical and biological researchers and engineers convenient access to the most up-to-date and state-of-the-art AI tools. @@ -16,11 +18,11 @@ Machine learning (ML), especially deep learning (DL), is playing an increasingly * [Predicting Drug-Target Interaction](drug_target_interaction_tutorial.ipynb) * [Compound Representation Learning and Property Prediction](compound_property_prediction_tutorial.ipynb) * [Protein Representation Learning and Property Prediction](protein_pretrain_and_property_prediction_tutorial.ipynb) -* [Predicting RNA Secondary Structured](linearrna_tutorial.ipynb) +* [Predicting RNA Secondary Structure](linearrna_tutorial.ipynb) # Run tutorials locally -The tutorials are written as Jupyter Notebooks and designed to be smoothly run on you own machine. If you don't have Jupyter installed, please refer to [here](https://jupyter.org/install). +The tutorials are written as Jupyter Notebooks and designed to be smoothly run on you own machine. If you don't have Jupyter installed, please refer to [here](https://jupyter.org/install). And please also install PaddleHelix before proceeding ([instructions](../README.md)). After the installation of Jypyter, please go through the following steps: diff --git a/tutorials/README_cn.md b/tutorials/README_cn.md index 95a9a558..1394bf1b 100644 --- a/tutorials/README_cn.md +++ b/tutorials/README_cn.md @@ -2,7 +2,9 @@ # 背景 -机器学习,特别是深度学习正在制药工业和生物信息学中发挥着越来越重要的作用。例如,基于深度学习的方法可以用于预测[药物-靶点相互作用](https://www.researchgate.net/publication/334088358_-GraphDTA_-drug-target_-binding_-affinity_使用_-graph-convolative_-networks)和[分子性质](https://pubmed.ncbi.nlm.nih.gov/30165565/),从而以相当低的计算成本达到可接受的预测精度,而这些数据以前只能通过体内/体外实验或计算复杂度极高的仿真方法(如分子动力学模拟等)来获得。另一个例子是借助深层神经网络,我们可以更好地解决 *in silico* [RNA折叠](https://www.researchgate.net/publication/344954534_LinearFold_Linear-Time_预测_RNA_次级结构)和[蛋白质折叠](https://www.researchgate.net/publication/338619491_改进的_蛋白质结构_预测_使用来自_deep_学习的潜力)这两个问题。机器学习和深度学习的应用可以大大提高问题求解的效率,从而降低药物发现、疫苗设计等工业应用的成本。深度学习模型具有强大的学习能力,而将其应用于制药行业的一个关键挑战在于如何解决模型对大量训练数据的需求与有限的标注数据之间的矛盾。近年来,自监督学习在自然语言处理和计算机视觉领域取得了巨大的成功,表明模型可以从大量的未标注数据和通用任务中学习到有益的信息。在分子表示的问题上,也有类似的情况。我们有大量未标记的数据,包括蛋白质序列(超过1亿条)和化合物信息(超过5000万条)等,但其中有标注的数据较少。 +机器学习,特别是深度学习正在制药工业和生物信息学中发挥着越来越重要的作用。例如,基于深度学习的方法可以用于预测[药物-靶点相互作用](https://www.researchgate.net/publication/334088358_-GraphDTA_-drug-target_-binding_-affinity_使用_-graph-convolative_-networks)和[分子性质](https://pubmed.ncbi.nlm.nih.gov/30165565/),从而以相当低的计算成本达到可接受的预测精度,而这些数据以前只能通过体内/体外实验或计算复杂度极高的仿真方法(如分子动力学模拟等)来获得。另一个例子是借助深层神经网络,我们可以更好地解决 *in silico* [RNA折叠](https://www.researchgate.net/publication/344954534_LinearFold_Linear-Time_预测_RNA_次级结构)和[蛋白质折叠](https://www.researchgate.net/publication/338619491_改进的_蛋白质结构_预测_使用来自_deep_学习的潜力)这两个问题。机器学习和深度学习的应用可以大大提高问题求解的效率,从而降低药物发现、疫苗设计等工业应用的成本。 + +深度学习模型具有强大的学习能力,而将其应用于制药行业的一个关键挑战在于如何解决模型对大量训练数据的需求与有限的标注数据之间的矛盾。近年来,自监督学习在自然语言处理和计算机视觉领域取得了巨大的成功,表明模型可以从大量的未标注数据和通用任务中学习到有益的信息。在分子表示(molecular representation)的问题上,情况十分相似,我们有大量未标记的数据,包括蛋白质序列(超过1亿条)和化合物信息(超过5000万条)等,但其中有标注的数据较少。将基于深度学习的预训练技术应用于化合物、蛋白质、RNA等的表示学习中,是一个非常有前景的方向。 **PaddleHelix** 是一个高性能并且专为生物计算任务开发的机器学习框架。它的特色在于大规模的表示学习(representation learning)和易用的API,我们期望为制药和生物领域的研究人员和工程师提供最新和最先进的AI工具。 @@ -13,14 +15,14 @@

# 教程 -* [药物-靶点相互作用预测](drug_target_interaction_tutorial.ipynb) -* [化合物表示学习和性质预测](compound_property_prediction_tutorial.ipynb) -* [蛋白质表示学习和性质预测](protein_pretrain_and_property_prediction_tutorial.ipynb) -* [RNA二级结构预测](linearrna_tutorial.ipynb) +* [药物-靶点相互作用预测](drug_target_interaction_tutorial_cn.ipynb) +* [化合物表示学习和性质预测](compound_property_prediction_tutorial_cn.ipynb) +* [蛋白质表示学习和性质预测](protein_pretrain_and_property_prediction_tutorial_cn.ipynb) +* [RNA二级结构预测](linearrna_tutorial_cn.ipynb) # 在本地运行 -我们的教程以 Jypyter Notebook 的形式编写,可以方便的在你的本地计算机上运行。如果你没有安装过 Jupyter,请看[这里](https://jupyter.org/install)。 +我们的教程以 Jupyter Notebook 的形式编写,可以方便的在你的本地计算机上运行。如果你没有安装过 Jupyter,请看[这里](https://jupyter.org/install)。另外也请安装好 PaddleHelix([教程](../README_cn.md))。 安装好 Jupyter 之后,请按照以下的步骤来运行: diff --git a/tutorials/compound_property_prediction_tutorial.ipynb b/tutorials/compound_property_prediction_tutorial.ipynb index 56e996aa..d4b623cf 100644 --- a/tutorials/compound_property_prediction_tutorial.ipynb +++ b/tutorials/compound_property_prediction_tutorial.ipynb @@ -6,7 +6,7 @@ "source": [ "# Compound representation learning and property prediction\n", "\n", - "In this tuorial, we will go through how to run a Graph Neural Network (GNN) model for compound property prediction. In particular, we will demonstrate how to pretrain and finetune the model in the downstream tasks.\n", + "In this tuorial, we will go through how to run a Graph Neural Network (GNN) model for compound property prediction. In particular, we will demonstrate how to pretrain and finetune the model in the downstream tasks. If you are intersted in more details, please refer to the README for \"[info graph](https://github.com/PaddlePaddle/PaddleHelix/apps/pretrained_compound/info_graph)\" and \"[pretrained GNN](https://github.com/PaddlePaddle/PaddleHelix/apps/pretrained_compound/pretrain_gnns)\".\n", "\n", "# Part I: Pretraining\n", "\n", diff --git a/tutorials/compound_property_prediction_tutorial_cn.ipynb b/tutorials/compound_property_prediction_tutorial_cn.ipynb index 0f582199..3f2f800b 100644 --- a/tutorials/compound_property_prediction_tutorial_cn.ipynb +++ b/tutorials/compound_property_prediction_tutorial_cn.ipynb @@ -6,7 +6,7 @@ "source": [ "# 化合物表示学习和性质预测\n", "\n", - "在这篇教程中,我们将介绍如何运用图神经网络(GNN)模型来预测化合物的性质。具体来说,我们将演示如何对其进行预训练(pretrain),如何针对下游任务进行模型微调(finetune),并利用最终的模型进行推断(inference)。\n", + "在这篇教程中,我们将介绍如何运用图神经网络(GNN)模型来预测化合物的性质。具体来说,我们将演示如何对其进行预训练(pretrain),如何针对下游任务进行模型微调(finetune),并利用最终的模型进行推断(inference)。如果你想了解更多细节,请查阅 \"[info graph](https://github.com/PaddlePaddle/PaddleHelix/apps/pretrained_compound/info_graph/README_cn.md)\" 和 \"[pretrained GNN](https://github.com/PaddlePaddle/PaddleHelix/apps/pretrained_compound/pretrain_gnns/README_cn.md)\" 的详细解释.\n", "\n", "# 第一部分:预训练\n", "\n", diff --git a/tutorials/drug_target_interaction_tutorial.ipynb b/tutorials/drug_target_interaction_tutorial.ipynb index 8770ed21..081b287f 100644 --- a/tutorials/drug_target_interaction_tutorial.ipynb +++ b/tutorials/drug_target_interaction_tutorial.ipynb @@ -49,14 +49,16 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 9, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ - "['data_gen.py',\n", + "['davis',\n", + " 'PaddleHelix%2Fdatasets%2Fdti_datasets%2Fdavis.tgz',\n", + " 'data_gen.py',\n", " 'data_preprocess.py',\n", " '__pycache__',\n", " 'model.py',\n", @@ -66,7 +68,7 @@ ] }, "metadata": {}, - "execution_count": 1 + "execution_count": 9 } ], "source": [ @@ -84,23 +86,61 @@ "## Prepare dataset" ] }, + { + "source": [ + "Download the Davis dataset using `wget`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "--2020-12-16 16:24:35-- https://baidu-nlp.bj.bcebos.com/PaddleHelix%2Fdatasets%2Fdti_datasets%2Fdavis.tgz\n", + "正在解析主机 baidu-nlp.bj.bcebos.com (baidu-nlp.bj.bcebos.com)... 10.70.0.165\n", + "正在连接 baidu-nlp.bj.bcebos.com (baidu-nlp.bj.bcebos.com)|10.70.0.165|:443... 已连接。\n", + "已发出 HTTP 请求,正在等待回应... 200 OK\n", + "长度:23301615 (22M) [application/gzip]\n", + "正在保存至: “PaddleHelix%2Fdatasets%2Fdti_datasets%2Fdavis.tgz.1”\n", + "\n", + "PaddleHelix%2Fdatas 100%[===================>] 22.22M 4.65MB/s 用时 5.7s \n", + "\n", + "2020-12-16 16:24:41 (3.87 MB/s) - 已保存 “PaddleHelix%2Fdatasets%2Fdti_datasets%2Fdavis.tgz.1” [23301615/23301615])\n", + "\n", + "\u001b[34mtest\u001b[m\u001b[m \u001b[34mtrain\u001b[m\u001b[m\n" + ] + } + ], + "source": [ + "# download and decompress the data\n", + "!wget \"https://baidu-nlp.bj.bcebos.com/PaddleHelix%2Fdatasets%2Fdti_datasets%2Fdavis.tgz\"\n", + "!tar -zxf \"PaddleHelix%2Fdatasets%2Fdti_datasets%2Fdavis.tgz\"\n", + "!ls \"./davis/processed\"" + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Suppose you have download the processed Davis dataset, please refer the script `data_gen.py` for the implementation of `DTADataset` class, which is a stream dataset wrapper for [PGL](https://github.com/PaddlePaddle/PGL)." + "Suppose you have download the processed Davis dataset , please refer to the script `data_gen.py` for the implementation of `DTADataset` class, which is a stream dataset wrapper for [PGL](https://github.com/PaddlePaddle/PGL)." ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 11, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ - "[INFO] 2020-12-16 13:22:00,650 [mp_reader.py: 23]:\tujson not install, fail back to use json instead\n" + "[INFO] 2020-12-16 16:24:46,122 [mp_reader.py: 23]:\tujson not install, fail back to use json instead\n" ] } ], @@ -119,18 +159,18 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 12, "metadata": {}, "outputs": [], "source": [ - "train_data = '/mnt/xueyang/Datasets/PaddleHelix/davis/processed/train'\n", - "test_data = '/mnt/xueyang/Datasets/PaddleHelix/davis/processed/test'\n", + "train_data = './davis/processed/train'\n", + "test_data = './davis/processed/test'\n", "max_protein_len = 1000 # set -1 to use full sequence" ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 13, "metadata": {}, "outputs": [], "source": [ @@ -140,12 +180,12 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 14, "metadata": {}, "outputs": [ { - "name": "stdout", "output_type": "stream", + "name": "stdout", "text": [ "25046 5010\n" ] diff --git a/tutorials/drug_target_interaction_tutorial_cn.ipynb b/tutorials/drug_target_interaction_tutorial_cn.ipynb index 1f050a2e..230f54e3 100644 --- a/tutorials/drug_target_interaction_tutorial_cn.ipynb +++ b/tutorials/drug_target_interaction_tutorial_cn.ipynb @@ -56,12 +56,15 @@ "output_type": "execute_result", "data": { "text/plain": [ - "['data_gen.py',\n", + "['davis',\n", + " 'PaddleHelix%2Fdatasets%2Fdti_datasets%2Fdavis.tgz',\n", + " 'data_gen.py',\n", " 'data_preprocess.py',\n", " '__pycache__',\n", " 'model.py',\n", " 'utils.py',\n", " 'train.py',\n", + " 'PaddleHelix%2Fdatasets%2Fdti_datasets%2Fdavis.tgz.1',\n", " 'demos']" ] }, @@ -84,6 +87,44 @@ "## 数据集的准备" ] }, + { + "source": [ + "使用 `wget` 命令下载 Davis 数据集。" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "--2020-12-16 16:27:22-- https://baidu-nlp.bj.bcebos.com/PaddleHelix%2Fdatasets%2Fdti_datasets%2Fdavis.tgz\n", + "正在解析主机 baidu-nlp.bj.bcebos.com (baidu-nlp.bj.bcebos.com)... 10.70.0.165\n", + "正在连接 baidu-nlp.bj.bcebos.com (baidu-nlp.bj.bcebos.com)|10.70.0.165|:443... 已连接。\n", + "已发出 HTTP 请求,正在等待回应... 200 OK\n", + "长度:23301615 (22M) [application/gzip]\n", + "正在保存至: “PaddleHelix%2Fdatasets%2Fdti_datasets%2Fdavis.tgz.2”\n", + "\n", + "PaddleHelix%2Fdatas 100%[===================>] 22.22M 7.91MB/s 用时 2.8s \n", + "\n", + "2020-12-16 16:27:25 (7.91 MB/s) - 已保存 “PaddleHelix%2Fdatasets%2Fdti_datasets%2Fdavis.tgz.2” [23301615/23301615])\n", + "\n", + "\u001b[34mtest\u001b[m\u001b[m \u001b[34mtrain\u001b[m\u001b[m\n" + ] + } + ], + "source": [ + "# download and decompress the data\n", + "!wget \"https://baidu-nlp.bj.bcebos.com/PaddleHelix%2Fdatasets%2Fdti_datasets%2Fdavis.tgz\"\n", + "!tar -zxf \"PaddleHelix%2Fdatasets%2Fdti_datasets%2Fdavis.tgz\"\n", + "!ls \"./davis/processed\"" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -123,8 +164,8 @@ "metadata": {}, "outputs": [], "source": [ - "train_data = '/mnt/xueyang/Datasets/PaddleHelix/davis/processed/train'\n", - "test_data = '/mnt/xueyang/Datasets/PaddleHelix/davis/processed/test'\n", + "train_data = './davis/processed/train'\n", + "test_data = './davis/processed/test'\n", "max_protein_len = 1000 # 设置为-1时使用全长蛋白质序列" ] }, diff --git a/tutorials/linearrna_tutorial.ipynb b/tutorials/linearrna_tutorial.ipynb index 3abd78cf..9bec4c2c 100644 --- a/tutorials/linearrna_tutorial.ipynb +++ b/tutorials/linearrna_tutorial.ipynb @@ -4,44 +4,21 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# LinearRNA Tutorial" + "# Predicting RNA secondary structure with LinearRNA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "LinearRNA includes a series of linear-time prediction algorithms/softwares for RNA secondary structures analysis: **LinearFold** and **LinearPartition**. " - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['linear_rna.h', 'linear_rna.cpp', 'utils', 'linear_fold', 'linear_partition']" - ] - }, - "execution_count": 1, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import os\n", - "#os.chdir('../apps/drug_target_interaction/graph_dta/')\n", - "os.chdir('../c/pahelix/toolkit/linear_rna/linear_rna')\n", - "os.listdir(os.getcwd())" + "LinearRNA includes a series of linear-time prediction algorithms/softwares for RNA secondary structure analysis: **LinearFold** and **LinearPartition**. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## LinearFold" + "# Part I: LinearFold" ] }, { @@ -49,21 +26,21 @@ "metadata": {}, "source": [ "**LinearFold** is the first linear-time prediction algorithm/software for RNA secondary structures. \n", - "The LinearFold paper has been accepted by ISMB, a top-level conference on computational biology and published on Bioinformatics, an authoritative journal. The link of the paper is: [LinearFold: linear-time approximate RNA folding by 5'-to-3' dynamic programming and beam search](https://academic.oup.com/bioinformatics/article/35/14/i295/5529205)." + "The LinearFold paper has been accepted by ISMB, a top-level conference on computational biology and published on Bioinformatics, an authoritative journal. The link of the paper is: [LinearFold: linear-time approximate RNA folding by 5'-to-3' dynamic programming and beam search](https://academic.oup.com/bioinformatics/article/35/14/i295/5529205)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### structure prediction" + "## RNA structure prediction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "#### machine learning model" + "### Machine learning model" ] }, { @@ -72,40 +49,38 @@ "metadata": {}, "outputs": [ { + "output_type": "execute_result", "data": { "text/plain": [ "('..((((.(((....)))...))))....((((((............................))))))....',\n", " 0.4548597317188978)" ] }, - "execution_count": 1, "metadata": {}, - "output_type": "execute_result" + "execution_count": 1 } ], "source": [ - "import sys\n", - "sys.path.append('../build/c/pahelix/toolkit/linear_rna')\n", - "import linear_rna\n", + "import pahelix.toolkit.linear_rna as linear_rna\n", "input_sequence = \"AACUCCGCCAGGCCUGGAAGGGAGCAACGGUAGUGACACUCUCUGUGUGCGUAGGUUGCCUAGCUACCAUUU\"\n", "linear_rna.linear_fold_c(input_sequence)" ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 3, "metadata": {}, "outputs": [ { + "output_type": "execute_result", "data": { "text/plain": [ "('..(.(((......)((........))(((......(.......).))).....))..)..............',\n", " -27.328358240425587)" ] }, - "execution_count": 2, "metadata": {}, - "output_type": "execute_result" + "execution_count": 3 } ], "source": [ @@ -118,10 +93,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### parameter setting:\n", - "- rna_sequence: the input RNA sequence to predict the secondary structure;\n", - "- beam_size: int (default 100), set 0 to turn off the beam pruning;\n", - "- use_constraints: bool (default False), enable adding constraints when predicting structures;\n", + "### Parameter setting\n", + "- rna_sequence: the input RNA sequence to predict the secondary structure.\n", + "- beam_size: int (default 100), set 0 to turn off the beam pruning.\n", + "- use_constraints: bool (default False), enable adding constraints when predicting structures.\n", "- constraint: string (default \"\"), the constraint sequence. It works when the parameter use_constraints is Ture. The constraint sequence should have the same length as the RNA sequence. \"? . ( )\" indicates a position for which the proper matching is unknown, unpaired, left or right parenthesis respectively. The parentheses must be well-banlanced and non-crossing.\n", "- no_sharp_turn: bool (default True), disable sharpturn in prediction." ] @@ -130,25 +105,25 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### thermodynamic model\n", - "the parameters are the same as the machine learning-based model." + "### Thermodynamic model\n", + "The parameters are the same as the machine learning-based model." ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 4, "metadata": {}, "outputs": [ { + "output_type": "execute_result", "data": { "text/plain": [ "('..((((.(((....)))...))))....((((((.((((.....))))...((((...))))))))))....',\n", " -18.4)" ] }, - "execution_count": 3, "metadata": {}, - "output_type": "execute_result" + "execution_count": 4 } ], "source": [ @@ -157,19 +132,19 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 5, "metadata": {}, "outputs": [ { + "output_type": "execute_result", "data": { "text/plain": [ "('..(.(((......)((........))(((......(.......).))).....))..)..............',\n", " 13.4)" ] }, - "execution_count": 4, "metadata": {}, - "output_type": "execute_result" + "execution_count": 5 } ], "source": [ @@ -181,347 +156,40 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## LinearPartition" + "# Part II: LinearPartition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**LienarPartition** is the first linear-time partition function and base pair probabilities calculation algorithm/software for RNA secondary structures. The LinearPartition paper has been accepted by ISMB, a top-level conference on computational biology and published on Bioinformatics, an authoritative journal. The link of the paper is: [LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities](https://academic.oup.com/bioinformatics/article/36/Supplement_1/i258/5870487)." + "**LienarPartition** is the first linear-time partition function and base pair probabilities calculation algorithm/software for RNA secondary structures. The LinearPartition paper has been accepted by ISMB, a top-level conference on computational biology and published on Bioinformatics, an authoritative journal. The link of the paper is: [LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities](https://academic.oup.com/bioinformatics/article/36/Supplement_1/i258/5870487)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### partition function and base pair proababilities calculation" + "## Partition function and base pair probabilities calculation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "#### machine linearing model" + "### Machine learning model" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 7, "metadata": {}, "outputs": [ { + "output_type": "execute_result", "data": { "text/plain": [ - "(7.182411193847656,\n", - " [(1, 15, 0.0004622228443622589),\n", - " (1, 31, 0.00015781447291374207),\n", - " (1, 34, 0.0002961978316307068),\n", - " (1, 40, 1.765415072441101e-05),\n", - " (1, 48, 5.81108033657074e-05),\n", - " (1, 52, 9.354203939437866e-05),\n", - " (1, 56, 0.00031825900077819824),\n", - " (1, 57, 0.0059296488761901855),\n", - " (1, 61, 0.00020839273929595947),\n", - " (1, 65, 0.00048786774277687073),\n", - " (1, 70, 0.0007950402796268463),\n", - " (1, 71, 0.002164561301469803),\n", - " (1, 72, 0.0016718171536922455),\n", - " (2, 15, 0.0015095695853233337),\n", - " (2, 31, 0.0009824782609939575),\n", - " (2, 34, 0.013957291841506958),\n", - " (2, 46, 0.00043483078479766846),\n", - " (2, 48, 0.0005135945975780487),\n", - " (2, 52, 0.00095348060131073),\n", - " (2, 56, 0.00834539532661438),\n", - " (2, 57, 0.00020313262939453125),\n", - " (2, 61, 0.0002741888165473938),\n", - " (2, 65, 0.00031565502285957336),\n", - " (2, 70, 0.0024125613272190094),\n", - " (2, 71, 0.0017434321343898773),\n", - " (2, 72, 0.00041290372610092163),\n", - " (3, 7, 0.0008168257772922516),\n", - " (3, 11, 0.008151888847351074),\n", - " (3, 12, 0.005629867315292358),\n", - " (3, 16, 0.001522894948720932),\n", - " (3, 17, 0.002167344093322754),\n", - " (3, 20, 0.006776213645935059),\n", - " (3, 21, 0.004337608814239502),\n", - " (3, 22, 0.026105105876922607),\n", - " (3, 24, 0.14537853002548218),\n", - " (3, 29, 0.00028303638100624084),\n", - " (3, 30, 0.0010201036930084229),\n", - " (3, 33, 0.02000981569290161),\n", - " (3, 35, 0.0015420466661453247),\n", - " (3, 45, 0.0005938038229942322),\n", - " (3, 47, 0.0007119551301002502),\n", - " (3, 49, 0.0002944953739643097),\n", - " (3, 51, 0.001286044716835022),\n", - " (3, 54, 0.005294471979141235),\n", - " (3, 55, 0.00849539041519165),\n", - " (3, 58, 0.0011745169758796692),\n", - " (3, 63, 0.004118800163269043),\n", - " (4, 10, 0.00503191351890564),\n", - " (4, 11, 0.004627823829650879),\n", - " (4, 12, 0.0028157830238342285),\n", - " (4, 16, 0.001822788268327713),\n", - " (4, 17, 0.0008588656783103943),\n", - " (4, 18, 0.03738701343536377),\n", - " (4, 19, 0.0039004385471343994),\n", - " (4, 20, 0.002740301191806793),\n", - " (4, 21, 0.02638840675354004),\n", - " (4, 22, 0.03388082981109619),\n", - " (4, 23, 0.15423643589019775),\n", - " (4, 24, 0.0003489069640636444),\n", - " (4, 26, 0.0002142190933227539),\n", - " (4, 27, 0.00020888447761535645),\n", - " (4, 29, 0.0006322525441646576),\n", - " (4, 30, 0.00016719475388526917),\n", - " (4, 32, 0.015675753355026245),\n", - " (4, 33, 0.0003027692437171936),\n", - " (4, 35, 0.00015425309538841248),\n", - " (4, 36, 0.0005160309374332428),\n", - " (4, 49, 0.00012025237083435059),\n", - " (4, 51, 0.00012722238898277283),\n", - " (4, 53, 0.0051382482051849365),\n", - " (4, 54, 0.006822943687438965),\n", - " (4, 55, 0.0005917400121688843),\n", - " (4, 58, 0.00016247481107711792),\n", - " (4, 62, 0.0031983554363250732),\n", - " (4, 63, 0.00020073726773262024),\n", - " (4, 66, 0.0006130710244178772),\n", - " (4, 69, 0.0005395784974098206),\n", - " (5, 11, 0.004743069410324097),\n", - " (5, 12, 0.0859522819519043),\n", - " (5, 16, 0.0016122832894325256),\n", - " (5, 17, 0.04930245876312256),\n", - " (5, 20, 0.027035415172576904),\n", - " (5, 21, 0.0754665732383728),\n", - " (5, 22, 0.1489909291267395),\n", - " (5, 24, 0.0009400211274623871),\n", - " (5, 29, 0.0002566128969192505),\n", - " (5, 30, 0.03200578689575195),\n", - " (5, 33, 0.002335749566555023),\n", - " (5, 35, 0.00054892897605896),\n", - " (5, 45, 0.000279158353805542),\n", - " (5, 47, 0.00019484758377075195),\n", - " (5, 49, 0.00034235045313835144),\n", - " (5, 51, 0.0004806928336620331),\n", - " (5, 54, 0.001391671597957611),\n", - " (5, 55, 0.005109161138534546),\n", - " (5, 58, 0.002447940409183502),\n", - " (5, 63, 0.0012036822736263275),\n", - " (6, 11, 0.09708786010742188),\n", - " (6, 12, 0.0018821656703948975),\n", - " (6, 16, 0.04884237051010132),\n", - " (6, 17, 0.0015244334936141968),\n", - " (6, 20, 0.08314257860183716),\n", - " (6, 21, 0.12704682350158691),\n", - " (6, 22, 0.0009751878678798676),\n", - " (6, 24, 0.0024432316422462463),\n", - " (6, 29, 0.034866511821746826),\n", - " (6, 30, 0.0003223419189453125),\n", - " (6, 33, 0.0013894736766815186),\n", - " (6, 35, 0.00040587037801742554),\n", - " (6, 45, 0.00019099563360214233),\n", - " (6, 47, 0.0008939355611801147),\n", - " (6, 49, 0.0016529299318790436),\n", - " (6, 51, 0.016640305519104004),\n", - " (6, 54, 0.005631953477859497),\n", - " (6, 55, 0.00018457695841789246),\n", - " (6, 58, 0.0010196901857852936),\n", - " (6, 63, 0.0022887922823429108),\n", - " (7, 13, 0.011170625686645508),\n", - " (7, 14, 0.003929793834686279),\n", - " (7, 15, 0.033252835273742676),\n", - " (7, 25, 0.009845316410064697),\n", - " (7, 28, 0.03560483455657959),\n", - " (7, 31, 0.007103413343429565),\n", - " (7, 34, 0.0005918294191360474),\n", - " (7, 37, 0.00013744458556175232),\n", - " (7, 43, 0.00011819601058959961),\n", - " (7, 44, 8.602067828178406e-05),\n", - " (7, 46, 0.0010782815515995026),\n", - " (7, 48, 0.0021608732640743256),\n", - " (7, 50, 0.018468260765075684),\n", - " (7, 52, 0.003180205821990967),\n", - " (7, 56, 0.00973278284072876),\n", - " (7, 57, 0.0003688931465148926),\n", - " (7, 59, 0.007527649402618408),\n", - " (7, 60, 0.0012912601232528687),\n", - " (7, 61, 0.00042457133531570435),\n", - " (7, 64, 0.04046368598937988),\n", - " (7, 65, 0.00024250522255897522),\n", - " (7, 67, 0.0005980059504508972),\n", - " (7, 68, 0.0005687177181243896),\n", - " (7, 70, 0.0002057589590549469),\n", - " (7, 71, 0.00017112120985984802),\n", - " (7, 72, 0.0002051815390586853),\n", - " (8, 12, 0.009244143962860107),\n", - " (8, 16, 0.005479782819747925),\n", - " (8, 17, 0.21605318784713745),\n", - " (8, 20, 0.013513147830963135),\n", - " (8, 21, 0.011645674705505371),\n", - " (8, 22, 0.002902120351791382),\n", - " (8, 24, 0.01012575626373291),\n", - " (8, 29, 0.0003161430358886719),\n", - " (8, 30, 0.008874207735061646),\n", - " (8, 33, 0.0006380230188369751),\n", - " (8, 35, 9.339675307273865e-05),\n", - " (8, 45, 0.001000765711069107),\n", - " (8, 47, 0.002293657511472702),\n", - " (8, 49, 0.01654091477394104),\n", - " (8, 51, 0.003729313611984253),\n", - " (8, 54, 0.0004991143941879272),\n", - " (8, 55, 0.012955158948898315),\n", - " (8, 58, 0.007416486740112305),\n", - " (8, 63, 0.043755531311035156),\n", - " (9, 16, 0.2228570580482483),\n", - " (9, 17, 0.010128110647201538),\n", - " (9, 20, 0.01358705759048462),\n", - " (9, 21, 0.002870023250579834),\n", - " (9, 22, 0.000507798045873642),\n", - " (9, 24, 0.0021577104926109314),\n", - " (9, 29, 0.008514046669006348),\n", - " (9, 30, 0.00019711628556251526),\n", - " (9, 33, 0.00017926841974258423),\n", - " (9, 35, 0.00028144195675849915),\n", - " (9, 45, 0.004837304353713989),\n", - " (9, 47, 0.0033318698406219482),\n", - " (9, 49, 0.0017511546611785889),\n", - " (9, 51, 0.0018434971570968628),\n", - " (9, 54, 0.012669652700424194),\n", - " (9, 55, 0.0002368353307247162),\n", - " (9, 58, 0.0030380189418792725),\n", - " (9, 63, 0.0019221603870391846),\n", - " (10, 15, 0.17589616775512695),\n", - " (10, 31, 3.317743539810181e-05),\n", - " (10, 34, 0.00022894889116287231),\n", - " (10, 40, 0.0001468062400817871),\n", - " (10, 42, 0.00017512962222099304),\n", - " (10, 44, 0.004744917154312134),\n", - " (10, 46, 0.002226322889328003),\n", - " (10, 48, 0.0016778036952018738),\n", - " (10, 52, 0.0008255057036876678),\n", - " (10, 56, 0.00017959997057914734),\n", - " (10, 57, 0.001654982566833496),\n", - " (10, 61, 0.04626733064651489),\n", - " (10, 65, 0.0009683743119239807),\n", - " (10, 70, 0.000538993626832962),\n", - " (10, 71, 0.0005774311721324921),\n", - " (10, 72, 0.001341603696346283),\n", - " (11, 15, 0.0011742450296878815),\n", - " (11, 25, 0.008123189210891724),\n", - " (11, 28, 0.0042219460010528564),\n", - " (11, 31, 2.90796160697937e-05),\n", - " (11, 37, 0.00012592226266860962),\n", - " (11, 39, 0.00016996636986732483),\n", - " (11, 40, 8.694082498550415e-05),\n", - " (11, 41, 0.00020815059542655945),\n", - " (11, 42, 0.00016628578305244446),\n", - " (11, 43, 0.004593878984451294),\n", - " (11, 44, 0.0009049177169799805),\n", - " (11, 46, 0.0006507895886898041),\n", - " (11, 48, 0.0007310137152671814),\n", - " (11, 50, 0.00395733118057251),\n", - " (11, 52, 0.002096358686685562),\n", - " (11, 56, 0.0006698817014694214),\n", - " (11, 57, 0.015850424766540527),\n", - " (11, 59, 0.0007002614438533783),\n", - " (11, 60, 0.05610603094100952),\n", - " (11, 61, 0.0014368519186973572),\n", - " (11, 64, 0.00126662477850914),\n", - " (11, 65, 0.0042133331298828125),\n", - " (11, 67, 0.0008630640804767609),\n", - " (11, 68, 0.006175577640533447),\n", - " (11, 70, 0.0007995143532752991),\n", - " (11, 71, 0.0020865797996520996),\n", - " (11, 72, 0.0011804252862930298),\n", - " (12, 25, 0.013226866722106934),\n", - " (12, 28, 0.0005642510950565338),\n", - " (12, 31, 0.001819457858800888),\n", - " (12, 34, 0.00023008882999420166),\n", - " (12, 39, 0.00013705715537071228),\n", - " (12, 40, 0.00015447288751602173),\n", - " (12, 41, 0.00022390484809875488),\n", - " (12, 42, 0.0022322610020637512),\n", - " (12, 43, 0.0008905753493309021),\n", - " (12, 44, 0.00016807392239570618),\n", - " (12, 46, 0.0032880306243896484),\n", - " (12, 48, 0.006122291088104248),\n", - " (12, 50, 0.014869928359985352),\n", - " (12, 52, 0.003486722707748413),\n", - " (12, 56, 0.023375272750854492),\n", - " (12, 57, 0.00016527995467185974),\n", - " (12, 59, 0.053273141384124756),\n", - " (12, 60, 0.0016719922423362732),\n", - " (12, 61, 8.777529001235962e-05),\n", - " (12, 64, 0.006330639123916626),\n", - " (12, 65, 0.00017600879073143005),\n", - " (12, 67, 0.006108105182647705),\n", - " (12, 68, 0.0004602111876010895),\n", - " (12, 70, 0.0021186135709285736),\n", - " (12, 71, 0.00137435644865036),\n", - " (12, 72, 0.0011010058224201202),\n", - " (13, 17, 0.0008790753781795502),\n", - " (13, 20, 0.002657536417245865),\n", - " (13, 21, 0.03750556707382202),\n", - " (13, 22, 0.10807883739471436),\n", - " (13, 24, 0.011932432651519775),\n", - " (13, 29, 0.0001623108983039856),\n", - " (13, 30, 0.0025212839245796204),\n", - " (13, 33, 0.00030111148953437805),\n", - " (13, 35, 9.37432050704956e-05),\n", - " (13, 45, 0.0040864646434783936),\n", - " (13, 47, 0.0074340105056762695),\n", - " (13, 49, 0.015464037656784058),\n", - " (13, 51, 0.004499077796936035),\n", - " (13, 54, 0.000272948294878006),\n", - " (13, 55, 0.03105640411376953),\n", - " (13, 58, 0.042548537254333496),\n", - " (13, 63, 0.005898922681808472),\n", - " (14, 20, 0.0391959547996521),\n", - " (14, 21, 0.12018775939941406),\n", - " (14, 22, 0.007683068513870239),\n", - " (14, 24, 0.004466891288757324),\n", - " (14, 29, 0.0024731531739234924),\n", - " (14, 30, 0.0012836195528507233),\n", - " (14, 33, 0.0002957656979560852),\n", - " (14, 35, 8.28765332698822e-05),\n", - " (14, 45, 0.0028058290481567383),\n", - " (14, 47, 0.00513535737991333),\n", - " (14, 49, 0.005341023206710815),\n", - " (14, 51, 0.0016194544732570648),\n", - " (14, 54, 0.03227001428604126),\n", - " (14, 55, 0.010436713695526123),\n", - " (14, 58, 0.00033552199602127075),\n", - " (14, 63, 0.003631681203842163),\n", - " (15, 19, 0.02284419536590576),\n", - " (15, 20, 0.11268025636672974),\n", - " (15, 21, 0.005476862192153931),\n", - " (15, 22, 0.0013911649584770203),\n", - " (15, 23, 0.004423260688781738),\n", - " (15, 24, 0.0003324933350086212),\n", - " (15, 26, 0.0017650313675403595),\n", - " (15, 27, 0.00036427006125450134),\n", - " (15, 29, 0.0013628862798213959),\n", - " (15, 32, 0.0002611130475997925),\n", - " (15, 33, 3.1597912311553955e-05),\n", - " (15, 38, 0.00028830766677856445),\n", - " (15, 45, 0.005253881216049194),\n", - " (15, 47, 0.0028781890869140625),\n", - " (15, 49, 0.0013617761433124542),\n", - " (15, 51, 0.0011769458651542664),\n", - " (15, 53, 0.026831328868865967),\n", - " (15, 54, 0.009345054626464844),\n", - " (15, 55, 0.0004412718117237091),\n", - " (15, 58, 0.0001741647720336914),\n", - " (15, 62, 0.0039209723472595215),\n", - " (15, 63, 0.00017056241631507874),\n", - " (15, 66, 0.0023140311241149902),\n", - " (15, 69, 0.036376237869262695),\n", - " (16, 25, 0.0024813786149024963),\n", - " (16, 28, 0.002208299934864044),\n", + "99934864044),\n", " (16, 31, 0.00016783550381660461),\n", " (16, 37, 0.00037275999784469604),\n", " (16, 39, 0.00014984235167503357),\n", @@ -1104,15 +772,11 @@ " (66, 72, 0.00034242868423461914)])" ] }, - "execution_count": 5, "metadata": {}, - "output_type": "execute_result" + "execution_count": 7 } ], "source": [ - "import sys\n", - "sys.path.append('../build/c/pahelix/toolkit/linear_rna')\n", - "import linear_rna\n", "input_sequence = \"AACUCCGCCAGGCCUGGAAGGGAGCAACGGUAGUGACACUCUCUGUGUGCGUAGGUUGCCUAGCUACCAUUU\"\n", "linear_rna.linear_partition_c(input_sequence)" ] @@ -1121,43 +785,72 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### parameter setting:\n", + "### Parameter setting\n", "- rna_sequence: string, the input RNA sequence to calculate partition function and base pair probabities. \n", - "- beam_size: int (default 100),set 0 to turn off the beam pruning;\n", - "- bp_cutoff: double (default 0.0), only output base pairs with correponding proabilities whose values larger than the bp_cutoff (between 0 and 1);\n", - "- no_sharp_turn: bool (default False), enable sharpturn in prediction." + "- beam_size: int (default 100), set 0 to turn off the beam pruning.\n", + "- bp_cutoff: double (default 0.0), only output base pairs with correponding proabilities whose values larger than the bp_cutoff (between 0 and 1).\n", + "- no_sharp_turn: bool (default True), enable sharpturn in prediction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "#### thermodynamic model\n", - "the parameters are the same as the machine learning model" + "### Thermodynamic model\n", + "The parameters are the same as the machine learning model." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "metadata": {}, - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(-19.403078079223633,\n", + " [(3, 24, 0.8039621710777283),\n", + " (4, 23, 0.8239085078239441),\n", + " (5, 22, 0.8219183087348938),\n", + " (6, 21, 0.8141640424728394),\n", + " (8, 17, 0.8630755543708801),\n", + " (9, 16, 0.8678420186042786),\n", + " (10, 15, 0.7041950225830078),\n", + " (29, 68, 0.8213568925857544),\n", + " (30, 67, 0.8230675458908081),\n", + " (31, 66, 0.8223718404769897),\n", + " (32, 65, 0.8192430138587952),\n", + " (33, 64, 0.7854012250900269),\n", + " (34, 63, 0.690216600894928),\n", + " (36, 48, 0.8604442477226257),\n", + " (37, 47, 0.91445392370224),\n", + " (38, 46, 0.9144330024719238),\n", + " (39, 45, 0.9141958951950073),\n", + " (52, 62, 0.564305305480957),\n", + " (53, 61, 0.6316171884536743),\n", + " (54, 60, 0.6385029554367065),\n", + " (55, 59, 0.6180907487869263)])" + ] + }, + "metadata": {}, + "execution_count": 8 + } + ], "source": [ "linear_rna.linear_partition_v(input_sequence, bp_cutoff = 0.5)" ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" + "name": "python3", + "display_name": "Python 3.7.9 64-bit ('patu': conda)", + "metadata": { + "interpreter": { + "hash": "3e0911fe6af7a6beeb9019ce4fe0b0d7b8f33d578060495e40865e7435c4d93f" + } + } }, "language_info": { "codemirror_mode": { @@ -1169,10 +862,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.3" + "version": "3.7.9-final" } }, "nbformat": 4, "nbformat_minor": 4 -} - +} \ No newline at end of file diff --git a/tutorials/linearrna_tutorial_cn.ipynb b/tutorials/linearrna_tutorial_cn.ipynb index b483d0e4..8bde0c98 100644 --- a/tutorials/linearrna_tutorial_cn.ipynb +++ b/tutorials/linearrna_tutorial_cn.ipynb @@ -4,65 +4,42 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# LinearRNA 教程" + "# 使用 LinearRNA 进行 RNA 二级结构预测" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "LinearRNA包括一系列的线性时间RNA二级结构分析算法: **LinearFold**和**LinearPartition**" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['linear_rna.h', 'linear_rna.cpp', 'utils', 'linear_fold', 'linear_partition']" - ] - }, - "execution_count": 1, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import os\n", - "#os.chdir('../apps/drug_target_interaction/graph_dta/')\n", - "os.chdir('../c/pahelix/toolkit/linear_rna/linear_rna')\n", - "os.listdir(os.getcwd())" + "LinearRNA 包括一系列的线性时间 RNA 二级结构分析算法: **LinearFold** 和 **LinearPartition**。关于这个主题的更多信息请查阅[这里](https://github.com/PaddlePaddle/PaddleHelix/c/pahelix/toolkit/linear_rna/README_cn.md)。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## LinearFold" + "# 第一部分:LinearFold" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**LinearFold**是第一个以线性时间预测RNA二级结构的算法, 可将RNA二级结构预测的时间大大降低,LinearFold论文已经在计算生物学顶级会议ISMB及生物信息学权威杂志Bioinformatics 上发表。论文链接请见:[LinearFold: linear-time approximate RNA folding by 5'-to-3' dynamic programming and beam search](http://academic.oup.com/bioinformatics/article/35/14/i295/5529205)。" + "**LinearFold** 是第一个以线性时间预测 RNA 二级结构的算法,可将 RNA 二级结构预测的时间大大降低。LinearFold 的论文已经在计算生物学顶级会议 ISMB 及生物信息学权威杂志 Bioinformatics 上发表。论文链接:[LinearFold: linear-time approximate RNA folding by 5'-to-3' dynamic programming and beam search](http://academic.oup.com/bioinformatics/article/35/14/i295/5529205)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### 二级结构预测" + "## RNA 二级结构预测" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "机器学习模型 " + "### 机器学习模型 " ] }, { @@ -83,9 +60,7 @@ } ], "source": [ - "import sys\n", - "sys.path.append('../build/c/pahelix/toolkit/linear_rna')\n", - "import linear_rna\n", + "import pahelix.toolkit.linear_rna as linear_rna\n", "input_sequence = \"AACUCCGCCAGGCCUGGAAGGGAGCAACGGUAGUGACACUCUCUGUGUGCGUAGGUUGCCUAGCUACCAUUU\"\n", "linear_rna.linear_fold_c(input_sequence)" ] @@ -117,18 +92,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### 参数说明" + "### 参数说明" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "* rna_sequence: string, 需要预测结构的RNA序列\n", - "* beam_size: int (缺省值100), 控制beam pruning size的参数,设置为0关闭beam pruning。该参数越大,则预测速度越慢,而与精确搜索相比近似效果越好;\n", - "* use_constraints: bool (缺省值False), 在预测二级结构时增加约束条件。当为True时, constraint参数需要提供约束序列;\n", - "* constraint: string (缺省值空字符串), 二级结构预测约束条件。当提供约束序列时, use_constraints参数需要设置为True。该约束须与输入的RNA序列长度相同,每个点位可以指定“? . ( )”四种符号中的一种,其中“?”表示该点位无限制,“.”表示该点位必须是unpaired,“(”与“)”表示该点位必须是paired。注意“(”与“)”必须数量相等,即相互匹配。具体操作请参考运行实例。\n", - "- no_sharp_turn: bool (缺省值True), 不允许在预测的hairpin结构中出现sharp turn。" + "* rna_sequence: string,需要预测结构的 RNA 序列。\n", + "* beam_size: int(缺省值 100),控制 beam pruning size 的参数,设置为 0 关闭 beam pruning。该参数越大,则预测速度越慢,而与精确搜索相比近似效果越好。\n", + "* use_constraints: bool(缺省值 False),在预测二级结构时增加约束条件。当为 True 时,constraint 参数需要提供约束序列;\n", + "* constraint: string(缺省值空字符串),二级结构预测约束条件。当提供约束序列时,use_constraints 参数需要设置为 True。该约束须与输入的 RNA 序列长度相同,每个点位可以指定“? . ( )”四种符号中的一种,其中“?”表示该点位无限制,“.”表示该点位必须是 unpaired,“(”与“)”表示该点位必须是 paired。注意“(”与“)”必须数量相等,即相互匹配。具体操作请参考运行实例。\n", + "- no_sharp_turn: bool(缺省值 True),不允许在预测的 hairpin 结构中出现 sharp turn。" ] }, { @@ -136,7 +111,7 @@ "metadata": {}, "source": [ "### 热力学模型\n", - "参数和机器学习模型一致" + "参数和机器学习模型一致。" ] }, { @@ -186,16 +161,23 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### LinearPartition" + "# 第二部分:LinearPartition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "LinearPartition是世界最快RNA配分方程和碱基对概率预测算法。该算法功能更加强大,可以模拟RNA序列在平衡态时成千上万种不同结构的分布,并预测碱基对概率矩阵。LinearPartition算法同样被ISMB顶会接收并在Bioinformatics杂志上发表,论文链接请见:[LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities](https://academic.oup.com/bioinformatics/article/36/Supplement_1/i258/5870487)。" + "**LinearPartition** 是世界上最快的 RNA 配分方程和碱基对概率预测算法。该算法功能更加强大,可以模拟 RNA 序列在平衡态时成千上万种不同结构的分布,并预测碱基对概率矩阵。LinearPartition 算法同样被 ISMB 顶会接收并在 Bioinformatics 杂志上发表。论文链接:[LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities](https://academic.oup.com/bioinformatics/article/36/Supplement_1/i258/5870487)" ] }, + { + "source": [ + "## 配分方程和碱基对概率预测算法" + ], + "cell_type": "markdown", + "metadata": {} + }, { "cell_type": "markdown", "metadata": {}, @@ -1108,9 +1090,6 @@ } ], "source": [ - "import sys\n", - "sys.path.append('../build/c/pahelix/toolkit/linear_rna')\n", - "import linear_rna\n", "input_sequence = \"AACUCCGCCAGGCCUGGAAGGGAGCAACGGUAGUGACACUCUCUGUGUGCGUAGGUUGCCUAGCUACCAUUU\"\n", "linear_rna.linear_partition_c(input_sequence)" ] @@ -1119,11 +1098,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### 参数说明\n", - "* rna_sequence: string, 需要计算配分函数和碱基对概率的RNA序列\n", - "* beam_size: int (缺省值100), 控制beam pruning size的参数,默认值为100。该参数越大,则预测速度越慢,而与精确搜索相比近似效果越好;\n", - "* bp_cutoff: double (缺省值0.9), 只输出概率大于等于bp_cutoff的碱基对及其概率, 0 <= bp_cutoff <= 1; \n", - "* no_sharp_turn: bool (缺省值True), 不允许在预测的hairpin结构中出现sharp turn, 默认为False。" + "### 参数说明\n", + "* rna_sequence: string,需要计算配分函数和碱基对概率的RNA序列。\n", + "* beam_size: int(缺省值 100),控制 beam pruning size 的参数,默认值为 100。该参数越大,则预测速度越慢,而与精确搜索相比近似效果越好。\n", + "* bp_cutoff: double(缺省值 0.9),只输出概率大于等于 bp_cutoff 的碱基对及其概率,0 <= bp_cutoff <= 1。\n", + "* no_sharp_turn: bool(缺省值 True),不允许在预测的 hairpin 结构中出现 sharp turn。" ] }, { @@ -1131,7 +1110,7 @@ "metadata": {}, "source": [ "### 热力学模型\n", - "参数和机器学习模型一致" + "参数和机器学习模型一致。" ] }, { @@ -1174,20 +1153,17 @@ "source": [ "linear_rna.linear_partition_v(input_sequence, bp_cutoff = 0.5)" ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" + "name": "python3", + "display_name": "Python 3.7.9 64-bit ('patu': conda)", + "metadata": { + "interpreter": { + "hash": "3e0911fe6af7a6beeb9019ce4fe0b0d7b8f33d578060495e40865e7435c4d93f" + } + } }, "language_info": { "codemirror_mode": { @@ -1199,10 +1175,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.3" + "version": "3.7.9-final" } }, "nbformat": 4, "nbformat_minor": 4 -} - +} \ No newline at end of file diff --git a/tutorials/protein_pretrain_and_property_prediction_tutorial.ipynb b/tutorials/protein_pretrain_and_property_prediction_tutorial.ipynb index 2f82d23c..b374fdb9 100644 --- a/tutorials/protein_pretrain_and_property_prediction_tutorial.ipynb +++ b/tutorials/protein_pretrain_and_property_prediction_tutorial.ipynb @@ -11,7 +11,7 @@ "source": [ "# Protein pretraining and property prediction\n", "\n", - "In this tuorial, we will go through how to run a sequence model for protein property prediction. In particular, we will demonstrate how to pretrain it and how to finetune in the downstream tasks.\n", + "In this tuorial, we will go through how to build a simply sequence model for protein property prediction. In particular, we will demonstrate how to pretrain it and how to finetune it in the downstream tasks. More details of this topic can be found [here](https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/pretrained_protein/tape).\n", "\n", "In recent years, with sequencing technology development, the protein sequence database scale has significantly increased. However, the cost of obtaining labeled protein sequences is still very high, as it requires biological experiments. Besides, due to the inadequate number of labeled samples, the model has a high probability of overfitting the data. Borrowing the ideas from natural language processing (NLP), we can pre-train numerous unlabeled sequences by self-supervised learning. In this way, we can extract useful biological information from proteins and transfer them to other tagged tasks to make these tasks training faster and more stable convergence. These instructions refer to the work of paper TAPE, providing the model implementation of Transformer, LSTM, and ResNet.\n", "\n", @@ -345,4 +345,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file diff --git a/tutorials/protein_pretrain_and_property_prediction_tutorial_cn.ipynb b/tutorials/protein_pretrain_and_property_prediction_tutorial_cn.ipynb index 469d2e2b..c90512f1 100644 --- a/tutorials/protein_pretrain_and_property_prediction_tutorial_cn.ipynb +++ b/tutorials/protein_pretrain_and_property_prediction_tutorial_cn.ipynb @@ -11,7 +11,7 @@ "source": [ "# 蛋白质预训练和性质预测\n", "\n", - "在这份教程中,我们将介绍如何构建一个序列模型来进行蛋白质性质预测。具体来说,我们将展示如何对模型进行预训练并针对下游任务进行微调。\n", + "在这份教程中,我们将介绍如何构建一个序列模型来进行蛋白质性质预测。具体来说,我们将展示如何对模型进行预训练并针对下游任务进行微调。关于这个主题的更多详细介绍请查阅[这里](https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/pretrained_protein/tape/README_cn.md)。\n", "\n", "近年来,随着测序技术的发展,蛋白质序列数据库的规模显著扩大。然而,必须通过湿实验才能够获得的有标注蛋白序列的成本仍然很高。此外,由于标记样本数量不足,模型有很高的概率过拟合数据。借鉴自然语言处理(NLP)的思想,通过自监督学习可以在大量无标注的蛋白序列上进行预训练。这样,我们就可以从蛋白质序列中提取有用的生物信息,并将其迁移到其他有标注的任务中,使这些任务的训练速度更快和更稳定地收敛。本教程的内容参考了 TAPE 的工作,提供了 Transformer、LSTM 和 ResNet 的模型实现。\n", "\n", @@ -345,4 +345,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file