From 117f7ac126c677d62a04c548a2a4522aba4c1f12 Mon Sep 17 00:00:00 2001 From: kpzn768 Date: Tue, 8 Apr 2025 13:53:42 +0200 Subject: [PATCH 1/3] First version of tutorial A --- tutorials/A_basic_usage.ipynb | 405 ++++++++++++++++++++++++++++++++++ 1 file changed, 405 insertions(+) create mode 100644 tutorials/A_basic_usage.ipynb diff --git a/tutorials/A_basic_usage.ipynb b/tutorials/A_basic_usage.ipynb new file mode 100644 index 0000000..301488f --- /dev/null +++ b/tutorials/A_basic_usage.ipynb @@ -0,0 +1,405 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "m8Oc1PiNjOwB" + }, + "source": [ + "# Tutorial A - basic usage\n", + "\n", + "In this tutorial you will learning the basics of running retrosynthesis experiments with AiZynthFinder.\n", + "\n", + "After the completion of this tutorial, you will know:\n", + "* How to download public models and data files\n", + "* How to write a simple configuration file\n", + "* How to select models to be used in search\n", + "* How to select stock to be used in search\n", + "* How to perform a retrosynthesis search\n", + "* How to perform basic analysis of the outut\n", + "\n", + "\n", + "We will start with installing the package from pypi" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "pkieNY8ikT-I", + "outputId": "067de950-6a3e-4573-ed7c-e821633b45fe" + }, + "outputs": [], + "source": [ + "!pip install --quiet aizynthfinder\n", + "!pip install --ignore-installed Pillow==9.0.0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZeZe5rNCq-DG" + }, + "source": [ + "### Download public data files\n", + "\n", + "Throughout this tutorial we will use publicly available models and data files.\n", + "These can be downloaded to our local folder using a convienient tool.\n", + "\n", + "We will download\n", + "- Expansion models trained on the USPTO data\n", + "- Filter model trained on UPSTO data\n", + "- ZINC stock file" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "JUDc5WN3rJvn", + "outputId": "19b11cc9-4944-4cbc-9be6-cd4bbf5539f7" + }, + "outputs": [], + "source": [ + "!mkdir --parents data && download_public_data data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uK-fa0tRp7Eu" + }, + "source": [ + "### The aizynthfinder configuration file\n", + "\n", + "The main python interface to AiZynthFinder is a class called `AiZynthFinder`. This interface is instantiated with a configuration, either from disc in the form of a yaml-file or from a dictionary.\n", + "\n", + "The configuration is central to the execution and holds information about:\n", + "- What models to use\n", + "- What stock to use\n", + "- How to configure the search algorithm\n", + "- What score to compute for the routes\n", + "\n", + "In this tutorial, we will only look at the two first and the other will be covered in upcoming tutorials.\n", + "\n", + "The script that we used above to download the public models, also provided us with a config file that looks like this\n", + "\n", + "```\n", + "expansion:\n", + " uspto:\n", + " - uspto_model.onnx\n", + " - uspto_templates.csv.gz\n", + " ringbreaker:\n", + " - uspto_ringbreaker_model.onnx\n", + " - uspto_ringbreaker_templates.csv.gz\n", + "filter:\n", + " uspto: uspto_filter_model.onnx\n", + "stock:\n", + " zinc: zinc_stock.hdf5\n", + "```\n", + "\n", + "The `expansion`-section specify the expansion model to load into memory. This does however not mean that they will be used in the search.\n", + "\n", + "Here we load two models, one general and one specific for breaking rings. The `uspto` and `ringbreaker` are labels for the models that we can use to reference the models in the setup of the search.\n", + "\n", + "The two files specified for each model is 1) a ONNX model file containing the weights of the neural network, and 2) a CSV file with metadata on the templates.\n", + "\n", + "The `filter`-section specifies similarly the filter model. Here, only one file needs to be specified - the ONNX model weights.\n", + "\n", + "Finally, the `stock`-section specifies the stock to load. Here we load one that we will refer to as `zinc` and the compounds in this stock will be loaded from `zinc_stock.hdf5`.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "twFTFrLFJZQT" + }, + "source": [ + "### Initializing AiZynthFinder interface\n", + "\n", + "Now we can start to setup the retrosynthesis search using the `AiZynthFinder` interface. We will also initialize the logging level so that we get some useful information printed to the screen." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "h9fFdHKIsoCA" + }, + "outputs": [], + "source": [ + "import logging\n", + "from aizynthfinder.utils.logging import setup_logger\n", + "setup_logger(logging.INFO)\n", + "\n", + "from aizynthfinder.aizynthfinder import AiZynthFinder" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "wHCx5Ylcrvwi", + "outputId": "50956318-63bf-49f4-eb31-ce359067e5bd" + }, + "outputs": [], + "source": [ + "finder = AiZynthFinder(\"data/config.yml\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "STFgCAbQJ5cw" + }, + "source": [ + "When instantiating the `AiZynthFinder` class with our config-file, we see that the two template-based models, the filter model, and stock file are loaded.\n", + "\n", + "Even though they are loaded into memory, they are not automatically used in the search. For this we need to select what stock and models we want to use.\n", + "\n", + "We will start with selection all stock (although we only loaded one) and the expansion policy with the tag `uspto`, i.e., the general expansion model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1ZimRagnr2iG", + "outputId": "db6ea36f-b48b-4378-f684-3b548bd0c7e7" + }, + "outputs": [], + "source": [ + "finder.stock.select_all()\n", + "finder.expansion_policy.select(\"uspto\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6rUHvg-hKbMY" + }, + "source": [ + "### Starting a search\n", + "\n", + "There are two steps to a retrosynthesis search once you have setup the interface\n", + "- Set the target SMILES\n", + "- Initiate the search" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 167 + }, + "id": "r3551jWIsEPw", + "outputId": "4a47ceb4-dec0-4717-c001-2ba20a7a09da" + }, + "outputs": [], + "source": [ + "finder.target_smiles = \"Cc1cccc(C)c1N(CC(=O)Nc1ccc(-c2ncon2)cc1)C(=O)C1CCS(=O)(=O)CC1\"\n", + "display(finder.target_mol.rd_mol)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "16YQt5SnKt1v", + "outputId": "857cd61a-4749-419c-84cb-4ddb8d1aff65" + }, + "outputs": [], + "source": [ + "finder.tree_search(show_progress=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Pav4ZcN8K6pC" + }, + "source": [ + "That was quick, right?\n", + "\n", + "\n", + "### Analysis of the output\n", + "\n", + "Now we need to extract routes from the retrosynthesis search tree" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "UbIcEUA7sSFC", + "outputId": "6846a701-f33b-4637-c2cd-048cd1dae0fb" + }, + "outputs": [], + "source": [ + "finder.build_routes()\n", + "finder.analysis.tree_statistics()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mCpFoCrdMf1_" + }, + "source": [ + "The `tree_statistics` method return som general information about the search tree, and the top-ranked routes.\n", + "\n", + "We can for instance read that:\n", + "- There are 618 nodes in the search tree\n", + "- The depth of the search tree is 6\n", + "- There are 174 routes in the search tree, whereof 37 are solved (starting material is in stock)\n", + "- The top-ranked route is a 2-step route with 3 starting material\n", + "\n", + "We only extract the top-ranked routes by default." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "SJgJLnWHPAA1", + "outputId": "b05ae733-5ce9-491f-dd63-b61f48d08750" + }, + "outputs": [], + "source": [ + "len(finder.routes)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FglAk9CQPB9-" + }, + "source": [ + "We can visualize the top-ranked route using this" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 492 + }, + "id": "-mniLc6OsWjx", + "outputId": "785050bf-8997-49f5-9b3e-02534a29c7c5" + }, + "outputs": [], + "source": [ + "finder.routes.reaction_trees[0].to_image()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2H-sWEwaOQcj" + }, + "source": [ + "We can iterate over all the starting material and display them together with their SMILES string" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 519 + }, + "id": "k-tPWUGKNGQq", + "outputId": "be411dd6-12c6-4024-f561-257e21ca43b5" + }, + "outputs": [], + "source": [ + "for mol in finder.routes.reaction_trees[0].leafs():\n", + " print(mol.smiles)\n", + " display(mol.rd_mol)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mU8nXMN6Oijj" + }, + "source": [ + "We can compute some scores of the extract routes. You will learn more about how this is done in a forthcoming tutorial.\n", + "\n", + "We will import `pandas` so that we can get a nice-looking table" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 332 + }, + "id": "4s7C9j6bN0kH", + "outputId": "c9d56ad5-3f9a-4d07-844d-6367ce7e883f" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "finder.routes.compute_scores(*finder.scorers.objects())\n", + "pd.DataFrame(\n", + " finder.routes.all_scores\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MlS-PRlsOu0f" + }, + "source": [ + "That is all for now!\n", + "\n", + "Let's continue with the next tutorial where you will learn how to do more advance route analysis." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} From 3739eb81cdb94e43caea9cb2e782f729949fc63f Mon Sep 17 00:00:00 2001 From: kpzn768 Date: Thu, 10 Apr 2025 11:25:54 +0200 Subject: [PATCH 2/3] B tutorial --- tutorials/B_route_analysis.ipynb | 845 +++++++++++++++++++++++++++++++ tutorials/README.md | 23 + 2 files changed, 868 insertions(+) create mode 100644 tutorials/B_route_analysis.ipynb create mode 100644 tutorials/README.md diff --git a/tutorials/B_route_analysis.ipynb b/tutorials/B_route_analysis.ipynb new file mode 100644 index 0000000..9b07abe --- /dev/null +++ b/tutorials/B_route_analysis.ipynb @@ -0,0 +1,845 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "k5Imv4YnxCNi" + }, + "source": [ + "# Tutorial B - route analysis\n", + "\n", + "In this tutorial you will learning the advances analysis techniques for route predictions\n", + "\n", + "After the completion of this tutorial, you will know:\n", + "* How to extract routes from retrosynthesis search trees\n", + "* How to score routes with AiZynthFinder\n", + "* How to score route with rxnutils\n", + "* How to calculate route similarities\n", + "* How to cluster routes\n", + "\n", + "\n", + "We will start with installing packages from pypi" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 474 + }, + "collapsed": true, + "id": "KjGZ4225wyiH", + "outputId": "00640252-f625-46b1-f54e-7a65ddf1f2ab" + }, + "outputs": [], + "source": [ + "!pip install --quiet aizynthfinder\n", + "!pip install --quiet reaction-utils[models]\n", + "!pip install --ignore-installed Pillow==9.0.0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1ICfB1uEx1EI" + }, + "source": [ + "### Setup\n", + "\n", + "As with the basic tutorial we will work with public data and models" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "d5L50Gptxz2q", + "outputId": "74507335-974b-4733-e415-aeeb8ee90d28" + }, + "outputs": [], + "source": [ + "!mkdir --parents data && download_public_data data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-zVK9IokxwS7" + }, + "source": [ + "And we will setup the `aizynthfinder` interface similarly to the basic tutorial as well...\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Czg-SycnyTrU" + }, + "outputs": [], + "source": [ + "import logging\n", + "from aizynthfinder.utils.logging import setup_logger\n", + "setup_logger(logging.INFO)\n", + "\n", + "from aizynthfinder.aizynthfinder import AiZynthFinder" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "jvXblyBpyU3y", + "outputId": "8d2efacd-fd82-4dac-a183-faee4d66da77" + }, + "outputs": [], + "source": [ + "finder = AiZynthFinder(\"data/config.yml\")\n", + "finder.stock.select_all()\n", + "finder.expansion_policy.select(\"uspto\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2YduQUNgyfve" + }, + "source": [ + "... and run retrosynthesis analysis on amenamevir" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 185 + }, + "id": "J8G7l8zqyXe4", + "outputId": "da2f7e0d-5945-4430-833f-ddf1f241fd9e" + }, + "outputs": [], + "source": [ + "finder.target_smiles = \"Cc1cccc(C)c1N(CC(=O)Nc1ccc(-c2ncon2)cc1)C(=O)C1CCS(=O)(=O)CC1\"\n", + "display(finder.target_mol.rd_mol)\n", + "finder.tree_search(show_progress=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "25FpUcV_WUQ1" + }, + "source": [ + "### Extracting routes\n", + "\n", + "In the previous tutorial we used the `build_routes` method to extract routes from the search. By default this only returns a limited set of routes based on the score used to guide the search." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "KV6BDun5WqFE", + "outputId": "de0d33d5-f4a1-4a61-9156-77a37aec73e7" + }, + "outputs": [], + "source": [ + "finder.build_routes()\n", + "len(finder.routes)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W1dZimGqWnFY" + }, + "source": [ + "We can control how many routes that we want to generate. The algorithm is works on a minimum and maximum number of routes, but more than the minimum can be returned if multiple routes have the same score but never more than the maximum" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "kD3EOyCo0yx4", + "outputId": "e4c1712a-146c-46cf-d200-95c8ed83c6f5" + }, + "outputs": [], + "source": [ + "from aizynthfinder.analysis.utils import RouteSelectionArguments\n", + "sel_args = RouteSelectionArguments(nmin=10, nmax=15)\n", + "finder.build_routes(sel_args)\n", + "len(finder.routes)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "2ar6PfyrXRhl", + "outputId": "bdd9b876-0126-4a81-a28c-02263c3f513f" + }, + "outputs": [], + "source": [ + "sel_args = RouteSelectionArguments(nmin=10)\n", + "finder.build_routes(sel_args)\n", + "len(finder.routes)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XcgtdblIX_xt" + }, + "source": [ + "We can also make it return all solved routes, i.e. routes leading to starting material in stock." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1r1E2DRE1Czr", + "outputId": "51226874-e264-441f-c2cc-4473248cc85b" + }, + "outputs": [], + "source": [ + "sel_args = RouteSelectionArguments(return_all=True)\n", + "finder.build_routes(sel_args)\n", + "len(finder.routes)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-ORtpgG-YNHU" + }, + "source": [ + "### Scoring\n", + "\n", + "Now we will explore some ways to score the generated routes. We have a preview of this in the previous tutorial and now we will dig into this some more.\n", + "\n", + "There are a number of route scores available in the `aizynthfinder` package but we will also use some more ellaborate route scores available in the `rxnutils` package.\n", + "\n", + "In `aizynthfinder`, scores available are typically accessible through `finder.scorers`, which is a collection of scorer objects. These can be loaded from the configuration file or created by users.\n", + "\n", + "There are a number of scores that are loaded automatically, because they are simple, fast, and does not require any external input." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "rd4vmS6BZRzV", + "outputId": "48dc6ce1-d53d-4d47-ae07-9b15b15af568" + }, + "outputs": [], + "source": [ + "finder.scorers.names()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FRNtzbqxZOqC" + }, + "source": [ + "The scorers are:\n", + "\n", + "- \"state score\", the score used to guide the tree search (a combination of the number of steps in the route and the fraction of starting material in stock)\n", + "- \"number of reactions\", the total number of steps in the route\n", + "- \"number of pre-cursors\", the number of starting materials\n", + "- \"number of pre-cursors in stock\", the number of starting materials in stock\n", + "- \"average template occurrence\", the average database occurence of the templates used in the route\n", + "\n", + "We can then use them to score the routes..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "sI5X_yQa4noH", + "outputId": "717d77fd-d44b-42f5-daf0-9cddcfa6a650" + }, + "outputs": [], + "source": [ + "finder.build_routes()\n", + "finder.routes.compute_scores(*finder.scorers.objects())\n", + "finder.routes.all_scores[0]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DAbIjRHCcdNB" + }, + "source": [ + "... and we can put them in a nice table using Pandas" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 332 + }, + "id": "b95Rlg0QcgJk", + "outputId": "9822eec6-58e8-4665-a848-c6200d7c2c52" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "pd.DataFrame(\n", + " finder.routes.all_scores\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TyhIWbKUh7NI" + }, + "source": [ + "Next, we can instantiate another available scorer that is a route score suggested by [Badowski and co-workers](https://pubs.rsc.org/en/content/articlelanding/2019/sc/c8sc05611k).\n", + "\n", + "The limitation with this score for our current experiment is that:\n", + "- we don't have a stock that provides cost of the starting material\n", + "- we don't have a cost of each reaction\n", + "- we don't have an estimate of the yield in each step\n", + "\n", + "Therefore, we are using an assumed yield of 0.8, a reaction cost of 1 and the cost of starting material is 1 for anything that is in stock and 10 for everything else." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 349 + }, + "id": "0bTdOvJthaQU", + "outputId": "d67e8526-c08e-4998-c462-1b7b99eb5e8c" + }, + "outputs": [], + "source": [ + "from aizynthfinder.context.scoring import RouteCostScorer\n", + "scorer = RouteCostScorer(finder.config)\n", + "finder.routes.compute_scores(scorer)\n", + "pd.DataFrame(\n", + " finder.routes.all_scores\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PlOz8mSOjvbD" + }, + "source": [ + "Finally, we will create a custom scorer that takes the difference compared to the maximum depth of the search and scales it with a sigmoid-like function\n", + "\n", + "To implement a custom scorer, we need to implement at least three functions\n", + "- `__repr__` that returns a string label for the scorer. This is what is shown as the column headers above.\n", + "- `_score_node` that should implement the scoring for the nodes in the search tree\n", + "- `_score_reaction_tree` that should implement the scoring for a synthesis route\n", + "\n", + "(here we are cheating a bit on the `_score_reaction_tree` function, it is not exactly like the `_score_node`. See if you can spot the difference!)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 349 + }, + "id": "EyQMULGukS7x", + "outputId": "d26c4478-bc4d-4580-ce3c-ef50d853d499" + }, + "outputs": [], + "source": [ + "from aizynthfinder.context.scoring.scorers import Scorer\n", + "\n", + "class DeltaNumberOfTransformsScorer(Scorer):\n", + "\n", + " def __init__(self, config):\n", + " super().__init__(\n", + " config,\n", + " scaler_params={\"name\": \"squash\", \"slope\": -1, \"yoffset\": 0, \"xoffset\": 3},\n", + " )\n", + "\n", + " def __repr__(self):\n", + " return \"delta number of transforms\"\n", + "\n", + " def _score_node(self, node):\n", + " return self._config.search.max_transforms - node.state.max_transforms\n", + "\n", + " def _score_reaction_tree(self, tree):\n", + " return self._config.max_transforms - len(list(tree.reactions()))\n", + "\n", + "custom_scorer = DeltaNumberOfTransformsScorer(finder.config)\n", + "finder.routes.compute_scores(custom_scorer)\n", + "pd.DataFrame(\n", + " finder.routes.all_scores\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "n1Xa78iznN3H" + }, + "source": [ + "### Scoring with `rxnutils`\n", + "\n", + "Now, we will have a look at some route scoring algorithms that are available in the `rxnutils`package.\n", + "\n", + "Before we do this, we will assign classification to the predicted reactions based on the template hash. This is because the scores we will use are based on the the NextMove reaction class, which is missing from the public USPTO model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZDxlbZ7Ounbz" + }, + "outputs": [], + "source": [ + "hash2class = {\n", + " '00cfc3ff089c8152916b2b19d242a80c85c8d70fda9ab8b0e8a1e7b52bf3380d': '2.1.1',\n", + " '110b771d77121b19f0a012d7c67263575768c8985321b84bf1629ea79c879d51': '6.2.2',\n", + " '19f2a0f4c6a46a956748171b40376b6f742843423bcde2abbd8b3a6cf6b0a7b9': '9.3.1',\n", + " '21e32ba7f5a8b5ab7b02ba748a13ecbd720a7475e580f0e1ebc3ce935e54b5a2': '2.1.1',\n", + " '264a7579c52f48f3c97db6b8bc61252d596b1a7b85cd90bbc48eca4a4c46134c': '2.1.1',\n", + " '3d2ded365fb5f402223c094af5dc49e39510bb1325a199cb96a43a0828c1a888': '2.1.2',\n", + " '49f31ca31dd097e43d87899a3976cdfcec8d77ac8f69c7b16a07a0d183b5f742': '1.3.6',\n", + " '634b5eedc2112400cedf8f3edde0c3e221ba4f01b8b7f27953b2811c42fc31f1': '2.1.10',\n", + " '6d0c1070464abadf312613f8d200b60386b42b62476555a9b48aad53e2e82868': '6.2.1',\n", + " '866e1a8ac889f536ad4f364036065c6e47f9d15c5f533dbb707d5e47a3434b1a': '9.3.1',\n", + " '8bdb325ee96ea0b293bc31507a025e93f4907432629dbb7ba3f79c7ed22c5afe': '3.1.2',\n", + " 'b12cae7f7da913000dcdba7def39559511339ce3330a7426c5ca0d488d1da652': '2.1.1',\n", + " 'c42f96a0658c03bc12dc20efce06e66daaeccc675d73631eea24ea610a7227b8': '7.9.2',\n", + " 'e913b1094150d416bf8203e6e8c96f70bb606ea82d757cc0bdffcc4e7fad5f5a': '2.6.3'\n", + "}\n", + "for reaction_tree in finder.routes.reaction_trees:\n", + " for reaction in reaction_tree.reactions():\n", + " template_hash = reaction.metadata[\"template_hash\"]\n", + " reaction.metadata[\"classification\"] = hash2class.get(template_hash, reaction.metadata[\"classification\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BHF3Yov4xvxS" + }, + "source": [ + "Then we will convert our `aizynthfinder` routes to the internal `SynthesisRoute` object of `rxnutils`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "baCWMQ-z4xJI", + "outputId": "a7fcffea-8f68-478f-df31-2f685baa05eb" + }, + "outputs": [], + "source": [ + "from rxnutils.routes.readers import read_aizynthfinder_dict\n", + "routes = [\n", + " read_aizynthfinder_dict(reaction_tree.to_dict())\n", + " for reaction_tree in finder.routes.reaction_trees]\n", + "routes[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-c73tPKPnqQ3" + }, + "source": [ + "These objects have many similar functionality to routes in `AiZynthFinder`. Like you can extract metadata or analyze the structure of the route" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "iTWqKYSRn3Bi", + "outputId": "6518bb15-5225-4098-c8fe-5fe52623c73d" + }, + "outputs": [], + "source": [ + "routes[0].reaction_data()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "uvUw6BvPn6mz", + "outputId": "fe6d7565-ec0c-4acb-e815-de7073e8f4aa" + }, + "outputs": [], + "source": [ + "routes[0].leaves()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P13v_K2HoOGW" + }, + "source": [ + "Next, we will download some data and model files that we will need to score our routes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "_IAyXgbCoXC2", + "outputId": "7459a660-89bf-4786-df37-a2f9efd523f6" + }, + "outputs": [], + "source": [ + "!mkdir files_scoring\n", + "!wget https://zenodo.org/records/14533779/files/reaction_class_ranks.csv?download=1 -O files_scoring/reaction_class_ranks.csv\n", + "!wget https://zenodo.org/records/14533779/files/deepset_route_scoring_sdf.onnx?download=1 -O files_scoring/deepset_route_scoring_sdf.onnx\n", + "!wget https://zenodo.org/records/14533779/files/scscore_model_1024_bits.onnx?download=1 -O files_scoring/scscore_model_1024_bits.onnx" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4DWHvCt9rWK8" + }, + "source": [ + "We will start with a score based on a ranking of the reaction classes. We have analysed the internal reaction data of AstraZeneca and has classified NextMove reaction classes based on how often such reactions have been carried out and how often tail succeed.\n", + "\n", + "We will also favour the most highly ranked classes. This could also be used to favour some reaction classes that are ranked low when looking at historial data but is preferred for some other reason." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZlH6WJ8kqkeE" + }, + "outputs": [], + "source": [ + "df = pd.read_csv(\"files_scoring/reaction_class_ranks.csv\", sep = \",\")\n", + "reaction_class_ranks = dict(zip(df[\"reaction_class\"], df[\"rank_score\"]))\n", + "preferred_classes = df[df[\"rank_score\"]>=5].reaction_class" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 349 + }, + "id": "3_eAw_0oqy-w", + "outputId": "80b5e3c4-6608-4008-a787-1d81e2d572eb" + }, + "outputs": [], + "source": [ + "from rxnutils.routes.scoring import reaction_class_rank_score\n", + "scores_df = pd.DataFrame(finder.routes.all_scores)\n", + "scores_df[\"reaction class rank\"] = [\n", + " reaction_class_rank_score(route, reaction_class_ranks, preferred_classes)\n", + " for route in routes\n", + "]\n", + "scores_df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YOBYJeZjyf6N" + }, + "source": [ + "Next, we will use a trained model developed by [Kaski and co-workers](https://chemrxiv.org/engage/chemrxiv/article-details/67acb4f06dde43c90896605c).\n", + "\n", + "This requires both a trained SCScore model for featurizing the reactions, and the deepset model itself." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3eGJZ6vQvfkQ" + }, + "outputs": [], + "source": [ + "from rxnutils.chem.features.sc_score import SCScore\n", + "from rxnutils.routes.deepset.scoring import DeepsetModelClient, deepset_route_score\n", + "\n", + "# Setup SCScore model\n", + "scscorer = SCScore(\"files_scoring/scscore_model_1024_bits.onnx\")\n", + "# Setup the Deepset model client\n", + "deepset_client = DeepsetModelClient(\"files_scoring/deepset_route_scoring_sdf.onnx\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zWmOCHhsy4Wn" + }, + "source": [ + "we will then compute the raw learned score, which is an abstract distance to the experimental routes used for training the model. The lower score the better.\n", + "\n", + "but we will also correct this for route length, i.e. the number of steps, based on expert evaluation of predicted routes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 366 + }, + "id": "ZysM5XlRvpuF", + "outputId": "82f7d689-746d-4d83-aee6-de7d489f3d91" + }, + "outputs": [], + "source": [ + "scores_df[\"deepset score\"] = [\n", + " deepset_route_score(route, deepset_client, scscorer, reaction_class_ranks)\n", + " for route in routes\n", + "]\n", + "scores_df[\"expert augment score\"] = 0.97 * scores_df[\"deepset score\"] -0.43 * scores_df[\"number of reactions\"]\n", + "scores_df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DUGvGkjlzNb7" + }, + "source": [ + "### Route similarity and clustering\n", + "\n", + "In `rxnutils` there are routines to compute the similarity of routes based on an algorithm from [Genheden and Shields](https://pubs.rsc.org/en/content/articlelanding/2025/dd/d4dd00292j).\n", + "\n", + "We can visualize these similarities using a heatmap\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "FalWUqP82BGi" + }, + "outputs": [], + "source": [ + "sel_args = RouteSelectionArguments(nmin=10)\n", + "finder.build_routes(sel_args)\n", + "routes_large_set = [\n", + " read_aizynthfinder_dict(reaction_tree.to_dict())\n", + " for reaction_tree in finder.routes.reaction_trees\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 453 + }, + "id": "zAWn6SGEzM0v", + "outputId": "990a9441-96ed-45d6-f0bc-d2c622d8ef11" + }, + "outputs": [], + "source": [ + "import seaborn as sn\n", + "from rxnutils.routes.comparison import simple_route_similarity\n", + "similarities = simple_route_similarity(routes)\n", + "sn.heatmap(similarities)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xPT-6WNU9wjG" + }, + "source": [ + "we see that route 1 is rather different from most other routes.\n", + "\n", + "You can visualize routes 0 and 1 and try to figure out why they are different." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 489 + }, + "id": "iRYYd-CNz2te", + "outputId": "fe7f1b7a-6c0d-47e6-8765-415f666f0b64" + }, + "outputs": [], + "source": [ + "routes[0].image()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 598 + }, + "id": "HLqXkxSLz4-N", + "outputId": "4ba1194e-ecd7-40eb-8ffa-1caae176c14f" + }, + "outputs": [], + "source": [ + "routes[1].image()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Tqth2bND97Yy" + }, + "source": [ + "Next, we will show how you can cluster routes. But we will start plotting the dendrogram. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 434 + }, + "id": "Yl06SY3H2lRH", + "outputId": "ceb1a0cb-55a1-4072-bce8-5d688a1e1bfd" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "from scipy.cluster.hierarchy import dendrogram\n", + "from sklearn.cluster import AgglomerativeClustering\n", + "\n", + "model2 = AgglomerativeClustering(linkage=\"single\", metric=\"precomputed\", n_clusters=None, distance_threshold=0.0)\n", + "model2.fit(1.0-similarities)\n", + "counts = np.zeros(len(model2.distances_))\n", + "matrix = np.column_stack([model2.children_, model2.distances_, counts])\n", + "_ = dendrogram(\n", + " matrix,\n", + " color_threshold=0.0,\n", + " labels=np.arange(0, len(routes)),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "U7Srq33P-mNK" + }, + "source": [ + "This is of course reflecting what we saw in the heatmap. That route 1 is rather disimilar to the othe routes, and that route 0 is most similar to routes 4-6.\n", + "\n", + "If try to create 3 clusters from these routes, we find that most routes end up in a single big cluster" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "hzjHuf6I-De_", + "outputId": "5625abd7-ba4a-44c2-f784-1592b53aea26" + }, + "outputs": [], + "source": [ + "model = AgglomerativeClustering(linkage=\"single\", metric=\"precomputed\", n_clusters=3)\n", + "model.fit(1.0-similarities)\n", + "model.labels_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HHcTWlIp-66-" + }, + "source": [ + "Try to experiment with extracting more routes from the search and repeat the analysis of scores, similarities and clusters.\n", + "\n", + "In the next tutorial, we will explore how to adjust the retrosynthesis search in some various ways." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/tutorials/README.md b/tutorials/README.md new file mode 100644 index 0000000..7b8563f --- /dev/null +++ b/tutorials/README.md @@ -0,0 +1,23 @@ +# Tutorial + + +## A - basic usage + +[Google co-lab](https://colab.research.google.com/github/MolecularAI/aizynthfinder/blob/tutorials/tutorials/A_basic_usage.ipynb) + +* How to download public models and data files +* How to write a simple configuration file +* How to select models to be used in search +* How to select stock to be used in search +* How to perform a retrosynthesis search +* How to perform basic analysis of the outut + +## B - route analysis + +[Google co-lab](https://colab.research.google.com/github/MolecularAI/aizynthfinder/blob/tutorials/tutorials/B_route_analysis.ipynb) + +* How to extract routes from retrosynthesis search trees +* How to score routes with AiZynthFinder +* How to score route with rxnutils +* How to calculate route similarities +* How to cluster routes \ No newline at end of file From 5f2637072a924576df4ab46a6cfd4e62ad97d306 Mon Sep 17 00:00:00 2001 From: kpzn768 Date: Fri, 11 Apr 2025 10:48:19 +0200 Subject: [PATCH 3/3] Add tutorial C --- tutorials/C_search_parameters.ipynb | 965 ++++++++++++++++++++++++++++ tutorials/README.md | 11 +- 2 files changed, 975 insertions(+), 1 deletion(-) create mode 100644 tutorials/C_search_parameters.ipynb diff --git a/tutorials/C_search_parameters.ipynb b/tutorials/C_search_parameters.ipynb new file mode 100644 index 0000000..22fb231 --- /dev/null +++ b/tutorials/C_search_parameters.ipynb @@ -0,0 +1,965 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "DqELadNv79Vb" + }, + "source": [ + "# Tutorial C - search parameters\n", + "\n", + "In this tutorial you will learn about how to control and modify the retrosynthesis search algorithm\n", + "\n", + "After the completion of this tutorial, you will know:\n", + "* How to modify the stock\n", + "* How to add custom stock rules\n", + "* How to use common search parameters\n", + "* How to select and use different search algorithms\n", + "\n", + "We will start with installing packages from pypi" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "8EB7onol770a", + "outputId": "651517a8-3826-4374-ada0-1f89f882b56a" + }, + "outputs": [], + "source": [ + "!pip install --quiet aizynthfinder\n", + "!pip install --quiet reaction-utils[models]\n", + "!pip install --ignore-installed Pillow==9.0.0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LzSuzSbv8tL_" + }, + "source": [ + "### Setup\n", + "\n", + "As with the basic tutorial we will work with public data and models" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "WycuYZCx8sIa", + "outputId": "4042b989-4fb9-4ed1-83b5-3086a3de7c4a" + }, + "outputs": [], + "source": [ + "!mkdir --parents data && download_public_data data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mPHeOh0v8yGc" + }, + "source": [ + "And we will setup the aizynthfinder interface similarly to the basic tutorial as well...\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-BrbcnBJ8yqq" + }, + "outputs": [], + "source": [ + "import logging\n", + "from aizynthfinder.utils.logging import setup_logger\n", + "setup_logger(logging.INFO)\n", + "\n", + "from aizynthfinder.aizynthfinder import AiZynthFinder\n", + "from rdkit import Chem\n", + "from rdkit.Chem import Descriptors" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Y6o24r_k80oD", + "outputId": "af950503-b94e-4f96-ee32-7130679040c9" + }, + "outputs": [], + "source": [ + "finder = AiZynthFinder(\"data/config.yml\")\n", + "finder.stock.select_all()\n", + "finder.expansion_policy.select(\"uspto\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vDqPOq7M8-2b" + }, + "source": [ + "... and setup it to do retrosynthesis on amenamevir" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 185 + }, + "id": "nVdrJ0SO89zC", + "outputId": "59c86a6b-9801-42f8-84f0-3ccadc8523b3" + }, + "outputs": [], + "source": [ + "finder.target_smiles = \"Cc1cccc(C)c1N(CC(=O)Nc1ccc(-c2ncon2)cc1)C(=O)C1CCS(=O)(=O)CC1\"\n", + "display(finder.target_mol.rd_mol)\n", + "finder.tree_search(show_progress=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 492 + }, + "id": "bzdOOCldBxMZ", + "outputId": "883ec5fb-7c25-4255-c942-d31cdf064c3a" + }, + "outputs": [], + "source": [ + "finder.build_routes()\n", + "finder.routes.reaction_trees[0].to_image()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "_TP4MFobZzr8", + "outputId": "15aef6cf-3e6c-42af-fed4-4db7499288fc" + }, + "outputs": [], + "source": [ + "for leaf in finder.routes.reaction_trees[0].leafs():\n", + " print(leaf.smiles, Descriptors.ExactMolWt(leaf.rd_mol))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gnm_eWbJZe7H" + }, + "source": [ + "### Modify the stock\n", + "\n", + "We see that some of the starting material contains several rings, and are rather heavy. Perhaps you have use-case where you want to constrain the starting material much more without modifying your stock file.\n", + "\n", + "We will look at a few different ways to do this.\n", + "\n", + "First, you can use built-in functionality to constrain the stock using\n", + "- Amount in the sock\n", + "- Price from stock\n", + "- Count of elements\n", + "\n", + "For these amount and price constraints, you would need a stock that contain this information and because we use a version of ZINC without this information - we will look at the last option." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "TPT-Y2S9HRqO", + "outputId": "200c778a-c9a5-4757-e904-7b75680f7f49" + }, + "outputs": [], + "source": [ + "finder.stock.set_stop_criteria({\"counts\": {\"C\": 8}})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HB3_hqDLcDsc" + }, + "source": [ + "Here we constrain the stock to anything with eight carbon atoms or less." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 598 + }, + "id": "dRqvyVjvcBz4", + "outputId": "1f249286-4fe8-4174-9252-671d6057ab80" + }, + "outputs": [], + "source": [ + "finder.prepare_tree() # This is important to reset the previous search!\n", + "finder.tree_search(show_progress=True)\n", + "finder.build_routes()\n", + "finder.routes.reaction_trees[0].to_image()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B1bNb56Mc_dV" + }, + "source": [ + "This is a bit cluncky and imprecise, so instead we will build our own stock class that implements a constraint based on mass.\n", + "\n", + "We need to subclass `StockQueryMixin` that provide some default functionality for a stock. Some of these functionalities can be overriden, but the only one that needs to be implemented is the `__contains__` method.\n", + "\n", + "This method takes a single argument, a `Molecule` object internal to `aizynthfinder`, and should return True if the molecule is in stock or False otherwise.\n", + "\n", + "You can use the `.rd_mol` or `.smiles` properties of the molecule object to access the RDKit molecule object or the SMILES string." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LYdCUj4cc-20" + }, + "outputs": [], + "source": [ + "from aizynthfinder.context.stock.queries import StockQueryMixin\n", + "\n", + "class MassCriteriaStock(StockQueryMixin):\n", + "\n", + " def __init__(self, mass_limit=180):\n", + " self._mass_limit = mass_limit\n", + "\n", + " def __contains__(self, mol):\n", + " return Descriptors.ExactMolWt(mol.rd_mol) < self._mass_limit" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cOfBgHK7eTm4" + }, + "source": [ + "This stock will only return True for molecules with a mass less than a given limit.\n", + "\n", + "Let's load it into our `finder` object and use it in the search." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "vgJ_dEJBed-g", + "outputId": "0b607756-0d53-41d8-b144-2403dd84eb43" + }, + "outputs": [], + "source": [ + "mass_stock = MassCriteriaStock()\n", + "finder.stock.load(mass_stock, \"mass\")\n", + "finder.stock.select(\"mass\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 655 + }, + "id": "xiRqd_wiepCz", + "outputId": "077134e1-f694-4582-8fb3-8b6a0b20671c" + }, + "outputs": [], + "source": [ + "finder.prepare_tree() # This is important to reset the previous search!\n", + "finder.tree_search(show_progress=True)\n", + "finder.build_routes()\n", + "finder.routes.reaction_trees[0].to_image()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6TlG1iKefinz" + }, + "source": [ + "We can also combine our custom stock with the ZINC stock" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "sFvLRhptfiDQ", + "outputId": "e3e2e452-4c04-45be-d036-33937d09e714" + }, + "outputs": [], + "source": [ + "class MassCriteriaStock2(StockQueryMixin):\n", + "\n", + " def __init__(self, molecule_stock, mass_limit=180):\n", + " self._molecule_stock = molecule_stock\n", + " self._mass_limit = mass_limit\n", + "\n", + " def __contains__(self, mol):\n", + " if Descriptors.ExactMolWt(mol.rd_mol) >= self._mass_limit:\n", + " return False\n", + " return mol in self._molecule_stock\n", + "\n", + "mass_stock2 = MassCriteriaStock2(finder.stock[\"zinc\"])\n", + "finder.stock.load(mass_stock2, \"mass2\")\n", + "finder.stock.select(\"mass2\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 661 + }, + "id": "BogqAb--gRTa", + "outputId": "4991ee1f-bb8a-467b-e081-b35ebaf6ea8e" + }, + "outputs": [], + "source": [ + "finder.prepare_tree() # This is important to reset the previous search!\n", + "finder.tree_search(show_progress=True)\n", + "finder.build_routes()\n", + "finder.routes.reaction_trees[0].to_image()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ItggMtV3gtfy" + }, + "source": [ + "\n", + "**Exercises**\n", + "\n", + "- Change the mass limit and explore its effect\n", + "- Implement a stock class that constrain the number of rings in the starting material" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jgGOEGq6g7H_" + }, + "source": [ + "### Modify search parameters\n", + "\n", + "Now we will explore some common search parameters\n", + "- Number of iterations\n", + "- Search depth\n", + "- Search width\n", + "\n", + "and see how they affect the search\n", + "\n", + "For this we will use the regular ZINC stock and a new molecule that `aizynthfinder` have more problem to break down" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "yuOnMeL1xdPa", + "outputId": "d537af21-9cde-4339-c0ee-6272dbf610b6" + }, + "outputs": [], + "source": [ + "finder.stock.select(\"zinc\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 167 + }, + "id": "g62Wnsophpgo", + "outputId": "1efe17d9-168f-47d9-c6e8-f431e2a3958a" + }, + "outputs": [], + "source": [ + "finder.target_smiles = \"Cc1ccc(F)c(C(=O)NC(CNC(=O)Cn2c(=O)[nH]c3nc(F)c(F)cc32)C)c1\"\n", + "display(finder.target_mol.rd_mol)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rpptPGvOx5Ut" + }, + "source": [ + "With default search setting we do not find any solved routes.." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "a9i59hsPh3-a", + "outputId": "f73f3cea-2dde-4599-89ce-362888b2459b" + }, + "outputs": [], + "source": [ + "# These are default parameters if you want to go back to earlier state\n", + "# finder.config.search.time_limit = 120\n", + "# finder.config.search.iteration_limit = 100\n", + "# finder.config.search.max_transforms = 6\n", + "finder.prepare_tree()\n", + "finder.tree_search(show_progress=True)\n", + "finder.build_routes()\n", + "finder.extract_statistics()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BnVXxZvtyAAd" + }, + "source": [ + "We will start with increasing the number of iterations in the search. For this it is also adviceable to \"disable\" the time limit by setting it to something big" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "zsV_hYEsinNZ", + "outputId": "78367f7d-a43e-44ee-fada-70134ac42be4" + }, + "outputs": [], + "source": [ + "finder.config.search.time_limit = 3600\n", + "finder.config.search.iteration_limit = 200\n", + "# finder.config.search.max_transforms = 6\n", + "finder.prepare_tree()\n", + "finder.tree_search(show_progress=True)\n", + "finder.build_routes()\n", + "finder.extract_statistics()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cSc-Ie4Dy0Cn" + }, + "source": [ + "Try to increase the iteration limit it further..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "SKXw10vCy8Gl", + "outputId": "28054e89-24b8-48e8-e010-d8729bfb19b3" + }, + "outputs": [], + "source": [ + "for limit in [300, 400, 500, 1000]:\n", + " finder.config.search.iteration_limit = limit\n", + " finder.prepare_tree()\n", + " finder.tree_search(show_progress=True)\n", + " finder.build_routes()\n", + " print(\"Number of solved routes: \", finder.extract_statistics()[\"number_of_solved_routes\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DuQ2_vlZzgC6" + }, + "source": [ + "Display the first route and try to understand why `aizynthfinder`cannot break down the molecule" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 539 + }, + "id": "jaLrRputiv7j", + "outputId": "5211d4af-79f6-4434-9210-ac2644ba2781" + }, + "outputs": [], + "source": [ + "finder.routes.images[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "852yUpzYzp4r" + }, + "source": [ + "Next, we will try to increase the search depth" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uo-u4Aqezs-t" + }, + "outputs": [], + "source": [ + "finder.config.search.time_limit = 3600\n", + "finder.config.search.iteration_limit = 300\n", + "finder.config.search.max_transforms = 12\n", + "finder.prepare_tree()\n", + "finder.tree_search(show_progress=True)\n", + "finder.build_routes()\n", + "finder.extract_statistics()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KTe329QBz0Gc" + }, + "source": [ + "Adjust the maximum search depth, iteration limit and display routes. Try to figure out why `aizynthfinder` cannot break down this compound to commerical starting material\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LcWf2SPy0Ej3" + }, + "source": [ + "To change the search width is a bit more involved, because it depends on the expansion model we used. Here, we will adjust it for the template-based model that we are using, but be aware that other expansion models might work differently" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "KiTxYSZz0Q1w", + "outputId": "e2d140b1-b72f-4290-879e-7a592def7dcb" + }, + "outputs": [], + "source": [ + "finder.expansion_policy[\"uspto\"].cutoff_number" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "bgLoCDL50p03", + "outputId": "d1923790-ad32-493a-cabd-329ae7fd8cd9" + }, + "outputs": [], + "source": [ + "finder.expansion_policy[\"uspto\"].cutoff_number = 100\n", + "finder.config.search.time_limit = 3600\n", + "finder.config.search.iteration_limit = 300\n", + "finder.config.search.max_transforms = 12\n", + "finder.prepare_tree()\n", + "finder.tree_search(show_progress=True)\n", + "finder.build_routes()\n", + "finder.extract_statistics()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xKe4pPGv0wSw" + }, + "source": [ + "### Search algorithm\n", + "\n", + "The default search algorithm in `aizynthfinder` is Monte Carlo Tree Search (MCTS), but there are other alternatives available.\n", + "\n", + "Here will explore two alternatives:\n", + "- [Retro*](https://arxiv.org/abs/2006.15820)\n", + "- [Multi-objective MCTS](https://www.sciencedirect.com/science/article/pii/S2667318525000066)\n", + "\n", + "We will start with Retro*, which requires a trained model that scores potential solutions that the search produces" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "bSG2BJox1VwS", + "outputId": "c71ac499-8af5-47bd-fc12-ee5e330f59b0" + }, + "outputs": [], + "source": [ + "!wget https://github.com/MolecularAI/PaRoutes/raw/refs/heads/main/publication/retrostar_value_model.pickle -O retrostar_value_model.pickle" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "drLCm7MF2ONA" + }, + "source": [ + "Then we change the search algorithm in the configuration of our `finder` object" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "v1W1_22V2Sn4", + "outputId": "ae2d7a73-33bd-41ba-8698-90db15cf1580" + }, + "outputs": [], + "source": [ + "finder.config.search.algorithm = \"aizynthfinder.search.retrostar.search_tree.SearchTree\"\n", + "finder.config.search.algorithm_config = {\n", + " \"molecule_cost\": {\n", + " \"cost\": \"aizynthfinder.search.retrostar.cost.RetroStarCost\",\n", + " \"model_path\": \"retrostar_value_model.pickle\"\n", + " }\n", + "}\n", + "finder.prepare_tree()\n", + "finder.tree" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Egew_avm4hVI" + }, + "source": [ + "We will return to amenamevir and the default search parameters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "EpSxAvvD26BD", + "outputId": "9a677593-659e-493d-fe9e-39f9f6084f8f" + }, + "outputs": [], + "source": [ + "finder.target_smiles = \"Cc1cccc(C)c1N(CC(=O)Nc1ccc(-c2ncon2)cc1)C(=O)C1CCS(=O)(=O)CC1\"\n", + "finder.config.search.iteration_limit = 100\n", + "finder.config.search.max_transforms = 6\n", + "finder.expansion_policy[\"uspto\"].cutoff_number = 50\n", + "finder.tree_search(show_progress=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 481 + }, + "id": "4pbeHDjQ2c3U", + "outputId": "84caef94-fdc5-4c4a-96be-1fd0667e8785" + }, + "outputs": [], + "source": [ + "finder.build_routes()\n", + "finder.routes.images[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KDLXYHIo40zT" + }, + "source": [ + "**Exercise**\n", + "\n", + "Run a search with both MCTS and Retro* and compare the output of the `finder.extract_statistics()` method." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XLYfuIg24-aI" + }, + "source": [ + "Next, we will setup a multi-objective MCTS search. We will setup a search with two objectives:\n", + "\n", + "- The fraction of starting material in stock\n", + "- The average template occurrence as a simple proxy for route quality\n", + "\n", + "Take it as an exercise to try out other objectives. In principle any scores that we explored in the previous tutorial can be used as an objective" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "oq-EFoCy6S9S", + "outputId": "aed05d8b-1a30-4aba-d215-30d37f7bdb78" + }, + "outputs": [], + "source": [ + "from aizynthfinder.context.scoring import FractionInStockScorer, AverageTemplateOccurrenceScorer\n", + "scorer1 = FractionInStockScorer(finder.config)\n", + "scorer2 = AverageTemplateOccurrenceScorer(\n", + " finder.config,\n", + " #scaler_params={\"name\": \"squash\", \"slope\": 0.001, \"xoffset\": 5000, \"yoffset\": 0}\n", + ")\n", + "finder.scorers.load(scorer1)\n", + "finder.scorers.load(scorer2)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Tu6QSYBD_2Gn" + }, + "source": [ + "The `AverageTemplateOccurrenceScorer` scorer is an unbound scorer, and on a completely different scale compared to the `FractionInStockScorer` scorer. But that should no matter for the MO-MCTS algorithm.\n", + "\n", + "However, you can try to uncomment the row above that suggest a sigmoid-like function to scale the scorer between 0 and 1.\n", + "\n", + "Next, we will set up the algorithm. The search algorithm is simply selected by choosen \"mcts\", because `aizynthfinder` will figure out from the other settings the we want a multi-objectiv search.\n", + "\n", + "In principle we only need to set `search_rewards` to a list of names of scorers. But since we ran Retro* above, we need to update the default settings of the other parameters for MCTS as well." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "TmJzk5lw8lWM", + "outputId": "efd4cd6c-21b4-4995-f7a9-0e046f1ffdf3" + }, + "outputs": [], + "source": [ + "finder.config.search.algorithm = \"mcts\"\n", + "finder.config.search.algorithm_config = {\n", + " \"search_rewards\": [\"fraction in stock\", \"average template occurrence\"],\n", + " \"C\": 1.4,\n", + " \"default_prior\": 0.5,\n", + " \"use_prior\": True,\n", + " \"prune_cycles_in_search\": True,\n", + " \"immediate_instantiation\": (),\n", + " \"mcts_grouping\": None,\n", + " \"search_rewards_weights\": [],\n", + "}\n", + "finder.prepare_tree()\n", + "finder.tree" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "109Q2emSAfRy" + }, + "source": [ + "We will run retrosynthesis for amenamevir with default settings, but feel free to try other target compounds and/or settings" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "nRvYhW5_9rZy", + "outputId": "3fa85930-4fcd-4fe8-d4e3-336207fe6c58" + }, + "outputs": [], + "source": [ + "finder.target_smiles = \"Cc1cccc(C)c1N(CC(=O)Nc1ccc(-c2ncon2)cc1)C(=O)C1CCS(=O)(=O)CC1\"\n", + "finder.config.search.iteration_limit = 100\n", + "finder.config.search.max_transforms = 6\n", + "finder.expansion_policy[\"uspto\"].cutoff_number = 50\n", + "finder.tree_search(show_progress=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5iLrd36O8YWZ" + }, + "source": [ + "When extracting routes from a multi-objective search it is also advantageous to extract routes on the Pareto front(s) of the objectives used in the search.\n", + "\n", + "This can be accomplished by providing a list of scorer names to the `scorer` argument of the `build_routes` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1njsgVzp8TQS", + "outputId": "8d43e534-4c2e-4f00-9780-db232e474c95" + }, + "outputs": [], + "source": [ + "finder.build_routes(\n", + " scorer=[\"fraction in stock\", \"average template occurrence\"]\n", + ")\n", + "finder.extract_statistics()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ps6ZnoSuA-iW" + }, + "source": [ + "If you check out the scores of the routes, you see that there are two scores computed. All of the extract routes are solved, but the \"average template occurence\" show some variance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "iNmgh43i_c13", + "outputId": "7ab440b8-4f28-4cd5-9e93-3c2367aab54d" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "pd.DataFrame(finder.routes.all_scores)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0AfNxxRkBLPI" + }, + "source": [ + "We can also plot these routes on a two-dimensional plot for the two objectives." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 523 + }, + "id": "bATAnG4d-dL6", + "outputId": "7a5a632b-4640-412e-cf7c-1d06efb43554" + }, + "outputs": [], + "source": [ + "from aizynthfinder.interfaces.gui.utils import pareto_fronts_plot\n", + "pareto_fronts_plot(finder.routes)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2nzQGx8A_gS6" + }, + "source": [ + "**Exercise**\n", + "\n", + "Run MO-MCTS with other objects and/or target compounds and plot the Pareto fronts of the found solutions.\n", + "\n", + "That is all for this tutorial!" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/tutorials/README.md b/tutorials/README.md index 7b8563f..c176b38 100644 --- a/tutorials/README.md +++ b/tutorials/README.md @@ -20,4 +20,13 @@ * How to score routes with AiZynthFinder * How to score route with rxnutils * How to calculate route similarities -* How to cluster routes \ No newline at end of file +* How to cluster routes + +## C - search parameters + +[Google co-lab](https://colab.research.google.com/github/MolecularAI/aizynthfinder/blob/tutorials/tutorials/C_search_parameters.ipynb) + +* How to modify the stock +* How to add custom stock rules +* How to use common search parameters +* How to select and use different search algorithms \ No newline at end of file