WO2023279436A1

WO2023279436A1 - Drug molecule intelligent generation method based on reinforcement learning and docking

Info

Publication number: WO2023279436A1
Application number: PCT/CN2021/107490
Authority: WO
Inventors: 魏志强; 王茜; 刘昊; 李阳阳; 王卓亚
Original assignee: 中国海洋大学
Priority date: 2021-07-09
Filing date: 2021-07-21
Publication date: 2023-01-12
Also published as: CN113488116B; CN113488116A; JP7387962B2; JP2023531846A

Abstract

A drug molecule intelligent generation method based on reinforcement learning and docking, belonging to the technical field of drug chemistry and computers. The method comprises the following steps: 1) constructing a virtual fragment combination library for drug design; 2) calculating fragment similarity, and performing molecular fragment coding; and 3) generating and optimizing a molecule on the basis of an actor-critic model for reinforcement learning. In the described method, on the basis of a lead compound, the chemical space of a search is reduced. Transformer modeling is used by means of an actor-critic model for reinforcement learning, position information of a molecular fragment is introduced, and relative or absolute position information of the fragment in a molecule is stored, thereby achieving parallel training. Furthermore, by means of establishing a single-layer perceptron model, a reward mechanism further optimizes the activity of a generated molecule.

Description

A method for intelligent generation of drug molecules based on reinforcement learning and docking

technical field

The invention relates to the fields of medicinal chemistry and computer technology, in particular to an intelligent generation method of drug molecules based on reinforcement learning and docking.

Background technique

Designing and manufacturing safe and effective compounds is key in the medicinal chemistry profession. This is a long, complex and difficult multi-parameter optimization process in terms of money and time. Promising compounds have a high risk (>90%) of failing in clinical trials, resulting in unnecessary waste of resources. The average cost of bringing a new drug to market is now well over $1 billion, and the average time from discovery to market is 13 years. In pharmaceuticals, the average time from discovery to commercial production can be longer, such as 25 years for high-energy molecules. A critical first step in molecular discovery is generating a pool of candidates for computational study or synthesis and characterization. This is a daunting task because the chemical space of possible molecules is enormous—the number of potential drug-like compounds is ^estimated to be between 10 and ¹⁰ , ^while the number of all compounds that have been synthesized is on the order of 10. Heuristics, such as Lipinski's "rules of five" for pharmacy, can help narrow the space of possibilities, but significant challenges remain.

With the revolution of computer technology, the use of AI for drug discovery is gradually becoming a trend. Traditionally, to achieve this goal, a combination of various computational models, such as quantitative structure-activity relationship (QSAR), molecular substitution, molecular simulation, and molecular docking, have been used. But traditional approaches are combinatorial in nature, often leading to instability or non-synthesis of most molecules. In recent years, many generative models based on deep learning models to design drug-like compounds have emerged, such as molecular generation methods based on variational autoencoders and molecular generation methods based on generative adversarial networks. However, current methods still need to be improved in terms of the generation speed, effectiveness and molecular activity of candidate compounds.

Contents of the invention

The invention provides a method for intelligently generating drug molecules based on reinforcement learning and docking. The method is based on an Actor-critic reinforcement learning model and docking simulation, and is used to generate new drug molecules with optimal properties. Among them, the Actor network adopts the bidirectional Transformer Encoder mechanism and DenseNet network modeling.

In order to solve the above problems, the present invention is achieved through the following technical solutions:

A drug molecular intelligence generation method based on reinforcement learning and docking, which specifically includes the following steps:

Step 1. Construct a virtual fragment combination library for drug design;

The drug molecule virtual fragment combination library is formed by fragmenting a group of molecules through the existing toolkit. When splitting molecules, the fragments are not classified, and all fragments are treated the same;

Step 2. Calculate fragment similarity for molecular fragment encoding

Using the existing combined method of calculating chemical similarity to measure the similarity between different molecular fragments, all fragments are encoded into binary strings by constructing a similarity-based balanced binary tree, so similar fragments get similar encodings ;

Step 3. Generate and optimize molecules based on Actor-critic reinforcement learning model

(1) Introduction to Actor-critic reinforcement learning model framework

Actor-critic-based reinforcement learning model is used to generate and optimize molecules, which are modified by selecting a single fragment of the molecule and a bit in the fragment representation; then exchange the value in this bit, that is: if it is 0, change into 1 and vice versa; this allows to keep track of the degree of change applied to the molecule, the leading bits encoded will remain unchanged, so the model only allows changing bits at the end, to force the model to search only for molecules near known compounds;

The Actor-critic-based reinforcement learning model starts from the fragmented molecular state, that is, the current state; Actor extracts and checks all fragments, introduces the position information of different fragments in the molecule, and uses the Transformer Encoder mechanism to calculate the attention of different fragments in each molecule coefficients, and then through the DenseNet network output probability to decide which fragments to replace and which fragments to replace; according to the degree to which the new state satisfies all constraints, the new state is scored, and the critic then examines the value of the new state and the value of the current state. The difference value TD-Error between is given to the actor; if yes, the actor's action will be reinforced, if negative, the action will be blocked; then, the current state is replaced by the new state, and the process repeats given the number of times;

(2) Optimization of reward mechanism of reinforcement learning model

Design molecules that are optimized for the two characteristics of the molecule's inherent attribute information and molecular calculation activity information. The reward mechanism of the reinforcement learning model realizes reward result prediction by building a perceptron model. The perceptron model includes two stages of training and prediction; during the training process , the data set includes two sources. One is that the positive samples of the data set come from known active molecules reported in the existing literature, and the other is that the negative samples of the data set come from the random sampling of the same number of ZINC libraries. The calculated activity information obtained by sequential docking and the molecular intrinsic property information calculated by the existing toolkit are used as input after random order. After multiple rounds of training, the model can learn the potential relationship between activity calculation information and property information and whether it is really active; During the prediction process, the model uses the calculated activity information of the generated molecules - using advanced and fast drug docking software to perform virtual molecular docking on the existing relevant PDB files of the generated molecules and disease-related targets, as well as the inherent properties of the generated molecules Information—calculated using a general-purpose software package as input to predict whether the generated molecule does have real activity, thereby further optimizing the activity of the generated molecule; the Actor in the reinforcement learning model is rewarded every time it generates a valid molecule, if it manages to Produce molecules that meet the prediction model's expectations, and you get higher rewards.

Further, in the step 1, when the molecule is split, all single bonds extending from a ring atom are destroyed, and a fragment chain list is created to record and store the original split point when splitting the molecule, which is convenient for the connection of subsequent molecular design points; if the total number of attachment points remains constant, the method allows the exchange of fragments with different numbers of attachment points; the open source toolkit RDKit is used for molecular cleavage in this process; fragments with more than 12 heavy atoms are discarded, with 4 Fragments with one or more attachment points are also discarded;

Further, when calculating the similarity between fragments in step 2, when comparing "drug-like" molecules, specifically use the maximum common substructure Tanimoto-MCS (TMCS) to compare the similarity, for smaller fragments, Introducing the Damerau-Levenshtein distance that improves the Levenshtein distance, the Damerau-Levenshtein distance between two strings is defined as:

The TMCS distance between two molecules M1 and M2 is defined as:

Then measure the similarity between two molecules M1 and M2, and the corresponding smiles represent S1 and S2, namely Max(TMCS(M ₁ , M ₂ ), DL(S ₁ , S ₂ );

Further, molecular fragment encoding in step 2 as described: these strings are created by constructing a balanced binary tree based on fragment similarity, which is then used to generate binary strings for each fragment, thus in the extension Generates a binary string representing the molecule; the order of attachment points is treated as an identifier for each fragment; when assembling the tree, the similarity between all fragments is calculated, and then fragment pairs are formed in a greedy bottom-up fashion, where the most similar two segments are first paired, and the process is repeated to join the two pairs with the most similar segments into a new tree with four leaves; the computed similarity between two subtrees is measured as the maximum similarity between any two fragments of these trees; the joining process is repeated until all fragments are joined into a single tree;

As each fragment is stored in a binary tree, use it to generate encodings for all fragments; the path from the root to the leaf where the fragment is stored determines the encoding of each fragment, for each branch in the tree, if left, in the encoding A 1 ("1") is appended to , or a 0 ("0") if to the right; thus, the rightmost character in the encoding corresponds to the branch closest to the fragment.

The beneficial effect of the present invention compared with prior art:

The invention is based on an Actor-critic reinforcement learning model and a docking simulation method for generating new molecules. The model learns how to modify and improve the molecule so that it has the desired properties.

(1) The present invention differs from previous reinforcement learning methods in that it focuses on how to generate new compounds that are structurally close to existing compounds by transforming fragments in lead compounds, thereby narrowing the searched chemical space.

(2) The present invention is based on the Actor-critic reinforcement learning model, and the Actor network adopts a bidirectional Transformer Encoder mechanism and DenseNet network modeling, introduces the position information of different fragments in the molecule, and uses the Transformer Encoder mechanism to calculate the attention coefficient of different fragments in each molecule , save the relative or absolute position information of the fragments in the molecule, and realize parallel training.

(3) The reward mechanism of reinforcement learning establishes a single-layer perceptron model. The input of this model contains two parts of information, namely, molecular-related attribute information and activity information. The activity information is the molecular docking of generated molecules and disease-related targets using docking software. Resulting, further optimization of the activity of the generated molecules.

(4) On the scale of candidate products, the method of the present invention is estimated to generate more than 2 million candidate product molecules for targets corresponding to specific diseases.

(5) The method of the present invention can generate more than 80% optimized high-quality AI molecules by adding more than 1,000 ultra-high-dimensional parameters to the molecular docking part and fusing molecular activity and related attribute information.

(6) The method of the present invention relies on a large-scale supercomputing platform, and the molecular generation speed is significantly improved.

Description of drawings

Figure 1 is a virtual molecular fragment library of Mpro-related compounds;

Fig. 2 is the binary tree subpart comprising all fragments of Mpro related compounds;

Figure 3 is a framework diagram of the Actor-critic reinforcement learning model;

Figure 4 shows the detailed information of actors in the Actor-critic reinforcement learning model;

Figure 5 shows the generation of active compound molecules of the Mpro new crown target.

detailed description

The technical solution of the present invention will be further explained through the embodiments in conjunction with the accompanying drawings, but the protection scope of the present invention is not limited in any form by the embodiments.

Example 1

The main goal of this example is: generation of active compounds against the Mpro target of the new crown, based on an initial set of lead compounds, and then improving and optimizing these molecules by replacing some fragments of them, resulting in an Mpro target with the desired properties new active compounds. This embodiment is based on an Actor-critic reinforcement learning model and a docking simulation method for generating new drug molecules with optimal properties. The technical solution of this embodiment is described in detail below.

A method for generating drug molecule intelligence based on Actor-critic reinforcement learning model and docking, which specifically includes the following steps:

Step 1. Construct a virtual fragment combination library for drug design.

The virtual fragment combinatorial library of drug molecules is constructed by fragmenting a set of molecules. The virtual fragment library in this example is jointly constructed from 10172 compounds related to the Mpro target in the medicinal chemistry database ChEMBL database and 175 lead compounds from the Mpro target obtained by molecular docking screening in the laboratory, as shown in Figure 1 shown. A common approach to fragmenting molecules is to group them into categories such as loop structures, side chains, and linkers. When splitting molecules, we basically follow the same scheme, but we don't sort the fragments. All fragments are thus treated identically. To break a molecule, all single bonds extending from a ring atom are broken. When splitting a molecule, create a chain list of fragments to record and store the original split point, which can be used as the connection point for subsequent molecular design. This method allows exchanging fragments with different numbers of attachment points if the total number of attachment points remains the same. In this process, the existing cheminformatics open source toolkit RDKit is used for molecular cleavage. During this process, fragments with more than 12 heavy atoms are discarded, as are fragments with 4 or more attachment points. These constraints are enforced to reduce complexity while still being able to generate a large number of interesting candidates.

Step 2. Calculate fragment similarity for molecular fragment encoding.

Step 2.1 Calculate the similarity between segments

In this embodiment, all fragments are encoded as binary strings, and the purpose of the encoding is that similar fragments should obtain similar encodings. It is therefore necessary to measure the similarity between fragments. There are many ways to calculate chemical similarity. A molecular fingerprint is a straightforward binary code, where similar molecules should in principle be given similar codes. However, when comparing molecular fragments and their inherently sparse representations, we found them to be less useful for this purpose. A chemically intuitive way to measure the similarity between molecules is to use the maximum common substructure Tanimoto-MCS (TMCS) similarity:

Here, mcs(M1,M2) is the number of atoms in the largest common substructure of molecules M1 and M2, and atoms(M1) and atoms(M2) are the number of atoms in molecules M1 and M2, respectively.

An advantage of Tanimoto-MCS similarity is that it directly compares the structures of fragments and thus does not depend on other specific representations. This approach often works well when comparing "drug-like" molecules. However, there are drawbacks to using Tanimoto-MCS similarity for smaller fragments. Therefore, the present invention introduces Levenshtein distance, a commonly used method to measure the similarity between two text strings. Levenshtein distance is defined as the minimum number of insertions, deletions and substitutions required to make two strings identical. But considering the impact of the replacement operation on the edit distance, this embodiment finally introduces the Damerau-Levenshtein distance that improves the Levenshtein distance, then the Damerau-Levenshtein distance between two character strings is defined as:

As a compromise, we choose to measure the similarity between two molecules M1 and M2, and the corresponding smiles representations S1 and S2, i.e.

Max(TMCS(M ₁ ,M ₂ ), DL(S ₁ ,S ₂ )

Step 2.2 Molecular Fragment Encoding

All fragments are encoded into binary strings. These strings are created by building a balanced binary tree based on segment similarity. This tree is then used to generate binary strings for each fragment, and thus in extensions, to generate binary strings representing molecules. The order of attachment points is treated as an identifier for each fragment. When assembling the tree, calculate the similarity between all fragments. Segment pairs are then formed in a greedy bottom-up fashion, where the most similar two segments are paired first. The process is then repeated to join the two pairs with the most similar segments into a new tree with four leaves. The calculated similarity between two subtrees is measured as the maximum similarity between any two fragments of these trees. The joining process is repeated until all fragments are joined into a single tree.

This can be used to generate encodings for all fragments as each fragment is stored in a binary tree. The path from the root to the leaf where the fragments are stored determines the encoding of each fragment. For each branch in the tree, a one ("1") is appended to the encoding if it goes left, and a zero ("0") if it goes right, as shown in Figure 2; thus, the encoding The rightmost character corresponds to the branch closest to the fragment.

Step 3. Generate and optimize molecules based on the Actor-critic reinforcement learning model.

Step 3.1 Introduction to Actor-critic reinforcement learning model framework

The present invention uses an Actor-critic-based reinforcement learning model to generate and optimize molecules, and the optimization is modified by selecting a single fragment of the molecule and a bit in the representation of the fragment. Then swap the value in this bit. ie: if it is 0, it becomes 1 and vice versa. This allows tracking of the degree of change applied to the molecule, since modification bits at the end of the code will represent changes for very similar fragments, while changes at the beginning will represent changes for very different types of fragments. The leading bits of the encoding will remain the same, so the model is only allowed to change bits at the end to force the model to search only for molecules near known compounds. As shown in Figure 3.

The actor-critic-based reinforcement learning model starts from the fragmented molecular state, namely the current state S. Actor extracts and checks all fragments, and uses the bidirectional Transformer Encoder mechanism and DenseNet network to decide which fragments to replace and which fragments to replace, that is, the actions Ai taken by the Actor get the new state Si. The new state Si is scored R according to how well the new state satisfies all the constraints. The critic then examines the difference, Td-error , between the rewards increased by the values of Si and S, given to the actor. If yes, the actor's action Ai will be enhanced, if negative, the action will be blocked. Then, the current state is replaced by the new state, and the process repeats a given number of times. Among them, the loss function loss=-log(prob)*td_error

Step 3.2 Network structure of reinforcement learning model Actor

Actor network adopts two-way Transformer Encoder mechanism and DenseNet network modeling, introduces the position information of different fragments in the molecule, and uses Transformer Encoder mechanism to calculate the attention coefficient of different fragments in each molecule. This structure reads the encoding fragment representing one molecule at a time. The forward and backward outputs are concatenated, and an estimate of the probability distribution of which segment to change and to what is computed by passing the concatenated representation through a DenseNet neural network.

Because the probability of replacing a fragment depends on the advancing and trailing fragments of the molecule. Therefore, each molecule is constructed as a sequence of fragments, which is passed to the Transformer encoder mechanism in one pass. The importance of different fragments is obtained by calculating the attention coefficients of different fragments in each molecule. Then, through the forward and backward transformer Encoder, a vectorized representation of a molecule with different fragment correlations is output; finally, the result of Concatenate is classified through the DenseNet network, and the probability distribution of which fragment to change and what to change is calculated. ,As shown in Figure 4.

Step 3.3 Optimization of Reinforcement Learning Model Reward Mechanism

A major challenge in drug discovery is designing molecules optimized for multiple properties that may not correlate well. To show that the proposed method can handle this situation, two different classes of properties were selected that characterize the viability of a molecule as a suitable drug. The aim of this invention is to generate molecules of the drug that more closely match the properties of the real active molecule, ie in the "sweet spot" of the target. As mentioned above, the selected properties include the inherent property information of the molecule itself (such as: MW, clogP, and PSA, etc.) and the calculated activity information of the molecule (ie, the docking result information of the molecule and the corresponding target of a specific disease). It is particularly worth emphasizing that the reward mechanism of the reinforcement learning model in the present invention realizes reward result prediction by constructing a single-layer perceptron model. The model includes two stages of training and prediction. During the training process, the data set includes two sources. One is that the positive samples of the data set come from known active molecules reported in the existing literature, and the other is that the negative samples of the data set come from the random sampling of the same number of ZINC libraries. After the negative samples are scrambled, the calculated activity information obtained by docking and the molecular intrinsic attribute information calculated by the existing toolkit are used as input. After multiple rounds of training, the model can learn the active calculation information and attribute information and whether it is really active. connection relation. In the prediction process, the model is based on the computational activity information of the generated molecules - which is obtained by virtual molecular docking of the generated molecules and disease-related targets using advanced and fast drug docking software. The model uses drug docking software, such as Ledock, to perform virtual molecular docking on the existing relevant PDB files of less than or equal to 512 molecules generated per epoch and 380 different conformational targets related to the Mpro new crown. Intrinsic attribute information of the generated molecule—calculated by using the general software package RDKit. A total of 1143 ultra-high-dimensional parameters including the calculated activity information of the generated molecule and the inherent attribute information of the molecule itself are used as the input of the single-layer perceptron to predict whether the generated molecule is true or not. With real activity, the activity of the generated molecules is further optimized. An actor in this reinforcement learning framework is rewarded for each valid molecule it produces, with higher rewards if it manages to produce molecules that match the prediction model's expectations.

The final active compound molecule of the Mpro new crown target is shown in Figure 5.

It should be noted that although the above-mentioned embodiments of the present invention are illustrative, they are not intended to limit the present invention, so the present invention is not limited to the above specific implementation manners. Without departing from the principles of the present invention, all other implementations obtained by those skilled in the art under the inspiration of the present invention are deemed to be within the protection of the present invention.

Claims

A drug molecular intelligence generation method based on reinforcement learning and docking, characterized in that the method specifically includes the following steps:

Step 1. Construct a virtual fragment combination library for drug design;

The drug molecule virtual fragment combination library is formed by fragmenting a group of molecules through the existing toolkit. When splitting molecules, the fragments are not classified, and all fragments are treated the same;

Step 2. Calculate fragment similarity for molecular fragment encoding

Using the existing combined method of calculating chemical similarity to measure the similarity between different molecular fragments, all fragments are encoded into binary strings by constructing a similarity-based balanced binary tree, so similar fragments get similar encodings ;

Step 3. Generate and optimize molecules based on Actor-critic reinforcement learning model

(1) Introduction to Actor-critic reinforcement learning model framework

Actor-critic-based reinforcement learning model is used to generate and optimize molecules, which are modified by selecting a single fragment of the molecule and a bit in the fragment representation; then exchange the value in this bit, that is: if it is 0, change into 1 and vice versa; this allows to keep track of the degree of change applied to the molecule, the leading bits encoded will remain unchanged, so the model only allows changing bits at the end, to force the model to search only for molecules near known compounds;

The Actor-critic-based reinforcement learning model starts from the fragmented molecular state, that is, the current state; Actor extracts and checks all fragments, introduces the position information of different fragments in the molecule, and uses the Transformer Encoder mechanism to calculate the attention of different fragments in each molecule coefficients, and then through the DenseNet network output probability to decide which fragments to replace and which fragments to replace; according to the degree to which the new state satisfies all constraints, the new state is scored, and the critic then examines the value of the new state and the value of the current state. The difference value TD-Error between is given to the actor; if yes, the actor's action will be reinforced, if negative, the action will be blocked; then, the current state is replaced by the new state, and the process repeats given the number of times;

(2) Optimization of reward mechanism of reinforcement learning model

Design molecules that are optimized for the two characteristics of the molecule's inherent attribute information and molecular calculation activity information. The reward mechanism of the reinforcement learning model realizes reward result prediction by building a perceptron model. The perceptron model includes two stages of training and prediction; during the training process , the data set includes two sources. One is that the positive samples of the data set come from known active molecules reported in the existing literature, and the other is that the negative samples of the data set come from the random sampling of the same number of ZINC libraries. The calculated activity information obtained by sequential docking and the molecular intrinsic property information calculated by the existing toolkit are used as input after random order. After multiple rounds of training, the model can learn the potential relationship between activity calculation information and property information and whether it is really active; During the prediction process, the model uses the calculated activity information of the generated molecules - using advanced and fast drug docking software to perform virtual molecular docking on the existing relevant PDB files of the generated molecules and disease-related targets, as well as the inherent properties of the generated molecules Information—calculated using a general-purpose software package as input to predict whether the generated molecule does have real activity, thereby further optimizing the activity of the generated molecule; the Actor in the reinforcement learning model is rewarded every time it generates a valid molecule, if it manages to Produce molecules that meet the prediction model's expectations, and you get higher rewards.
A method for intelligently generating drug molecules based on reinforcement learning and docking according to claim 1, characterized in that in step 1, when the molecule is split, all single bonds extending from a ring atom are destroyed, splitting the molecule When creating a fragment chain list to record and store the original split point, it is convenient to use as the connection point of the molecular design later; if the total number of attachment points remains unchanged, this method allows the exchange of fragments with different numbers of attachment points; used in this process The open-source toolkit RDKit performed molecular fragmentation; fragments with more than 12 heavy atoms were discarded, as were fragments with 4 or more attachment points.
A drug molecular intelligence generation method based on reinforcement learning and docking according to claim 1, characterized in that when calculating the similarity between segments in said step 2, when comparing "drug-like" molecules, specifically using The maximum common substructure Tanimoto-MCS is used to compare the similarity. For smaller fragments, the Damerau-Levenshtein distance that improves the Levenshtein distance is introduced. The Damerau-Levenshtein distance between two strings is defined as:

The TMCS distance between two molecules M1 and M2 is defined as:

Then measure the similarity between two molecules M1 and M2, and the corresponding smiles represent S1 and S2, namely
A method for intelligently generating drug molecules based on reinforcement learning and docking according to claim 1, characterized in that the molecular fragment encoding in step 2: these character strings are created by constructing a balanced binary tree based on fragment similarity , the tree is then used to generate binary strings for each fragment, and thus in extensions, to generate binary strings representing molecules; the order of attachment points is considered as an identifier for each fragment; when assembling the tree, computing similarity between all fragments, and then form fragment pairs in a greedy bottom-up manner, where the two most similar fragments are paired first, and the process is repeated to join the two pairs with the most similar fragments into a tree A new tree with four leaves; the computed similarity between two subtrees is measured as the maximum similarity between any two fragments of these trees; the join process is repeated until all fragments are joined into a single tree;

As each fragment is stored in a binary tree, use it to generate encodings for all fragments; the path from the root to the leaf where the fragment is stored determines the encoding of each fragment, for each branch in the tree, if left, in the encoding A 1 is appended to , or a 0 if to the right; thus, the rightmost character in the encoding corresponds to the branch closest to the fragment.