CN113140267B

CN113140267B - Directional molecule generation method based on graph neural network

Info

Publication number: CN113140267B
Application number: CN202110318381.8A
Authority: CN
Inventors: 王坤峰; 赖欣; 杨培松; 阳庆元; 俞度立
Original assignee: Beijing University of Chemical Technology
Current assignee: Beijing University of Chemical Technology
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2024-03-29
Anticipated expiration: 2041-03-25
Also published as: CN113140267A

Abstract

The invention relates to a method for generating oriented molecules based on a graph neural network, and relates to the technical field of material molecules. Comprising the following steps: converting the topological mapping mode of the organic molecular structure diagram into a component diagram, and taking the embedded representation of the component diagram as the input of a graph neural network model; learning, by a graph neural network model, the molecular figures based on a message propagation process, including representations of nodes and edges therein; learning the generated representations through a graph neural network to facilitate various decisions in the graph generation process; in the decision process, new structures are added to the existing graph in a form conforming to the rules of organic molecular chemistry, and the probability of the addition event depends on the history of the graph derivation process. The chemical valence constraint confirmation of the finally generated novel molecules can ensure the chemical effectiveness of the generated molecules. The invention can generate a novel molecular structure with effective chemical properties similar to the original molecules aiming at an organic molecular database.

Description

Directional molecule generation method based on graph neural network

Technical Field

The invention relates to the technical field of material molecules, in particular to a method for generating oriented molecules based on a graph neural network.

Background

In our daily lives, the body and shadow of the graph neural network are seen everywhere. They are used to build complex systems, such as many topologies composed of economy, nature, social science, etc., such as social networks, efficacy networks in biomedical fields, synthesis and property prediction of material molecules in chemical material fields, etc., and are of practical significance in many practical scenarios in social life. For example, recommending content of interest and users to users in a social network, identifying functions of proteins in a PPI biological network, or predicting the ability of existing materials to target certain physical and chemical properties. In recent years, machine learning has become an efficient research method for graphical representation of downstream graph analysis tasks, including node classification, link prediction, cluster analysis detection, and other graph analysis tasks.

Designing new molecular structures with the desired properties is an important issue in applications such as material science and drug discovery. This problem is challenging because the chemical space is discrete, the entire search space is huge, and is scientifically calculated to be as high as 10 ³³ . Machine learning techniques can work well in these areas because of the large data volumes in these areas. The field uses machine learning to generate a model of a molecule, mainly including a depth generation model of a variational automatic encoder VAEs, a generation countermeasure network (GANs) generation model, a generation model of a variational automatic encoder (JT-VAE) based on a node tree, and an iterative generation model based on a neural network.

Wherein the linear representation of the graph is generated by representing a linear string of molecular structures and using a depth generation model (e.g., variational auto-encoders, VAEs), which is an example of a methodology that utilizes auto-encoding to solve for hidden variables of a bayesian graph model in combination with variational boundaries. An antagonism network (GANs) generation model is generated, which is optimized through a discriminator network, so that the distribution of data generated by a generator is directly fitted with the distribution of training data, and molecules are generated. Node tree VAE (JT-VAE), an algorithm that is used neatly in a tree structure for molecular structure encoding and decoding. A Graph Convolutional Policy Network (GCPN) method that represents the generation of a molecular graph as an iterative decision process and generates nodes and edges based on existing graph substructures. Another related work called Molecular Recurrent Neural Networks (MRNNs) proposes an iterative sampling model that exhibits impressive performance in molecular map generation. Both graph rolling policy networks and molecular recurrent neural networks demonstrate that iterative generation of nodes and edges from the graph itself is a practical graph generation method.

The conventional method has the following problems. First, the methods such as the variational self-encoder mainly aim at the generation of new molecules for the SMILES character string representation of the molecules, and cannot process non-European data, such as molecular data in the form of molecular patterns. Second, the sampling process is different, and the variation self-encoder adopts a one-hot sampling method instead of sequential iterative sampling. Thirdly, the effectiveness of the generated molecules is improved, the sequence of atomic sequences generated by the existing molecular generation algorithm is random, so that a large number of generated invalid molecules can be generated, and the efficiency of generating the molecules is low.

Disclosure of Invention

Therefore, the invention aims to provide a method for generating directed molecules based on a graph neural network. The specific implementation process is that the chemical molecular structure is mapped into a molecular graph, the graph is learned through a graph neural network, and meanwhile, aiming at the node generating sequence, a breadth-first search algorithm is utilized to solve the problem that a large number of invalid molecules are generated by adopting a random sequence in the prior art.

In order to achieve the above process, the present invention provides a method for generating a directed molecule based on a graph neural network, comprising:

step a, converting an organic molecular structure into a component diagram through topological mapping of the diagram structure, converting atoms and chemical bonds in the organic molecular structure into nodes and edges in the molecular diagram, and inputting embedded representation of the component diagram as a diagram neural network model;

step b, learning an input representation of the molecular graph, learning a representation of nodes and edges of the molecular graph and generating a representation of new nodes and edges by using a message propagation process through a graph neural network model;

step c, making a decision on the molecular diagram through a graph neural network model in the learning process and adding the new node or new edge into the molecular diagram to generate a novel molecular structure;

and d, performing chemical valence constraint confirmation on the novel molecule with the novel molecular structure to generate an effective novel molecular structure similar to the atomic chemical property.

Further, in the step a, the process of converting the organic molecular structure into the molecular map includes:

step a-1, reading the data file of the input molecule, and acquiring a corresponding molecular structure according to the type of the data file of the molecule;

step a-2, mapping atoms and chemical bonds in the molecular structure into corresponding nodes and edges;

and a step a-3, mapping the molecular structure to generate a molecular graph containing the nodes and the edges.

Further, in said step b, the input representation of the molecular map comprises for any one of the molecular mapsWhereinVAs a set of nodes in the graph,Efor the edge set, the nodes in the said molecular graph are vector expressed, which is node embedded vector +.>Wherein->Summing the longitude and latitude of all node embedded vectors in all the molecular diagram to obtain the molecular diagram embedded vector +.>The calculation formula is as follows:

wherein,is node embedded vector +.>Mapping from lower dimension to higher dimension.

Further, the message propagation process includes iteratively propagating and aggregating information from local neighbors by the nodes on the molecular graph, in each round of propagation, computing a message vector on each of the edges of the molecular graph, each of the nodes collecting all incoming messages to itself upon completion of the computation to generate a message set and updating its own representation and the node embedded vector, the message set aggregate calculation being:

the update calculation formula of the node embedded vector is as follows:

wherein,message aggregation for the node v, +.>Is the slavemTo the point ofvMessage vector of>Updating for node->For slave end pointsm，vIs used to generate the variable feature vector.

Further, during the message propagation process, the message is embedded for a group of the nodesExpressed as->Returns a set of transformed node embedded representations, which node embedded representation is +.>When the message propagatesAfter updating the iteration T-round, calculating the embedded representation of the T-round updated +.>The calculation formula is as follows:

wherein,for the embedded representation of the node set in the figure, < >>For the molecular map itself, < >>To perform the message function of the T round.

Further, the node embeds and aggregates information from each node neighbor and makes multiple rounds of propagation, the non-makingThe propagated aggregate operation is calculated as follows:

wherein,for general aggregation function, +.>A representation is embedded for the graph.

Further, the molecular diagram generation model sequentially generates the distribution of the corresponding decision sequence for the defined molecular diagram by defining the probability distribution of the possible results of each step, wherein the probability calculation formula of the added nodes is as follows:

the calculation formula of the probability of the added edge is as follows:

wherein,for the molecular map itself, < >>To add probability of node ++>Probability of adding edges, ++>And->Is two different multi-layer perceptrons.

Further, the decision process of adding a node includes dividing the molecular graph G into a plurality of nodes, and representing the nodesAs input, generating the necessary intermediate parameters, deciding whether to add a new node or terminate the addition of the node and introducing chemical valence constraints in the process to ensure that the structure of the existing molecular diagram conforms to the chemical rules, wherein the chemical effectiveness calculation formula of the molecular diagram is as follows:

further, generating a validation of the novel molecule includes applying the chemical valence constraint during decision making to check whether the current molecular chemical bond connection has exceeded a valence allowed by the chemical rule, expressed by the following formula:

wherein, byRepresenting one of the edges +.>Chemical valence of->Representing the corresponding atom, in->Checking the valence constraint relative to the atom, refusing to generate the edge and regenerating the edge meeting the condition if the newly added key breaks the valence constraint, and terminating the generation process if one of the following conditions is met:

the size of the raw component graph reaches an upper limit,

no bonds are generated between the newly generated atom and the previous subgraph,

when the condition is judged to be satisfied, the missing valence portion of the molecule is complemented by a hydrogen atom, and the molecule thus produced is an effective chemical molecule.

Compared with the prior art, the method has the advantages that the component graphs are converted according to the topology mapping mode of the organic molecular structure diagram, the distribution of nodes and edges in the component graphs is learned by using a message propagation algorithm through a graph neural network, the decision process of new molecular generation is carried out by using the distribution, the molecular structure generated by the decision process meets the organic chemical rules added in the model, and after the generation process is finally completed, the constraint confirmation of molecular chemical valence effect is carried out, so that the chemical effectiveness of the generated molecules is ensured, and the novel molecular structure with effective chemical properties similar to the original molecules is generated aiming at the organic molecular database with high efficiency.

Drawings

FIG. 1 is a flowchart of a method for generating directed molecules based on a graph neural network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram showing a linear representation of molecular data provided in an embodiment of the present invention;

fig. 3 is a schematic diagram of a message propagation process between nodes in a molecular graph according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a molecular process for generating a molecule according to an embodiment of the present invention;

FIG. 5 is a graph showing the results of the formation of molecules (including QED fraction) according to the present invention.

Detailed Description

In order that the objects and advantages of the invention will become more apparent, the invention will be further described with reference to the following examples; it should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.

It should be noted that, in the description of the present invention, terms such as "upper," "lower," "left," "right," "inner," "outer," and the like indicate directions or positional relationships based on the directions or positional relationships shown in the drawings, which are merely for convenience of description, and do not indicate or imply that the apparatus or elements must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.

Furthermore, it should be noted that, in the description of the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those skilled in the art according to the specific circumstances.

Fig. 1 is a flowchart of a method for generating a directed molecule based on a neural network according to an embodiment of the present invention. The method for generating the directed molecules based on the graph neural network provided by the embodiment of the invention comprises the following steps:

step S101, generating a component diagram by mapping an organic molecular structure through a graph topology, converting atoms and chemical bonds in the molecule into nodes and edges in the component diagram, and inputting the molecular diagram containing the nodes and edges as a graph neural network model;

step S102, through a graph neural network model, utilizing the message propagation process to learn the generated molecular diagram containing the representation of the nodes and edges in the step S101, and learning the representation of the molecular diagram through the graph neural network model for making various decisions in the graph generation process;

step S103, in the decision process, adding the new structure into the existing graph in a form conforming to the chemical rule of the organic molecule, wherein the probability of the adding time depends on the history deducing process of the graph, and the generated new molecule is confirmed by chemical valence constraint so as to ensure the chemical effectiveness of the generated molecule.

Referring to fig. 2, in the step a, the process of converting the organic molecular structure into the molecular map includes:

Specifically, the molecular data file format stored in a general molecular database is generally a linear representation in the form of a linear molecular representation in the form of a SMILES string, which represents a molecule that has generally omitted a hydrogen atom, and a SMILES symbol is one of "linear symbols" for expressing the structure of a compound in a single line of text. It is the most widely used linear symbol because of its simplicity.

Referring to fig. 3, in the method for generating a directed molecule based on a graph neural network according to the embodiment, the learning the molecular graph through the graph neural network model includes:

representation of molecular diagram for any one of the molecular diagramsWhereinVFor the set of nodes of the molecular graph,Efor the edge set of the molecular graph, the nodes of the molecular graph are represented by vectors, which are called node embedding vectors +.>Wherein->Summing all node embedded vectors in the molecular diagram through high latitude to obtain a molecular diagram embedded vector +.>The calculation formula is as follows:

wherein,is embedded by node vector->Obtained from mapping from the lower dimension to the higher dimension.

A message propagation process, in which the nodes iteratively propagate and aggregate information from local neighbors on a molecular graph (see fig. 3 (a)), by computing a message vector on each edge and aggregating all the messages by each of the nodes and updating the node's representation in real time (see fig. 3 (b)), the message aggregation calculation formula is:

the calculation formula of the update node embedded vector is as follows:

wherein,message aggregation for the node v, +.>Is the slavemTo the point ofvIs calculated from the message vector of (a) and (b),for the node update calculation formula, in this embodiment, the message aggregation and the node update are both a neural network, in this embodiment, a fully connected neural network is selected, and->As an end point ofm, vIs included.

Specifically, a representation is embedded for a set of nodesThe message propagation returns after one round of propagation a node embedded representation sufficient for conversion +.>Defined as->After updating the iteration T round, the node embedding expression calculation formula is as follows:

wherein the method comprises the steps ofFor the embedded representation of the node set in the figure, < >>Is a molecular diagram itself; />To perform the message function of the T round.

Specifically, the node embeds and aggregates information from each node neighbor, and performs multiple rounds of propagation without changing the graph structure to aggregate information in a larger neighborhood, defining that there is no graph structurepropThe general aggregate operation formula for propagation is:

wherein,for general aggregation function, +.>The representation is embedded for the molecular map.

The method for generating the decision by the graph neural network aiming at the molecular graph comprises the step that the molecular graph generating model sequentially generates corresponding decision sequence distribution aiming at the defined molecular graph by defining probability distribution of possible results of each step, wherein the decision sequence distribution comprises a decision process of adding nodes and a decision process of adding edges.

The probability calculation formula of the added node is as follows:

the probability calculation formula of the added edge is as follows:

wherein,to add nodes. />Is a multi-layer perceptron, which is a->Representing the probability of adding an edge, +.>Is different from->Is provided.

Specifically, in the decision process of the added node, the molecular diagram is obtainedAnd the node embedded representation of the molecular diagram is used as input, after the input is completed, a necessary intermediate parameter is generated according to the input to determine whether to add a new node or terminate the adding node, chemical valence effect constraint is introduced in the process of adding the new node, the molecular diagram is constrained to ensure that the molecular diagram structure accords with chemical rules, and the chemical effectiveness of the molecular diagram is calculated, wherein the calculation formula is as follows:

when the probability calculation process is performedTBy running the round message after it is propagatedTRound message propagation function to update node embedded vectorAfter the update is completed, the molecular diagram embedding vector +.>And predicting output through a standard multi-layer perceptron and a Softmax or Logistic Sigmoid, and when the predicted output is finished, the new node vector is +.>Continuing to the next step; in the present embodiment, the ∈ ->() Is to embed the molecular diagram into the vector +.>The multi-layer perceptron is mapped to the action output space, and the probability of adding new nodes or terminating adding nodes is calculated in the process; the decision process of adding an edge in this embodiment is the same as the process of adding a node.

Fig. 4 is a schematic diagram showing a molecular generation process of the method for generating directional molecules based on the neural network of the present invention.

The molecular generation process of the embodiment of the invention is a decision sequence, and the validity of the generated new molecular structure is verified by utilizing the chemical rule of the chemical molecule in the decision process of adding nodes and edges in the process according to the chemical rule of the chemical molecule.

Specifically, chemical valence constraint is introduced in the generation decision process to check whether the molecular chemical bond connection condition of the generated new molecular architecture exceeds the allowable valence number of the chemical rule, and the calculation formula of the molecular chemical bond connection condition is as follows:

wherein,for one of the chemical bonds->Chemical valence of->Is the corresponding atom.

Specifically, in each side generation step, checking the valence constraint of the corresponding chemical bond of the side relative to the corresponding atom, if the valence constraint is broken by the newly added bond, refusing to generate the bond and regenerating the bond conforming to the valence constraint, if the newly added chemical bond meets any one of the following conditions, judging that the newly added chemical bond conforms to the valence constraint, and terminating the generation process, wherein the conditions are as follows:

(1) The size of the raw component graph reaches the upper limit;

(2) No bonds are generated between the newly generated atoms and the previous subgraph;

when the molecular generation process is terminated, new chemical molecules with effectiveness are generated by using hydrogen atom paint make-up for the missing valence of the molecules.

Fig. 5 shows a molecular diagram generated by the method for generating directed molecules based on the graph neural network according to the present invention.

The novel chemical molecules described in the embodiments of the present invention evaluate the degree of similarity of the generated molecules to the molecules in the original molecular database by QED scores.

Specifically, the QED is a method of evaluating the similarity of drugs by combining a plurality of molecular descriptors, and in this embodiment, the degree of similarity is determined by evaluating and qualitatively describing the combination of the generated molecules with the molecules in the original molecular database, and determining the degree of similarity based on the score obtained by the qualitative description.

Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.

The foregoing description is only of the preferred embodiments of the invention and is not intended to limit the invention; various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The method for generating the oriented molecules based on the graph neural network is characterized by comprising the following steps of:

step a, converting an organic molecular structure into a component diagram through topological mapping of the diagram structure, converting atoms and chemical bonds in the organic molecular structure into nodes and edges in the molecular diagram, and inputting embedded representation of the component diagram as a graph neural network model;

step d, performing chemical valence constraint confirmation on the novel molecule with the novel molecular structure to generate a novel molecular structure similar to the atomic chemical property;

the molecular diagram generation model sequentially generates the distribution of a corresponding decision sequence for the defined molecular diagram by defining the probability distribution of the possible results of each step, wherein the distribution comprises a decision process of adding nodes and a decision process of adding edges, and the calculation formula of the probability of the adding nodes is as follows:

，

the calculation formula of the probability of the added edge is as follows:

，

wherein,for the molecular map itself, < >>To add probability of node ++>To add the probability of an edge +.>And->Is two different multi-layer perceptrons.

2. The method of generating directed molecules based on a graph neural network according to claim 1, wherein in the step a, the process of converting the organic molecular structure into the molecular graph includes:

step a-1, reading a data file of an input molecule, and acquiring a corresponding molecular structure according to the type of the data file of the molecule;

3. The method for generating directed molecules based on a graph neural network according to claim 1, wherein in the step b, the molecular diagram representation includes, for any one of the molecular diagramsWherein->For the molecular graph node set, < >>For the set of molecular graph edges, the nodes in the molecular graph are expressed as a vector, and the vector is expressed as a node embedding vector +.>WhereinSumming the longitude and latitude of all node embedded vectors in all the molecular diagram to obtain the molecular diagram embedded vectorThe calculation formula is as follows:

，

4. A graph neural network based directed molecular generation method in accordance with claim 3, wherein the message propagation process includes the nodes iteratively propagating and aggregating information from local neighbors on the molecular graph, in each round of propagation, calculating a message vector on each edge of the molecular graph, each node collecting all incoming messages to itself upon completion of calculation to generate a message set and updating its own representation and the node embedded vector, the message set aggregate formula being:

,

the update calculation formula of the node embedded vector is as follows:

,

wherein,message aggregation for the node v, +.>Is the slavemTo the point ofvMessage vector of>Updating for node->For slave end pointsm，vIs included.

5. A method of generating directed molecules based on a neural network according to claim 3, wherein said message is embedded for a group of said nodes during propagation of said messageExpressed as->Returns a set of transformed node embedded representations, which node embedded representation is +.>After iterating the message propagation update for T rounds, calculating an embedded representation updated by T rounds +.>The calculation formula is as follows:

，

6. The method for generating directed molecules based on a graph neural network according to claim 5, wherein the nodes embed information from each node neighbor and make multiple rounds of propagation, notThe propagated aggregate operation is calculated as follows:

，

7. The method of generating directed molecules based on a graph neural network of claim 6, wherein the decision process of adding nodes includes adding the moleculesDrawing of the figureGThe node representationAs input, generating the necessary intermediate parameters, deciding to add a new node or terminate the addition of the node and introducing a chemical valence constraint in the process to ensure that the structure of the existing molecular diagram conforms to the chemical rules, wherein the chemical effectiveness calculation formula of the molecular diagram is as follows:

。

8. the method of generating directed molecules based on a graph neural network of claim 1, wherein generating the validation of the new molecules comprises applying the chemical valence constraint during decision-making to check whether the current molecular chemical bond connection has exceeded a valence allowed by a chemical rule, represented by the following formula:

，

wherein, byRepresenting one of the edges +.>Chemical valence of->Representing the corresponding atom, in->Checking the valence constraint of the edges of the atoms relative to the atoms in the generation process of the edges, if newThe added key breaks the valence constraint, refuses to generate the edge and regenerates the eligible edge, terminating the generation process if one of the following conditions is met:

the size of the raw component graph reaches an upper limit,