CN115331751A

CN115331751A - Chemical pathway analysis and prediction method based on machine learning and terminal equipment

Info

Publication number: CN115331751A
Application number: CN202110504261.7A
Authority: CN
Inventors: 张毅; 周龙飞; 吴振东
Original assignee: Smic Future Beijing Technology Co ltd
Current assignee: Smic Future Beijing Technology Co ltd
Priority date: 2021-05-10
Filing date: 2021-05-10
Publication date: 2022-11-11

Abstract

The application provides a chemical pathway analysis and prediction method based on machine learning, aiming at the problems of low design efficiency, low prediction precision and the like in the existing chemical reaction path prediction technology. By means of a four-layer architecture: the system comprises a data support layer, a data calculation layer, a regular network layer and a path prediction layer, and realizes quick and high-precision prediction of unknown reaction paths. The method is used for constructing a large-scale bottom-layer chemical database based on a graph representation mode, and can well reflect the structural change of molecules in chemical reaction; by means of the graph convolution neural network model and the rapid sub-graph matching detection technology, information such as chemical molecular characteristics and reaction modes is accurately extracted, and accuracy of path prediction is effectively improved. By means of the path prediction technology, the rapid search from the reactant to the target product in the reaction rule network is realized, the prediction efficiency of the biological reaction path is greatly improved, and the path prediction cost is reduced.

Description

Chemical pathway analysis and prediction method based on machine learning and terminal equipment

Technical Field

The invention belongs to the technical field of data mining and machine learning, and particularly relates to an intelligent design and prediction method for an unknown reaction path of a chemical molecule and a terminal device.

Background

The reaction path refers to the process of a series of chemical reactions of reactants under the catalysis of enzyme to generate corresponding products. The prediction of the biological reaction path can help people to synthesize some required target products, and provides effective traction for innovative research in the chemical medical field.

The existing path prediction technology is calculated by means of a large number of chemical experiments and research experiences, the used traditional chemical database cannot represent the change relation of the self structure of a chemical molecule in the reaction participating process, the path prediction research performed on the basis of the database has the problems of low analysis and mining speed, large path prediction error and the like, and is easily limited by a plurality of factors such as experimental equipment, experimental environment and the like, so that the design efficiency and the prediction precision of biological reaction are greatly restricted.

Disclosure of Invention

In order to solve the problems in the traditional path prediction technology, the invention provides a chemical path analysis prediction method based on machine learning and a terminal device, which can realize quick and high-precision prediction of unknown reaction paths. As shown in fig. 3, the overall structure is:

(1) The data support layer is used for constructing a chemical database in a graph representation mode and providing bottom data support for data calculation;

(2) The data calculation layer is used for carrying out reaction mode mining and reaction rule extraction work by combining reaction equation data and the self structure of chemical molecules through a rapid subgraph matching detection technology;

(3) The rule network layer is used for constructing a complete reaction rule network by means of the reaction rules obtained by calculation;

(4) And the path prediction layer uses a path prediction technology to realize the rapid prediction of the unknown reaction path.

The implementation flow of the invention is shown in figure 4, and the implementation steps are as follows:

(1) Based on a molecular structure and chemical reaction equation data of SMILES (Simplified molecular input line entry specification), constructing a localized chemical pathway database by using a graph representation method;

(2) Inputting the chemical molecular attribute map into a map convolution neural network, converting the map from a topological structure into a d-dimensional vector, and realizing extraction and aggregation of the feature vector of the whole chemical substance structure by PCA (Principal Component Analysis);

(3) Separating a reactant set and a product set in a chemical reaction, and pairing the reactant set and the compounds in the product set pairwise to form a compound reaction pair;

(4) Aiming at each reaction pair, finding out the constant substructure of the two compounds in the reaction process by utilizing a subgraph matching technology, thereby further abstracting a specific reaction mode corresponding to the reaction pair;

(5) Judging information such as reaction modes, subgraphs of the reactants added to and deleted from the product and the like by using a threshold value, storing the information as a reaction rule, and constructing a reaction rule network diagram;

(6) Matching the reactant A and the target product B with the reaction rules in the database one by utilizing a subgraph matching technology to obtain a rule set related to the reactant A and the target product B;

(7) And (3) carrying out path analysis search on the reaction rule network diagram by means of a path prediction technology to obtain possible paths from the related set of the reactant A to the related set of the target product B, and simultaneously giving a probability value of each path as an analysis result of the feasibility of the reaction path.

The invention has the following advantages:

(1) A large-scale bottom-layer chemical database is constructed based on a graph representation mode, and the structural change of chemical molecules in chemical reaction can be well reflected;

(2) By means of the graph convolution neural network model and the rapid sub-graph matching detection technology, information such as chemical molecular characteristics and reaction modes is accurately extracted, and accuracy of path prediction is effectively improved;

(3) The rapid search from the reactant to the target product in the reaction rule network is realized by means of a path prediction technology, the prediction efficiency of the biological reaction path is greatly improved, and the prediction cost is reduced.

Drawings

FIG. 1 is a schematic diagram of a characterization method for all types of chemical reaction schemes;

FIG. 2 is a schematic diagram of a graph convolution neural network model for analytical calculations on chemical molecules;

FIG. 3 is a general architecture diagram for implementing the way prediction method proposed by the present invention;

FIG. 4 is a schematic diagram of a flow chart of an implementation of the path prediction method proposed by the present invention;

fig. 5 is a schematic structural diagram of a terminal device described in the embodiment of the present application.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. The specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. In addition, for convenience of description, only a part, not all of the contents related to the present invention are shown in the drawings.

The embodiment is based on a terminal device running a Linux operating system and a corresponding development environment thereof.

(1) The construction method of the chemical pathway database comprises the following steps:

1. the mol structure file in V2000 format was first obtained for all chemical molecules. In this format, the atoms and chemical bonds of the molecular structure will be defined in a uniform manner: for atoms in a molecular structure, the information of three-dimensional positions, atom types and the like of the atoms when the whole molecular structure is drawn is defined in the mol document. For each chemical bond, the mol document defines the atom number of the chemical bond, the key value of the chemical bond, the spatial structure type of the chemical bond, and the like.

2. According to the mol file of chemical molecules, the structure of the chemical molecules is characterized by using a property map. The points in the graph represent the chemical atom species and the edges of the graph represent the bonds between the atoms. According to the characterization method, effective structure information such as Vertex, edge and the like in the molecule mol file is extracted, converted into a form of an attribute graph and stored in a database.

3. In general chemical reactions, there are classes based on the types of reactants and products: isomerization reaction, abbreviated as A → B; a chemical combination reaction, abbreviated as A + B → C; decomposition reaction, abbreviated as A → B + C; replacement reaction, abbreviated as a + BC → B + AC; metathesis reaction, abbreviated as AB + CD → AD + CB; complex chemical reactions, which can be stepwise simplified, are considered to be continuous reactions of the above reaction classes. The analytical storage of the chemical reaction data is carried out according to the following steps:

1) Decomposing the chemical reaction, and separating out all reactants and products to form a set;

2) For each chemical molecule in the reaction, we aligned it name to the chemical molecule data set. When it is not possible to do so, the chemical molecule is considered to be an insignificant by-product of the reaction (e.g., H2O, H +, etc.), and is negligible;

3) A virtual node is arranged at each of the reaction input end and the reaction output end, so that the representation of the reaction equation is completed, see fig. 1, and is stored in the database in the form of a directed graph.

(2) Inputting a molecular attribute map G and a vector W representing the molecular characteristics into a map convolution neural network, and realizing the embedding of the map by calculating a plurality of combination layers (convolution layer, pooling layer and activation layer) with reference to FIG. 2, and finally realizing the output and prediction of the analysis result of the chemical substance. The graph convolution neural network works as follows:

1. associating a feature vector of an initial dimension for each node of the molecular attribute graph G, wherein the vector encodes molecular local subgraphs into a vector form, and allocates a random unit norm vector to each local subgraph;

2. each layer of the model replaces each vector by the average value on all adjacent vectors to update all node embedded vectors;

3. carrying out linear transformation on the model parameters by using the trained model parameters, and transmitting each coordinate of the result to a ReLU activation function;

4. after a plurality of layers are given by another hyper-parameter, carrying out average calculation on the embedding vectors of all final nodes to obtain a d-dimensional graph embedding vector;

5. combining other characteristic vectors of related chemical substances in the SMILES structure with the d-dimensional vector embedded in the graph neural network, inputting the combined vectors into a characteristic aggregation layer, and extracting and aggregating the characteristic vectors of the whole chemical substance structure through PCA;

6. and obtaining a group of 11-dimensional output vectors through a neural network, and generating a vector with the probability sum of 1 by using a Softmax layer to realize the output and prediction of the analysis result of the chemical substance.

(3) Reaction R comprises two sets of diagrams: the first group contains the reactants and the second group contains the synthesized products. We use

To represent the set of reactants in R by

To represent a product set. The pathway P (a, B) from molecule a to B is a reaction chain "R1: r2; 8230; rn' to enable reaction in one reaction

And the next reaction

Sharing at least one chemical molecule therebetween.

(4) To quantify the structural change of a chemical molecule upon reaction, we first do

The drawings in (1) and

a mapping is established between the graphs in (1). By comparing the structures of the mapped molecules, we can quantify this change. We refer to reactant-product Mapping (RPM) and use the notation RPM (a, B) to indicate that reactant a has mapped to product B.

(5) This is a mode of reaction if the same structural change occurs in one or more reactions. We mine the reaction pattern by:

1. separation of the reaction mixture from the chemical reaction

And product set

2. By aligning the compound databases, the important compounds in the two sets were screened. By this process, chemical molecules of no interest are eliminated from the regular mining;

3. after the screening is finished, pairing the reactant set and the compounds in the product set pairwise to form compound reaction pairs;

4. and determining a reaction center. Aiming at each reaction pair, establishing isomorphic mapping relation between a reactant graph and a target product graph in a graph matching mode by utilizing a rapid subgraph matching detection technology, and finding out a substructure, namely a reaction center, of two compounds which are kept unchanged in the reaction process. The matching procedure of reactant a to product B was as follows:

1) Selecting an initial vertex us from a data graph q of a product B, and performing BFS search on a query graph q to generate a BFS tree Tq;

2) The candidate regions are traversed in parallel from a plurality of starting vertices. For each candidate region, performing parallel depth-first search on a data graph g of the reactant A by using a query tree Tq to obtain a candidate vertex set CVS;

3) Performing ascending arrangement on each path of the query tree Tq according to the CVS to obtain a matching sequence of the vertexes of the query graph q;

4) And according to the determined subgraph matching sequence, carrying out subgraph matching by using the region traversal result, and generating all subgraph isomorphic mappings in parallel to finish the determination of the reaction center.

5. Reaction centers tell us where to change, the reaction center for RPM (A, B) is a set of vertices in product B that can be viewed as adding new edges or removing existing edges during the transition from A to B;

6. the reaction characteristics are determined. The reaction characteristics are a subgraph of the product. The reaction characteristics can be changed by the addition or removal of subgraphs. When there are multiple reaction centers, there are also multiple reaction features, where each feature represents a neighborhood around the corresponding different reaction center;

7. the reaction center identifies the location of the change, and the reaction signature encodes the potential driver behind the change. We refer to a reaction center and its corresponding reaction characteristics as a reaction mode. By this point, the single reaction mode mining is complete.

(6) Mining the reaction mode of each reaction R in the database, and acquiring 1. A reaction center from each RPM (A, B); 2. reaction characteristics; 3. added and deleted subgraphs; 4. all reactants in reaction R except a (these reactants are the enzyme or co-reactant facilitating the reaction). We denote the above information extracted from RPM (a, B) as L (a, B).

(7) Given a threshold h, L (A, B) is said to be a reaction rule if L (A, B) occurs more than h times in the reaction pattern mining. In essence, the reaction rules encode the conditions required for the reaction to produce a predictable output. And recording the occurrence frequency of the reaction rule L (A, B), and performing probability assignment by using a logistic regression technology to provide a basis for subsequent path prediction.

(8) And constructing a reaction rule network diagram by using all the reaction rules. The nodes in the graph are each reaction rule, the edges are the reactions between the rules, and the probability corresponding to the rules is used as the assignment of the edges.

(9) The prediction from reactant A to target product B is realized by means of a path prediction technology, and the method comprises the following steps:

1. matching the reactant A with the reaction rules in the database one by using a subgraph matching technology to obtain a rule set suitable for A;

2. matching the target product B with the reaction rules in the database one by utilizing a subgraph matching technology to obtain a rule set suitable for B;

3. referring to the rule sets A and B, carrying out path analysis search on the reaction rule network diagram to obtain a possible path from the relevant set A to the relevant set B;

4. and giving the probability value of each path as an analysis result of the feasibility of the reaction path by means of probability assignment of edges in the regular network.

Fig. 5 is a schematic structural diagram of a terminal device in an embodiment of the present application, where the above embodiment is implemented by being attached to the terminal device. The terminal device includes, but is not limited to, a desktop computer, a high-performance notebook, a cloud server, and other computing devices.

The above examples are only preferred embodiments of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. A method for graph characterization of chemical molecular structures, wherein each chemical molecule is characterized as an undirected graph, individual chemical atoms are characterized using nodes of the graph, and connections between atoms are characterized using undirected edges of the graph.

2. A graph characterization method for chemical reactions is characterized in that each chemical reaction is characterized by using a directed graph, nodes of the graph represent a chemical molecule, directed edges of the graph represent the participation direction of the chemical molecule in the reaction, and a virtual node is arranged at each of the input end and the output end of the reaction to complete graph characterization for all types of chemical reactions.

3. A method for constructing a chemical pathway database, wherein a chemical molecule is stored according to the molecular structure diagram characterization method of claim 1, and a chemical reaction is stored according to the chemical reaction pathway diagram characterization method of claim 2.

4. An analytical computation method for chemical molecules, which is characterized in that analytical computation work is developed by using a graph-volume neural network model, and the analytical computation method comprises the following steps:

step 1) firstly, constructing a corresponding attribute map aiming at the SMILES molecular structure of a chemical substance;

step 2) inputting the attribute graph into a graph convolution neural network, and realizing the embedding of the graph through the calculation of a plurality of combination layers (convolution layers, pooling layers and activation layers), namely finishing the work of converting the topological structure of the graph into a d-dimensional vector;

step 3) combining other characteristic vectors of related chemical substances in the SMILES structure with the d-dimensional vector embedded in the graph neural network and inputting the combined vectors into a characteristic polymerization layer, and extracting and polymerizing the characteristic vectors of the whole chemical molecular structure by PCA;

and 4) outputting and predicting the chemical molecule analysis result through a Softmax layer.

5. A fast subgraph matching detection technology is characterized in that the fast subgraph matching is achieved through a subgraph isomorphism algorithm based on region traversal by means of heterogeneous hardware.

6. A method for mining reaction patterns, which is characterized in that the rapid subgraph matching detection technology of claim 5 is adopted to realize two processes of determining reaction centers and mining reaction characteristics.

7. A method for extracting reaction rules, characterized in that, given a database of reactions R, for each reaction R, we identify all its "reactant-product" pairings. From each pairing, the following information is extracted and stored: (1) the reaction scheme of claim 6; (2) subgraphs of reactant addition and deletion to product; (3) All reactants in reaction R except the one in the pair. We refer to the above information extracted from a single pair as a reaction rule.

8. A method of constructing a reactive rules network, wherein each node in the reactive rules network is a reactive rule according to claim 7, and two rules are connected by an edge if they are likely to form a reactive path.

9. A method for predicting unknown reaction paths, characterized in that, the reaction rule network of claim 8 is used in combination with a path prediction technique to realize the prediction of unknown reaction paths.

10. A terminal device comprising a CPU, a GPU and a memory unit and a computer program operable on the terminal device, characterized in that the terminal device is capable of correctly performing the steps of any of claims 1 to 9.