CN115331751A - Chemical pathway analysis and prediction method based on machine learning and terminal equipment - Google Patents

Chemical pathway analysis and prediction method based on machine learning and terminal equipment Download PDF

Info

Publication number
CN115331751A
CN115331751A CN202110504261.7A CN202110504261A CN115331751A CN 115331751 A CN115331751 A CN 115331751A CN 202110504261 A CN202110504261 A CN 202110504261A CN 115331751 A CN115331751 A CN 115331751A
Authority
CN
China
Prior art keywords
reaction
chemical
graph
prediction
path
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110504261.7A
Other languages
Chinese (zh)
Inventor
张毅
周龙飞
吴振东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Smic Future Beijing Technology Co ltd
Original Assignee
Smic Future Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Smic Future Beijing Technology Co ltd filed Critical Smic Future Beijing Technology Co ltd
Priority to CN202110504261.7A priority Critical patent/CN115331751A/en
Publication of CN115331751A publication Critical patent/CN115331751A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Landscapes

  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a chemical pathway analysis and prediction method based on machine learning, aiming at the problems of low design efficiency, low prediction precision and the like in the existing chemical reaction path prediction technology. By means of a four-layer architecture: the system comprises a data support layer, a data calculation layer, a regular network layer and a path prediction layer, and realizes quick and high-precision prediction of unknown reaction paths. The method is used for constructing a large-scale bottom-layer chemical database based on a graph representation mode, and can well reflect the structural change of molecules in chemical reaction; by means of the graph convolution neural network model and the rapid sub-graph matching detection technology, information such as chemical molecular characteristics and reaction modes is accurately extracted, and accuracy of path prediction is effectively improved. By means of the path prediction technology, the rapid search from the reactant to the target product in the reaction rule network is realized, the prediction efficiency of the biological reaction path is greatly improved, and the path prediction cost is reduced.

Description

Chemical pathway analysis and prediction method based on machine learning and terminal equipment
Technical Field
The invention belongs to the technical field of data mining and machine learning, and particularly relates to an intelligent design and prediction method for an unknown reaction path of a chemical molecule and a terminal device.
Background
The reaction path refers to the process of a series of chemical reactions of reactants under the catalysis of enzyme to generate corresponding products. The prediction of the biological reaction path can help people to synthesize some required target products, and provides effective traction for innovative research in the chemical medical field.
The existing path prediction technology is calculated by means of a large number of chemical experiments and research experiences, the used traditional chemical database cannot represent the change relation of the self structure of a chemical molecule in the reaction participating process, the path prediction research performed on the basis of the database has the problems of low analysis and mining speed, large path prediction error and the like, and is easily limited by a plurality of factors such as experimental equipment, experimental environment and the like, so that the design efficiency and the prediction precision of biological reaction are greatly restricted.
Disclosure of Invention
In order to solve the problems in the traditional path prediction technology, the invention provides a chemical path analysis prediction method based on machine learning and a terminal device, which can realize quick and high-precision prediction of unknown reaction paths. As shown in fig. 3, the overall structure is:
(1) The data support layer is used for constructing a chemical database in a graph representation mode and providing bottom data support for data calculation;
(2) The data calculation layer is used for carrying out reaction mode mining and reaction rule extraction work by combining reaction equation data and the self structure of chemical molecules through a rapid subgraph matching detection technology;
(3) The rule network layer is used for constructing a complete reaction rule network by means of the reaction rules obtained by calculation;
(4) And the path prediction layer uses a path prediction technology to realize the rapid prediction of the unknown reaction path.
The implementation flow of the invention is shown in figure 4, and the implementation steps are as follows:
(1) Based on a molecular structure and chemical reaction equation data of SMILES (Simplified molecular input line entry specification), constructing a localized chemical pathway database by using a graph representation method;
(2) Inputting the chemical molecular attribute map into a map convolution neural network, converting the map from a topological structure into a d-dimensional vector, and realizing extraction and aggregation of the feature vector of the whole chemical substance structure by PCA (Principal Component Analysis);
(3) Separating a reactant set and a product set in a chemical reaction, and pairing the reactant set and the compounds in the product set pairwise to form a compound reaction pair;
(4) Aiming at each reaction pair, finding out the constant substructure of the two compounds in the reaction process by utilizing a subgraph matching technology, thereby further abstracting a specific reaction mode corresponding to the reaction pair;
(5) Judging information such as reaction modes, subgraphs of the reactants added to and deleted from the product and the like by using a threshold value, storing the information as a reaction rule, and constructing a reaction rule network diagram;
(6) Matching the reactant A and the target product B with the reaction rules in the database one by utilizing a subgraph matching technology to obtain a rule set related to the reactant A and the target product B;
(7) And (3) carrying out path analysis search on the reaction rule network diagram by means of a path prediction technology to obtain possible paths from the related set of the reactant A to the related set of the target product B, and simultaneously giving a probability value of each path as an analysis result of the feasibility of the reaction path.
The invention has the following advantages:
(1) A large-scale bottom-layer chemical database is constructed based on a graph representation mode, and the structural change of chemical molecules in chemical reaction can be well reflected;
(2) By means of the graph convolution neural network model and the rapid sub-graph matching detection technology, information such as chemical molecular characteristics and reaction modes is accurately extracted, and accuracy of path prediction is effectively improved;
(3) The rapid search from the reactant to the target product in the reaction rule network is realized by means of a path prediction technology, the prediction efficiency of the biological reaction path is greatly improved, and the prediction cost is reduced.
Drawings
FIG. 1 is a schematic diagram of a characterization method for all types of chemical reaction schemes;
FIG. 2 is a schematic diagram of a graph convolution neural network model for analytical calculations on chemical molecules;
FIG. 3 is a general architecture diagram for implementing the way prediction method proposed by the present invention;
FIG. 4 is a schematic diagram of a flow chart of an implementation of the path prediction method proposed by the present invention;
fig. 5 is a schematic structural diagram of a terminal device described in the embodiment of the present application.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. The specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. In addition, for convenience of description, only a part, not all of the contents related to the present invention are shown in the drawings.
The embodiment is based on a terminal device running a Linux operating system and a corresponding development environment thereof.
(1) The construction method of the chemical pathway database comprises the following steps:
1. the mol structure file in V2000 format was first obtained for all chemical molecules. In this format, the atoms and chemical bonds of the molecular structure will be defined in a uniform manner: for atoms in a molecular structure, the information of three-dimensional positions, atom types and the like of the atoms when the whole molecular structure is drawn is defined in the mol document. For each chemical bond, the mol document defines the atom number of the chemical bond, the key value of the chemical bond, the spatial structure type of the chemical bond, and the like.
2. According to the mol file of chemical molecules, the structure of the chemical molecules is characterized by using a property map. The points in the graph represent the chemical atom species and the edges of the graph represent the bonds between the atoms. According to the characterization method, effective structure information such as Vertex, edge and the like in the molecule mol file is extracted, converted into a form of an attribute graph and stored in a database.
3. In general chemical reactions, there are classes based on the types of reactants and products: isomerization reaction, abbreviated as A → B; a chemical combination reaction, abbreviated as A + B → C; decomposition reaction, abbreviated as A → B + C; replacement reaction, abbreviated as a + BC → B + AC; metathesis reaction, abbreviated as AB + CD → AD + CB; complex chemical reactions, which can be stepwise simplified, are considered to be continuous reactions of the above reaction classes. The analytical storage of the chemical reaction data is carried out according to the following steps:
1) Decomposing the chemical reaction, and separating out all reactants and products to form a set;
2) For each chemical molecule in the reaction, we aligned it name to the chemical molecule data set. When it is not possible to do so, the chemical molecule is considered to be an insignificant by-product of the reaction (e.g., H2O, H +, etc.), and is negligible;
3) A virtual node is arranged at each of the reaction input end and the reaction output end, so that the representation of the reaction equation is completed, see fig. 1, and is stored in the database in the form of a directed graph.
(2) Inputting a molecular attribute map G and a vector W representing the molecular characteristics into a map convolution neural network, and realizing the embedding of the map by calculating a plurality of combination layers (convolution layer, pooling layer and activation layer) with reference to FIG. 2, and finally realizing the output and prediction of the analysis result of the chemical substance. The graph convolution neural network works as follows:
1. associating a feature vector of an initial dimension for each node of the molecular attribute graph G, wherein the vector encodes molecular local subgraphs into a vector form, and allocates a random unit norm vector to each local subgraph;
2. each layer of the model replaces each vector by the average value on all adjacent vectors to update all node embedded vectors;
3. carrying out linear transformation on the model parameters by using the trained model parameters, and transmitting each coordinate of the result to a ReLU activation function;
4. after a plurality of layers are given by another hyper-parameter, carrying out average calculation on the embedding vectors of all final nodes to obtain a d-dimensional graph embedding vector;
5. combining other characteristic vectors of related chemical substances in the SMILES structure with the d-dimensional vector embedded in the graph neural network, inputting the combined vectors into a characteristic aggregation layer, and extracting and aggregating the characteristic vectors of the whole chemical substance structure through PCA;
6. and obtaining a group of 11-dimensional output vectors through a neural network, and generating a vector with the probability sum of 1 by using a Softmax layer to realize the output and prediction of the analysis result of the chemical substance.
(3) Reaction R comprises two sets of diagrams: the first group contains the reactants and the second group contains the synthesized products. We use
Figure BSA0000241508730000021
To represent the set of reactants in R by
Figure BSA0000241508730000031
To represent a product set. The pathway P (a, B) from molecule a to B is a reaction chain "R1: r2; 8230; rn' to enable reaction in one reaction
Figure BSA0000241508730000032
And the next reaction
Figure BSA0000241508730000033
Sharing at least one chemical molecule therebetween.
(4) To quantify the structural change of a chemical molecule upon reaction, we first do
Figure BSA0000241508730000034
The drawings in (1) and
Figure BSA0000241508730000035
a mapping is established between the graphs in (1). By comparing the structures of the mapped molecules, we can quantify this change. We refer to reactant-product Mapping (RPM) and use the notation RPM (a, B) to indicate that reactant a has mapped to product B.
(5) This is a mode of reaction if the same structural change occurs in one or more reactions. We mine the reaction pattern by:
1. separation of the reaction mixture from the chemical reaction
Figure BSA0000241508730000036
And product set
Figure BSA0000241508730000037
2. By aligning the compound databases, the important compounds in the two sets were screened. By this process, chemical molecules of no interest are eliminated from the regular mining;
3. after the screening is finished, pairing the reactant set and the compounds in the product set pairwise to form compound reaction pairs;
4. and determining a reaction center. Aiming at each reaction pair, establishing isomorphic mapping relation between a reactant graph and a target product graph in a graph matching mode by utilizing a rapid subgraph matching detection technology, and finding out a substructure, namely a reaction center, of two compounds which are kept unchanged in the reaction process. The matching procedure of reactant a to product B was as follows:
1) Selecting an initial vertex us from a data graph q of a product B, and performing BFS search on a query graph q to generate a BFS tree Tq;
2) The candidate regions are traversed in parallel from a plurality of starting vertices. For each candidate region, performing parallel depth-first search on a data graph g of the reactant A by using a query tree Tq to obtain a candidate vertex set CVS;
3) Performing ascending arrangement on each path of the query tree Tq according to the CVS to obtain a matching sequence of the vertexes of the query graph q;
4) And according to the determined subgraph matching sequence, carrying out subgraph matching by using the region traversal result, and generating all subgraph isomorphic mappings in parallel to finish the determination of the reaction center.
5. Reaction centers tell us where to change, the reaction center for RPM (A, B) is a set of vertices in product B that can be viewed as adding new edges or removing existing edges during the transition from A to B;
6. the reaction characteristics are determined. The reaction characteristics are a subgraph of the product. The reaction characteristics can be changed by the addition or removal of subgraphs. When there are multiple reaction centers, there are also multiple reaction features, where each feature represents a neighborhood around the corresponding different reaction center;
7. the reaction center identifies the location of the change, and the reaction signature encodes the potential driver behind the change. We refer to a reaction center and its corresponding reaction characteristics as a reaction mode. By this point, the single reaction mode mining is complete.
(6) Mining the reaction mode of each reaction R in the database, and acquiring 1. A reaction center from each RPM (A, B); 2. reaction characteristics; 3. added and deleted subgraphs; 4. all reactants in reaction R except a (these reactants are the enzyme or co-reactant facilitating the reaction). We denote the above information extracted from RPM (a, B) as L (a, B).
(7) Given a threshold h, L (A, B) is said to be a reaction rule if L (A, B) occurs more than h times in the reaction pattern mining. In essence, the reaction rules encode the conditions required for the reaction to produce a predictable output. And recording the occurrence frequency of the reaction rule L (A, B), and performing probability assignment by using a logistic regression technology to provide a basis for subsequent path prediction.
(8) And constructing a reaction rule network diagram by using all the reaction rules. The nodes in the graph are each reaction rule, the edges are the reactions between the rules, and the probability corresponding to the rules is used as the assignment of the edges.
(9) The prediction from reactant A to target product B is realized by means of a path prediction technology, and the method comprises the following steps:
1. matching the reactant A with the reaction rules in the database one by using a subgraph matching technology to obtain a rule set suitable for A;
2. matching the target product B with the reaction rules in the database one by utilizing a subgraph matching technology to obtain a rule set suitable for B;
3. referring to the rule sets A and B, carrying out path analysis search on the reaction rule network diagram to obtain a possible path from the relevant set A to the relevant set B;
4. and giving the probability value of each path as an analysis result of the feasibility of the reaction path by means of probability assignment of edges in the regular network.
Fig. 5 is a schematic structural diagram of a terminal device in an embodiment of the present application, where the above embodiment is implemented by being attached to the terminal device. The terminal device includes, but is not limited to, a desktop computer, a high-performance notebook, a cloud server, and other computing devices.
The above examples are only preferred embodiments of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (10)

1. A method for graph characterization of chemical molecular structures, wherein each chemical molecule is characterized as an undirected graph, individual chemical atoms are characterized using nodes of the graph, and connections between atoms are characterized using undirected edges of the graph.
2. A graph characterization method for chemical reactions is characterized in that each chemical reaction is characterized by using a directed graph, nodes of the graph represent a chemical molecule, directed edges of the graph represent the participation direction of the chemical molecule in the reaction, and a virtual node is arranged at each of the input end and the output end of the reaction to complete graph characterization for all types of chemical reactions.
3. A method for constructing a chemical pathway database, wherein a chemical molecule is stored according to the molecular structure diagram characterization method of claim 1, and a chemical reaction is stored according to the chemical reaction pathway diagram characterization method of claim 2.
4. An analytical computation method for chemical molecules, which is characterized in that analytical computation work is developed by using a graph-volume neural network model, and the analytical computation method comprises the following steps:
step 1) firstly, constructing a corresponding attribute map aiming at the SMILES molecular structure of a chemical substance;
step 2) inputting the attribute graph into a graph convolution neural network, and realizing the embedding of the graph through the calculation of a plurality of combination layers (convolution layers, pooling layers and activation layers), namely finishing the work of converting the topological structure of the graph into a d-dimensional vector;
step 3) combining other characteristic vectors of related chemical substances in the SMILES structure with the d-dimensional vector embedded in the graph neural network and inputting the combined vectors into a characteristic polymerization layer, and extracting and polymerizing the characteristic vectors of the whole chemical molecular structure by PCA;
and 4) outputting and predicting the chemical molecule analysis result through a Softmax layer.
5. A fast subgraph matching detection technology is characterized in that the fast subgraph matching is achieved through a subgraph isomorphism algorithm based on region traversal by means of heterogeneous hardware.
6. A method for mining reaction patterns, which is characterized in that the rapid subgraph matching detection technology of claim 5 is adopted to realize two processes of determining reaction centers and mining reaction characteristics.
7. A method for extracting reaction rules, characterized in that, given a database of reactions R, for each reaction R, we identify all its "reactant-product" pairings. From each pairing, the following information is extracted and stored: (1) the reaction scheme of claim 6; (2) subgraphs of reactant addition and deletion to product; (3) All reactants in reaction R except the one in the pair. We refer to the above information extracted from a single pair as a reaction rule.
8. A method of constructing a reactive rules network, wherein each node in the reactive rules network is a reactive rule according to claim 7, and two rules are connected by an edge if they are likely to form a reactive path.
9. A method for predicting unknown reaction paths, characterized in that, the reaction rule network of claim 8 is used in combination with a path prediction technique to realize the prediction of unknown reaction paths.
10. A terminal device comprising a CPU, a GPU and a memory unit and a computer program operable on the terminal device, characterized in that the terminal device is capable of correctly performing the steps of any of claims 1 to 9.
CN202110504261.7A 2021-05-10 2021-05-10 Chemical pathway analysis and prediction method based on machine learning and terminal equipment Pending CN115331751A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110504261.7A CN115331751A (en) 2021-05-10 2021-05-10 Chemical pathway analysis and prediction method based on machine learning and terminal equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110504261.7A CN115331751A (en) 2021-05-10 2021-05-10 Chemical pathway analysis and prediction method based on machine learning and terminal equipment

Publications (1)

Publication Number Publication Date
CN115331751A true CN115331751A (en) 2022-11-11

Family

ID=83912076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110504261.7A Pending CN115331751A (en) 2021-05-10 2021-05-10 Chemical pathway analysis and prediction method based on machine learning and terminal equipment

Country Status (1)

Country Link
CN (1) CN115331751A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115831248A (en) * 2023-02-20 2023-03-21 新疆独山子石油化工有限公司 Method and device for determining reaction rule, electronic equipment and storage medium
CN115841851A (en) * 2023-02-20 2023-03-24 新疆独山子石油化工有限公司 Method and device for constructing hydrocracking molecular reaction rule

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115831248A (en) * 2023-02-20 2023-03-21 新疆独山子石油化工有限公司 Method and device for determining reaction rule, electronic equipment and storage medium
CN115841851A (en) * 2023-02-20 2023-03-24 新疆独山子石油化工有限公司 Method and device for constructing hydrocracking molecular reaction rule
CN115831248B (en) * 2023-02-20 2023-06-06 新疆独山子石油化工有限公司 Method and device for determining reaction rules, electronic equipment and storage medium
CN115841851B (en) * 2023-02-20 2023-06-06 新疆独山子石油化工有限公司 Construction method and device of hydrocracking molecular-level reaction rule

Similar Documents

Publication Publication Date Title
Berahmand et al. Spectral clustering on protein-protein interaction networks via constructing affinity matrix using attributed graph embedding
Ying et al. Hierarchical graph representation learning with differentiable pooling
Alvarez-Hamelin et al. Large scale networks fingerprinting and visualization using the k-core decomposition
Ni et al. Multi-level submap based slam using nested dissection
Costa et al. Characterization of complex networks: A survey of measurements
CN107784598A (en) A kind of network community discovery method
WO2018098018A1 (en) Compilation, memory management, and fault localization with ancillas in an unknown state
CN110957002A (en) Drug target interaction relation prediction method based on collaborative matrix decomposition
CN115331751A (en) Chemical pathway analysis and prediction method based on machine learning and terminal equipment
Ma et al. A review of protein–protein interaction network alignment: From pathway comparison to global alignment
CN105183796A (en) Distributed link prediction method based on clustering
CN107563653A (en) Multi-robot full-coverage task allocation method
Ngo et al. Multiresolution graph transformers and wavelet positional encoding for learning long-range and hierarchical structures
Yang et al. An effective detection of satellite image via K-means clustering on Hadoop system
Song et al. Interactive visual pattern search on graph data via graph representation learning
Smalter et al. Gpm: A graph pattern matching kernel with diffusion for chemical compound classification
Yang et al. Graph Contrastive Learning for Clustering of Multi-layer Networks
Banerjee A survey on mining and analysis of uncertain graphs
Xia et al. Parallel implementation of Kaufman’s initialization for clustering large remote sensing images on clouds
Pollastri et al. Prediction of protein topologies using generalized IOHMMs and RNNs
Zhang et al. A Multi-perspective Model for Protein–Ligand-Binding Affinity Prediction
Zheng et al. DPN: Decoupling Partition and Navigation for Neural Solvers of Min-max Vehicle Routing Problems
Nakada et al. Optimal protein structure alignment using modified extremal optimization
CN117409872A (en) Biological synthesis path prediction method based on machine learning and user platform
Al-Janabi et al. Synthesis biometric materials based on cooperative among (DSA, WOA and gSpan-FBR) to water treatment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination