CN116106461B - Method and device for predicting liquid chromatograph retention time based on deep graph network - Google Patents
Method and device for predicting liquid chromatograph retention time based on deep graph network Download PDFInfo
- Publication number
- CN116106461B CN116106461B CN202211374166.0A CN202211374166A CN116106461B CN 116106461 B CN116106461 B CN 116106461B CN 202211374166 A CN202211374166 A CN 202211374166A CN 116106461 B CN116106461 B CN 116106461B
- Authority
- CN
- China
- Prior art keywords
- information
- layer
- graph network
- retention time
- graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000014759 maintenance of location Effects 0.000 title claims abstract description 67
- 238000000034 method Methods 0.000 title claims abstract description 34
- 239000007788 liquid Substances 0.000 title description 2
- 230000000694 effects Effects 0.000 claims abstract description 34
- 239000000126 substance Substances 0.000 claims abstract description 31
- 238000004811 liquid chromatography Methods 0.000 claims abstract description 27
- 230000005540 biological transmission Effects 0.000 claims abstract description 5
- 238000012549 training Methods 0.000 claims description 20
- 230000007246 mechanism Effects 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 16
- 238000010276 construction Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 7
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 5
- 238000012795 verification Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 125000004429 atom Chemical group 0.000 claims description 4
- 125000005842 heteroatom Chemical group 0.000 claims description 4
- 238000009396 hybridization Methods 0.000 claims description 4
- 125000004435 hydrogen atom Chemical group [H]* 0.000 claims description 4
- 239000000852 hydrogen donor Substances 0.000 claims description 4
- 238000005096 rolling process Methods 0.000 claims description 4
- 230000004931 aggregating effect Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000013508 migration Methods 0.000 description 6
- 230000005012 migration Effects 0.000 description 6
- 150000001875 compounds Chemical class 0.000 description 5
- 102100030569 Nuclear receptor corepressor 2 Human genes 0.000 description 4
- 101710153660 Nuclear receptor corepressor 2 Proteins 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 238000004885 tandem mass spectrometry Methods 0.000 description 4
- 239000007791 liquid phase Substances 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 101001095088 Homo sapiens Melanoma antigen preferentially expressed in tumors Proteins 0.000 description 2
- 102100037020 Melanoma antigen preferentially expressed in tumors Human genes 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004895 liquid chromatography mass spectrometry Methods 0.000 description 2
- 239000012071 phase Substances 0.000 description 2
- 101100153586 Caenorhabditis elegans top-1 gene Proteins 0.000 description 1
- 101100537629 Caenorhabditis elegans top-2 gene Proteins 0.000 description 1
- 101100370075 Mus musculus Top1 gene Proteins 0.000 description 1
- 101150107801 Top2a gene Proteins 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000004817 gas chromatography Methods 0.000 description 1
- 238000002013 hydrophilic interaction chromatography Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000004949 mass spectrometry Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000004366 reverse phase liquid chromatography Methods 0.000 description 1
- IKGXIBQEEMLURG-NVPNHPEKSA-N rutin Chemical compound O[C@@H]1[C@H](O)[C@@H](O)[C@H](C)O[C@H]1OC[C@@H]1[C@@H](O)[C@H](O)[C@@H](O)[C@H](OC=2C(C3=C(O)C=C(O)C=C3OC=2C=2C=C(O)C(O)=CC=2)=O)O1 IKGXIBQEEMLURG-NVPNHPEKSA-N 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- -1 small molecule compounds Chemical class 0.000 description 1
- 150000003384 small molecules Chemical group 0.000 description 1
- 230000001550 time effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N30/00—Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
- G01N30/02—Column chromatography
- G01N30/86—Signal analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biochemistry (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Immunology (AREA)
- Pathology (AREA)
- Investigating Or Analysing Materials By Optical Means (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a method and a device for predicting liquid chromatography retention time based on a deep map network. The method comprises the steps of obtaining molecular structure information of chemical substances to be detected, and constructing graph network information according to the molecular structure information, wherein the graph network information comprises node characteristics, edge characteristics and adjacent matrixes; and inputting the graph network information into a trained deep graph network model for predicting the liquid chromatography retention time, and predicting the liquid chromatography retention time by using the deep graph network model. The deep layer graph network model comprises a graph network layer, a reading layer and a linear layer; the graph network layer introduces molecular side information into an information transmission process, introduces residual connection, and increases model depth to improve prediction effect; the readout layer employs a attention-based readout layer. The method for predicting the liquid chromatography retention time based on the deep map network can improve the prediction accuracy.
Description
Technical Field
The invention belongs to the technical field of liquid chromatography and information processing, and particularly relates to a method and a device for predicting liquid chromatography retention time based on a deep map network.
Background
In the last few decades, liquid chromatography-mass spectrometry (LC-MS) has been used as the most efficient method for identifying small molecular structures due to its high sensitivity and high selectivity. While tandem mass spectrometry (MS/MS) information has proven useful for characterizing structures, relying solely on tandem mass spectrometry is insufficient to determine structures because the tandem mass spectrometry database is extremely limited. In face of this challenge, retention times have been used to aid in the identification of compounds. The retention time is the time required for a sample to enter the column until it exits the column to be detected by mass spectrometry. Because the retention time can provide orthogonal information beyond that obtained by tandem mass spectrometry, the ability to reduce the number of possible structures during identification is an important means to exclude identification of false positives. How to accurately predict the retention time of liquid chromatography and the retention time under different liquid phase conditions is a main problem to be solved by the present invention.
Currently, there are limited studies, and conventional machine learning methods, such as bayesian ridge regression, random forest, etc., are used to predict retention time based on molecular fingerprints or molecular descriptors. However, molecular fingerprints or descriptors can only represent part of the nature of chemical molecules, and cannot use the information of the overall structure of the molecules.
Disclosure of Invention
Aiming at the problem that the existing traditional machine learning based on molecular fingerprints or molecular descriptors is low in prediction accuracy, the invention provides a method for predicting liquid chromatography retention time based on a deep map network, so that prediction accuracy is improved.
The technical scheme adopted by the invention is as follows:
a method for predicting liquid chromatography retention time based on a deep map network, comprising the steps of:
acquiring molecular structure information of a chemical substance to be detected, and constructing graph network information according to the molecular structure information, wherein the graph network information comprises node characteristics, edge characteristics and adjacent matrixes;
and inputting the graph network information into a trained deep graph network model for predicting the liquid chromatography retention time, and predicting the liquid chromatography retention time by using the deep graph network model.
Further, the node features include: atom type, chiral center type, chirality, atomicity, formal charge, hybridization, aromaticity, whether the hydrogen donor or acceptor is a heteroatom, whether the number of hydrogen atoms in a ring, the number of radical electrons, the number of valence electrons, the Crippen LogP contribution rate, crippen molar refractive contribution rate, gaseiger charge, mass number, and topological polar surface area contribution; the edge feature includes: bond type, whether conjugated, whether part of a ring, whether rotatable, and steric structural information of the chemical bond; the adjacency matrix is constructed from molecular chemical bonds.
Further, the deep layer graph network model comprises a graph network layer, a reading layer and a linear layer; the graph network layer introduces molecular side information into an information transmission process, introduces residual connection, and increases model depth to improve prediction effect.
Further, the processing procedure of the graph network layer comprises the following steps:
transmitting side information between a source node u and a target node v and information of the source node u to the target node v, and aggregating the target node v by adopting a softmax function to obtain updated information m l ;
Information m after updating l Processing with linear layer, and finally updating molecular information and original molecular information by nonlinear activation function sigmaAnd adding, namely performing residual connection operation.
Further, the readout layer adopts a readout layer based on an attention mechanism; the reading layer based on the attention mechanism comprises super virtual nodes, wherein the super virtual nodes are connected with each atomic node in a molecule, and the codes of the super virtual nodes are firstly obtained by summation and then updated by using the following formula:
e i =concat(c,n i )*W+b
h k ,c k =GRU(h k-1 ,c k-1 )
wherein c is the code of the super virtual node, n i Representing the code of each atomic node in the molecule, e i Alpha is the weight after passing through the linear layer i For returning using softmaxThe importance of a normalization represents the coefficient of the degree, the sum of which is one;representing all atomic nodes in all molecules; GRU is a gating circulation unit, c k Calculating the codes of the super virtual nodes for the kth pass through the graph attention mechanism, h k Encoding the molecules after the kth update.
Further, the linear layer comprises 2 linear layers, wherein the hidden layer dimension of the first layer is 1024, and after passing through the first layer, the dimension is projected to 1 through the linear rectification function ReLU and then through the second layer, so as to predict the retention time.
Further, the training process of the deep map network model comprises the following steps: and selecting a retention time data set, dividing the retention time data set into a training set, a verification set and a test set, constructing graph network information, and training the deep graph network model by adopting a SmoothL1 loss function and adopting a self-adaptive moment estimation algorithm.
An apparatus for predicting liquid chromatography retention time based on a deep map network, comprising:
the graph network information construction module is used for acquiring the molecular structure information of the chemical substance to be detected and constructing graph network information according to the molecular structure information, wherein the graph network information comprises node characteristics, edge characteristics and adjacent matrixes;
and the retention time prediction module is used for inputting the graph network information into a trained deep layer graph network model for liquid chromatography retention time prediction, and predicting the liquid chromatography retention time by using the deep layer graph network model.
The beneficial effects of the invention are as follows:
aiming at the problem that the existing traditional machine learning based on molecular fingerprints or molecular descriptors is low in prediction accuracy, the invention firstly proposes to introduce a deep map network to perform retention time prediction, and aiming at the problem of chemical substance retention time prediction, performs multiple optimization on a model, and further achieves the effect of improving prediction accuracy. Compared with the traditional machine learning method, the graph network model can use atomic level descriptors and meanwhile use structural information (graph network information) of chemical substances, so that a better prediction effect can be achieved.
The invention develops a deep graph rolling network (deep GCN-RT) model, introduces residual connection in the model for the first time, introduces side (chemical bond) information of molecules, introduces a graph network 'read out' module based on an attention mechanism, and obtains the model with the best prediction effect at present on a 'METIN retention time data set' (SMRT).
Furthermore, the present invention compares the effect of the developed model on other liquid chromatography datasets given that different liquid chromatography conditions typically tend to be used between different studies. The results show that the model developed by the invention significantly improves the accuracy of predictions on the SMRT data set and the transfer learning data set as compared to the literature report model. Finally, LCMS-based molecular recognition using the RIKEN-PlaSMA dataset, deep gcn-RT shows great advantages in reducing the number of candidate structures and improving top-k recognition accuracy.
Drawings
Fig. 1. The model structure of the present invention.
Figure 2. Loss in training process of the present invention.
FIG. 3 Structure identification of RIKEN-PlaSMA dataset. Wherein (a) the graph is the average number of selected structures when different identification modes are used for identification, and the abscissa thereof represents the result of structure identification by using the software of MSFinder alone and the result of structure identification by using the MSFinder and the retention time prediction model developed in the study at the same time, respectively, and the ordinate thereof represents the average number of candidate structures of each chromatographic peak (the average of 100 candidate structures of the chromatogram in total); (b) The graph shows the accuracy of the identification, the abscissa represents whether the candidate structures of top-1, top-2, top-5, top-10, top-15, and top-20 contain the true structures, and the ordinate represents the proportion of correctly identified molecular structures, and identification type indicates the use of different identification means (MSFinder alone and MSFinder and DeepGCN-RT together).
Fig. 4. Predicted effect of the model of the present invention on METLIN retention time dataset, with the abscissa being the experimentally determined true retention time and the ordinate being the predicted retention time of the present study development model.
Fig. 5 is a histogram of the prediction error of the inventive model on the METLIN retention time dataset, with the abscissa representing the prediction error and the ordinate representing the corresponding count (count).
Detailed Description
The present invention will be further described in detail with reference to the following examples and drawings, so that the above objects, features and advantages of the present invention can be more clearly understood.
The invention relates to a method for predicting retention time based on a deep graph network, which comprises the construction of chemical substance graph network information, including the construction of node characteristics, edge characteristics and adjacency matrixes. The adopted chemical substance deep learning model is a deep layer graph network model, which comprises the following steps: in the information transmission process, introducing side information to transmit information; using residual connection to construct a deep graph network model; the "readout" module of the improved model uses a "readout" model based on the attention mechanism to achieve better prediction effect. The architecture of the present model is shown in fig. 1.
The specific scheme of the method for predicting the retention time based on the deep graph network is as follows:
1. construction of chemical substance graph network information
The construction of chemical graph network information includes constructing node features, edge features, and adjacency matrices.
The node features include: atom type, chiral center type, chirality, atomicity, formal charge, hybridization, aromaticity, whether the hydrogen donor or acceptor is a heteroatom, whether the number of hydrogen atoms in the ring, the number of radical electrons, the number of valence electrons, the Crippen LogP contribution rate, the Crippen molar refractive contribution rate, the gaseiger charge, the number of masses (divided by 100), and the topopolar surface area contribution (Topological polar surface area contribution).
The adjacency matrix is built using chemical bonds of chemicals. In addition, the invention also introduces edge features into the information transfer process, wherein the edge features comprise: bond type, whether conjugated, whether part of a ring, whether rotatable, and steric structural information of the chemical bond.
The information is respectively constructed into node characteristics, edge characteristics and adjacency matrixes by using open source software RDkit, and the information is input into a graph network to predict the retention time.
2. Construction of deep layer graph network model
As shown in FIG. 1, the deep GCN-RT model of the present invention consists of a graph network Layer (GNN Layer), a Readout Layer (GNN Readout), and a linear Layer (Dense Layer).
1. Picture network layer (GNN layer)
The graph network is a graph-rolling network, and the invention makes the following improvements on the basis of GCN proposed by Kensert et al (Kensert, A.; bouwmetester, R.; efthiadidis, K.; et al, graph convolutional networks for improved prediction and interpretability of chromatographic retention data. Anal chem.2021,93 (47), 15633-15641.): adding side (chemical bond) information of molecules to perform graph network model modeling; adding residual connections to improve the model structure; the depth of the model is increased to improve the predictive effect.
The GCN layer of Kensert et al is as follows:
wherein u and v are the source node and the target node respectively, and N (v) is all the source nodes of v. c uv Is the square root of the degree of the node. Sigma represents a nonlinear function.Molecular coding (ebedding) for the 1+1st updated target node v,>for the molecular code of the target node v after l times of updating, l is the number of times of updating,b l is the bias parameter of the first layer, W l Is the weight parameter of the first layer.
The GCN layer firstly transmits the side information between u and v and the information of the source node u to the target node v, and the target node v adopts a softmax function to aggregate, as shown in a formula (2) and a formula (3):
wherein the method comprises the steps ofRespectively representing information and side information of a source node, m l Representing updated information. The information of the source node refers to node characteristics in the previous text, the side information refers to side characteristics in the previous text, and the source node and the target node are determined by an adjacency matrix in the previous text.
Then, the obtained updated information m l Using linear layer processing (i is the number of updates, b l Is the bias parameter of the first layer, W l The weight parameter of the first layer) and then through a nonlinear activation function sigma. Finally, the updated molecular information and the original molecular informationThe summation (i.e., residual join operation) is performed as follows:
2. readout layer (GNN Readout)
Currently, map-based readout mostly employs simple readout operations such as "average", "summation", and the like. In order to improve the prediction accuracy of the model, the invention adopts a reading layer based on an attention mechanism. Specifically, after the message passing process, a molecular code is obtained for each atomic node in the molecule. The invention first creates a "super virtual" node and connects the node to each atomic node. The coding of the "super virtual" node is first obtained by summing and then updating using the following formula, in particular:
e i =concat(c,n i )*W+b (5)
h k ,c k =GRU(h k-1 ,c k-1 ) (8)
where c is the code of the "super virtual" node, n i Representing the code for each atomic node in the molecule,representing all atomic nodes in the molecule. e, e i Is the weight after passing through the linear layer. Alpha i The importance representing degree coefficient normalized by softmax is one in sum. The GRU is a gated loop unit. c k Calculating the coding of the super virtual node for the kth time passing through the graph attention mechanism, h k Encoding the molecules after the kth update.
The invention can achieve better retention time prediction effect based on the reading of the attention mechanism, because the attention mechanism of the graph can effectively capture the useful information of the target task. In addition, the gating circulation unit has good effects in the aspects of information retention and invalid information filtering. The two are combined, so that a better effect can be achieved in the aspect of capturing the global characteristics of chemical molecules.
3. Linear Layer (Dense Layer)
The code of the readout layer is input to a linear layer, and the structure of the linear layer is 2 layers of linear layers, wherein the hidden layer dimension of the first layer is 1024. After passing through the first layer, the dimensions are projected to 1 through a linear rectification function (ReLU) and then through the second layer for retention time prediction.
3. Retention time prediction
Training phase: the method comprises the steps of dividing an existing data set, such as a METIN retention time data set, which contains structural information of chemical substances and experimentally measured retention time, into a training set, a verification set and a test set, constructing the graph network information by using a graph network information constructing part, and then training the deep GCN-RT model by adopting a self-adaptive time estimation algorithm (Adam) algorithm by adopting a smoothL1 loss function.
Retention time prediction phase: the simplified molecular linear input specification (SMILES) of the chemical substance to be detected is obtained, the descriptor and molecular structure information of the chemical substance are extracted by using open source software RDkit, the construction of the graph network information is completed, the constructed graph network information (namely node characteristics, edge characteristics and adjacent matrixes) is input into a deep GCN-RT model after training, and the model outputs a retention time prediction result.
4. Examples
1. Model training
The METIN retention time dataset, which was derived from METIN laboratory, containing structural information of 80038 chemicals and experimentally determined retention times, was selected for model training. The invention divides the data set into a training set, a verification set and a test set, and based on the data, the graph network information is constructed by using the graph network information constructing part.
The training process of the model is based on the data set, a smoothL1 loss function is adopted, and an adaptive time estimation algorithm (Adam) algorithm is adopted for model training. The hidden layer dimension of the model was 200, the dense layer dimension was 1024, the dropout ratio was 0.1, and the batch size was 64. The training results are shown in fig. 2, wherein train_loss represents the loss of the training set in the training process, valid_mae represents the mean absolute error of the verification set, and test_mae represents the mean absolute error of the test set.
Fig. 4 is a graph showing the predictive effect of the model of the present invention on the MELIN retention time dataset. Fig. 5 is a graph of the prediction error of the model of the present invention on a METLIN retention time dataset. As can be seen from fig. 4 and fig. 5, the prediction error of the model of the present invention is smaller, and the accuracy of prediction is higher.
2. The technical proposal of the invention has the beneficial effects that
The effect of the retention time prediction model developed by the invention is far better than that of a model reported in literature.
2.1 Comparison of the model Effect of the invention with the model Effect of the prior art
Comparing the model effect of the present invention with the prior art literature model effect, as shown in table 1, the average absolute error (MAE) of the model of the present invention is lowest, the median absolute error (MedAE) and the average absolute percent error (MAPE) are lower than the literature reported model.
TABLE 1 comparison of the effects of the model of the invention (DeepGCN-RT) and the literature model
Model | MAE(s)↓ | MedAE(s)↓ | MAPE↓ | R2↑ | Reference |
GCN | 29.4 | - | 0.04 | 0.89 | Kensert et al.,Anal.Chem.2021 |
DNNpwa | 39.62 | 25.08 | 0.05 | 0.85 | Ju et al.,Anal.Chem.2021 |
GNN-RT | 39.87 | 25.24 | 0.05 | 0.85 | Yang et al.,Anal.Chem.2021 |
DeepGNN-RT | 26.46 | 12.39 | 0.03 | 0.89 | - |
Among them, the results of GCN, DNNpwa, GNN-RT are cited in the following documents:
Kensert,A.;Bouwmeester,R.;Efthymiadis,K.,et al.,Graph convolutional networks for improved prediction and interpretability of chromatographic retention data.Anal Chem.2021,93(47),15633-15641.
Ju,R.;Liu,X.;Zheng,F.,et al.,Deep Neural Network Pretrained by Weighted Autoencoders and Transfer Learning for Retention Time Prediction of Small Molecules.Anal Chem.2021,93(47),15651-15658.
Yang,Q.;Ji,H.;Lu,H.,et al.,Prediction of liquid chromatographic retention time with graph neural networks to assist in small molecule identification.Anal Chem.2021,93(4),2200-2206.
furthermore, the present invention explores the effect of residual connection and model depth on the prediction effect, as shown in table 2. Overall, with the layer number model, residual connection (residual) is added, so that the effect is improved obviously; in the case of residual connection, the effect of the model gradually becomes better as the depth of the model increases.
TABLE 2 influence of residual connection and model depth on model effect
In addition, the different readout effects are shown in table 3. Wherein deep gcn-RT uses readout based on an attention mechanism. It can be seen that the average readout is better than the sum readout, while the attention-based mechanism of the present invention introduces the best readout effect.
TABLE 3 Effect of different readout layers
2.2 Migration learning effect)
Since different subject studies generally use different liquid phase conditions, the model built on the SMRT dataset cannot be directly used for datasets under other liquid phase conditions. To test the generalization ability of the model, 7 reverse phase liquid chromatography datasets and 2 hydrophilic interaction chromatography datasets were collected from the PredRet database (Stanstrup, j.; neumann, s.; vrhovsek, u.; predRet: prediction of retention time by direct mapping between multiple chromatographic systems. Animal chem.2015,87 (18), 9421-8.), and a model obtained using SMRT training was used for migration learning to obtain a migration learning model deep gcn-RT-TL, the model effects are shown in table 4:
table 4 comparison of migration learning effects
It can be found that the model effect of the invention is far better than that of the document report model. Among them, the results of DNNpwa-TL and GNN-RT-TL are respectively cited in the following documents:
Ju,R.;Liu,X.;Zheng,F.,et al.,Deep Neural Network Pretrained by Weighted Autoencoders and Transfer Learning for Retention Time Prediction of Small Molecules.Anal Chem.2021,93(47),15651-15658.
Yang,Q.;Ji,H.;Lu,H.,et al.,Prediction of liquid chromatographic retention time with graph neural networks to assist in small molecule identification.Anal Chem.2021,93(4),2200-2206.
2.3 Application of model to small molecule structure identification
A retention time prediction model was built, ultimately for structural identification of the compounds. Thus, the present invention selects the RIKEN-PlaSMA dataset from the MoNA database for structural identification of compounds. The data set is composed of 434 small molecule compounds, 334 compounds are taken to establish a migration learning model, and the other 100 compounds are used for structure identification. The structure identification adopts MSFinder software and the migration learning model of the invention, and the result is shown in figure 3. It can be seen that deep gcn-RT of the present invention shows great advantages in terms of reducing the number of candidate structures and improving top-k recognition accuracy: the average number of candidate structures is reduced from 50 to 35; the top-k accuracy is also significantly improved.
In summary, the present invention provides a method for predicting retention time effects based on deep graph networks. The effect of the method is better than that of the existing models reported in all documents.
Although the above-described method of the present invention is based on liquid chromatography for case analysis, the application of the present invention is not limited to liquid chromatography, and can be performed by using the model of the present study, such as gas chromatography.
Based on the same inventive concept, another embodiment of the present invention provides an apparatus for predicting retention time of liquid chromatography based on a deep map network, comprising:
the graph network information construction module is used for acquiring the molecular structure information of the chemical substance to be detected and constructing graph network information according to the molecular structure information, wherein the graph network information comprises node characteristics, edge characteristics and adjacent matrixes;
and the retention time prediction module is used for inputting the graph network information into a trained deep layer graph network model for liquid chromatography retention time prediction, and predicting the liquid chromatography retention time by using the deep layer graph network model.
Wherein the specific implementation of each module is referred to the previous description of the method of the present invention.
Based on the same inventive concept, another embodiment of the present invention provides a computer device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps in the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.
The above-disclosed embodiments of the present invention are intended to aid in understanding the contents of the present invention and to enable the same to be carried into practice, and it will be understood by those of ordinary skill in the art that various alternatives, variations and modifications are possible without departing from the spirit and scope of the invention. The invention should not be limited to what has been disclosed in the examples of the specification, but rather by the scope of the invention as defined in the claims.
Claims (6)
1. A method for predicting liquid chromatography retention time based on a deep map network, comprising the steps of:
acquiring molecular structure information of a chemical substance to be detected, and constructing graph network information according to the molecular structure information, wherein the graph network information comprises node characteristics, edge characteristics and adjacent matrixes;
inputting the graph network information into a trained deep graph network model for predicting the retention time of liquid chromatography, and predicting the retention time of the liquid chromatography by using the deep graph network model;
the node features include: atom type, chiral center type, chirality, atomicity, formal charge, hybridization, aromaticity, whether the hydrogen donor or acceptor is a heteroatom, whether the number of hydrogen atoms in a ring, the number of radical electrons, the number of valence electrons, the Crippen LogP contribution rate, crippen molar refractive contribution rate, gaseiger charge, mass number, and topological polar surface area contribution; the edge feature includes: the type of bond, whether conjugated, whether part of a ring, whether rotatable, and the steric structure information of the chemical bond; the adjacency matrix is constructed according to molecular chemical bonds;
the deep layer graph network model comprises a graph network layer, a reading layer and a linear layer; the graph network layer introduces the chemical bond information of the molecules into the information transmission process, introduces residual connection, and increases the model depth to improve the prediction effect; the graph network is a graph rolling network;
the processing process of the graph network layer comprises the following steps:
transmitting side information between a source node u and a target node v and information of the source node u to the target node v, and aggregating the target node v by adopting a softmax function to obtain updated information m l ;
Information m after updating l Processing with linear layer, and finally updating molecular information and original molecular information by nonlinear activation function sigmaAdding, namely performing residual error connection operation;
the reading layer adopts a reading layer based on an attention mechanism; the reading layer based on the attention mechanism comprises super virtual nodes, wherein the super virtual nodes are connected with each atomic node in a molecule, and the codes of the super virtual nodes are firstly obtained by summation and then updated by using the following formula:
e i =concat(c,n i )*W+b
h k ,c k =GRU(h k-1 ,c k-1 )
wherein c is the code of the super virtual node, n i Representing the code of each atomic node in the molecule, e i Alpha is the weight after passing through the linear layer i A coefficient of importance representing degree for normalization using softmax, the sum of which is one;representing all atomic nodes in all molecules; GRU is a gating circulation unit, c k Calculating the codes of the super virtual nodes for the kth pass through the graph attention mechanism, h k Encoding the molecules after the kth update.
2. The method of claim 1, wherein the linear layers comprise 2 linear layers, wherein the hidden layer dimension of the first layer is 1024, and the dimension is projected to 1 after passing through the first layer and then through the linear rectification function ReLU and then through the second layer to predict the retention time.
3. The method of claim 1, wherein the training process of the deep-graph network model comprises: and selecting a retention time data set, dividing the retention time data set into a training set, a verification set and a test set, constructing graph network information, and training the deep graph network model by adopting a SmoothL1 loss function and adopting a self-adaptive moment estimation algorithm.
4. An apparatus for predicting liquid chromatography retention time based on a deep map network, comprising:
the graph network information construction module is used for acquiring the molecular structure information of the chemical substance to be detected and constructing graph network information according to the molecular structure information, wherein the graph network information comprises node characteristics, edge characteristics and adjacent matrixes;
the retention time prediction module is used for inputting the graph network information into a trained deep layer graph network model for liquid chromatography retention time prediction, and predicting the liquid chromatography retention time by using the deep layer graph network model;
the node features include: atom type, chiral center type, chirality, atomicity, formal charge, hybridization, aromaticity, whether the hydrogen donor or acceptor is a heteroatom, whether the number of hydrogen atoms in a ring, the number of radical electrons, the number of valence electrons, the Crippen LogP contribution rate, crippen molar refractive contribution rate, gaseiger charge, mass number, and topological polar surface area contribution; the edge feature includes: the type of bond, whether conjugated, whether part of a ring, whether rotatable, and the steric structure information of the chemical bond; the adjacency matrix is constructed according to molecular chemical bonds;
the deep layer graph network model comprises a graph network layer, a reading layer and a linear layer; the graph network layer introduces the chemical bond information of the molecules into the information transmission process, introduces residual connection, and increases the model depth to improve the prediction effect; the graph network is a graph rolling network;
the processing process of the graph network layer comprises the following steps:
transmitting side information between a source node u and a target node v and information of the source node u to the target node v, and aggregating the target node v by adopting a softmax function to obtain updated information m l ;
Updated informationm l Processing with linear layer, and finally updating molecular information and original molecular information by nonlinear activation function sigmaAdding, namely performing residual error connection operation;
the reading layer adopts a reading layer based on an attention mechanism; the reading layer based on the attention mechanism comprises super virtual nodes, wherein the super virtual nodes are connected with each atomic node in a molecule, and the codes of the super virtual nodes are firstly obtained by summation and then updated by using the following formula:
e i =concat(c,n i )*W+b
h k ,c k =GRU(h k-1 ,c k-1 )
wherein c is the code of the super virtual node, n i Representing the code of each atomic node in the molecule, e i Alpha is the weight after passing through the linear layer i A coefficient of importance representing degree for normalization using softmax, the sum of which is one;representing all atomic nodes in all molecules; GRU is a gating circulation unit, c k Calculating the codes of the super virtual nodes for the kth pass through the graph attention mechanism, h k Encoding the molecules after the kth update.
5. A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-3.
6. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1-3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211374166.0A CN116106461B (en) | 2022-11-03 | 2022-11-03 | Method and device for predicting liquid chromatograph retention time based on deep graph network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211374166.0A CN116106461B (en) | 2022-11-03 | 2022-11-03 | Method and device for predicting liquid chromatograph retention time based on deep graph network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116106461A CN116106461A (en) | 2023-05-12 |
CN116106461B true CN116106461B (en) | 2024-02-06 |
Family
ID=86258567
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211374166.0A Active CN116106461B (en) | 2022-11-03 | 2022-11-03 | Method and device for predicting liquid chromatograph retention time based on deep graph network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116106461B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111899510A (en) * | 2020-07-28 | 2020-11-06 | 南京工程学院 | Intelligent traffic system flow short-term prediction method and system based on divergent convolution and GAT |
CN113192559A (en) * | 2021-05-08 | 2021-07-30 | 中山大学 | Protein-protein interaction site prediction method based on deep map convolution network |
CN113241128A (en) * | 2021-04-29 | 2021-08-10 | 天津大学 | Molecular property prediction method based on molecular space position coding attention neural network model |
CN113241130A (en) * | 2021-06-08 | 2021-08-10 | 西南交通大学 | Molecular structure prediction method based on graph convolution network |
CN113299354A (en) * | 2021-05-14 | 2021-08-24 | 中山大学 | Small molecule representation learning method based on Transformer and enhanced interactive MPNN neural network |
CN114121178A (en) * | 2021-12-07 | 2022-03-01 | 中国计量科学研究院 | Chromatogram retention index prediction method and device based on graph convolution network |
CN114565187A (en) * | 2022-04-01 | 2022-05-31 | 吉林大学 | Traffic network data prediction method based on graph space-time self-coding network |
CN114629674A (en) * | 2021-11-11 | 2022-06-14 | 北京计算机技术及应用研究所 | Attention mechanism-based industrial control network security risk assessment method |
CN114818515A (en) * | 2022-06-24 | 2022-07-29 | 中国海洋大学 | Multidimensional time sequence prediction method based on self-attention mechanism and graph convolution network |
CN115148302A (en) * | 2022-05-18 | 2022-10-04 | 上海天鹜科技有限公司 | Compound property prediction method based on graph neural network and multi-task learning |
-
2022
- 2022-11-03 CN CN202211374166.0A patent/CN116106461B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111899510A (en) * | 2020-07-28 | 2020-11-06 | 南京工程学院 | Intelligent traffic system flow short-term prediction method and system based on divergent convolution and GAT |
CN113241128A (en) * | 2021-04-29 | 2021-08-10 | 天津大学 | Molecular property prediction method based on molecular space position coding attention neural network model |
CN113192559A (en) * | 2021-05-08 | 2021-07-30 | 中山大学 | Protein-protein interaction site prediction method based on deep map convolution network |
CN113299354A (en) * | 2021-05-14 | 2021-08-24 | 中山大学 | Small molecule representation learning method based on Transformer and enhanced interactive MPNN neural network |
CN113241130A (en) * | 2021-06-08 | 2021-08-10 | 西南交通大学 | Molecular structure prediction method based on graph convolution network |
CN114629674A (en) * | 2021-11-11 | 2022-06-14 | 北京计算机技术及应用研究所 | Attention mechanism-based industrial control network security risk assessment method |
CN114121178A (en) * | 2021-12-07 | 2022-03-01 | 中国计量科学研究院 | Chromatogram retention index prediction method and device based on graph convolution network |
CN114565187A (en) * | 2022-04-01 | 2022-05-31 | 吉林大学 | Traffic network data prediction method based on graph space-time self-coding network |
CN115148302A (en) * | 2022-05-18 | 2022-10-04 | 上海天鹜科技有限公司 | Compound property prediction method based on graph neural network and multi-task learning |
CN114818515A (en) * | 2022-06-24 | 2022-07-29 | 中国海洋大学 | Multidimensional time sequence prediction method based on self-attention mechanism and graph convolution network |
Also Published As
Publication number | Publication date |
---|---|
CN116106461A (en) | 2023-05-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Ensembling multiple raw coevolutionary features with deep residual neural networks for contact‐map prediction in CASP13 | |
Liu et al. | Inferring gene regulatory networks using the improved Markov blanket discovery algorithm | |
CN112087420B (en) | Network killing chain detection method, prediction method and system | |
CN110929080B (en) | Optical remote sensing image retrieval method based on attention and generation countermeasure network | |
Li et al. | Protein contact map prediction based on ResNet and DenseNet | |
CN112733997B (en) | Hydrological time series prediction optimization method based on WOA-LSTM-MC | |
CN113257357B (en) | Protein residue contact map prediction method | |
Guo et al. | Machine learning based feature selection and knowledge reasoning for CBR system under big data | |
Sarkar et al. | An algorithm for DNA read alignment on quantum accelerators | |
CN115274007A (en) | Generalizable and interpretable depth map learning method for discovering and optimizing drug lead compound | |
Xu et al. | Adaptive surrogate models for uncertainty quantification with partially observed information | |
Chen et al. | LOGER: A learned optimizer towards generating efficient and robust query execution plans | |
CN112270950B (en) | Network enhancement and graph regularization-based fusion network drug target relation prediction method | |
CN116106461B (en) | Method and device for predicting liquid chromatograph retention time based on deep graph network | |
CN116208399A (en) | Network malicious behavior detection method and device based on metagraph | |
Chen et al. | Nas-bench-zero: A large scale dataset for understanding zero-shot neural architecture search | |
Canchila et al. | Hyperparameter optimization and importance ranking in deep learning–based crack segmentation | |
Wang et al. | GNN-Dom: an unsupervised method for protein domain partition via protein contact map | |
Yang et al. | Graph Contrastive Learning for Clustering of Multi-layer Networks | |
CN117976047B (en) | Key protein prediction method based on deep learning | |
Bonetta Valentino et al. | Machine learning using neural networks for metabolomic pathway analyses | |
CN117351300B (en) | Small sample training method and device for target detection model | |
CN118243842A (en) | Liquid chromatography retention time prediction method under different chromatographic conditions | |
Sun | Construction of computer algorithms in bioinformatics of the fusion genetic algorithm | |
CN114842920A (en) | Molecular property prediction method and device, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |