CN116106461B

CN116106461B - Method and device for predicting liquid chromatograph retention time based on deep graph network

Info

Publication number: CN116106461B
Application number: CN202211374166.0A
Authority: CN
Inventors: 蓝振忠; 康启越; 刘航
Original assignee: Westlake University
Current assignee: Westlake University
Priority date: 2022-11-03
Filing date: 2022-11-03
Publication date: 2024-02-06
Anticipated expiration: 2042-11-03
Also published as: CN116106461A

Abstract

The invention relates to a method and a device for predicting liquid chromatography retention time based on a deep map network. The method comprises the steps of obtaining molecular structure information of chemical substances to be detected, and constructing graph network information according to the molecular structure information, wherein the graph network information comprises node characteristics, edge characteristics and adjacent matrixes; and inputting the graph network information into a trained deep graph network model for predicting the liquid chromatography retention time, and predicting the liquid chromatography retention time by using the deep graph network model. The deep layer graph network model comprises a graph network layer, a reading layer and a linear layer; the graph network layer introduces molecular side information into an information transmission process, introduces residual connection, and increases model depth to improve prediction effect; the readout layer employs a attention-based readout layer. The method for predicting the liquid chromatography retention time based on the deep map network can improve the prediction accuracy.

Description

Method and device for predicting liquid chromatograph retention time based on deep graph network

Technical Field

The invention belongs to the technical field of liquid chromatography and information processing, and particularly relates to a method and a device for predicting liquid chromatography retention time based on a deep map network.

Background

In the last few decades, liquid chromatography-mass spectrometry (LC-MS) has been used as the most efficient method for identifying small molecular structures due to its high sensitivity and high selectivity. While tandem mass spectrometry (MS/MS) information has proven useful for characterizing structures, relying solely on tandem mass spectrometry is insufficient to determine structures because the tandem mass spectrometry database is extremely limited. In face of this challenge, retention times have been used to aid in the identification of compounds. The retention time is the time required for a sample to enter the column until it exits the column to be detected by mass spectrometry. Because the retention time can provide orthogonal information beyond that obtained by tandem mass spectrometry, the ability to reduce the number of possible structures during identification is an important means to exclude identification of false positives. How to accurately predict the retention time of liquid chromatography and the retention time under different liquid phase conditions is a main problem to be solved by the present invention.

Currently, there are limited studies, and conventional machine learning methods, such as bayesian ridge regression, random forest, etc., are used to predict retention time based on molecular fingerprints or molecular descriptors. However, molecular fingerprints or descriptors can only represent part of the nature of chemical molecules, and cannot use the information of the overall structure of the molecules.

Disclosure of Invention

Aiming at the problem that the existing traditional machine learning based on molecular fingerprints or molecular descriptors is low in prediction accuracy, the invention provides a method for predicting liquid chromatography retention time based on a deep map network, so that prediction accuracy is improved.

The technical scheme adopted by the invention is as follows:

a method for predicting liquid chromatography retention time based on a deep map network, comprising the steps of:

acquiring molecular structure information of a chemical substance to be detected, and constructing graph network information according to the molecular structure information, wherein the graph network information comprises node characteristics, edge characteristics and adjacent matrixes;

and inputting the graph network information into a trained deep graph network model for predicting the liquid chromatography retention time, and predicting the liquid chromatography retention time by using the deep graph network model.

Further, the node features include: atom type, chiral center type, chirality, atomicity, formal charge, hybridization, aromaticity, whether the hydrogen donor or acceptor is a heteroatom, whether the number of hydrogen atoms in a ring, the number of radical electrons, the number of valence electrons, the Crippen LogP contribution rate, crippen molar refractive contribution rate, gaseiger charge, mass number, and topological polar surface area contribution; the edge feature includes: bond type, whether conjugated, whether part of a ring, whether rotatable, and steric structural information of the chemical bond; the adjacency matrix is constructed from molecular chemical bonds.

Further, the deep layer graph network model comprises a graph network layer, a reading layer and a linear layer; the graph network layer introduces molecular side information into an information transmission process, introduces residual connection, and increases model depth to improve prediction effect.

Further, the processing procedure of the graph network layer comprises the following steps:

transmitting side information between a source node u and a target node v and information of the source node u to the target node v, and aggregating the target node v by adopting a softmax function to obtain updated information m ^l ；

Information m after updating ^l Processing with linear layer, and finally updating molecular information and original molecular information by nonlinear activation function sigmaAnd adding, namely performing residual connection operation.

Further, the readout layer adopts a readout layer based on an attention mechanism; the reading layer based on the attention mechanism comprises super virtual nodes, wherein the super virtual nodes are connected with each atomic node in a molecule, and the codes of the super virtual nodes are firstly obtained by summation and then updated by using the following formula:

e _i ＝concat(c,n _i )*W+b

h ^k ,c ^k ＝GRU(h ^k-1 ,c ^k-1 )

wherein c is the code of the super virtual node, n _i Representing the code of each atomic node in the molecule, e _i Alpha is the weight after passing through the linear layer _i For returning using softmaxThe importance of a normalization represents the coefficient of the degree, the sum of which is one;representing all atomic nodes in all molecules; GRU is a gating circulation unit, c ^k Calculating the codes of the super virtual nodes for the kth pass through the graph attention mechanism, h ^k Encoding the molecules after the kth update.

Further, the linear layer comprises 2 linear layers, wherein the hidden layer dimension of the first layer is 1024, and after passing through the first layer, the dimension is projected to 1 through the linear rectification function ReLU and then through the second layer, so as to predict the retention time.

Further, the training process of the deep map network model comprises the following steps: and selecting a retention time data set, dividing the retention time data set into a training set, a verification set and a test set, constructing graph network information, and training the deep graph network model by adopting a SmoothL1 loss function and adopting a self-adaptive moment estimation algorithm.

An apparatus for predicting liquid chromatography retention time based on a deep map network, comprising:

the graph network information construction module is used for acquiring the molecular structure information of the chemical substance to be detected and constructing graph network information according to the molecular structure information, wherein the graph network information comprises node characteristics, edge characteristics and adjacent matrixes;

and the retention time prediction module is used for inputting the graph network information into a trained deep layer graph network model for liquid chromatography retention time prediction, and predicting the liquid chromatography retention time by using the deep layer graph network model.

The beneficial effects of the invention are as follows:

aiming at the problem that the existing traditional machine learning based on molecular fingerprints or molecular descriptors is low in prediction accuracy, the invention firstly proposes to introduce a deep map network to perform retention time prediction, and aiming at the problem of chemical substance retention time prediction, performs multiple optimization on a model, and further achieves the effect of improving prediction accuracy. Compared with the traditional machine learning method, the graph network model can use atomic level descriptors and meanwhile use structural information (graph network information) of chemical substances, so that a better prediction effect can be achieved.

The invention develops a deep graph rolling network (deep GCN-RT) model, introduces residual connection in the model for the first time, introduces side (chemical bond) information of molecules, introduces a graph network 'read out' module based on an attention mechanism, and obtains the model with the best prediction effect at present on a 'METIN retention time data set' (SMRT).

Furthermore, the present invention compares the effect of the developed model on other liquid chromatography datasets given that different liquid chromatography conditions typically tend to be used between different studies. The results show that the model developed by the invention significantly improves the accuracy of predictions on the SMRT data set and the transfer learning data set as compared to the literature report model. Finally, LCMS-based molecular recognition using the RIKEN-PlaSMA dataset, deep gcn-RT shows great advantages in reducing the number of candidate structures and improving top-k recognition accuracy.

Drawings

Fig. 1. The model structure of the present invention.

Figure 2. Loss in training process of the present invention.

FIG. 3 Structure identification of RIKEN-PlaSMA dataset. Wherein (a) the graph is the average number of selected structures when different identification modes are used for identification, and the abscissa thereof represents the result of structure identification by using the software of MSFinder alone and the result of structure identification by using the MSFinder and the retention time prediction model developed in the study at the same time, respectively, and the ordinate thereof represents the average number of candidate structures of each chromatographic peak (the average of 100 candidate structures of the chromatogram in total); (b) The graph shows the accuracy of the identification, the abscissa represents whether the candidate structures of top-1, top-2, top-5, top-10, top-15, and top-20 contain the true structures, and the ordinate represents the proportion of correctly identified molecular structures, and identification type indicates the use of different identification means (MSFinder alone and MSFinder and DeepGCN-RT together).

Fig. 4. Predicted effect of the model of the present invention on METLIN retention time dataset, with the abscissa being the experimentally determined true retention time and the ordinate being the predicted retention time of the present study development model.

Fig. 5 is a histogram of the prediction error of the inventive model on the METLIN retention time dataset, with the abscissa representing the prediction error and the ordinate representing the corresponding count (count).

Detailed Description

The present invention will be further described in detail with reference to the following examples and drawings, so that the above objects, features and advantages of the present invention can be more clearly understood.

The invention relates to a method for predicting retention time based on a deep graph network, which comprises the construction of chemical substance graph network information, including the construction of node characteristics, edge characteristics and adjacency matrixes. The adopted chemical substance deep learning model is a deep layer graph network model, which comprises the following steps: in the information transmission process, introducing side information to transmit information; using residual connection to construct a deep graph network model; the "readout" module of the improved model uses a "readout" model based on the attention mechanism to achieve better prediction effect. The architecture of the present model is shown in fig. 1.

The specific scheme of the method for predicting the retention time based on the deep graph network is as follows:

1. construction of chemical substance graph network information

The construction of chemical graph network information includes constructing node features, edge features, and adjacency matrices.

The node features include: atom type, chiral center type, chirality, atomicity, formal charge, hybridization, aromaticity, whether the hydrogen donor or acceptor is a heteroatom, whether the number of hydrogen atoms in the ring, the number of radical electrons, the number of valence electrons, the Crippen LogP contribution rate, the Crippen molar refractive contribution rate, the gaseiger charge, the number of masses (divided by 100), and the topopolar surface area contribution (Topological polar surface area contribution).

The adjacency matrix is built using chemical bonds of chemicals. In addition, the invention also introduces edge features into the information transfer process, wherein the edge features comprise: bond type, whether conjugated, whether part of a ring, whether rotatable, and steric structural information of the chemical bond.

The information is respectively constructed into node characteristics, edge characteristics and adjacency matrixes by using open source software RDkit, and the information is input into a graph network to predict the retention time.

2. Construction of deep layer graph network model

As shown in FIG. 1, the deep GCN-RT model of the present invention consists of a graph network Layer (GNN Layer), a Readout Layer (GNN Readout), and a linear Layer (Dense Layer).

1. Picture network layer (GNN layer)

The graph network is a graph-rolling network, and the invention makes the following improvements on the basis of GCN proposed by Kensert et al (Kensert, A.; bouwmetester, R.; efthiadidis, K.; et al, graph convolutional networks for improved prediction and interpretability of chromatographic retention data. Anal chem.2021,93 (47), 15633-15641.): adding side (chemical bond) information of molecules to perform graph network model modeling; adding residual connections to improve the model structure; the depth of the model is increased to improve the predictive effect.

The GCN layer of Kensert et al is as follows:

wherein u and v are the source node and the target node respectively, and N (v) is all the source nodes of v. c _uv Is the square root of the degree of the node. Sigma represents a nonlinear function.Molecular coding (ebedding) for the 1+1st updated target node v,>for the molecular code of the target node v after l times of updating, l is the number of times of updating,b ^l is the bias parameter of the first layer, W ^l Is the weight parameter of the first layer.

The GCN layer firstly transmits the side information between u and v and the information of the source node u to the target node v, and the target node v adopts a softmax function to aggregate, as shown in a formula (2) and a formula (3):

wherein the method comprises the steps ofRespectively representing information and side information of a source node, m ^l Representing updated information. The information of the source node refers to node characteristics in the previous text, the side information refers to side characteristics in the previous text, and the source node and the target node are determined by an adjacency matrix in the previous text.

Then, the obtained updated information m ^l Using linear layer processing (i is the number of updates, b ^l Is the bias parameter of the first layer, W ^l The weight parameter of the first layer) and then through a nonlinear activation function sigma. Finally, the updated molecular information and the original molecular informationThe summation (i.e., residual join operation) is performed as follows:

2. readout layer (GNN Readout)

Currently, map-based readout mostly employs simple readout operations such as "average", "summation", and the like. In order to improve the prediction accuracy of the model, the invention adopts a reading layer based on an attention mechanism. Specifically, after the message passing process, a molecular code is obtained for each atomic node in the molecule. The invention first creates a "super virtual" node and connects the node to each atomic node. The coding of the "super virtual" node is first obtained by summing and then updating using the following formula, in particular:

e _i ＝concat(c,n _i )*W+b (5)

h ^k ,c ^k ＝GRU(h ^k-1 ,c ^k-1 ) (8)

where c is the code of the "super virtual" node, n _i Representing the code for each atomic node in the molecule,representing all atomic nodes in the molecule. e, e _i Is the weight after passing through the linear layer. Alpha _i The importance representing degree coefficient normalized by softmax is one in sum. The GRU is a gated loop unit. c ^k Calculating the coding of the super virtual node for the kth time passing through the graph attention mechanism, h ^k Encoding the molecules after the kth update.

The invention can achieve better retention time prediction effect based on the reading of the attention mechanism, because the attention mechanism of the graph can effectively capture the useful information of the target task. In addition, the gating circulation unit has good effects in the aspects of information retention and invalid information filtering. The two are combined, so that a better effect can be achieved in the aspect of capturing the global characteristics of chemical molecules.

3. Linear Layer (Dense Layer)

The code of the readout layer is input to a linear layer, and the structure of the linear layer is 2 layers of linear layers, wherein the hidden layer dimension of the first layer is 1024. After passing through the first layer, the dimensions are projected to 1 through a linear rectification function (ReLU) and then through the second layer for retention time prediction.

3. Retention time prediction

Training phase: the method comprises the steps of dividing an existing data set, such as a METIN retention time data set, which contains structural information of chemical substances and experimentally measured retention time, into a training set, a verification set and a test set, constructing the graph network information by using a graph network information constructing part, and then training the deep GCN-RT model by adopting a self-adaptive time estimation algorithm (Adam) algorithm by adopting a smoothL1 loss function.

Retention time prediction phase: the simplified molecular linear input specification (SMILES) of the chemical substance to be detected is obtained, the descriptor and molecular structure information of the chemical substance are extracted by using open source software RDkit, the construction of the graph network information is completed, the constructed graph network information (namely node characteristics, edge characteristics and adjacent matrixes) is input into a deep GCN-RT model after training, and the model outputs a retention time prediction result.

4. Examples

1. Model training

The METIN retention time dataset, which was derived from METIN laboratory, containing structural information of 80038 chemicals and experimentally determined retention times, was selected for model training. The invention divides the data set into a training set, a verification set and a test set, and based on the data, the graph network information is constructed by using the graph network information constructing part.

The training process of the model is based on the data set, a smoothL1 loss function is adopted, and an adaptive time estimation algorithm (Adam) algorithm is adopted for model training. The hidden layer dimension of the model was 200, the dense layer dimension was 1024, the dropout ratio was 0.1, and the batch size was 64. The training results are shown in fig. 2, wherein train_loss represents the loss of the training set in the training process, valid_mae represents the mean absolute error of the verification set, and test_mae represents the mean absolute error of the test set.

Fig. 4 is a graph showing the predictive effect of the model of the present invention on the MELIN retention time dataset. Fig. 5 is a graph of the prediction error of the model of the present invention on a METLIN retention time dataset. As can be seen from fig. 4 and fig. 5, the prediction error of the model of the present invention is smaller, and the accuracy of prediction is higher.

2. The technical proposal of the invention has the beneficial effects that

The effect of the retention time prediction model developed by the invention is far better than that of a model reported in literature.

2.1 Comparison of the model Effect of the invention with the model Effect of the prior art

Comparing the model effect of the present invention with the prior art literature model effect, as shown in table 1, the average absolute error (MAE) of the model of the present invention is lowest, the median absolute error (MedAE) and the average absolute percent error (MAPE) are lower than the literature reported model.

TABLE 1 comparison of the effects of the model of the invention (DeepGCN-RT) and the literature model

Model	MAE(s)↓	MedAE(s)↓	MAPE↓	R2↑	Reference
						GCN	29.4	-	0.04	0.89	Kensert et al.,Anal.Chem.2021
DNNpwa	39.62	25.08	0.05	0.85	Ju et al.,Anal.Chem.2021
						GNN-RT	39.87	25.24	0.05	0.85	Yang et al.,Anal.Chem.2021
DeepGNN-RT	26.46	12.39	0.03	0.89	-

Among them, the results of GCN, DNNpwa, GNN-RT are cited in the following documents:

Kensert,A.；Bouwmeester,R.；Efthymiadis,K.,et al.,Graph convolutional networks for improved prediction and interpretability of chromatographic retention data.Anal Chem.2021,93(47),15633-15641.

Ju,R.；Liu,X.；Zheng,F.,et al.,Deep Neural Network Pretrained by Weighted Autoencoders and Transfer Learning for Retention Time Prediction of Small Molecules.Anal Chem.2021,93(47),15651-15658.

Yang,Q.；Ji,H.；Lu,H.,et al.,Prediction of liquid chromatographic retention time with graph neural networks to assist in small molecule identification.Anal Chem.2021,93(4),2200-2206.

furthermore, the present invention explores the effect of residual connection and model depth on the prediction effect, as shown in table 2. Overall, with the layer number model, residual connection (residual) is added, so that the effect is improved obviously; in the case of residual connection, the effect of the model gradually becomes better as the depth of the model increases.

TABLE 2 influence of residual connection and model depth on model effect

In addition, the different readout effects are shown in table 3. Wherein deep gcn-RT uses readout based on an attention mechanism. It can be seen that the average readout is better than the sum readout, while the attention-based mechanism of the present invention introduces the best readout effect.

TABLE 3 Effect of different readout layers

2.2 Migration learning effect)

Since different subject studies generally use different liquid phase conditions, the model built on the SMRT dataset cannot be directly used for datasets under other liquid phase conditions. To test the generalization ability of the model, 7 reverse phase liquid chromatography datasets and 2 hydrophilic interaction chromatography datasets were collected from the PredRet database (Stanstrup, j.; neumann, s.; vrhovsek, u.; predRet: prediction of retention time by direct mapping between multiple chromatographic systems. Animal chem.2015,87 (18), 9421-8.), and a model obtained using SMRT training was used for migration learning to obtain a migration learning model deep gcn-RT-TL, the model effects are shown in table 4:

table 4 comparison of migration learning effects

It can be found that the model effect of the invention is far better than that of the document report model. Among them, the results of DNNpwa-TL and GNN-RT-TL are respectively cited in the following documents:

2.3 Application of model to small molecule structure identification

A retention time prediction model was built, ultimately for structural identification of the compounds. Thus, the present invention selects the RIKEN-PlaSMA dataset from the MoNA database for structural identification of compounds. The data set is composed of 434 small molecule compounds, 334 compounds are taken to establish a migration learning model, and the other 100 compounds are used for structure identification. The structure identification adopts MSFinder software and the migration learning model of the invention, and the result is shown in figure 3. It can be seen that deep gcn-RT of the present invention shows great advantages in terms of reducing the number of candidate structures and improving top-k recognition accuracy: the average number of candidate structures is reduced from 50 to 35; the top-k accuracy is also significantly improved.

In summary, the present invention provides a method for predicting retention time effects based on deep graph networks. The effect of the method is better than that of the existing models reported in all documents.

Although the above-described method of the present invention is based on liquid chromatography for case analysis, the application of the present invention is not limited to liquid chromatography, and can be performed by using the model of the present study, such as gas chromatography.

Based on the same inventive concept, another embodiment of the present invention provides an apparatus for predicting retention time of liquid chromatography based on a deep map network, comprising:

Wherein the specific implementation of each module is referred to the previous description of the method of the present invention.

Based on the same inventive concept, another embodiment of the present invention provides a computer device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps in the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.

The above-disclosed embodiments of the present invention are intended to aid in understanding the contents of the present invention and to enable the same to be carried into practice, and it will be understood by those of ordinary skill in the art that various alternatives, variations and modifications are possible without departing from the spirit and scope of the invention. The invention should not be limited to what has been disclosed in the examples of the specification, but rather by the scope of the invention as defined in the claims.

Claims

1. A method for predicting liquid chromatography retention time based on a deep map network, comprising the steps of:

inputting the graph network information into a trained deep graph network model for predicting the retention time of liquid chromatography, and predicting the retention time of the liquid chromatography by using the deep graph network model;

the node features include: atom type, chiral center type, chirality, atomicity, formal charge, hybridization, aromaticity, whether the hydrogen donor or acceptor is a heteroatom, whether the number of hydrogen atoms in a ring, the number of radical electrons, the number of valence electrons, the Crippen LogP contribution rate, crippen molar refractive contribution rate, gaseiger charge, mass number, and topological polar surface area contribution; the edge feature includes: the type of bond, whether conjugated, whether part of a ring, whether rotatable, and the steric structure information of the chemical bond; the adjacency matrix is constructed according to molecular chemical bonds;

the deep layer graph network model comprises a graph network layer, a reading layer and a linear layer; the graph network layer introduces the chemical bond information of the molecules into the information transmission process, introduces residual connection, and increases the model depth to improve the prediction effect; the graph network is a graph rolling network;

the processing process of the graph network layer comprises the following steps:

Information m after updating ^l Processing with linear layer, and finally updating molecular information and original molecular information by nonlinear activation function sigmaAdding, namely performing residual error connection operation;

the reading layer adopts a reading layer based on an attention mechanism; the reading layer based on the attention mechanism comprises super virtual nodes, wherein the super virtual nodes are connected with each atomic node in a molecule, and the codes of the super virtual nodes are firstly obtained by summation and then updated by using the following formula:

e _i ＝concat(c,n _i )*W+b

h ^k ,c ^k ＝GRU(h ^k-1 ,c ^k-1 )

wherein c is the code of the super virtual node, n _i Representing the code of each atomic node in the molecule, e _i Alpha is the weight after passing through the linear layer _i A coefficient of importance representing degree for normalization using softmax, the sum of which is one;representing all atomic nodes in all molecules; GRU is a gating circulation unit, c ^k Calculating the codes of the super virtual nodes for the kth pass through the graph attention mechanism, h ^k Encoding the molecules after the kth update.

2. The method of claim 1, wherein the linear layers comprise 2 linear layers, wherein the hidden layer dimension of the first layer is 1024, and the dimension is projected to 1 after passing through the first layer and then through the linear rectification function ReLU and then through the second layer to predict the retention time.

3. The method of claim 1, wherein the training process of the deep-graph network model comprises: and selecting a retention time data set, dividing the retention time data set into a training set, a verification set and a test set, constructing graph network information, and training the deep graph network model by adopting a SmoothL1 loss function and adopting a self-adaptive moment estimation algorithm.

4. An apparatus for predicting liquid chromatography retention time based on a deep map network, comprising:

the retention time prediction module is used for inputting the graph network information into a trained deep layer graph network model for liquid chromatography retention time prediction, and predicting the liquid chromatography retention time by using the deep layer graph network model;

Updated informationm ^l Processing with linear layer, and finally updating molecular information and original molecular information by nonlinear activation function sigmaAdding, namely performing residual error connection operation;

e _i ＝concat(c,n _i )*W+b

h ^k ,c ^k ＝GRU(h ^k-1 ,c ^k-1 )

5. A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-3.

6. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1-3.