CN114997366A - Protein structure model quality evaluation method based on graph neural network - Google Patents
Protein structure model quality evaluation method based on graph neural network Download PDFInfo
- Publication number
- CN114997366A CN114997366A CN202210557804.6A CN202210557804A CN114997366A CN 114997366 A CN114997366 A CN 114997366A CN 202210557804 A CN202210557804 A CN 202210557804A CN 114997366 A CN114997366 A CN 114997366A
- Authority
- CN
- China
- Prior art keywords
- protein
- graph
- model
- network
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Biomedical Technology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Databases & Information Systems (AREA)
- Physiology (AREA)
- Bioethics (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
A method for evaluating quality of a protein structure model based on a graph neural network comprises the steps of extracting global coordinates of all 167 heavy atoms in a to-be-predicted protein bait in a three-dimensional space as input, preprocessing the global coordinates through dual-space relay of Rosetta software, analyzing and calculating to obtain original node characteristics and edge characteristics, inputting the original node characteristics and the edge characteristics into a graph neural network model constructed based on an attention mechanism and a graph pooling technology, calculating by using pre-trained model network parameters, and respectively obtaining a global score and a local score reflecting the difference between the protein bait structure and a real natural protein structure on the protein level and the amino acid residue level. The invention can pay more discriminative attention to the protein-like bait structures with the precision close to the natural structure, thereby improving the accuracy of the protein structure model evaluation on a high-precision data set.
Description
Technical Field
The invention relates to a technology in the field of bioengineering, in particular to a protein structure model quality evaluation method based on a graph neural network, which is used for evaluating the authenticity of a model structure under the condition of unknown corresponding natural structure.
Background
Protein structure model evaluation aims at quantifying the accuracy of a protein-like bait structure generated by a protein structure prediction model relative to its native structure without knowledge of the native structure. For the structural bioinformatics related to proteins, a protein structure model evaluation tool is an extremely important infrastructure, and a good evaluation method can provide important guidance for tasks such as protein structure prediction, structure-based protein function prediction, structure-based protein molecule docking and the like. Generally, protein structure model evaluation is divided into two methods, i.e., a multi-model method and a single-model method, wherein the former method needs to take the whole protein structure model pool as input, and evaluates the protein structure model by comparing the target protein structure model with other models in the pool and calculating average structure similarity as a quality index; the latter uses only the sequence and structure information of the individual bait structures themselves as input to predict their quality. Usually, the multi-model method can obtain false high precision under the condition that most models in the model pool have good quality, but the time consumption is often exponentially increased; the single model method can be more consistent in performance and more independent of the distribution of the quality of the model to be evaluated, and is closer to the expectation of a protein structure model evaluation algorithm, namely the fact that the true structure of the sequence possibly formed under natural conditions is learned. In 2020, the model headed by google latest AI algorithm AlphaFold 2 demonstrated in the CASP14 structure prediction arena of the official organization that deep learning of amino acid co-evolution data was sufficient to predict protein-like decoy structures with considerable accuracy for all protein sequences of unknown structure. Since past evaluation techniques tend to identify protein-like bait structures with substantial authenticity, the creation of large numbers of highly accurate protein-like bait structures allows almost all evaluation methods to be scaled down with great precision. In other words, the model evaluation algorithm is an important guiding part of the structure prediction algorithm, and the prior art cannot keep up with the rapid development of the structure prediction algorithm.
The existing multi-model quality evaluation method has longer time consumption and greatly influenced by the distribution of a protein structure model pool; the single model quality evaluation method is closer to the solution of the problem in the field, but the algorithm learning effect is poor; and almost all existing quality assessment methods are difficult to make reliable prediction in high-precision protein models.
Disclosure of Invention
The invention provides a protein structure model quality evaluation method based on a graph neural network aiming at the defect that the existing protein structure model evaluation method is poor in performance on a high-precision protein structure model set, combines a deep learning technology with the field knowledge of a protein structure, applies the neural network constructed based on a triple attention mechanism to a protein structure model evaluation algorithm, firstly focuses on the evaluation of the high-precision protein structure model and carries out improvement and optimization on the high-precision protein structure model, and can more discriminatively focus on the protein-like bait structures with the precision close to the natural structure, thereby improving the accuracy of the protein structure model evaluation on the high-precision data set.
The invention is realized by the following technical scheme:
the invention relates to a quality evaluation method of a protein structure model based on a graph neural network, which is characterized by extracting the global coordinates of all 167 heavy atoms in a to-be-predicted protein bait in a three-dimensional space as input, preprocessing the global coordinates through dual-space relay of Rosetta software, analyzing and calculating to obtain original node characteristics and edge characteristics, inputting the original node characteristics and the edge characteristics into a graph neural network model constructed based on an attention mechanism and a graph pooling technology, and calculating by using pre-trained model network parameters to respectively obtain a global score and a local score which reflect the difference between the protein bait structure and a real natural protein structure on the protein level and the amino acid residue level.
The 167 heavy atoms, i.e., all atoms with relatively large atomic mass contained in 20 standard amino acid residues that can constitute proteins in nature, are considered to determine the framework position and relative position of proteins in space relative to other atoms.
The graph neural network model constructed based on the attention mechanism and the graph pooling technology comprises the following steps: the four-layer network layer architecture is the same, the graph neural computation network is built in series, and the graph pooling network which is connected with the graph neural computation network and formed by connecting two pooling layers in parallel is provided, wherein: each layer of network layer in the graph neural computation network is formed by connecting a triple attention mechanism and a channel attention mechanism in series, and two parallel pooling layers in the graph pooling network are used for performing local pooling and global pooling on node embedding of an upstream graph respectively.
The pre-trained model network parameters are obtained by repeatedly training ten rounds on a training set by the model framework and then respectively evaluating on a verification set and a test set, and specifically comprise:
firstly, taking out a sample protein bait structure from a training set, carrying out dual-space relax, then obtaining original node characteristics and edge characteristics from structure analysis data, and constructing a training data queue Q train ={(X 1 ,Y 1 ),(X 2 ,Y 2 ),…,(X n ,Y n ) }, wherein: x is a protein map constructed by the original characteristics of the sample, and Y is a real label value corresponding to the sample.
Step II, training data queue Q train The data pairs in the step (a) are sequentially input into the neural network model in the form of batch to obtain output delta θ (X), and then calculating the Loss function value Loss ═ sigma batch (Y-δ θ (X)) 2 +∑ label>0.5 (Y-δ θ (X)) 2 Wherein: delta θ Is a graph neural network model.
Step three, updating delta by using a random gradient descent algorithm according to the loss function θ Kth time parameter theta in network k Is concretely provided withWherein: λ is a preset learning rate of 0.001,is the corresponding gradient value found by the loss function.
Fourthly, after traversing all samples in the training set, inputting the samples in the verification set into delta θ And testing in the network, calculating correlation coefficients and absolute value differences of all sample predicted values and label values in the verification set, comparing the correlation coefficients and the absolute value differences with model results trained before, and selecting a better version for replacing and storing.
Fifthly, after finishing the training iteration of the appointed times, taking out the historical optimal model parameters which are judged and stored according to the verification set, inputting samples in the test set, calculating to obtain the prediction scores of the models on the test set and obtaining the correlation coefficient, and then evaluating whether the current model design is reasonable according to the quality of the correlation coefficient to decide whether to terminate the training of the models or adjust the relevant hyper-parameters.
The training set and the verification set are obtained by using different prediction methods and prediction software to perturb different amino acid sequences of protein structures obtained by experiments in a human protein database to generate a large number of protein-like bait structures and dividing the protein-like bait structures according to the proportion of 8: 2.
The test set refers to CASP standard data set, and is an official data set provided by a CASP game. CASP, also known as key assessment of protein structure prediction, is a global experiment directed to the entire community. Each game of CASP will arrange a batch of protein sequences whose structures are unknown but whose true structures are to be solved by world centre laboratories, and provide them to all the panels in the game for structure prediction, and then collect all the predicted structures and provide them to the panel for calculation by the protein structure model evaluation group. Its official dataset is therefore the gold standard in the field of research in which the invention is located.
The protein graph refers to a data representation form for a protein-like bait structure, and each protein graph is composed of defined nodes and edges, wherein the nodes represent various heavy atoms in the protein, and the edges represent the interrelations of each node atom and 50 nearest neighbor atoms in a three-dimensional space. Each node and edge has an original characteristic.
The original node characteristics refer to the use of one-hot vectors to encode the atomic species at the node.
The edge characteristics specifically include:
category 1) edge distance: for every two atoms to be joined in the protein map, their relative distances in the sample are calculated, in this work, a vector representation of the distances is generated using a gaussian expansion. The mean values varied uniformly from [0,15] with a step size of 0.4.
Category 2) edge coordinates: to account for the apparent anisotropy of the relative atomic positions within proteins, the above distances are supplemented with a set of directional features. For each heavy atom B, it may form a local reference frame together with its preceding atom a and its following atom C in the amino acid sequence. Three unit vectors defining the reference systemRespectively, as calculated below.Thus when considering any two atoms B in space j ,B j When the edge coordinate of B is characterized, B is respectively set i ,B j The three-dimensional space coordinates of (a) are replaced under the local reference system of the other party, then the edge (i, j) in the protein map obtains 2 x 3 characteristics, which are respectively atom B i (B j ) Has a spatial coordinate of atom B j (B i ) Is projected on the local coordinate system.
Class 3) chemical bond: if the two-atom linkage is a chemical bond in the actual protein structure, then a 1 is placed, whereas a 0 is placed, i.e., a Boolean-type feature.
The real label value refers to: the accuracy of the scoring of a protein-like bait structure relative to its corresponding authentic structure. Similarly, the true tag value is divided into global and local categories. When the natural real structure corresponding to the sequence is known, the global and local tag values are respectively realized by two algorithms of global distance test (GDT-TS) and local distance difference test (lDDT), wherein: the GDT-TS score is the largest set within a defined distance cut-off of the positions of the alpha carbon atoms of the amino acid residues in the model structure in the native structure after iterative stacking of the two structures. lDDT is a non-additive score used to evaluate the local distance differences of all atoms in the model, including validation of stereochemistry, which is calculated over all atom pairs of the native protein structure, and measures how well the environment in the native structure is reproduced in the protein-like bait structure. These two types of methods are the gold standards for measuring the global and local domains of structural similarity, respectively.
The correlation coefficient refers to: pearson's correlation coefficientWherein: cov (X, Y) corresponds to the covariance of the two vectors, and σ corresponds to the standard deviation of the vector.
The invention relates to a system for realizing the method, which comprises the following steps: the system comprises a protein map generation module, a feature extraction network based on an attention mechanism, a map pooling network, a loss function module, a parameter updating module and a verification module, wherein: the protein graph generation module respectively extracts original node features and edge features from the sample protein bait structure data and constructs the extracted features into an original protein graph; the feature extraction network based on the attention mechanism obtains the high-dimensional features of the protein map through forward learning; the graph pooling network carries out simple calculation and average processing on the high-dimensional features to obtain global and local prediction scores respectively; sending the obtained fraction and the real fraction corresponding to the sample to a loss function module together for calculating loss; updating network parameters in a parameter updating module by utilizing a back propagation and gradient descent algorithm according to the loss; and repeatedly training and iterating until the verification module judges that the model achieves the expected effect.
Technical effects
The invention can achieve similar precision on the authoritative public data sets and obtain the optimal result on the high-precision subset under the condition of shorter running time than the current first-class evaluation algorithm.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of a graph neural network model training process constructed based on an attention mechanism and a graph pooling technique;
FIG. 3 is a schematic diagram of a neural network architecture implementing an attention-based mechanism.
Detailed Description
As shown in fig. 1, the present embodiment relates to a method for evaluating a protein structure model based on deep learning, which includes the following steps:
step 1) extracting Cartesian coordinate information of heavy atoms of each residue in a protein-like bait structure in a three-dimensional space, and then performing pretreatment through dual-relax of Rosetta software to relax the structure under double-space constraint.
Step 2) calculating the Euclidean distance between each heavy atom pair according to the coordinate information of each heavy atom, constructing an adjacency matrix according to the distance, and calculating the original node characteristics and the edge characteristics of each heavy atom, wherein the method specifically comprises the following steps:
step 2.1) A protein is given which comprises L residues, which are assumed to contain a total of N heavy atoms, wherein: the cartesian coordinate of the ith heavy atom in three-dimensional space is c i =(x i ,y i ,z i ) The coordinate of the jth heavy atom is c j =(x j ,y j ,z j ) Then the Euclidean distance between these two residues is d ij =||c i -c j And | l, in practical application, a vector representation of the distance is generated using gaussian base expansion. The mean value of which is from [0,15]]Uniformly, step size is 0.4. The distance matrix of the protein isThe atomic coordinates of the protein are set as
Step 2.2) based on the distance matrix obtained aboveThe adjacency matrix can be obtained by traversing and searching M adjacent atoms which are closest to each heavy atom in three-dimensional spaceWherein: (i, j) ∈ a ij In order to determine whether two heavy atoms are connected at an edge in a protein map, k is a hyper-parameter for normalization, and in practical application, k is 50.
Step 2.3) obtaining the original node characteristics of each heavy atom according to the attributes of each heavy atom in the protein, and recording the collection of the original node characteristics asv i The atomic species are identified independently as a 167-dimensional one-hot vector.
Step 2.4) based on the adjacency matrixDistance matrixThe coordinate set C may compute other edge features than euclidean distance features, such as contact features and projection features, where: contact matrixHyperparameter d 0 =8, Projection matrixThe ith heavy atom coordinate is known as c i =(x i ,y i ,z i ) Then its preorder and posteror heavy atom coordinates on the amino acid sequence are c i-1 =(x i+1 ,y i+1 ,z i+1 ) And c i+1 =(x i+1 ,y i+1 ,z i+1 ) Then, the local reference system can be calculated from the coordinates of the three heavy atoms according to the following formula: the local coordinate system of the jth heavy atom can be obtained in the same wayAndthen the vector p is projected ij Connecting vectors (c) for two heavy atoms j -c i ) And (4) splicing the projections in the two local coordinate systems respectively. There can be definedWherein: mu.s ij =[d ij ,b ij ,p ij ]For a total of 47-dimensional features.
Step 2.5) in summary, a protein map can be defined asIn the figureFor showing whether the ith and jth heavy atoms in the figure are connected,the original node features for the ith heavy atom,for edge features between the ith and jth heavy atoms.
Step 3) constructing a training data queueWherein:to be constructed from the original features of the sampleThe obtained protein map is established, Y is the real label value corresponding to the sample, Y n ∈(0,1)。
And 4) constructing a graph neural network learning framework, wherein the framework comprises a graph convolution network based on an attention mechanism and a graph pooling calculation network. And inputting the training data into a graph convolution network to calculate the high-dimensional characteristics of the protein graph, and then performing graph pooling calculation on the high-dimensional characteristics to respectively obtain the local score and the global score of the sample. Sending the obtained fraction and the real fraction corresponding to the sample to a loss function module together for calculating loss; updating network parameters in a parameter updating module by utilizing a back propagation and gradient descent algorithm according to the loss; and repeatedly training and iterating until the verification module judges that the model achieves the expected effect.
Step 4.1) extraction of a protein map in batchFirstly, sending a graph convolution network based on an attention mechanism to obtain high-dimensional characteristics, wherein the network consists of 4 layers of frames with the same structure, and the total calculation of each layer of network specifically comprises the following steps:wherein: omega (k) As a high-dimensional feature map e (k) The attention weight obtained by the squeeze-and-excitation module specifically comprises:e (k) is tau (k) Obtained through full connectors,. tau. (k) The high-dimensional characteristic diagram obtained by the triple attention mechanism module specifically comprises the following steps: wherein:are respectively original charactersSign graph z (k) Four profiles obtained through four multi-head full connectors with the same frame, subscript h is multi-head: wherein: the superscript k is the current corresponding network layer number, and can be 1,2,3 or 4;the function is softmax, sigma (-) is sigmoid function, and delta (-) is relu nonlinear activation function; as an element-by-element multiplication,is the concatenation between vectors;anda convolution weight matrix and a convolution bias matrix for the k-th layer, respectively, wherein: the subscript h is a multi-head, and in this work the subscript h is a preset hyper-parameter of 3.
Step 4.2) embedding of the new node obtained for the forward calculation as v' i The method adopts a graph pooling technology to obtain global and local scores, and specifically comprises the following steps: the resulting global scoreLocal score for kth amino acidWherein:to form the kthThe collection of heavy atoms of an amino acid,b l ,b m the weights and offsets of the global and local fully connected layers, respectively.
Step 4.3) updating delta by using a random gradient descent algorithm according to the loss function θ Kth time parameter θ in network k Is concretely provided withWherein: λ is a preset learning rate,is the corresponding gradient value found by the loss function. Specifically, assuming that the sample is composed of L amino acids, the final score obtained by the network is The corresponding genuine tag value is also The loss function of the sample Wherein: ε (. cndot.) is a step function.
Step 4.4) after the iteration of the appointed times is finished, inquiring the optimal model parameter which is expressed on the verification set under the storage path, predicting on the test set,by comparing Pearson coefficients of two vectors in the test set (Score, Y) Judging whether the hyper-parameters need to be adjusted, wherein: cov (-) is covariance, Δ is standard deviation.
Step 5) for the protein-like bait structure waiting for prediction, firstly, the processing of step 1 and step 2 is carried out to obtain the protein map thereof, and the final result obtained by inputting the protein map into the map neural network is the prediction score of the protein-like bait structure.
This example uses DeepaAccNet (from Hiranuma, N., Park, H., Baek, M.et al. improved protein structure defined by deep depletion basic acid estimation. Nat Commun 12,1340(2021), https:// doi.org/10.1038/s41467-021-, pearson's correlation and AUC compared to the global score on the test set were used as criteria to evaluate the model's goodness.
In order to prove the superiority of the invention in high-precision structure judgment, samples with true tag values greater than 0.5 in CASP13 and CASP14 protein decoy structure data sets are screened to obtain new two test data sets CASP13_ H and CASP14_ H, and the two data sets are also compared.
The final results on CASP13 and CASP14 sorting tasks are shown in tables 1-2. Compared with the first four single model evaluation methods in the game, the pearson correlation value and the AUC value of the method are kept in a cascade within an acceptable range, wherein: the AUC value is defined as the area under the ROC curve and reflects the ability of the method to distinguish between poorly authentic and well authentic (bounded by whether GDT-TS is greater than 0.5) proteoid value decoy structures.
Table 1: comparison at CASP13
Method | AUC | PEARSON |
ProQ3D | 95.83% | 0.8510 |
MESHI-enrich-server | 95.08% | 0.8318 |
ProQ3 | 95.05% | 0.8316 |
MASS1 | 94.28% | 0.8096 |
Method for producing a composite material | 92.32% | 0.8247 |
Table 2: comparison at CASP14
Method | AUC | PEARSON |
ProQ4 | 80.79% | 0.6278 |
Bhattacharya-QDeep | 78.88% | 0.6249 |
QMEANDisCo | 78.84% | 0.6120 |
Bhattacharya-QDeepU | 78.38% | 0.6084 |
Method for producing a composite material | 82.81% | 0.6383 |
The final results on CASP13_ H and CASP14_ H sorting tasks are shown in tables 3-4. Compared with the 14 best-ranked evaluation methods in the game (including the multi-model method which is often better in performance on data indexes), the method shows that the method can obtain a great lead on the Pearson correlation value, and can better sort and select the protein-like bait with the authenticity closer to the natural structure.
Table 3: comparison at CASP13_ H
Method | PEARSON |
MUfoldQA_T | 0.7636 |
MULTICOM_CLUSTER | 0.8022 |
Davis-EMAconsensus | 0.7733 |
ModFOLDclust2 | 0.7777 |
FALCON-QA | 0.7516 |
RaptorX-DeepQA | 0.6791 |
Bhattacharya-ClustQ | 0.7124 |
ProQ3D | 0.6317 |
FaeNNz | 0.8115 |
Pcons | 0.5867 |
Pcomb | 0.5560 |
MESHI-enrich-server | 0.7772 |
ProQ3 | 0.7188 |
MASS1 | 0.4661 |
Method for producing a composite material | 0.8646 |
Table 4: comparison at CASP14_ H
Similarly, for 1000 protein-like bait structures randomly selected in the validation set, the method predicts time pairs with other top-ranked methods, again using CPU, as shown in table 5.
Table 5: predictive time comparisons over 1000 randomly chosen baits
Method | Time |
ProQ4 | 27min 22s |
Bhattacharya-QDeep | 31min 3s |
QMEANDisCo | 2h 3min 0s |
Bhattacharya-QDeepU | 33min 10s |
Method for producing a composite material | 10min 46s |
Compared with the prior art, the method has the advantages that the used prediction time is shorter than that of the prior art, and meanwhile, the evaluation precision is equal to that of the prior art; in the high-precision sample of the authoritative data set, the evaluation precision of the method is superior to that of all the current technologies.
The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Claims (9)
1. A protein structure model quality evaluation method based on a graph neural network is characterized in that global coordinates of all 167 heavy atoms in a to-be-predicted protein bait in a three-dimensional space are extracted as input, the to-be-predicted protein bait is preprocessed through dual-space relay of Rosetta software, original node features and edge features are obtained through analysis and calculation, a graph neural network model constructed based on an attention mechanism and a graph pooling technology is input, pre-trained model network parameters are used for calculation, and global scores and local scores reflecting the difference between the protein bait structure and a real natural protein structure in the protein level and the amino acid residue level are respectively obtained.
2. The method for evaluating the quality of the protein structure model based on the graph neural network as claimed in claim 1, wherein the graph neural network model constructed based on the attention mechanism and the graph pooling technology comprises: the four-layer network layer architecture is the same and is connected in series to build the graph neural computation network and is connected with the graph pooling network formed by connecting two pooling layers in parallel, wherein: each layer of network layer in the graph neural computation network is formed by connecting a triple attention mechanism and a channel attention mechanism in series, and two parallel pooling layers in the graph pooling network are used for performing local pooling and global pooling on node embedding of an upstream graph respectively.
3. The method for evaluating the quality of the protein structure model based on the graph neural network as claimed in claim 1 or 2, wherein the pre-trained model network parameters are obtained by repeatedly training the model architecture on a training set for ten rounds and then respectively evaluating the model architecture on a verification set and a test set, and specifically comprises the following steps:
firstly, taking out a sample protein bait structure from a training set, carrying out dual-space relax, analyzing data from the structure to obtain original node characteristics and edge characteristics, and constructing a training data queueWherein: x is a protein map constructed by the original characteristics of the sample, and Y is a real label value corresponding to the sample;
step two, training data are queuedThe data pairs in the step (a) are sequentially input into the neural network model in the form of batch to obtain output delta θ (X), and then calculating the Loss function value Loss ═ Σ batch (Y-δ θ (X)) 2 +∑ label>0.5 (Y-δ θ (X)) 2 Wherein: delta θ Is a graph neural network model;
step three, updating delta by using a random gradient descent algorithm according to the loss function θ Kth time parameter δ in network k Specifically, it isWherein: λ is a preset learning rate of 0.001,the corresponding gradient value obtained by the loss function;
fourthly, after traversing all samples in the training set, inputting the samples in the verification set into the delta θ Testing in the network, calculating correlation coefficients and absolute value differences of all sample predicted values and label values in a verification set, comparing with a model result trained previously, and selecting a better version for replacing and storing;
fifthly, after finishing the training iteration of the appointed times, taking out the historical optimal model parameters which are judged and stored according to the verification set, inputting samples in the test set, calculating to obtain the prediction scores of the models on the test set and obtaining the correlation coefficient, and then evaluating whether the current model design is reasonable according to the quality of the correlation coefficient to decide whether to terminate the training of the models or adjust the relevant hyper-parameters.
4. The method for evaluating the quality of the protein structure model based on the graph neural network as claimed in claim 3, wherein the training set and the verification set are obtained by using different prediction methods and prediction software to perturb different amino acid sequences of protein structures which are known through experiments in a human protein database to generate a large number of protein-like bait structures, and dividing the protein-like bait structures according to a ratio of 8: 2;
the test set refers to CASP standard data set, and is an official data set provided by a CASP game. CASP, also known as key assessment of protein structure prediction, is a global experiment for the entire community. Each game of CASP will arrange a batch of protein sequences whose structures are unknown but whose true structures are to be solved by world centre laboratories, and provide them to all the panels in the game for structure prediction, and then collect all the predicted structures and provide them to the panel for calculation by the protein structure model evaluation group. Its official dataset is therefore the gold standard in the field of research in which the invention is located.
5. The method of claim 3, wherein the protein graph is a representation of data of a decoy structure of a protein-like protein, and each protein graph is composed of nodes and edges, wherein the nodes represent heavy atoms in the protein and the edges represent the relationship between each node atom and its 50 nearest neighbors in three-dimensional space. Each node and edge has original characteristics; the original node characteristics refer to the use of a one-hot vector to encode the atomic species at the node.
6. The method for evaluating the quality of the protein structure model based on the graph neural network according to claim 5, wherein the edge features specifically comprise:
category 1) edge distance: for every two atoms to be joined in the protein map, calculate their relative distance in the sample, in this work, a vector representation of the distance is generated using gaussian expansion;
category 2) edge coordinates: each heavy atom B forms a local reference system together with its preceding atom A and its following atom C in the amino acid sequence; the three unit vectors of the reference system are respectivelyWhen considering any two atoms B in space i ,B j When the edge coordinate of B is characterized, B is respectively set i ,B j The three-dimensional space coordinates of (a) are replaced under the local reference system of the other party, then the edge (i, j) in the protein map obtains 2 x 3 characteristics, which are respectively the atom B i (B j ) Has a spatial coordinate of atom B j (B i ) A projection on the local coordinate system of (a);
class 3) chemical bond: when the two-atom linkage is a chemical bond in the actual protein structure, a 1 is placed, whereas a 0 is placed, i.e., a boolean feature.
7. The method for evaluating the quality of the protein structure model based on the graph neural network according to claim 3, wherein the true label values are: an accurate score of a protein-like bait structure relative to its corresponding authentic structure; the real tag values are divided into global and local categories: when the natural real structure corresponding to the sequence is known, the global and local tag values are respectively realized by two algorithms, i.e. global distance test (GDT-TS) and local distance difference test (lDDT).
8. The method for evaluating the quality of the protein structure model based on the graph neural network according to claim 3, wherein the correlation coefficient is: pearson's correlation coefficientWherein: cov (X, Y) corresponds to the covariance of the two vectors, and σ corresponds to the standard deviation of the vector.
9. A system for realizing the quality evaluation method of the protein structure model based on the graph neural network as claimed in any one of claims 1 to 8, which is characterized by comprising the following steps: the system comprises a protein map generation module, a feature extraction network based on an attention mechanism, a map pooling network, a loss function module, a parameter updating module and a verification module, wherein: the protein graph generation module respectively extracts original node features and edge features from the sample protein bait structure data and constructs the extracted features into an original protein graph; the feature extraction network based on the attention mechanism obtains the high-dimensional features of the protein map through forward learning; the graph pooling network carries out simple calculation and average processing on the high-dimensional features to obtain global and local prediction scores respectively; sending the obtained fraction and the real fraction corresponding to the sample to a loss function module together for calculating loss; updating network parameters in a parameter updating module by utilizing a back propagation and gradient descent algorithm according to the loss; and repeatedly training and iterating until the verification module judges that the model achieves the expected effect.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210557804.6A CN114997366A (en) | 2022-05-19 | 2022-05-19 | Protein structure model quality evaluation method based on graph neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210557804.6A CN114997366A (en) | 2022-05-19 | 2022-05-19 | Protein structure model quality evaluation method based on graph neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114997366A true CN114997366A (en) | 2022-09-02 |
Family
ID=83026978
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210557804.6A Pending CN114997366A (en) | 2022-05-19 | 2022-05-19 | Protein structure model quality evaluation method based on graph neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114997366A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116206676A (en) * | 2023-04-28 | 2023-06-02 | 中国人民解放军军事科学院军事医学研究院 | Immunogen prediction system and method based on protein three-dimensional structure and graph neural network |
-
2022
- 2022-05-19 CN CN202210557804.6A patent/CN114997366A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116206676A (en) * | 2023-04-28 | 2023-06-02 | 中国人民解放军军事科学院军事医学研究院 | Immunogen prediction system and method based on protein three-dimensional structure and graph neural network |
CN116206676B (en) * | 2023-04-28 | 2023-09-26 | 中国人民解放军军事科学院军事医学研究院 | Immunogen prediction system and method based on protein three-dimensional structure and graph neural network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112101430B (en) | Anchor frame generation method for image target detection processing and lightweight target detection method | |
CN110851645B (en) | Image retrieval method based on similarity maintenance under deep metric learning | |
CN110289050B (en) | Drug-target interaction prediction method based on graph convolution sum and word vector | |
CN110569901A (en) | Channel selection-based countermeasure elimination weak supervision target detection method | |
CN113298230B (en) | Prediction method based on unbalanced data set generated against network | |
CN112489723B (en) | DNA binding protein prediction method based on local evolution information | |
CN115311502A (en) | Remote sensing image small sample scene classification method based on multi-scale double-flow architecture | |
CN114997366A (en) | Protein structure model quality evaluation method based on graph neural network | |
CN115063664A (en) | Model learning method, training method and system for industrial vision detection | |
CN114723784A (en) | Pedestrian motion trajectory prediction method based on domain adaptation technology | |
CN114529552A (en) | Remote sensing image building segmentation method based on geometric contour vertex prediction | |
CN110263125B (en) | Service discovery method based on extreme learning machine | |
CN112884222A (en) | Time-period-oriented LSTM traffic flow density prediction method | |
CN113707213B (en) | Protein structure rapid classification method based on contrast graph neural network | |
CN112949599B (en) | Candidate content pushing method based on big data | |
CN113177078B (en) | Approximate query processing algorithm based on condition generation model | |
CN114627076A (en) | Industrial detection method combining active learning and deep learning technologies | |
CN115083511A (en) | Peripheral gene regulation and control feature extraction method based on graph representation learning and attention | |
CN116071636B (en) | Commodity image retrieval method | |
Mu | Implementation of Music Genre Classifier Using KNN Algorithm | |
CN117437786B (en) | Real-time traffic flow prediction method based on artificial intelligence for traffic network | |
Bouzaachane | Applying Face Recognition in Video Surveillance for Security Systems | |
Hassan et al. | COMPARATIVE ANALYSIS OF CLASSIFICATION BASED ON CELLULAR LOCALIZATION DATA USING MACHINE LEARNING | |
CN116955713A (en) | Method for generating protein index, method and device for querying protein fragment | |
CN116913379A (en) | Directional protein transformation method based on iterative optimization pre-training large model sampling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |