CN114997366A - Protein structure model quality evaluation method based on graph neural network - Google Patents

Protein structure model quality evaluation method based on graph neural network Download PDF

Info

Publication number
CN114997366A
CN114997366A CN202210557804.6A CN202210557804A CN114997366A CN 114997366 A CN114997366 A CN 114997366A CN 202210557804 A CN202210557804 A CN 202210557804A CN 114997366 A CN114997366 A CN 114997366A
Authority
CN
China
Prior art keywords
protein
graph
model
network
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210557804.6A
Other languages
Chinese (zh)
Inventor
张沛东
沈红斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202210557804.6A priority Critical patent/CN114997366A/en
Publication of CN114997366A publication Critical patent/CN114997366A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Databases & Information Systems (AREA)
  • Physiology (AREA)
  • Bioethics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

A method for evaluating quality of a protein structure model based on a graph neural network comprises the steps of extracting global coordinates of all 167 heavy atoms in a to-be-predicted protein bait in a three-dimensional space as input, preprocessing the global coordinates through dual-space relay of Rosetta software, analyzing and calculating to obtain original node characteristics and edge characteristics, inputting the original node characteristics and the edge characteristics into a graph neural network model constructed based on an attention mechanism and a graph pooling technology, calculating by using pre-trained model network parameters, and respectively obtaining a global score and a local score reflecting the difference between the protein bait structure and a real natural protein structure on the protein level and the amino acid residue level. The invention can pay more discriminative attention to the protein-like bait structures with the precision close to the natural structure, thereby improving the accuracy of the protein structure model evaluation on a high-precision data set.

Description

Protein structure model quality evaluation method based on graph neural network
Technical Field
The invention relates to a technology in the field of bioengineering, in particular to a protein structure model quality evaluation method based on a graph neural network, which is used for evaluating the authenticity of a model structure under the condition of unknown corresponding natural structure.
Background
Protein structure model evaluation aims at quantifying the accuracy of a protein-like bait structure generated by a protein structure prediction model relative to its native structure without knowledge of the native structure. For the structural bioinformatics related to proteins, a protein structure model evaluation tool is an extremely important infrastructure, and a good evaluation method can provide important guidance for tasks such as protein structure prediction, structure-based protein function prediction, structure-based protein molecule docking and the like. Generally, protein structure model evaluation is divided into two methods, i.e., a multi-model method and a single-model method, wherein the former method needs to take the whole protein structure model pool as input, and evaluates the protein structure model by comparing the target protein structure model with other models in the pool and calculating average structure similarity as a quality index; the latter uses only the sequence and structure information of the individual bait structures themselves as input to predict their quality. Usually, the multi-model method can obtain false high precision under the condition that most models in the model pool have good quality, but the time consumption is often exponentially increased; the single model method can be more consistent in performance and more independent of the distribution of the quality of the model to be evaluated, and is closer to the expectation of a protein structure model evaluation algorithm, namely the fact that the true structure of the sequence possibly formed under natural conditions is learned. In 2020, the model headed by google latest AI algorithm AlphaFold 2 demonstrated in the CASP14 structure prediction arena of the official organization that deep learning of amino acid co-evolution data was sufficient to predict protein-like decoy structures with considerable accuracy for all protein sequences of unknown structure. Since past evaluation techniques tend to identify protein-like bait structures with substantial authenticity, the creation of large numbers of highly accurate protein-like bait structures allows almost all evaluation methods to be scaled down with great precision. In other words, the model evaluation algorithm is an important guiding part of the structure prediction algorithm, and the prior art cannot keep up with the rapid development of the structure prediction algorithm.
The existing multi-model quality evaluation method has longer time consumption and greatly influenced by the distribution of a protein structure model pool; the single model quality evaluation method is closer to the solution of the problem in the field, but the algorithm learning effect is poor; and almost all existing quality assessment methods are difficult to make reliable prediction in high-precision protein models.
Disclosure of Invention
The invention provides a protein structure model quality evaluation method based on a graph neural network aiming at the defect that the existing protein structure model evaluation method is poor in performance on a high-precision protein structure model set, combines a deep learning technology with the field knowledge of a protein structure, applies the neural network constructed based on a triple attention mechanism to a protein structure model evaluation algorithm, firstly focuses on the evaluation of the high-precision protein structure model and carries out improvement and optimization on the high-precision protein structure model, and can more discriminatively focus on the protein-like bait structures with the precision close to the natural structure, thereby improving the accuracy of the protein structure model evaluation on the high-precision data set.
The invention is realized by the following technical scheme:
the invention relates to a quality evaluation method of a protein structure model based on a graph neural network, which is characterized by extracting the global coordinates of all 167 heavy atoms in a to-be-predicted protein bait in a three-dimensional space as input, preprocessing the global coordinates through dual-space relay of Rosetta software, analyzing and calculating to obtain original node characteristics and edge characteristics, inputting the original node characteristics and the edge characteristics into a graph neural network model constructed based on an attention mechanism and a graph pooling technology, and calculating by using pre-trained model network parameters to respectively obtain a global score and a local score which reflect the difference between the protein bait structure and a real natural protein structure on the protein level and the amino acid residue level.
The 167 heavy atoms, i.e., all atoms with relatively large atomic mass contained in 20 standard amino acid residues that can constitute proteins in nature, are considered to determine the framework position and relative position of proteins in space relative to other atoms.
The graph neural network model constructed based on the attention mechanism and the graph pooling technology comprises the following steps: the four-layer network layer architecture is the same, the graph neural computation network is built in series, and the graph pooling network which is connected with the graph neural computation network and formed by connecting two pooling layers in parallel is provided, wherein: each layer of network layer in the graph neural computation network is formed by connecting a triple attention mechanism and a channel attention mechanism in series, and two parallel pooling layers in the graph pooling network are used for performing local pooling and global pooling on node embedding of an upstream graph respectively.
The pre-trained model network parameters are obtained by repeatedly training ten rounds on a training set by the model framework and then respectively evaluating on a verification set and a test set, and specifically comprise:
firstly, taking out a sample protein bait structure from a training set, carrying out dual-space relax, then obtaining original node characteristics and edge characteristics from structure analysis data, and constructing a training data queue Q train ={(X 1 ,Y 1 ),(X 2 ,Y 2 ),…,(X n ,Y n ) }, wherein: x is a protein map constructed by the original characteristics of the sample, and Y is a real label value corresponding to the sample.
Step II, training data queue Q train The data pairs in the step (a) are sequentially input into the neural network model in the form of batch to obtain output delta θ (X), and then calculating the Loss function value Loss ═ sigma batch (Y-δ θ (X)) 2 +∑ label>0.5 (Y-δ θ (X)) 2 Wherein: delta θ Is a graph neural network model.
Step three, updating delta by using a random gradient descent algorithm according to the loss function θ Kth time parameter theta in network k Is concretely provided with
Figure BDA0003652967710000021
Wherein: λ is a preset learning rate of 0.001,
Figure BDA0003652967710000022
is the corresponding gradient value found by the loss function.
Fourthly, after traversing all samples in the training set, inputting the samples in the verification set into delta θ And testing in the network, calculating correlation coefficients and absolute value differences of all sample predicted values and label values in the verification set, comparing the correlation coefficients and the absolute value differences with model results trained before, and selecting a better version for replacing and storing.
Fifthly, after finishing the training iteration of the appointed times, taking out the historical optimal model parameters which are judged and stored according to the verification set, inputting samples in the test set, calculating to obtain the prediction scores of the models on the test set and obtaining the correlation coefficient, and then evaluating whether the current model design is reasonable according to the quality of the correlation coefficient to decide whether to terminate the training of the models or adjust the relevant hyper-parameters.
The training set and the verification set are obtained by using different prediction methods and prediction software to perturb different amino acid sequences of protein structures obtained by experiments in a human protein database to generate a large number of protein-like bait structures and dividing the protein-like bait structures according to the proportion of 8: 2.
The test set refers to CASP standard data set, and is an official data set provided by a CASP game. CASP, also known as key assessment of protein structure prediction, is a global experiment directed to the entire community. Each game of CASP will arrange a batch of protein sequences whose structures are unknown but whose true structures are to be solved by world centre laboratories, and provide them to all the panels in the game for structure prediction, and then collect all the predicted structures and provide them to the panel for calculation by the protein structure model evaluation group. Its official dataset is therefore the gold standard in the field of research in which the invention is located.
The protein graph refers to a data representation form for a protein-like bait structure, and each protein graph is composed of defined nodes and edges, wherein the nodes represent various heavy atoms in the protein, and the edges represent the interrelations of each node atom and 50 nearest neighbor atoms in a three-dimensional space. Each node and edge has an original characteristic.
The original node characteristics refer to the use of one-hot vectors to encode the atomic species at the node.
The edge characteristics specifically include:
category 1) edge distance: for every two atoms to be joined in the protein map, their relative distances in the sample are calculated, in this work, a vector representation of the distances is generated using a gaussian expansion. The mean values varied uniformly from [0,15] with a step size of 0.4.
Category 2) edge coordinates: to account for the apparent anisotropy of the relative atomic positions within proteins, the above distances are supplemented with a set of directional features. For each heavy atom B, it may form a local reference frame together with its preceding atom a and its following atom C in the amino acid sequence. Three unit vectors defining the reference system
Figure BDA0003652967710000031
Respectively, as calculated below.
Figure BDA0003652967710000032
Thus when considering any two atoms B in space j ,B j When the edge coordinate of B is characterized, B is respectively set i ,B j The three-dimensional space coordinates of (a) are replaced under the local reference system of the other party, then the edge (i, j) in the protein map obtains 2 x 3 characteristics, which are respectively atom B i (B j ) Has a spatial coordinate of atom B j (B i ) Is projected on the local coordinate system.
Class 3) chemical bond: if the two-atom linkage is a chemical bond in the actual protein structure, then a 1 is placed, whereas a 0 is placed, i.e., a Boolean-type feature.
The real label value refers to: the accuracy of the scoring of a protein-like bait structure relative to its corresponding authentic structure. Similarly, the true tag value is divided into global and local categories. When the natural real structure corresponding to the sequence is known, the global and local tag values are respectively realized by two algorithms of global distance test (GDT-TS) and local distance difference test (lDDT), wherein: the GDT-TS score is the largest set within a defined distance cut-off of the positions of the alpha carbon atoms of the amino acid residues in the model structure in the native structure after iterative stacking of the two structures. lDDT is a non-additive score used to evaluate the local distance differences of all atoms in the model, including validation of stereochemistry, which is calculated over all atom pairs of the native protein structure, and measures how well the environment in the native structure is reproduced in the protein-like bait structure. These two types of methods are the gold standards for measuring the global and local domains of structural similarity, respectively.
The correlation coefficient refers to: pearson's correlation coefficient
Figure BDA0003652967710000041
Wherein: cov (X, Y) corresponds to the covariance of the two vectors, and σ corresponds to the standard deviation of the vector.
The invention relates to a system for realizing the method, which comprises the following steps: the system comprises a protein map generation module, a feature extraction network based on an attention mechanism, a map pooling network, a loss function module, a parameter updating module and a verification module, wherein: the protein graph generation module respectively extracts original node features and edge features from the sample protein bait structure data and constructs the extracted features into an original protein graph; the feature extraction network based on the attention mechanism obtains the high-dimensional features of the protein map through forward learning; the graph pooling network carries out simple calculation and average processing on the high-dimensional features to obtain global and local prediction scores respectively; sending the obtained fraction and the real fraction corresponding to the sample to a loss function module together for calculating loss; updating network parameters in a parameter updating module by utilizing a back propagation and gradient descent algorithm according to the loss; and repeatedly training and iterating until the verification module judges that the model achieves the expected effect.
Technical effects
The invention can achieve similar precision on the authoritative public data sets and obtain the optimal result on the high-precision subset under the condition of shorter running time than the current first-class evaluation algorithm.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of a graph neural network model training process constructed based on an attention mechanism and a graph pooling technique;
FIG. 3 is a schematic diagram of a neural network architecture implementing an attention-based mechanism.
Detailed Description
As shown in fig. 1, the present embodiment relates to a method for evaluating a protein structure model based on deep learning, which includes the following steps:
step 1) extracting Cartesian coordinate information of heavy atoms of each residue in a protein-like bait structure in a three-dimensional space, and then performing pretreatment through dual-relax of Rosetta software to relax the structure under double-space constraint.
Step 2) calculating the Euclidean distance between each heavy atom pair according to the coordinate information of each heavy atom, constructing an adjacency matrix according to the distance, and calculating the original node characteristics and the edge characteristics of each heavy atom, wherein the method specifically comprises the following steps:
step 2.1) A protein is given which comprises L residues, which are assumed to contain a total of N heavy atoms, wherein: the cartesian coordinate of the ith heavy atom in three-dimensional space is c i =(x i ,y i ,z i ) The coordinate of the jth heavy atom is c j =(x j ,y j ,z j ) Then the Euclidean distance between these two residues is d ij =||c i -c j And | l, in practical application, a vector representation of the distance is generated using gaussian base expansion. The mean value of which is from [0,15]]Uniformly, step size is 0.4. The distance matrix of the protein is
Figure BDA0003652967710000051
The atomic coordinates of the protein are set as
Figure BDA0003652967710000052
Step 2.2) based on the distance matrix obtained above
Figure BDA0003652967710000053
The adjacency matrix can be obtained by traversing and searching M adjacent atoms which are closest to each heavy atom in three-dimensional space
Figure BDA0003652967710000054
Wherein: (i, j) ∈ a ij In order to determine whether two heavy atoms are connected at an edge in a protein map, k is a hyper-parameter for normalization, and in practical application, k is 50.
Step 2.3) obtaining the original node characteristics of each heavy atom according to the attributes of each heavy atom in the protein, and recording the collection of the original node characteristics as
Figure BDA0003652967710000055
v i The atomic species are identified independently as a 167-dimensional one-hot vector.
Step 2.4) based on the adjacency matrix
Figure BDA0003652967710000056
Distance matrix
Figure BDA0003652967710000057
The coordinate set C may compute other edge features than euclidean distance features, such as contact features and projection features, where: contact matrix
Figure BDA0003652967710000058
Hyperparameter d 0 =8,
Figure BDA0003652967710000059
Figure BDA00036529677100000510
Projection matrix
Figure BDA00036529677100000511
The ith heavy atom coordinate is known as c i =(x i ,y i ,z i ) Then its preorder and posteror heavy atom coordinates on the amino acid sequence are c i-1 =(x i+1 ,y i+1 ,z i+1 ) And c i+1 =(x i+1 ,y i+1 ,z i+1 ) Then, the local reference system can be calculated from the coordinates of the three heavy atoms according to the following formula:
Figure BDA00036529677100000512
Figure BDA00036529677100000513
the local coordinate system of the jth heavy atom can be obtained in the same way
Figure BDA00036529677100000514
And
Figure BDA00036529677100000515
then the vector p is projected ij Connecting vectors (c) for two heavy atoms j -c i ) And (4) splicing the projections in the two local coordinate systems respectively. There can be defined
Figure BDA00036529677100000516
Wherein: mu.s ij =[d ij ,b ij ,p ij ]For a total of 47-dimensional features.
Step 2.5) in summary, a protein map can be defined as
Figure BDA00036529677100000517
In the figure
Figure BDA00036529677100000518
For showing whether the ith and jth heavy atoms in the figure are connected,
Figure BDA00036529677100000519
the original node features for the ith heavy atom,
Figure BDA00036529677100000520
for edge features between the ith and jth heavy atoms.
Step 3) constructing a training data queue
Figure BDA00036529677100000521
Wherein:
Figure BDA00036529677100000522
to be constructed from the original features of the sampleThe obtained protein map is established, Y is the real label value corresponding to the sample, Y n ∈(0,1)。
And 4) constructing a graph neural network learning framework, wherein the framework comprises a graph convolution network based on an attention mechanism and a graph pooling calculation network. And inputting the training data into a graph convolution network to calculate the high-dimensional characteristics of the protein graph, and then performing graph pooling calculation on the high-dimensional characteristics to respectively obtain the local score and the global score of the sample. Sending the obtained fraction and the real fraction corresponding to the sample to a loss function module together for calculating loss; updating network parameters in a parameter updating module by utilizing a back propagation and gradient descent algorithm according to the loss; and repeatedly training and iterating until the verification module judges that the model achieves the expected effect.
Step 4.1) extraction of a protein map in batch
Figure BDA00036529677100000523
Firstly, sending a graph convolution network based on an attention mechanism to obtain high-dimensional characteristics, wherein the network consists of 4 layers of frames with the same structure, and the total calculation of each layer of network specifically comprises the following steps:
Figure BDA0003652967710000061
wherein: omega (k) As a high-dimensional feature map e (k) The attention weight obtained by the squeeze-and-excitation module specifically comprises:
Figure BDA0003652967710000062
e (k) is tau (k) Obtained through full connectors,. tau. (k) The high-dimensional characteristic diagram obtained by the triple attention mechanism module specifically comprises the following steps:
Figure BDA0003652967710000063
Figure BDA0003652967710000064
wherein:
Figure BDA0003652967710000065
are respectively original charactersSign graph z (k) Four profiles obtained through four multi-head full connectors with the same frame, subscript h is multi-head:
Figure BDA0003652967710000066
Figure BDA0003652967710000067
Figure BDA0003652967710000068
wherein: the superscript k is the current corresponding network layer number, and can be 1,2,3 or 4;
Figure BDA00036529677100000627
the function is softmax, sigma (-) is sigmoid function, and delta (-) is relu nonlinear activation function; as an element-by-element multiplication,
Figure BDA0003652967710000069
is the concatenation between vectors;
Figure BDA00036529677100000610
and
Figure BDA00036529677100000611
a convolution weight matrix and a convolution bias matrix for the k-th layer, respectively, wherein: the subscript h is a multi-head, and in this work the subscript h is a preset hyper-parameter of 3.
Step 4.2) embedding of the new node obtained for the forward calculation as v' i The method adopts a graph pooling technology to obtain global and local scores, and specifically comprises the following steps: the resulting global score
Figure BDA00036529677100000612
Local score for kth amino acid
Figure BDA00036529677100000613
Wherein:
Figure BDA00036529677100000614
to form the kthThe collection of heavy atoms of an amino acid,
Figure BDA00036529677100000615
b l ,
Figure BDA00036529677100000616
b m the weights and offsets of the global and local fully connected layers, respectively.
Step 4.3) updating delta by using a random gradient descent algorithm according to the loss function θ Kth time parameter θ in network k Is concretely provided with
Figure BDA00036529677100000617
Wherein: λ is a preset learning rate,
Figure BDA00036529677100000618
is the corresponding gradient value found by the loss function. Specifically, assuming that the sample is composed of L amino acids, the final score obtained by the network is
Figure BDA00036529677100000619
Figure BDA00036529677100000620
The corresponding genuine tag value is also
Figure BDA00036529677100000621
Figure BDA00036529677100000622
The loss function of the sample
Figure BDA00036529677100000623
Figure BDA00036529677100000624
Wherein: ε (. cndot.) is a step function.
Step 4.4) after the iteration of the appointed times is finished, inquiring the optimal model parameter which is expressed on the verification set under the storage path, predicting on the test set,by comparing Pearson coefficients of two vectors in the test set (Score, Y)
Figure BDA00036529677100000625
Figure BDA00036529677100000626
Judging whether the hyper-parameters need to be adjusted, wherein: cov (-) is covariance, Δ is standard deviation.
Step 5) for the protein-like bait structure waiting for prediction, firstly, the processing of step 1 and step 2 is carried out to obtain the protein map thereof, and the final result obtained by inputting the protein map into the map neural network is the prediction score of the protein-like bait structure.
This example uses DeepaAccNet (from Hiranuma, N., Park, H., Baek, M.et al. improved protein structure defined by deep depletion basic acid estimation. Nat Commun 12,1340(2021), https:// doi.org/10.1038/s41467-021-, pearson's correlation and AUC compared to the global score on the test set were used as criteria to evaluate the model's goodness.
In order to prove the superiority of the invention in high-precision structure judgment, samples with true tag values greater than 0.5 in CASP13 and CASP14 protein decoy structure data sets are screened to obtain new two test data sets CASP13_ H and CASP14_ H, and the two data sets are also compared.
The final results on CASP13 and CASP14 sorting tasks are shown in tables 1-2. Compared with the first four single model evaluation methods in the game, the pearson correlation value and the AUC value of the method are kept in a cascade within an acceptable range, wherein: the AUC value is defined as the area under the ROC curve and reflects the ability of the method to distinguish between poorly authentic and well authentic (bounded by whether GDT-TS is greater than 0.5) proteoid value decoy structures.
Table 1: comparison at CASP13
Method AUC PEARSON
ProQ3D 95.83% 0.8510
MESHI-enrich-server 95.08% 0.8318
ProQ3 95.05% 0.8316
MASS1 94.28% 0.8096
Method for producing a composite material 92.32% 0.8247
Table 2: comparison at CASP14
Method AUC PEARSON
ProQ4 80.79% 0.6278
Bhattacharya-QDeep 78.88% 0.6249
QMEANDisCo 78.84% 0.6120
Bhattacharya-QDeepU 78.38% 0.6084
Method for producing a composite material 82.81% 0.6383
The final results on CASP13_ H and CASP14_ H sorting tasks are shown in tables 3-4. Compared with the 14 best-ranked evaluation methods in the game (including the multi-model method which is often better in performance on data indexes), the method shows that the method can obtain a great lead on the Pearson correlation value, and can better sort and select the protein-like bait with the authenticity closer to the natural structure.
Table 3: comparison at CASP13_ H
Method PEARSON
MUfoldQA_T 0.7636
MULTICOM_CLUSTER 0.8022
Davis-EMAconsensus 0.7733
ModFOLDclust2 0.7777
FALCON-QA 0.7516
RaptorX-DeepQA 0.6791
Bhattacharya-ClustQ 0.7124
ProQ3D 0.6317
FaeNNz 0.8115
Pcons 0.5867
Pcomb 0.5560
MESHI-enrich-server 0.7772
ProQ3 0.7188
MASS1 0.4661
Method for producing a composite material 0.8646
Table 4: comparison at CASP14_ H
Figure BDA0003652967710000081
Figure BDA0003652967710000091
Similarly, for 1000 protein-like bait structures randomly selected in the validation set, the method predicts time pairs with other top-ranked methods, again using CPU, as shown in table 5.
Table 5: predictive time comparisons over 1000 randomly chosen baits
Method Time
ProQ4 27min 22s
Bhattacharya-QDeep 31min 3s
QMEANDisCo 2h 3min 0s
Bhattacharya-QDeepU 33min 10s
Method for producing a composite material 10min 46s
Compared with the prior art, the method has the advantages that the used prediction time is shorter than that of the prior art, and meanwhile, the evaluation precision is equal to that of the prior art; in the high-precision sample of the authoritative data set, the evaluation precision of the method is superior to that of all the current technologies.
The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (9)

1. A protein structure model quality evaluation method based on a graph neural network is characterized in that global coordinates of all 167 heavy atoms in a to-be-predicted protein bait in a three-dimensional space are extracted as input, the to-be-predicted protein bait is preprocessed through dual-space relay of Rosetta software, original node features and edge features are obtained through analysis and calculation, a graph neural network model constructed based on an attention mechanism and a graph pooling technology is input, pre-trained model network parameters are used for calculation, and global scores and local scores reflecting the difference between the protein bait structure and a real natural protein structure in the protein level and the amino acid residue level are respectively obtained.
2. The method for evaluating the quality of the protein structure model based on the graph neural network as claimed in claim 1, wherein the graph neural network model constructed based on the attention mechanism and the graph pooling technology comprises: the four-layer network layer architecture is the same and is connected in series to build the graph neural computation network and is connected with the graph pooling network formed by connecting two pooling layers in parallel, wherein: each layer of network layer in the graph neural computation network is formed by connecting a triple attention mechanism and a channel attention mechanism in series, and two parallel pooling layers in the graph pooling network are used for performing local pooling and global pooling on node embedding of an upstream graph respectively.
3. The method for evaluating the quality of the protein structure model based on the graph neural network as claimed in claim 1 or 2, wherein the pre-trained model network parameters are obtained by repeatedly training the model architecture on a training set for ten rounds and then respectively evaluating the model architecture on a verification set and a test set, and specifically comprises the following steps:
firstly, taking out a sample protein bait structure from a training set, carrying out dual-space relax, analyzing data from the structure to obtain original node characteristics and edge characteristics, and constructing a training data queue
Figure FDA0003652967700000011
Wherein: x is a protein map constructed by the original characteristics of the sample, and Y is a real label value corresponding to the sample;
step two, training data are queued
Figure FDA0003652967700000014
The data pairs in the step (a) are sequentially input into the neural network model in the form of batch to obtain output delta θ (X), and then calculating the Loss function value Loss ═ Σ batch (Y-δ θ (X)) 2 +∑ label>0.5 (Y-δ θ (X)) 2 Wherein: delta θ Is a graph neural network model;
step three, updating delta by using a random gradient descent algorithm according to the loss function θ Kth time parameter δ in network k Specifically, it is
Figure FDA0003652967700000012
Wherein: λ is a preset learning rate of 0.001,
Figure FDA0003652967700000013
the corresponding gradient value obtained by the loss function;
fourthly, after traversing all samples in the training set, inputting the samples in the verification set into the delta θ Testing in the network, calculating correlation coefficients and absolute value differences of all sample predicted values and label values in a verification set, comparing with a model result trained previously, and selecting a better version for replacing and storing;
fifthly, after finishing the training iteration of the appointed times, taking out the historical optimal model parameters which are judged and stored according to the verification set, inputting samples in the test set, calculating to obtain the prediction scores of the models on the test set and obtaining the correlation coefficient, and then evaluating whether the current model design is reasonable according to the quality of the correlation coefficient to decide whether to terminate the training of the models or adjust the relevant hyper-parameters.
4. The method for evaluating the quality of the protein structure model based on the graph neural network as claimed in claim 3, wherein the training set and the verification set are obtained by using different prediction methods and prediction software to perturb different amino acid sequences of protein structures which are known through experiments in a human protein database to generate a large number of protein-like bait structures, and dividing the protein-like bait structures according to a ratio of 8: 2;
the test set refers to CASP standard data set, and is an official data set provided by a CASP game. CASP, also known as key assessment of protein structure prediction, is a global experiment for the entire community. Each game of CASP will arrange a batch of protein sequences whose structures are unknown but whose true structures are to be solved by world centre laboratories, and provide them to all the panels in the game for structure prediction, and then collect all the predicted structures and provide them to the panel for calculation by the protein structure model evaluation group. Its official dataset is therefore the gold standard in the field of research in which the invention is located.
5. The method of claim 3, wherein the protein graph is a representation of data of a decoy structure of a protein-like protein, and each protein graph is composed of nodes and edges, wherein the nodes represent heavy atoms in the protein and the edges represent the relationship between each node atom and its 50 nearest neighbors in three-dimensional space. Each node and edge has original characteristics; the original node characteristics refer to the use of a one-hot vector to encode the atomic species at the node.
6. The method for evaluating the quality of the protein structure model based on the graph neural network according to claim 5, wherein the edge features specifically comprise:
category 1) edge distance: for every two atoms to be joined in the protein map, calculate their relative distance in the sample, in this work, a vector representation of the distance is generated using gaussian expansion;
category 2) edge coordinates: each heavy atom B forms a local reference system together with its preceding atom A and its following atom C in the amino acid sequence; the three unit vectors of the reference system are respectively
Figure FDA0003652967700000021
When considering any two atoms B in space i ,B j When the edge coordinate of B is characterized, B is respectively set i ,B j The three-dimensional space coordinates of (a) are replaced under the local reference system of the other party, then the edge (i, j) in the protein map obtains 2 x 3 characteristics, which are respectively the atom B i (B j ) Has a spatial coordinate of atom B j (B i ) A projection on the local coordinate system of (a);
class 3) chemical bond: when the two-atom linkage is a chemical bond in the actual protein structure, a 1 is placed, whereas a 0 is placed, i.e., a boolean feature.
7. The method for evaluating the quality of the protein structure model based on the graph neural network according to claim 3, wherein the true label values are: an accurate score of a protein-like bait structure relative to its corresponding authentic structure; the real tag values are divided into global and local categories: when the natural real structure corresponding to the sequence is known, the global and local tag values are respectively realized by two algorithms, i.e. global distance test (GDT-TS) and local distance difference test (lDDT).
8. The method for evaluating the quality of the protein structure model based on the graph neural network according to claim 3, wherein the correlation coefficient is: pearson's correlation coefficient
Figure FDA0003652967700000031
Wherein: cov (X, Y) corresponds to the covariance of the two vectors, and σ corresponds to the standard deviation of the vector.
9. A system for realizing the quality evaluation method of the protein structure model based on the graph neural network as claimed in any one of claims 1 to 8, which is characterized by comprising the following steps: the system comprises a protein map generation module, a feature extraction network based on an attention mechanism, a map pooling network, a loss function module, a parameter updating module and a verification module, wherein: the protein graph generation module respectively extracts original node features and edge features from the sample protein bait structure data and constructs the extracted features into an original protein graph; the feature extraction network based on the attention mechanism obtains the high-dimensional features of the protein map through forward learning; the graph pooling network carries out simple calculation and average processing on the high-dimensional features to obtain global and local prediction scores respectively; sending the obtained fraction and the real fraction corresponding to the sample to a loss function module together for calculating loss; updating network parameters in a parameter updating module by utilizing a back propagation and gradient descent algorithm according to the loss; and repeatedly training and iterating until the verification module judges that the model achieves the expected effect.
CN202210557804.6A 2022-05-19 2022-05-19 Protein structure model quality evaluation method based on graph neural network Pending CN114997366A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210557804.6A CN114997366A (en) 2022-05-19 2022-05-19 Protein structure model quality evaluation method based on graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210557804.6A CN114997366A (en) 2022-05-19 2022-05-19 Protein structure model quality evaluation method based on graph neural network

Publications (1)

Publication Number Publication Date
CN114997366A true CN114997366A (en) 2022-09-02

Family

ID=83026978

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210557804.6A Pending CN114997366A (en) 2022-05-19 2022-05-19 Protein structure model quality evaluation method based on graph neural network

Country Status (1)

Country Link
CN (1) CN114997366A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116206676A (en) * 2023-04-28 2023-06-02 中国人民解放军军事科学院军事医学研究院 Immunogen prediction system and method based on protein three-dimensional structure and graph neural network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116206676A (en) * 2023-04-28 2023-06-02 中国人民解放军军事科学院军事医学研究院 Immunogen prediction system and method based on protein three-dimensional structure and graph neural network
CN116206676B (en) * 2023-04-28 2023-09-26 中国人民解放军军事科学院军事医学研究院 Immunogen prediction system and method based on protein three-dimensional structure and graph neural network

Similar Documents

Publication Publication Date Title
CN112101430B (en) Anchor frame generation method for image target detection processing and lightweight target detection method
CN110851645B (en) Image retrieval method based on similarity maintenance under deep metric learning
CN110289050B (en) Drug-target interaction prediction method based on graph convolution sum and word vector
CN110569901A (en) Channel selection-based countermeasure elimination weak supervision target detection method
CN113298230B (en) Prediction method based on unbalanced data set generated against network
CN112489723B (en) DNA binding protein prediction method based on local evolution information
CN115311502A (en) Remote sensing image small sample scene classification method based on multi-scale double-flow architecture
CN114997366A (en) Protein structure model quality evaluation method based on graph neural network
CN115063664A (en) Model learning method, training method and system for industrial vision detection
CN114723784A (en) Pedestrian motion trajectory prediction method based on domain adaptation technology
CN114529552A (en) Remote sensing image building segmentation method based on geometric contour vertex prediction
CN110263125B (en) Service discovery method based on extreme learning machine
CN112884222A (en) Time-period-oriented LSTM traffic flow density prediction method
CN113707213B (en) Protein structure rapid classification method based on contrast graph neural network
CN112949599B (en) Candidate content pushing method based on big data
CN113177078B (en) Approximate query processing algorithm based on condition generation model
CN114627076A (en) Industrial detection method combining active learning and deep learning technologies
CN115083511A (en) Peripheral gene regulation and control feature extraction method based on graph representation learning and attention
CN116071636B (en) Commodity image retrieval method
Mu Implementation of Music Genre Classifier Using KNN Algorithm
CN117437786B (en) Real-time traffic flow prediction method based on artificial intelligence for traffic network
Bouzaachane Applying Face Recognition in Video Surveillance for Security Systems
Hassan et al. COMPARATIVE ANALYSIS OF CLASSIFICATION BASED ON CELLULAR LOCALIZATION DATA USING MACHINE LEARNING
CN116955713A (en) Method for generating protein index, method and device for querying protein fragment
CN116913379A (en) Directional protein transformation method based on iterative optimization pre-training large model sampling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination