CN113707213B - Protein structure rapid classification method based on contrast graph neural network - Google Patents
Protein structure rapid classification method based on contrast graph neural network Download PDFInfo
- Publication number
- CN113707213B CN113707213B CN202111047262.XA CN202111047262A CN113707213B CN 113707213 B CN113707213 B CN 113707213B CN 202111047262 A CN202111047262 A CN 202111047262A CN 113707213 B CN113707213 B CN 113707213B
- Authority
- CN
- China
- Prior art keywords
- protein
- residue
- training
- neural network
- structures
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 96
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 96
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 20
- 239000011159 matrix material Substances 0.000 claims abstract description 28
- 125000004432 carbon atom Chemical group C* 0.000 claims abstract description 18
- 238000003062 neural network model Methods 0.000 claims abstract description 8
- 238000012549 training Methods 0.000 claims description 66
- 238000012795 verification Methods 0.000 claims description 18
- 239000013598 vector Substances 0.000 claims description 16
- 229910052799 carbon Inorganic materials 0.000 claims description 12
- 238000010200 validation analysis Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 2
- 238000000605 extraction Methods 0.000 claims description 2
- 238000013135 deep learning Methods 0.000 abstract description 2
- 238000012512 characterization method Methods 0.000 description 6
- 230000007547 defect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 238000000547 structure data Methods 0.000 description 2
- 125000004429 atom Chemical group 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000003032 molecular docking Methods 0.000 description 1
- 108020001580 protein domains Proteins 0.000 description 1
- 230000004853 protein function Effects 0.000 description 1
- 238000000455 protein structure prediction Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Physiology (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Chemical & Material Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Genetics & Genomics (AREA)
- Analytical Chemistry (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method for quickly classifying the protein structure based on the neural network of contrast map includes such steps as extracting the global coordinates of alpha carbon atoms of all the residues in protein to be predicted in three-dimensional space, calculating to obtain adjacent matrix and original node features, and inputting the neural network model based on momentum contrast learning frame to obtain the descriptor of protein structure. The invention combines the deep learning technology with the knowledge in the field of protein structure to generate the descriptor with more discrimination, thereby more accurately identifying the similar structure of the target protein and improving the precision of protein structure classification.
Description
Technical Field
The invention relates to a technology in the field of bioengineering, in particular to a method for rapidly classifying protein structures based on a contrast map neural network.
Background
The purpose of the protein structure comparison is to measure the structural similarity between two different proteins. For structural bioinformatics involving proteins, the structural comparison tool can be said to be an infrastructure, which is an indispensable part of tasks such as protein structure prediction, protein molecule docking, structure-based protein function prediction, and the like. Protein structure comparison methods include methods based on structural alignment, which are often time consuming and do not meet the needs of large-scale protein structure retrieval, and methods based on characterization. With the rapid growth of protein structure data; characterization-based methods typically convert the coordinates of all atoms on a protein backbone into a vector of fixed length, and then measure the similarity between two structures by comparing the distances or correlation coefficients between the vectors, a vector of fixed length and with rotational invariance being called a descriptor of the protein structure.
Disclosure of Invention
Aiming at the problems that the existing protein structure characterization method depends on the characteristics of manual design, and the structure comparison-based method is low in efficiency, the requirement of large-scale protein structure data retrieval cannot be met; other characterization-based methods have relatively low precision and are difficult to find enough defects of similar structures, and a protein structure rapid classification method based on a contrast graph neural network is provided, and a deep learning technology is combined with knowledge in the field of protein structures to generate a descriptor with more discrimination, so that the similar structure of target protein is more accurately identified, and the precision of protein structure classification can be improved.
The invention is realized by the following technical scheme:
the invention relates to a method for quickly classifying protein structures based on a contrast graph neural network, which comprises the steps of extracting global coordinates of alpha carbon atoms of all residues in a protein to be predicted in a three-dimensional space, calculating according to the global coordinates to obtain an adjacent matrix and original node characteristics, and inputting a neural network model based on a momentum contrast learning frame to obtain a descriptor of the protein structure.
The neural network model comprises: two encoders based on the graphic neural network with the same architecture are constructed by calculating the similarity between any two protein structures in a training data set and then using a method of dynamically dividing positive and negative samples to construct the training samples from the sampling structure pairs in the training data set.
And the training is carried out, the distance between two descriptors output by the neural network model is measured according to the cosine distance of the length scaling, and whether the training reaches a target is determined according to the data in the verification set and the cosine distance of the length scaling.
The adjacency matrix refers to: extracting Cartesian coordinate information of alpha carbon atoms of each residue in a protein structure in a three-dimensional space, then calculating Euclidean distance between each residue pair according to the coordinate information of the residues, and constructing an adjacent matrix according to the distance, wherein the method specifically comprises the following steps:
step 1) for a protein comprising L residues, the alpha carbon atom of the ith residue thereof has a Cartesian coordinate in three dimensions of v i =(x i ,y i ,z i ) The alpha carbon atom of the j-th residue has the coordinate v j =(x j ,y j ,z j ) The Euclidean distance between these two residues is d ij =||v i -v j Distance matrix of the protein is
Step 2) based on the distance matrix obtained aboveThe adjacency matrix thereof can be obtained by>Wherein: />Omega and epsilon are two superparameters for normalization, both greater than 0.
Step 3) obtaining the original node characteristics of each residue based on distance, namely the relative coordinates and angles of each residue according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space; the coordinates of the residues are assembled as v= { V 1 ,v 2 ,…,v L-1 ,v L },V i:j ={v i ,v i+1 ,…,v j-2 ,v j-1 (ii) represents the set of coordinates from the ith to the jth residue in the sequence in a given protein, i<j; original node of ith residue in proteinThe eigenvector is x i ∈[0,+∞) K Wherein: k is the length of the vector andm is control x i Super parameter of length->M is {0,1, …, M-1}, g is {1,2,3, …,2 } m Coordinates of kth reference point +.>v i Representing the coordinates of the ith residue, x i The kth element in (a) is v i And->Euclidean distance between them.
Step 4) obtaining the original node characteristic of each residue based on angles according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space, and for the coordinates of three continuous residues on the protein sequence: v i-1 、v i 、v i+1 Obtaining the original node characteristics of the ith residue based on the angle
Step 5) splicing the original node characteristics based on the distance and the angle to obtain the original node characteristics of the ith residue, wherein the original node characteristics are as follows:wherein: if the splicing operation is represented, the original node feature matrix of the protein structure containing the residue is +.>Wherein: l is the number of residues, and T is the transpose operation.
The similarity is obtained by the following way: when the protein structure training dataset contains N knotsConstruct it as D train ={X 1 ,X 2 ,…,X i ,…,X N -wherein: x is X i Representing the structure of the ith protein, calculating the similarity (TM-score) between the ith and jth structures as its structural similarity using the TM-align algorithm and being TM (X) i ,X j ) The TM-score has a value in the range of [0,1 ]]。
The method for dynamically dividing the positive and negative samples specifically comprises the following steps:
step i) from training dataset D train Optionally one of the structures is X a Calculate its and D train TM-score between all other structures in (a) then arranged in descending order of TM-score, from D train Is randomly sampled a non-X a The structure of (C) is X b When it is ranked in the first K%, then (X a ,X b ) Is a positive sample structure pair; otherwise, a negative-sample structure pair, wherein: k is a preset super parameter, K is N and K is 0,100];
Step ii) training data queue toWherein: each structure pair is a positive sample structure pair and satisfies: for arbitrary belongings->Is of the structure pair (X) ai ,X bi ),TM(X ai ,X bj )<TM(X ai ,X bi ) Wherein: 0<i-j≤L N ,/>The length of the negative sample queue in the framework is learned for the momentum contrast.
The encoder based on the graph neural network comprises: a plurality of full connection layers, a BiLSTM module and a plurality of graph convolution layers, wherein the training process comprises:
step (1) training data is queuedThe adjacency matrix and the original node characteristics of the structure pairs in (a) are sequentially input into a contrast learning framework in the form of batch, and a loss function is calculated>Wherein: y is q And y k Is that two structures in positive sample structure pair are respectively input epsilon q And epsilon k The resulting descriptors, y i Is the input of the ith structure in the negative sample queue to ε k τ is a predetermined temperature coefficient.
Step (2) updating ε using a random gradient descent algorithm based on the loss function q Parameter θ q Then utilize theta q Updating epsilon k Parameter θ k The method specifically comprises the following steps: θ k ←m·θ k +(1-m)·θ q Wherein: m epsilon (0, 1)]Is a preset momentum coefficient.
Step (3) the second structure X in all pairs of structures to be used in the present iteration b Adding into the negative sample queue, when the number of structures in the negative sample queue has reached the preset length L N The structure that was first added to the queue is removed.
Step (4) after the designated number of training iterations is completed, inputting the structural data in the training set and the verification set to epsilon q The obtained descriptors in the training set are used for calculating the cosine distance of length scaling between all descriptors in the verification set and all descriptors in the training setWherein: l (L) a ,l b And l max The length of the protein in the verification set, the length of the protein in the training set and the length of the longest protein in the training set, y a And y b The descriptors of protein structure in the validation set and training set, respectively. After obtaining the distances between every two descriptors in the verification set and the training set, the current model can be evaluated according to the real structural similarity between the two descriptors, and whether to terminate the training of the model or reduce the learning rate is determined.
The invention relates to a system for realizing the method, which comprises the following steps: the device comprises a feature extractor, an encoder, a verification module and a parameter updating module, wherein: the feature extraction module extracts distance matrix and node features from structural data of two proteins in the structural pair respectively, and inputs the extracted features into two encoders respectively; two encoders encode the input features into a fixed length vector output; the verification module calculates the distance between the two output vectors, evaluates the difference between the two output vectors and the real similarity between the two output vectors and the two structures, and calculates the loss; and updating the parameters of one encoder by using a back propagation algorithm according to the loss, and updating the parameters of the other encoder by using a momentum method.
Technical effects
The invention integrally solves the defects of the prior art in the precision and speed of protein structure comparison. Compared with the prior art, the dynamic training data partitioning strategy can enable the model to learn a finer similarity relationship, and the cosine distance of the length scaling can further correct the similarity between descriptors. Better results are achieved on the sorting and classifying tasks, respectively, without losing computational efficiency.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of an embodiment momentum contrast learning framework;
fig. 3 is a schematic diagram of an encoder architecture based on the neural network of fig. 3.
Detailed Description
As shown in fig. 1, this embodiment relates to a method for rapid classification of protein structures based on a neural network of a comparison graph, which includes the following steps:
step 1) firstly extracting Cartesian coordinate information of alpha carbon atoms of each residue in a protein structure in a three-dimensional space, then calculating Euclidean distance between each residue pair according to the coordinate information of the residues, constructing an adjacent matrix according to the distance, and calculating relative coordinates and angles of each residue as original node characteristics, wherein the method specifically comprises the following steps:
step 1.1) toA protein comprising L residues is defined, wherein: the alpha carbon atom of the ith residue has a Cartesian coordinate v in three dimensions i =(x i ,y i ,z i ) The alpha carbon atom of the j-th residue has the coordinate v j =(x j ,y j ,z j ) The Euclidean distance between these two residues is d ij =||v i -v j Distance matrix of the protein is
Step 1.2) based on the distance matrix obtained aboveThe adjacency matrix thereof can be obtained by>Wherein: />ω and e are two superparameters for normalization, ω=4 and e=2 in this example.
Step 1.3) obtaining the original node characteristics of each residue based on distance according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space, and collecting the coordinates of the residues as V= { V 1 ,v 2 ,…,v L-1 ,v L },V i:j ={v i ,v i+1 ,…,v j-2 ,v j-1 ,}(i<j) Representing the set of coordinates from the ith to the jth residue in the sequence in a given protein.
The original node characteristic vector of the ith residue in the protein is x i ∈[0,+∞) K ,Wherein: coordinates of the kth reference point +.>v i Representing the coordinates of the ith residue, x i The kth element in (a) is v i And->The Euclidean distance between K is the length of the vector and +.>M is control x i The length hyper-parameters, m=5, M e {0,1, …, M-1}, g e {1,2,3, …,2 }, are taken in this example m },k=0,1,…,K-1。
Step 1.4) obtaining the original node characteristics of each residue based on angles according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space. Coordinates for three consecutive residues on the protein sequence: v i-1 、v i 、v i+1 Obtaining the original node characteristics of the ith residue based on the angle
Step 1.5) splicing the original node characteristics based on the distance and the angle to obtain the original node characteristics of the ith residue, wherein the original node characteristics are as follows:wherein: and represents a splicing operation. The original node characteristic matrix of the protein structure comprising this residue is +.>Wherein: l is the number of residues, and T is the transpose operation.
Step 2) for any protein structure in the training dataset, calculating its similarity to all other protein structures, the method of calculating its similarity to all other protein structures for any protein structure in the training dataset specifically refers to: when the protein structure training dataset contains N structures, it is noted as D train ={X 1 ,X 2 ,…,X i ,…,X N -wherein: x is X i Represents the i-th protein structure; in this embodiment, N is 13265. The TM-score between the ith and jth structures was calculated as its structural similarity using the TM-align algorithm and noted as TM (X i ,X j ) The TM-score has a value in the range of [0,1 ]]。
Step 3) constructing a training data queue from a training data set sampling structure pair by using a method of dynamically dividing positive and negative samples, which specifically comprises the following steps:
step 3.1) from training dataset D train Optionally one of the structures is X a The method as described calculates the sum D train TM-score between all other structures in (1), then ordering the structures in descending order of TM-score, and from D train Is randomly sampled a non-X a The structure of (C) is X b When it is ranked in the first 30%, it is called (X a ,X b ) Is a positive sample structure pair; otherwise, it is called a negative sample structure pair;
step 3.2) queuing training data asWherein: each structure pair is a positive sample structure pair and satisfies: for arbitrary belongings->Is of the structure pair (X) ai ,X bi ),TM(X ai ,X bj )<TM(X ai ,X bi ) Wherein: 0<i-j≤L N ,/>The length of the negative sample queue in the framework is learned for the momentum contrast.
Step 4) building a momentum contrast learning framework which comprises two encoders based on a graph neural network. Inputting training data into the contrast learning framework, as shown in FIG. 2, comprising two encoders based on a graph neural network, to generate descriptors and train the model, deciding when to terminate training of the model based on data in a validation set and length scaled cosine distancesThe two encoders have exactly the same architecture, which is ε q And make it epsilon k . By epsilon q For example, as shown in FIG. 3, it contains multiple fully-connected layers, a BiLSTM module and multiple graph convolutional layers. The training process of the model is as follows:
step 4.1) queuing training dataThe adjacency matrix and the original node characteristics of the structure pairs in (a) are sequentially input into a contrast learning framework in the form of batch, and a loss function is calculated>Wherein: y is q And y k Is that two structures in positive sample structure pair are respectively input epsilon q And epsilon k The resulting descriptors, y i Is the input of the ith structure in the negative sample queue to ε k τ is a predetermined temperature coefficient.
Step 4.2) updating ε using a random gradient descent algorithm based on the Loss obtained q Parameter θ q Then utilize theta q Updating epsilon k Parameter θ k The method specifically comprises the following steps: θ k ←m·θ k +(1-m)·θ q Wherein: m epsilon (0, 1)]Is a preset momentum coefficient.
Step 4.3) adding the second structure in the structure pair used this time to the negative sample queue, when the number of structures in the negative sample queue has reached the preset length L N The structure that was first added to the queue is removed.
Step 4.4) after the specified number of training iterations is completed, the structural data in the training set and validation set is input to ε q The obtained descriptors in the training set are used for calculating the cosine distance of length scaling between all descriptors in the verification set and all descriptors in the training setWherein: l (L) a ,l b And l max Respectively are provided withIs to verify the length of the protein in the training set, the length of the protein in the training set and the length of the longest protein in the training set, y a And y b The descriptors of protein structure in the validation set and training set, respectively. After obtaining the distances between every two descriptors in the verification set and the training set, the current model can be evaluated according to the real structural similarity between the two descriptors, and whether to terminate the training of the model or reduce the learning rate is determined.
Step 5) for inquiring the protein structure, firstly extracting the adjacent matrix and the original node characteristics in a manner similar to the step 1, inputting the adjacent matrix and the original node characteristics into a first encoder of a momentum contrast learning framework, and obtaining the final result, namely the structure descriptor of the protein.
The present example uses the protein structure database SCOPev2.07 as a training set and a validation set, and performs 5-fold cross validation on the data. The dataset contained a total of 13265 protein domains after 40% redundancy removal and data washing, wherein: each structure belongs to one of 7 categories.
Firstly, calculating adjacent matrixes and original node characteristics of all structures, then calculating similarity between every two structures by using a TM-align algorithm, and constructing a training data queue by using a dynamic training data partitioning strategy. The training data are sequentially input into the momentum contrast learning framework in the form of batch to train the whole model, and the batch size is 64. After each iteration, the second structure of the structure pair in the current batch is added to the negative sample queue, with a queue length of 1024. After about 1500 iterations of the model, inputting all structures in the training set and the verification set into the network to obtain descriptors, calculating the cosine distance of length scaling between each structure in the verification set and any structure descriptor in the training set, and comparing the cosine distance with the true similarity of the two structures to evaluate the quality of the descriptors. And when the performance of the model is not improved any more, the training is terminated, and the result of the last evaluation is output.
In the aspect of sequencing tasks, the final result is shown in the following table, and compared with the current best method, the method has the advantage that the indexes are greatly improved. Compared to the second best method deep, the method was improved by about 6% on AUPRC and 12.2%,14.2% and 14.7% on Top-1hit ratio, top-5hit ratio and Top-10hit ratio, respectively.
Wherein: AUPRC refers to the area under the accuracy-recall curve, top-K hit ratio refers to the number and min (K, N r ) Wherein: n (N) r Is the number of true similar proteins of the target protein.
Method | AUPRC | Top-1hitratio | Top-5hitratio | Top-10hitratio |
SGM | 0.4559 | 0.5591 | 0.5328 | 0.5579 |
SSEF | 0.0377 | 0.0833 | 0.0579 | 0.0607 |
DeepFold | 0.4990 | 0.6061 | 0.5677 | 0.5930 |
The method | 0.6595 | 0.7282 | 0.7101 | 0.7400 |
In the classification task, the descriptors obtained by the method are input into a Logistic Regression classifier, and the classes of all proteins in the SCOPe in the Class level (the Class level of the SCOPe comprises 7 classes) are cross-verified, so that the method has obvious improvement on each index compared with the best method at present as shown in the following table. Compared to the second best method deep, the average F1-score and accuracy were improved by 5.1% and 3.7%, respectively. Wherein the average F1-score is an average value obtained by the classifier after obtaining F1-score over 7 categories, respectively.
Method | Average F1-score | Accuracy (Accuracy) |
SGM | 0.6289 | 0.8354 |
SSEF | 0.4920 | 0.7470 |
DeepFold | 0.7615 | 0.8887 |
The method | 0.8124 | 0.9258 |
Further comparing the running speed of the method with other algorithms: similarity between 1914 protein structures in a separate dataset was calculated (3,663,396 structure comparisons were performed in total) using all methods, all running on a single logical core of Intel Xeon CPU E5-2630 v4, and then the run times for each method were counted, with the results shown in the table below. Compared with the structure comparison-based method TM-align, all characterization-based methods have obvious improvement in operation speed. Compared with other characterization-based methods, the calculation speed of the method is slightly slower than that of SGM and SSEF (the precision of the two methods in sorting and classifying tasks is lower), but the average time is still on the same horizontal line and faster than deep. When pre-computation is introduced (descriptors of all structures in the database are computed before query, so that only the distance between the descriptors of the structure to be queried needs to be computed during query), the difference between the method and SGM and SSEF is further reduced.
The foregoing embodiments may be partially modified in numerous ways by those skilled in the art without departing from the principles and spirit of the invention, the scope of which is defined in the claims and not by the foregoing embodiments, and all such implementations are within the scope of the invention.
Claims (5)
1. A method for quickly classifying protein structures based on a contrast graph neural network is characterized by extracting global coordinates of alpha carbon atoms of all residues in a protein to be predicted in a three-dimensional space, then calculating according to the global coordinates to obtain an adjacent matrix and original node characteristics, and inputting a neural network model based on a momentum contrast learning frame to obtain a descriptor of the protein structure;
the neural network model comprises: two encoders with the same architecture and based on the graph neural network, wherein training samples of the encoders are obtained by calculating the similarity between any two protein structures in a training data set and then constructing sampling structure pairs in the training data set by using a method of dynamically dividing positive and negative samples;
the neural network model is trained in the following mode, the distance between two descriptors output by the neural network model is measured according to the cosine distance of length scaling, and whether the training reaches a target is determined according to data in a verification set and the cosine distance of length scaling;
the method for dynamically dividing the positive and negative samples specifically comprises the following steps:
step i) from training dataset D train Optionally one of the structures is X a Calculate its and D train TM-score between all other structures in (a) then arranged in descending order of TM-score, from D train Is randomly sampled a non-X a The structure of (C) is X b When it is ranked in the first K%, then (X a ,X b ) Is a positive sample structure pair; otherwise, a negative-sample structure pair, wherein: k is a preset super parameter, K is N and K is 0,100];
Step ii) training data queue toWherein: each structure pair is a positive sample structure pair and satisfies: for arbitrary belongings->Is of the structure pair (X) ai ,X bi ),TM(X aj ,X bj )<TM(X ai ,X bi ) Wherein: 0<i-j≤L N ,/>The length of a negative sample queue in the framework is learned for the momentum comparison;
the encoder based on the graph neural network comprises: a plurality of full connection layers, a BiLSTM module and a plurality of graph convolution layers, wherein the training process comprises:
step (1) training data is queuedThe adjacency matrix and the original node characteristics of the structure pairs in (a) are sequentially input into a contrast learning framework in the form of batch, and a loss function is calculated>Wherein: y is q And y k Two structures in the positive sample structure pair are respectively input into an encoder epsilon based on a graph neural network q And epsilon k The resulting descriptors, y i Is the input of the ith structure in the negative sample queue to ε k The obtained descriptor, τ, is a preset temperature coefficient;
step (2) updating ε using a random gradient descent algorithm based on the loss function q Parameter θ q Then utilize theta q Updating epsilon k Parameter θ k The method specifically comprises the following steps: θ k ←m·θ k +(1-m)·θ q Wherein: m epsilon (0, 1)]Is a preset momentum coefficient;
step (3) the second structure X in all pairs of structures to be used in the present iteration b Adding into the negative sample queue, when the number of structures in the negative sample queue has reached the preset length L N Removing the structure added to the queue first;
step (4) when the designated number of training iterations is completedThe structural data in the training set and validation set is then input to ε q The obtained descriptors in the training set are used for calculating the cosine distance of length scaling between all descriptors in the verification set and all descriptors in the training setWherein: l (L) a ,l b And l max The length of the protein in the verification set, the length of the protein in the training set and the length of the longest protein in the training set, y a And y b And respectively obtaining the distances between the protein structure descriptors in the verification set and the training set, evaluating the current model according to the real structural similarity between the protein structure descriptors in the verification set and the training set, and determining whether to terminate training of the model or reduce the learning rate.
2. The rapid protein structure classification method based on a contrast map neural network according to claim 1, wherein the adjacency matrix is: and extracting Cartesian coordinate information of alpha carbon atoms of each residue in the protein structure in a three-dimensional space, then calculating Euclidean distance between each residue pair according to the coordinate information of the residues, and constructing an adjacent matrix according to the distance.
3. The rapid protein structure classification method based on a contrast map neural network according to claim 1, wherein the adjacency matrix is obtained by the following steps:
step 1) for a protein comprising L residues, the alpha carbon atom of the ith residue thereof has a Cartesian coordinate in three dimensions of v i =(x i ,y i ,z i ) The alpha carbon atom of the j-th residue has the coordinate v j =(x j ,y j ,z j ) The Euclidean distance between these two residues is d ij =||v i -v j Distance matrix of the protein is
Step 2) based on the distance matrix obtained aboveThe adjacency matrix is obtained by>Wherein: /> Omega and epsilon are two superparameters for normalization, both greater than 0;
step 3) obtaining the original node characteristics of each residue based on distance, namely the relative coordinates and angles of each residue according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space; the set of coordinates of the residues is v= { V 1 ,v 2 ,…,v L-1 ,v L },V i:j ={v i ,v i+1 ,…,v j-2 ,v j-1 (ii) represents the set of coordinates from the ith to the jth residue in the sequence in a given protein, i<j; the original node characteristic vector of the ith residue in the protein is x i ∈[0,+∞) P Wherein: p is the length of the vector and p=2 M -1,M is control x i Super parameter of length->M is {0,1, …, M-1}, g is {1,2,3, …,2 } m Coordinates of kth reference point +.>v i Representing the coordinates of the ith residue, x i The kth element in (a) is v i And->A Euclidean distance between the two;
step 4) obtaining the original node characteristic of each residue based on angles according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space, and for the coordinates of three continuous residues on the protein sequence: v i-1 、v i 、v i+1 Obtaining the original node characteristics of the ith residue based on the angle
Step 5) splicing the original node characteristics based on the distance and the angle to obtain the original node characteristics of the ith residue, wherein the original node characteristics are as follows:wherein: if the splicing operation is represented, the original node feature matrix of the protein structure containing the residue is +.>L is the number of residues, and T is the transpose operation.
4. The rapid protein structure classification method based on a contrast map neural network according to claim 1, wherein the similarity is obtained by: when the protein structure training dataset contains N structures, it is noted as D train ={X 1 ,X 2 ,…,X i ,…,X N -wherein: x is X i Representing the structure of the ith protein, calculating the TM-score between the ith and jth structures as its structural similarity using the TM-align algorithm and noting as TM (X) i ,X j ) The TM-score has a value in the range of [0,1 ]]。
5. A contrast map neural network-based rapid protein structure classification system implementing the method of any one of claims 1-4, comprising: the device comprises a feature extractor, an encoder, a verification module and a parameter updating module, wherein: the feature extraction module extracts distance matrix and node features from structural data of two proteins in the structural pair respectively, and inputs the extracted features into two encoders respectively; two encoders encode the input features into a fixed length vector output; the verification module calculates the distance between the two output vectors, evaluates the difference between the two output vectors and the real similarity between the two output vectors and the two structures, and calculates the loss; and updating the parameters of one encoder by using a back propagation algorithm according to the loss, and updating the parameters of the other encoder by using a momentum method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111047262.XA CN113707213B (en) | 2021-09-08 | 2021-09-08 | Protein structure rapid classification method based on contrast graph neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111047262.XA CN113707213B (en) | 2021-09-08 | 2021-09-08 | Protein structure rapid classification method based on contrast graph neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113707213A CN113707213A (en) | 2021-11-26 |
CN113707213B true CN113707213B (en) | 2024-03-08 |
Family
ID=78660858
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111047262.XA Active CN113707213B (en) | 2021-09-08 | 2021-09-08 | Protein structure rapid classification method based on contrast graph neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113707213B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024095126A1 (en) * | 2022-11-02 | 2024-05-10 | Basf Se | Systems and methods for using natural language processing (nlp) to predict protein function similarity |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110689920A (en) * | 2019-09-18 | 2020-01-14 | 上海交通大学 | Protein-ligand binding site prediction algorithm based on deep learning |
CN111710375A (en) * | 2020-05-13 | 2020-09-25 | 中国科学院计算机网络信息中心 | Molecular property prediction method and system |
CN112767554A (en) * | 2021-04-12 | 2021-05-07 | 腾讯科技(深圳)有限公司 | Point cloud completion method, device, equipment and storage medium |
CN113192559A (en) * | 2021-05-08 | 2021-07-30 | 中山大学 | Protein-protein interaction site prediction method based on deep map convolution network |
-
2021
- 2021-09-08 CN CN202111047262.XA patent/CN113707213B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110689920A (en) * | 2019-09-18 | 2020-01-14 | 上海交通大学 | Protein-ligand binding site prediction algorithm based on deep learning |
CN111710375A (en) * | 2020-05-13 | 2020-09-25 | 中国科学院计算机网络信息中心 | Molecular property prediction method and system |
CN112767554A (en) * | 2021-04-12 | 2021-05-07 | 腾讯科技(深圳)有限公司 | Point cloud completion method, device, equipment and storage medium |
CN113192559A (en) * | 2021-05-08 | 2021-07-30 | 中山大学 | Protein-protein interaction site prediction method based on deep map convolution network |
Non-Patent Citations (1)
Title |
---|
Deep Wasserstein Graph Discriminant Learning for Graph Classification;Tong Zhang et al.;《The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)》;20210518;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113707213A (en) | 2021-11-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107515895B (en) | Visual target retrieval method and system based on target detection | |
CN111967294B (en) | Unsupervised domain self-adaptive pedestrian re-identification method | |
CN109740541B (en) | Pedestrian re-identification system and method | |
CN110852168A (en) | Pedestrian re-recognition model construction method and device based on neural framework search | |
KR102305568B1 (en) | Finding k extreme values in constant processing time | |
CN110289050B (en) | Drug-target interaction prediction method based on graph convolution sum and word vector | |
CN111898689B (en) | Image classification method based on neural network architecture search | |
CN110309343B (en) | Voiceprint retrieval method based on deep hash | |
CN110097060B (en) | Open set identification method for trunk image | |
CN110941734B (en) | Depth unsupervised image retrieval method based on sparse graph structure | |
CN112232413B (en) | High-dimensional data feature selection method based on graph neural network and spectral clustering | |
CN110827921B (en) | Single cell clustering method and device, electronic equipment and storage medium | |
CN108875933B (en) | Over-limit learning machine classification method and system for unsupervised sparse parameter learning | |
CN109871749B (en) | Pedestrian re-identification method and device based on deep hash and computer system | |
CN110188225A (en) | A kind of image search method based on sequence study and polynary loss | |
CN110751027B (en) | Pedestrian re-identification method based on deep multi-instance learning | |
CN112084895B (en) | Pedestrian re-identification method based on deep learning | |
CN112489723B (en) | DNA binding protein prediction method based on local evolution information | |
Fan et al. | Deep Hashing for Speaker Identification and Retrieval. | |
CN113707213B (en) | Protein structure rapid classification method based on contrast graph neural network | |
CN114556364A (en) | Neural architecture search based on similarity operator ordering | |
Bai et al. | A unified deep learning model for protein structure prediction | |
CN114997366A (en) | Protein structure model quality evaluation method based on graph neural network | |
CN115907775A (en) | Personal credit assessment rating method based on deep learning and application thereof | |
CN115100694A (en) | Fingerprint quick retrieval method based on self-supervision neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |