CN113707213B - Protein structure rapid classification method based on contrast graph neural network - Google Patents

Protein structure rapid classification method based on contrast graph neural network Download PDF

Info

Publication number
CN113707213B
CN113707213B CN202111047262.XA CN202111047262A CN113707213B CN 113707213 B CN113707213 B CN 113707213B CN 202111047262 A CN202111047262 A CN 202111047262A CN 113707213 B CN113707213 B CN 113707213B
Authority
CN
China
Prior art keywords
protein
residue
training
neural network
structures
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111047262.XA
Other languages
Chinese (zh)
Other versions
CN113707213A (en
Inventor
夏春秋
沈红斌
潘小勇
冯世豪
夏莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202111047262.XA priority Critical patent/CN113707213B/en
Publication of CN113707213A publication Critical patent/CN113707213A/en
Application granted granted Critical
Publication of CN113707213B publication Critical patent/CN113707213B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Physiology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for quickly classifying the protein structure based on the neural network of contrast map includes such steps as extracting the global coordinates of alpha carbon atoms of all the residues in protein to be predicted in three-dimensional space, calculating to obtain adjacent matrix and original node features, and inputting the neural network model based on momentum contrast learning frame to obtain the descriptor of protein structure. The invention combines the deep learning technology with the knowledge in the field of protein structure to generate the descriptor with more discrimination, thereby more accurately identifying the similar structure of the target protein and improving the precision of protein structure classification.

Description

Protein structure rapid classification method based on contrast graph neural network
Technical Field
The invention relates to a technology in the field of bioengineering, in particular to a method for rapidly classifying protein structures based on a contrast map neural network.
Background
The purpose of the protein structure comparison is to measure the structural similarity between two different proteins. For structural bioinformatics involving proteins, the structural comparison tool can be said to be an infrastructure, which is an indispensable part of tasks such as protein structure prediction, protein molecule docking, structure-based protein function prediction, and the like. Protein structure comparison methods include methods based on structural alignment, which are often time consuming and do not meet the needs of large-scale protein structure retrieval, and methods based on characterization. With the rapid growth of protein structure data; characterization-based methods typically convert the coordinates of all atoms on a protein backbone into a vector of fixed length, and then measure the similarity between two structures by comparing the distances or correlation coefficients between the vectors, a vector of fixed length and with rotational invariance being called a descriptor of the protein structure.
Disclosure of Invention
Aiming at the problems that the existing protein structure characterization method depends on the characteristics of manual design, and the structure comparison-based method is low in efficiency, the requirement of large-scale protein structure data retrieval cannot be met; other characterization-based methods have relatively low precision and are difficult to find enough defects of similar structures, and a protein structure rapid classification method based on a contrast graph neural network is provided, and a deep learning technology is combined with knowledge in the field of protein structures to generate a descriptor with more discrimination, so that the similar structure of target protein is more accurately identified, and the precision of protein structure classification can be improved.
The invention is realized by the following technical scheme:
the invention relates to a method for quickly classifying protein structures based on a contrast graph neural network, which comprises the steps of extracting global coordinates of alpha carbon atoms of all residues in a protein to be predicted in a three-dimensional space, calculating according to the global coordinates to obtain an adjacent matrix and original node characteristics, and inputting a neural network model based on a momentum contrast learning frame to obtain a descriptor of the protein structure.
The neural network model comprises: two encoders based on the graphic neural network with the same architecture are constructed by calculating the similarity between any two protein structures in a training data set and then using a method of dynamically dividing positive and negative samples to construct the training samples from the sampling structure pairs in the training data set.
And the training is carried out, the distance between two descriptors output by the neural network model is measured according to the cosine distance of the length scaling, and whether the training reaches a target is determined according to the data in the verification set and the cosine distance of the length scaling.
The adjacency matrix refers to: extracting Cartesian coordinate information of alpha carbon atoms of each residue in a protein structure in a three-dimensional space, then calculating Euclidean distance between each residue pair according to the coordinate information of the residues, and constructing an adjacent matrix according to the distance, wherein the method specifically comprises the following steps:
step 1) for a protein comprising L residues, the alpha carbon atom of the ith residue thereof has a Cartesian coordinate in three dimensions of v i =(x i ,y i ,z i ) The alpha carbon atom of the j-th residue has the coordinate v j =(x j ,y j ,z j ) The Euclidean distance between these two residues is d ij =||v i -v j Distance matrix of the protein is
Step 2) based on the distance matrix obtained aboveThe adjacency matrix thereof can be obtained by>Wherein: />Omega and epsilon are two superparameters for normalization, both greater than 0.
Step 3) obtaining the original node characteristics of each residue based on distance, namely the relative coordinates and angles of each residue according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space; the coordinates of the residues are assembled as v= { V 1 ,v 2 ,…,v L-1 ,v L },V i:j ={v i ,v i+1 ,…,v j-2 ,v j-1 (ii) represents the set of coordinates from the ith to the jth residue in the sequence in a given protein, i<j; original node of ith residue in proteinThe eigenvector is x i ∈[0,+∞) K Wherein: k is the length of the vector andm is control x i Super parameter of length->M is {0,1, …, M-1}, g is {1,2,3, …,2 } m Coordinates of kth reference point +.>v i Representing the coordinates of the ith residue, x i The kth element in (a) is v i And->Euclidean distance between them.
Step 4) obtaining the original node characteristic of each residue based on angles according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space, and for the coordinates of three continuous residues on the protein sequence: v i-1 、v i 、v i+1 Obtaining the original node characteristics of the ith residue based on the angle
Step 5) splicing the original node characteristics based on the distance and the angle to obtain the original node characteristics of the ith residue, wherein the original node characteristics are as follows:wherein: if the splicing operation is represented, the original node feature matrix of the protein structure containing the residue is +.>Wherein: l is the number of residues, and T is the transpose operation.
The similarity is obtained by the following way: when the protein structure training dataset contains N knotsConstruct it as D train ={X 1 ,X 2 ,…,X i ,…,X N -wherein: x is X i Representing the structure of the ith protein, calculating the similarity (TM-score) between the ith and jth structures as its structural similarity using the TM-align algorithm and being TM (X) i ,X j ) The TM-score has a value in the range of [0,1 ]]。
The method for dynamically dividing the positive and negative samples specifically comprises the following steps:
step i) from training dataset D train Optionally one of the structures is X a Calculate its and D train TM-score between all other structures in (a) then arranged in descending order of TM-score, from D train Is randomly sampled a non-X a The structure of (C) is X b When it is ranked in the first K%, then (X a ,X b ) Is a positive sample structure pair; otherwise, a negative-sample structure pair, wherein: k is a preset super parameter, K is N and K is 0,100];
Step ii) training data queue toWherein: each structure pair is a positive sample structure pair and satisfies: for arbitrary belongings->Is of the structure pair (X) ai ,X bi ),TM(X ai ,X bj )<TM(X ai ,X bi ) Wherein: 0<i-j≤L N ,/>The length of the negative sample queue in the framework is learned for the momentum contrast.
The encoder based on the graph neural network comprises: a plurality of full connection layers, a BiLSTM module and a plurality of graph convolution layers, wherein the training process comprises:
step (1) training data is queuedThe adjacency matrix and the original node characteristics of the structure pairs in (a) are sequentially input into a contrast learning framework in the form of batch, and a loss function is calculated>Wherein: y is q And y k Is that two structures in positive sample structure pair are respectively input epsilon q And epsilon k The resulting descriptors, y i Is the input of the ith structure in the negative sample queue to ε k τ is a predetermined temperature coefficient.
Step (2) updating ε using a random gradient descent algorithm based on the loss function q Parameter θ q Then utilize theta q Updating epsilon k Parameter θ k The method specifically comprises the following steps: θ k ←m·θ k +(1-m)·θ q Wherein: m epsilon (0, 1)]Is a preset momentum coefficient.
Step (3) the second structure X in all pairs of structures to be used in the present iteration b Adding into the negative sample queue, when the number of structures in the negative sample queue has reached the preset length L N The structure that was first added to the queue is removed.
Step (4) after the designated number of training iterations is completed, inputting the structural data in the training set and the verification set to epsilon q The obtained descriptors in the training set are used for calculating the cosine distance of length scaling between all descriptors in the verification set and all descriptors in the training setWherein: l (L) a ,l b And l max The length of the protein in the verification set, the length of the protein in the training set and the length of the longest protein in the training set, y a And y b The descriptors of protein structure in the validation set and training set, respectively. After obtaining the distances between every two descriptors in the verification set and the training set, the current model can be evaluated according to the real structural similarity between the two descriptors, and whether to terminate the training of the model or reduce the learning rate is determined.
The invention relates to a system for realizing the method, which comprises the following steps: the device comprises a feature extractor, an encoder, a verification module and a parameter updating module, wherein: the feature extraction module extracts distance matrix and node features from structural data of two proteins in the structural pair respectively, and inputs the extracted features into two encoders respectively; two encoders encode the input features into a fixed length vector output; the verification module calculates the distance between the two output vectors, evaluates the difference between the two output vectors and the real similarity between the two output vectors and the two structures, and calculates the loss; and updating the parameters of one encoder by using a back propagation algorithm according to the loss, and updating the parameters of the other encoder by using a momentum method.
Technical effects
The invention integrally solves the defects of the prior art in the precision and speed of protein structure comparison. Compared with the prior art, the dynamic training data partitioning strategy can enable the model to learn a finer similarity relationship, and the cosine distance of the length scaling can further correct the similarity between descriptors. Better results are achieved on the sorting and classifying tasks, respectively, without losing computational efficiency.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of an embodiment momentum contrast learning framework;
fig. 3 is a schematic diagram of an encoder architecture based on the neural network of fig. 3.
Detailed Description
As shown in fig. 1, this embodiment relates to a method for rapid classification of protein structures based on a neural network of a comparison graph, which includes the following steps:
step 1) firstly extracting Cartesian coordinate information of alpha carbon atoms of each residue in a protein structure in a three-dimensional space, then calculating Euclidean distance between each residue pair according to the coordinate information of the residues, constructing an adjacent matrix according to the distance, and calculating relative coordinates and angles of each residue as original node characteristics, wherein the method specifically comprises the following steps:
step 1.1) toA protein comprising L residues is defined, wherein: the alpha carbon atom of the ith residue has a Cartesian coordinate v in three dimensions i =(x i ,y i ,z i ) The alpha carbon atom of the j-th residue has the coordinate v j =(x j ,y j ,z j ) The Euclidean distance between these two residues is d ij =||v i -v j Distance matrix of the protein is
Step 1.2) based on the distance matrix obtained aboveThe adjacency matrix thereof can be obtained by>Wherein: />ω and e are two superparameters for normalization, ω=4 and e=2 in this example.
Step 1.3) obtaining the original node characteristics of each residue based on distance according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space, and collecting the coordinates of the residues as V= { V 1 ,v 2 ,…,v L-1 ,v L },V i:j ={v i ,v i+1 ,…,v j-2 ,v j-1 ,}(i<j) Representing the set of coordinates from the ith to the jth residue in the sequence in a given protein.
The original node characteristic vector of the ith residue in the protein is x i ∈[0,+∞) KWherein: coordinates of the kth reference point +.>v i Representing the coordinates of the ith residue, x i The kth element in (a) is v i And->The Euclidean distance between K is the length of the vector and +.>M is control x i The length hyper-parameters, m=5, M e {0,1, …, M-1}, g e {1,2,3, …,2 }, are taken in this example m },k=0,1,…,K-1。
Step 1.4) obtaining the original node characteristics of each residue based on angles according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space. Coordinates for three consecutive residues on the protein sequence: v i-1 、v i 、v i+1 Obtaining the original node characteristics of the ith residue based on the angle
Step 1.5) splicing the original node characteristics based on the distance and the angle to obtain the original node characteristics of the ith residue, wherein the original node characteristics are as follows:wherein: and represents a splicing operation. The original node characteristic matrix of the protein structure comprising this residue is +.>Wherein: l is the number of residues, and T is the transpose operation.
Step 2) for any protein structure in the training dataset, calculating its similarity to all other protein structures, the method of calculating its similarity to all other protein structures for any protein structure in the training dataset specifically refers to: when the protein structure training dataset contains N structures, it is noted as D train ={X 1 ,X 2 ,…,X i ,…,X N -wherein: x is X i Represents the i-th protein structure; in this embodiment, N is 13265. The TM-score between the ith and jth structures was calculated as its structural similarity using the TM-align algorithm and noted as TM (X i ,X j ) The TM-score has a value in the range of [0,1 ]]。
Step 3) constructing a training data queue from a training data set sampling structure pair by using a method of dynamically dividing positive and negative samples, which specifically comprises the following steps:
step 3.1) from training dataset D train Optionally one of the structures is X a The method as described calculates the sum D train TM-score between all other structures in (1), then ordering the structures in descending order of TM-score, and from D train Is randomly sampled a non-X a The structure of (C) is X b When it is ranked in the first 30%, it is called (X a ,X b ) Is a positive sample structure pair; otherwise, it is called a negative sample structure pair;
step 3.2) queuing training data asWherein: each structure pair is a positive sample structure pair and satisfies: for arbitrary belongings->Is of the structure pair (X) ai ,X bi ),TM(X ai ,X bj )<TM(X ai ,X bi ) Wherein: 0<i-j≤L N ,/>The length of the negative sample queue in the framework is learned for the momentum contrast.
Step 4) building a momentum contrast learning framework which comprises two encoders based on a graph neural network. Inputting training data into the contrast learning framework, as shown in FIG. 2, comprising two encoders based on a graph neural network, to generate descriptors and train the model, deciding when to terminate training of the model based on data in a validation set and length scaled cosine distancesThe two encoders have exactly the same architecture, which is ε q And make it epsilon k . By epsilon q For example, as shown in FIG. 3, it contains multiple fully-connected layers, a BiLSTM module and multiple graph convolutional layers. The training process of the model is as follows:
step 4.1) queuing training dataThe adjacency matrix and the original node characteristics of the structure pairs in (a) are sequentially input into a contrast learning framework in the form of batch, and a loss function is calculated>Wherein: y is q And y k Is that two structures in positive sample structure pair are respectively input epsilon q And epsilon k The resulting descriptors, y i Is the input of the ith structure in the negative sample queue to ε k τ is a predetermined temperature coefficient.
Step 4.2) updating ε using a random gradient descent algorithm based on the Loss obtained q Parameter θ q Then utilize theta q Updating epsilon k Parameter θ k The method specifically comprises the following steps: θ k ←m·θ k +(1-m)·θ q Wherein: m epsilon (0, 1)]Is a preset momentum coefficient.
Step 4.3) adding the second structure in the structure pair used this time to the negative sample queue, when the number of structures in the negative sample queue has reached the preset length L N The structure that was first added to the queue is removed.
Step 4.4) after the specified number of training iterations is completed, the structural data in the training set and validation set is input to ε q The obtained descriptors in the training set are used for calculating the cosine distance of length scaling between all descriptors in the verification set and all descriptors in the training setWherein: l (L) a ,l b And l max Respectively are provided withIs to verify the length of the protein in the training set, the length of the protein in the training set and the length of the longest protein in the training set, y a And y b The descriptors of protein structure in the validation set and training set, respectively. After obtaining the distances between every two descriptors in the verification set and the training set, the current model can be evaluated according to the real structural similarity between the two descriptors, and whether to terminate the training of the model or reduce the learning rate is determined.
Step 5) for inquiring the protein structure, firstly extracting the adjacent matrix and the original node characteristics in a manner similar to the step 1, inputting the adjacent matrix and the original node characteristics into a first encoder of a momentum contrast learning framework, and obtaining the final result, namely the structure descriptor of the protein.
The present example uses the protein structure database SCOPev2.07 as a training set and a validation set, and performs 5-fold cross validation on the data. The dataset contained a total of 13265 protein domains after 40% redundancy removal and data washing, wherein: each structure belongs to one of 7 categories.
Firstly, calculating adjacent matrixes and original node characteristics of all structures, then calculating similarity between every two structures by using a TM-align algorithm, and constructing a training data queue by using a dynamic training data partitioning strategy. The training data are sequentially input into the momentum contrast learning framework in the form of batch to train the whole model, and the batch size is 64. After each iteration, the second structure of the structure pair in the current batch is added to the negative sample queue, with a queue length of 1024. After about 1500 iterations of the model, inputting all structures in the training set and the verification set into the network to obtain descriptors, calculating the cosine distance of length scaling between each structure in the verification set and any structure descriptor in the training set, and comparing the cosine distance with the true similarity of the two structures to evaluate the quality of the descriptors. And when the performance of the model is not improved any more, the training is terminated, and the result of the last evaluation is output.
In the aspect of sequencing tasks, the final result is shown in the following table, and compared with the current best method, the method has the advantage that the indexes are greatly improved. Compared to the second best method deep, the method was improved by about 6% on AUPRC and 12.2%,14.2% and 14.7% on Top-1hit ratio, top-5hit ratio and Top-10hit ratio, respectively.
Wherein: AUPRC refers to the area under the accuracy-recall curve, top-K hit ratio refers to the number and min (K, N r ) Wherein: n (N) r Is the number of true similar proteins of the target protein.
Method AUPRC Top-1hitratio Top-5hitratio Top-10hitratio
SGM 0.4559 0.5591 0.5328 0.5579
SSEF 0.0377 0.0833 0.0579 0.0607
DeepFold 0.4990 0.6061 0.5677 0.5930
The method 0.6595 0.7282 0.7101 0.7400
In the classification task, the descriptors obtained by the method are input into a Logistic Regression classifier, and the classes of all proteins in the SCOPe in the Class level (the Class level of the SCOPe comprises 7 classes) are cross-verified, so that the method has obvious improvement on each index compared with the best method at present as shown in the following table. Compared to the second best method deep, the average F1-score and accuracy were improved by 5.1% and 3.7%, respectively. Wherein the average F1-score is an average value obtained by the classifier after obtaining F1-score over 7 categories, respectively.
Method Average F1-score Accuracy (Accuracy)
SGM 0.6289 0.8354
SSEF 0.4920 0.7470
DeepFold 0.7615 0.8887
The method 0.8124 0.9258
Further comparing the running speed of the method with other algorithms: similarity between 1914 protein structures in a separate dataset was calculated (3,663,396 structure comparisons were performed in total) using all methods, all running on a single logical core of Intel Xeon CPU E5-2630 v4, and then the run times for each method were counted, with the results shown in the table below. Compared with the structure comparison-based method TM-align, all characterization-based methods have obvious improvement in operation speed. Compared with other characterization-based methods, the calculation speed of the method is slightly slower than that of SGM and SSEF (the precision of the two methods in sorting and classifying tasks is lower), but the average time is still on the same horizontal line and faster than deep. When pre-computation is introduced (descriptors of all structures in the database are computed before query, so that only the distance between the descriptors of the structure to be queried needs to be computed during query), the difference between the method and SGM and SSEF is further reduced.
The foregoing embodiments may be partially modified in numerous ways by those skilled in the art without departing from the principles and spirit of the invention, the scope of which is defined in the claims and not by the foregoing embodiments, and all such implementations are within the scope of the invention.

Claims (5)

1. A method for quickly classifying protein structures based on a contrast graph neural network is characterized by extracting global coordinates of alpha carbon atoms of all residues in a protein to be predicted in a three-dimensional space, then calculating according to the global coordinates to obtain an adjacent matrix and original node characteristics, and inputting a neural network model based on a momentum contrast learning frame to obtain a descriptor of the protein structure;
the neural network model comprises: two encoders with the same architecture and based on the graph neural network, wherein training samples of the encoders are obtained by calculating the similarity between any two protein structures in a training data set and then constructing sampling structure pairs in the training data set by using a method of dynamically dividing positive and negative samples;
the neural network model is trained in the following mode, the distance between two descriptors output by the neural network model is measured according to the cosine distance of length scaling, and whether the training reaches a target is determined according to data in a verification set and the cosine distance of length scaling;
the method for dynamically dividing the positive and negative samples specifically comprises the following steps:
step i) from training dataset D train Optionally one of the structures is X a Calculate its and D train TM-score between all other structures in (a) then arranged in descending order of TM-score, from D train Is randomly sampled a non-X a The structure of (C) is X b When it is ranked in the first K%, then (X a ,X b ) Is a positive sample structure pair; otherwise, a negative-sample structure pair, wherein: k is a preset super parameter, K is N and K is 0,100];
Step ii) training data queue toWherein: each structure pair is a positive sample structure pair and satisfies: for arbitrary belongings->Is of the structure pair (X) ai ,X bi ),TM(X aj ,X bj )<TM(X ai ,X bi ) Wherein: 0<i-j≤L N ,/>The length of a negative sample queue in the framework is learned for the momentum comparison;
the encoder based on the graph neural network comprises: a plurality of full connection layers, a BiLSTM module and a plurality of graph convolution layers, wherein the training process comprises:
step (1) training data is queuedThe adjacency matrix and the original node characteristics of the structure pairs in (a) are sequentially input into a contrast learning framework in the form of batch, and a loss function is calculated>Wherein: y is q And y k Two structures in the positive sample structure pair are respectively input into an encoder epsilon based on a graph neural network q And epsilon k The resulting descriptors, y i Is the input of the ith structure in the negative sample queue to ε k The obtained descriptor, τ, is a preset temperature coefficient;
step (2) updating ε using a random gradient descent algorithm based on the loss function q Parameter θ q Then utilize theta q Updating epsilon k Parameter θ k The method specifically comprises the following steps: θ k ←m·θ k +(1-m)·θ q Wherein: m epsilon (0, 1)]Is a preset momentum coefficient;
step (3) the second structure X in all pairs of structures to be used in the present iteration b Adding into the negative sample queue, when the number of structures in the negative sample queue has reached the preset length L N Removing the structure added to the queue first;
step (4) when the designated number of training iterations is completedThe structural data in the training set and validation set is then input to ε q The obtained descriptors in the training set are used for calculating the cosine distance of length scaling between all descriptors in the verification set and all descriptors in the training setWherein: l (L) a ,l b And l max The length of the protein in the verification set, the length of the protein in the training set and the length of the longest protein in the training set, y a And y b And respectively obtaining the distances between the protein structure descriptors in the verification set and the training set, evaluating the current model according to the real structural similarity between the protein structure descriptors in the verification set and the training set, and determining whether to terminate training of the model or reduce the learning rate.
2. The rapid protein structure classification method based on a contrast map neural network according to claim 1, wherein the adjacency matrix is: and extracting Cartesian coordinate information of alpha carbon atoms of each residue in the protein structure in a three-dimensional space, then calculating Euclidean distance between each residue pair according to the coordinate information of the residues, and constructing an adjacent matrix according to the distance.
3. The rapid protein structure classification method based on a contrast map neural network according to claim 1, wherein the adjacency matrix is obtained by the following steps:
step 1) for a protein comprising L residues, the alpha carbon atom of the ith residue thereof has a Cartesian coordinate in three dimensions of v i =(x i ,y i ,z i ) The alpha carbon atom of the j-th residue has the coordinate v j =(x j ,y j ,z j ) The Euclidean distance between these two residues is d ij =||v i -v j Distance matrix of the protein is
Step 2) based on the distance matrix obtained aboveThe adjacency matrix is obtained by>Wherein: /> Omega and epsilon are two superparameters for normalization, both greater than 0;
step 3) obtaining the original node characteristics of each residue based on distance, namely the relative coordinates and angles of each residue according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space; the set of coordinates of the residues is v= { V 1 ,v 2 ,…,v L-1 ,v L },V i:j ={v i ,v i+1 ,…,v j-2 ,v j-1 (ii) represents the set of coordinates from the ith to the jth residue in the sequence in a given protein, i<j; the original node characteristic vector of the ith residue in the protein is x i ∈[0,+∞) P Wherein: p is the length of the vector and p=2 M -1,M is control x i Super parameter of length->M is {0,1, …, M-1}, g is {1,2,3, …,2 } m Coordinates of kth reference point +.>v i Representing the coordinates of the ith residue, x i The kth element in (a) is v i And->A Euclidean distance between the two;
step 4) obtaining the original node characteristic of each residue based on angles according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space, and for the coordinates of three continuous residues on the protein sequence: v i-1 、v i 、v i+1 Obtaining the original node characteristics of the ith residue based on the angle
Step 5) splicing the original node characteristics based on the distance and the angle to obtain the original node characteristics of the ith residue, wherein the original node characteristics are as follows:wherein: if the splicing operation is represented, the original node feature matrix of the protein structure containing the residue is +.>L is the number of residues, and T is the transpose operation.
4. The rapid protein structure classification method based on a contrast map neural network according to claim 1, wherein the similarity is obtained by: when the protein structure training dataset contains N structures, it is noted as D train ={X 1 ,X 2 ,…,X i ,…,X N -wherein: x is X i Representing the structure of the ith protein, calculating the TM-score between the ith and jth structures as its structural similarity using the TM-align algorithm and noting as TM (X) i ,X j ) The TM-score has a value in the range of [0,1 ]]。
5. A contrast map neural network-based rapid protein structure classification system implementing the method of any one of claims 1-4, comprising: the device comprises a feature extractor, an encoder, a verification module and a parameter updating module, wherein: the feature extraction module extracts distance matrix and node features from structural data of two proteins in the structural pair respectively, and inputs the extracted features into two encoders respectively; two encoders encode the input features into a fixed length vector output; the verification module calculates the distance between the two output vectors, evaluates the difference between the two output vectors and the real similarity between the two output vectors and the two structures, and calculates the loss; and updating the parameters of one encoder by using a back propagation algorithm according to the loss, and updating the parameters of the other encoder by using a momentum method.
CN202111047262.XA 2021-09-08 2021-09-08 Protein structure rapid classification method based on contrast graph neural network Active CN113707213B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111047262.XA CN113707213B (en) 2021-09-08 2021-09-08 Protein structure rapid classification method based on contrast graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111047262.XA CN113707213B (en) 2021-09-08 2021-09-08 Protein structure rapid classification method based on contrast graph neural network

Publications (2)

Publication Number Publication Date
CN113707213A CN113707213A (en) 2021-11-26
CN113707213B true CN113707213B (en) 2024-03-08

Family

ID=78660858

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111047262.XA Active CN113707213B (en) 2021-09-08 2021-09-08 Protein structure rapid classification method based on contrast graph neural network

Country Status (1)

Country Link
CN (1) CN113707213B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024095126A1 (en) * 2022-11-02 2024-05-10 Basf Se Systems and methods for using natural language processing (nlp) to predict protein function similarity

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689920A (en) * 2019-09-18 2020-01-14 上海交通大学 Protein-ligand binding site prediction algorithm based on deep learning
CN111710375A (en) * 2020-05-13 2020-09-25 中国科学院计算机网络信息中心 Molecular property prediction method and system
CN112767554A (en) * 2021-04-12 2021-05-07 腾讯科技(深圳)有限公司 Point cloud completion method, device, equipment and storage medium
CN113192559A (en) * 2021-05-08 2021-07-30 中山大学 Protein-protein interaction site prediction method based on deep map convolution network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689920A (en) * 2019-09-18 2020-01-14 上海交通大学 Protein-ligand binding site prediction algorithm based on deep learning
CN111710375A (en) * 2020-05-13 2020-09-25 中国科学院计算机网络信息中心 Molecular property prediction method and system
CN112767554A (en) * 2021-04-12 2021-05-07 腾讯科技(深圳)有限公司 Point cloud completion method, device, equipment and storage medium
CN113192559A (en) * 2021-05-08 2021-07-30 中山大学 Protein-protein interaction site prediction method based on deep map convolution network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Deep Wasserstein Graph Discriminant Learning for Graph Classification;Tong Zhang et al.;《The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)》;20210518;全文 *

Also Published As

Publication number Publication date
CN113707213A (en) 2021-11-26

Similar Documents

Publication Publication Date Title
CN107515895B (en) Visual target retrieval method and system based on target detection
CN111967294B (en) Unsupervised domain self-adaptive pedestrian re-identification method
CN109740541B (en) Pedestrian re-identification system and method
CN110852168A (en) Pedestrian re-recognition model construction method and device based on neural framework search
KR102305568B1 (en) Finding k extreme values in constant processing time
CN110289050B (en) Drug-target interaction prediction method based on graph convolution sum and word vector
CN111898689B (en) Image classification method based on neural network architecture search
CN110309343B (en) Voiceprint retrieval method based on deep hash
CN110097060B (en) Open set identification method for trunk image
CN110941734B (en) Depth unsupervised image retrieval method based on sparse graph structure
CN112232413B (en) High-dimensional data feature selection method based on graph neural network and spectral clustering
CN110827921B (en) Single cell clustering method and device, electronic equipment and storage medium
CN108875933B (en) Over-limit learning machine classification method and system for unsupervised sparse parameter learning
CN109871749B (en) Pedestrian re-identification method and device based on deep hash and computer system
CN110188225A (en) A kind of image search method based on sequence study and polynary loss
CN110751027B (en) Pedestrian re-identification method based on deep multi-instance learning
CN112084895B (en) Pedestrian re-identification method based on deep learning
CN112489723B (en) DNA binding protein prediction method based on local evolution information
Fan et al. Deep Hashing for Speaker Identification and Retrieval.
CN113707213B (en) Protein structure rapid classification method based on contrast graph neural network
CN114556364A (en) Neural architecture search based on similarity operator ordering
Bai et al. A unified deep learning model for protein structure prediction
CN114997366A (en) Protein structure model quality evaluation method based on graph neural network
CN115907775A (en) Personal credit assessment rating method based on deep learning and application thereof
CN115100694A (en) Fingerprint quick retrieval method based on self-supervision neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant