CN113707213B

CN113707213B - Protein structure rapid classification method based on contrast graph neural network

Info

Publication number: CN113707213B
Application number: CN202111047262.XA
Authority: CN
Inventors: 夏春秋; 沈红斌; 潘小勇; 冯世豪; 夏莹
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2024-03-08
Anticipated expiration: 2041-09-08
Also published as: CN113707213A

Abstract

A method for quickly classifying the protein structure based on the neural network of contrast map includes such steps as extracting the global coordinates of alpha carbon atoms of all the residues in protein to be predicted in three-dimensional space, calculating to obtain adjacent matrix and original node features, and inputting the neural network model based on momentum contrast learning frame to obtain the descriptor of protein structure. The invention combines the deep learning technology with the knowledge in the field of protein structure to generate the descriptor with more discrimination, thereby more accurately identifying the similar structure of the target protein and improving the precision of protein structure classification.

Description

Protein structure rapid classification method based on contrast graph neural network

Technical Field

The invention relates to a technology in the field of bioengineering, in particular to a method for rapidly classifying protein structures based on a contrast map neural network.

Background

The purpose of the protein structure comparison is to measure the structural similarity between two different proteins. For structural bioinformatics involving proteins, the structural comparison tool can be said to be an infrastructure, which is an indispensable part of tasks such as protein structure prediction, protein molecule docking, structure-based protein function prediction, and the like. Protein structure comparison methods include methods based on structural alignment, which are often time consuming and do not meet the needs of large-scale protein structure retrieval, and methods based on characterization. With the rapid growth of protein structure data; characterization-based methods typically convert the coordinates of all atoms on a protein backbone into a vector of fixed length, and then measure the similarity between two structures by comparing the distances or correlation coefficients between the vectors, a vector of fixed length and with rotational invariance being called a descriptor of the protein structure.

Disclosure of Invention

Aiming at the problems that the existing protein structure characterization method depends on the characteristics of manual design, and the structure comparison-based method is low in efficiency, the requirement of large-scale protein structure data retrieval cannot be met; other characterization-based methods have relatively low precision and are difficult to find enough defects of similar structures, and a protein structure rapid classification method based on a contrast graph neural network is provided, and a deep learning technology is combined with knowledge in the field of protein structures to generate a descriptor with more discrimination, so that the similar structure of target protein is more accurately identified, and the precision of protein structure classification can be improved.

The invention is realized by the following technical scheme:

the invention relates to a method for quickly classifying protein structures based on a contrast graph neural network, which comprises the steps of extracting global coordinates of alpha carbon atoms of all residues in a protein to be predicted in a three-dimensional space, calculating according to the global coordinates to obtain an adjacent matrix and original node characteristics, and inputting a neural network model based on a momentum contrast learning frame to obtain a descriptor of the protein structure.

The neural network model comprises: two encoders based on the graphic neural network with the same architecture are constructed by calculating the similarity between any two protein structures in a training data set and then using a method of dynamically dividing positive and negative samples to construct the training samples from the sampling structure pairs in the training data set.

And the training is carried out, the distance between two descriptors output by the neural network model is measured according to the cosine distance of the length scaling, and whether the training reaches a target is determined according to the data in the verification set and the cosine distance of the length scaling.

The adjacency matrix refers to: extracting Cartesian coordinate information of alpha carbon atoms of each residue in a protein structure in a three-dimensional space, then calculating Euclidean distance between each residue pair according to the coordinate information of the residues, and constructing an adjacent matrix according to the distance, wherein the method specifically comprises the following steps:

step 1) for a protein comprising L residues, the alpha carbon atom of the ith residue thereof has a Cartesian coordinate in three dimensions of v _i ＝(x _i ,y _i ,z _i ) The alpha carbon atom of the j-th residue has the coordinate v _j ＝(x _j ,y _j ,z _j ) The Euclidean distance between these two residues is d _ij ＝||v _i -v _j Distance matrix of the protein is

Step 2) based on the distance matrix obtained aboveThe adjacency matrix thereof can be obtained by>Wherein: />Omega and epsilon are two superparameters for normalization, both greater than 0.

Step 3) obtaining the original node characteristics of each residue based on distance, namely the relative coordinates and angles of each residue according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space; the coordinates of the residues are assembled as v= { V ₁ ,v ₂ ,…,v _L-1 ,v _L }，V _i:j ＝{v _i ,v _i+1 ,…,v _j-2 ,v _j-1 (ii) represents the set of coordinates from the ith to the jth residue in the sequence in a given protein, i<j; original node of ith residue in proteinThe eigenvector is x _i ∈[0，+∞) ^K Wherein: k is the length of the vector andm is control x _i Super parameter of length->M is {0,1, …, M-1}, g is {1,2,3, …,2 } ^m Coordinates of kth reference point +.>v _i Representing the coordinates of the ith residue, x _i The kth element in (a) is v _i And->Euclidean distance between them.

Step 4) obtaining the original node characteristic of each residue based on angles according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space, and for the coordinates of three continuous residues on the protein sequence: v _i-1 、v _i 、v _i+1 Obtaining the original node characteristics of the ith residue based on the angle

Step 5) splicing the original node characteristics based on the distance and the angle to obtain the original node characteristics of the ith residue, wherein the original node characteristics are as follows:wherein: if the splicing operation is represented, the original node feature matrix of the protein structure containing the residue is +.>Wherein: l is the number of residues, and T is the transpose operation.

The similarity is obtained by the following way: when the protein structure training dataset contains N knotsConstruct it as D _train ＝{X ₁ ,X ₂ ,…,X _i ,…,X _N -wherein: x is X _i Representing the structure of the ith protein, calculating the similarity (TM-score) between the ith and jth structures as its structural similarity using the TM-align algorithm and being TM (X) _i ,X _j ) The TM-score has a value in the range of [0,1 ]]。

The method for dynamically dividing the positive and negative samples specifically comprises the following steps:

step i) from training dataset D _train Optionally one of the structures is X _a Calculate its and D _train TM-score between all other structures in (a) then arranged in descending order of TM-score, from D _train Is randomly sampled a non-X _a The structure of (C) is X _b When it is ranked in the first K%, then (X _a ,X _b ) Is a positive sample structure pair; otherwise, a negative-sample structure pair, wherein: k is a preset super parameter, K is N and K is 0,100]；

Step ii) training data queue toWherein: each structure pair is a positive sample structure pair and satisfies: for arbitrary belongings->Is of the structure pair (X) _ai ,X _bi )，TM(X _ai ,X _bj )<TM(X _ai ,X _bi ) Wherein: 0<i-j≤L _N ，/>The length of the negative sample queue in the framework is learned for the momentum contrast.

The encoder based on the graph neural network comprises: a plurality of full connection layers, a BiLSTM module and a plurality of graph convolution layers, wherein the training process comprises:

step (1) training data is queuedThe adjacency matrix and the original node characteristics of the structure pairs in (a) are sequentially input into a contrast learning framework in the form of batch, and a loss function is calculated>Wherein: y is _q And y _k Is that two structures in positive sample structure pair are respectively input epsilon _q And epsilon _k The resulting descriptors, y _i Is the input of the ith structure in the negative sample queue to ε _k τ is a predetermined temperature coefficient.

Step (2) updating ε using a random gradient descent algorithm based on the loss function _q Parameter θ _q Then utilize theta _q Updating epsilon _k Parameter θ _k The method specifically comprises the following steps: θ _k ←m·θ _k +(1-m)·θ _q Wherein: m epsilon (0, 1)]Is a preset momentum coefficient.

Step (3) the second structure X in all pairs of structures to be used in the present iteration _b Adding into the negative sample queue, when the number of structures in the negative sample queue has reached the preset length L _N The structure that was first added to the queue is removed.

Step (4) after the designated number of training iterations is completed, inputting the structural data in the training set and the verification set to epsilon _q The obtained descriptors in the training set are used for calculating the cosine distance of length scaling between all descriptors in the verification set and all descriptors in the training setWherein: l (L) _a ，l _b And l _max The length of the protein in the verification set, the length of the protein in the training set and the length of the longest protein in the training set, y _a And y _b The descriptors of protein structure in the validation set and training set, respectively. After obtaining the distances between every two descriptors in the verification set and the training set, the current model can be evaluated according to the real structural similarity between the two descriptors, and whether to terminate the training of the model or reduce the learning rate is determined.

The invention relates to a system for realizing the method, which comprises the following steps: the device comprises a feature extractor, an encoder, a verification module and a parameter updating module, wherein: the feature extraction module extracts distance matrix and node features from structural data of two proteins in the structural pair respectively, and inputs the extracted features into two encoders respectively; two encoders encode the input features into a fixed length vector output; the verification module calculates the distance between the two output vectors, evaluates the difference between the two output vectors and the real similarity between the two output vectors and the two structures, and calculates the loss; and updating the parameters of one encoder by using a back propagation algorithm according to the loss, and updating the parameters of the other encoder by using a momentum method.

Technical effects

The invention integrally solves the defects of the prior art in the precision and speed of protein structure comparison. Compared with the prior art, the dynamic training data partitioning strategy can enable the model to learn a finer similarity relationship, and the cosine distance of the length scaling can further correct the similarity between descriptors. Better results are achieved on the sorting and classifying tasks, respectively, without losing computational efficiency.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of an embodiment momentum contrast learning framework;

fig. 3 is a schematic diagram of an encoder architecture based on the neural network of fig. 3.

Detailed Description

As shown in fig. 1, this embodiment relates to a method for rapid classification of protein structures based on a neural network of a comparison graph, which includes the following steps:

step 1) firstly extracting Cartesian coordinate information of alpha carbon atoms of each residue in a protein structure in a three-dimensional space, then calculating Euclidean distance between each residue pair according to the coordinate information of the residues, constructing an adjacent matrix according to the distance, and calculating relative coordinates and angles of each residue as original node characteristics, wherein the method specifically comprises the following steps:

step 1.1) toA protein comprising L residues is defined, wherein: the alpha carbon atom of the ith residue has a Cartesian coordinate v in three dimensions _i ＝(x _i ,y _i ,z _i ) The alpha carbon atom of the j-th residue has the coordinate v _j ＝(x _j ,y _j ,z _j ) The Euclidean distance between these two residues is d _ij ＝||v _i -v _j Distance matrix of the protein is

Step 1.2) based on the distance matrix obtained aboveThe adjacency matrix thereof can be obtained by>Wherein: />ω and e are two superparameters for normalization, ω=4 and e=2 in this example.

Step 1.3) obtaining the original node characteristics of each residue based on distance according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space, and collecting the coordinates of the residues as V= { V ₁ ,v ₂ ,…,v _L-1 ,v _L }，V _i:j ＝{v _i ,v _i+1 ,…,v _j-2 ,v _j-1 ,}(i<j) Representing the set of coordinates from the ith to the jth residue in the sequence in a given protein.

The original node characteristic vector of the ith residue in the protein is x _i ∈[0，+∞) ^K ，Wherein: coordinates of the kth reference point +.>v _i Representing the coordinates of the ith residue, x _i The kth element in (a) is v _i And->The Euclidean distance between K is the length of the vector and +.>M is control x _i The length hyper-parameters, m=5, M e {0,1, …, M-1}, g e {1,2,3, …,2 }, are taken in this example ^m }，k＝0,1,…,K-1。

Step 1.4) obtaining the original node characteristics of each residue based on angles according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space. Coordinates for three consecutive residues on the protein sequence: v _i-1 、v _i 、v _i+1 Obtaining the original node characteristics of the ith residue based on the angle

Step 1.5) splicing the original node characteristics based on the distance and the angle to obtain the original node characteristics of the ith residue, wherein the original node characteristics are as follows:wherein: and represents a splicing operation. The original node characteristic matrix of the protein structure comprising this residue is +.>Wherein: l is the number of residues, and T is the transpose operation.

Step 2) for any protein structure in the training dataset, calculating its similarity to all other protein structures, the method of calculating its similarity to all other protein structures for any protein structure in the training dataset specifically refers to: when the protein structure training dataset contains N structures, it is noted as D _train ＝{X ₁ ,X ₂ ,…,X _i ,…,X _N -wherein: x is X _i Represents the i-th protein structure; in this embodiment, N is 13265. The TM-score between the ith and jth structures was calculated as its structural similarity using the TM-align algorithm and noted as TM (X _i ,X _j ) The TM-score has a value in the range of [0,1 ]]。

Step 3) constructing a training data queue from a training data set sampling structure pair by using a method of dynamically dividing positive and negative samples, which specifically comprises the following steps:

step 3.1) from training dataset D _train Optionally one of the structures is X _a The method as described calculates the sum D _train TM-score between all other structures in (1), then ordering the structures in descending order of TM-score, and from D _train Is randomly sampled a non-X _a The structure of (C) is X _b When it is ranked in the first 30%, it is called (X _a ,X _b ) Is a positive sample structure pair; otherwise, it is called a negative sample structure pair;

step 3.2) queuing training data asWherein: each structure pair is a positive sample structure pair and satisfies: for arbitrary belongings->Is of the structure pair (X) _ai ,X _bi )，TM(X _ai ,X _bj )<TM(X _ai ,X _bi ) Wherein: 0<i-j≤L _N ，/>The length of the negative sample queue in the framework is learned for the momentum contrast.

Step 4) building a momentum contrast learning framework which comprises two encoders based on a graph neural network. Inputting training data into the contrast learning framework, as shown in FIG. 2, comprising two encoders based on a graph neural network, to generate descriptors and train the model, deciding when to terminate training of the model based on data in a validation set and length scaled cosine distancesThe two encoders have exactly the same architecture, which is ε _q And make it epsilon _k . By epsilon _q For example, as shown in FIG. 3, it contains multiple fully-connected layers, a BiLSTM module and multiple graph convolutional layers. The training process of the model is as follows:

step 4.1) queuing training dataThe adjacency matrix and the original node characteristics of the structure pairs in (a) are sequentially input into a contrast learning framework in the form of batch, and a loss function is calculated>Wherein: y is _q And y _k Is that two structures in positive sample structure pair are respectively input epsilon _q And epsilon _k The resulting descriptors, y _i Is the input of the ith structure in the negative sample queue to ε _k τ is a predetermined temperature coefficient.

Step 4.2) updating ε using a random gradient descent algorithm based on the Loss obtained _q Parameter θ _q Then utilize theta _q Updating epsilon _k Parameter θ _k The method specifically comprises the following steps: θ _k ←m·θ _k +(1-m)·θ _q Wherein: m epsilon (0, 1)]Is a preset momentum coefficient.

Step 4.3) adding the second structure in the structure pair used this time to the negative sample queue, when the number of structures in the negative sample queue has reached the preset length L _N The structure that was first added to the queue is removed.

Step 4.4) after the specified number of training iterations is completed, the structural data in the training set and validation set is input to ε _q The obtained descriptors in the training set are used for calculating the cosine distance of length scaling between all descriptors in the verification set and all descriptors in the training setWherein: l (L) _a ，l _b And l _max Respectively are provided withIs to verify the length of the protein in the training set, the length of the protein in the training set and the length of the longest protein in the training set, y _a And y _b The descriptors of protein structure in the validation set and training set, respectively. After obtaining the distances between every two descriptors in the verification set and the training set, the current model can be evaluated according to the real structural similarity between the two descriptors, and whether to terminate the training of the model or reduce the learning rate is determined.

Step 5) for inquiring the protein structure, firstly extracting the adjacent matrix and the original node characteristics in a manner similar to the step 1, inputting the adjacent matrix and the original node characteristics into a first encoder of a momentum contrast learning framework, and obtaining the final result, namely the structure descriptor of the protein.

The present example uses the protein structure database SCOPev2.07 as a training set and a validation set, and performs 5-fold cross validation on the data. The dataset contained a total of 13265 protein domains after 40% redundancy removal and data washing, wherein: each structure belongs to one of 7 categories.

Firstly, calculating adjacent matrixes and original node characteristics of all structures, then calculating similarity between every two structures by using a TM-align algorithm, and constructing a training data queue by using a dynamic training data partitioning strategy. The training data are sequentially input into the momentum contrast learning framework in the form of batch to train the whole model, and the batch size is 64. After each iteration, the second structure of the structure pair in the current batch is added to the negative sample queue, with a queue length of 1024. After about 1500 iterations of the model, inputting all structures in the training set and the verification set into the network to obtain descriptors, calculating the cosine distance of length scaling between each structure in the verification set and any structure descriptor in the training set, and comparing the cosine distance with the true similarity of the two structures to evaluate the quality of the descriptors. And when the performance of the model is not improved any more, the training is terminated, and the result of the last evaluation is output.

In the aspect of sequencing tasks, the final result is shown in the following table, and compared with the current best method, the method has the advantage that the indexes are greatly improved. Compared to the second best method deep, the method was improved by about 6% on AUPRC and 12.2%,14.2% and 14.7% on Top-1hit ratio, top-5hit ratio and Top-10hit ratio, respectively.

Wherein: AUPRC refers to the area under the accuracy-recall curve, top-K hit ratio refers to the number and min (K, N _r ) Wherein: n (N) _r Is the number of true similar proteins of the target protein.

Method	AUPRC	Top-1hitratio	Top-5hitratio	Top-10hitratio
					SGM	0.4559	0.5591	0.5328	0.5579
SSEF	0.0377	0.0833	0.0579	0.0607
					DeepFold	0.4990	0.6061	0.5677	0.5930
The method	0.6595	0.7282	0.7101	0.7400

In the classification task, the descriptors obtained by the method are input into a Logistic Regression classifier, and the classes of all proteins in the SCOPe in the Class level (the Class level of the SCOPe comprises 7 classes) are cross-verified, so that the method has obvious improvement on each index compared with the best method at present as shown in the following table. Compared to the second best method deep, the average F1-score and accuracy were improved by 5.1% and 3.7%, respectively. Wherein the average F1-score is an average value obtained by the classifier after obtaining F1-score over 7 categories, respectively.

Method	Average F1-score	Accuracy (Accuracy)
			SGM	0.6289	0.8354
SSEF	0.4920	0.7470
			DeepFold	0.7615	0.8887
The method	0.8124	0.9258

Further comparing the running speed of the method with other algorithms: similarity between 1914 protein structures in a separate dataset was calculated (3,663,396 structure comparisons were performed in total) using all methods, all running on a single logical core of Intel Xeon CPU E5-2630 v4, and then the run times for each method were counted, with the results shown in the table below. Compared with the structure comparison-based method TM-align, all characterization-based methods have obvious improvement in operation speed. Compared with other characterization-based methods, the calculation speed of the method is slightly slower than that of SGM and SSEF (the precision of the two methods in sorting and classifying tasks is lower), but the average time is still on the same horizontal line and faster than deep. When pre-computation is introduced (descriptors of all structures in the database are computed before query, so that only the distance between the descriptors of the structure to be queried needs to be computed during query), the difference between the method and SGM and SSEF is further reduced.

The foregoing embodiments may be partially modified in numerous ways by those skilled in the art without departing from the principles and spirit of the invention, the scope of which is defined in the claims and not by the foregoing embodiments, and all such implementations are within the scope of the invention.

Claims

1. A method for quickly classifying protein structures based on a contrast graph neural network is characterized by extracting global coordinates of alpha carbon atoms of all residues in a protein to be predicted in a three-dimensional space, then calculating according to the global coordinates to obtain an adjacent matrix and original node characteristics, and inputting a neural network model based on a momentum contrast learning frame to obtain a descriptor of the protein structure;

the neural network model comprises: two encoders with the same architecture and based on the graph neural network, wherein training samples of the encoders are obtained by calculating the similarity between any two protein structures in a training data set and then constructing sampling structure pairs in the training data set by using a method of dynamically dividing positive and negative samples;

the neural network model is trained in the following mode, the distance between two descriptors output by the neural network model is measured according to the cosine distance of length scaling, and whether the training reaches a target is determined according to data in a verification set and the cosine distance of length scaling;

step i) from training dataset D _train Optionally one of the structures is X _a Calculate its and D _train TM-score between all other structures in (a) then arranged in descending order of TM-score, from D _train Is randomly sampled a non-X _a The structure of (C) is X _b When it is ranked in the first K%, then (X _a ，X _b ) Is a positive sample structure pair; otherwise, a negative-sample structure pair, wherein: k is a preset super parameter, K is N and K is 0,100]；

Step ii) training data queue toWherein: each structure pair is a positive sample structure pair and satisfies: for arbitrary belongings->Is of the structure pair (X) _ai ,X _bi )，TM(X _aj ,X _bj )<TM(X _ai ,X _bi ) Wherein: 0<i-j≤L _N ，/>The length of a negative sample queue in the framework is learned for the momentum comparison;

step (1) training data is queuedThe adjacency matrix and the original node characteristics of the structure pairs in (a) are sequentially input into a contrast learning framework in the form of batch, and a loss function is calculated>Wherein: y is _q And y _k Two structures in the positive sample structure pair are respectively input into an encoder epsilon based on a graph neural network _q And epsilon _k The resulting descriptors, y _i Is the input of the ith structure in the negative sample queue to ε _k The obtained descriptor, τ, is a preset temperature coefficient;

step (2) updating ε using a random gradient descent algorithm based on the loss function _q Parameter θ _q Then utilize theta _q Updating epsilon _k Parameter θ _k The method specifically comprises the following steps: θ _k ←m·θ _k +(1-m)·θ _q Wherein: m epsilon (0, 1)]Is a preset momentum coefficient;

step (3) the second structure X in all pairs of structures to be used in the present iteration _b Adding into the negative sample queue, when the number of structures in the negative sample queue has reached the preset length L _N Removing the structure added to the queue first;

step (4) when the designated number of training iterations is completedThe structural data in the training set and validation set is then input to ε _q The obtained descriptors in the training set are used for calculating the cosine distance of length scaling between all descriptors in the verification set and all descriptors in the training setWherein: l (L) _a ，l _b And l _max The length of the protein in the verification set, the length of the protein in the training set and the length of the longest protein in the training set, y _a And y _b And respectively obtaining the distances between the protein structure descriptors in the verification set and the training set, evaluating the current model according to the real structural similarity between the protein structure descriptors in the verification set and the training set, and determining whether to terminate training of the model or reduce the learning rate.

2. The rapid protein structure classification method based on a contrast map neural network according to claim 1, wherein the adjacency matrix is: and extracting Cartesian coordinate information of alpha carbon atoms of each residue in the protein structure in a three-dimensional space, then calculating Euclidean distance between each residue pair according to the coordinate information of the residues, and constructing an adjacent matrix according to the distance.

3. The rapid protein structure classification method based on a contrast map neural network according to claim 1, wherein the adjacency matrix is obtained by the following steps:

Step 2) based on the distance matrix obtained aboveThe adjacency matrix is obtained by>Wherein: /> Omega and epsilon are two superparameters for normalization, both greater than 0;

step 3) obtaining the original node characteristics of each residue based on distance, namely the relative coordinates and angles of each residue according to the Cartesian coordinates of the alpha carbon atom of each residue in the protein in a three-dimensional space; the set of coordinates of the residues is v= { V ₁ ,v ₂ ,…,v _L-1 ,v _L }，V _i:j ＝{v _i ,v _i+1 ,…,v _j-2 ,v _j-1 (ii) represents the set of coordinates from the ith to the jth residue in the sequence in a given protein, i<j; the original node characteristic vector of the ith residue in the protein is x _i ∈[0，+∞) ^P Wherein: p is the length of the vector and p=2 ^M -1，M is control x _i Super parameter of length->M is {0,1, …, M-1}, g is {1,2,3, …,2 } ^m Coordinates of kth reference point +.>v _i Representing the coordinates of the ith residue, x _i The kth element in (a) is v _i And->A Euclidean distance between the two;

Step 5) splicing the original node characteristics based on the distance and the angle to obtain the original node characteristics of the ith residue, wherein the original node characteristics are as follows:wherein: if the splicing operation is represented, the original node feature matrix of the protein structure containing the residue is +.>L is the number of residues, and T is the transpose operation.

4. The rapid protein structure classification method based on a contrast map neural network according to claim 1, wherein the similarity is obtained by: when the protein structure training dataset contains N structures, it is noted as D _train ＝{X ₁ ,X ₂ ,…,X _i ,…,X _N -wherein: x is X _i Representing the structure of the ith protein, calculating the TM-score between the ith and jth structures as its structural similarity using the TM-align algorithm and noting as TM (X) _i ,X _j ) The TM-score has a value in the range of [0,1 ]]。

5. A contrast map neural network-based rapid protein structure classification system implementing the method of any one of claims 1-4, comprising: the device comprises a feature extractor, an encoder, a verification module and a parameter updating module, wherein: the feature extraction module extracts distance matrix and node features from structural data of two proteins in the structural pair respectively, and inputs the extracted features into two encoders respectively; two encoders encode the input features into a fixed length vector output; the verification module calculates the distance between the two output vectors, evaluates the difference between the two output vectors and the real similarity between the two output vectors and the two structures, and calculates the loss; and updating the parameters of one encoder by using a back propagation algorithm according to the loss, and updating the parameters of the other encoder by using a momentum method.