CN114550824A

CN114550824A - Protein folding identification method and system based on embedding characteristics and unbalanced classification loss

Info

Publication number: CN114550824A
Application number: CN202210110503.9A
Authority: CN
Inventors: 张蕾; 杨伟; 文云光
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2022-01-29
Filing date: 2022-01-29
Publication date: 2022-05-27
Anticipated expiration: 2042-01-29
Also published as: CN114550824B

Abstract

The invention discloses a protein folding identification method and a protein folding identification system based on embedding characteristics and unbalanced classification loss. The method comprises the steps of firstly generating an embedded matrix of a given protein chain by adopting a pre-trained protein language model ProtT5-XL-UniRef50, and then converting the embedded matrix into a feature vector with a fixed length by calculating mean value and cosine similarity. In particular, by using the embedding feature, the present invention avoids time consuming multiple sequence alignment operations. In addition, considering that protein folding data has obvious unbalanced class distribution, the invention adopts a multilayer perceptron network designed for label distribution conscious interval loss training of unbalanced classification tasks, and thus the learning capacity of sparse folding classes is enhanced. In conclusion, the protein folding recognition network model provided by the invention can quickly and accurately predict the folding category of a given protein chain.

Description

Protein folding identification method and system based on embedding characteristics and unbalanced classification loss

Technical Field

The invention belongs to the technical field of computational biology, and particularly relates to a protein folding identification method and system based on embedding characteristics and unbalanced classification loss.

Background

Proteins are biological macromolecules composed of 20 standard amino acid types that play a very important role in many biological processes. The amino acid sequence of a protein determines its tertiary structure, and its function is significantly dependent on the tertiary structure. The number of proteins in protein sequence databases is currently significantly greater than that of proteins of known structure due to the rapid development of protein sequencing technology. For this reason, computational methods to predict protein structure have become an essential means to narrow the differences in the number of sequences and the number of structures. In the prediction of the tertiary structure of a protein, an important subtask is to find proteins with similar structures. For a protein with unknown structure, when the protein with similar structure exists in the PDB database, the structure of the protein can be accurately modeled by taking the protein as a template. In particular, protein folding recognition may help to find proteins with similar structures.

The currently recently released SCOPE 2.08 database divides protein structures into 12 broad categories: all alpha proteins, all beta proteins, alpha/beta proteins, alpha + beta proteins, multi-domain proteins, membrane and cell surface proteins and peptides, small proteins, frizzled proteins, low resolution protein structures, peptides, engineered proteins, and defective proteins. The stable statistics published by the SCOPE 2.08 database contain the top 7 major classes of structure, which contain numbers of fold classes 290, 180, 148, 396, 74, 69, and 100, respectively, totaling 1257 fold classes, 2067 superfamilies, and 5084 families. In particular, with the release of new protein structures, the number of known superfamilies varies somewhat [ Chandonia, J.M., N.K.Fox, and S.E.Brenner, SCOPE: classification of large macromolecular structures in the structural classification of protein-extended database. nucleic Acids Res,2019.47(D1): p.D. 475-D481 ]. If two different superfamilies are found to be related to each other due to the appearance of a new structure, they will be merged into the same superfamily. Furthermore, a single domain (domain) may be divided into multiple domains due to the discovery of different evolutionary relationships.

Currently, many machine learning algorithms such as Support Vector Machines [ Yan, K., et al, Protein Fold Recognition by Combining Support Vector Machines and Pairwise Sequence Similarity scores. IEEE/ACM Trans composite Biol Bioinfo, 2021.18(5): p.2008-2016], neural networks [ Villegas-Morcillo, A., V.Sanchez, and A.M.Gomez, FoldHSphere: deep hyperspectral embedding for Protein ld Recognition. BMC Biol., 2021.22(1): p.490., integrated classifiers, random forests, etc. have been successfully applied to Fold Recognition. However, these existing methods not only ignore the imbalance of the folded data set when training the model, but also mostly use PSSM spectral features or HHM spectral features as the main folded recognition features. It is noted that generating these spectral features requires time-consuming multiple sequence alignments to be performed on large scale sequence databases.

Disclosure of Invention

The invention provides a protein folding recognition method and a protein folding recognition system based on embedding characteristics and unbalanced classification loss, aiming at the problems that the existing protein folding recognition method not only ignores the imbalance of a folding data set when a model is trained, but also mostly takes PSSM (phosphosenesma syndrome-associated syndrome) spectral characteristics or HHM (HHM-associated syndrome) spectral characteristics as main folding recognition characteristics and needs to perform time-consuming multi-sequence comparison on a large-scale sequence database when the spectral characteristics are generated, the protein folding recognition method and the protein folding recognition system based on the embedding characteristics derived from a pre-trained protein language model carry out protein folding recognition, and the unbalanced classification loss is introduced when the model is trained to enhance the learning capability of sparse folding categories. The folding identification method provided by the invention can be used for rapidly and accurately predicting the folding category of the protein.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a protein folding identification method based on embedding characteristics and unbalanced classification loss on one hand, which comprises the following steps:

step 1: determining a protein folding training dataset and a test dataset, each comprising a plurality of protein chains;

step 2: generating an embedded matrix of a protein chain in a protein folding training data set by adopting a pre-trained protein language model ProtT5-XL-UniRef 50;

and step 3: converting the embedded matrix into a characteristic vector with a fixed length of a protein chain by calculating the similarity of a mean value and a cosine;

and 4, step 4: constructing a protein folding recognition network model, wherein the protein folding recognition network model is a multilayer perceptron consisting of three fully-connected layers, and the last fully-connected layer of the multilayer perceptron adopts a normalized fully-connected layer;

and 5: adopting interval loss aiming at label distribution consciousness of unbalanced classification as a loss function for training a folding recognition network model;

and 6: training a folding recognition network model based on a protein folding training data set and a loss function for training a folding recognition network;

and 7: and predicting the folding category of the protein chain based on the protein folding test data set and the trained folding recognition network model.

Further, the step 2 comprises:

for any protein chain with the length of L in the protein folding training data set, firstly converting all characters in the amino acid sequence of the protein chain into capital characters, taking the converted amino acid sequence as the input of a model ProtT5-XL-UniRef50, finally running the model in a semi-precision mode and storing the output of an encoder of the model, and obtaining an embedded feature matrix with the size of L multiplied by 1024.

Further, the step 3 comprises:

for a given embedding matrix E of size lx 1024, first calculate the mean value of each column of the embedding matrix, obtain a length-1024 signature:

where l represents the number of rows of the embedding matrix E;

then, calculating the average value of each row of the embedding matrix to obtain a vector with the length of L:

then f is calculated_{row_mean}And obtaining a feature representation with the length of 1024 according to the cosine similarity of each column vector of the embedded matrix:

f_{cos_sim}＝[s₁,s₂,...,s_j,...s₁₀₂₄]^T

cosine similarity s_jThe calculation is carried out according to the following formula:

wherein <, > represents the inner product of two vectors, | | | - | represents the length of the vector;

finally, two vectors f are divided_{col_mean}And f_{cos_sim}The protein chains are spliced into a vector to represent the characteristics of the protein chains, and each protein chain can be represented into a characteristic vector with 2048 dimensions through the operation.

Further, the step 6 further includes:

and initializing network parameters of the multilayer perceptron by adopting a default weight initialization method of a PyTorch deep learning framework.

Further, the step 7 includes:

step 2 and step 3 are performed first to represent the protein chains in the protein folding test dataset as feature vectors, which are then input into a trained folding recognition network model, and the folding class with the highest score is assigned to the protein chain.

In another aspect, the present invention provides a protein folding identification system based on intercalation characteristics and unbalanced classification loss, comprising:

a data set determination module for determining a protein folding training data set and a test data set, both of which comprise a plurality of protein chains;

the embedded matrix generation module is used for generating an embedded matrix of a protein chain in the protein folding training data set by adopting a pre-trained protein language model ProtT5-XL-UniRef 50;

the characteristic vector obtaining module is used for converting the embedded matrix into a characteristic vector with a fixed length of a protein chain by calculating the similarity of a mean value and a cosine;

the model building module is used for building a protein folding identification network model, the protein folding identification network model is a multilayer perceptron consisting of three full connection layers, and the last full connection layer of the multilayer perceptron adopts a normalized full connection layer;

a loss function obtaining module, configured to use an interval loss of a label distribution consciousness for unbalanced classification as a loss function for training the folded recognition network model;

the model training module is used for training a folding recognition network model based on a protein folding training data set and a loss function for training a folding recognition network;

and the folding type identification module is used for predicting the folding type of the protein chain based on the protein folding test data set and the trained folding identification network model.

Further, the embedded matrix generation module is specifically configured to:

for any protein chain with the length of L in the protein folding training dataset, converting all characters in the amino acid sequence of the protein chain into capital characters, taking the converted amino acid sequence as the input of a model ProtT5-XL-UniRef50, finally running the model in a semi-precision mode and storing the output of an encoder of the model, and obtaining an embedded feature matrix with the size of L multiplied by 1024.

Further, the feature vector derivation module is specifically configured to:

where l represents the number of rows of the embedding matrix E;

then, calculating the mean value of each row of the embedding matrix to obtain a vector with the length of L:

f_{cos_sim}＝[s₁,s₂,...,s_j,...s₁₀₂₄]^T

cosine similarity s_jThe calculation is performed according to the following formula:

Further, the model training module is further configured to:

Further, the folding category identifying module is specifically configured to:

the method comprises the steps of firstly executing an embedding matrix generation module and a feature vector obtaining module to represent protein chains in a protein folding test data set as feature vectors, then inputting the feature vectors into a trained folding recognition network model, and allocating the folding category with the highest score to the protein chains.

Compared with the prior art, the invention has the following beneficial effects:

the method comprises the steps of firstly generating an embedded matrix of a given protein chain by adopting a pre-trained protein language model ProtT5-XL-UniRef50, and then converting the embedded matrix into a feature vector with a fixed length by calculating mean value and cosine similarity. In particular, by using the embedding feature, the present invention avoids time consuming multiple sequence alignment operations. In addition, considering that protein folding data has obvious unbalanced class distribution, the invention adopts a multilayer perceptron network designed for label distribution conscious interval loss training of unbalanced classification tasks, and thus the learning capacity of sparse folding classes is enhanced. In conclusion, the protein folding recognition network model provided by the invention can quickly and accurately predict the folding category of a given protein chain.

Drawings

FIG. 1 is a basic flow chart of a protein folding identification method based on intercalation characteristics and unbalanced classification loss according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a protein folding recognition network structure according to an embodiment of the present invention;

FIG. 3 is a flow chart of protein folding class prediction according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an embodiment of the present invention, wherein the schematic diagram is a protein folding recognition system architecture based on embedding characteristics and unbalanced classification loss.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

as shown in fig. 1, a protein fold identification method based on intercalation characteristics and unbalanced classification loss comprises:

step S101: determining a protein folding training dataset and a test dataset, each comprising a plurality of protein chains;

step S102: generating an embedded matrix of a protein chain in a protein folding training data set by adopting a pre-trained protein language model ProtT5-XL-UniRef 50;

step S103: converting the embedded matrix into a characteristic vector with a fixed length of a protein chain by calculating the similarity of a mean value and a cosine;

step S104: constructing a protein folding recognition network model, wherein the protein folding recognition network model is a multilayer perceptron consisting of three fully-connected layers, and the last fully-connected layer of the multilayer perceptron adopts a normalized fully-connected layer;

step S105: adopting interval loss aiming at label distribution consciousness of unbalanced classification as a loss function for training a folding recognition network model;

step S106: training a folding recognition network model based on a protein folding training data set and a loss function for training a folding recognition network;

step S107: and predicting the folding category of the protein chain based on the protein folding test data set and the trained folding recognition network model.

Further, in step S101, the present invention adopts the protein structure prediction challenge held by the boomerang open platform in 2021 (step S) (ii)https://challenge.xfyun.cn/topic/infotype＝protein&ch ═ dc-web-01) as a protein folding dataset. The dataset was from the protein structure classification database Astral SCOPE 2.07, consisting of 11843 protein chains, and the sequence identity between any two protein chains was less than 40%. In particular, to ensure that there are enough samples for training in each folding category, the number of protein chains contained in each folding category selected is not less than 10. In addition, the data set was further divided into a training set containing 9472 protein chains and a test set containing 2371 protein chains. The number of fold categories in the dataset is 245.

Further, in step S102, the invention first downloads the pre-trained protein language model ProtT5-XL-UniRef50 from the link https:// zenodo.org/record/4644188#. YZxmZcVBy 70. Then, for any given protein chain with length L, all characters in the amino acid sequence are converted into capital characters, the converted amino acid sequence is used as the input of a model ProtT5-XL-UniRef50, and finally the model is operated in a semi-precision mode and the output of an encoder of the model is saved, so that an embedded feature matrix with the size of L multiplied by 1024 can be obtained. This means that the dimension of the corresponding insertion feature for each amino acid residue is 1024.

Further, in step S103, in order to predict the folding type of a given protein chain, protein chains of different lengths need to be represented as feature vectors of fixed size. For a given embedding matrix E with the size of L multiplied by 1024, the invention firstly calculates the mean value of each column of the embedding matrix, and obtains a characteristic representation with the length of 1024:

where l represents the number of rows of the embedding matrix E;

f_{cos_sim}＝[s₁,s₂,...,s_j,...s₁₀₂₄]^T

in particular, the cosine similarity s_jThe calculation is carried out according to the following formula:

where <, > represents the inner product of two vectors, | | | - | represents the length of the vector.

Finally, the invention combines two vectors f_{col_mean}And f_{cos_sim}Spliced into a vector to represent the characteristics of the protein chain. Obviously, each protein chain can be represented as a feature vector of 2048 dimensions by the above operation.

Further, in step S104, the folding recognition network model designed by the present invention is a multi-layer perceptron composed of three fully-connected layers. FIG. 2 shows a model architecture of a folding recognition network, wherein X represents an input matrix with the size of N × 2048, Z represents an output matrix with the size of N × 245, N represents the number of protein chains in mini-batch, and FC represents a full connection layer for performing linear transformation on input data. The activation function ReLU follows both of the first two fully connected layers. In particular, the last layer of the multi-layer perceptron employs a normalized fully-connected layer. For a given input data vector x, the normalized fully-connected layer computes the output vector z as follows:

where A represents the weight matrix of the fully-connected layer and the hyperparameter τ is the scaling factor. In the present invention, we set τ to 18.

It is worth noting that the invention does not employ deeper network models for protein folding identification. This is because the embedded features employed by the present invention are derived from the pre-trained deep protein language model, ProtT5-XL-UniRef 50. Since the model ProtT5-XL-UniRef50 is a deep Transformer network composed of about 30 hundred million parameters, the output characteristics of its encoder already contain context information for each residue. Deeper network models are not required for fold recognition based on embedded features. In particular, in a specific implementation the output dimension of full link layer FC1 is 2048 and the output dimension of full link layer FC2 is 1024.

Further, in step S105, by performing a statistical analysis on the training data set, it can be found that: of the 245 fold categories, 17.96% of the fold categories had no less than 50 samples, 35.92% had between 20 and 50 samples, and 46.12% had less than 20 samples. It is clear that the protein fold identification problem is an unbalanced classification problem. In particular, when the folded recognition network is trained by using the common cross entropy loss, the information of the sparse class is usually submerged by the information of the frequent class during training, so that the classification accuracy can be only obtained. To this end, the invention exploits the gap loss of consciousness of Label Distribution for unbalanced classification [ Cao, K., et al Processing Systems 32,2019.]And training the folding recognition network to enhance the learning ability of the sparse class. Specifically, let n_lIndicates the number of protein chains in the first fold class in the training dataset, and the vector γ ═ γ₁,γ₂,...,γ₂₄₅]Wherein

The tag-dependent interval for the ith folding class is defined as: delta_l＝γ_l/γ_maxWherein γ is_maxRepresenting the maximum value of the vector gamma. Note that sparse fold classes have larger spacing. Based on the output matrix Z of the folded recognition network and the corresponding tag vector y, the following gap loss of the tag distribution awareness can be defined:

where the super parameter s can adjust the size of the interval, we set it to 5 in the present invention.

Further, in step S106, in order to train the designed folded recognition network, the present invention initializes the network parameters of the multi-layered perceptron by using a default weight initialization method of the PyTorch deep learning framework. In particular, to initialize the weight matrix a in the normalized fully-connected layer, the present invention first samples a matrix of the same size as matrix a from the uniform distribution U (-1,1), then normalizes each column thereof to a unit vector and initializes a with it. In addition, the size of the mini-batch is set to 64 during training, the adopted optimizer is AdamW with the learning rate of 0.001, the weight attenuation hyperparameter is set to 0.001, and the maximum allowable epoch size is 10.

Further, in step S107, in order to predict the folding class of a given protein chain, it is necessary to first perform step S102 and step S103 so as to represent it as a feature vector of 2048 dimensions. The feature vector is then fed into a trained folded recognition network, and a 245-dimensional output vector (logit score vector) can be obtained. Finally, the folding class corresponding to the largest output vector is assigned to the protein chain. FIG. 3 shows a flow chart of the present invention for predicting the folding class of protein chains.

On the basis of the above embodiment, as shown in fig. 4, the present invention further provides a protein folding recognition system based on intercalation characteristics and unbalanced classification loss, comprising:

the model construction module is used for constructing a protein folding identification network model, the protein folding identification network model is a multilayer perceptron consisting of three fully-connected layers, and the last fully-connected layer of the multilayer perceptron adopts a normalized fully-connected layer;

Further, the embedded matrix generation module is specifically configured to:

Further, the feature vector derivation module is specifically configured to:

where l represents the number of rows of the embedding matrix E;

f_{cos_sim}＝[s₁,s₂,...,s_j,...s₁₀₂₄]^T

finally, two vectors f_{col_mean}And f_{cos_sim}The protein chains are spliced into a vector to represent the characteristics of the protein chains, and each protein chain can be represented into a characteristic vector with 2048 dimensions through the operation.

Further, the model training module is further configured to:

Further, the folding category identification module is specifically configured to:

In summary, the invention first generates an embedded matrix for a given protein chain using the pre-trained protein language model ProtT5-XL-UniRef50, and then converts the embedded matrix into a feature vector of fixed length by calculating mean and cosine similarities. In particular, by using the embedding feature, the present invention avoids time consuming multiple sequence alignment operations. In addition, considering that protein folding data has obvious unbalanced class distribution, the invention adopts a multilayer perceptron network designed for label distribution conscious interval loss training of unbalanced classification tasks, and thus the learning capacity of sparse folding classes is enhanced. In conclusion, the protein folding recognition network model provided by the invention can quickly and accurately predict the folding category of a given protein chain.

The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. A method for identifying protein folding based on intercalation characteristics and unbalanced classification loss, comprising:

step 6: training a folding recognition network model based on a protein folding training data set and a loss function for training a folding recognition network;

2. The method for identifying protein folding based on embeddings and unbalanced classification loss according to claim 1, wherein the step 2 comprises:

3. The method for identifying protein folding based on embeddable features and unbalanced classification loss as claimed in claim 2, wherein the step 3 comprises:

where l represents the number of rows of the embedding matrix E;

f_{cos_sim}＝[s₁,s₂,...,s_j,...s₁₀₂₄]^T

4. The method for identifying protein folding based on embeddings and unbalanced classification loss according to claim 1, wherein the step 6 further comprises:

5. The method of claim 1, wherein the step 7 comprises:

6. A protein fold identification system based on intercalation characteristics and unbalanced classification loss, comprising:

7. The system according to claim 6, wherein the embedding matrix generation module is specifically configured to:

8. The system according to claim 7, wherein the feature vector derivation module is specifically configured to:

where l represents the number of rows of the embedding matrix E;

f_{cos_sim}＝[s₁,s₂,...,s_j,...s₁₀₂₄]^T

finally, willTwo vectors f_{col_mean}And f_{cos_sim}The protein chains are spliced into a vector to represent the characteristics of the protein chains, and each protein chain can be represented into a characteristic vector with 2048 dimensions through the operation.

9. The system of claim 6, wherein the model training module is further configured to:

10. The system according to claim 6, wherein the fold class identification module is specifically configured to: