CN114550824A - Protein folding identification method and system based on embedding characteristics and unbalanced classification loss - Google Patents
Protein folding identification method and system based on embedding characteristics and unbalanced classification loss Download PDFInfo
- Publication number
- CN114550824A CN114550824A CN202210110503.9A CN202210110503A CN114550824A CN 114550824 A CN114550824 A CN 114550824A CN 202210110503 A CN202210110503 A CN 202210110503A CN 114550824 A CN114550824 A CN 114550824A
- Authority
- CN
- China
- Prior art keywords
- protein
- folding
- training
- model
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Pure & Applied Mathematics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computational Mathematics (AREA)
- Genetics & Genomics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biotechnology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a protein folding identification method and a protein folding identification system based on embedding characteristics and unbalanced classification loss. The method comprises the steps of firstly generating an embedded matrix of a given protein chain by adopting a pre-trained protein language model ProtT5-XL-UniRef50, and then converting the embedded matrix into a feature vector with a fixed length by calculating mean value and cosine similarity. In particular, by using the embedding feature, the present invention avoids time consuming multiple sequence alignment operations. In addition, considering that protein folding data has obvious unbalanced class distribution, the invention adopts a multilayer perceptron network designed for label distribution conscious interval loss training of unbalanced classification tasks, and thus the learning capacity of sparse folding classes is enhanced. In conclusion, the protein folding recognition network model provided by the invention can quickly and accurately predict the folding category of a given protein chain.
Description
Technical Field
The invention belongs to the technical field of computational biology, and particularly relates to a protein folding identification method and system based on embedding characteristics and unbalanced classification loss.
Background
Proteins are biological macromolecules composed of 20 standard amino acid types that play a very important role in many biological processes. The amino acid sequence of a protein determines its tertiary structure, and its function is significantly dependent on the tertiary structure. The number of proteins in protein sequence databases is currently significantly greater than that of proteins of known structure due to the rapid development of protein sequencing technology. For this reason, computational methods to predict protein structure have become an essential means to narrow the differences in the number of sequences and the number of structures. In the prediction of the tertiary structure of a protein, an important subtask is to find proteins with similar structures. For a protein with unknown structure, when the protein with similar structure exists in the PDB database, the structure of the protein can be accurately modeled by taking the protein as a template. In particular, protein folding recognition may help to find proteins with similar structures.
The currently recently released SCOPE 2.08 database divides protein structures into 12 broad categories: all alpha proteins, all beta proteins, alpha/beta proteins, alpha + beta proteins, multi-domain proteins, membrane and cell surface proteins and peptides, small proteins, frizzled proteins, low resolution protein structures, peptides, engineered proteins, and defective proteins. The stable statistics published by the SCOPE 2.08 database contain the top 7 major classes of structure, which contain numbers of fold classes 290, 180, 148, 396, 74, 69, and 100, respectively, totaling 1257 fold classes, 2067 superfamilies, and 5084 families. In particular, with the release of new protein structures, the number of known superfamilies varies somewhat [ Chandonia, J.M., N.K.Fox, and S.E.Brenner, SCOPE: classification of large macromolecular structures in the structural classification of protein-extended database. nucleic Acids Res,2019.47(D1): p.D. 475-D481 ]. If two different superfamilies are found to be related to each other due to the appearance of a new structure, they will be merged into the same superfamily. Furthermore, a single domain (domain) may be divided into multiple domains due to the discovery of different evolutionary relationships.
Currently, many machine learning algorithms such as Support Vector Machines [ Yan, K., et al, Protein Fold Recognition by Combining Support Vector Machines and Pairwise Sequence Similarity scores. IEEE/ACM Trans composite Biol Bioinfo, 2021.18(5): p.2008-2016], neural networks [ Villegas-Morcillo, A., V.Sanchez, and A.M.Gomez, FoldHSphere: deep hyperspectral embedding for Protein ld Recognition. BMC Biol., 2021.22(1): p.490., integrated classifiers, random forests, etc. have been successfully applied to Fold Recognition. However, these existing methods not only ignore the imbalance of the folded data set when training the model, but also mostly use PSSM spectral features or HHM spectral features as the main folded recognition features. It is noted that generating these spectral features requires time-consuming multiple sequence alignments to be performed on large scale sequence databases.
Disclosure of Invention
The invention provides a protein folding recognition method and a protein folding recognition system based on embedding characteristics and unbalanced classification loss, aiming at the problems that the existing protein folding recognition method not only ignores the imbalance of a folding data set when a model is trained, but also mostly takes PSSM (phosphosenesma syndrome-associated syndrome) spectral characteristics or HHM (HHM-associated syndrome) spectral characteristics as main folding recognition characteristics and needs to perform time-consuming multi-sequence comparison on a large-scale sequence database when the spectral characteristics are generated, the protein folding recognition method and the protein folding recognition system based on the embedding characteristics derived from a pre-trained protein language model carry out protein folding recognition, and the unbalanced classification loss is introduced when the model is trained to enhance the learning capability of sparse folding categories. The folding identification method provided by the invention can be used for rapidly and accurately predicting the folding category of the protein.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a protein folding identification method based on embedding characteristics and unbalanced classification loss on one hand, which comprises the following steps:
step 1: determining a protein folding training dataset and a test dataset, each comprising a plurality of protein chains;
step 2: generating an embedded matrix of a protein chain in a protein folding training data set by adopting a pre-trained protein language model ProtT5-XL-UniRef 50;
and step 3: converting the embedded matrix into a characteristic vector with a fixed length of a protein chain by calculating the similarity of a mean value and a cosine;
and 4, step 4: constructing a protein folding recognition network model, wherein the protein folding recognition network model is a multilayer perceptron consisting of three fully-connected layers, and the last fully-connected layer of the multilayer perceptron adopts a normalized fully-connected layer;
and 5: adopting interval loss aiming at label distribution consciousness of unbalanced classification as a loss function for training a folding recognition network model;
and 6: training a folding recognition network model based on a protein folding training data set and a loss function for training a folding recognition network;
and 7: and predicting the folding category of the protein chain based on the protein folding test data set and the trained folding recognition network model.
Further, the step 2 comprises:
for any protein chain with the length of L in the protein folding training data set, firstly converting all characters in the amino acid sequence of the protein chain into capital characters, taking the converted amino acid sequence as the input of a model ProtT5-XL-UniRef50, finally running the model in a semi-precision mode and storing the output of an encoder of the model, and obtaining an embedded feature matrix with the size of L multiplied by 1024.
Further, the step 3 comprises:
for a given embedding matrix E of size lx 1024, first calculate the mean value of each column of the embedding matrix, obtain a length-1024 signature:
where l represents the number of rows of the embedding matrix E;
then, calculating the average value of each row of the embedding matrix to obtain a vector with the length of L:
then f is calculatedrow_meanAnd obtaining a feature representation with the length of 1024 according to the cosine similarity of each column vector of the embedded matrix:
fcos_sim=[s1,s2,...,sj,...s1024]T
cosine similarity sjThe calculation is carried out according to the following formula:
wherein <, > represents the inner product of two vectors, | | | - | represents the length of the vector;
finally, two vectors f are dividedcol_meanAnd fcos_simThe protein chains are spliced into a vector to represent the characteristics of the protein chains, and each protein chain can be represented into a characteristic vector with 2048 dimensions through the operation.
Further, the step 6 further includes:
and initializing network parameters of the multilayer perceptron by adopting a default weight initialization method of a PyTorch deep learning framework.
Further, the step 7 includes:
step 2 and step 3 are performed first to represent the protein chains in the protein folding test dataset as feature vectors, which are then input into a trained folding recognition network model, and the folding class with the highest score is assigned to the protein chain.
In another aspect, the present invention provides a protein folding identification system based on intercalation characteristics and unbalanced classification loss, comprising:
a data set determination module for determining a protein folding training data set and a test data set, both of which comprise a plurality of protein chains;
the embedded matrix generation module is used for generating an embedded matrix of a protein chain in the protein folding training data set by adopting a pre-trained protein language model ProtT5-XL-UniRef 50;
the characteristic vector obtaining module is used for converting the embedded matrix into a characteristic vector with a fixed length of a protein chain by calculating the similarity of a mean value and a cosine;
the model building module is used for building a protein folding identification network model, the protein folding identification network model is a multilayer perceptron consisting of three full connection layers, and the last full connection layer of the multilayer perceptron adopts a normalized full connection layer;
a loss function obtaining module, configured to use an interval loss of a label distribution consciousness for unbalanced classification as a loss function for training the folded recognition network model;
the model training module is used for training a folding recognition network model based on a protein folding training data set and a loss function for training a folding recognition network;
and the folding type identification module is used for predicting the folding type of the protein chain based on the protein folding test data set and the trained folding identification network model.
Further, the embedded matrix generation module is specifically configured to:
for any protein chain with the length of L in the protein folding training dataset, converting all characters in the amino acid sequence of the protein chain into capital characters, taking the converted amino acid sequence as the input of a model ProtT5-XL-UniRef50, finally running the model in a semi-precision mode and storing the output of an encoder of the model, and obtaining an embedded feature matrix with the size of L multiplied by 1024.
Further, the feature vector derivation module is specifically configured to:
for a given embedding matrix E of size lx 1024, first calculate the mean value of each column of the embedding matrix, obtain a length-1024 signature:
where l represents the number of rows of the embedding matrix E;
then, calculating the mean value of each row of the embedding matrix to obtain a vector with the length of L:
then f is calculatedrow_meanAnd obtaining a feature representation with the length of 1024 according to the cosine similarity of each column vector of the embedded matrix:
fcos_sim=[s1,s2,...,sj,...s1024]T
cosine similarity sjThe calculation is performed according to the following formula:
wherein <, > represents the inner product of two vectors, | | | - | represents the length of the vector;
finally, two vectors f are dividedcol_meanAnd fcos_simThe protein chains are spliced into a vector to represent the characteristics of the protein chains, and each protein chain can be represented into a characteristic vector with 2048 dimensions through the operation.
Further, the model training module is further configured to:
and initializing network parameters of the multilayer perceptron by adopting a default weight initialization method of a PyTorch deep learning framework.
Further, the folding category identifying module is specifically configured to:
the method comprises the steps of firstly executing an embedding matrix generation module and a feature vector obtaining module to represent protein chains in a protein folding test data set as feature vectors, then inputting the feature vectors into a trained folding recognition network model, and allocating the folding category with the highest score to the protein chains.
Compared with the prior art, the invention has the following beneficial effects:
the method comprises the steps of firstly generating an embedded matrix of a given protein chain by adopting a pre-trained protein language model ProtT5-XL-UniRef50, and then converting the embedded matrix into a feature vector with a fixed length by calculating mean value and cosine similarity. In particular, by using the embedding feature, the present invention avoids time consuming multiple sequence alignment operations. In addition, considering that protein folding data has obvious unbalanced class distribution, the invention adopts a multilayer perceptron network designed for label distribution conscious interval loss training of unbalanced classification tasks, and thus the learning capacity of sparse folding classes is enhanced. In conclusion, the protein folding recognition network model provided by the invention can quickly and accurately predict the folding category of a given protein chain.
Drawings
FIG. 1 is a basic flow chart of a protein folding identification method based on intercalation characteristics and unbalanced classification loss according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a protein folding recognition network structure according to an embodiment of the present invention;
FIG. 3 is a flow chart of protein folding class prediction according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an embodiment of the present invention, wherein the schematic diagram is a protein folding recognition system architecture based on embedding characteristics and unbalanced classification loss.
Detailed Description
The invention is further illustrated by the following examples in conjunction with the accompanying drawings:
as shown in fig. 1, a protein fold identification method based on intercalation characteristics and unbalanced classification loss comprises:
step S101: determining a protein folding training dataset and a test dataset, each comprising a plurality of protein chains;
step S102: generating an embedded matrix of a protein chain in a protein folding training data set by adopting a pre-trained protein language model ProtT5-XL-UniRef 50;
step S103: converting the embedded matrix into a characteristic vector with a fixed length of a protein chain by calculating the similarity of a mean value and a cosine;
step S104: constructing a protein folding recognition network model, wherein the protein folding recognition network model is a multilayer perceptron consisting of three fully-connected layers, and the last fully-connected layer of the multilayer perceptron adopts a normalized fully-connected layer;
step S105: adopting interval loss aiming at label distribution consciousness of unbalanced classification as a loss function for training a folding recognition network model;
step S106: training a folding recognition network model based on a protein folding training data set and a loss function for training a folding recognition network;
step S107: and predicting the folding category of the protein chain based on the protein folding test data set and the trained folding recognition network model.
Further, in step S101, the present invention adopts the protein structure prediction challenge held by the boomerang open platform in 2021 (step S) (ii)https://challenge.xfyun.cn/topic/infotype=protein&ch ═ dc-web-01) as a protein folding dataset. The dataset was from the protein structure classification database Astral SCOPE 2.07, consisting of 11843 protein chains, and the sequence identity between any two protein chains was less than 40%. In particular, to ensure that there are enough samples for training in each folding category, the number of protein chains contained in each folding category selected is not less than 10. In addition, the data set was further divided into a training set containing 9472 protein chains and a test set containing 2371 protein chains. The number of fold categories in the dataset is 245.
Further, in step S102, the invention first downloads the pre-trained protein language model ProtT5-XL-UniRef50 from the link https:// zenodo.org/record/4644188#. YZxmZcVBy 70. Then, for any given protein chain with length L, all characters in the amino acid sequence are converted into capital characters, the converted amino acid sequence is used as the input of a model ProtT5-XL-UniRef50, and finally the model is operated in a semi-precision mode and the output of an encoder of the model is saved, so that an embedded feature matrix with the size of L multiplied by 1024 can be obtained. This means that the dimension of the corresponding insertion feature for each amino acid residue is 1024.
Further, in step S103, in order to predict the folding type of a given protein chain, protein chains of different lengths need to be represented as feature vectors of fixed size. For a given embedding matrix E with the size of L multiplied by 1024, the invention firstly calculates the mean value of each column of the embedding matrix, and obtains a characteristic representation with the length of 1024:
where l represents the number of rows of the embedding matrix E;
then, calculating the mean value of each row of the embedding matrix to obtain a vector with the length of L:
then f is calculatedrow_meanAnd obtaining a feature representation with the length of 1024 according to the cosine similarity of each column vector of the embedded matrix:
fcos_sim=[s1,s2,...,sj,...s1024]T
in particular, the cosine similarity sjThe calculation is carried out according to the following formula:
where <, > represents the inner product of two vectors, | | | - | represents the length of the vector.
Finally, the invention combines two vectors fcol_meanAnd fcos_simSpliced into a vector to represent the characteristics of the protein chain. Obviously, each protein chain can be represented as a feature vector of 2048 dimensions by the above operation.
Further, in step S104, the folding recognition network model designed by the present invention is a multi-layer perceptron composed of three fully-connected layers. FIG. 2 shows a model architecture of a folding recognition network, wherein X represents an input matrix with the size of N × 2048, Z represents an output matrix with the size of N × 245, N represents the number of protein chains in mini-batch, and FC represents a full connection layer for performing linear transformation on input data. The activation function ReLU follows both of the first two fully connected layers. In particular, the last layer of the multi-layer perceptron employs a normalized fully-connected layer. For a given input data vector x, the normalized fully-connected layer computes the output vector z as follows:
where A represents the weight matrix of the fully-connected layer and the hyperparameter τ is the scaling factor. In the present invention, we set τ to 18.
It is worth noting that the invention does not employ deeper network models for protein folding identification. This is because the embedded features employed by the present invention are derived from the pre-trained deep protein language model, ProtT5-XL-UniRef 50. Since the model ProtT5-XL-UniRef50 is a deep Transformer network composed of about 30 hundred million parameters, the output characteristics of its encoder already contain context information for each residue. Deeper network models are not required for fold recognition based on embedded features. In particular, in a specific implementation the output dimension of full link layer FC1 is 2048 and the output dimension of full link layer FC2 is 1024.
Further, in step S105, by performing a statistical analysis on the training data set, it can be found that: of the 245 fold categories, 17.96% of the fold categories had no less than 50 samples, 35.92% had between 20 and 50 samples, and 46.12% had less than 20 samples. It is clear that the protein fold identification problem is an unbalanced classification problem. In particular, when the folded recognition network is trained by using the common cross entropy loss, the information of the sparse class is usually submerged by the information of the frequent class during training, so that the classification accuracy can be only obtained. To this end, the invention exploits the gap loss of consciousness of Label Distribution for unbalanced classification [ Cao, K., et al Processing Systems 32,2019.]And training the folding recognition network to enhance the learning ability of the sparse class. Specifically, let nlIndicates the number of protein chains in the first fold class in the training dataset, and the vector γ ═ γ1,γ2,...,γ245]WhereinThe tag-dependent interval for the ith folding class is defined as: deltal=γl/γmaxWherein γ ismaxRepresenting the maximum value of the vector gamma. Note that sparse fold classes have larger spacing. Based on the output matrix Z of the folded recognition network and the corresponding tag vector y, the following gap loss of the tag distribution awareness can be defined:
where the super parameter s can adjust the size of the interval, we set it to 5 in the present invention.
Further, in step S106, in order to train the designed folded recognition network, the present invention initializes the network parameters of the multi-layered perceptron by using a default weight initialization method of the PyTorch deep learning framework. In particular, to initialize the weight matrix a in the normalized fully-connected layer, the present invention first samples a matrix of the same size as matrix a from the uniform distribution U (-1,1), then normalizes each column thereof to a unit vector and initializes a with it. In addition, the size of the mini-batch is set to 64 during training, the adopted optimizer is AdamW with the learning rate of 0.001, the weight attenuation hyperparameter is set to 0.001, and the maximum allowable epoch size is 10.
Further, in step S107, in order to predict the folding class of a given protein chain, it is necessary to first perform step S102 and step S103 so as to represent it as a feature vector of 2048 dimensions. The feature vector is then fed into a trained folded recognition network, and a 245-dimensional output vector (logit score vector) can be obtained. Finally, the folding class corresponding to the largest output vector is assigned to the protein chain. FIG. 3 shows a flow chart of the present invention for predicting the folding class of protein chains.
On the basis of the above embodiment, as shown in fig. 4, the present invention further provides a protein folding recognition system based on intercalation characteristics and unbalanced classification loss, comprising:
a data set determination module for determining a protein folding training data set and a test data set, both of which comprise a plurality of protein chains;
the embedded matrix generation module is used for generating an embedded matrix of a protein chain in the protein folding training data set by adopting a pre-trained protein language model ProtT5-XL-UniRef 50;
the characteristic vector obtaining module is used for converting the embedded matrix into a characteristic vector with a fixed length of a protein chain by calculating the similarity of a mean value and a cosine;
the model construction module is used for constructing a protein folding identification network model, the protein folding identification network model is a multilayer perceptron consisting of three fully-connected layers, and the last fully-connected layer of the multilayer perceptron adopts a normalized fully-connected layer;
a loss function obtaining module, configured to use an interval loss of a label distribution consciousness for unbalanced classification as a loss function for training the folded recognition network model;
the model training module is used for training a folding recognition network model based on a protein folding training data set and a loss function for training a folding recognition network;
and the folding type identification module is used for predicting the folding type of the protein chain based on the protein folding test data set and the trained folding identification network model.
Further, the embedded matrix generation module is specifically configured to:
for any protein chain with the length of L in the protein folding training dataset, converting all characters in the amino acid sequence of the protein chain into capital characters, taking the converted amino acid sequence as the input of a model ProtT5-XL-UniRef50, finally running the model in a semi-precision mode and storing the output of an encoder of the model, and obtaining an embedded feature matrix with the size of L multiplied by 1024.
Further, the feature vector derivation module is specifically configured to:
for a given embedding matrix E of size lx 1024, first calculate the mean value of each column of the embedding matrix, obtain a length-1024 signature:
where l represents the number of rows of the embedding matrix E;
then, calculating the mean value of each row of the embedding matrix to obtain a vector with the length of L:
then f is calculatedrow_meanAnd obtaining a feature representation with the length of 1024 according to the cosine similarity of each column vector of the embedded matrix:
fcos_sim=[s1,s2,...,sj,...s1024]T
cosine similarity sjThe calculation is performed according to the following formula:
wherein <, > represents the inner product of two vectors, | | | · | | | represents the length of the vectors;
finally, two vectors fcol_meanAnd fcos_simThe protein chains are spliced into a vector to represent the characteristics of the protein chains, and each protein chain can be represented into a characteristic vector with 2048 dimensions through the operation.
Further, the model training module is further configured to:
and initializing network parameters of the multilayer perceptron by adopting a default weight initialization method of a PyTorch deep learning framework.
Further, the folding category identification module is specifically configured to:
the method comprises the steps of firstly executing an embedding matrix generation module and a feature vector obtaining module to represent protein chains in a protein folding test data set as feature vectors, then inputting the feature vectors into a trained folding recognition network model, and allocating the folding category with the highest score to the protein chains.
In summary, the invention first generates an embedded matrix for a given protein chain using the pre-trained protein language model ProtT5-XL-UniRef50, and then converts the embedded matrix into a feature vector of fixed length by calculating mean and cosine similarities. In particular, by using the embedding feature, the present invention avoids time consuming multiple sequence alignment operations. In addition, considering that protein folding data has obvious unbalanced class distribution, the invention adopts a multilayer perceptron network designed for label distribution conscious interval loss training of unbalanced classification tasks, and thus the learning capacity of sparse folding classes is enhanced. In conclusion, the protein folding recognition network model provided by the invention can quickly and accurately predict the folding category of a given protein chain.
The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.
Claims (10)
1. A method for identifying protein folding based on intercalation characteristics and unbalanced classification loss, comprising:
step 1: determining a protein folding training dataset and a test dataset, each comprising a plurality of protein chains;
step 2: generating an embedded matrix of a protein chain in a protein folding training data set by adopting a pre-trained protein language model ProtT5-XL-UniRef 50;
and step 3: converting the embedded matrix into a characteristic vector with a fixed length of a protein chain by calculating the similarity of a mean value and a cosine;
and 4, step 4: constructing a protein folding recognition network model, wherein the protein folding recognition network model is a multilayer perceptron consisting of three fully-connected layers, and the last fully-connected layer of the multilayer perceptron adopts a normalized fully-connected layer;
and 5: adopting interval loss aiming at label distribution consciousness of unbalanced classification as a loss function for training a folding recognition network model;
step 6: training a folding recognition network model based on a protein folding training data set and a loss function for training a folding recognition network;
and 7: and predicting the folding category of the protein chain based on the protein folding test data set and the trained folding recognition network model.
2. The method for identifying protein folding based on embeddings and unbalanced classification loss according to claim 1, wherein the step 2 comprises:
for any protein chain with the length of L in the protein folding training dataset, converting all characters in the amino acid sequence of the protein chain into capital characters, taking the converted amino acid sequence as the input of a model ProtT5-XL-UniRef50, finally running the model in a semi-precision mode and storing the output of an encoder of the model, and obtaining an embedded feature matrix with the size of L multiplied by 1024.
3. The method for identifying protein folding based on embeddable features and unbalanced classification loss as claimed in claim 2, wherein the step 3 comprises:
for a given embedding matrix E of size lx 1024, first calculate the mean value of each column of the embedding matrix, obtain a length-1024 signature:
where l represents the number of rows of the embedding matrix E;
then, calculating the mean value of each row of the embedding matrix to obtain a vector with the length of L:
then f is calculatedrow_meanAnd obtaining a feature representation with the length of 1024 according to the cosine similarity of each column vector of the embedded matrix:
fcos_sim=[s1,s2,...,sj,...s1024]T
cosine similarity sjThe calculation is carried out according to the following formula:
wherein <, > represents the inner product of two vectors, | | | - | represents the length of the vector;
finally, two vectors fcol_meanAnd fcos_simThe protein chains are spliced into a vector to represent the characteristics of the protein chains, and each protein chain can be represented into a characteristic vector with 2048 dimensions through the operation.
4. The method for identifying protein folding based on embeddings and unbalanced classification loss according to claim 1, wherein the step 6 further comprises:
and initializing network parameters of the multilayer perceptron by adopting a default weight initialization method of a PyTorch deep learning framework.
5. The method of claim 1, wherein the step 7 comprises:
step 2 and step 3 are performed first to represent the protein chains in the protein folding test dataset as feature vectors, which are then input into a trained folding recognition network model, and the folding class with the highest score is assigned to the protein chain.
6. A protein fold identification system based on intercalation characteristics and unbalanced classification loss, comprising:
a data set determination module for determining a protein folding training data set and a test data set, both of which comprise a plurality of protein chains;
the embedded matrix generation module is used for generating an embedded matrix of a protein chain in the protein folding training data set by adopting a pre-trained protein language model ProtT5-XL-UniRef 50;
the characteristic vector obtaining module is used for converting the embedded matrix into a characteristic vector with a fixed length of a protein chain by calculating the similarity of a mean value and a cosine;
the model construction module is used for constructing a protein folding identification network model, the protein folding identification network model is a multilayer perceptron consisting of three fully-connected layers, and the last fully-connected layer of the multilayer perceptron adopts a normalized fully-connected layer;
a loss function obtaining module, configured to use an interval loss of a label distribution consciousness for unbalanced classification as a loss function for training the folded recognition network model;
the model training module is used for training a folding recognition network model based on a protein folding training data set and a loss function for training a folding recognition network;
and the folding type identification module is used for predicting the folding type of the protein chain based on the protein folding test data set and the trained folding identification network model.
7. The system according to claim 6, wherein the embedding matrix generation module is specifically configured to:
for any protein chain with the length of L in the protein folding training dataset, converting all characters in the amino acid sequence of the protein chain into capital characters, taking the converted amino acid sequence as the input of a model ProtT5-XL-UniRef50, finally running the model in a semi-precision mode and storing the output of an encoder of the model, and obtaining an embedded feature matrix with the size of L multiplied by 1024.
8. The system according to claim 7, wherein the feature vector derivation module is specifically configured to:
for a given embedding matrix E of size lx 1024, first calculate the mean value of each column of the embedding matrix, obtain a length-1024 signature:
where l represents the number of rows of the embedding matrix E;
then, calculating the mean value of each row of the embedding matrix to obtain a vector with the length of L:
then f is calculatedrow_meanAnd obtaining a feature representation with the length of 1024 according to the cosine similarity of each column vector of the embedded matrix:
fcos_sim=[s1,s2,...,sj,...s1024]T
cosine similarity sjThe calculation is carried out according to the following formula:
wherein <, > represents the inner product of two vectors, | | | - | represents the length of the vector;
finally, willTwo vectors fcol_meanAnd fcos_simThe protein chains are spliced into a vector to represent the characteristics of the protein chains, and each protein chain can be represented into a characteristic vector with 2048 dimensions through the operation.
9. The system of claim 6, wherein the model training module is further configured to:
and initializing network parameters of the multilayer perceptron by adopting a default weight initialization method of a PyTorch deep learning framework.
10. The system according to claim 6, wherein the fold class identification module is specifically configured to:
the method comprises the steps of firstly executing an embedding matrix generation module and a feature vector obtaining module to represent protein chains in a protein folding test data set as feature vectors, then inputting the feature vectors into a trained folding recognition network model, and allocating the folding category with the highest score to the protein chains.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210110503.9A CN114550824B (en) | 2022-01-29 | 2022-01-29 | Protein folding identification method and system based on embedding characteristics and unbalanced classification loss |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210110503.9A CN114550824B (en) | 2022-01-29 | 2022-01-29 | Protein folding identification method and system based on embedding characteristics and unbalanced classification loss |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114550824A true CN114550824A (en) | 2022-05-27 |
CN114550824B CN114550824B (en) | 2022-11-22 |
Family
ID=81674378
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210110503.9A Active CN114550824B (en) | 2022-01-29 | 2022-01-29 | Protein folding identification method and system based on embedding characteristics and unbalanced classification loss |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114550824B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020061510A1 (en) * | 2000-10-16 | 2002-05-23 | Jonathan Miller | Method and system for designing proteins and protein backbone configurations |
CN101794351A (en) * | 2010-03-09 | 2010-08-04 | 哈尔滨工业大学 | Protein secondary structure engineering prediction method based on large margin nearest central point |
CN111667884A (en) * | 2020-06-12 | 2020-09-15 | 天津大学 | Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism |
CN112116950A (en) * | 2020-09-10 | 2020-12-22 | 南京理工大学 | Protein folding identification method based on depth measurement learning |
CN112116949A (en) * | 2020-09-10 | 2020-12-22 | 南京理工大学 | Protein folding identification method based on triple loss |
CN112292693A (en) * | 2018-05-18 | 2021-01-29 | 渊慧科技有限公司 | Meta-gradient update of reinforcement learning system training return function |
CN113593633A (en) * | 2021-08-02 | 2021-11-02 | 中国石油大学(华东) | Drug-protein interaction prediction model based on convolutional neural network |
CN113593631A (en) * | 2021-08-09 | 2021-11-02 | 山东大学 | Method and system for predicting protein-polypeptide binding site |
CN113643756A (en) * | 2021-08-09 | 2021-11-12 | 安徽工业大学 | Protein interaction site prediction method based on deep learning |
CN113727994A (en) * | 2019-05-02 | 2021-11-30 | 德克萨斯大学董事会 | System and method for improving stability of synthetic protein |
-
2022
- 2022-01-29 CN CN202210110503.9A patent/CN114550824B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020061510A1 (en) * | 2000-10-16 | 2002-05-23 | Jonathan Miller | Method and system for designing proteins and protein backbone configurations |
CN101794351A (en) * | 2010-03-09 | 2010-08-04 | 哈尔滨工业大学 | Protein secondary structure engineering prediction method based on large margin nearest central point |
CN112292693A (en) * | 2018-05-18 | 2021-01-29 | 渊慧科技有限公司 | Meta-gradient update of reinforcement learning system training return function |
CN113727994A (en) * | 2019-05-02 | 2021-11-30 | 德克萨斯大学董事会 | System and method for improving stability of synthetic protein |
CN111667884A (en) * | 2020-06-12 | 2020-09-15 | 天津大学 | Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism |
CN112116950A (en) * | 2020-09-10 | 2020-12-22 | 南京理工大学 | Protein folding identification method based on depth measurement learning |
CN112116949A (en) * | 2020-09-10 | 2020-12-22 | 南京理工大学 | Protein folding identification method based on triple loss |
CN113593633A (en) * | 2021-08-02 | 2021-11-02 | 中国石油大学(华东) | Drug-protein interaction prediction model based on convolutional neural network |
CN113593631A (en) * | 2021-08-09 | 2021-11-02 | 山东大学 | Method and system for predicting protein-polypeptide binding site |
CN113643756A (en) * | 2021-08-09 | 2021-11-12 | 安徽工业大学 | Protein interaction site prediction method based on deep learning |
Non-Patent Citations (3)
Title |
---|
BURCU ÇARKLI YAVUZ等: "Prediction of Protein Secondary Structure With Clonal Selection Algorithm and Multilayer Perceptron", 《IEEE ACCESS》 * |
张光亚 等: "采用BP算法的多层感知机模型的蛋白识别", 《华侨大学学报(自然科学版)》 * |
潘越: "基于卷积神经网络的蛋白质折叠类型最小特征提取", 《中国优秀硕士学位论文全文数据库 基础科学辑》 * |
Also Published As
Publication number | Publication date |
---|---|
CN114550824B (en) | 2022-11-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hoff et al. | Gene prediction in metagenomic fragments: a large scale machine learning approach | |
CN112614538A (en) | Antibacterial peptide prediction method and device based on protein pre-training characterization learning | |
CN112711953A (en) | Text multi-label classification method and system based on attention mechanism and GCN | |
CN111626048A (en) | Text error correction method, device, equipment and storage medium | |
Chang et al. | Protein motif extraction with neuro-fuzzy optimization | |
CN112035620B (en) | Question-answer management method, device, equipment and storage medium of medical query system | |
US20190073443A1 (en) | Methods and systems for producing an expanded training set for machine learning using biological sequences | |
CN112988963B (en) | User intention prediction method, device, equipment and medium based on multi-flow nodes | |
Cai et al. | Evolutionary computation techniques for multiple sequence alignment | |
CN114420212A (en) | Escherichia coli strain identification method and system | |
US20230237084A1 (en) | Method and apparatus for question-answering using a database consist of query vectors | |
CN116206688A (en) | Multi-mode information fusion model and method for DTA prediction | |
CN112632264A (en) | Intelligent question and answer method and device, electronic equipment and storage medium | |
Dotan et al. | Effect of tokenization on transformers for biological sequences | |
CN114550824B (en) | Protein folding identification method and system based on embedding characteristics and unbalanced classification loss | |
CN113656601A (en) | Doctor-patient matching method, device, equipment and storage medium | |
CN113450870A (en) | Method and system for matching drug with target protein | |
Guerrini et al. | Lightweight metagenomic classification via eBWT | |
Delteil et al. | MATrIX--Modality-Aware Transformer for Information eXtraction | |
Zou et al. | Predicting RNA secondary structure based on the class information and Hopfield network | |
CN115497105A (en) | Multi-modal hate cause detection method based on multi-task learning network | |
KR102187586B1 (en) | Data processing apparatus and method for discovering new drug candidates | |
CN114706971A (en) | Biomedical document type determination method and device | |
WO2020138588A1 (en) | Data processing device and method for discovering new drug candidate material | |
Ogul et al. | Subcellular localization prediction with new protein encoding schemes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |