CN114550824A - Protein folding identification method and system based on embedding characteristics and unbalanced classification loss - Google Patents

Protein folding identification method and system based on embedding characteristics and unbalanced classification loss Download PDF

Info

Publication number
CN114550824A
CN114550824A CN202210110503.9A CN202210110503A CN114550824A CN 114550824 A CN114550824 A CN 114550824A CN 202210110503 A CN202210110503 A CN 202210110503A CN 114550824 A CN114550824 A CN 114550824A
Authority
CN
China
Prior art keywords
protein
folding
training
model
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210110503.9A
Other languages
Chinese (zh)
Other versions
CN114550824B (en
Inventor
张蕾
杨伟
文云光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University
Original Assignee
Henan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University filed Critical Henan University
Priority to CN202210110503.9A priority Critical patent/CN114550824B/en
Publication of CN114550824A publication Critical patent/CN114550824A/en
Application granted granted Critical
Publication of CN114550824B publication Critical patent/CN114550824B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Pure & Applied Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Mathematics (AREA)
  • Genetics & Genomics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a protein folding identification method and a protein folding identification system based on embedding characteristics and unbalanced classification loss. The method comprises the steps of firstly generating an embedded matrix of a given protein chain by adopting a pre-trained protein language model ProtT5-XL-UniRef50, and then converting the embedded matrix into a feature vector with a fixed length by calculating mean value and cosine similarity. In particular, by using the embedding feature, the present invention avoids time consuming multiple sequence alignment operations. In addition, considering that protein folding data has obvious unbalanced class distribution, the invention adopts a multilayer perceptron network designed for label distribution conscious interval loss training of unbalanced classification tasks, and thus the learning capacity of sparse folding classes is enhanced. In conclusion, the protein folding recognition network model provided by the invention can quickly and accurately predict the folding category of a given protein chain.

Description

Protein folding identification method and system based on embedding characteristics and unbalanced classification loss
Technical Field
The invention belongs to the technical field of computational biology, and particularly relates to a protein folding identification method and system based on embedding characteristics and unbalanced classification loss.
Background
Proteins are biological macromolecules composed of 20 standard amino acid types that play a very important role in many biological processes. The amino acid sequence of a protein determines its tertiary structure, and its function is significantly dependent on the tertiary structure. The number of proteins in protein sequence databases is currently significantly greater than that of proteins of known structure due to the rapid development of protein sequencing technology. For this reason, computational methods to predict protein structure have become an essential means to narrow the differences in the number of sequences and the number of structures. In the prediction of the tertiary structure of a protein, an important subtask is to find proteins with similar structures. For a protein with unknown structure, when the protein with similar structure exists in the PDB database, the structure of the protein can be accurately modeled by taking the protein as a template. In particular, protein folding recognition may help to find proteins with similar structures.
The currently recently released SCOPE 2.08 database divides protein structures into 12 broad categories: all alpha proteins, all beta proteins, alpha/beta proteins, alpha + beta proteins, multi-domain proteins, membrane and cell surface proteins and peptides, small proteins, frizzled proteins, low resolution protein structures, peptides, engineered proteins, and defective proteins. The stable statistics published by the SCOPE 2.08 database contain the top 7 major classes of structure, which contain numbers of fold classes 290, 180, 148, 396, 74, 69, and 100, respectively, totaling 1257 fold classes, 2067 superfamilies, and 5084 families. In particular, with the release of new protein structures, the number of known superfamilies varies somewhat [ Chandonia, J.M., N.K.Fox, and S.E.Brenner, SCOPE: classification of large macromolecular structures in the structural classification of protein-extended database. nucleic Acids Res,2019.47(D1): p.D. 475-D481 ]. If two different superfamilies are found to be related to each other due to the appearance of a new structure, they will be merged into the same superfamily. Furthermore, a single domain (domain) may be divided into multiple domains due to the discovery of different evolutionary relationships.
Currently, many machine learning algorithms such as Support Vector Machines [ Yan, K., et al, Protein Fold Recognition by Combining Support Vector Machines and Pairwise Sequence Similarity scores. IEEE/ACM Trans composite Biol Bioinfo, 2021.18(5): p.2008-2016], neural networks [ Villegas-Morcillo, A., V.Sanchez, and A.M.Gomez, FoldHSphere: deep hyperspectral embedding for Protein ld Recognition. BMC Biol., 2021.22(1): p.490., integrated classifiers, random forests, etc. have been successfully applied to Fold Recognition. However, these existing methods not only ignore the imbalance of the folded data set when training the model, but also mostly use PSSM spectral features or HHM spectral features as the main folded recognition features. It is noted that generating these spectral features requires time-consuming multiple sequence alignments to be performed on large scale sequence databases.
Disclosure of Invention
The invention provides a protein folding recognition method and a protein folding recognition system based on embedding characteristics and unbalanced classification loss, aiming at the problems that the existing protein folding recognition method not only ignores the imbalance of a folding data set when a model is trained, but also mostly takes PSSM (phosphosenesma syndrome-associated syndrome) spectral characteristics or HHM (HHM-associated syndrome) spectral characteristics as main folding recognition characteristics and needs to perform time-consuming multi-sequence comparison on a large-scale sequence database when the spectral characteristics are generated, the protein folding recognition method and the protein folding recognition system based on the embedding characteristics derived from a pre-trained protein language model carry out protein folding recognition, and the unbalanced classification loss is introduced when the model is trained to enhance the learning capability of sparse folding categories. The folding identification method provided by the invention can be used for rapidly and accurately predicting the folding category of the protein.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a protein folding identification method based on embedding characteristics and unbalanced classification loss on one hand, which comprises the following steps:
step 1: determining a protein folding training dataset and a test dataset, each comprising a plurality of protein chains;
step 2: generating an embedded matrix of a protein chain in a protein folding training data set by adopting a pre-trained protein language model ProtT5-XL-UniRef 50;
and step 3: converting the embedded matrix into a characteristic vector with a fixed length of a protein chain by calculating the similarity of a mean value and a cosine;
and 4, step 4: constructing a protein folding recognition network model, wherein the protein folding recognition network model is a multilayer perceptron consisting of three fully-connected layers, and the last fully-connected layer of the multilayer perceptron adopts a normalized fully-connected layer;
and 5: adopting interval loss aiming at label distribution consciousness of unbalanced classification as a loss function for training a folding recognition network model;
and 6: training a folding recognition network model based on a protein folding training data set and a loss function for training a folding recognition network;
and 7: and predicting the folding category of the protein chain based on the protein folding test data set and the trained folding recognition network model.
Further, the step 2 comprises:
for any protein chain with the length of L in the protein folding training data set, firstly converting all characters in the amino acid sequence of the protein chain into capital characters, taking the converted amino acid sequence as the input of a model ProtT5-XL-UniRef50, finally running the model in a semi-precision mode and storing the output of an encoder of the model, and obtaining an embedded feature matrix with the size of L multiplied by 1024.
Further, the step 3 comprises:
for a given embedding matrix E of size lx 1024, first calculate the mean value of each column of the embedding matrix, obtain a length-1024 signature:
Figure BDA0003494946700000031
where l represents the number of rows of the embedding matrix E;
then, calculating the average value of each row of the embedding matrix to obtain a vector with the length of L:
Figure BDA0003494946700000032
then f is calculatedrow_meanAnd obtaining a feature representation with the length of 1024 according to the cosine similarity of each column vector of the embedded matrix:
fcos_sim=[s1,s2,...,sj,...s1024]T
cosine similarity sjThe calculation is carried out according to the following formula:
Figure BDA0003494946700000033
wherein <, > represents the inner product of two vectors, | | | - | represents the length of the vector;
finally, two vectors f are dividedcol_meanAnd fcos_simThe protein chains are spliced into a vector to represent the characteristics of the protein chains, and each protein chain can be represented into a characteristic vector with 2048 dimensions through the operation.
Further, the step 6 further includes:
and initializing network parameters of the multilayer perceptron by adopting a default weight initialization method of a PyTorch deep learning framework.
Further, the step 7 includes:
step 2 and step 3 are performed first to represent the protein chains in the protein folding test dataset as feature vectors, which are then input into a trained folding recognition network model, and the folding class with the highest score is assigned to the protein chain.
In another aspect, the present invention provides a protein folding identification system based on intercalation characteristics and unbalanced classification loss, comprising:
a data set determination module for determining a protein folding training data set and a test data set, both of which comprise a plurality of protein chains;
the embedded matrix generation module is used for generating an embedded matrix of a protein chain in the protein folding training data set by adopting a pre-trained protein language model ProtT5-XL-UniRef 50;
the characteristic vector obtaining module is used for converting the embedded matrix into a characteristic vector with a fixed length of a protein chain by calculating the similarity of a mean value and a cosine;
the model building module is used for building a protein folding identification network model, the protein folding identification network model is a multilayer perceptron consisting of three full connection layers, and the last full connection layer of the multilayer perceptron adopts a normalized full connection layer;
a loss function obtaining module, configured to use an interval loss of a label distribution consciousness for unbalanced classification as a loss function for training the folded recognition network model;
the model training module is used for training a folding recognition network model based on a protein folding training data set and a loss function for training a folding recognition network;
and the folding type identification module is used for predicting the folding type of the protein chain based on the protein folding test data set and the trained folding identification network model.
Further, the embedded matrix generation module is specifically configured to:
for any protein chain with the length of L in the protein folding training dataset, converting all characters in the amino acid sequence of the protein chain into capital characters, taking the converted amino acid sequence as the input of a model ProtT5-XL-UniRef50, finally running the model in a semi-precision mode and storing the output of an encoder of the model, and obtaining an embedded feature matrix with the size of L multiplied by 1024.
Further, the feature vector derivation module is specifically configured to:
for a given embedding matrix E of size lx 1024, first calculate the mean value of each column of the embedding matrix, obtain a length-1024 signature:
Figure BDA0003494946700000051
where l represents the number of rows of the embedding matrix E;
then, calculating the mean value of each row of the embedding matrix to obtain a vector with the length of L:
Figure BDA0003494946700000052
then f is calculatedrow_meanAnd obtaining a feature representation with the length of 1024 according to the cosine similarity of each column vector of the embedded matrix:
fcos_sim=[s1,s2,...,sj,...s1024]T
cosine similarity sjThe calculation is performed according to the following formula:
Figure BDA0003494946700000053
wherein <, > represents the inner product of two vectors, | | | - | represents the length of the vector;
finally, two vectors f are dividedcol_meanAnd fcos_simThe protein chains are spliced into a vector to represent the characteristics of the protein chains, and each protein chain can be represented into a characteristic vector with 2048 dimensions through the operation.
Further, the model training module is further configured to:
and initializing network parameters of the multilayer perceptron by adopting a default weight initialization method of a PyTorch deep learning framework.
Further, the folding category identifying module is specifically configured to:
the method comprises the steps of firstly executing an embedding matrix generation module and a feature vector obtaining module to represent protein chains in a protein folding test data set as feature vectors, then inputting the feature vectors into a trained folding recognition network model, and allocating the folding category with the highest score to the protein chains.
Compared with the prior art, the invention has the following beneficial effects:
the method comprises the steps of firstly generating an embedded matrix of a given protein chain by adopting a pre-trained protein language model ProtT5-XL-UniRef50, and then converting the embedded matrix into a feature vector with a fixed length by calculating mean value and cosine similarity. In particular, by using the embedding feature, the present invention avoids time consuming multiple sequence alignment operations. In addition, considering that protein folding data has obvious unbalanced class distribution, the invention adopts a multilayer perceptron network designed for label distribution conscious interval loss training of unbalanced classification tasks, and thus the learning capacity of sparse folding classes is enhanced. In conclusion, the protein folding recognition network model provided by the invention can quickly and accurately predict the folding category of a given protein chain.
Drawings
FIG. 1 is a basic flow chart of a protein folding identification method based on intercalation characteristics and unbalanced classification loss according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a protein folding recognition network structure according to an embodiment of the present invention;
FIG. 3 is a flow chart of protein folding class prediction according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an embodiment of the present invention, wherein the schematic diagram is a protein folding recognition system architecture based on embedding characteristics and unbalanced classification loss.
Detailed Description
The invention is further illustrated by the following examples in conjunction with the accompanying drawings:
as shown in fig. 1, a protein fold identification method based on intercalation characteristics and unbalanced classification loss comprises:
step S101: determining a protein folding training dataset and a test dataset, each comprising a plurality of protein chains;
step S102: generating an embedded matrix of a protein chain in a protein folding training data set by adopting a pre-trained protein language model ProtT5-XL-UniRef 50;
step S103: converting the embedded matrix into a characteristic vector with a fixed length of a protein chain by calculating the similarity of a mean value and a cosine;
step S104: constructing a protein folding recognition network model, wherein the protein folding recognition network model is a multilayer perceptron consisting of three fully-connected layers, and the last fully-connected layer of the multilayer perceptron adopts a normalized fully-connected layer;
step S105: adopting interval loss aiming at label distribution consciousness of unbalanced classification as a loss function for training a folding recognition network model;
step S106: training a folding recognition network model based on a protein folding training data set and a loss function for training a folding recognition network;
step S107: and predicting the folding category of the protein chain based on the protein folding test data set and the trained folding recognition network model.
Further, in step S101, the present invention adopts the protein structure prediction challenge held by the boomerang open platform in 2021 (step S) (ii)https://challenge.xfyun.cn/topic/infotype=protein&ch ═ dc-web-01) as a protein folding dataset. The dataset was from the protein structure classification database Astral SCOPE 2.07, consisting of 11843 protein chains, and the sequence identity between any two protein chains was less than 40%. In particular, to ensure that there are enough samples for training in each folding category, the number of protein chains contained in each folding category selected is not less than 10. In addition, the data set was further divided into a training set containing 9472 protein chains and a test set containing 2371 protein chains. The number of fold categories in the dataset is 245.
Further, in step S102, the invention first downloads the pre-trained protein language model ProtT5-XL-UniRef50 from the link https:// zenodo.org/record/4644188#. YZxmZcVBy 70. Then, for any given protein chain with length L, all characters in the amino acid sequence are converted into capital characters, the converted amino acid sequence is used as the input of a model ProtT5-XL-UniRef50, and finally the model is operated in a semi-precision mode and the output of an encoder of the model is saved, so that an embedded feature matrix with the size of L multiplied by 1024 can be obtained. This means that the dimension of the corresponding insertion feature for each amino acid residue is 1024.
Further, in step S103, in order to predict the folding type of a given protein chain, protein chains of different lengths need to be represented as feature vectors of fixed size. For a given embedding matrix E with the size of L multiplied by 1024, the invention firstly calculates the mean value of each column of the embedding matrix, and obtains a characteristic representation with the length of 1024:
Figure BDA0003494946700000071
where l represents the number of rows of the embedding matrix E;
then, calculating the mean value of each row of the embedding matrix to obtain a vector with the length of L:
Figure BDA0003494946700000072
then f is calculatedrow_meanAnd obtaining a feature representation with the length of 1024 according to the cosine similarity of each column vector of the embedded matrix:
fcos_sim=[s1,s2,...,sj,...s1024]T
in particular, the cosine similarity sjThe calculation is carried out according to the following formula:
Figure BDA0003494946700000081
where <, > represents the inner product of two vectors, | | | - | represents the length of the vector.
Finally, the invention combines two vectors fcol_meanAnd fcos_simSpliced into a vector to represent the characteristics of the protein chain. Obviously, each protein chain can be represented as a feature vector of 2048 dimensions by the above operation.
Further, in step S104, the folding recognition network model designed by the present invention is a multi-layer perceptron composed of three fully-connected layers. FIG. 2 shows a model architecture of a folding recognition network, wherein X represents an input matrix with the size of N × 2048, Z represents an output matrix with the size of N × 245, N represents the number of protein chains in mini-batch, and FC represents a full connection layer for performing linear transformation on input data. The activation function ReLU follows both of the first two fully connected layers. In particular, the last layer of the multi-layer perceptron employs a normalized fully-connected layer. For a given input data vector x, the normalized fully-connected layer computes the output vector z as follows:
Figure BDA0003494946700000082
where A represents the weight matrix of the fully-connected layer and the hyperparameter τ is the scaling factor. In the present invention, we set τ to 18.
It is worth noting that the invention does not employ deeper network models for protein folding identification. This is because the embedded features employed by the present invention are derived from the pre-trained deep protein language model, ProtT5-XL-UniRef 50. Since the model ProtT5-XL-UniRef50 is a deep Transformer network composed of about 30 hundred million parameters, the output characteristics of its encoder already contain context information for each residue. Deeper network models are not required for fold recognition based on embedded features. In particular, in a specific implementation the output dimension of full link layer FC1 is 2048 and the output dimension of full link layer FC2 is 1024.
Further, in step S105, by performing a statistical analysis on the training data set, it can be found that: of the 245 fold categories, 17.96% of the fold categories had no less than 50 samples, 35.92% had between 20 and 50 samples, and 46.12% had less than 20 samples. It is clear that the protein fold identification problem is an unbalanced classification problem. In particular, when the folded recognition network is trained by using the common cross entropy loss, the information of the sparse class is usually submerged by the information of the frequent class during training, so that the classification accuracy can be only obtained. To this end, the invention exploits the gap loss of consciousness of Label Distribution for unbalanced classification [ Cao, K., et al Processing Systems 32,2019.]And training the folding recognition network to enhance the learning ability of the sparse class. Specifically, let nlIndicates the number of protein chains in the first fold class in the training dataset, and the vector γ ═ γ12,...,γ245]Wherein
Figure BDA0003494946700000091
The tag-dependent interval for the ith folding class is defined as: deltal=γlmaxWherein γ ismaxRepresenting the maximum value of the vector gamma. Note that sparse fold classes have larger spacing. Based on the output matrix Z of the folded recognition network and the corresponding tag vector y, the following gap loss of the tag distribution awareness can be defined:
Figure BDA0003494946700000092
where the super parameter s can adjust the size of the interval, we set it to 5 in the present invention.
Further, in step S106, in order to train the designed folded recognition network, the present invention initializes the network parameters of the multi-layered perceptron by using a default weight initialization method of the PyTorch deep learning framework. In particular, to initialize the weight matrix a in the normalized fully-connected layer, the present invention first samples a matrix of the same size as matrix a from the uniform distribution U (-1,1), then normalizes each column thereof to a unit vector and initializes a with it. In addition, the size of the mini-batch is set to 64 during training, the adopted optimizer is AdamW with the learning rate of 0.001, the weight attenuation hyperparameter is set to 0.001, and the maximum allowable epoch size is 10.
Further, in step S107, in order to predict the folding class of a given protein chain, it is necessary to first perform step S102 and step S103 so as to represent it as a feature vector of 2048 dimensions. The feature vector is then fed into a trained folded recognition network, and a 245-dimensional output vector (logit score vector) can be obtained. Finally, the folding class corresponding to the largest output vector is assigned to the protein chain. FIG. 3 shows a flow chart of the present invention for predicting the folding class of protein chains.
On the basis of the above embodiment, as shown in fig. 4, the present invention further provides a protein folding recognition system based on intercalation characteristics and unbalanced classification loss, comprising:
a data set determination module for determining a protein folding training data set and a test data set, both of which comprise a plurality of protein chains;
the embedded matrix generation module is used for generating an embedded matrix of a protein chain in the protein folding training data set by adopting a pre-trained protein language model ProtT5-XL-UniRef 50;
the characteristic vector obtaining module is used for converting the embedded matrix into a characteristic vector with a fixed length of a protein chain by calculating the similarity of a mean value and a cosine;
the model construction module is used for constructing a protein folding identification network model, the protein folding identification network model is a multilayer perceptron consisting of three fully-connected layers, and the last fully-connected layer of the multilayer perceptron adopts a normalized fully-connected layer;
a loss function obtaining module, configured to use an interval loss of a label distribution consciousness for unbalanced classification as a loss function for training the folded recognition network model;
the model training module is used for training a folding recognition network model based on a protein folding training data set and a loss function for training a folding recognition network;
and the folding type identification module is used for predicting the folding type of the protein chain based on the protein folding test data set and the trained folding identification network model.
Further, the embedded matrix generation module is specifically configured to:
for any protein chain with the length of L in the protein folding training dataset, converting all characters in the amino acid sequence of the protein chain into capital characters, taking the converted amino acid sequence as the input of a model ProtT5-XL-UniRef50, finally running the model in a semi-precision mode and storing the output of an encoder of the model, and obtaining an embedded feature matrix with the size of L multiplied by 1024.
Further, the feature vector derivation module is specifically configured to:
for a given embedding matrix E of size lx 1024, first calculate the mean value of each column of the embedding matrix, obtain a length-1024 signature:
Figure BDA0003494946700000101
where l represents the number of rows of the embedding matrix E;
then, calculating the mean value of each row of the embedding matrix to obtain a vector with the length of L:
Figure BDA0003494946700000102
then f is calculatedrow_meanAnd obtaining a feature representation with the length of 1024 according to the cosine similarity of each column vector of the embedded matrix:
fcos_sim=[s1,s2,...,sj,...s1024]T
cosine similarity sjThe calculation is performed according to the following formula:
Figure BDA0003494946700000111
wherein <, > represents the inner product of two vectors, | | | · | | | represents the length of the vectors;
finally, two vectors fcol_meanAnd fcos_simThe protein chains are spliced into a vector to represent the characteristics of the protein chains, and each protein chain can be represented into a characteristic vector with 2048 dimensions through the operation.
Further, the model training module is further configured to:
and initializing network parameters of the multilayer perceptron by adopting a default weight initialization method of a PyTorch deep learning framework.
Further, the folding category identification module is specifically configured to:
the method comprises the steps of firstly executing an embedding matrix generation module and a feature vector obtaining module to represent protein chains in a protein folding test data set as feature vectors, then inputting the feature vectors into a trained folding recognition network model, and allocating the folding category with the highest score to the protein chains.
In summary, the invention first generates an embedded matrix for a given protein chain using the pre-trained protein language model ProtT5-XL-UniRef50, and then converts the embedded matrix into a feature vector of fixed length by calculating mean and cosine similarities. In particular, by using the embedding feature, the present invention avoids time consuming multiple sequence alignment operations. In addition, considering that protein folding data has obvious unbalanced class distribution, the invention adopts a multilayer perceptron network designed for label distribution conscious interval loss training of unbalanced classification tasks, and thus the learning capacity of sparse folding classes is enhanced. In conclusion, the protein folding recognition network model provided by the invention can quickly and accurately predict the folding category of a given protein chain.
The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims (10)

1. A method for identifying protein folding based on intercalation characteristics and unbalanced classification loss, comprising:
step 1: determining a protein folding training dataset and a test dataset, each comprising a plurality of protein chains;
step 2: generating an embedded matrix of a protein chain in a protein folding training data set by adopting a pre-trained protein language model ProtT5-XL-UniRef 50;
and step 3: converting the embedded matrix into a characteristic vector with a fixed length of a protein chain by calculating the similarity of a mean value and a cosine;
and 4, step 4: constructing a protein folding recognition network model, wherein the protein folding recognition network model is a multilayer perceptron consisting of three fully-connected layers, and the last fully-connected layer of the multilayer perceptron adopts a normalized fully-connected layer;
and 5: adopting interval loss aiming at label distribution consciousness of unbalanced classification as a loss function for training a folding recognition network model;
step 6: training a folding recognition network model based on a protein folding training data set and a loss function for training a folding recognition network;
and 7: and predicting the folding category of the protein chain based on the protein folding test data set and the trained folding recognition network model.
2. The method for identifying protein folding based on embeddings and unbalanced classification loss according to claim 1, wherein the step 2 comprises:
for any protein chain with the length of L in the protein folding training dataset, converting all characters in the amino acid sequence of the protein chain into capital characters, taking the converted amino acid sequence as the input of a model ProtT5-XL-UniRef50, finally running the model in a semi-precision mode and storing the output of an encoder of the model, and obtaining an embedded feature matrix with the size of L multiplied by 1024.
3. The method for identifying protein folding based on embeddable features and unbalanced classification loss as claimed in claim 2, wherein the step 3 comprises:
for a given embedding matrix E of size lx 1024, first calculate the mean value of each column of the embedding matrix, obtain a length-1024 signature:
Figure FDA0003494946690000011
where l represents the number of rows of the embedding matrix E;
then, calculating the mean value of each row of the embedding matrix to obtain a vector with the length of L:
Figure FDA0003494946690000021
then f is calculatedrow_meanAnd obtaining a feature representation with the length of 1024 according to the cosine similarity of each column vector of the embedded matrix:
fcos_sim=[s1,s2,...,sj,...s1024]T
cosine similarity sjThe calculation is carried out according to the following formula:
Figure FDA0003494946690000022
wherein <, > represents the inner product of two vectors, | | | - | represents the length of the vector;
finally, two vectors fcol_meanAnd fcos_simThe protein chains are spliced into a vector to represent the characteristics of the protein chains, and each protein chain can be represented into a characteristic vector with 2048 dimensions through the operation.
4. The method for identifying protein folding based on embeddings and unbalanced classification loss according to claim 1, wherein the step 6 further comprises:
and initializing network parameters of the multilayer perceptron by adopting a default weight initialization method of a PyTorch deep learning framework.
5. The method of claim 1, wherein the step 7 comprises:
step 2 and step 3 are performed first to represent the protein chains in the protein folding test dataset as feature vectors, which are then input into a trained folding recognition network model, and the folding class with the highest score is assigned to the protein chain.
6. A protein fold identification system based on intercalation characteristics and unbalanced classification loss, comprising:
a data set determination module for determining a protein folding training data set and a test data set, both of which comprise a plurality of protein chains;
the embedded matrix generation module is used for generating an embedded matrix of a protein chain in the protein folding training data set by adopting a pre-trained protein language model ProtT5-XL-UniRef 50;
the characteristic vector obtaining module is used for converting the embedded matrix into a characteristic vector with a fixed length of a protein chain by calculating the similarity of a mean value and a cosine;
the model construction module is used for constructing a protein folding identification network model, the protein folding identification network model is a multilayer perceptron consisting of three fully-connected layers, and the last fully-connected layer of the multilayer perceptron adopts a normalized fully-connected layer;
a loss function obtaining module, configured to use an interval loss of a label distribution consciousness for unbalanced classification as a loss function for training the folded recognition network model;
the model training module is used for training a folding recognition network model based on a protein folding training data set and a loss function for training a folding recognition network;
and the folding type identification module is used for predicting the folding type of the protein chain based on the protein folding test data set and the trained folding identification network model.
7. The system according to claim 6, wherein the embedding matrix generation module is specifically configured to:
for any protein chain with the length of L in the protein folding training dataset, converting all characters in the amino acid sequence of the protein chain into capital characters, taking the converted amino acid sequence as the input of a model ProtT5-XL-UniRef50, finally running the model in a semi-precision mode and storing the output of an encoder of the model, and obtaining an embedded feature matrix with the size of L multiplied by 1024.
8. The system according to claim 7, wherein the feature vector derivation module is specifically configured to:
for a given embedding matrix E of size lx 1024, first calculate the mean value of each column of the embedding matrix, obtain a length-1024 signature:
Figure FDA0003494946690000031
where l represents the number of rows of the embedding matrix E;
then, calculating the mean value of each row of the embedding matrix to obtain a vector with the length of L:
Figure FDA0003494946690000032
then f is calculatedrow_meanAnd obtaining a feature representation with the length of 1024 according to the cosine similarity of each column vector of the embedded matrix:
fcos_sim=[s1,s2,...,sj,...s1024]T
cosine similarity sjThe calculation is carried out according to the following formula:
Figure FDA0003494946690000041
wherein <, > represents the inner product of two vectors, | | | - | represents the length of the vector;
finally, willTwo vectors fcol_meanAnd fcos_simThe protein chains are spliced into a vector to represent the characteristics of the protein chains, and each protein chain can be represented into a characteristic vector with 2048 dimensions through the operation.
9. The system of claim 6, wherein the model training module is further configured to:
and initializing network parameters of the multilayer perceptron by adopting a default weight initialization method of a PyTorch deep learning framework.
10. The system according to claim 6, wherein the fold class identification module is specifically configured to:
the method comprises the steps of firstly executing an embedding matrix generation module and a feature vector obtaining module to represent protein chains in a protein folding test data set as feature vectors, then inputting the feature vectors into a trained folding recognition network model, and allocating the folding category with the highest score to the protein chains.
CN202210110503.9A 2022-01-29 2022-01-29 Protein folding identification method and system based on embedding characteristics and unbalanced classification loss Active CN114550824B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210110503.9A CN114550824B (en) 2022-01-29 2022-01-29 Protein folding identification method and system based on embedding characteristics and unbalanced classification loss

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210110503.9A CN114550824B (en) 2022-01-29 2022-01-29 Protein folding identification method and system based on embedding characteristics and unbalanced classification loss

Publications (2)

Publication Number Publication Date
CN114550824A true CN114550824A (en) 2022-05-27
CN114550824B CN114550824B (en) 2022-11-22

Family

ID=81674378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210110503.9A Active CN114550824B (en) 2022-01-29 2022-01-29 Protein folding identification method and system based on embedding characteristics and unbalanced classification loss

Country Status (1)

Country Link
CN (1) CN114550824B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020061510A1 (en) * 2000-10-16 2002-05-23 Jonathan Miller Method and system for designing proteins and protein backbone configurations
CN101794351A (en) * 2010-03-09 2010-08-04 哈尔滨工业大学 Protein secondary structure engineering prediction method based on large margin nearest central point
CN111667884A (en) * 2020-06-12 2020-09-15 天津大学 Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism
CN112116950A (en) * 2020-09-10 2020-12-22 南京理工大学 Protein folding identification method based on depth measurement learning
CN112116949A (en) * 2020-09-10 2020-12-22 南京理工大学 Protein folding identification method based on triple loss
CN112292693A (en) * 2018-05-18 2021-01-29 渊慧科技有限公司 Meta-gradient update of reinforcement learning system training return function
CN113593633A (en) * 2021-08-02 2021-11-02 中国石油大学(华东) Drug-protein interaction prediction model based on convolutional neural network
CN113593631A (en) * 2021-08-09 2021-11-02 山东大学 Method and system for predicting protein-polypeptide binding site
CN113643756A (en) * 2021-08-09 2021-11-12 安徽工业大学 Protein interaction site prediction method based on deep learning
CN113727994A (en) * 2019-05-02 2021-11-30 德克萨斯大学董事会 System and method for improving stability of synthetic protein

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020061510A1 (en) * 2000-10-16 2002-05-23 Jonathan Miller Method and system for designing proteins and protein backbone configurations
CN101794351A (en) * 2010-03-09 2010-08-04 哈尔滨工业大学 Protein secondary structure engineering prediction method based on large margin nearest central point
CN112292693A (en) * 2018-05-18 2021-01-29 渊慧科技有限公司 Meta-gradient update of reinforcement learning system training return function
CN113727994A (en) * 2019-05-02 2021-11-30 德克萨斯大学董事会 System and method for improving stability of synthetic protein
CN111667884A (en) * 2020-06-12 2020-09-15 天津大学 Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism
CN112116950A (en) * 2020-09-10 2020-12-22 南京理工大学 Protein folding identification method based on depth measurement learning
CN112116949A (en) * 2020-09-10 2020-12-22 南京理工大学 Protein folding identification method based on triple loss
CN113593633A (en) * 2021-08-02 2021-11-02 中国石油大学(华东) Drug-protein interaction prediction model based on convolutional neural network
CN113593631A (en) * 2021-08-09 2021-11-02 山东大学 Method and system for predicting protein-polypeptide binding site
CN113643756A (en) * 2021-08-09 2021-11-12 安徽工业大学 Protein interaction site prediction method based on deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BURCU ÇARKLI YAVUZ等: "Prediction of Protein Secondary Structure With Clonal Selection Algorithm and Multilayer Perceptron", 《IEEE ACCESS》 *
张光亚 等: "采用BP算法的多层感知机模型的蛋白识别", 《华侨大学学报(自然科学版)》 *
潘越: "基于卷积神经网络的蛋白质折叠类型最小特征提取", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *

Also Published As

Publication number Publication date
CN114550824B (en) 2022-11-22

Similar Documents

Publication Publication Date Title
Wang et al. Learning context-sensitive similarity by shortest path propagation
CN109923557A (en) Use continuous regularization training joint multitask neural network model
CN111626048A (en) Text error correction method, device, equipment and storage medium
US20190073443A1 (en) Methods and systems for producing an expanded training set for machine learning using biological sequences
CN108763865A (en) A kind of integrated learning approach of prediction DNA protein binding sites
Deng et al. META-DDIE: predicting drug–drug interaction events with few-shot learning
CN112035620B (en) Question-answer management method, device, equipment and storage medium of medical query system
Cai et al. Evolutionary computation techniques for multiple sequence alignment
CN113140254B (en) Meta-learning drug-target interaction prediction system and prediction method
US20230237084A1 (en) Method and apparatus for question-answering using a database consist of query vectors
CN111259115B (en) Training method and device for content authenticity detection model and computing equipment
CN114420212A (en) Escherichia coli strain identification method and system
CN116206688A (en) Multi-mode information fusion model and method for DTA prediction
CN103473416A (en) Protein-protein interaction model building method and device
CN114550824B (en) Protein folding identification method and system based on embedding characteristics and unbalanced classification loss
CN112632264A (en) Intelligent question and answer method and device, electronic equipment and storage medium
Guerrini et al. Lightweight metagenomic classification via eBWT
CN112085245A (en) Protein residue contact prediction method based on deep residual error neural network
Delteil et al. MATrIX--Modality-Aware Transformer for Information eXtraction
Zou et al. Predicting RNA secondary structure based on the class information and Hopfield network
Dotan et al. Effect of tokenization on transformers for biological sequences
KR102187586B1 (en) Data processing apparatus and method for discovering new drug candidates
Wells et al. Chainsaw: protein domain segmentation with fully convolutional neural networks
CN112069322B (en) Text multi-label analysis method and device, electronic equipment and storage medium
CN114706971A (en) Biomedical document type determination method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant