CN117476106B - Multi-class unbalanced protein secondary structure prediction method and system - Google Patents
Multi-class unbalanced protein secondary structure prediction method and system Download PDFInfo
- Publication number
- CN117476106B CN117476106B CN202311804115.1A CN202311804115A CN117476106B CN 117476106 B CN117476106 B CN 117476106B CN 202311804115 A CN202311804115 A CN 202311804115A CN 117476106 B CN117476106 B CN 117476106B
- Authority
- CN
- China
- Prior art keywords
- matrix
- layer
- secondary structure
- output
- processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 168
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 168
- 238000000034 method Methods 0.000 title claims abstract description 47
- 239000011159 matrix material Substances 0.000 claims description 180
- 238000012545 processing Methods 0.000 claims description 78
- 230000006870 function Effects 0.000 claims description 55
- 238000010606 normalization Methods 0.000 claims description 32
- 150000001413 amino acids Chemical class 0.000 claims description 24
- 230000002457 bidirectional effect Effects 0.000 claims description 9
- 125000004122 cyclic group Chemical group 0.000 claims description 9
- 238000007781 pre-processing Methods 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 6
- 210000002569 neuron Anatomy 0.000 claims description 4
- 230000008447 perception Effects 0.000 claims description 4
- 230000001902 propagating effect Effects 0.000 claims description 4
- 239000003814 drug Substances 0.000 abstract 1
- 229940079593 drug Drugs 0.000 abstract 1
- 238000012827 research and development Methods 0.000 abstract 1
- 238000000605 extraction Methods 0.000 description 5
- 238000001228 spectrum Methods 0.000 description 5
- 230000006872 improvement Effects 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 101100382574 Bos taurus CASP13 gene Proteins 0.000 description 2
- 102100024931 Caspase-14 Human genes 0.000 description 2
- 101000761467 Homo sapiens Caspase-14 Proteins 0.000 description 2
- 101000983515 Homo sapiens Inactive caspase-12 Proteins 0.000 description 2
- 102100026556 Inactive caspase-12 Human genes 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 125000003275 alpha amino acid group Chemical group 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 125000000539 amino acid group Chemical group 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000009509 drug development Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- -1 num_features Chemical class 0.000 description 1
- 238000007430 reference method Methods 0.000 description 1
- 230000003014 reinforcing effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Biotechnology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Genetics & Genomics (AREA)
- Physiology (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention discloses a method and a system for predicting a secondary structure of a multi-class unbalanced protein, and relates to the technical field of computer-aided drug research and development. The method comprises the steps of obtaining a target protein sequence to be predicted, inputting the target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a Transformer, obtaining a secondary structure predicted value of the target protein sequence to be predicted, and outputting the secondary structure predicted value of the target protein sequence.
Description
Technical Field
The invention relates to the technical field of computer-aided drug development, in particular to a method and a system for predicting a secondary structure of a multi-class unbalanced protein.
Background
Proteins are important organic matters for constructing and repairing human tissues, and a large number of calculation methods are developed around the problem of predicting the secondary structure of proteins. Common methods are deep ACLSTM and MUFold-SS. The deep ACLSTM model utilizes protein sequence characteristics and Profile characteristics, wherein the Profile characteristics are sequence characteristic spectrums constructed by multi-sequence comparison, particularly a position specific scoring matrix (Position Special Scoring Matrix, PSSM) is adopted, and an asymmetric convolution network (asymmetric convolutional neural network, ACNN) formed by one-dimensional convolution with a convolution kernel of 1×42 and two-dimensional convolution with a convolution kernel of 3×1 is combined with a two-way long and short term memory (bidirectional long short term memory, biLSTM) network to obtain local and global correlations among amino acid residues; the characteristic matrix of MUFold-SS model is composed of physicochemical property of amino acid, PSI-BLAST spectrum and HHHBlits spectrum, contains abundant evolutionary information, and utilizes parallel convolution with convolution kernel scale of 1 and 3 to form deep convolution network to extract local and global relativity between amino acids. However, when the two methods predict the secondary structure of the protein, the problem that the amino acid global dependence is strong and the prediction precision of the secondary structure of the rare protein is low exists.
Disclosure of Invention
In order to solve the defects in the background technology, the method mainly aims at the problems that the overall dependence on amino acid is strong and the prediction precision of the rare protein secondary structure is low when the protein secondary structure is predicted in the prior art. The invention provides a method and a system for predicting a multi-class unbalanced protein secondary structure, wherein the method has low overall dependence on amino acid, and improves the accuracy of predicting a rare class protein secondary structure.
In order to achieve the above object, a first aspect of the present invention provides a method for predicting a secondary structure of a plurality of unbalanced proteins:
obtaining a target protein sequence to be predicted;
inputting a target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a transducer to obtain a secondary structure prediction value of the target protein sequence to be predicted; the classes of unbalanced protein structures fall into 8 categories: denoted by L, B, E, G, I, H, S and T respectively, but of these 8 structures, the B, G and S ratios are low, especially structure G, whereas rare classes are easily submerged during training, resulting in reduced prediction accuracy, and thus there are many problems with unbalanced classes in the protein secondary structure. The pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components:
a data preprocessing layer for performing weighting processing and batch normalization processing on the input sample data;
the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix;
the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences;
the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences;
the two-layer bidirectional gating cyclic unit layer is used for processing the incidence matrix among protein sequences to obtain a global feature matrix;
the convolution layer and the full connection layer are used for sequentially processing the global feature matrix to obtain a protein secondary structure predicted value;
and outputting the predicted value of the secondary structure of the target protein sequence to be predicted.
Optionally, the step of weighting and batch normalizing the input sample data includes:
processing input sample data according to a position specificity scoring matrix, a hidden Markov model feature matrix and a feature matrix obtained by an amino acid physicochemical property feature matrix of horizontal splicing to obtain weighted sample data;
respectively obtaining mask matrixes with all initial elements being false and input feature matrixes with all initial elements being 0;
according to protein chain indexes, protein chain lengths and protein chain maximum lengths obtained by traversing protein sequences of each batch in the weighted sample data, an updated mask matrix and an updated input feature matrix are obtained;
respectively calculating the mean value and the variance of the updated mask matrix and the updated input feature matrix by using a preset first formula and a preset second formula;
and processing the mean and variance by using a batch normalization layer pre-constructed according to a preset third formula to obtain a batch normalization output matrix.
Optionally, the third preset formula includes:
wherein μ represents the average value obtained by processing the updated mask matrix and the updated input feature matrix using a preset first formula, var represents the variance obtained by processing the updated mask matrix and the updated input feature matrix using a preset second formula, X B,C,max_L Representing input weighted sample data, X * Representing the output of the batch normalization layer, β and γ being parameters to be learned in the batch normalization layer, where ε=0.01;
the preset first formula and the preset second formula are respectively as follows:
wherein μ represents the mean value of all sample data, var 1 Representing the variance of all sample data, M bl Represents a mask matrix, X bcl Is the element value of each amino acid vector in each protein sequence, B is the batch size set during training, namely, batchsize, here set to 16; b refers to traversing the batch of 16 proteins, so b= [1,16]The method comprises the steps of carrying out a first treatment on the surface of the max_l refers to the maximum sequence length in the batch of proteins; l represents the sequence length of the batch of proteins;
optionally, the pyrach function layer includes an nn.conv1d function, an f.relu function, a maskedbatch norm1d function, and an nn.dropout function connected in this order, where the argument of the nn.conv1d function includes a convolution size of an output channel number 57, an input channel number 448, a convolution kernel length 3, padding = 1, bias = False, the input of the f.relu function is the output of the nn.conv1d function, the maskedbatch norm1d function has a dimension of 448, and the probability that the neuron of the nn.dropout function is not activated is 0.2.
Optionally, iteratively processing the second output matrix includes:
taking the second output matrix output by the nn. Dropout function as an initial input matrix of a MobileNet v2 layer to obtain an initial output matrix;
and (3) taking the initial output matrix as the input of the MobileNet v2 layer, and iterating for a plurality of times by taking the intermediate output matrix processed by the MobileNet v2 layer as the input of the MobileNet v2 layer to obtain the local feature matrix among the protein sequences.
Alternatively, the coding layer of the transducer is two layers, and each encoder is provided with a set of eight attention matrix.
Alternatively, the output channels of the convolution layer and the full connection layer are 57, the convolution kernel of the convolution layer is 1, and the output size of the full connection layer is 8.
Optionally, after the convolution layer and the full-connection layer for sequentially processing the global feature matrix to obtain the predicted value of the protein secondary structure, the method further includes:
and calculating a prediction error by using a label distribution perception marginal loss function, and reversely propagating and updating parameters beta and gamma of the batch normalization layer by the obtained prediction error.
In another aspect, the present invention provides a system for predicting a secondary structure of a plurality of unbalanced proteins, comprising:
the input module is used for acquiring a target protein sequence to be predicted;
the processing module is used for inputting the target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a transducer to obtain a secondary structure predicted value of the target protein sequence to be predicted; the pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components:
a data preprocessing layer for performing weighting processing and batch normalization processing on the input sample data;
the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix;
the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences;
the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences;
the two-layer bidirectional gating cyclic unit layer is used for processing the incidence matrix among protein sequences to obtain a global feature matrix;
the convolution layer and the full connection layer are used for sequentially processing the global feature matrix to obtain a protein secondary structure predicted value;
and the output module is used for outputting the predicted value of the secondary structure of the target protein sequence to be predicted.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a method and a system for predicting a secondary structure of a multi-class unbalanced protein, which are characterized in that a target protein sequence to be predicted is input into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a Transformer to obtain a secondary structure predicted value of the target protein sequence to be predicted; the pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components: a data preprocessing layer for performing weighting processing and batch normalization processing on the input sample data; the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix; the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences; the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences; the two-layer bidirectional gating cyclic unit layer is used for processing the incidence matrix among protein sequences to obtain a global feature matrix; and the convolution layer and the full-connection layer are used for sequentially processing the global feature matrix to obtain the predicted value of the protein secondary structure. When the method predicts the secondary structure of the protein, the overall dependence on amino acid is low, and the prediction precision of the secondary structure of the rare protein is improved.
Drawings
FIG. 1 is a flow chart of a method for predicting the secondary structure of a multi-class unbalanced protein;
FIG. 2 is a histogram of performance comparisons of models of a multi-class unbalanced protein secondary structure prediction method under different features;
FIG. 3 is a histogram of performance comparisons of models of a multi-class unbalanced protein secondary structure prediction method under different BN's;
FIG. 4 is a histogram of performance comparisons of a model of a multi-class unbalanced protein secondary structure prediction method with or without a transducer;
FIG. 5 is a schematic diagram of a system for predicting the secondary structure of a plurality of unbalanced proteins.
Detailed Description
The invention will be further described with reference to specific examples and figures, which are not intended to limit the invention.
FIG. 1 is a flowchart of a method for predicting a secondary structure of a multi-class unbalanced protein, provided by the invention, as shown in FIG. 1, the method for predicting a secondary structure of a multi-class unbalanced protein provided by the invention comprises the following steps:
101. and obtaining a target protein sequence to be predicted.
102. Inputting the target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a transducer to obtain a secondary structure predicted value of the target protein sequence to be predicted. The pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components:
and the data preprocessing layer is used for carrying out weighting processing and batch normalization processing on the input sample data.
The step of weighting and batch normalization of the input sample data comprises:
and processing input sample data according to the position specificity scoring matrix, the hidden Markov model characteristic matrix and the characteristic matrix obtained by the amino acid physicochemical property characteristic matrix of the horizontal splicing to obtain weighted sample data.
Specifically, the PSI-BLAST spectrum convergence parameter e is set to be 0.001, and the PSI-BLAST spectrum is operated to iterate the Uniref database twice to generate a position-specific scoring matrix (PSSM) with the size of L multiplied by 20, wherein L is the length of a protein sequence.
The HH maps are operated to iterate the Uniprot20 database four times to generate a Hidden Markov Model (HMM) feature matrix, the size of which is L multiplied by 30. Wherein Uniprot20 is a protein sequence database containing sequence information of all species, all proteins in UniProtKB (protein knowledge base), and a total of 70432686 protein sequences. Including both Swiss-Prot and TrEMBL. Wherein the Swiss-Prot portion comprises a high quality protein sequence that is manually annotated, and the TrEMBL portion comprises an automatically annotated protein sequence and an unverified predicted protein sequence, obtainable from
http:// wwuser. Gwdg.de/-combbiol/data/hhuiste/databases/hhuiste_dbs/old-release/in download.
And splicing and fusing the obtained position specificity scoring matrix, the hidden Markov model feature matrix and the seven amino acid physicochemical property feature matrices, wherein the amino acid physicochemical properties comprise sheet probability, spiral probability, isoelectric point, hydrophobicity, van der Waals volume, polarizability and graphic shape index, so as to obtain the feature matrix with the size of L multiplied by 57, and the three fused matrices provide abundant information for the model, thereby effectively improving the model prediction precision.
And respectively acquiring mask matrixes with all initial elements being false and input feature matrixes with all initial elements being 0.
And (3) obtaining protein chain indexes according to the protein sequences of each batch in the sample data subjected to traversal weighting, and obtaining an updated mask matrix and an updated input feature matrix by utilizing the protein chain length and the maximum protein chain length.
Specifically, a mask matrix is introduced into the batch normalization layer of the network, and specifically, B protein chains are traversed for each batch to obtain the index batch_idx and the length L of each protein chain in the batch and the maximum length of the protein chains in the batchDegree max_L, and mask matrix M B,max_L All elements in the matrix are initialized to false, and the input characteristic matrix X is input B,C,max_L All elements in (1) are initialized to 0, for each batch_idx of the mask matrix M B,max_L M is determined according to batch_idx and ProteinLen B,max_L 0 to ProteinLen-1 elements of each line are updated to True, wherein ProteinLen represents the True length of the protein sequence, resulting in a final mask matrix M for each batch B,max_L . For X B,C,max_L Each batch_idx corresponds to a protein chain, and X is determined as X [ batch_idx, proteinLen ] according to batch_idx]X is filled to X B,C,max_L In (3) to obtain the final X B,C,max_L The method comprises the steps of carrying out a first treatment on the surface of the Specifically, when the method is implemented, the dimension of input X is B×L×F, the dimension of Masks is B×L, the filling part in X is masked, the non-filling part is taken out and put into a new tensor, the dimension of the new tensor is adjusted according to the length of the feature vector describing each amino acid, namely num_features, and the mean variance in the batch normalization layer is calculated according to the new tensor.
In order to avoid that the 0-compensating position influences the accuracy of the feature extraction results of other positions in the subsequent feature extraction process, a mask matrix is introduced.
And respectively calculating the mean and variance of the updated mask matrix and the updated input feature matrix by using a preset first formula and a preset second formula.
And processing the mean and variance by using a batch normalization layer pre-constructed according to a preset third formula to obtain a batch normalization output matrix.
Specifically, the preset third formula includes:
wherein μ represents the average value obtained by processing the updated mask matrix and the updated input feature matrix using a preset first formula, var represents the variance obtained by processing the updated mask matrix and the updated input feature matrix using a preset second formula, X B,C,max_L Representing the weighted inputSample data, X * Representing the output of the batch normalization layer, β and γ are the parameters to be learned in the batch normalization layer, ε=0.01.
The preset first formula and the preset second formula are respectively as follows:
wherein μ represents the mean value of all sample data, var 1 Representing the variance of all sample data, M bl Represents a mask matrix, X bcl Is the element value of each amino acid vector in each protein sequence, B is the batch size set during training, namely, batchsize, here set to 16; b refers to traversing the batch of 16 proteins, so b= [1,16]The method comprises the steps of carrying out a first treatment on the surface of the max_l refers to the maximum sequence length in the batch of proteins; l represents the sequence length of the batch of proteins.
And the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix.
Wherein the pyrach function layer includes an nn.conv1d function, an f.relu function, a maskedbatch norm1d function, and an nn.dropout function connected in this order, where the argument of the nn.conv1d function includes a convolution size of the output channel number 57, the input channel number 448, a convolution kernel length 3, padding=1, bias=false, the input of the f.relu function is the output of the nn.conv1d function, the maskedbatch norm1d function has a dimension of 448, and the probability that the neuron of the nn.dropout function is not activated is 0.2.
Specifically, B samples subjected to weighting processing and normalization processing are randomly selected from training samples to form an input matrix X B,C,L X is taken as B,C,L As input, call out 1=nn.conv1d (57, 448,3, padding=1, bias=false), out=f.relu (out 1), out=maskidbatch norm1d (448), out=nnDropout (0.2) to obtain matrix X B,C,L 。
And the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences.
And taking the second output matrix output by the nn. Dropout function as an initial input matrix of the MobileNet v2 layer to obtain an initial output matrix of the MobileNet v2 layer.
And (3) using the initial output matrix processed by the MobileNet v2 layer as the input of the MobileNet v2 layer, and performing multiple iterations by using the intermediate output matrix processed by the MobileNet v2 layer as the input of the MobileNet v2 layer to obtain the local feature matrix among the protein sequences.
Specifically, the local feature extraction code of the MobileNet v2 network is operated to iterate n times, and a matrix X obtained by the first iteration is obtained B,C,L As input, the ith output is taken as the input of the (i+1) th time, wherein i=1, 2,3, …, n-1, and a local feature matrix among amino acids in the sequence is obtained
And the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences.
Wherein, the coding layer of the transducer is two layers, and each encoder is provided with a set of eight attention head matrixes.
Specifically, the local feature matrix X L And the characteristic out1 are added to obtain X R Running a two-layer eight-head Transformer Encoder program to obtain X R As input, a correlation matrix X between all amino acid sequences is obtained E 。
And the two-layer bidirectional gating cyclic unit layer is used for processing the correlation matrix among protein sequences to obtain the global feature matrix.
Specifically, two layers of bi-directional gating cyclic units (BiGRU) are operated to form an inter-amino acid sequence correlation matrix X E As input, a global feature matrix X of the protein sequence is obtained G 。
And the convolution layer and the full-connection layer are used for sequentially processing the global feature matrix to obtain the predicted value of the protein secondary structure.
The output channels of the convolution layer and the full connection layer are 57, the convolution kernel of the convolution layer is 1, and the output size of the full connection layer is 8.
Specifically, a one-dimensional convolution program with 57 output channels and 1 convolution kernel is operated to make the global feature matrix X G As input, the final feature X is obtained F Running a full-connection layer program with an input size of 57 and an output size of 8, and finally obtaining a characteristic X F As an input, a protein secondary structure predicted value P is obtained.
After the convolution layer and the full-connection layer which are used for sequentially processing the global feature matrix to obtain the predicted value of the protein secondary structure, the method further comprises the following steps:
and calculating a prediction error by using a label distribution perception marginal loss function, and reversely propagating and updating parameters beta and gamma of the batch normalization layer by the obtained prediction error.
Specifically, running a label distribution perception marginal loss function code to calculate a prediction error, reversely propagating the obtained prediction error to update parameters of a batch normalization layer, including beta, lambda and weights of a network, setting the maximum iteration step number to 200, setting the learning rate to 0.0001, setting the batch size to 16, carrying out model loop iteration, and when the verification centralized prediction precision is not improved or reaches the maximum iteration step number, finishing training to obtain a multi-class unbalanced protein secondary structure prediction model based on a Transformer.
TABLE 1
Table 1 shows a comparison with the most advanced method on data set CB513, bolded for optimal performance. As can be seen from Table 1, the performance of the model in this document is obviously superior to other reference methods on the CB513 data set, so that the prediction accuracy of common classes in the 8-class secondary structure is ensured, and the prediction effects of rare classes B, G and S are improved to a certain extent.
TABLE 2
Table 2 shows a comparison with the most advanced method on the data set CASP12, bolded for optimal performance. On dataset CASP12, the overall prediction accuracy is 0.02% lower than mufold_ss, but for the rarer categories, such as: both categories B, S and G have a greater improvement in prediction accuracy than other methods, but have less impact on overall prediction accuracy due to the smaller number.
TABLE 3 Table 3
Table 3 shows a comparison with the most advanced method on the data set CASP13, bolded for optimal performance. On the data set CASP13, the overall prediction accuracy is maximized, and the prediction improvement amplitude for rare classes is also large, but the prediction accuracy on the classes L and T is still a certain gap compared with other methods.
TABLE 4 Table 4
Table 4 shows a comparison with the most advanced method on the data set CASP14, bolded for optimal performance. On the data set CASP14, the overall prediction accuracy is improved by 0.05% compared with MUFold_SS, the overall accuracy improvement is limited, but the prediction accuracy improvement amplitude for rare classes is larger. From the above results, it can be seen that Q of the structural class has a lower duty ratio (smaller sample size) 8 The lower the accuracy is, the less than one thousandth the probability that an "I" type structure in the eight states will appear, and almost all methods are difficult to predict correctly. It is reasonable to believe that the predictive effect will be further improved if the sample size of the low frequency sub-class can be amplified. Thus, the data enhancement for protein secondary structure prediction isOne direction worth studying.
FIG. 2 is a histogram of performance comparisons of models of a multi-class unbalanced protein secondary structure prediction method under different features;
FIG. 3 is a comparative histogram of the performance of a model of a multi-class unbalanced protein secondary structure prediction method under different BN's, where BN refers to the batch normalization layer, batch Normalization;
FIG. 4 is a histogram of performance comparisons of a multi-class unbalanced protein secondary structure prediction method with or without a transducer model.
As the information source and the basis of the structure prediction, the evolutionary information contained in different amino acid codes is different, and the exploration of the characteristic representation is developed under the model parameters with the best prediction result in the experiment. Here we first analyze the influence of different coding modes on the prediction accuracy one by one and then perform a combination experiment on them. FIG. 2 shows that on the validation set, Q 8 The accuracy is the basis, and the influence of 9 different input features on model prediction performance. Specifically, we consider PSSM coding and HMM coding separately, and because the single thermal coding and the physicochemical property coding are independent of position and do not contain evolution information, we can see from experimental results that PSSM performance is better than HMM, and the evolution information contained in PSSM is likely to be more abundant. Then, PSSM and HMM are respectively combined with single thermal coding and physicochemical property coding, and PSSM coding and HMM coding are combined, so that when PSSM coding is respectively combined with single thermal coding and physicochemical property coding, model prediction performance is close, and when HMM and HMM are respectively combined with PSSM, performance is lower than that of PSSM and combination, so that it is also proved that the PSSM contains more information or is more suitable for carrying out secondary structure prediction, but when the PSSM is combined with physicochemical property coding, the prediction performance is lower than that of single thermal coding and is not consistent with expectations, but in the subsequent experiments, the evolution process of forming a stable structure of the protein is seen, because different amino acid properties are different, the properties influence the interaction mode of the protein and surrounding amino acids in the sequence, and thus the structure of the protein is influenced. When PSSM coding and HMM coding are combined, the prediction performance is obviously improved relative to the previous four types, and PSSM comprisesThe probability of mutation of the same amino acid into other amino acids in the process of forming a stable protein structure is reduced in different sequences, and the HMM codes comprise matching state probability, translation frequency and local diversity of different amino acids, so that the two codes comprise certain complementary information which is useful for predicting the secondary structure of the protein. Based on the combination of PSSM coding and HMM coding, the combination of single thermal coding and physicochemical coding is adopted, and the combination of the PSSM coding and the HMM coding with physicochemical coding is the best coding mode, so that the prediction precision reaches the highest, and the prediction method also reaches the expectation.
For the modified batch normalization (Batch Normalization) layer, as shown in FIG. 3, after introducing a mask matrix on test set CB513, Q 8 The precision is improved by 0.23 percent, F 1 The method has the advantages that the characteristic vector of the filling position is improved by about 0.05%, the characteristic vector of the filling position can possibly become a non-zero vector in the characteristic extraction process, the prediction result of the secondary structure corresponding to the amino acid of the non-filling position can be influenced, the problem can be relieved to a certain extent by introducing the mask matrix, the accuracy of characteristic extraction is improved, and the accuracy of secondary structure prediction is further improved.
Fig. 4 is a comparison of network performance achieved by a network with or without a feature enhancement module. As shown in the figure, after the characteristic reinforcing module is added to the network, the prediction accuracy is improved by 0.16 percent, and F is improved 1 The value is increased by 0.79%, which shows that adding 2 layers of 8 heads Transformer Encoder before long-distance action enhances the expression of relatedness among residues in the sequence, and the subsequent BiGRU can capture long-distance dependence more flexibly. In summary, the invention obtains the secondary structure predicted value of the target protein sequence to be predicted by inputting the target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a Transformer. The pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components: and the data preprocessing layer is used for carrying out weighting processing and batch normalization processing on the input sample data. And the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix. For iteratively processing the second output matrix to obtain proteinsMobileNet v2 layer of local feature matrix between mass sequences. And the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences. And the two-layer bidirectional gating cyclic unit layer is used for processing the correlation matrix among protein sequences to obtain the global feature matrix. And the convolution layer and the full-connection layer are used for sequentially processing the global feature matrix to obtain the predicted value of the protein secondary structure. When the method predicts the secondary structure of the protein, the overall dependence on amino acid is low, and the prediction precision of the secondary structure of the rare protein is improved.
103. And outputting the predicted value P of the secondary structure of the target protein sequence to be predicted.
FIG. 5 is a schematic diagram of a structure 200 of a system for predicting secondary structures of multiple types of unbalanced proteins, as shown in FIG. 5, the apparatus comprises:
an input module 201, configured to obtain a target protein sequence to be predicted.
The processing module 202 is configured to input the target protein sequence to be predicted into a pre-constructed transform-based multi-class unbalanced protein secondary structure prediction model, and obtain a secondary structure predicted value of the target protein sequence to be predicted. The pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components:
and the data preprocessing layer is used for carrying out weighting processing and batch normalization processing on the input sample data.
And the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix.
And the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences.
And the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences.
And the two-layer bidirectional gating cyclic unit layer is used for processing the correlation matrix among protein sequences to obtain the global feature matrix.
And the convolution layer and the full-connection layer are used for sequentially processing the global feature matrix to obtain the predicted value of the protein secondary structure.
And the output module 203 is used for outputting the predicted value of the secondary structure of the target protein sequence to be predicted.
It will be appreciated by those skilled in the art that the present invention may take the form of one or more computer program products embodied in computer usable storage medium, or a computer program product. Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (6)
1. A method for predicting a secondary structure of a plurality of unbalanced proteins, comprising:
obtaining a target protein sequence to be predicted;
inputting a target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a transducer to obtain a secondary structure prediction value of the target protein sequence to be predicted; the pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components:
a data preprocessing layer for performing weighting processing and batch normalization processing on the input sample data;
the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix;
the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences;
the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences;
the two-layer bidirectional gating cyclic unit layer is used for processing the incidence matrix among protein sequences to obtain a global feature matrix;
the convolution layer and the full connection layer are used for sequentially processing the global feature matrix to obtain a protein secondary structure predicted value;
outputting a secondary structure predicted value of the target protein sequence to be predicted;
the pyrach function layer comprises an nn.Conv1d function, an F.relu function, a MaskedBatchNorm1d function and an nn.Dropout function which are connected in sequence, wherein parameters of the nn.Conv1d function comprise a convolution size of 57 output channels, 448 input channels, 3 convolution kernels, padding=1, bias bias=false, the input of the F.relu function is the output of the nn.Conv1d function, the dimension of the MaskedBatchNorm1d function is 448, and the probability that neurons of the nn.Dropout function are not activated is 0.2;
the coding layers of the transducer are two layers, and each coder is provided with a set of eight attention head matrixes;
the output channels of the convolution layer and the full connection layer are 57, the convolution kernel of the convolution layer is 1, and the output size of the full connection layer is 8.
2. The method of claim 1, wherein the step of weighting and batch normalizing the input sample data comprises:
processing input sample data according to a position specificity scoring matrix, a hidden Markov model feature matrix and a feature matrix obtained by an amino acid physicochemical property feature matrix of horizontal splicing to obtain weighted sample data;
respectively obtaining mask matrixes with all initial elements being false and input feature matrixes with all initial elements being 0;
according to protein chain indexes, protein chain lengths and protein chain maximum lengths obtained by traversing protein sequences of each batch in the weighted sample data, an updated mask matrix and an updated input feature matrix are obtained;
respectively calculating the mean value and the variance of the updated mask matrix and the updated input feature matrix by using a preset first formula and a preset second formula;
and processing the mean and variance by using a batch normalization layer pre-constructed according to a preset third formula to obtain a batch normalization output matrix.
3. The method for predicting a secondary structure of a plurality of unbalanced proteins according to claim 2, wherein the predetermined third formula comprises:
wherein μ represents the average value obtained by processing the updated mask matrix and the updated input feature matrix using a preset first formula, var represents the variance obtained by processing the updated mask matrix and the updated input feature matrix using a preset second formula, X B,C,max_L Representing input weighted sample data, X * Representing the output of the batch normalization layer, β and γ being parameters to be learned in the batch normalization layer, where ε=0.01;
the preset first formula and the preset second formula are respectively as follows:
wherein μ represents the mean value of all sample data, var 1 Representing the variance of all sample data, M bl Represents a mask matrix, X bcl Is the element value of each amino acid vector in each protein sequence, B is the batch size set during training, namely, batchsize, here set to 16; b refers to traversing the batch of 16 proteins, so b= [1,16]The method comprises the steps of carrying out a first treatment on the surface of the max_l refers to the batch of proteinsMaximum sequence length in the stroma; l represents the sequence length of the batch of proteins.
4. The method of claim 1, wherein iteratively processing the second output matrix comprises:
taking the second output matrix output by the nn. Dropout function as an initial input matrix of a MobileNet v2 layer to obtain an initial output matrix;
and (3) taking the initial output matrix as the input of the MobileNet v2 layer, and carrying out multiple iterations on the obtained intermediate output matrix processed by the MobileNet v2 layer as the input of the MobileNet v2 layer to obtain the local feature matrix among protein sequences.
5. A method for predicting a secondary structure of a multi-class unbalanced protein according to claim 3, wherein after the convolution layer and the full-connection layer for sequentially processing the global feature matrix to obtain the predicted value of the secondary structure of the protein, the method further comprises:
and calculating a prediction error by using a label distribution perception marginal loss function, and reversely propagating and updating parameters beta and gamma of the batch normalization layer by the obtained prediction error.
6. A multi-class unbalanced protein secondary structure prediction system, comprising:
the input module is used for acquiring a target protein sequence to be predicted;
the processing module is used for inputting the target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a transducer to obtain a secondary structure predicted value of the target protein sequence to be predicted; the pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components:
a data preprocessing layer for performing weighting processing and batch normalization processing on the input sample data;
the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix;
the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences;
the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences;
the two-layer bidirectional gating cyclic unit layer is used for processing the incidence matrix among protein sequences to obtain a global feature matrix;
the convolution layer and the full connection layer are used for sequentially processing the global feature matrix to obtain a protein secondary structure predicted value;
the output module is used for outputting the predicted value of the secondary structure of the target protein sequence to be predicted;
the pyrach function layer comprises an nn.Conv1d function, an F.relu function, a MaskedBatchNorm1d function and an nn.Dropout function which are connected in sequence, wherein parameters of the nn.Conv1d function comprise a convolution size of 57 output channels, 448 input channels, 3 convolution kernels, padding=1, bias bias=false, the input of the F.relu function is the output of the nn.Conv1d function, the dimension of the MaskedBatchNorm1d function is 448, and the probability that neurons of the nn.Dropout function are not activated is 0.2;
the coding layers of the transducer are two layers, and each coder is provided with a set of eight attention head matrixes;
the output channels of the convolution layer and the full connection layer are 57, the convolution kernel of the convolution layer is 1, and the output size of the full connection layer is 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311804115.1A CN117476106B (en) | 2023-12-26 | 2023-12-26 | Multi-class unbalanced protein secondary structure prediction method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311804115.1A CN117476106B (en) | 2023-12-26 | 2023-12-26 | Multi-class unbalanced protein secondary structure prediction method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117476106A CN117476106A (en) | 2024-01-30 |
CN117476106B true CN117476106B (en) | 2024-04-02 |
Family
ID=89633271
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311804115.1A Active CN117476106B (en) | 2023-12-26 | 2023-12-26 | Multi-class unbalanced protein secondary structure prediction method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117476106B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118658528A (en) * | 2024-08-20 | 2024-09-17 | 电子科技大学长三角研究院(衢州) | Construction method of specific myoglobin prediction model |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111667884A (en) * | 2020-06-12 | 2020-09-15 | 天津大学 | Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism |
CN112767997A (en) * | 2021-02-04 | 2021-05-07 | 齐鲁工业大学 | Protein secondary structure prediction method based on multi-scale convolution attention neural network |
CN113178229A (en) * | 2021-05-31 | 2021-07-27 | 吉林大学 | Deep learning-based RNA and protein binding site recognition method |
CN114974397A (en) * | 2021-02-23 | 2022-08-30 | 腾讯科技(深圳)有限公司 | Training method of protein structure prediction model and protein structure prediction method |
CN115458039A (en) * | 2022-08-08 | 2022-12-09 | 北京分子之心科技有限公司 | Single-sequence protein structure prediction method and system based on machine learning |
CN115662501A (en) * | 2022-10-25 | 2023-01-31 | 浙江大学杭州国际科创中心 | Protein generation method based on position specificity weight matrix |
CN116486900A (en) * | 2023-04-25 | 2023-07-25 | 徐州医科大学 | Drug target affinity prediction method based on depth mode data fusion |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220122689A1 (en) * | 2020-10-15 | 2022-04-21 | Salesforce.Com, Inc. | Systems and methods for alignment-based pre-training of protein prediction models |
-
2023
- 2023-12-26 CN CN202311804115.1A patent/CN117476106B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111667884A (en) * | 2020-06-12 | 2020-09-15 | 天津大学 | Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism |
CN112767997A (en) * | 2021-02-04 | 2021-05-07 | 齐鲁工业大学 | Protein secondary structure prediction method based on multi-scale convolution attention neural network |
CN114974397A (en) * | 2021-02-23 | 2022-08-30 | 腾讯科技(深圳)有限公司 | Training method of protein structure prediction model and protein structure prediction method |
CN113178229A (en) * | 2021-05-31 | 2021-07-27 | 吉林大学 | Deep learning-based RNA and protein binding site recognition method |
CN115458039A (en) * | 2022-08-08 | 2022-12-09 | 北京分子之心科技有限公司 | Single-sequence protein structure prediction method and system based on machine learning |
CN115662501A (en) * | 2022-10-25 | 2023-01-31 | 浙江大学杭州国际科创中心 | Protein generation method based on position specificity weight matrix |
CN116486900A (en) * | 2023-04-25 | 2023-07-25 | 徐州医科大学 | Drug target affinity prediction method based on depth mode data fusion |
Non-Patent Citations (1)
Title |
---|
基于自注意力机制和GAN的蛋白质二级结构预测;杨璐, 董洪伟;《中国科技论文在线精品论文》;20230615;第16卷(第02期);148-159 * |
Also Published As
Publication number | Publication date |
---|---|
CN117476106A (en) | 2024-01-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111667884B (en) | Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism | |
CN111291183B (en) | Method and device for carrying out classification prediction by using text classification model | |
CN107622182B (en) | Method and system for predicting local structural features of protein | |
CN117476106B (en) | Multi-class unbalanced protein secondary structure prediction method and system | |
CN111105013B (en) | Optimization method of countermeasure network architecture, image description generation method and system | |
CN109598387A (en) | Forecasting of Stock Prices method and system based on two-way cross-module state attention network model | |
EP3912042B1 (en) | A deep learning model for learning program embeddings | |
CN114743600B (en) | Deep learning prediction method of target-ligand binding affinity based on gated attention mechanism | |
CN112258262A (en) | Conversation recommendation method based on convolution self-attention network | |
CN115222998B (en) | Image classification method | |
Tian et al. | Joint learning model for underwater acoustic target recognition | |
CN115422369B (en) | Knowledge graph completion method and device based on improved TextRank | |
CN114528835A (en) | Semi-supervised specialized term extraction method, medium and equipment based on interval discrimination | |
CN114708903A (en) | Method for predicting distance between protein residues based on self-attention mechanism | |
CN112151127A (en) | Unsupervised learning drug virtual screening method and system based on molecular semantic vector | |
CN112488301A (en) | Food inversion method based on multitask learning and attention mechanism | |
Downey et al. | alineR: An R package for optimizing feature-weighted alignments and linguistic distances | |
CN117976035A (en) | Protein SNO site prediction method of feature fusion deep learning network | |
CN117494815A (en) | File-oriented credible large language model training and reasoning method and device | |
Onu et al. | A fully tensorized recurrent neural network | |
Eyraud et al. | TAYSIR Competition: Transformer+\textscrnn: Algorithms to Yield Simple and Interpretable Representations | |
CN117831609A (en) | Protein secondary structure prediction method and device and computer device | |
CN112884019B (en) | Image language conversion method based on fusion gate circulation network model | |
CN115964475A (en) | Dialogue abstract generation method for medical inquiry | |
CN113177608A (en) | Neighbor model feature selection method and device for incomplete data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |