CN117476106B - Multi-class unbalanced protein secondary structure prediction method and system - Google Patents

Multi-class unbalanced protein secondary structure prediction method and system Download PDF

Info

Publication number
CN117476106B
CN117476106B CN202311804115.1A CN202311804115A CN117476106B CN 117476106 B CN117476106 B CN 117476106B CN 202311804115 A CN202311804115 A CN 202311804115A CN 117476106 B CN117476106 B CN 117476106B
Authority
CN
China
Prior art keywords
matrix
layer
secondary structure
output
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311804115.1A
Other languages
Chinese (zh)
Other versions
CN117476106A (en
Inventor
何朝政
赵君研
肖秦琨
付玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an Huisuan Intelligent Technology Co ltd
Original Assignee
Xi'an Huisuan Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xi'an Huisuan Intelligent Technology Co ltd filed Critical Xi'an Huisuan Intelligent Technology Co ltd
Priority to CN202311804115.1A priority Critical patent/CN117476106B/en
Publication of CN117476106A publication Critical patent/CN117476106A/en
Application granted granted Critical
Publication of CN117476106B publication Critical patent/CN117476106B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Genetics & Genomics (AREA)
  • Physiology (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a method and a system for predicting a secondary structure of a multi-class unbalanced protein, and relates to the technical field of computer-aided drug research and development. The method comprises the steps of obtaining a target protein sequence to be predicted, inputting the target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a Transformer, obtaining a secondary structure predicted value of the target protein sequence to be predicted, and outputting the secondary structure predicted value of the target protein sequence.

Description

Multi-class unbalanced protein secondary structure prediction method and system
Technical Field
The invention relates to the technical field of computer-aided drug development, in particular to a method and a system for predicting a secondary structure of a multi-class unbalanced protein.
Background
Proteins are important organic matters for constructing and repairing human tissues, and a large number of calculation methods are developed around the problem of predicting the secondary structure of proteins. Common methods are deep ACLSTM and MUFold-SS. The deep ACLSTM model utilizes protein sequence characteristics and Profile characteristics, wherein the Profile characteristics are sequence characteristic spectrums constructed by multi-sequence comparison, particularly a position specific scoring matrix (Position Special Scoring Matrix, PSSM) is adopted, and an asymmetric convolution network (asymmetric convolutional neural network, ACNN) formed by one-dimensional convolution with a convolution kernel of 1×42 and two-dimensional convolution with a convolution kernel of 3×1 is combined with a two-way long and short term memory (bidirectional long short term memory, biLSTM) network to obtain local and global correlations among amino acid residues; the characteristic matrix of MUFold-SS model is composed of physicochemical property of amino acid, PSI-BLAST spectrum and HHHBlits spectrum, contains abundant evolutionary information, and utilizes parallel convolution with convolution kernel scale of 1 and 3 to form deep convolution network to extract local and global relativity between amino acids. However, when the two methods predict the secondary structure of the protein, the problem that the amino acid global dependence is strong and the prediction precision of the secondary structure of the rare protein is low exists.
Disclosure of Invention
In order to solve the defects in the background technology, the method mainly aims at the problems that the overall dependence on amino acid is strong and the prediction precision of the rare protein secondary structure is low when the protein secondary structure is predicted in the prior art. The invention provides a method and a system for predicting a multi-class unbalanced protein secondary structure, wherein the method has low overall dependence on amino acid, and improves the accuracy of predicting a rare class protein secondary structure.
In order to achieve the above object, a first aspect of the present invention provides a method for predicting a secondary structure of a plurality of unbalanced proteins:
obtaining a target protein sequence to be predicted;
inputting a target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a transducer to obtain a secondary structure prediction value of the target protein sequence to be predicted; the classes of unbalanced protein structures fall into 8 categories: denoted by L, B, E, G, I, H, S and T respectively, but of these 8 structures, the B, G and S ratios are low, especially structure G, whereas rare classes are easily submerged during training, resulting in reduced prediction accuracy, and thus there are many problems with unbalanced classes in the protein secondary structure. The pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components:
a data preprocessing layer for performing weighting processing and batch normalization processing on the input sample data;
the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix;
the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences;
the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences;
the two-layer bidirectional gating cyclic unit layer is used for processing the incidence matrix among protein sequences to obtain a global feature matrix;
the convolution layer and the full connection layer are used for sequentially processing the global feature matrix to obtain a protein secondary structure predicted value;
and outputting the predicted value of the secondary structure of the target protein sequence to be predicted.
Optionally, the step of weighting and batch normalizing the input sample data includes:
processing input sample data according to a position specificity scoring matrix, a hidden Markov model feature matrix and a feature matrix obtained by an amino acid physicochemical property feature matrix of horizontal splicing to obtain weighted sample data;
respectively obtaining mask matrixes with all initial elements being false and input feature matrixes with all initial elements being 0;
according to protein chain indexes, protein chain lengths and protein chain maximum lengths obtained by traversing protein sequences of each batch in the weighted sample data, an updated mask matrix and an updated input feature matrix are obtained;
respectively calculating the mean value and the variance of the updated mask matrix and the updated input feature matrix by using a preset first formula and a preset second formula;
and processing the mean and variance by using a batch normalization layer pre-constructed according to a preset third formula to obtain a batch normalization output matrix.
Optionally, the third preset formula includes:
wherein μ represents the average value obtained by processing the updated mask matrix and the updated input feature matrix using a preset first formula, var represents the variance obtained by processing the updated mask matrix and the updated input feature matrix using a preset second formula, X B,C,max_L Representing input weighted sample data, X * Representing the output of the batch normalization layer, β and γ being parameters to be learned in the batch normalization layer, where ε=0.01;
the preset first formula and the preset second formula are respectively as follows:
wherein μ represents the mean value of all sample data, var 1 Representing the variance of all sample data, M bl Represents a mask matrix, X bcl Is the element value of each amino acid vector in each protein sequence, B is the batch size set during training, namely, batchsize, here set to 16; b refers to traversing the batch of 16 proteins, so b= [1,16]The method comprises the steps of carrying out a first treatment on the surface of the max_l refers to the maximum sequence length in the batch of proteins; l represents the sequence length of the batch of proteins;
optionally, the pyrach function layer includes an nn.conv1d function, an f.relu function, a maskedbatch norm1d function, and an nn.dropout function connected in this order, where the argument of the nn.conv1d function includes a convolution size of an output channel number 57, an input channel number 448, a convolution kernel length 3, padding = 1, bias = False, the input of the f.relu function is the output of the nn.conv1d function, the maskedbatch norm1d function has a dimension of 448, and the probability that the neuron of the nn.dropout function is not activated is 0.2.
Optionally, iteratively processing the second output matrix includes:
taking the second output matrix output by the nn. Dropout function as an initial input matrix of a MobileNet v2 layer to obtain an initial output matrix;
and (3) taking the initial output matrix as the input of the MobileNet v2 layer, and iterating for a plurality of times by taking the intermediate output matrix processed by the MobileNet v2 layer as the input of the MobileNet v2 layer to obtain the local feature matrix among the protein sequences.
Alternatively, the coding layer of the transducer is two layers, and each encoder is provided with a set of eight attention matrix.
Alternatively, the output channels of the convolution layer and the full connection layer are 57, the convolution kernel of the convolution layer is 1, and the output size of the full connection layer is 8.
Optionally, after the convolution layer and the full-connection layer for sequentially processing the global feature matrix to obtain the predicted value of the protein secondary structure, the method further includes:
and calculating a prediction error by using a label distribution perception marginal loss function, and reversely propagating and updating parameters beta and gamma of the batch normalization layer by the obtained prediction error.
In another aspect, the present invention provides a system for predicting a secondary structure of a plurality of unbalanced proteins, comprising:
the input module is used for acquiring a target protein sequence to be predicted;
the processing module is used for inputting the target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a transducer to obtain a secondary structure predicted value of the target protein sequence to be predicted; the pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components:
a data preprocessing layer for performing weighting processing and batch normalization processing on the input sample data;
the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix;
the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences;
the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences;
the two-layer bidirectional gating cyclic unit layer is used for processing the incidence matrix among protein sequences to obtain a global feature matrix;
the convolution layer and the full connection layer are used for sequentially processing the global feature matrix to obtain a protein secondary structure predicted value;
and the output module is used for outputting the predicted value of the secondary structure of the target protein sequence to be predicted.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a method and a system for predicting a secondary structure of a multi-class unbalanced protein, which are characterized in that a target protein sequence to be predicted is input into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a Transformer to obtain a secondary structure predicted value of the target protein sequence to be predicted; the pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components: a data preprocessing layer for performing weighting processing and batch normalization processing on the input sample data; the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix; the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences; the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences; the two-layer bidirectional gating cyclic unit layer is used for processing the incidence matrix among protein sequences to obtain a global feature matrix; and the convolution layer and the full-connection layer are used for sequentially processing the global feature matrix to obtain the predicted value of the protein secondary structure. When the method predicts the secondary structure of the protein, the overall dependence on amino acid is low, and the prediction precision of the secondary structure of the rare protein is improved.
Drawings
FIG. 1 is a flow chart of a method for predicting the secondary structure of a multi-class unbalanced protein;
FIG. 2 is a histogram of performance comparisons of models of a multi-class unbalanced protein secondary structure prediction method under different features;
FIG. 3 is a histogram of performance comparisons of models of a multi-class unbalanced protein secondary structure prediction method under different BN's;
FIG. 4 is a histogram of performance comparisons of a model of a multi-class unbalanced protein secondary structure prediction method with or without a transducer;
FIG. 5 is a schematic diagram of a system for predicting the secondary structure of a plurality of unbalanced proteins.
Detailed Description
The invention will be further described with reference to specific examples and figures, which are not intended to limit the invention.
FIG. 1 is a flowchart of a method for predicting a secondary structure of a multi-class unbalanced protein, provided by the invention, as shown in FIG. 1, the method for predicting a secondary structure of a multi-class unbalanced protein provided by the invention comprises the following steps:
101. and obtaining a target protein sequence to be predicted.
102. Inputting the target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a transducer to obtain a secondary structure predicted value of the target protein sequence to be predicted. The pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components:
and the data preprocessing layer is used for carrying out weighting processing and batch normalization processing on the input sample data.
The step of weighting and batch normalization of the input sample data comprises:
and processing input sample data according to the position specificity scoring matrix, the hidden Markov model characteristic matrix and the characteristic matrix obtained by the amino acid physicochemical property characteristic matrix of the horizontal splicing to obtain weighted sample data.
Specifically, the PSI-BLAST spectrum convergence parameter e is set to be 0.001, and the PSI-BLAST spectrum is operated to iterate the Uniref database twice to generate a position-specific scoring matrix (PSSM) with the size of L multiplied by 20, wherein L is the length of a protein sequence.
The HH maps are operated to iterate the Uniprot20 database four times to generate a Hidden Markov Model (HMM) feature matrix, the size of which is L multiplied by 30. Wherein Uniprot20 is a protein sequence database containing sequence information of all species, all proteins in UniProtKB (protein knowledge base), and a total of 70432686 protein sequences. Including both Swiss-Prot and TrEMBL. Wherein the Swiss-Prot portion comprises a high quality protein sequence that is manually annotated, and the TrEMBL portion comprises an automatically annotated protein sequence and an unverified predicted protein sequence, obtainable from
http:// wwuser. Gwdg.de/-combbiol/data/hhuiste/databases/hhuiste_dbs/old-release/in download.
And splicing and fusing the obtained position specificity scoring matrix, the hidden Markov model feature matrix and the seven amino acid physicochemical property feature matrices, wherein the amino acid physicochemical properties comprise sheet probability, spiral probability, isoelectric point, hydrophobicity, van der Waals volume, polarizability and graphic shape index, so as to obtain the feature matrix with the size of L multiplied by 57, and the three fused matrices provide abundant information for the model, thereby effectively improving the model prediction precision.
And respectively acquiring mask matrixes with all initial elements being false and input feature matrixes with all initial elements being 0.
And (3) obtaining protein chain indexes according to the protein sequences of each batch in the sample data subjected to traversal weighting, and obtaining an updated mask matrix and an updated input feature matrix by utilizing the protein chain length and the maximum protein chain length.
Specifically, a mask matrix is introduced into the batch normalization layer of the network, and specifically, B protein chains are traversed for each batch to obtain the index batch_idx and the length L of each protein chain in the batch and the maximum length of the protein chains in the batchDegree max_L, and mask matrix M B,max_L All elements in the matrix are initialized to false, and the input characteristic matrix X is input B,C,max_L All elements in (1) are initialized to 0, for each batch_idx of the mask matrix M B,max_L M is determined according to batch_idx and ProteinLen B,max_L 0 to ProteinLen-1 elements of each line are updated to True, wherein ProteinLen represents the True length of the protein sequence, resulting in a final mask matrix M for each batch B,max_L . For X B,C,max_L Each batch_idx corresponds to a protein chain, and X is determined as X [ batch_idx, proteinLen ] according to batch_idx]X is filled to X B,C,max_L In (3) to obtain the final X B,C,max_L The method comprises the steps of carrying out a first treatment on the surface of the Specifically, when the method is implemented, the dimension of input X is B×L×F, the dimension of Masks is B×L, the filling part in X is masked, the non-filling part is taken out and put into a new tensor, the dimension of the new tensor is adjusted according to the length of the feature vector describing each amino acid, namely num_features, and the mean variance in the batch normalization layer is calculated according to the new tensor.
In order to avoid that the 0-compensating position influences the accuracy of the feature extraction results of other positions in the subsequent feature extraction process, a mask matrix is introduced.
And respectively calculating the mean and variance of the updated mask matrix and the updated input feature matrix by using a preset first formula and a preset second formula.
And processing the mean and variance by using a batch normalization layer pre-constructed according to a preset third formula to obtain a batch normalization output matrix.
Specifically, the preset third formula includes:
wherein μ represents the average value obtained by processing the updated mask matrix and the updated input feature matrix using a preset first formula, var represents the variance obtained by processing the updated mask matrix and the updated input feature matrix using a preset second formula, X B,C,max_L Representing the weighted inputSample data, X * Representing the output of the batch normalization layer, β and γ are the parameters to be learned in the batch normalization layer, ε=0.01.
The preset first formula and the preset second formula are respectively as follows:
wherein μ represents the mean value of all sample data, var 1 Representing the variance of all sample data, M bl Represents a mask matrix, X bcl Is the element value of each amino acid vector in each protein sequence, B is the batch size set during training, namely, batchsize, here set to 16; b refers to traversing the batch of 16 proteins, so b= [1,16]The method comprises the steps of carrying out a first treatment on the surface of the max_l refers to the maximum sequence length in the batch of proteins; l represents the sequence length of the batch of proteins.
And the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix.
Wherein the pyrach function layer includes an nn.conv1d function, an f.relu function, a maskedbatch norm1d function, and an nn.dropout function connected in this order, where the argument of the nn.conv1d function includes a convolution size of the output channel number 57, the input channel number 448, a convolution kernel length 3, padding=1, bias=false, the input of the f.relu function is the output of the nn.conv1d function, the maskedbatch norm1d function has a dimension of 448, and the probability that the neuron of the nn.dropout function is not activated is 0.2.
Specifically, B samples subjected to weighting processing and normalization processing are randomly selected from training samples to form an input matrix X B,C,L X is taken as B,C,L As input, call out 1=nn.conv1d (57, 448,3, padding=1, bias=false), out=f.relu (out 1), out=maskidbatch norm1d (448), out=nnDropout (0.2) to obtain matrix X B,C,L
And the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences.
And taking the second output matrix output by the nn. Dropout function as an initial input matrix of the MobileNet v2 layer to obtain an initial output matrix of the MobileNet v2 layer.
And (3) using the initial output matrix processed by the MobileNet v2 layer as the input of the MobileNet v2 layer, and performing multiple iterations by using the intermediate output matrix processed by the MobileNet v2 layer as the input of the MobileNet v2 layer to obtain the local feature matrix among the protein sequences.
Specifically, the local feature extraction code of the MobileNet v2 network is operated to iterate n times, and a matrix X obtained by the first iteration is obtained B,C,L As input, the ith output is taken as the input of the (i+1) th time, wherein i=1, 2,3, …, n-1, and a local feature matrix among amino acids in the sequence is obtained
And the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences.
Wherein, the coding layer of the transducer is two layers, and each encoder is provided with a set of eight attention head matrixes.
Specifically, the local feature matrix X L And the characteristic out1 are added to obtain X R Running a two-layer eight-head Transformer Encoder program to obtain X R As input, a correlation matrix X between all amino acid sequences is obtained E
And the two-layer bidirectional gating cyclic unit layer is used for processing the correlation matrix among protein sequences to obtain the global feature matrix.
Specifically, two layers of bi-directional gating cyclic units (BiGRU) are operated to form an inter-amino acid sequence correlation matrix X E As input, a global feature matrix X of the protein sequence is obtained G
And the convolution layer and the full-connection layer are used for sequentially processing the global feature matrix to obtain the predicted value of the protein secondary structure.
The output channels of the convolution layer and the full connection layer are 57, the convolution kernel of the convolution layer is 1, and the output size of the full connection layer is 8.
Specifically, a one-dimensional convolution program with 57 output channels and 1 convolution kernel is operated to make the global feature matrix X G As input, the final feature X is obtained F Running a full-connection layer program with an input size of 57 and an output size of 8, and finally obtaining a characteristic X F As an input, a protein secondary structure predicted value P is obtained.
After the convolution layer and the full-connection layer which are used for sequentially processing the global feature matrix to obtain the predicted value of the protein secondary structure, the method further comprises the following steps:
and calculating a prediction error by using a label distribution perception marginal loss function, and reversely propagating and updating parameters beta and gamma of the batch normalization layer by the obtained prediction error.
Specifically, running a label distribution perception marginal loss function code to calculate a prediction error, reversely propagating the obtained prediction error to update parameters of a batch normalization layer, including beta, lambda and weights of a network, setting the maximum iteration step number to 200, setting the learning rate to 0.0001, setting the batch size to 16, carrying out model loop iteration, and when the verification centralized prediction precision is not improved or reaches the maximum iteration step number, finishing training to obtain a multi-class unbalanced protein secondary structure prediction model based on a Transformer.
TABLE 1
Table 1 shows a comparison with the most advanced method on data set CB513, bolded for optimal performance. As can be seen from Table 1, the performance of the model in this document is obviously superior to other reference methods on the CB513 data set, so that the prediction accuracy of common classes in the 8-class secondary structure is ensured, and the prediction effects of rare classes B, G and S are improved to a certain extent.
TABLE 2
Table 2 shows a comparison with the most advanced method on the data set CASP12, bolded for optimal performance. On dataset CASP12, the overall prediction accuracy is 0.02% lower than mufold_ss, but for the rarer categories, such as: both categories B, S and G have a greater improvement in prediction accuracy than other methods, but have less impact on overall prediction accuracy due to the smaller number.
TABLE 3 Table 3
Table 3 shows a comparison with the most advanced method on the data set CASP13, bolded for optimal performance. On the data set CASP13, the overall prediction accuracy is maximized, and the prediction improvement amplitude for rare classes is also large, but the prediction accuracy on the classes L and T is still a certain gap compared with other methods.
TABLE 4 Table 4
Table 4 shows a comparison with the most advanced method on the data set CASP14, bolded for optimal performance. On the data set CASP14, the overall prediction accuracy is improved by 0.05% compared with MUFold_SS, the overall accuracy improvement is limited, but the prediction accuracy improvement amplitude for rare classes is larger. From the above results, it can be seen that Q of the structural class has a lower duty ratio (smaller sample size) 8 The lower the accuracy is, the less than one thousandth the probability that an "I" type structure in the eight states will appear, and almost all methods are difficult to predict correctly. It is reasonable to believe that the predictive effect will be further improved if the sample size of the low frequency sub-class can be amplified. Thus, the data enhancement for protein secondary structure prediction isOne direction worth studying.
FIG. 2 is a histogram of performance comparisons of models of a multi-class unbalanced protein secondary structure prediction method under different features;
FIG. 3 is a comparative histogram of the performance of a model of a multi-class unbalanced protein secondary structure prediction method under different BN's, where BN refers to the batch normalization layer, batch Normalization;
FIG. 4 is a histogram of performance comparisons of a multi-class unbalanced protein secondary structure prediction method with or without a transducer model.
As the information source and the basis of the structure prediction, the evolutionary information contained in different amino acid codes is different, and the exploration of the characteristic representation is developed under the model parameters with the best prediction result in the experiment. Here we first analyze the influence of different coding modes on the prediction accuracy one by one and then perform a combination experiment on them. FIG. 2 shows that on the validation set, Q 8 The accuracy is the basis, and the influence of 9 different input features on model prediction performance. Specifically, we consider PSSM coding and HMM coding separately, and because the single thermal coding and the physicochemical property coding are independent of position and do not contain evolution information, we can see from experimental results that PSSM performance is better than HMM, and the evolution information contained in PSSM is likely to be more abundant. Then, PSSM and HMM are respectively combined with single thermal coding and physicochemical property coding, and PSSM coding and HMM coding are combined, so that when PSSM coding is respectively combined with single thermal coding and physicochemical property coding, model prediction performance is close, and when HMM and HMM are respectively combined with PSSM, performance is lower than that of PSSM and combination, so that it is also proved that the PSSM contains more information or is more suitable for carrying out secondary structure prediction, but when the PSSM is combined with physicochemical property coding, the prediction performance is lower than that of single thermal coding and is not consistent with expectations, but in the subsequent experiments, the evolution process of forming a stable structure of the protein is seen, because different amino acid properties are different, the properties influence the interaction mode of the protein and surrounding amino acids in the sequence, and thus the structure of the protein is influenced. When PSSM coding and HMM coding are combined, the prediction performance is obviously improved relative to the previous four types, and PSSM comprisesThe probability of mutation of the same amino acid into other amino acids in the process of forming a stable protein structure is reduced in different sequences, and the HMM codes comprise matching state probability, translation frequency and local diversity of different amino acids, so that the two codes comprise certain complementary information which is useful for predicting the secondary structure of the protein. Based on the combination of PSSM coding and HMM coding, the combination of single thermal coding and physicochemical coding is adopted, and the combination of the PSSM coding and the HMM coding with physicochemical coding is the best coding mode, so that the prediction precision reaches the highest, and the prediction method also reaches the expectation.
For the modified batch normalization (Batch Normalization) layer, as shown in FIG. 3, after introducing a mask matrix on test set CB513, Q 8 The precision is improved by 0.23 percent, F 1 The method has the advantages that the characteristic vector of the filling position is improved by about 0.05%, the characteristic vector of the filling position can possibly become a non-zero vector in the characteristic extraction process, the prediction result of the secondary structure corresponding to the amino acid of the non-filling position can be influenced, the problem can be relieved to a certain extent by introducing the mask matrix, the accuracy of characteristic extraction is improved, and the accuracy of secondary structure prediction is further improved.
Fig. 4 is a comparison of network performance achieved by a network with or without a feature enhancement module. As shown in the figure, after the characteristic reinforcing module is added to the network, the prediction accuracy is improved by 0.16 percent, and F is improved 1 The value is increased by 0.79%, which shows that adding 2 layers of 8 heads Transformer Encoder before long-distance action enhances the expression of relatedness among residues in the sequence, and the subsequent BiGRU can capture long-distance dependence more flexibly. In summary, the invention obtains the secondary structure predicted value of the target protein sequence to be predicted by inputting the target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a Transformer. The pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components: and the data preprocessing layer is used for carrying out weighting processing and batch normalization processing on the input sample data. And the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix. For iteratively processing the second output matrix to obtain proteinsMobileNet v2 layer of local feature matrix between mass sequences. And the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences. And the two-layer bidirectional gating cyclic unit layer is used for processing the correlation matrix among protein sequences to obtain the global feature matrix. And the convolution layer and the full-connection layer are used for sequentially processing the global feature matrix to obtain the predicted value of the protein secondary structure. When the method predicts the secondary structure of the protein, the overall dependence on amino acid is low, and the prediction precision of the secondary structure of the rare protein is improved.
103. And outputting the predicted value P of the secondary structure of the target protein sequence to be predicted.
FIG. 5 is a schematic diagram of a structure 200 of a system for predicting secondary structures of multiple types of unbalanced proteins, as shown in FIG. 5, the apparatus comprises:
an input module 201, configured to obtain a target protein sequence to be predicted.
The processing module 202 is configured to input the target protein sequence to be predicted into a pre-constructed transform-based multi-class unbalanced protein secondary structure prediction model, and obtain a secondary structure predicted value of the target protein sequence to be predicted. The pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components:
and the data preprocessing layer is used for carrying out weighting processing and batch normalization processing on the input sample data.
And the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix.
And the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences.
And the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences.
And the two-layer bidirectional gating cyclic unit layer is used for processing the correlation matrix among protein sequences to obtain the global feature matrix.
And the convolution layer and the full-connection layer are used for sequentially processing the global feature matrix to obtain the predicted value of the protein secondary structure.
And the output module 203 is used for outputting the predicted value of the secondary structure of the target protein sequence to be predicted.
It will be appreciated by those skilled in the art that the present invention may take the form of one or more computer program products embodied in computer usable storage medium, or a computer program product. Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (6)

1. A method for predicting a secondary structure of a plurality of unbalanced proteins, comprising:
obtaining a target protein sequence to be predicted;
inputting a target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a transducer to obtain a secondary structure prediction value of the target protein sequence to be predicted; the pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components:
a data preprocessing layer for performing weighting processing and batch normalization processing on the input sample data;
the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix;
the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences;
the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences;
the two-layer bidirectional gating cyclic unit layer is used for processing the incidence matrix among protein sequences to obtain a global feature matrix;
the convolution layer and the full connection layer are used for sequentially processing the global feature matrix to obtain a protein secondary structure predicted value;
outputting a secondary structure predicted value of the target protein sequence to be predicted;
the pyrach function layer comprises an nn.Conv1d function, an F.relu function, a MaskedBatchNorm1d function and an nn.Dropout function which are connected in sequence, wherein parameters of the nn.Conv1d function comprise a convolution size of 57 output channels, 448 input channels, 3 convolution kernels, padding=1, bias bias=false, the input of the F.relu function is the output of the nn.Conv1d function, the dimension of the MaskedBatchNorm1d function is 448, and the probability that neurons of the nn.Dropout function are not activated is 0.2;
the coding layers of the transducer are two layers, and each coder is provided with a set of eight attention head matrixes;
the output channels of the convolution layer and the full connection layer are 57, the convolution kernel of the convolution layer is 1, and the output size of the full connection layer is 8.
2. The method of claim 1, wherein the step of weighting and batch normalizing the input sample data comprises:
processing input sample data according to a position specificity scoring matrix, a hidden Markov model feature matrix and a feature matrix obtained by an amino acid physicochemical property feature matrix of horizontal splicing to obtain weighted sample data;
respectively obtaining mask matrixes with all initial elements being false and input feature matrixes with all initial elements being 0;
according to protein chain indexes, protein chain lengths and protein chain maximum lengths obtained by traversing protein sequences of each batch in the weighted sample data, an updated mask matrix and an updated input feature matrix are obtained;
respectively calculating the mean value and the variance of the updated mask matrix and the updated input feature matrix by using a preset first formula and a preset second formula;
and processing the mean and variance by using a batch normalization layer pre-constructed according to a preset third formula to obtain a batch normalization output matrix.
3. The method for predicting a secondary structure of a plurality of unbalanced proteins according to claim 2, wherein the predetermined third formula comprises:
wherein μ represents the average value obtained by processing the updated mask matrix and the updated input feature matrix using a preset first formula, var represents the variance obtained by processing the updated mask matrix and the updated input feature matrix using a preset second formula, X B,C,max_L Representing input weighted sample data, X * Representing the output of the batch normalization layer, β and γ being parameters to be learned in the batch normalization layer, where ε=0.01;
the preset first formula and the preset second formula are respectively as follows:
wherein μ represents the mean value of all sample data, var 1 Representing the variance of all sample data, M bl Represents a mask matrix, X bcl Is the element value of each amino acid vector in each protein sequence, B is the batch size set during training, namely, batchsize, here set to 16; b refers to traversing the batch of 16 proteins, so b= [1,16]The method comprises the steps of carrying out a first treatment on the surface of the max_l refers to the batch of proteinsMaximum sequence length in the stroma; l represents the sequence length of the batch of proteins.
4. The method of claim 1, wherein iteratively processing the second output matrix comprises:
taking the second output matrix output by the nn. Dropout function as an initial input matrix of a MobileNet v2 layer to obtain an initial output matrix;
and (3) taking the initial output matrix as the input of the MobileNet v2 layer, and carrying out multiple iterations on the obtained intermediate output matrix processed by the MobileNet v2 layer as the input of the MobileNet v2 layer to obtain the local feature matrix among protein sequences.
5. A method for predicting a secondary structure of a multi-class unbalanced protein according to claim 3, wherein after the convolution layer and the full-connection layer for sequentially processing the global feature matrix to obtain the predicted value of the secondary structure of the protein, the method further comprises:
and calculating a prediction error by using a label distribution perception marginal loss function, and reversely propagating and updating parameters beta and gamma of the batch normalization layer by the obtained prediction error.
6. A multi-class unbalanced protein secondary structure prediction system, comprising:
the input module is used for acquiring a target protein sequence to be predicted;
the processing module is used for inputting the target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a transducer to obtain a secondary structure predicted value of the target protein sequence to be predicted; the pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components:
a data preprocessing layer for performing weighting processing and batch normalization processing on the input sample data;
the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix;
the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences;
the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences;
the two-layer bidirectional gating cyclic unit layer is used for processing the incidence matrix among protein sequences to obtain a global feature matrix;
the convolution layer and the full connection layer are used for sequentially processing the global feature matrix to obtain a protein secondary structure predicted value;
the output module is used for outputting the predicted value of the secondary structure of the target protein sequence to be predicted;
the pyrach function layer comprises an nn.Conv1d function, an F.relu function, a MaskedBatchNorm1d function and an nn.Dropout function which are connected in sequence, wherein parameters of the nn.Conv1d function comprise a convolution size of 57 output channels, 448 input channels, 3 convolution kernels, padding=1, bias bias=false, the input of the F.relu function is the output of the nn.Conv1d function, the dimension of the MaskedBatchNorm1d function is 448, and the probability that neurons of the nn.Dropout function are not activated is 0.2;
the coding layers of the transducer are two layers, and each coder is provided with a set of eight attention head matrixes;
the output channels of the convolution layer and the full connection layer are 57, the convolution kernel of the convolution layer is 1, and the output size of the full connection layer is 8.
CN202311804115.1A 2023-12-26 2023-12-26 Multi-class unbalanced protein secondary structure prediction method and system Active CN117476106B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311804115.1A CN117476106B (en) 2023-12-26 2023-12-26 Multi-class unbalanced protein secondary structure prediction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311804115.1A CN117476106B (en) 2023-12-26 2023-12-26 Multi-class unbalanced protein secondary structure prediction method and system

Publications (2)

Publication Number Publication Date
CN117476106A CN117476106A (en) 2024-01-30
CN117476106B true CN117476106B (en) 2024-04-02

Family

ID=89633271

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311804115.1A Active CN117476106B (en) 2023-12-26 2023-12-26 Multi-class unbalanced protein secondary structure prediction method and system

Country Status (1)

Country Link
CN (1) CN117476106B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118658528A (en) * 2024-08-20 2024-09-17 电子科技大学长三角研究院(衢州) Construction method of specific myoglobin prediction model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667884A (en) * 2020-06-12 2020-09-15 天津大学 Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism
CN112767997A (en) * 2021-02-04 2021-05-07 齐鲁工业大学 Protein secondary structure prediction method based on multi-scale convolution attention neural network
CN113178229A (en) * 2021-05-31 2021-07-27 吉林大学 Deep learning-based RNA and protein binding site recognition method
CN114974397A (en) * 2021-02-23 2022-08-30 腾讯科技(深圳)有限公司 Training method of protein structure prediction model and protein structure prediction method
CN115458039A (en) * 2022-08-08 2022-12-09 北京分子之心科技有限公司 Single-sequence protein structure prediction method and system based on machine learning
CN115662501A (en) * 2022-10-25 2023-01-31 浙江大学杭州国际科创中心 Protein generation method based on position specificity weight matrix
CN116486900A (en) * 2023-04-25 2023-07-25 徐州医科大学 Drug target affinity prediction method based on depth mode data fusion

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220122689A1 (en) * 2020-10-15 2022-04-21 Salesforce.Com, Inc. Systems and methods for alignment-based pre-training of protein prediction models

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667884A (en) * 2020-06-12 2020-09-15 天津大学 Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism
CN112767997A (en) * 2021-02-04 2021-05-07 齐鲁工业大学 Protein secondary structure prediction method based on multi-scale convolution attention neural network
CN114974397A (en) * 2021-02-23 2022-08-30 腾讯科技(深圳)有限公司 Training method of protein structure prediction model and protein structure prediction method
CN113178229A (en) * 2021-05-31 2021-07-27 吉林大学 Deep learning-based RNA and protein binding site recognition method
CN115458039A (en) * 2022-08-08 2022-12-09 北京分子之心科技有限公司 Single-sequence protein structure prediction method and system based on machine learning
CN115662501A (en) * 2022-10-25 2023-01-31 浙江大学杭州国际科创中心 Protein generation method based on position specificity weight matrix
CN116486900A (en) * 2023-04-25 2023-07-25 徐州医科大学 Drug target affinity prediction method based on depth mode data fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于自注意力机制和GAN的蛋白质二级结构预测;杨璐, 董洪伟;《中国科技论文在线精品论文》;20230615;第16卷(第02期);148-159 *

Also Published As

Publication number Publication date
CN117476106A (en) 2024-01-30

Similar Documents

Publication Publication Date Title
CN111667884B (en) Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism
CN111291183B (en) Method and device for carrying out classification prediction by using text classification model
CN107622182B (en) Method and system for predicting local structural features of protein
CN117476106B (en) Multi-class unbalanced protein secondary structure prediction method and system
CN111105013B (en) Optimization method of countermeasure network architecture, image description generation method and system
CN109598387A (en) Forecasting of Stock Prices method and system based on two-way cross-module state attention network model
EP3912042B1 (en) A deep learning model for learning program embeddings
CN114743600B (en) Deep learning prediction method of target-ligand binding affinity based on gated attention mechanism
CN112258262A (en) Conversation recommendation method based on convolution self-attention network
CN115222998B (en) Image classification method
Tian et al. Joint learning model for underwater acoustic target recognition
CN115422369B (en) Knowledge graph completion method and device based on improved TextRank
CN114528835A (en) Semi-supervised specialized term extraction method, medium and equipment based on interval discrimination
CN114708903A (en) Method for predicting distance between protein residues based on self-attention mechanism
CN112151127A (en) Unsupervised learning drug virtual screening method and system based on molecular semantic vector
CN112488301A (en) Food inversion method based on multitask learning and attention mechanism
Downey et al. alineR: An R package for optimizing feature-weighted alignments and linguistic distances
CN117976035A (en) Protein SNO site prediction method of feature fusion deep learning network
CN117494815A (en) File-oriented credible large language model training and reasoning method and device
Onu et al. A fully tensorized recurrent neural network
Eyraud et al. TAYSIR Competition: Transformer+\textscrnn: Algorithms to Yield Simple and Interpretable Representations
CN117831609A (en) Protein secondary structure prediction method and device and computer device
CN112884019B (en) Image language conversion method based on fusion gate circulation network model
CN115964475A (en) Dialogue abstract generation method for medical inquiry
CN113177608A (en) Neighbor model feature selection method and device for incomplete data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant