CN112837741B

CN112837741B - Protein secondary structure prediction method based on cyclic neural network

Info

Publication number: CN112837741B
Application number: CN202110097155.1A
Authority: CN
Inventors: 胡俊; 殷文杰; 贾宁欣; 曾文武; 董明; 张贵军
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2024-04-16
Anticipated expiration: 2041-01-25
Also published as: CN112837741A

Abstract

Protein based on cyclic neural networkAccording to the input protein sequence to be subjected to secondary structure prediction, the protein sequence is encoded by a single thermal encoding mode to obtain an L multiplied by 20 characteristic matrix M ₁ Generating a protein position-specific scoring matrix PSSM using a PSI-BLAST program; then, M is ₁ Performing matrix addition operation with PSSM to obtain matrix M ₂ The method comprises the steps of carrying out a first treatment on the surface of the Secondly, obtaining a characteristic vector of the residue; thirdly, constructing a circulating neural network framework, constructing a sample set by acquiring protein sequences of known secondary structure labels from a PDB database, and training the constructed circulating neural network; and finally, inputting the feature vector of the residue in the protein sequence to be predicted into a trained model, and predicting the secondary structure category of the residue in the protein sequence according to the output probability value. The method has low calculation cost and high prediction accuracy.

Description

Protein secondary structure prediction method based on cyclic neural network

Technical Field

The invention relates to the fields of bioinformatics, deep learning and computer application, in particular to a protein secondary structure prediction method based on a circulating neural network.

Background

Prediction of the secondary structure of a protein is an important intermediate step in the prediction of tertiary structure, and is also a bridge linking the protein sequence and tertiary structure. Accurate identification of the secondary structure of the protein can help people to know complex dependency relationship between protein sequences and tertiary structures, and can also promote functional analysis of the protein and drug design.

Currently, the methods for predicting the secondary structure of proteins by deep learning are as follows: SSpro8 (Gianluca, pollastri, darisz, et al, improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles [ J ]. Proteins Structure Function & Genetics, 2002: gianluca et al SSpro 8. Prediction of class 3 and class 8 protein secondary structures using cyclic neural networks and spectrum files [ J ]. Protein structure, function and Genetics,2002 ], BLSTM (Snderby S K, winther O.protein Secondary Structure Prediction with Long Short Term Memory Networks [ J ]. Computer Science,2014. I.e., snderby S K, winther O.BLSTM: prediction of class 3 and class 8 protein secondary structures based on long and short time memory networks [ J ]. Computer Science,2014 ], GSN (Zhou J, troynsky O.Deep Supervised and Convolutional Generative Stochastic Network for Protein Secondary Structure Prediction [ J ]. Computer Science,2014:745-753. I.e., zhou J ]. Troynsky O G. Prediction of class III secondary structures based on deep-supervision convolution and CRNs [ J ]. 2014:745-753. ]) are performed with (Buzzoz, jnyan [ L ]. 35 of class III.e., 35) of new biological structures of Buzzo [ J ]. 35, 35 mg [ J ]. 1-92 ] of Buzzo.e.g. biological structures of Buzzo [ J ]. Three classes of Bungan [ J ]. 35 ] were studied. Compared with the traditional machine learning method, the deep learning-based method can more fully extract the amino acid characteristics and the hidden modes in the protein sequence. Although existing protein secondary structure prediction methods have achieved good results, these methods only extract features along the amino acid residue dimension. Thus, these methods may ignore some important features hidden in the protein sequence feature vector, which may be useful for predicting secondary structure.

In summary, the existing protein secondary structure prediction methods have great differences from the actual application requirements in terms of calculation cost and prediction accuracy, and urgent improvement is needed.

Disclosure of Invention

In order to overcome the defects of the existing protein secondary structure prediction method in terms of calculation cost and prediction accuracy, the invention provides a protein secondary structure prediction method based on a cyclic neural network, which is low in calculation cost and high in prediction accuracy.

The technical scheme adopted for solving the technical problems is as follows:

a method for predicting protein secondary structure based on a recurrent neural network, the method comprising the steps of:

1) Inputting a protein sequence with L residues and to be subjected to protein secondary structure prediction, and marking the protein sequence as S;

2) The protein sequence S is encoded by a single thermal encoding mode to obtain a characteristic matrix with the size of L multiplied by 20, which is marked as M ₁ ，M ₁ The ith row and jth column elements of (a) are expressed as:

wherein A is ₁ ,A ₂ ,...,A ₂₀ Residues representing 20 common amino acid types, T _i Indicating the type of the i-th residue in sequence S;

3) Protein sequence S was scored using the PSI-BLAST (https:// www.ebi.ac.uk/Tools/sss/psiblast /) program to generate a protein position-specific scoring matrix, designated PSSM;

4) For M ₁ Performing matrix addition operation with PSSM to obtain a matrix with size of Lx20, and recording as M ₂ ；

5) Calculating matrix M ₂ The average value of each row in the row (C) is obtained, and a vector with the size of L multiplied by 1 is obtained and is recorded as H;

6) Calculating matrix M ₂ Obtaining a vector with the size of L multiplied by 1, and recording the vector as V;

7) For the protein sequence S, a feature matrix F with the size of L multiplied by 20 is obtained through the following formula, and the elements in the ith row and the jth column in F are expressed as:

wherein,represents M ₂ The ith row and the jth column elements of H _i Represents the i-th element in H, V _i Represents the i-th element in V;

8) Any row F in matrix F _i I=1, 2, L, a feature vector representing the i-th residue in the protein sequence S;

9) Obtaining a protein sequence of a known residue secondary structure class from a PDB (https:// www.rcsb.org /) database as a training set, wherein the residue secondary structure class comprises burial, intermediate state and exposure, and generating feature vectors of all residues in the training set by using the steps 2) -8); combining the residue secondary structure class labels to construct a training sample set;

10 A second-level structure class of residues in the predicted protein sequence S of the cyclic neural network is built, the network consists of three parts, wherein the first part is a convolution layer part which consists of a convolution layer with a convolution kernel size of 1 multiplied by 20, a convolution layer with a convolution kernel size of 3 multiplied by 1, a normalization layer and a pooling layer; the second part is a recycled layer part, which consists of two LSTM layers; the last part is a fully connected layer part which consists of two fully connected layers, wherein the output of each layer is used as the input of the next layer, and the sigmoid activation function is used to enable the output value of the network to be in the range of (0, 1);

11 Training the cyclic neural network constructed in step 10) using the training sample set constructed in step 9), the training stage adjusting parameters in the network using a cross entropy loss function, the cross entropy loss function being noted as:

wherein u represents the true tag of the residue to be detected in the protein sequence,representing an output value corresponding to the type of the predicted residue of the network model, wherein Y represents the difference between the predicted output and the real label;

12 Inputting the feature vector of the residue in the sequence S into the model trained in the step 11), outputting the probability value of each residue secondary structure category according to the model, wherein the category corresponding to the maximum probability value is the prediction category of the residue secondary structure.

The technical conception of the invention is as follows: firstly, according to an input protein sequence to be subjected to secondary structure prediction, encoding the protein sequence by using a single thermal encoding mode to obtain an L multiplied by 20 characteristic matrix M ₁ Generating a protein position-specific scoring matrix PSSM using a PSI-BLAST program; then, M is ₁ Performing matrix addition operation with PSSM to obtain matrix M ₂ The method comprises the steps of carrying out a first treatment on the surface of the Next, the eigenvectors of the residues are obtained by: calculating matrix M ₂ Mean and variance of each row, matrix M ₂ Subtracting the average value of the current row from any element, dividing the average value by the variance of the current row, and generating a corresponding element in a feature matrix F, wherein any row in F represents a feature vector of a corresponding residue; thirdly, constructing a circulating neural network framework, constructing a sample set by acquiring protein sequences of known secondary structure labels from a PDB database, and training the constructed circulating neural network; and finally, inputting the feature vector of the residue in the protein sequence to be predicted into a trained model, and predicting the secondary structure category of the residue in the protein sequence according to the output probability value. The invention provides a protein secondary structure prediction method based on a cyclic neural network, which has low calculation cost and high prediction accuracy.

The beneficial effects of the invention are as follows: on one hand, the convolution part of the network uses asymmetric convolution for extracting the connection between residue parts, so that preparation is made for further improving the accuracy of protein secondary structure prediction; on the other hand, the BLSTM part of the network is used for capturing long-distance interdependence relation among residues, so that the accuracy of protein secondary structure prediction is better ensured.

Drawings

FIG. 1 is a schematic diagram of a method for predicting a protein secondary structure based on a recurrent neural network.

FIG. 2 is a graph showing the results of secondary structure prediction of protein sequence 1F2C using a cyclic neural network-based protein secondary structure prediction method.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 and 2, a protein secondary structure prediction method based on a recurrent neural network includes the steps of:

The embodiment takes the prediction of the secondary structure of the protein 1F2C as an embodiment, and the prediction method of the secondary structure of the protein based on the circulating neural network comprises the following steps:

1) Inputting a protein sequence 1F2C with the number of residues 215 to be subjected to protein secondary structure prediction, and marking the protein sequence as S;

2) The protein sequence S is encoded by a single thermal encoding method to obtain a matrix with the size of 215 multiplied by 20, which is denoted as M ₁ ，M ₁ The ith row and jth column elements of (b) are represented by the following equations:

wherein A is ₁ ,A ₂ ,...,A ₂₀ Representing 20 common amino acid types, T _i Represents the type of the i-th residue in the protein sequence 1F 2C;

12 Inputting the feature vector of the residue in the protein sequence S into the model trained in the step 11), outputting the probability value of each residue secondary structure category according to the model, wherein the category corresponding to the maximum probability value is the prediction category of the residue secondary structure.

The prediction of the protein sequence 1F2C using the above method for partitioning was performed using the secondary structure prediction of the protein sequence 1F2C as shown in FIG. 2.

The above description is that the predicted result obtained by taking the secondary structure of the protein sequence 1F2C as an example of the present invention is not limited to the implementation scope of the present invention, and various modifications and improvements can be made thereto without departing from the scope of the present invention, which is not to be excluded from the protection scope of the present invention.

Claims

1. A method for predicting a protein secondary structure based on a recurrent neural network, the method comprising the steps of:

3) Protein sequence S was scored using the PSI-BLAST program to generate a protein position-specific scoring matrix, designated PSSM;

4) For M ₁ And PSSM performs matrix addition to obtain a matrix with L×20, denoted as M ₂ ；

9) Obtaining a protein sequence of a secondary structure class of known residues from a PDB database as a training set, wherein the secondary structure class of the residues comprises burial, intermediate state and exposure, and generating feature vectors of all residues in the training set by using the steps 2) -8); combining the residue secondary structure class labels to construct a training sample set;

11 Training the network constructed in step 10) using the training sample set constructed in step 9), the training stage adjusting parameters in the network using a cross entropy loss function, the cross entropy loss function being noted as: