CN110853703A

CN110853703A - Semi-supervised learning prediction method for protein secondary structure

Info

Publication number: CN110853703A
Application number: CN201910982228.8A
Authority: CN
Inventors: 宫秀军; 赵兴海
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2020-02-28

Abstract

The invention discloses a method for predicting the secondary structure of protein by semi-supervised learning, which comprises the following steps: (1) acquiring a protein sequence data set; (2) carrying out data cleaning and feature extraction on the acquired data set; (3) building a Semi-GAN neural network model; (4) training a Semi-GAN neural network model: (5) adjusting parameters of the Semi-GAN neural network model; (6) the Semi-GAN neural network model was evaluated. The invention can establish a semi-supervised prediction model for the secondary structure of the protein under the condition of a large amount of data with missing value labels. A large amount of manpower and financial resources are saved.

Description

Semi-supervised learning prediction method for protein secondary structure

Technical Field

The invention relates to the field of bioinformatics and deep learning, and belongs to a key research problem of bioinformatics prediction. The method utilizes a protein data set with a missing label to train a deep learning classification model to predict the secondary structure of the protein.

Background

Protein secondary structure prediction is the inference of the secondary structure of a protein fragment based on its amino acid sequence. In bioinformatics and theoretical chemistry, protein secondary structure prediction is very important for medicine and biotechnology, such as drug design and design of novel enzymes. Since secondary structure can be used to find distant relationships of proteins with unaligned primary structure, combining secondary structure information with simple sequence information can improve the accuracy of their alignment. Finally, protein secondary structure prediction also plays an important role in protein tertiary structure prediction. The secondary structure of the protein can determine the structural type of the partial fragment of the protein, and thus the degree of freedom of the partial fragment of the protein in the tertiary structure can be reduced. Therefore, accurate secondary structure prediction is likely to improve the accuracy of protein tertiary structure prediction.

The objective of protein secondary structure prediction is to predict whether the residue at the center of an amino acid sequence fragment is in α helix, β fold or random coil, although it is generally thought that it is enough to determine the three-dimensional structure of a protein by having enough amino acid sequence information, it is difficult in practice, especially in the case of lacbel, which is a missing protein secondary structure.

Disclosure of Invention

The object of the present invention is to overcome the deficiencies of the prior art, and although several deep learning methods for secondary structure prediction have been developed, the problem of semi-supervised classification of secondary structures has never been studied before. Therefore, a semi-supervised learning prediction method for the protein secondary structure is provided, and a discriminator utilizing a countermeasure generation network (GAN) is modified into a classifier to carry out semi-supervised prediction on the protein secondary structure.

The purpose of the invention is realized by the following technical scheme:

a semi-supervised learning prediction method for protein secondary structure comprises the following steps:

(1) acquiring a protein sequence data set;

(2) carrying out data cleaning and feature extraction on the acquired data set;

(3) building a Semi-GAN neural network model; the Semi-GAN neural network model comprises a generator, a discriminator and a loss function, wherein the generator comprises three deconvolution neural networks, the three deconvolution neural networks are respectively subjected to normalization processing, and a leak ReLU function is adopted as an activation function to prevent overfitting; the discriminator structure uses a network structure of a convolutional neural network, a normalization process and a ReLU activation function; the loss function divides the discriminator loss into two parts: one represents the GAN problem, unsupervised loss; the other is to calculate the probability of a single real class and supervise the loss; for unsupervised loss, the discriminator must distinguish between real training samples and false samples from the generator; in both cases, the binary classification problem is being handled; since the probability value of the true sample is expected to be close to 1, and the probability value of the non-true sample is expected to be close to 0, the sigmoid cross entropy function is used to calculate the loss; for samples from the training set, maximize their true probability by assigning labels of 1; for the synthetic samples from the generator, by labeling them with a 0 to maximize their pseudo-probability;

(4) training a Semi-GAN neural network model:

(5) adjusting parameters of the Semi-GAN neural network model;

(6) the Semi-GAN neural network model was evaluated.

Further, the dataset used in step (1) is a CullPDB dataset consisting of 6133 proteins, each protein having 39900 features; the 6133 protein × 39900 signature can be reshaped into 6133 protein × 700 amino acids × 57 signatures.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects: the present invention is the field of applying semi-supervised learning to protein secondary structure prediction for the first time. The invention can establish a semi-supervised prediction model for the secondary structure of the protein under the condition that the protein data set has a large number of missing labels, thus avoiding the data annotation of the protein data set with the missing labels by the protein, and the data annotation of the protein sequence is a difficult work and needs a large amount of manpower and financial resources.

Drawings

FIG. 1 is a schematic flow diagram of the present invention.

Fig. 2 is a schematic structural diagram of the Semi-GAN neural network model in this embodiment.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a method for predicting a protein secondary structure by semi-supervised learning, which is shown in figure 1 and specifically comprises the following steps:

first, acquiring a protein data set

First, a data set needs to be acquired. In this example, the dataset used was a CullPDB dataset consisting of 6133 proteins, each protein having 39900 features. The 6133 protein × 39900 signature can be reshaped into 6133 protein × 700 amino acids × 57 signatures. The following table is the protein secondary structure classes and the frequency of occurrence of each class:

description of protein secondary structure classes and class frequencies in the data set.

In this example, the amino acid chains are depicted by a 700 × 57 matrix to keep the data size consistent. 700 denotes the peptide chain and 57 denotes the number of features per amino acid. When the end of the chain is reached, the remainder of the vector will simply be labeled "NoSeq" (unmarked padding). Of the 57 features, 22 represent the primary structure (of which 20 amino acids are included, 1 represents an unknown amino acid, the other represents "No Seq" (unlabeled filling), 22 protein spectra (identical to the primary structure), 9 are possible states of the secondary structure (8), the other likewise represents "No Seq" (unlabeled filling). CB513 is a common test dataset, as an independent test set for the testing of CB513, CB6133 is used as a training dataset because of the redundancy between CB513 and CB6133, CB6133 is filtered by removing sequences with more than 25% sequence similarity to the sequences in CB 513. after filtering, 5534 protein remaining in CB6133 is used as a training sample because the protein sequence spectra with evolutionary information has been a breakthrough in the prediction of the secondary structure of the protein.thus, a Position Specific Scoring Matrix (PSSM) feature is used here, these features are widely used features that can be extracted from protein mass spectra by defining the protein secondary structure (DSSP) and the location-specific iterative basic local alignment search tool (PSI-BLAST). Data used for training contained features and tags in 56 channels (PSSM 22, amino acid sequence 22, carbon and nitrogen termini 2, solvent accessibility tag 2, secondary structure tag 8)). The training data included 700 amino acids. It is believed to provide a good balance between efficiency and coverage, since most protein chains are shorter than 700 amino acids. In training and testing, shorter sequences (less than 700 amino acids) are filled with 0 s.

Secondly, cleaning the data and extracting the characteristics

In this example, only one protein secondary structure tag is output at a time, and the amino acid sequence is specially processed. Assuming a database with 700 amino acid sequences, to predict protein secondary structure, a sliding window is set up through which to return to the model batch matrix. The window is a small portion of the complete protein string. This sliding window is in essence similar to a one-dimensional convolution.

The window size should be chosen to be greater than 11 because the average length of the α helices is approximately 11 residues and the average length of the β strands is approximately 6. a number of uniform sizes from 11 to 23 were tested, 17 of which produced the best results (performance/training time trade-off). the window was shifted one unit at a time, predicting the central amino acid.

The data set of the collapdb + profile _6133 is processed by the method, and the final result is divided into 80% serving as a set, 10% serving as a cross-validation set and 10% serving as a test set. The same operation was taken for the CB513 data set as an independent test set of the model.

Thirdly, building a Semi-GAN neural network model

The generator and the discriminator are trained simultaneously when constructing the GAN for generating the sample. After training, the generator can be discarded because it is only used to train the discriminator/classifier. In this embodiment, the generator is used only to help the discriminator during training. In other words, the generator acts as a distinct source of information from which the discriminator obtains the original unlabeled training data. It can be seen that these unmarked data are key to improving the performance of the discriminator. Furthermore, the discriminator has only one role for conventional sample generation GAN. The probability of whether its input is true is calculated.

First, the work to be done is: in order to turn the discriminator into a semi-supervised classifier, the discriminator must learn the probability of each raw data set class in addition to the GAN problem. In other words, for each input datum, the discriminator must know its specific classification probability. For generating the GAN discriminator, a sigmoid unit output is set. This value represents the probability that the input data is true (value close to 1) or spurious (value close to 0). In other words, from the perspective of the discriminator, a value close to 1 means that the sample is likely from the training set. Also, a value close to 0 means that the variation of the samples from the generator network is higher. By using the probability, the discriminator can send a signal to the generator. This signal allows the generator to adjust its parameters during training, which may improve its ability to create realistic data.

Second, assuming 8-state classification of the protein secondary structure, the discriminator (from the previous GAN) must be converted to a class 9 classifier. For this purpose, its sigmoid output may be converted to softmax with class 9 outputs. The first 8 of each class probability for the protein secondary structure data set (0 to 9), and the 9 th class for all spurious data from the generator. If the class 9 probability is set to 0, then the sum of the first 8 probabilities represents the same probability calculated using the sigmoid function.

Finally, a penalty needs to be set so that the discriminator can perform both of the following operations:

(i) helping the generator learn to generate realistic samples. To do this, the discriminator must be instructed to distinguish between true and false samples.

(ii) The generator's samples and labeled and unlabeled training data are used to help classify the data set.

In summary, there are three different sources of training data for the discriminator.

Real data with a tag. These are data tag pairs as any conventional supervised classification problem; without the true data of the tag. For these, the classifier only knows that these data are authentic; data from the generator. For these, the discriminator learns to classify them as false samples.

The aim of the invention is to be able to predict with sufficient accuracy which class the secondary structure belongs to in the case of a missing secondary structure tag of a protein. Not only the 3-state prediction is concerned, but also the 8-state prediction needs to be concerned, and the 8-state prediction explains more structural information, see Table 1. At present, evolution information of location-specific scoring matrices (PSSM) has been recognized as the most suitable information feature for research

Specifically, the overall structure of the Semi-GAN neural network model in this embodiment is shown in fig. 2;

the generator, following a very standard implementation described in the DCGAN paper. The method includes taking a random vector z as input. Reshaping it into a 4D tensor and inputting it into a series of deconvolution neural networks, where three deconvolution neural networks are set, and respectively performing Batch Normalization on them to accelerate the optimization of the gradient, and then using the learky ReLU function as the activation function to prevent overfitting.

The discriminator is modified into a multi-class classifier for demand. Here, the present invention designs a similar DCGAN architecture using several sets of networks of convolutional neural network + BN (normalization process) and ReLU activation functions. And cross-over convolution is used to reduce the dimensionality of the feature vector. Not all convolutions perform this type of computation. When the dimensionality of the feature vector is to be kept constant, the convolution kernel uses a step size of 1, otherwise a step size of 2 is used. Finally, for stable learning, BN was used for normalization (except for the first layer of convolutional neural network). The 2D convolution window (kernel or filter) is set to 3(3 x 3) for the width and height of all convolutions. Therefore, it may encounter problems with any classifier if it is not well designed. One of the most likely drawbacks that may be encountered when training a large classifier on a very limited data set is overfitting. It is noted that "trained" classifiers typically show a significant difference between training errors (smaller) and testing errors (higher). This indicates that the model captures the structure of the training data set well. However, it cannot generalize the unseen examples because it too trusts the training data. To prevent this, the process may be performed by dropout normalization. Finally, instead of applying a fully connected neural network layer on top of the convolution stack, a Global Average Pooling (GAP) is performed. In GAP, the spatial dimension of the feature vector is averaged. This operation results in a compression of the tensor dimension to a value.

Loss function: as a core of the present invention, the discriminator loss is divided into two parts. One represents the GAN problem, unsupervised loss. The other is to calculate a single true class probability, supervise the loss. For unsupervised wear, the discriminator must distinguish between real training samples and false samples from the generator. For normal GAN, half of the time discriminators receive unlabeled samples from the training set and the other half of the time receives fictitious unlabeled samples from the generator. In both cases, the binary classification problem is being handled. Since the probability value of the true sample is expected to be close to 1, and the probability value of the non-true sample is expected to be close to 0, the sigmoid cross entropy function is used to calculate the loss. For samples from the training set, their true probability is maximized by assigning a label of 1. For the synthetic samples from the generator, their pseudo-probability is maximized by labeling them with a 0.

Fourthly, training and adjusting parameters of the model;

and finally, selecting hyper-parameters such as the number of network layers, learning rate, dropout coefficient, and parameter of adma in a network structure by adopting a grid searching method, and predicting results of different protein data sets with label ratios. The results are as follows:

the study was semi-supervised trained using the cutlpdb, experimental tests were performed using the cb513, and to arrive at a semi-supervised learned data set to inject noise into the labels, a parameter was set to specify the label data ratio in the training set. The overall performance of the deep network (semi-GAN) was evaluated by performing several sets of experiments. In a first set of experiments, the cullpdb + profile _6133 dataset was trained and tested. 80%, 60%, 40%, 20% have been trained and all data are referred to tables 1, 2 and 3 respectively.

Table 1: predicted expression of the secondary Structure of the Q8 protein

Table 2: predicted expression of the secondary Structure of the Q3 protein

Table 3: predicted global trend table for protein secondary structure

According to the final experimental result of the embodiment, as expected from the beginning of the experiment, although the accuracy of the protein secondary structure prediction is improved along with the increase of the proportion of the labeled labels, the accuracy difference is not large, so that a semi-supervised prediction model can be established for the protein secondary structure under the condition that a large amount of missing value data exists. A large amount of manpower and financial resources are saved.

The present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A semi-supervised learning prediction method for protein secondary structure is characterized by comprising the following steps:

(1) acquiring a protein sequence data set;

(2) carrying out data cleaning and feature extraction on the acquired data set;

(3) building a Semi-GAN neural network model; the Semi-GAN neural network model comprises a generator, a discriminator and a loss function, wherein the generator comprises three deconvolution neural networks, the three deconvolution neural networks are respectively subjected to normalization processing, and a leak ReLU function is adopted as an activation function to prevent overfitting; the discriminator structure uses a network structure of a convolutional neural network, a normalization process and a ReLU activation function; the loss function divides the discriminator loss into two parts: one represents the GAN problem, unsupervised loss; the other is to calculate the probability of a single real class and supervise the loss; for unsupervised loss, the discriminator must distinguish between real training samples and false samples from the generator; in both cases, the binary classification problem is being handled; in order to make the probability value of the real sample close to 1 and the probability value of the non-real sample close to 0, calculating loss by using a sigmoid cross entropy function; for samples from the training set, maximize their true probability by assigning labels of 1; the synthetic samples from the generator are labeled with 0 to maximize their pseudo-probability;

(4) training a Semi-GAN neural network model:

(5) adjusting parameters of the Semi-GAN neural network model;

(6) the Semi-GAN neural network model was evaluated.

2. The method for predicting the secondary structure of protein through semi-supervised learning of claim 1, wherein the data set used in the step (1) is a CullPDB data set consisting of 6133 proteins, each protein having 39900 features; the 6133 protein × 39900 signature can be reshaped into 6133 protein × 700 amino acids × 57 signatures.