CN111508556A

CN111508556A - Protein contact map prediction method based on single sequence and full convolution neural network

Info

Publication number: CN111508556A
Application number: CN201911068072.9A
Authority: CN
Inventors: 於东军; 陈明猜
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2019-11-04
Filing date: 2019-11-04
Publication date: 2020-08-07

Abstract

The invention discloses a protein contact map prediction method based on a single sequence and a full convolution neural network. The invention uses the strategy of combining the coding single sequence with the deep full convolution neural network technology, so that a calculation model can be predicted only from a single protein sequence, the dependence of a contact map prediction algorithm on a homologous sequence is avoided, and the prediction precision of low-homology protein is improved; the method only uses a simple and direct coding sequence as a characteristic, so that the algorithm is efficient and simple, and the operation efficiency of the prediction model and the usability of the prediction method are improved.

Description

Protein contact map prediction method based on single sequence and full convolution neural network

Technical Field

The invention relates to the field of bioinformatics prediction of protein structures, in particular to a protein contact map prediction method based on a single sequence and a deep full convolution neural network.

Background

The protein contact map contains important protein space geometric constraint information, and is helpful for solving the bioinformatics problems related to protein structures, such as 3D structure modeling, drug design and the like. The use of conventional wet assay methods to determine protein contact patterns is very expensive and time consuming. Therefore, the development of an automated calculation method for protein contact map prediction using bioinformatics-related knowledge is an urgent need. In recent years, computational methods for protein contact map prediction have been developed in great quantities.

Most protein-contact prediction algorithms are based on the fact that spatially close residues are mutated simultaneously, this evolutionary information is available from homologous sequences several typical statistical algorithms to further analyze common mutations coupling analysis, covariance inverse estimation based on direct coupling analysis, pseudo-likelihood function maximization based on direct coupling analysis, indirect coupling is a key problem that these algorithms attempt to solve.

However, about one third of the 15000 protein families lack known structures of homology, and over 90% of the sequences have no known sequences of homology. In this case, the accuracy of these advanced methods can be greatly affected. The accuracy of low homology sequences is still low. In this case, the contact map prediction result may even have a negative impact on the structure prediction, rather than an auxiliary prediction.

Disclosure of Invention

In order to solve the problem of low precision of low homology sequences in the protein contact map prediction problem, the invention aims to provide a high-precision protein contact map prediction method based on a single-sequence deep full-convolution neural network technology.

The technical solution for realizing the purpose of the invention is as follows:

compared with the prior art, the invention has the remarkable advantages that:

1. improving the prediction accuracy of the low homology protein: the strategy of combining the coding single sequence with the deep full convolution neural network technology is used, so that a calculation model can predict from a single protein sequence only, and the dependence of a contact map prediction algorithm on a homologous sequence is avoided;

2. the operation efficiency of the prediction model and the usability of the prediction method are improved: other contact map prediction algorithms mostly rely on intermediate results of other algorithms to complete prediction, or perform complex time-consuming matrix calculation to obtain input features, the method abandons the defect, and only uses simple and direct coding sequences as features, so that the algorithm is efficient and simple;

drawings

FIG. 1 is a schematic view of a prediction flow chart of a protein contact map prediction method.

Detailed Description

The invention relates to a protein contact map prediction method based on a single sequence and a deep full convolution neural network, which comprises the following steps of:

the first stage is as follows: training model

Step 1.1: screening all known protein structures and sequences in the PDB database (4 months ago 2019); data with incomplete information or inconsistent sequence information in the result information are removed; removing protein sequences greater than 400 or less than 40 in length; using CD-HIT to remove redundancy such that the sequence similarity of any two protein sequences is less than 70%; the data set comprised a total of 25394 proteins. Then, the first 1200 proteins in the data set are taken as a test data set after being sorted according to pdb id, 1000 proteins are taken as a verification set for parameter adjustment and early stopping in turn, and the other 23194 proteins are used as a training data set. Since pdb id is independent of the nature and type of the protein sequence itself, this method is a random selection.

Step 1.2: encoding a protein sequence. Each residue pair is encoded into a 441-dimensional feature vector using one-hot. That is, the pair of residues corresponding to a specific position in the feature vector is set to 1, and the other positions are set to 0.

Step 1.3, building a network model, wherein a fully-connected neural network using residual blocks and convolutional layers is designed, all the intermediate convolutional layers have 96 convolutional kernels with the size of 3 × 3, an exponential linear unit (E L U) is adopted to activate a function, the final output layer has 1 convolutional kernel, all the convolutional layers use 'same padding', then a prediction result of a L×L matrix is given, each element of the matrix represents the possibility that a residue pair contacts at a corresponding position residue pair, 30 residual blocks are used in total, in the training process, a cross entropy loss function is used to calculate loss,

in addition, to prevent overfitting, we applied L2 regularization with a coefficient of 0.05 to all weights in the network.

And a second stage: based on the protein sequence information input by the user, a protein contact map is predicted by using a trained prediction model, and the process is as follows:

step 2.1: and receiving target protein sequence information from a webpage end, coding the sequence at a server end, and completing prediction by using a pre-trained model. And returning the prediction information to the webpage end.

The invention is further described below with reference to the accompanying drawings.

Fig. 1 shows a system structure diagram of the prediction method of the present invention. As shown in connection with fig. 1, according to an embodiment of the present invention,

a protein contact map prediction method based on a single sequence and a deep full convolution neural network comprises the following steps:

firstly, screening a construction data set, encoding a protein sequence by using a one-hot encoding technology to construct input characteristics, calculating the distance between residues according to the position information of protein atoms, and constructing a training label; second, the model parameters are optimized according to the Adaptive motion Estimation algorithm using a gradient descent technique. And controlling the iteration times of the gradient descent algorithm according to the precision of the model on the verification set. After the webpage server is deployed, the protein input of the user is received, the protein input is encoded and then input into a prediction model, and a model prediction result is returned to the page.

The foregoing process will be described in more detail with reference to the accompanying drawings.

The first stage is as follows: training model

Step 1.1: screening all known protein structures and sequences in the PDB database (4 months ago 2019); data with incomplete information or inconsistent sequence information in the result information are removed; removing protein sequences greater than 400 or less than 40 in length; using CD-HIT to remove redundancy such that the sequence similarity of any two protein sequences is less than 70%; the data set comprised a total of 25394 proteins. We then randomly selected 1200 proteins as the test data set, 1000 proteins as the validation set used to adjust parameters and stop early, and the other 23194 proteins were used as the training data set.

Where CD-HIT is used to remove redundancy, the sequence similarity is equal to the ratio of identical amino acids over the length of the sequence.

Wherein, the residue pair number is n (1 is not less than n and not more than 441), n is converted into a characteristic vector V, and i is the subscript of the element in V.

where R (x) is an E L U function, x is an argument, α is a constant.

J＝-(ylog(p)+(1-y)log(1-p)) (4)

Where p is the prediction residue pair contact probability, y is the label, J is the cross entropy loss function, x is the argument, α is the constant.

And the Adaptive motion Estimation technique is used to optimize the model. ADAM can be viewed as a combination of RMSprop and random gradient descent with momentum. It scales the learning rate with the square of the gradient, such as RMSprop, and replaces the gradient itself with a moving average of the gradient, such as SGD with momentum. ADAM is an adaptive learning rate method that calculates individual learning rates under different parameters, and it uses first and second moment estimates of the gradient to adjust the learning rate of each weight of the neural network.

Where m is the momentum and X is a random variable.

Where m and v are moving averages, g is the current gradient, β is a hyperparameter, their default values are 0.9 and 0.999, respectively, the vector of moving averages is initialized to zero at the first iteration.

Furthermore, to prevent overfitting, we applied L2 regularization with a coefficient of 0.05 to all weights in the network.

An exemplary algorithmic process for optimizing model parameters for a gradient descent algorithm is as follows:

In summary, compared with the existing prediction method, the prediction method has the following significant advantages: the method solves the problem of low prediction precision of low homology protein sequences and improves the prediction precision of the model.

The invention has the advantages that: firstly, other contact image prediction algorithms rely on intermediate results of other algorithms to complete prediction, or perform complex time-consuming matrix calculation to obtain input features, the method abandons the defect, and only uses simple and direct coding sequences as features, so that the algorithm is efficient and simple; secondly, other contact map prediction algorithms can obtain a prediction result with higher precision only by requiring that the target protein has more homologous sequences, and the method only constructs characteristics from a single sequence so that the model precision does not depend on the sequence homology; thirdly, the full convolution neural network is used, the variable-length protein sequence can be predicted, and a direct contact map prediction result is obtained.

Claims

1. A protein contact map prediction method based on a single sequence and a full convolution neural network comprises the following steps:

the first stage is as follows: training model

Step 1.1: constructing a data set for model test and training; all sequences in the PDB database were screened: using the CD-HIT to perform redundancy removal on a single data set; clustering all sequences according to parameter setting, outputting the longest sequence in each group of clusters as a representative sequence, and simultaneously giving the name of each sequence under each group of clusters for similarity analysis; randomly dividing a training set, a testing set and a verifying set;

step 1.2: (ii) encodes a protein; firstly, converting a one-dimensional protein sequence into residue pairs, wherein each residue pair is coded by one-hot; calculating the distance between residues according to the PDB structure information to obtain a protein contact map;

step 1.3: designing a network model structure; designing a fully-connected neural network using a residual block and a convolutional layer; optimizing a training network by using an Adaptive motion Estimation technology, and adjusting parameters according to the precision of a verification set;

and a second stage: based on the protein sequence information input by the user, the protein contact map is predicted by using a trained prediction model, and the process is as follows:

receiving target protein sequence information of a webpage end, coding the sequence at a server end, completing prediction by using a pre-trained model, and returning prediction information to the webpage end.

2. The method of predicting protein contact maps based on single sequence and full convolution neural networks of claim 1, wherein: one-hot encoding was performed in units of residue pairs based on the input protein sequence as input for the model.

Wherein, the residue pair number is n, n is more than or equal to 1 and less than or equal to 441, n is converted into a characteristic vector V, and i is the subscript of the element in V.

3. The single sequence and full convolution neural network based protein contact map prediction method of claim 1, characterized in that a data set for training a model is constructed and all known protein structures and sequences in the PDB database are filtered by CD-HIT screening so that there is no homology in the sequences of any two protein sequences; incomplete data is removed from the data; removing protein sequences with too long or too short length; then randomly taking a small part of samples in the data set as a test data set and a verification set, and using the rest samples as training data sets.

4. The method of predicting protein contact maps based on single sequence and full convolution neural networks of claim 1, wherein: the network model uses residual block concatenation; all convolution layers in the residual block adopt exponential linear unit activation functions, and the last convolution layer adopts Sigmoid signal functions. And (3) iteratively updating the weight of the convolution layer by using an Adaptive motion Estimation technology, adjusting parameters by using a verification set and controlling the number of training iterations.