DNA binding residue prediction method based on deep convolutional neural network
Technical Field
The invention relates to the fields of bioinformatics, pattern recognition and computer application, in particular to a DNA binding residue prediction method based on a deep convolutional neural network.
Background
Protein-ligand interactions are ubiquitous and indispensable in life processes, and play a very important role in recognition and signaling of biomolecules. The DNA molecule belongs to one of ligand molecules, accurately identifies the binding residue of the DNA molecule in a protein sequence, is beneficial to understanding the function of the protein, analyzing the interaction mechanism between the protein and the DNA molecule and designing a drug target protein, and has important biological significance.
Investigations have found that many methods for predicting DNA binding residues in protein sequences have been proposed, such as: DISPLAR (Tjong H, Zhou H. an acid method for predicting DNA-binding sites on proteins surface [ J ]. Nucleic Acids Research,2007,35(5):1465-1477. Tjong H et al. A method for accurately predicting DNA binding residues on protein surface [ J ]. Nucleic Acids Research,2007,35 (5):1465-1477), DELIA (Xia C, Pan X, Shen H, et al. protein-binding residues) biological binding residues of protein binding sites, sequence and structure data [ J ]. biological information, protein formation, i.e.. Xia C, etc. by improving the binding properties of protein through the mixed depth of sequence and structure data [ J ]. biological information prediction of protein binding residues [ N, C, protein J ]. prediction of protein binding sites, protein J ]. protein binding sites, protein binding sites, protein binding sites, protein binding sites, protein binding sites, protein sites, 2016,32(12): 121-: zeng H et al. prediction of DNA Protein Binding residues based on convolutional neural networks [ J ]. bioinformatics,2016,32 (12)), ENSEMBLE-CNN (Zhang Y, Qiao S, Ji S, et al. prediction of DNA Binding Sites in Protein Sequences by an enzyme deletion leaving Method [ C ]. International reference on interaction computing,2018: 301-: zhang Y et al, predicting DNA binding sites [ C ] in protein sequences by integrated deep learning methods, International Intelligent computing conference, 2018: 301-. Although the existing method can be used for predicting DNA binding residues in a protein sequence, a large amount of experimental data and a machine learning algorithm are generally used, so that the cost is high, and meanwhile, because noise information in a training set is not paid enough attention, the prediction accuracy cannot be guaranteed to be optimal, and needs to be further improved.
In conclusion, the existing prediction method of the DNA binding residues has a great gap from the requirement of practical application in the aspects of calculation cost and prediction precision, and needs to be improved urgently.
Disclosure of Invention
In order to overcome the defects of the existing DNA binding residue prediction method in two aspects of calculation cost and prediction precision, the invention provides a DNA binding residue prediction method based on a deep convolutional neural network, which is low in calculation cost and high in prediction precision.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for DNA-binding residue prediction based on deep convolutional neural network, the method comprising the steps of:
1) inputting a protein sequence S with the residue number L and to be subjected to DNA binding residue prediction;
2) for protein sequence S, a psi-blast (https:// toolkit. tuebingen. mpg. de/tools/psiblst) program was used to search protein sequence database swissprot (https:// ftp. ncbi. nlm. nih. gov/blast/db/FASTA /) to generate a location-specific scoring matrix of size L × 20, denoted PSSM;
3) for the protein sequence S, a PSSpred (https:// zhanglab. ccmb. med. umich. edu/PSSpred) program is used for searching a protein sequence database nr (https:// ftp. ncbi. nlm. nih. gov/blast/db/FASTA/nr) to generate a protein secondary structure matrix with the size of L multiplied by 3, and the protein secondary structure matrix is marked as PSS;
4) combining the two-dimensional matrixes obtained in the steps 2) and 3) into an L multiplied by 23 characteristic matrix, and recording the characteristic matrix as F;
5) adding 8 rows of 0 data before and after F, starting from the 9 th row of F and ending from the L-9 th row of F, taking the residue corresponding to the middle row as a prediction target, and taking the 8 rows of data adjacent to the front row and the back row as a feature matrix of the residue;
6) constructing a deep convolutional neural network to predict DNA binding residues of a protein sequence S, wherein the network comprises eight layers, the first seven layers are convolutional layers, the last layer is a fully-connected layer, each convolutional layer comprises a two-dimensional convolutional layer, a normalization layer and a pooling layer, the output of each layer is used as the input of the next layer, and the fully-connected layer uses a sigmoid activation function to enable the output value of the convolutional layer to be in the range of (0, 1);
7) generating residue samples by using a protein sequence of known binding residues through steps 2) -5), repeating the method to construct a training set, dividing the training set into M groups of training subsets, wherein residue positive samples in each group of training subsets comprise all positive samples in the training set, and randomly adding negative samples to each group of training subsets according to a positive-negative sample ratio of 1: 2;
8) using M groups of training subsets in 7) to train the deep convolutional neural network built in 6), wherein each group of training adopts two-class cross entropy loss functions to adjust parameters in the network, and M deep convolutional neural network models are obtained in total, and the two-class cross entropy loss functions are recorded as:
u represents the true tag of the residue to be determined in the protein sequence,
the predicted output value of the network model is represented, and Y represents the difference between the predicted output and the real label;
9) inputting residue samples generated by a protein sequence S into M models obtained in 8), setting an output probability threshold value as threshold for each model, and when the position of the output value larger than the threshold is a binding residue predicted by the model, predicting each residue sample in S through M models to generate M prediction results, wherein most prediction conditions in the M prediction results are final prediction results.
The technical conception of the invention is as follows: firstly, obtaining matrixes PSSM and PSS by using a psi-blast program and a PSSpred program according to protein sequence information with input residue number L and to-be-subjected ligand binding residue prediction; then, combining the two matrixes into a characteristic matrix F; secondly, we processed the protein sequence into residue samples; thirdly, building a deep convolutional neural network, building a data set by utilizing the protein sequence of the known binding residues, dividing the data set into ten groups of data subsets, and training ten network models by utilizing the ten groups of data subsets; and finally, processing the protein sequence to be predicted into residue samples, inputting the residue samples into ten trained network models, and predicting whether residues in the protein sequence are binding residues or not by integrating the prediction results of the ten models.
The beneficial effects of the invention are as follows: on one hand, starting from a characteristic matrix of sequence information, a protein sequence is processed into a residue sample, and a deep convolution network model is built, so that preparation is made for improving prediction accuracy; on the other hand, ten data subsets are constructed and used for training ten network models, and the prediction results of the ten network models are integrated, so that the prediction efficiency and accuracy of the DNA binding residues are further improved.
Drawings
FIG. 1 is a schematic diagram of a deep convolutional neural network-based DNA binding residue prediction method.
FIG. 2 shows the result of DNA binding residue prediction of protein sequence 1X3C using a deep convolutional neural network-based prediction method.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 and 2, a DNA binding residue prediction method based on a deep convolutional neural network includes the following steps:
1) inputting a protein sequence S with the residue number L and to be subjected to DNA binding residue prediction;
2) for protein sequence S, a psi-blast (https:// toolkit. tuebingen. mpg. de/tools/psiblst) program was used to search protein sequence database swissprot (https:// ftp. ncbi. nlm. nih. gov/blast/db/FASTA /) to generate a location-specific scoring matrix of size L × 20, denoted PSSM;
3) for the protein sequence S, a PSSpred (https:// zhanglab. ccmb. med. umich. edu/PSSpred) program is used for searching a protein sequence database nr (https:// ftp. ncbi. nlm. nih. gov/blast/db/FASTA/nr) to generate a protein secondary structure matrix with the size of L multiplied by 3, and the protein secondary structure matrix is marked as PSS;
4) combining the two-dimensional matrixes obtained in the steps 2) and 3) into an L multiplied by 23 characteristic matrix, and recording the characteristic matrix as F;
5) adding 8 rows of 0 data before and after F, starting from the 9 th row of F and ending from the L-9 th row of F, taking the residue corresponding to the middle row as a prediction target, and taking the 8 rows of data adjacent to the front row and the back row as a feature matrix of the residue;
6) constructing a deep convolutional neural network to predict DNA binding residues of a protein sequence S, wherein the network comprises eight layers, the first seven layers are convolutional layers, the last layer is a fully-connected layer, each convolutional layer comprises a two-dimensional convolutional layer, a normalization layer and a pooling layer, the output of each layer is used as the input of the next layer, and the fully-connected layer uses a sigmoid activation function to enable the output value of the convolutional layer to be in the range of (0, 1);
7) generating residue samples by using a protein sequence of known binding residues through steps 2) -5), repeating the method to construct a training set, dividing the training set into M (taking M as 10) groups of training subsets, wherein residue positive samples in each group of training subsets comprise all positive samples in the training set, and randomly adding negative samples to each group of training subsets according to a positive-negative sample ratio of 1: 2;
8) using M groups of training subsets in 7) to train the deep convolutional neural network built in 6), wherein each group of training adopts two-class cross entropy loss functions to adjust parameters in the network, and M deep convolutional neural network models are obtained in total, and the two-class cross entropy loss functions are recorded as:
u represents the true tag of the residue to be determined in the protein sequence,
the predicted output value of the network model is represented, and Y represents the difference between the predicted output and the real label;
9) inputting residue samples generated by a protein sequence S into M models obtained in 8), setting an output probability threshold value as threshold for each model, and when the position of the output value larger than the threshold is a binding residue predicted by the model, predicting each residue sample in S through M models to generate M prediction results, wherein most prediction conditions in the M prediction results are final prediction results.
In this embodiment, the DNA binding residue prediction of the protein sequence 1X3C is taken as an example, and a DNA binding residue prediction method based on a deep convolutional neural network includes the following steps:
1) inputting a protein 1X3C with 73 residues to be subjected to DNA binding residue prediction, and recording the protein as S;
2) for protein sequence S, a psi-blast (https:// toolkit. tuebingen. mpg. de/tools/psiblst) program was used to search protein sequence database swissprot (https:// ftp. ncbi. nlm. nih. gov/blast/db/FASTA /) to generate a position-specific scoring matrix with a size of 73X 20, denoted PSSM;
3) for the protein sequence S, a PSSpred (https:// zhanglab. ccmb. med. umich. edu/PSSpred) program is used for searching a protein sequence database nr (https:// ftp. ncbi. nlm. nih. gov/blast/db/FASTA/nr) to generate a protein secondary structure matrix with the size of 73 x3, and the protein secondary structure matrix is marked as PSS;
4) combining the two-dimensional matrixes obtained in the steps 2) and 3) into a characteristic matrix of 73 multiplied by 23, and recording the characteristic matrix as F;
5) adding 8 rows of 0 data before and after F, starting from the 9 th row of F and ending at the 64 th row of F, taking the residue corresponding to the middle row as a prediction target, and taking the 8 rows of data adjacent to the front row and the back row as a feature matrix of the residue;
6) constructing a deep convolutional neural network to predict DNA binding residues of a protein sequence S, wherein the network comprises eight layers, the first seven layers are convolutional layers, the last layer is a fully-connected layer, each convolutional layer comprises a two-dimensional convolutional layer, a normalization layer and a pooling layer, the output of each layer is used as the input of the next layer, and the fully-connected layer uses a sigmoid activation function to enable the output value of the convolutional layer to be in the range of (0, 1);
7) generating residue samples by using a protein sequence of known binding residues through steps 2) -5), repeating the method to construct a training set, dividing the training set into ten groups of training subsets, wherein residue positive samples in each group of training subsets comprise all positive samples in the training set, and randomly adding negative samples to each group of training subsets according to a positive-negative sample ratio of 1: 2;
8) using M groups of training subsets in 7) to train the deep convolutional neural network built in 6), wherein each group of training adopts a two-class cross entropy loss function to adjust parameters in the network, so as to obtain ten deep convolutional neural network models in total, and the two-class cross entropy loss function is recorded as:
u represents the true tag of the residue to be determined in the protein sequence,
the predicted output value of the network model is represented, and Y represents the difference between the predicted output and the real label;
9) inputting residue samples generated by a protein sequence S into ten models obtained in step 8), setting an output probability threshold value as threshold for each model, and when the position of the output value greater than the threshold is a binding residue predicted by the model, predicting each residue sample in the S through the ten models to generate ten prediction results, wherein most prediction conditions in the ten prediction results are final prediction results.
The above description is the prediction result obtained by the present invention using the prediction of DNA binding residues of protein sequence 1X3C as an example, and is not intended to limit the scope of the present invention, and various modifications and improvements can be made without departing from the scope of the present invention.