CN111785321B

CN111785321B - DNA binding residue prediction method based on deep convolutional neural network

Info

Publication number: CN111785321B
Application number: CN202010533489.4A
Authority: CN
Inventors: 胡俊; 白岩松; 樊学强; 郑琳琳; 张贵军
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Guangzhou Zhaoji Biotechnology Co ltd; Shenzhen Xinrui Gene Technology Co ltd
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2022-04-05
Anticipated expiration: 2040-06-12
Also published as: CN111785321A

Abstract

A method for predicting DNA binding residues based on a deep convolutional neural network. First, according to the input residue number of L, the protein sequence information of the ligand binding residues to be predicted is obtained by using the psi-blast program and the PSSpred program. matrices PSSM and PSS; then, combine the two matrices into a feature matrix F; secondly, we process protein sequences into residue samples; thirdly, build a deep convolutional neural network using protein sequences with known binding residues to construct The data set is divided into M groups of data subsets, and M network models are trained by using these ten groups of data subsets; finally, the protein sequences to be predicted are processed into residue samples and input to the trained In the M network models of , the prediction results of these M models are combined to predict whether the residues in the protein sequence are binding residues. The invention has low calculation cost and high prediction accuracy.

Description

DNA binding residue prediction method based on deep convolutional neural network

Technical Field

The invention relates to the fields of bioinformatics, pattern recognition and computer application, in particular to a DNA binding residue prediction method based on a deep convolutional neural network.

Background

Protein-ligand interactions are ubiquitous and indispensable in life processes, and play a very important role in recognition and signaling of biomolecules. The DNA molecule belongs to one of ligand molecules, accurately identifies the binding residue of the DNA molecule in a protein sequence, is beneficial to understanding the function of the protein, analyzing the interaction mechanism between the protein and the DNA molecule and designing a drug target protein, and has important biological significance.

Investigations have found that many methods for predicting DNA binding residues in protein sequences have been proposed, such as: DISPLAR (Tjong H, Zhou H. an acid method for predicting DNA-binding sites on proteins surface [ J ]. Nucleic Acids Research,2007,35(5):1465-1477. Tjong H et al. A method for accurately predicting DNA binding residues on protein surface [ J ]. Nucleic Acids Research,2007,35 (5):1465-1477), DELIA (Xia C, Pan X, Shen H, et al. protein-binding residues) biological binding residues of protein binding sites, sequence and structure data [ J ]. biological information, protein formation, i.e.. Xia C, etc. by improving the binding properties of protein through the mixed depth of sequence and structure data [ J ]. biological information prediction of protein binding residues [ N, C, protein J ]. prediction of protein binding sites, protein J ]. protein binding sites, protein binding sites, protein binding sites, protein binding sites, protein binding sites, protein sites, 2016,32(12): 121-: zeng H et al. prediction of DNA Protein Binding residues based on convolutional neural networks [ J ]. bioinformatics,2016,32 (12)), ENSEMBLE-CNN (Zhang Y, Qiao S, Ji S, et al. prediction of DNA Binding Sites in Protein Sequences by an enzyme deletion leaving Method [ C ]. International reference on interaction computing,2018: 301-: zhang Y et al, predicting DNA binding sites [ C ] in protein sequences by integrated deep learning methods, International Intelligent computing conference, 2018: 301-. Although the existing method can be used for predicting DNA binding residues in a protein sequence, a large amount of experimental data and a machine learning algorithm are generally used, so that the cost is high, and meanwhile, because noise information in a training set is not paid enough attention, the prediction accuracy cannot be guaranteed to be optimal, and needs to be further improved.

In conclusion, the existing prediction method of the DNA binding residues has a great gap from the requirement of practical application in the aspects of calculation cost and prediction precision, and needs to be improved urgently.

Disclosure of Invention

In order to overcome the defects of the existing DNA binding residue prediction method in two aspects of calculation cost and prediction precision, the invention provides a DNA binding residue prediction method based on a deep convolutional neural network, which is low in calculation cost and high in prediction precision.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for DNA-binding residue prediction based on deep convolutional neural network, the method comprising the steps of:

1) inputting a protein sequence S with the residue number L and to be subjected to DNA binding residue prediction;

2) for protein sequence S, a psi-blast (https:// toolkit. tuebingen. mpg. de/tools/psiblst) program was used to search protein sequence database swissprot (https:// ftp. ncbi. nlm. nih. gov/blast/db/FASTA /) to generate a location-specific scoring matrix of size L × 20, denoted PSSM;

3) for the protein sequence S, a PSSpred (https:// zhanglab. ccmb. med. umich. edu/PSSpred) program is used for searching a protein sequence database nr (https:// ftp. ncbi. nlm. nih. gov/blast/db/FASTA/nr) to generate a protein secondary structure matrix with the size of L multiplied by 3, and the protein secondary structure matrix is marked as PSS;

4) combining the two-dimensional matrixes obtained in the steps 2) and 3) into an L multiplied by 23 characteristic matrix, and recording the characteristic matrix as F;

5) adding 8 rows of 0 data before and after F, starting from the 9 th row of F and ending from the L-9 th row of F, taking the residue corresponding to the middle row as a prediction target, and taking the 8 rows of data adjacent to the front row and the back row as a feature matrix of the residue;

6) constructing a deep convolutional neural network to predict DNA binding residues of a protein sequence S, wherein the network comprises eight layers, the first seven layers are convolutional layers, the last layer is a fully-connected layer, each convolutional layer comprises a two-dimensional convolutional layer, a normalization layer and a pooling layer, the output of each layer is used as the input of the next layer, and the fully-connected layer uses a sigmoid activation function to enable the output value of the convolutional layer to be in the range of (0, 1);

7) generating residue samples by using a protein sequence of known binding residues through steps 2) -5), repeating the method to construct a training set, dividing the training set into M groups of training subsets, wherein residue positive samples in each group of training subsets comprise all positive samples in the training set, and randomly adding negative samples to each group of training subsets according to a positive-negative sample ratio of 1: 2;

8) using M groups of training subsets in 7) to train the deep convolutional neural network built in 6), wherein each group of training adopts two-class cross entropy loss functions to adjust parameters in the network, and M deep convolutional neural network models are obtained in total, and the two-class cross entropy loss functions are recorded as:

u represents the true tag of the residue to be determined in the protein sequence,

the predicted output value of the network model is represented, and Y represents the difference between the predicted output and the real label;

9) inputting residue samples generated by a protein sequence S into M models obtained in 8), setting an output probability threshold value as threshold for each model, and when the position of the output value larger than the threshold is a binding residue predicted by the model, predicting each residue sample in S through M models to generate M prediction results, wherein most prediction conditions in the M prediction results are final prediction results.

The technical conception of the invention is as follows: firstly, obtaining matrixes PSSM and PSS by using a psi-blast program and a PSSpred program according to protein sequence information with input residue number L and to-be-subjected ligand binding residue prediction; then, combining the two matrixes into a characteristic matrix F; secondly, we processed the protein sequence into residue samples; thirdly, building a deep convolutional neural network, building a data set by utilizing the protein sequence of the known binding residues, dividing the data set into ten groups of data subsets, and training ten network models by utilizing the ten groups of data subsets; and finally, processing the protein sequence to be predicted into residue samples, inputting the residue samples into ten trained network models, and predicting whether residues in the protein sequence are binding residues or not by integrating the prediction results of the ten models.

The beneficial effects of the invention are as follows: on one hand, starting from a characteristic matrix of sequence information, a protein sequence is processed into a residue sample, and a deep convolution network model is built, so that preparation is made for improving prediction accuracy; on the other hand, ten data subsets are constructed and used for training ten network models, and the prediction results of the ten network models are integrated, so that the prediction efficiency and accuracy of the DNA binding residues are further improved.

Drawings

FIG. 1 is a schematic diagram of a deep convolutional neural network-based DNA binding residue prediction method.

FIG. 2 shows the result of DNA binding residue prediction of protein sequence 1X3C using a deep convolutional neural network-based prediction method.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 and 2, a DNA binding residue prediction method based on a deep convolutional neural network includes the following steps:

7) generating residue samples by using a protein sequence of known binding residues through steps 2) -5), repeating the method to construct a training set, dividing the training set into M (taking M as 10) groups of training subsets, wherein residue positive samples in each group of training subsets comprise all positive samples in the training set, and randomly adding negative samples to each group of training subsets according to a positive-negative sample ratio of 1: 2;

In this embodiment, the DNA binding residue prediction of the protein sequence 1X3C is taken as an example, and a DNA binding residue prediction method based on a deep convolutional neural network includes the following steps:

1) inputting a protein 1X3C with 73 residues to be subjected to DNA binding residue prediction, and recording the protein as S;

2) for protein sequence S, a psi-blast (https:// toolkit. tuebingen. mpg. de/tools/psiblst) program was used to search protein sequence database swissprot (https:// ftp. ncbi. nlm. nih. gov/blast/db/FASTA /) to generate a position-specific scoring matrix with a size of 73X 20, denoted PSSM;

3) for the protein sequence S, a PSSpred (https:// zhanglab. ccmb. med. umich. edu/PSSpred) program is used for searching a protein sequence database nr (https:// ftp. ncbi. nlm. nih. gov/blast/db/FASTA/nr) to generate a protein secondary structure matrix with the size of 73 x3, and the protein secondary structure matrix is marked as PSS;

4) combining the two-dimensional matrixes obtained in the steps 2) and 3) into a characteristic matrix of 73 multiplied by 23, and recording the characteristic matrix as F;

5) adding 8 rows of 0 data before and after F, starting from the 9 th row of F and ending at the 64 th row of F, taking the residue corresponding to the middle row as a prediction target, and taking the 8 rows of data adjacent to the front row and the back row as a feature matrix of the residue;

7) generating residue samples by using a protein sequence of known binding residues through steps 2) -5), repeating the method to construct a training set, dividing the training set into ten groups of training subsets, wherein residue positive samples in each group of training subsets comprise all positive samples in the training set, and randomly adding negative samples to each group of training subsets according to a positive-negative sample ratio of 1: 2;

8) using M groups of training subsets in 7) to train the deep convolutional neural network built in 6), wherein each group of training adopts a two-class cross entropy loss function to adjust parameters in the network, so as to obtain ten deep convolutional neural network models in total, and the two-class cross entropy loss function is recorded as:

9) inputting residue samples generated by a protein sequence S into ten models obtained in step 8), setting an output probability threshold value as threshold for each model, and when the position of the output value greater than the threshold is a binding residue predicted by the model, predicting each residue sample in the S through the ten models to generate ten prediction results, wherein most prediction conditions in the ten prediction results are final prediction results.

The above description is the prediction result obtained by the present invention using the prediction of DNA binding residues of protein sequence 1X3C as an example, and is not intended to limit the scope of the present invention, and various modifications and improvements can be made without departing from the scope of the present invention.

Claims

1. a DNA binding residue prediction method based on deep convolutional neural network, is characterized in that, described prediction method comprises the following steps:

1) Input a protein sequence S whose residue number is L to be predicted by DNA binding residues;

2) For the protein sequence S, use the psi-blast program to search the protein sequence database swissprot to generate a position-specific scoring matrix of size L×20, denoted as PSSM;

3) For the protein sequence S, use the PSSpred program to search the protein sequence database nr to generate a protein secondary structure matrix of size L×3, denoted as PSS;

4) combine the two-dimensional matrix obtained in steps 2), 3) into a characteristic matrix of L × 23, denoted as F;

5) Add 8 rows of 0 data before and after F, starting from the 9th row of F and ending at the L-9th row of F, taking the residue corresponding to the middle row as the prediction target, and the adjacent 8 rows of data. as the feature matrix of the residue;

6) Build a deep convolutional neural network to predict the DNA binding residues of the protein sequence S. The network has eight layers, the first seven layers are convolutional layers, the last layer is a fully connected layer, and each convolutional layer contains a Two-dimensional convolutional layer, a normalization layer and a pooling layer, the output of each layer is used as the input of the next layer, the fully connected layer uses the sigmoid activation function to make the output value of the convolutional layer in the (0,1) range Inside;

7) Use the protein sequence of known binding residues to generate residue samples through steps 2)-5), repeat this method to construct a training set, and divide the training set into M groups of training subsets. Residual positive samples include all positive samples in the training set, and negative samples are randomly added to each group of training subsets with a ratio of positive and negative samples of 1:2;

8) Use the M groups of training subsets in 7) to train the deep convolutional neural network built in 6), each group of training uses the binary cross-entropy loss function to adjust the parameters in the network, and obtains a total of M deep convolutional neural networks The network model, the two-category cross-entropy loss function is written as:

u denotes the true label of the residue to be detected in the protein sequence,

Represents the predicted output value of the network model, and Y represents the gap between the predicted output and the true label;

9) Input the residue samples generated by the protein sequence S into the M models obtained in 8), each model sets the output probability threshold as threshold, and the position of the output value greater than the threshold is the model predicted binding Residue, each residue sample in S is predicted by M models to generate M prediction results, and most of the predictions in the M prediction results are the final prediction results.