CN114512188B

CN114512188B - DNA binding protein recognition method based on improved protein sequence position specificity matrix

Info

Publication number: CN114512188B
Application number: CN202210274125.8A
Authority: CN
Inventors: 冉坤; 彭绍亮; 赵雄君; 潘亮; 王练; 刘文娟
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-03-20
Filing date: 2022-03-20
Publication date: 2024-04-05
Anticipated expiration: 2042-03-20
Also published as: CN114512188A

Abstract

The invention discloses a DNA binding protein recognition method based on an improved protein sequence position specificity matrix, which comprises the following steps: s1, initializing parameters; s2, constructing DNA binding protein sequence information; s3, expressing a protein sequence by adopting a position specificity score matrix; s4, normalizing the position specificity score matrix to obtain an improved position specificity score matrix; s5, inputting a convolutional neural network; s6, inputting an output result of the convolutional neural network into a bidirectional long-short-time memory network; s7, weighting hidden features generated by different storage units by adopting a time distribution dense layer; s8, inputting the output of the dense layer into the flat layer; s9, inputting the improved position specificity score matrix into a random forest model to obtain a decision result of a specific protein sequence; s10, inputting the output of the step S8 and the decision result of the step S9 into a scoring layer, and carrying out final prediction scoring according to the set weight. The invention improves the prediction performance and accuracy.

Description

DNA binding protein recognition method based on improved protein sequence position specificity matrix

Technical Field

The invention relates to the technical fields of biological informatics and computer fusion, in particular to a DNA binding protein recognition method based on an improved protein sequence position specificity matrix.

Background

DNA Binding Proteins (DBPs) are important proteins that play an important role in a variety of biological processes, such as DNA replication, transcriptional control, chromatin stability and modification, epigenetic regulation, post-transcriptional gene regulation, alternative splicing, translation, and the like. They have an important role in certain diseases such as cancer and myeloid leukemia. In addition, DNA binding proteins can also bind to DNA, which also plays an important role in gene expression, and accurate recognition of DNA binding proteins is of great importance.

Experimental techniques can accurately identify DNA binding proteins such as chromatin immunoprecipitation microarrays, x-ray crystallography, and filter binding assays, but these methods are expensive and time consuming. Particularly in the postgene era, the calculation method has low cost and is a good supplement of experimental technology. In recent years, computing methods based on machine learning algorithms have received widespread attention for their encouraging performance. Given a protein sequence as input, machine learning-based methods have proven effective in automatically predicting whether the protein sequence binds to DNA.

Therefore, the improvement of the accuracy of the model on DNA binding protein recognition is significant, and the discovery of important functions of potential DNA replication, transcription and the like and the mechanism of action thereof by utilizing the knowledge is very important scientific significance.

Disclosure of Invention

The invention aims to provide a DNA binding protein identification method based on an improved protein sequence position specificity matrix, so as to overcome the defects in the prior art.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a method for DNA binding protein recognition based on an improved protein sequence position specificity matrix comprising the steps of:

s1, initializing parameters, including setting a network input dimension dim, a network sequence length l, and setting the number and the size of filters of a first convolution layer of a convolution neural network to n ₁ And size ₁ The number and size of filters of the second convolution layer is set to n ₂ And size ₂ The size of the pooling core of the maximum pooling layer is size ₃ Setting the neuron number of the two-way long short-time memory network as n ₃ The number of nodes of the full connection layer is set to n ₄ The final predicted score of the DNA binding protein was set to score _DBP The neural network predicted result is score ₁ Random forest prediction result is score ₂ Neural networkThe weight occupied by the prediction result is w ₁ The weight of the random forest prediction result is w ₂ ；

S2, constructing DNA binding protein sequence information;

s3, for a given protein sequence S, using a position specificity score matrix to represent the protein sequence S ₁ S ₂ ...S _L Wherein S is _i (1.ltoreq.i.ltoreq.L) represents an amino acid at the i-th position in S, L being the length of S;

s4, normalizing the position specificity score matrix, decomposing the normalized matrix into n submatrices, calculating the local position specificity score matrix characteristics of all submatrices, and expressing the protein sequence as a characteristic carrier with a specific dimension to obtain an improved position specificity score matrix;

s5, inputting the improved position specificity scoring matrix into a convolutional neural network, and sequentially stacking two convolutional layers, wherein the output of the upper layer is used as the input of the lower layer, and the convolutional layers adopt ReLU as an activation function;

s6, inputting an output result of the convolutional neural network into the bidirectional long-short-time memory network, and adopting a ReLU as an activation function;

s7, weighting hidden features generated by different storage units by adopting a time distribution dense layer;

s8, inputting the output of the dense layer into the flat layer, changing the result into one-dimensional data, and inputting the one-dimensional data into the full-connection layer to obtain output, wherein a node of the output adopts sigmoid as an activation function;

s9, inputting the improved position specificity score matrix obtained in the step S4 into a random forest model, and obtaining a decision result of a specific protein sequence through a random forest decision tree;

s10, inputting the output of the step S8 and the decision result of the step S9 into a scoring layer, and carrying out final prediction scoring according to the set weight, wherein the prediction scoring corresponds to a confidence degree, and the higher the scoring is, the higher the possibility of correctly identifying is.

Further, the step S2 specifically includes:

s20, obtaining the gene classification term DNA-binding annotated protein from the annotated protein sequence database Swiss-Prot as a positive sample S ⁺ ；

S21, collecting proteins irrelevant to the annotation of the gene classification term DNA-binding from the annotated protein sequence database Swiss-Prot as a negative sample S ^— ；

S22, in the positive sample S ⁺ And negative sample S ^— Protein with the length smaller than a set value is removed;

s23, removing the positive sample S ⁺ And negative sample S ^— The middle cut-off threshold is a homologous protein with a first set value and the coverage rate is a sequence length of a second set value.

Further, a negative sample S in the step S21 ^— Proteins of known structure are selected.

Further, said step S23 specifically uses CD-HIT and BLASTCLITS to remove said positive sample S ⁺ And negative sample S ^— Homologous proteins with a cut-off threshold of 0.35 and a coverage of 90% sequence length.

Further, in the step S3, an e-value threshold is set to 0.001 and the iteration number is set to 10, and a corresponding position-specific score matrix is generated through PSI-BLAST.

Further, the step S4 specifically includes:

s40, dividing the position specificity score matrix into n submatrices, wherein the first n-1 submatrices are provided with L/n rows and 20 columns, the last submatrix is provided with L- (n-1) L/n rows and 20 columns, and each submatrix keeps the evolution information contained in the position specificity score matrix, wherein n is more than or equal to 1;

s41, calculating local position specificity score matrix characteristics of each submatrix, wherein the first n-1 submatrices calculate 20 local characteristics by combining evolution information, and the last submatrix is calculated by values of the first n-1 submatrices.

Further, the step S9 specifically includes:

s90, sampling the position specificity score matrix after the improvement in the step S4 with a put back mode to obtain a plurality of sample sets;

s91, randomly extracting m features from candidate features to serve as candidate features for decision under a current node, selecting and dividing training sample features from the candidate features, constructing a decision tree by using each sample set as a training sample, and calculating a single decision tree by using a CART algorithm after generating the sample set and determining the features without pruning;

and S92, voting the output of the decision trees by adopting a random forest method after the decision trees with the set number are obtained, so that the class with the largest vote is used as the decision of the random forest.

Further, the step S90 is specifically from n to n each time ₁ Random decimation n with put back in each training sample ₂ Samples.

Further, the step S10 specifically includes:

s100, respectively obtaining the prediction score of the output DNA binding protein of the step S8 ₁ And the predicted score of the DNA binding protein of the decision result of step S9 ₂ ；

S110, according to different weights w ₁ And w ₂ The final prediction score was calculated as follows:

score _DBP ＝score ₁ *w ₁ +score ₂ *w ₂ 。

compared with the prior art, the invention has the advantages that: according to the DNA binding protein recognition method based on the improved protein sequence position specificity matrix, the prediction accuracy is improved by constructing a positive sample and a negative sample of the DNA binding protein; secondly, learning spatial sequence information and time sequence information of the DNA binding protein through a convolutional neural network, a two-way long and short-time memory network and a random forest model, improving a PSSM matrix, and improving the recognition performance of the DNA binding protein; finally, by setting different weights, the neural network and the random forest model are weighted to obtain a final prediction score, and the prediction performance and accuracy are improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a DNA binding protein recognition method of the present invention based on an improved protein sequence position specificity matrix.

FIG. 2 is a block diagram of a neural network based on the DNA binding protein recognition method of the present invention that improves the positional specificity matrix of a protein sequence.

Detailed Description

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, thereby making clear and defining the scope of the present invention.

Referring to FIG. 1, the present embodiment discloses a DNA binding protein recognition method based on an improved protein sequence position specificity matrix, comprising the steps of:

step S1, initializing parameters, including setting a network input dimension dim, a network sequence length l, and setting the number and the size of filters of a first convolution layer of a convolution neural network to n ₁ And size ₁ The number and size of filters of the second convolution layer is set to n ₂ And size ₂ The size of the pooling core of the maximum pooling layer is size ₃ Setting the neuron number of the two-way long short-time memory network as n ₃ The number of nodes of the full connection layer is set to n ₄ The final predicted score of the DNA binding protein was set to score _DBP The neural network predicted result is score ₁ Random forest prediction result is score ₂ The weight occupied by the neural network prediction result is w ₁ The weight of the random forest prediction result is w ₂ 。

And S2, constructing DNA binding protein sequence information.

Specifically, the method specifically comprises the following steps:

step S20, obtaining the DNA-binding annotated protein of the gene classification term from the annotated protein sequence database Swiss-Prot as a positive sample S ⁺ 。

Step S21, collecting proteins unrelated to the annotation of the gene classification term DNA-binding from the annotated protein sequence database Swiss-Prot as a negative sample S ^— To ensure the quality of the negative sample, the negative sample S ^— Proteins of known structure are selected.

Step S22, in the positive sample S ⁺ And negative sample S ^— Protein with a length less than the set value (40 in this example) is removed.

Step S23, removing the positive sample S ⁺ And negative sample S ^— Homologous proteins with a cutoff threshold of a first set (0.35) and a coverage of a second set (90%) sequence length were removed using the CD-HIT and BLASTCLITS methods in this example.

Step S3, for a given protein sequence S, using a position-specific scoring matrix (PSSM) to represent the protein sequence S ₁ S ₂ ...S _L Wherein S is _i (1.ltoreq.i.ltoreq.L) represents an amino acid at the ith position in S, L being the length of S.

In this embodiment, the e-value threshold is set to 0.001 and the iteration number is set to 10, and a corresponding position-specific score matrix is generated by PSI-BLAST.

And S4, normalizing a Position Specificity Score Matrix (PSSM), decomposing the normalized matrix into n submatrices, calculating the local position specificity score matrix characteristics of all submatrices, and expressing a protein sequence as a characteristic carrier with a specific dimension to obtain an improved position specificity score matrix IMPSSM= { x|x=normalization (PSSM (i), 0< i < n+1}.

Specifically, step S4 in this embodiment specifically includes:

step S40, dividing the position specificity score matrix into n submatrices, wherein the first n-1 submatrices have L/n rows and 20 columns, the last submatrix has L- (n-1) L/n rows and 20 columns, and each submatrix retains the evolution information contained in the Position Specificity Score Matrix (PSSM), wherein n is more than or equal to 1.

And S41, calculating the local position specificity score matrix characteristics of each submatrix, wherein the first n-1 submatrices calculate 20 local characteristics by combining evolution information, and the last submatrix is calculated by the values of the first n-1 submatrices.

And S5, inputting the improved position specificity score matrix (IMPSSM) into a convolutional neural network, and sequentially stacking two convolutional layers, wherein the output of the upper layer is used as the input of the next layer, and the convolutional layers adopt a ReLU as an activation function.

And S6, inputting an output result of the convolutional neural network into the bidirectional long-short-time memory network, and adopting a ReLU as an activation function.

And step S7, after the bidirectional long-short-time memory network, weighting hidden features generated by different storage units by adopting a time distribution dense layer.

And S8, inputting the output of the dense layer into the flat layer, changing the result into one-dimensional data, and inputting the one-dimensional data into the full-connection layer to obtain output, wherein a node of the output adopts sigmoid as an activation function.

And S9, inputting the improved position specificity score matrix obtained in the step S4 into a random forest model, and obtaining a decision result of the specific protein sequence through a random forest decision tree.

Specifically, the step S9 specifically includes:

step S90, sampling the position specificity scoring matrix after the improvement in step S4 with a put back to obtain a plurality of sample sets, specifically: specifically from the original n each time ₁ Random decimation n with put back in each training sample ₂ Samples (including possibly duplicate samples).

Step S91, randomly extracting m features from candidate features to serve as candidate features for decision under a current node, selecting and dividing training sample features from the candidate features, constructing a decision tree by using each sample set as a training sample, and calculating by using a CART algorithm after generating the sample set and determining the features by using a single decision tree without pruning;

step S92, voting is carried out on the output of the decision trees by adopting a random forest method after the decision trees with the set number are obtained, and the class with the largest vote is used as the decision of the random forest.

And step S10, inputting the output of the step S8 and the decision result of the step S9 into a scoring layer, and carrying out final prediction scoring according to the set weight.

Specifically, step S10 specifically includes:

step S100, obtaining the predicted score of the output DNA binding protein of step S8 ₁ And the predicted score of the DNA binding protein of the decision result of step S9 ₂ 。

Step S110, according to different weights w ₁ And w ₂ The final prediction score was calculated as follows:

score _DBP ＝score ₁ *w ₁ +score ₂ *w ₂ 。

according to the invention, the prediction precision is improved by constructing positive samples and negative samples of DNA binding proteins; secondly, learning spatial sequence information and time sequence information of the DNA binding protein through a convolutional neural network, a two-way long and short-time memory network and a random forest model, improving a PSSM matrix, and improving the recognition performance of the DNA binding protein; finally, by setting different weights, the neural network and the random forest model are weighted to obtain a final prediction score, and the prediction performance and accuracy are improved.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, the patentees may make various modifications or alterations within the scope of the appended claims, and are intended to be within the scope of the invention as described in the claims.

Claims

1. A method for identifying a DNA binding protein based on an improved protein sequence position specificity matrix comprising the steps of:

s1, initializing parameters, including setting network input dimension dim, network sequence length l and convolutional neural networkThe number and size of filters of the first convolution layer is set to n ₁ And size ₁ The number and size of filters of the second convolution layer is set to n ₂ And size ₂ The size of the pooling core of the maximum pooling layer is size ₃ Setting the neuron number of the two-way long short-time memory network as n ₃ The number of nodes of the full connection layer is set to n ₄ The final predicted score of the DNA binding protein was set to score _DBP The neural network predicted result is score ₁ Random forest prediction result is score ₂ The weight occupied by the neural network prediction result is w ₁ The weight of the random forest prediction result is w ₂ ；

S2, constructing DNA binding protein sequence information;

s10, inputting the output of the step S8 and the decision result of the step S9 into a scoring layer, and carrying out final prediction scoring according to the set weight.

2. The method for identifying a DNA binding protein based on an improved protein sequence position specificity matrix according to claim 1, wherein said step S2 specifically comprises:

3. The method for DNA binding protein recognition based on the improved protein sequence position specificity matrix according to claim 2, wherein said negative sample S in step S21 ^— Proteins of known structure are selected.

4. The method for DNA binding protein recognition based on the improved protein sequence position specificity matrix according to claim 2, wherein said step S23 specifically uses CD-HIT and BLASTCLITS to remove said positive sample S ⁺ And negative sample S ^— Homologous proteins with a cut-off threshold of 0.35 and a coverage of 90% sequence length.

5. The method for identifying a DNA binding protein based on the improved protein sequence position specificity matrix according to claim 1, wherein the e-value threshold is set to 0.001 and the number of iterations is set to 10 in the step S3, and the corresponding position specificity score matrix is generated by PSI-BLAST.

6. The method for identifying a DNA binding protein based on an improved protein sequence position specificity matrix according to claim 1, wherein said step S4 specifically comprises:

7. The method for identifying a DNA binding protein based on an improved protein sequence position specificity matrix according to claim 1, wherein said step S9 specifically comprises:

8. The substrate according to claim 7A method for identifying DNA binding proteins by improving the positional specificity matrix of a protein sequence, characterized in that the step S90 is specifically performed every time from the original n ₁ Random decimation n with put back in each training sample ₂ Samples.

9. The method for identifying a DNA binding protein based on an improved protein sequence position specificity matrix according to claim 1, wherein said step S10 specifically comprises:

score _DBP ＝score ₁ *w ₁ +score ₂ *w ₂ 。