CN114512188B - DNA binding protein recognition method based on improved protein sequence position specificity matrix - Google Patents
DNA binding protein recognition method based on improved protein sequence position specificity matrix Download PDFInfo
- Publication number
- CN114512188B CN114512188B CN202210274125.8A CN202210274125A CN114512188B CN 114512188 B CN114512188 B CN 114512188B CN 202210274125 A CN202210274125 A CN 202210274125A CN 114512188 B CN114512188 B CN 114512188B
- Authority
- CN
- China
- Prior art keywords
- position specificity
- matrix
- protein sequence
- score
- dna binding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 66
- 239000011159 matrix material Substances 0.000 title claims abstract description 61
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 59
- 102000052510 DNA-Binding Proteins Human genes 0.000 title claims abstract description 44
- 101710096438 DNA-binding protein Proteins 0.000 title claims abstract description 37
- 238000000034 method Methods 0.000 title claims abstract description 25
- 238000007637 random forest analysis Methods 0.000 claims abstract description 23
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 10
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 5
- 238000003066 decision tree Methods 0.000 claims description 15
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 10
- 230000004913 activation Effects 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 9
- 108700020911 DNA-Binding Proteins Proteins 0.000 claims description 7
- 230000004568 DNA-binding Effects 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 claims description 4
- 230000006872 improvement Effects 0.000 claims description 4
- 150000001413 amino acids Chemical class 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 238000013138 pruning Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000001537 neural effect Effects 0.000 claims description 2
- 239000000758 substrate Substances 0.000 claims 1
- 230000004543 DNA replication Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 108010077544 Chromatin Proteins 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000031018 biological processes and functions Effects 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 210000003483 chromatin Anatomy 0.000 description 1
- 238000002487 chromatin immunoprecipitation Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000006718 epigenetic regulation Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000010534 mechanism of action Effects 0.000 description 1
- 238000002493 microarray Methods 0.000 description 1
- 208000025113 myeloid leukemia Diseases 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000001124 posttranscriptional effect Effects 0.000 description 1
- 238000000159 protein binding assay Methods 0.000 description 1
- 238000012509 protein identification method Methods 0.000 description 1
- 230000022532 regulation of transcription, DNA-dependent Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000002424 x-ray crystallography Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Genetics & Genomics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a DNA binding protein recognition method based on an improved protein sequence position specificity matrix, which comprises the following steps: s1, initializing parameters; s2, constructing DNA binding protein sequence information; s3, expressing a protein sequence by adopting a position specificity score matrix; s4, normalizing the position specificity score matrix to obtain an improved position specificity score matrix; s5, inputting a convolutional neural network; s6, inputting an output result of the convolutional neural network into a bidirectional long-short-time memory network; s7, weighting hidden features generated by different storage units by adopting a time distribution dense layer; s8, inputting the output of the dense layer into the flat layer; s9, inputting the improved position specificity score matrix into a random forest model to obtain a decision result of a specific protein sequence; s10, inputting the output of the step S8 and the decision result of the step S9 into a scoring layer, and carrying out final prediction scoring according to the set weight. The invention improves the prediction performance and accuracy.
Description
Technical Field
The invention relates to the technical fields of biological informatics and computer fusion, in particular to a DNA binding protein recognition method based on an improved protein sequence position specificity matrix.
Background
DNA Binding Proteins (DBPs) are important proteins that play an important role in a variety of biological processes, such as DNA replication, transcriptional control, chromatin stability and modification, epigenetic regulation, post-transcriptional gene regulation, alternative splicing, translation, and the like. They have an important role in certain diseases such as cancer and myeloid leukemia. In addition, DNA binding proteins can also bind to DNA, which also plays an important role in gene expression, and accurate recognition of DNA binding proteins is of great importance.
Experimental techniques can accurately identify DNA binding proteins such as chromatin immunoprecipitation microarrays, x-ray crystallography, and filter binding assays, but these methods are expensive and time consuming. Particularly in the postgene era, the calculation method has low cost and is a good supplement of experimental technology. In recent years, computing methods based on machine learning algorithms have received widespread attention for their encouraging performance. Given a protein sequence as input, machine learning-based methods have proven effective in automatically predicting whether the protein sequence binds to DNA.
Therefore, the improvement of the accuracy of the model on DNA binding protein recognition is significant, and the discovery of important functions of potential DNA replication, transcription and the like and the mechanism of action thereof by utilizing the knowledge is very important scientific significance.
Disclosure of Invention
The invention aims to provide a DNA binding protein identification method based on an improved protein sequence position specificity matrix, so as to overcome the defects in the prior art.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a method for DNA binding protein recognition based on an improved protein sequence position specificity matrix comprising the steps of:
s1, initializing parameters, including setting a network input dimension dim, a network sequence length l, and setting the number and the size of filters of a first convolution layer of a convolution neural network to n 1 And size 1 The number and size of filters of the second convolution layer is set to n 2 And size 2 The size of the pooling core of the maximum pooling layer is size 3 Setting the neuron number of the two-way long short-time memory network as n 3 The number of nodes of the full connection layer is set to n 4 The final predicted score of the DNA binding protein was set to score DBP The neural network predicted result is score 1 Random forest prediction result is score 2 Neural networkThe weight occupied by the prediction result is w 1 The weight of the random forest prediction result is w 2 ;
S2, constructing DNA binding protein sequence information;
s3, for a given protein sequence S, using a position specificity score matrix to represent the protein sequence S 1 S 2 ...S L Wherein S is i (1.ltoreq.i.ltoreq.L) represents an amino acid at the i-th position in S, L being the length of S;
s4, normalizing the position specificity score matrix, decomposing the normalized matrix into n submatrices, calculating the local position specificity score matrix characteristics of all submatrices, and expressing the protein sequence as a characteristic carrier with a specific dimension to obtain an improved position specificity score matrix;
s5, inputting the improved position specificity scoring matrix into a convolutional neural network, and sequentially stacking two convolutional layers, wherein the output of the upper layer is used as the input of the lower layer, and the convolutional layers adopt ReLU as an activation function;
s6, inputting an output result of the convolutional neural network into the bidirectional long-short-time memory network, and adopting a ReLU as an activation function;
s7, weighting hidden features generated by different storage units by adopting a time distribution dense layer;
s8, inputting the output of the dense layer into the flat layer, changing the result into one-dimensional data, and inputting the one-dimensional data into the full-connection layer to obtain output, wherein a node of the output adopts sigmoid as an activation function;
s9, inputting the improved position specificity score matrix obtained in the step S4 into a random forest model, and obtaining a decision result of a specific protein sequence through a random forest decision tree;
s10, inputting the output of the step S8 and the decision result of the step S9 into a scoring layer, and carrying out final prediction scoring according to the set weight, wherein the prediction scoring corresponds to a confidence degree, and the higher the scoring is, the higher the possibility of correctly identifying is.
Further, the step S2 specifically includes:
s20, obtaining the gene classification term DNA-binding annotated protein from the annotated protein sequence database Swiss-Prot as a positive sample S + ;
S21, collecting proteins irrelevant to the annotation of the gene classification term DNA-binding from the annotated protein sequence database Swiss-Prot as a negative sample S — ;
S22, in the positive sample S + And negative sample S — Protein with the length smaller than a set value is removed;
s23, removing the positive sample S + And negative sample S — The middle cut-off threshold is a homologous protein with a first set value and the coverage rate is a sequence length of a second set value.
Further, a negative sample S in the step S21 — Proteins of known structure are selected.
Further, said step S23 specifically uses CD-HIT and BLASTCLITS to remove said positive sample S + And negative sample S — Homologous proteins with a cut-off threshold of 0.35 and a coverage of 90% sequence length.
Further, in the step S3, an e-value threshold is set to 0.001 and the iteration number is set to 10, and a corresponding position-specific score matrix is generated through PSI-BLAST.
Further, the step S4 specifically includes:
s40, dividing the position specificity score matrix into n submatrices, wherein the first n-1 submatrices are provided with L/n rows and 20 columns, the last submatrix is provided with L- (n-1) L/n rows and 20 columns, and each submatrix keeps the evolution information contained in the position specificity score matrix, wherein n is more than or equal to 1;
s41, calculating local position specificity score matrix characteristics of each submatrix, wherein the first n-1 submatrices calculate 20 local characteristics by combining evolution information, and the last submatrix is calculated by values of the first n-1 submatrices.
Further, the step S9 specifically includes:
s90, sampling the position specificity score matrix after the improvement in the step S4 with a put back mode to obtain a plurality of sample sets;
s91, randomly extracting m features from candidate features to serve as candidate features for decision under a current node, selecting and dividing training sample features from the candidate features, constructing a decision tree by using each sample set as a training sample, and calculating a single decision tree by using a CART algorithm after generating the sample set and determining the features without pruning;
and S92, voting the output of the decision trees by adopting a random forest method after the decision trees with the set number are obtained, so that the class with the largest vote is used as the decision of the random forest.
Further, the step S90 is specifically from n to n each time 1 Random decimation n with put back in each training sample 2 Samples.
Further, the step S10 specifically includes:
s100, respectively obtaining the prediction score of the output DNA binding protein of the step S8 1 And the predicted score of the DNA binding protein of the decision result of step S9 2 ;
S110, according to different weights w 1 And w 2 The final prediction score was calculated as follows:
score DBP =score 1 *w 1 +score 2 *w 2 。
compared with the prior art, the invention has the advantages that: according to the DNA binding protein recognition method based on the improved protein sequence position specificity matrix, the prediction accuracy is improved by constructing a positive sample and a negative sample of the DNA binding protein; secondly, learning spatial sequence information and time sequence information of the DNA binding protein through a convolutional neural network, a two-way long and short-time memory network and a random forest model, improving a PSSM matrix, and improving the recognition performance of the DNA binding protein; finally, by setting different weights, the neural network and the random forest model are weighted to obtain a final prediction score, and the prediction performance and accuracy are improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a DNA binding protein recognition method of the present invention based on an improved protein sequence position specificity matrix.
FIG. 2 is a block diagram of a neural network based on the DNA binding protein recognition method of the present invention that improves the positional specificity matrix of a protein sequence.
Detailed Description
The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, thereby making clear and defining the scope of the present invention.
Referring to FIG. 1, the present embodiment discloses a DNA binding protein recognition method based on an improved protein sequence position specificity matrix, comprising the steps of:
step S1, initializing parameters, including setting a network input dimension dim, a network sequence length l, and setting the number and the size of filters of a first convolution layer of a convolution neural network to n 1 And size 1 The number and size of filters of the second convolution layer is set to n 2 And size 2 The size of the pooling core of the maximum pooling layer is size 3 Setting the neuron number of the two-way long short-time memory network as n 3 The number of nodes of the full connection layer is set to n 4 The final predicted score of the DNA binding protein was set to score DBP The neural network predicted result is score 1 Random forest prediction result is score 2 The weight occupied by the neural network prediction result is w 1 The weight of the random forest prediction result is w 2 。
And S2, constructing DNA binding protein sequence information.
Specifically, the method specifically comprises the following steps:
step S20, obtaining the DNA-binding annotated protein of the gene classification term from the annotated protein sequence database Swiss-Prot as a positive sample S + 。
Step S21, collecting proteins unrelated to the annotation of the gene classification term DNA-binding from the annotated protein sequence database Swiss-Prot as a negative sample S — To ensure the quality of the negative sample, the negative sample S — Proteins of known structure are selected.
Step S22, in the positive sample S + And negative sample S — Protein with a length less than the set value (40 in this example) is removed.
Step S23, removing the positive sample S + And negative sample S — Homologous proteins with a cutoff threshold of a first set (0.35) and a coverage of a second set (90%) sequence length were removed using the CD-HIT and BLASTCLITS methods in this example.
Step S3, for a given protein sequence S, using a position-specific scoring matrix (PSSM) to represent the protein sequence S 1 S 2 ...S L Wherein S is i (1.ltoreq.i.ltoreq.L) represents an amino acid at the ith position in S, L being the length of S.
In this embodiment, the e-value threshold is set to 0.001 and the iteration number is set to 10, and a corresponding position-specific score matrix is generated by PSI-BLAST.
And S4, normalizing a Position Specificity Score Matrix (PSSM), decomposing the normalized matrix into n submatrices, calculating the local position specificity score matrix characteristics of all submatrices, and expressing a protein sequence as a characteristic carrier with a specific dimension to obtain an improved position specificity score matrix IMPSSM= { x|x=normalization (PSSM (i), 0< i < n+1}.
Specifically, step S4 in this embodiment specifically includes:
step S40, dividing the position specificity score matrix into n submatrices, wherein the first n-1 submatrices have L/n rows and 20 columns, the last submatrix has L- (n-1) L/n rows and 20 columns, and each submatrix retains the evolution information contained in the Position Specificity Score Matrix (PSSM), wherein n is more than or equal to 1.
And S41, calculating the local position specificity score matrix characteristics of each submatrix, wherein the first n-1 submatrices calculate 20 local characteristics by combining evolution information, and the last submatrix is calculated by the values of the first n-1 submatrices.
And S5, inputting the improved position specificity score matrix (IMPSSM) into a convolutional neural network, and sequentially stacking two convolutional layers, wherein the output of the upper layer is used as the input of the next layer, and the convolutional layers adopt a ReLU as an activation function.
And S6, inputting an output result of the convolutional neural network into the bidirectional long-short-time memory network, and adopting a ReLU as an activation function.
And step S7, after the bidirectional long-short-time memory network, weighting hidden features generated by different storage units by adopting a time distribution dense layer.
And S8, inputting the output of the dense layer into the flat layer, changing the result into one-dimensional data, and inputting the one-dimensional data into the full-connection layer to obtain output, wherein a node of the output adopts sigmoid as an activation function.
And S9, inputting the improved position specificity score matrix obtained in the step S4 into a random forest model, and obtaining a decision result of the specific protein sequence through a random forest decision tree.
Specifically, the step S9 specifically includes:
step S90, sampling the position specificity scoring matrix after the improvement in step S4 with a put back to obtain a plurality of sample sets, specifically: specifically from the original n each time 1 Random decimation n with put back in each training sample 2 Samples (including possibly duplicate samples).
Step S91, randomly extracting m features from candidate features to serve as candidate features for decision under a current node, selecting and dividing training sample features from the candidate features, constructing a decision tree by using each sample set as a training sample, and calculating by using a CART algorithm after generating the sample set and determining the features by using a single decision tree without pruning;
step S92, voting is carried out on the output of the decision trees by adopting a random forest method after the decision trees with the set number are obtained, and the class with the largest vote is used as the decision of the random forest.
And step S10, inputting the output of the step S8 and the decision result of the step S9 into a scoring layer, and carrying out final prediction scoring according to the set weight.
Specifically, step S10 specifically includes:
step S100, obtaining the predicted score of the output DNA binding protein of step S8 1 And the predicted score of the DNA binding protein of the decision result of step S9 2 。
Step S110, according to different weights w 1 And w 2 The final prediction score was calculated as follows:
score DBP =score 1 *w 1 +score 2 *w 2 。
according to the invention, the prediction precision is improved by constructing positive samples and negative samples of DNA binding proteins; secondly, learning spatial sequence information and time sequence information of the DNA binding protein through a convolutional neural network, a two-way long and short-time memory network and a random forest model, improving a PSSM matrix, and improving the recognition performance of the DNA binding protein; finally, by setting different weights, the neural network and the random forest model are weighted to obtain a final prediction score, and the prediction performance and accuracy are improved.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, the patentees may make various modifications or alterations within the scope of the appended claims, and are intended to be within the scope of the invention as described in the claims.
Claims (9)
1. A method for identifying a DNA binding protein based on an improved protein sequence position specificity matrix comprising the steps of:
s1, initializing parameters, including setting network input dimension dim, network sequence length l and convolutional neural networkThe number and size of filters of the first convolution layer is set to n 1 And size 1 The number and size of filters of the second convolution layer is set to n 2 And size 2 The size of the pooling core of the maximum pooling layer is size 3 Setting the neuron number of the two-way long short-time memory network as n 3 The number of nodes of the full connection layer is set to n 4 The final predicted score of the DNA binding protein was set to score DBP The neural network predicted result is score 1 Random forest prediction result is score 2 The weight occupied by the neural network prediction result is w 1 The weight of the random forest prediction result is w 2 ;
S2, constructing DNA binding protein sequence information;
s3, for a given protein sequence S, using a position specificity score matrix to represent the protein sequence S 1 S 2 ...S L Wherein S is i (1.ltoreq.i.ltoreq.L) represents an amino acid at the i-th position in S, L being the length of S;
s4, normalizing the position specificity score matrix, decomposing the normalized matrix into n submatrices, calculating the local position specificity score matrix characteristics of all submatrices, and expressing the protein sequence as a characteristic carrier with a specific dimension to obtain an improved position specificity score matrix;
s5, inputting the improved position specificity scoring matrix into a convolutional neural network, and sequentially stacking two convolutional layers, wherein the output of the upper layer is used as the input of the lower layer, and the convolutional layers adopt ReLU as an activation function;
s6, inputting an output result of the convolutional neural network into the bidirectional long-short-time memory network, and adopting a ReLU as an activation function;
s7, weighting hidden features generated by different storage units by adopting a time distribution dense layer;
s8, inputting the output of the dense layer into the flat layer, changing the result into one-dimensional data, and inputting the one-dimensional data into the full-connection layer to obtain output, wherein a node of the output adopts sigmoid as an activation function;
s9, inputting the improved position specificity score matrix obtained in the step S4 into a random forest model, and obtaining a decision result of a specific protein sequence through a random forest decision tree;
s10, inputting the output of the step S8 and the decision result of the step S9 into a scoring layer, and carrying out final prediction scoring according to the set weight.
2. The method for identifying a DNA binding protein based on an improved protein sequence position specificity matrix according to claim 1, wherein said step S2 specifically comprises:
s20, obtaining the gene classification term DNA-binding annotated protein from the annotated protein sequence database Swiss-Prot as a positive sample S + ;
S21, collecting proteins irrelevant to the annotation of the gene classification term DNA-binding from the annotated protein sequence database Swiss-Prot as a negative sample S — ;
S22, in the positive sample S + And negative sample S — Protein with the length smaller than a set value is removed;
s23, removing the positive sample S + And negative sample S — The middle cut-off threshold is a homologous protein with a first set value and the coverage rate is a sequence length of a second set value.
3. The method for DNA binding protein recognition based on the improved protein sequence position specificity matrix according to claim 2, wherein said negative sample S in step S21 — Proteins of known structure are selected.
4. The method for DNA binding protein recognition based on the improved protein sequence position specificity matrix according to claim 2, wherein said step S23 specifically uses CD-HIT and BLASTCLITS to remove said positive sample S + And negative sample S — Homologous proteins with a cut-off threshold of 0.35 and a coverage of 90% sequence length.
5. The method for identifying a DNA binding protein based on the improved protein sequence position specificity matrix according to claim 1, wherein the e-value threshold is set to 0.001 and the number of iterations is set to 10 in the step S3, and the corresponding position specificity score matrix is generated by PSI-BLAST.
6. The method for identifying a DNA binding protein based on an improved protein sequence position specificity matrix according to claim 1, wherein said step S4 specifically comprises:
s40, dividing the position specificity score matrix into n submatrices, wherein the first n-1 submatrices are provided with L/n rows and 20 columns, the last submatrix is provided with L- (n-1) L/n rows and 20 columns, and each submatrix keeps the evolution information contained in the position specificity score matrix, wherein n is more than or equal to 1;
s41, calculating local position specificity score matrix characteristics of each submatrix, wherein the first n-1 submatrices calculate 20 local characteristics by combining evolution information, and the last submatrix is calculated by values of the first n-1 submatrices.
7. The method for identifying a DNA binding protein based on an improved protein sequence position specificity matrix according to claim 1, wherein said step S9 specifically comprises:
s90, sampling the position specificity score matrix after the improvement in the step S4 with a put back mode to obtain a plurality of sample sets;
s91, randomly extracting m features from candidate features to serve as candidate features for decision under a current node, selecting and dividing training sample features from the candidate features, constructing a decision tree by using each sample set as a training sample, and calculating a single decision tree by using a CART algorithm after generating the sample set and determining the features without pruning;
and S92, voting the output of the decision trees by adopting a random forest method after the decision trees with the set number are obtained, so that the class with the largest vote is used as the decision of the random forest.
8. The substrate according to claim 7A method for identifying DNA binding proteins by improving the positional specificity matrix of a protein sequence, characterized in that the step S90 is specifically performed every time from the original n 1 Random decimation n with put back in each training sample 2 Samples.
9. The method for identifying a DNA binding protein based on an improved protein sequence position specificity matrix according to claim 1, wherein said step S10 specifically comprises:
s100, respectively obtaining the prediction score of the output DNA binding protein of the step S8 1 And the predicted score of the DNA binding protein of the decision result of step S9 2 ;
S110, according to different weights w 1 And w 2 The final prediction score was calculated as follows:
score DBP =score 1 *w 1 +score 2 *w 2 。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210274125.8A CN114512188B (en) | 2022-03-20 | 2022-03-20 | DNA binding protein recognition method based on improved protein sequence position specificity matrix |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210274125.8A CN114512188B (en) | 2022-03-20 | 2022-03-20 | DNA binding protein recognition method based on improved protein sequence position specificity matrix |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114512188A CN114512188A (en) | 2022-05-17 |
CN114512188B true CN114512188B (en) | 2024-04-05 |
Family
ID=81553408
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210274125.8A Active CN114512188B (en) | 2022-03-20 | 2022-03-20 | DNA binding protein recognition method based on improved protein sequence position specificity matrix |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114512188B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105808975A (en) * | 2016-03-14 | 2016-07-27 | 南京理工大学 | Multi-core-learning and Boosting algorithm based protein-DNA binding site prediction method |
CN111785321A (en) * | 2020-06-12 | 2020-10-16 | 浙江工业大学 | DNA binding residue prediction method based on deep convolutional neural network |
CN112489723A (en) * | 2020-12-01 | 2021-03-12 | 南京理工大学 | DNA binding protein prediction method based on local evolution information |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA3100607A1 (en) * | 2018-05-23 | 2019-11-28 | Envisagenics, Inc. | Systems and methods for analysis of alternative splicing |
-
2022
- 2022-03-20 CN CN202210274125.8A patent/CN114512188B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105808975A (en) * | 2016-03-14 | 2016-07-27 | 南京理工大学 | Multi-core-learning and Boosting algorithm based protein-DNA binding site prediction method |
CN111785321A (en) * | 2020-06-12 | 2020-10-16 | 浙江工业大学 | DNA binding residue prediction method based on deep convolutional neural network |
CN112489723A (en) * | 2020-12-01 | 2021-03-12 | 南京理工大学 | DNA binding protein prediction method based on local evolution information |
Non-Patent Citations (1)
Title |
---|
基于特征融合的DNA-蛋白质结合位点预测;薛广富;;科学技术创新;20200605(第16期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114512188A (en) | 2022-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111798921A (en) | RNA binding protein prediction method and device based on multi-scale attention convolution neural network | |
CN111370073B (en) | Medicine interaction rule prediction method based on deep learning | |
CN114927162A (en) | Multi-set correlation phenotype prediction method based on hypergraph representation and Dirichlet distribution | |
CN111933212A (en) | Clinical omics data processing method and device based on machine learning | |
JP2024524795A (en) | Gene phenotype prediction based on graph neural networks | |
CN108427865B (en) | Method for predicting correlation between LncRNA and environmental factors | |
CN114819056B (en) | Single-cell data integration method based on domain countermeasure and variation inference | |
CN109727637B (en) | Method for identifying key proteins based on mixed frog-leaping algorithm | |
CN112927753A (en) | Method for identifying interface hot spot residues of protein and RNA (ribonucleic acid) compound based on transfer learning | |
CN114091603A (en) | Spatial transcriptome cell clustering and analyzing method | |
Yu et al. | SANPolyA: a deep learning method for identifying Poly (A) signals | |
Wang et al. | A brief review of machine learning methods for RNA methylation sites prediction | |
CN114783526A (en) | Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder | |
CN114743600A (en) | Gate-controlled attention mechanism-based deep learning prediction method for target-ligand binding affinity | |
Yan et al. | A review about RNA–protein-binding sites prediction based on deep learning | |
CN115472221A (en) | Protein fitness prediction method based on deep learning | |
Luo et al. | A Caps-UBI model for protein ubiquitination site prediction | |
CN114512188B (en) | DNA binding protein recognition method based on improved protein sequence position specificity matrix | |
Borah et al. | A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis | |
CN116705192A (en) | Drug virtual screening method and device based on deep learning | |
WO2023148684A1 (en) | Local steps in latent space and descriptors-based molecules filtering for conditional molecular generation | |
CN114627964B (en) | Prediction enhancer based on multi-core learning and intensity classification method and classification equipment thereof | |
CN115691661A (en) | Gene coding breeding prediction method and device based on graph clustering | |
Durge et al. | Heuristic analysis of genomic sequence processing models for high efficiency prediction: A statistical perspective | |
CN113223622B (en) | miRNA-disease association prediction method based on meta-path |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |