CN113506595A - Method for identifying DNA promoter element based on information theory - Google Patents
Method for identifying DNA promoter element based on information theory Download PDFInfo
- Publication number
- CN113506595A CN113506595A CN202110907396.8A CN202110907396A CN113506595A CN 113506595 A CN113506595 A CN 113506595A CN 202110907396 A CN202110907396 A CN 202110907396A CN 113506595 A CN113506595 A CN 113506595A
- Authority
- CN
- China
- Prior art keywords
- information
- sequence
- promoter
- trinucleotide
- mer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Abstract
The invention discloses a method for identifying DNA promoter elements based on information theory, which is based on a double-layer identification model for judging different types of promoters, wherein: the double-layer recognition model carries out promoter sequence recognition through the following steps: step 101: acquiring a promoter sequence data set through an escherichia coli database; step 102: performing position-specific frequency extraction of trinucleotide composition information and dinucleotide composition information on DNA promoter sequence data by a PSTNP algorithm; step 103: optimizing position-specific frequency information of trinucleotide composition information and dinucleotide composition information; the promoter element type identification layer carries out resampling processing on data sets of different promoter types by utilizing a SMOTE algorithm; the invention solves the problem of predicting the DNA promoter and the specific type thereof, and adopts the information theory method to carry out characteristic optimization on the extracted sequence frequency information, thereby obviously improving the prediction precision.
Description
Technical Field
The invention belongs to the field of functional element prediction algorithms in bioinformatics, and particularly relates to a method for identifying DNA promoter elements based on an information theory.
Background
Promoters are DNA regulatory elements located near the upstream transcription initiation site of a gene, which control the initiation of gene-specific transcription, and determine the timing and level of gene expression. Accurate positioning of the promoter can be realized, so that the identification of the promoter has important significance for researching gene structure, annotating gene information and the like on the genome level. The promoter can be recognized by sigma factors having different functions and structures when specifically bound to RNA polymerase, and is classified as sigma24、σ28、σ32、σ38、σ54And σ70And six types. At present, researchers still mainly recognize these promoters by biological methods. However, computational biology is becoming a more favored classification method due to the time and material consuming nature of conducting biological experiments.
Disclosure of Invention
The invention aims to provide a method for accurately and efficiently predicting a DNA promoter element and a type thereof. The PSTNP algorithm used by the invention can well extract the position specificity information of the nucleotide, and further improves the PSTNP by using the information content scoring matrix, thereby more clearly describing the frequency matrix difference between the positive sample and the negative sample. The position specificity feature matrixes of the trinucleotide and the dinucleotide are combined, and finally a two-layer prediction model based on a support vector machine is constructed: the first layer judges whether the sequence is a promoter, and the second layer further predicts the type of the identified promoter and obtains good prediction performance.
The invention is characterized in that the invention solves the problem of the identification and prediction of DNA promoter elements and types thereof, and comprises the following steps in sequence:
a method for identifying a DNA promoter element based on information theory, the method being based on a double layer identification model for judging different types of promoters, the double layer identification model being composed of a layer for identifying a promoter element and a layer for identifying a promoter element type, wherein:
the promoter sequence recognition is carried out by the following steps of:
step 101: acquiring a promoter sequence data set through an escherichia coli database;
step 102: performing position-specific frequency extraction of trinucleotide information and dinucleotide composition information on DNA promoter sequence data by a PSTNP algorithm; the position-specific frequencies include trinucleotide and dinucleotide position-specific frequency information F on the positive and negative data sets+And F-;
Step 103: optimizing position-specific frequency information of trinucleotide information and dinucleotide composition information by the following formula;
wherein, F+And F-Respectively representing the distribution condition of the frequency matrix obtained by the positive and negative sample frequency information;represents the 4 th position appearing at the 81-k +1 th position in the sequencekThe degree of difference in the positive and negative sample frequencies of a single trinucleotide or dinucleotide.
And the promoter element type identification layer carries out resampling processing on the data sets of different promoter types by utilizing a SMOTE algorithm.
Further, the nucleotide composition information with position specificity of the promoter sequence obtained in step 102 is generated by the following steps:
2.1 for each 81bp sequence sample S, there are:
S=N1N2…Nl…N81
wherein N islNucleotide representing the l position, consisting of A, C, G, T;
2.2, extracting position specificity information of the promoter sequence by using a k-mer method, and respectively taking k as 3 and k as 2;
2.3 calculating frequency information F of the position specificity of the trinucleotide and dinucleotide over the entire positive and negative data set, respectively+And F-Expressed as follows:
and
wherein the content of the first and second substances,orRepresents the 4 th position appearing at the 81-k +1 th position in the sequencekTrinucleotide (3 mer)i) Or dinucleotides (2 mers)i) Frequency of (3 mer)iRepresenting AAA, AAC, …, TTT, and 2meriRepresenting AA, AC, …, TT.
Further, the step 103 of performing an optimization process on the position-specific frequency information of the trinucleotide composition information and the dinucleotide composition information further includes:
each sequence sample S for position-specific frequency information of trinucleotide composition information and dinucleotide composition information can be represented as:
S=[φ1,φ2,…,φw,…,φ81-k+1]T
where T is the transpose operator.
For trinucleotide,. phiwThe definition is as follows:
wherein w is a trinucleotide in the sequenceThe information on the position of the inspiration is,4 th position representing the appearance at w-th position in the sequencekTrinucleotide (3 mer)i) Positive and negative sample frequency difference degree of (3 mer)iRepresenting AAA, AAC, …, TTT.
For dinucleotides,. phiwThe definition is as follows:
wherein w is a dinucleotide encoding positional information in the sequence,4 th position representing the appearance at w-th position in the sequencekTwo nucleotides (2 mers)i) Positive and negative sample frequency difference degree of (2 mer)iThe expression AA is shown in the specification,
AC,…,TT。
advantageous effects
The invention utilizes an information theory method to process sequence frequency information to carry out DNA promoter element identification and type prediction. And extracting nucleotide position specificity information of the promoter sequence by using a PSTNP algorithm to jointly express sequence information, and improving the PSTNP by using an information content scoring matrix to enlarge the discrete distribution difference of the frequency matrix between the positive sample and the negative sample. Combining the characteristic matrixes of the trinucleotide and the dinucleotide, and obtaining useful characteristic information to the maximum extent. Finally, the invention constructs a two-layer prediction model based on a support vector machine: the first layer judges whether the sequence is a promoter; the second layer further predicts the specific type of promoter identified and achieves good prediction performance. The prediction accuracy of the invention is higher than that of other existing models, and the invention has important significance for the recognition of DNA promoter elements and the research of type prediction problems.
Drawings
FIG. 1 is a flow chart of the computational process of the present invention;
FIG. 2 comparison of performance of different k-mer selections;
FIG. 3 is a performance comparison of three information theory algorithms in feature optimization;
FIG. 4 is a comparison of the performance of six feature selection algorithms;
FIG. 5 comparison of performance of different resampling strategies at the second tier promoter type prediction;
FIG. 6 compares the performance of two existing promoter prediction models.
Detailed Description
Promoters determine the initiation of DNA sequence-specific transcription and are important regulatory elements necessary for gene expression. Identifying and positioning the promoter helps to accurately position the gene, and has important guiding effect on annotation of structural and functional information of the biological genome. In the gene transcription process, when RNA polymerase is specifically combined with a specific promoter, a specific sigma protein factor is required for auxiliary recognition, so the sigma factor is often used for marking the type of the promoter as sigma24、σ28、σ32、σ38、σ54、σ70. At present, the traditional biological experiment method for identifying the promoter and the type thereof is time-consuming, labor-consuming and high in cost, and compared with the traditional biological experiment method for identifying and classifying the promoter and the type thereof by using a bioinformatics algorithm, the method is more economical and convenient.
The basic idea of the invention is as follows: and extracting position specificity information of the promoter sequence, optimizing and improving the characteristics, and constructing a two-layer prediction model based on a support vector machine. The first layer judges whether the sequence is a promoter; the second layer further predicts the specific type of promoter identified.
The invention mainly comprises the following steps: firstly, a DNA promoter sequence data set is constructed, then, the PSTNP algorithm is utilized to obtain k-mer nucleotide composition information with position specificity of a promoter sequence, the extracted sequence information is optimized through an information content scoring matrix, and trinucleotide and dinucleotide feature matrices are combined to obtain more feature information. And finally, constructing a prediction model by using a support vector machine algorithm, and identifying the promoter and the type thereof. The flow chart of the whole calculation process of the invention is shown in FIG. 1. By using the double-layer prediction model, a better prediction result can be obtained than other existing models.
Step (1): coli (e.coli K-12) promoter sequence dataset was obtained via database reguulondb (version 9.3) and redundant sequences were removed with CD-HIT;
step (2): obtaining trinucleotide composition information and dinucleotide composition information with position specificity of a DNA promoter sequence by a PSTNP algorithm;
and (3): calculating an information content scoring matrix, and optimizing the extracted sequence information based on an information theory;
and (4): merging the feature matrices of the trinucleotides and the dinucleotides;
and (5): constructing a prediction model by using a support vector machine, and identifying a DNA promoter sequence;
and (6): resampling the data sets of different starter subtypes by utilizing an SMOTE algorithm, and solving the problem of unbalanced data sets;
and (7): and constructing a prediction model and identifying different types of promoter sequences.
Further, the nucleotide composition information that the DNA promoter sequence in the step (2) has position specificity is generated by the following steps:
2.1 for each 81bp sequence sample S, there are:
S=N1N2…Nl…N81
wherein N islThe nucleotide representing the l-th position consists of A, C, G and T.
2.2, extracting position specificity information of the promoter sequence by using a k-mer method, and respectively taking k as 3 and k as 2;
2.3 calculating frequency information F of the position specificity of the trinucleotide and dinucleotide over the entire positive and negative data set, respectively+And F-Expressed as follows:
and
wherein the content of the first and second substances,orRepresents the 4 th position appearing at the 81-k +1 th position in the sequencekTrinucleotide (3 mer)i) Or dinucleotides (2 mers)i) Frequency of (3 mer)iRepresenting AAA, AAC, …, TTT, and 2meriRepresenting AA, AC, …, TT.
Further, the step of optimizing the PSTNP algorithm using the information content score matrix in step (3) is represented as follows:
3.1, using the information content score matrix, and using the method based on the information theory to optimize the sequence information, the process is expressed as:
wherein, F+And F-Respectively representing the distribution condition of the frequency matrix obtained by the positive and negative sample frequency information;
represents the 4 th position appearing at the 81-k +1 th position in the sequencekThe degree of difference in the positive and negative sample frequencies of a single trinucleotide or dinucleotide.
3.2, each sequence sample S can then be represented as:
S=[φ1,φ2,…,φw,…,φ81-k+1]T
where T is the transpose operator.
For trinucleotide,. phiwThe definition is as follows:
wherein w is a trinucleotide-revealing positional information in the sequence,4 th position representing the appearance at w-th position in the sequencekTrinucleotide (3 mer)i) Positive and negative sample frequency difference degree of (3 mer)iRepresenting AAA, AAC, …, TTT.
For dinucleotides,. phiwThe definition is as follows:
wherein w is a dinucleotide encoding positional information in the sequence,4 th position representing the appearance at w-th position in the sequencekTwo nucleotides (2 mers)i) Positive and negative sample frequency difference degree of (2 mer)iRepresenting AA, AC, …, TT.
According to the calculation method, 5-fold cross validation is carried out on all prediction experiments. First, when obtaining nucleotide composition information of a position-specific k-mer, different attempts such as 1-mer, 2-mer, 3-mer, etc. were made, and the results of comparing the performances are shown in FIG. 2. It can be seen that the method of combining position-specific features of trinucleotides and dinucleotides performs best, with an overall accuracy significantly higher than the others. Therefore, we finally choose this hybrid method for feature extraction. Then, the invention adopts three information theory methods when improving the PSTNP algorithm, as shown in FIG. 3. It can be seen that the predicted results of the features constructed using the information content score matrix perform best (Acc: 90.05%), with significant advantages over the original PSTNP method and the features of KL divergence, JS divergence processing. After processing the features according to the optimal solution, the present invention uses different classification algorithms for prediction based on 5-fold cross validation, as shown in fig. 4. It can be seen that the classifier constructed using SVM has the best prediction results on Acc (90.05%), MCC (0.68).
At the second level of the classifier, we identified and classified the types of six promoters. The number of samples is extremely unbalanced for different types of promoter datasets. Therefore, we resample the six subsets, and construct three different datasets, namely, the original dataset Data I, the dataset Data ii processed by CD _ HIT undersampling, and the dataset Data iii processed by SMOTE oversampling, respectively, and compare the classification results of the three datasets, as shown in fig. 5. In the results of the original Data set Data I, the classification results of different promoters are greatly different, and the performances are not balanced. The number of samples per subset of Data set Data ii is small, only 96 samples. Based on this, we obtained the most excellent set of results. The SMOTE processed Data set Data iii has a more reliable Data set size, and each subset is 500 samples. The results, whether sensitivity, specificity, Acc or MCC, were also more balanced than the untreated Data set Data I, but slightly worse than Data ii.
Finally, through 5-fold cross validation, the performances of different classifiers for solving the promoter classification problem are compared. The present invention was compared to other 2 classification methods on the same dataset as shown in fig. 6. The result shows that the iPro2L-PSTKNC classifier provided by the invention performs best on the first layer, and Acc reaches 90.05% when a promoter and a non-promoter are identified. Even in the second layer, our model still has the best performance in each sub-classification of the promoter, and the accuracy can reach more than 91%.
In conclusion, the invention provides an improved feature extraction algorithm based on PSTNP, and nucleotide position specificity information of a promoter sequence is effectively described. Subsequently, an SVM algorithm is applied to establish a classification model, and 5-fold cross validation is adopted to evaluate the performance of the classification model. In addition, for the promoters which are already identified, the promoters are further subjected to refined classification. A resampling algorithm is used to process the imbalance of the data sets for different promoter types. Compared with the performance of the most advanced classifier at present, the prediction classification model provided by the invention is obviously improved on the evaluation indexes such as sensitivity, specificity, accuracy, MCC (China computer code) and the like, a more effective method is provided for solving the problem of promoter prediction identification, and the prediction classification model is simple in calculation process, easy to implement and wide in usability.
Claims (3)
1. A method for identifying a DNA promoter element based on information theory, wherein the method is based on a double layer identification model for judging different types of promoters, the double layer identification model is composed of a layer for identifying a promoter element and a layer for identifying a promoter element type, wherein:
the promoter sequence recognition is carried out by the following steps of:
step 101: acquiring a promoter sequence data set through an escherichia coli database;
step 102: performing position-specific frequency extraction of trinucleotide composition information and dinucleotide composition information on DNA promoter sequence data by a PSTNP algorithm; the position-specific frequencies include position-specific frequency information F of trinucleotides and dinucleotides on the positive and negative data sets+And F-;
Step 103: optimizing position-specific frequency information of trinucleotide composition information and dinucleotide composition information by the following formula;
wherein, F+And F-Respectively representing the distribution condition of the frequency matrix obtained by the positive and negative sample frequency information;represents the 4 th position appearing at the 81-k +1 th position in the sequencekPositive or negative identity of a trinucleotide or dinucleotideThe frequency difference is measured.
And the promoter element type identification layer carries out resampling processing on the data sets of different types of promoters by utilizing a SMOTE algorithm.
2. The method of claim 1, wherein the nucleotide composition information of the promoter sequence with position specificity obtained in step 102 is generated by the following steps:
2.1 for each 81bp sequence sample S, there are:
S=N1N2…Nl…N81
wherein N islNucleotide representing the l position, consisting of A, C, G, T;
2.2, extracting position specificity information of the promoter sequence by using a k-mer method, and respectively taking k as 3 and k as 2;
2.3 calculating the position-specific frequency information F of the trinucleotides and dinucleotides respectively over the entire positive and negative data sets+And F-Expressed as follows:
and
3. The method of claim 1, wherein the step 103 is performed by optimizing the position-specific frequency information of the trinucleotide composition information and the dinucleotide composition information, and further comprises:
each sequence sample S of the position-specific frequency information of the trinucleotide composition information and the dinucleotide composition information can be represented as:
S=[φ1,φ2,…,φw,…,φ81-k+1]T
where T is the transpose operator.
For trinucleotide,. phiwThe definition is as follows:
wherein w is a trinucleotide-revealing positional information in the sequence,4 th position representing the appearance at w-th position in the sequencekTrinucleotide (3 mer)i) Positive and negative sample frequency difference degree of (3 mer)iRepresenting AAA, AAC, …, TTT;
for dinucleotides,. phiwThe definition is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110907396.8A CN113506595A (en) | 2021-08-09 | 2021-08-09 | Method for identifying DNA promoter element based on information theory |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110907396.8A CN113506595A (en) | 2021-08-09 | 2021-08-09 | Method for identifying DNA promoter element based on information theory |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113506595A true CN113506595A (en) | 2021-10-15 |
Family
ID=78015853
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110907396.8A Pending CN113506595A (en) | 2021-08-09 | 2021-08-09 | Method for identifying DNA promoter element based on information theory |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113506595A (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2006249249A1 (en) * | 1997-06-03 | 2007-01-11 | Rutgers, The State University Of New Jersey | Plastid promoters for transgene expression in the plastids of higher plants |
CN111161793A (en) * | 2020-01-09 | 2020-05-15 | 青岛科技大学 | Stacking integration based N in RNA6Method for predicting methyladenosine modification site |
-
2021
- 2021-08-09 CN CN202110907396.8A patent/CN113506595A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2006249249A1 (en) * | 1997-06-03 | 2007-01-11 | Rutgers, The State University Of New Jersey | Plastid promoters for transgene expression in the plastids of higher plants |
CN111161793A (en) * | 2020-01-09 | 2020-05-15 | 青岛科技大学 | Stacking integration based N in RNA6Method for predicting methyladenosine modification site |
Non-Patent Citations (4)
Title |
---|
WENYING HE: "EnhancerPred2.0: predicting enhancers and their strength based on position-specific trinucleotide propensity and electron–ion interaction potential feature selection", 《MOLECULAR BIOSYSTEMS》 * |
YINUO LYU 等: "《iEnhancer-KL: A Novel Two-Layer Predictor》", 《IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS》 * |
YINUO LYU 等: "《iPro2L-PSTKNC: A Two-Layer Predictor for Discovering Various Types of Promoters by Position Specific of Nucleotide Composition》", 《IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS》 * |
刘枫: "霍山石斛 cpDNA 全序列微卫星分布及分子鉴别研究", 《中药材》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Barash et al. | A simple hyper-geometric approach for discovering putative transcription factor binding sites | |
US20210332354A1 (en) | Systems and methods for identifying differential accessibility of gene regulatory elements at single cell resolution | |
CN113344272B (en) | Prediction method of interaction relation between circRNA, miRNA and RBP based on machine learning | |
CN111863121A (en) | Protein self-interaction prediction method based on graph convolution neural network | |
Liang et al. | iPromoter-ET: Identifying promoters and their strength by extremely randomized trees-based feature selection | |
US20210398605A1 (en) | System and method for promoter prediction in human genome | |
Tatarinova et al. | NPEST: a nonparametric method and a database for transcription start site prediction | |
CN113823356A (en) | Methylation site identification method and device | |
CN113506595A (en) | Method for identifying DNA promoter element based on information theory | |
Chen et al. | sORFPred: A Method Based on Comprehensive Features and Ensemble Learning to Predict the sORFs in Plant LncRNAs | |
Min et al. | Survey of programs used to detect alternative splicing isoforms from deep sequencing data in silico | |
CN114627964B (en) | Prediction enhancer based on multi-core learning and intensity classification method and classification equipment thereof | |
US20210324465A1 (en) | Systems and methods for analyzing and aggregating open chromatin signatures at single cell resolution | |
CN113593641A (en) | Method for identifying DNA enhancer element based on sequence frequency information | |
CN113362898A (en) | RNA subcellular localization method for identifying by fusing multiple sequence frequency information | |
CN111383710A (en) | Gene splice site recognition model construction method based on particle swarm optimization gemini support vector machine | |
Li et al. | Fast and accurate classification of meta-genomics long reads with deSAMBA | |
Sutanto et al. | Assessing global-local secondary structure fingerprints to classify RNA sequences with deep learning | |
Tao et al. | A new promoter recognition method based on features optimal selection | |
Wu et al. | Systems biology approaches to mining high throughput biological data | |
Abbas et al. | TC-6mA-Pred: Prediction of DNA N6-methyladenine sites using CNN with transformer | |
Garbarine et al. | An information theoretic method of microarray probe design for genome classification | |
jast Muhammad et al. | Prediction of Sigma-54 Promoters in Bacterial Genomes | |
Anand et al. | Feature selection approach for quantitative prediction of transcriptional activities | |
Wang et al. | iGAPK: Improved GAPK algorithm for regulatory DNA motif discovery |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20211015 |
|
RJ01 | Rejection of invention patent application after publication |