CN113506595A

CN113506595A - Method for identifying DNA promoter element based on information theory

Info

Publication number: CN113506595A
Application number: CN202110907396.8A
Authority: CN
Inventors: 郭菲; 吕一诺; 何文颖; 唐继军; 曹晶
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-08-09
Filing date: 2021-08-09
Publication date: 2021-10-15

Abstract

The invention discloses a method for identifying DNA promoter elements based on information theory, which is based on a double-layer identification model for judging different types of promoters, wherein: the double-layer recognition model carries out promoter sequence recognition through the following steps: step 101: acquiring a promoter sequence data set through an escherichia coli database; step 102: performing position-specific frequency extraction of trinucleotide composition information and dinucleotide composition information on DNA promoter sequence data by a PSTNP algorithm; step 103: optimizing position-specific frequency information of trinucleotide composition information and dinucleotide composition information; the promoter element type identification layer carries out resampling processing on data sets of different promoter types by utilizing a SMOTE algorithm; the invention solves the problem of predicting the DNA promoter and the specific type thereof, and adopts the information theory method to carry out characteristic optimization on the extracted sequence frequency information, thereby obviously improving the prediction precision.

Description

Method for identifying DNA promoter element based on information theory

Technical Field

The invention belongs to the field of functional element prediction algorithms in bioinformatics, and particularly relates to a method for identifying DNA promoter elements based on an information theory.

Background

Promoters are DNA regulatory elements located near the upstream transcription initiation site of a gene, which control the initiation of gene-specific transcription, and determine the timing and level of gene expression. Accurate positioning of the promoter can be realized, so that the identification of the promoter has important significance for researching gene structure, annotating gene information and the like on the genome level. The promoter can be recognized by sigma factors having different functions and structures when specifically bound to RNA polymerase, and is classified as sigma²⁴、σ²⁸、σ³²、σ³⁸、σ⁵⁴And σ⁷⁰And six types. At present, researchers still mainly recognize these promoters by biological methods. However, computational biology is becoming a more favored classification method due to the time and material consuming nature of conducting biological experiments.

Disclosure of Invention

The invention aims to provide a method for accurately and efficiently predicting a DNA promoter element and a type thereof. The PSTNP algorithm used by the invention can well extract the position specificity information of the nucleotide, and further improves the PSTNP by using the information content scoring matrix, thereby more clearly describing the frequency matrix difference between the positive sample and the negative sample. The position specificity feature matrixes of the trinucleotide and the dinucleotide are combined, and finally a two-layer prediction model based on a support vector machine is constructed: the first layer judges whether the sequence is a promoter, and the second layer further predicts the type of the identified promoter and obtains good prediction performance.

The invention is characterized in that the invention solves the problem of the identification and prediction of DNA promoter elements and types thereof, and comprises the following steps in sequence:

a method for identifying a DNA promoter element based on information theory, the method being based on a double layer identification model for judging different types of promoters, the double layer identification model being composed of a layer for identifying a promoter element and a layer for identifying a promoter element type, wherein:

the promoter sequence recognition is carried out by the following steps of:

step 101: acquiring a promoter sequence data set through an escherichia coli database;

step 102: performing position-specific frequency extraction of trinucleotide information and dinucleotide composition information on DNA promoter sequence data by a PSTNP algorithm; the position-specific frequencies include trinucleotide and dinucleotide position-specific frequency information F on the positive and negative data sets⁺And F^-；

Step 103: optimizing position-specific frequency information of trinucleotide information and dinucleotide composition information by the following formula;

wherein, F⁺And F^-Respectively representing the distribution condition of the frequency matrix obtained by the positive and negative sample frequency information;

represents the 4 th position appearing at the 81-k +1 th position in the sequence^kThe degree of difference in the positive and negative sample frequencies of a single trinucleotide or dinucleotide.

And the promoter element type identification layer carries out resampling processing on the data sets of different promoter types by utilizing a SMOTE algorithm.

Further, the nucleotide composition information with position specificity of the promoter sequence obtained in step 102 is generated by the following steps:

2.1 for each 81bp sequence sample S, there are:

S＝N₁N₂…N_l…N₈₁

wherein N is_lNucleotide representing the l position, consisting of A, C, G, T;

2.2, extracting position specificity information of the promoter sequence by using a k-mer method, and respectively taking k as 3 and k as 2;

2.3 calculating frequency information F of the position specificity of the trinucleotide and dinucleotide over the entire positive and negative data set, respectively⁺And F^-Expressed as follows:

and

wherein the content of the first and second substances,

or

Represents the 4 th position appearing at the 81-k +1 th position in the sequence^kTrinucleotide (3 mer)_i) Or dinucleotides (2 mers)_i) Frequency of (3 mer)_iRepresenting AAA, AAC, …, TTT, and 2mer_iRepresenting AA, AC, …, TT.

Further, the step 103 of performing an optimization process on the position-specific frequency information of the trinucleotide composition information and the dinucleotide composition information further includes:

each sequence sample S for position-specific frequency information of trinucleotide composition information and dinucleotide composition information can be represented as:

S＝[φ₁,φ₂,…,φ_w,…,φ_81-k+1]^T

where T is the transpose operator.

For trinucleotide,. phi_wThe definition is as follows:

wherein w is a trinucleotide in the sequenceThe information on the position of the inspiration is,

4 th position representing the appearance at w-th position in the sequence^kTrinucleotide (3 mer)_i) Positive and negative sample frequency difference degree of (3 mer)_iRepresenting AAA, AAC, …, TTT.

For dinucleotides,. phi_wThe definition is as follows:

wherein w is a dinucleotide encoding positional information in the sequence,

4 th position representing the appearance at w-th position in the sequence^kTwo nucleotides (2 mers)_i) Positive and negative sample frequency difference degree of (2 mer)_iThe expression AA is shown in the specification,

AC，…，TT。

advantageous effects

The invention utilizes an information theory method to process sequence frequency information to carry out DNA promoter element identification and type prediction. And extracting nucleotide position specificity information of the promoter sequence by using a PSTNP algorithm to jointly express sequence information, and improving the PSTNP by using an information content scoring matrix to enlarge the discrete distribution difference of the frequency matrix between the positive sample and the negative sample. Combining the characteristic matrixes of the trinucleotide and the dinucleotide, and obtaining useful characteristic information to the maximum extent. Finally, the invention constructs a two-layer prediction model based on a support vector machine: the first layer judges whether the sequence is a promoter; the second layer further predicts the specific type of promoter identified and achieves good prediction performance. The prediction accuracy of the invention is higher than that of other existing models, and the invention has important significance for the recognition of DNA promoter elements and the research of type prediction problems.

Drawings

FIG. 1 is a flow chart of the computational process of the present invention;

FIG. 2 comparison of performance of different k-mer selections;

FIG. 3 is a performance comparison of three information theory algorithms in feature optimization;

FIG. 4 is a comparison of the performance of six feature selection algorithms;

FIG. 5 comparison of performance of different resampling strategies at the second tier promoter type prediction;

FIG. 6 compares the performance of two existing promoter prediction models.

Detailed Description

Promoters determine the initiation of DNA sequence-specific transcription and are important regulatory elements necessary for gene expression. Identifying and positioning the promoter helps to accurately position the gene, and has important guiding effect on annotation of structural and functional information of the biological genome. In the gene transcription process, when RNA polymerase is specifically combined with a specific promoter, a specific sigma protein factor is required for auxiliary recognition, so the sigma factor is often used for marking the type of the promoter as sigma²⁴、σ²⁸、σ³²、σ³⁸、σ⁵⁴、σ⁷⁰. At present, the traditional biological experiment method for identifying the promoter and the type thereof is time-consuming, labor-consuming and high in cost, and compared with the traditional biological experiment method for identifying and classifying the promoter and the type thereof by using a bioinformatics algorithm, the method is more economical and convenient.

The basic idea of the invention is as follows: and extracting position specificity information of the promoter sequence, optimizing and improving the characteristics, and constructing a two-layer prediction model based on a support vector machine. The first layer judges whether the sequence is a promoter; the second layer further predicts the specific type of promoter identified.

The invention mainly comprises the following steps: firstly, a DNA promoter sequence data set is constructed, then, the PSTNP algorithm is utilized to obtain k-mer nucleotide composition information with position specificity of a promoter sequence, the extracted sequence information is optimized through an information content scoring matrix, and trinucleotide and dinucleotide feature matrices are combined to obtain more feature information. And finally, constructing a prediction model by using a support vector machine algorithm, and identifying the promoter and the type thereof. The flow chart of the whole calculation process of the invention is shown in FIG. 1. By using the double-layer prediction model, a better prediction result can be obtained than other existing models.

Step (1): coli (e.coli K-12) promoter sequence dataset was obtained via database reguulondb (version 9.3) and redundant sequences were removed with CD-HIT;

step (2): obtaining trinucleotide composition information and dinucleotide composition information with position specificity of a DNA promoter sequence by a PSTNP algorithm;

and (3): calculating an information content scoring matrix, and optimizing the extracted sequence information based on an information theory;

and (4): merging the feature matrices of the trinucleotides and the dinucleotides;

and (5): constructing a prediction model by using a support vector machine, and identifying a DNA promoter sequence;

and (6): resampling the data sets of different starter subtypes by utilizing an SMOTE algorithm, and solving the problem of unbalanced data sets;

and (7): and constructing a prediction model and identifying different types of promoter sequences.

Further, the nucleotide composition information that the DNA promoter sequence in the step (2) has position specificity is generated by the following steps:

2.1 for each 81bp sequence sample S, there are:

S＝N₁N₂…N_l…N₈₁

wherein N is_lThe nucleotide representing the l-th position consists of A, C, G and T.

and

wherein the content of the first and second substances,

or

Further, the step of optimizing the PSTNP algorithm using the information content score matrix in step (3) is represented as follows:

3.1, using the information content score matrix, and using the method based on the information theory to optimize the sequence information, the process is expressed as:

3.2, each sequence sample S can then be represented as:

S＝[φ₁,φ₂,…,φ_w,…,φ_81-k+1]^T

where T is the transpose operator.

For trinucleotide,. phi_wThe definition is as follows:

wherein w is a trinucleotide-revealing positional information in the sequence,

For dinucleotides,. phi_wThe definition is as follows:

wherein w is a dinucleotide encoding positional information in the sequence,

4 th position representing the appearance at w-th position in the sequence^kTwo nucleotides (2 mers)_i) Positive and negative sample frequency difference degree of (2 mer)_iRepresenting AA, AC, …, TT.

According to the calculation method, 5-fold cross validation is carried out on all prediction experiments. First, when obtaining nucleotide composition information of a position-specific k-mer, different attempts such as 1-mer, 2-mer, 3-mer, etc. were made, and the results of comparing the performances are shown in FIG. 2. It can be seen that the method of combining position-specific features of trinucleotides and dinucleotides performs best, with an overall accuracy significantly higher than the others. Therefore, we finally choose this hybrid method for feature extraction. Then, the invention adopts three information theory methods when improving the PSTNP algorithm, as shown in FIG. 3. It can be seen that the predicted results of the features constructed using the information content score matrix perform best (Acc: 90.05%), with significant advantages over the original PSTNP method and the features of KL divergence, JS divergence processing. After processing the features according to the optimal solution, the present invention uses different classification algorithms for prediction based on 5-fold cross validation, as shown in fig. 4. It can be seen that the classifier constructed using SVM has the best prediction results on Acc (90.05%), MCC (0.68).

At the second level of the classifier, we identified and classified the types of six promoters. The number of samples is extremely unbalanced for different types of promoter datasets. Therefore, we resample the six subsets, and construct three different datasets, namely, the original dataset Data I, the dataset Data ii processed by CD _ HIT undersampling, and the dataset Data iii processed by SMOTE oversampling, respectively, and compare the classification results of the three datasets, as shown in fig. 5. In the results of the original Data set Data I, the classification results of different promoters are greatly different, and the performances are not balanced. The number of samples per subset of Data set Data ii is small, only 96 samples. Based on this, we obtained the most excellent set of results. The SMOTE processed Data set Data iii has a more reliable Data set size, and each subset is 500 samples. The results, whether sensitivity, specificity, Acc or MCC, were also more balanced than the untreated Data set Data I, but slightly worse than Data ii.

Finally, through 5-fold cross validation, the performances of different classifiers for solving the promoter classification problem are compared. The present invention was compared to other 2 classification methods on the same dataset as shown in fig. 6. The result shows that the iPro2L-PSTKNC classifier provided by the invention performs best on the first layer, and Acc reaches 90.05% when a promoter and a non-promoter are identified. Even in the second layer, our model still has the best performance in each sub-classification of the promoter, and the accuracy can reach more than 91%.

In conclusion, the invention provides an improved feature extraction algorithm based on PSTNP, and nucleotide position specificity information of a promoter sequence is effectively described. Subsequently, an SVM algorithm is applied to establish a classification model, and 5-fold cross validation is adopted to evaluate the performance of the classification model. In addition, for the promoters which are already identified, the promoters are further subjected to refined classification. A resampling algorithm is used to process the imbalance of the data sets for different promoter types. Compared with the performance of the most advanced classifier at present, the prediction classification model provided by the invention is obviously improved on the evaluation indexes such as sensitivity, specificity, accuracy, MCC (China computer code) and the like, a more effective method is provided for solving the problem of promoter prediction identification, and the prediction classification model is simple in calculation process, easy to implement and wide in usability.

Claims

1. A method for identifying a DNA promoter element based on information theory, wherein the method is based on a double layer identification model for judging different types of promoters, the double layer identification model is composed of a layer for identifying a promoter element and a layer for identifying a promoter element type, wherein:

the promoter sequence recognition is carried out by the following steps of:

step 102: performing position-specific frequency extraction of trinucleotide composition information and dinucleotide composition information on DNA promoter sequence data by a PSTNP algorithm; the position-specific frequencies include position-specific frequency information F of trinucleotides and dinucleotides on the positive and negative data sets⁺And F^-；

Step 103: optimizing position-specific frequency information of trinucleotide composition information and dinucleotide composition information by the following formula;

represents the 4 th position appearing at the 81-k +1 th position in the sequence^kPositive or negative identity of a trinucleotide or dinucleotideThe frequency difference is measured.

And the promoter element type identification layer carries out resampling processing on the data sets of different types of promoters by utilizing a SMOTE algorithm.

2. The method of claim 1, wherein the nucleotide composition information of the promoter sequence with position specificity obtained in step 102 is generated by the following steps:

2.1 for each 81bp sequence sample S, there are:

S＝N₁N₂…N_l…N₈₁

wherein N is_lNucleotide representing the l position, consisting of A, C, G, T;

2.3 calculating the position-specific frequency information F of the trinucleotides and dinucleotides respectively over the entire positive and negative data sets⁺And F^-Expressed as follows:

and

wherein the content of the first and second substances,

or

Represents the 4 th position appearing at the 81-k +1 th position in the sequence^kTrinucleotide (3 mer)_i) Or dinucleotides (2 mers)_i) Frequency of (3 mer)_iTo representAAA, AAC, …, TTT, and 2mer_iRepresenting AA, AC, …, TT.

3. The method of claim 1, wherein the step 103 is performed by optimizing the position-specific frequency information of the trinucleotide composition information and the dinucleotide composition information, and further comprises:

each sequence sample S of the position-specific frequency information of the trinucleotide composition information and the dinucleotide composition information can be represented as:

S＝[φ₁，φ₂，…，φ_w，…，φ_81-k+1]^T

where T is the transpose operator.

For trinucleotide,. phi_wThe definition is as follows:

wherein w is a trinucleotide-revealing positional information in the sequence,

4 th position representing the appearance at w-th position in the sequence^kTrinucleotide (3 mer)_i) Positive and negative sample frequency difference degree of (3 mer)_iRepresenting AAA, AAC, …, TTT;

for dinucleotides,. phi_wThe definition is as follows:

wherein w is a dinucleotide encoding positional information in the sequence,

4 th position representing the appearance at w-th position in the sequence^kTwo isNucleotide (2 mer)_i) Positive and negative sample frequency difference degree of (2 mer)_iRepresenting AA, AC, …, TT.