CN112365925A

CN112365925A - Bidirectional dinucleotide position specific preference and mutual information DNA/RNA sequence coding method

Info

Publication number: CN112365925A
Application number: CN202011236138.3A
Authority: CN
Inventors: 王明钊; 谢娟英; 许升全
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2020-11-09
Filing date: 2020-11-09
Publication date: 2021-02-12

Abstract

A coding method of a bidirectional dinucleotide position specificity preference and point mutual information DNA/RNA sequence comprises the steps of constructing a DNA/RNA sequence nucleotide position specificity preference matrix, constructing a DNA/RNA sequence bidirectional dinucleotide position specificity preference matrix, determining a point mutual information value and a characteristic combination of DNA/RNA sequence nucleotides, and coding the DNA/RNA sequence. In order to extract more position information of dinucleotides from DNA/RNA sequence data, a parameter alpha is introduced to express the distance between the dinucleotides, numerical characteristic vectors with different values of alpha are combined into a global high-dimensional numerical characteristic vector, and the global high-dimensional numerical characteristic vector is used for a4mC methylation site of DNA and m of RNA⁶A has very good performance in methylation site recognition. The DNA/RNA numerical characteristic data obtained by the invention has the advantages of more classification information, low redundancy among characteristics and trained model identificationHigh accuracy and the like, and can be used for coding DNA/RNA sequences.

Description

Bidirectional dinucleotide position specific preference and mutual information DNA/RNA sequence coding method

Technical Field

The invention belongs to the technical field of sequence data analysis, and particularly relates to a DNA/RNA sequence coding method.

Background

Machine learning techniques play a key role in the DNA/RNA methylation site recognition problem in the post-genomic era. In general, the solution to this problem by using machine learning technique mainly comprises 3 steps: sequence data encoding, construction models and performance assessment, wherein the sequence data encoding is the most important step for solving the problem, namely how to extract numerical features containing more classified identification information from DNA/RNA sequence data is the key for accurately identifying DNA/RNA methylation sites.

At present, the existing coding method has the defects of low coding characteristic dimension and incapability of extracting key identification information from DNA/RNA sequence data, so that the accuracy of the established prediction identification model is low. The DNA/RNA sequence data are coded by adopting a plurality of coding methods, and the coded numerical characteristics are combined into high-dimensional characteristics, so that the defects of a single coding method can be overcome, high redundancy of the numerical characteristics and waste of computing resources are caused, and the improvement on the accuracy of a DNA/RNA methylation site recognition model is very limited. Therefore, developing a DNA/RNA sequence data encoding method that can effectively encode DNA/RNA sequences into high-dimensional numerical features containing more classified identification information and has low redundancy between features is a key to solve the problem and is a hot spot of research in this field.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects of the prior art and provide a bidirectional dinucleotide position specificity preference and mutual information DNA/RNA sequence coding method which has the advantages of more classified identification information, low redundancy among characteristics and high identification and prediction accuracy of an established model.

The technical scheme adopted for solving the technical problems comprises the following steps:

(1) construction of DNA/RNA sequence nucleotide position specificity preference matrix

Given a DNA/RNA sequence dataset D, the dataset consists of a positive class dataset and a negative class dataset.

Determining a nucleotide position-specific preference matrix for a positive data set according to

Wherein A, C, G, X is 4 nucleotides of DNA/RNA, X is represented as nucleotide T in the DNA sequence data set, and is represented as nucleotide U in the RNA sequence data set, i is the position of the nucleotide, i is more than or equal to 1 and less than or equal to l, the value of i is a limited positive integer, l is the nucleotide length of the DNA/RNA sequence sample, the value of l is an odd number,

the occurrence frequency of nucleotide A, C, G, X at the i-th position of all sequence samples in the positive type dataset is shown.

Determining a nucleotide position-specific preference matrix for a negative class dataset as follows

Wherein the content of the first and second substances,

the frequency of occurrence of nucleotide A, C, G, X at the i-th position of all sequence samples in the negative class dataset, respectively.

(2) Construction of DNA/RNA sequence bidirectional dinucleotide position specificity preference matrix

Determining a forward dinucleotide position-specific preference matrix for a positive class dataset as follows

Wherein AA, AC, … and XX are 16 dinucleosides consisting of 4 nucleotides A, C, G, X of DNA/RNAAcid, alpha represents the distance between two nucleotides, alpha is more than or equal to 0 and less than or equal to (l-3)/2, the value of alpha is a limited positive integer, j is the position of the nucleotide, j is more than or equal to alpha +2 and less than or equal to l-alpha-1, the value of j is a limited positive integer,

the occurrence frequencies of the dinucleotides AA, AC, … and XX at the jth and jth + alpha +1 positions of all sequence samples in the positive type data set are respectively.

Determining a backward dinucleotide position-specific preference matrix for a positive data set according to

Wherein the content of the first and second substances,

the occurrence frequencies of dinucleotides AA, AC, … and XX at the jth and jth-alpha-1 positions of all sequence samples in the positive type data set are respectively.

Determining a forward dinucleotide position-specific preference matrix for the negative class dataset as follows

Wherein the content of the first and second substances,

the occurrence frequencies of the dinucleotides AA, AC, … and XX at the jth and jth + alpha +1 positions of all sequence samples in the negative class data set are respectively.

Determining a backward dinucleotide position-specific preference matrix for a negative class dataset as follows

Wherein the content of the first and second substances,

the occurrence frequencies of dinucleotides AA, AC, … and XX at the jth and jth-alpha-1 positions of all sequence samples in the negative class data set are respectively.

(3) Determination of mutual information values of nucleotides of DNA/RNA sequences

(3.1) determining the forward mutual information value of the nucleotide sequence of the DNA/RNA sequence to be encoded in the positive data set according to the following formula

Wherein X is the nucleotide at the jth position, x.epsilon. { A, C, G, X }, z is the nucleotide at the j + alpha +1 position, z.epsilon. { A, C, G, X },

is the frequency of the occurrence of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples of the positive dataset,

is the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the positive data set,

is the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples of the positive dataset.

Determining the backward mutual trust of the nucleotide sequence of the DNA/RNA sequence to be encoded in the positive data set according to the following formulaInformation value

Wherein y is the nucleotide at position j- α -1, y ∈ { A, C, G, X },

the occurrence frequency of the dinucleotide xy at the jth and jth-alpha-1 position of all sequence samples of the positive type data set,

is the frequency of occurrence of nucleotide y at the j- α -1 position of all sequence samples of the positive dataset.

The point mutual information coding value of the nucleotide at the j position of a sample of a DNA/RNA sequence to be coded in the positive data set

Defined as forward point mutual information value

And backward point mutual information value

The sample of DNA/RNA sequences of length l encodes a feature vector V of length l-2 alpha-2 and point-to-point information⁺：

(3.2) determination of the nucleotide sequence of the DNA/RNA to be encoded in the negative class data set according to the following formulaForward point mutual information value

Wherein the content of the first and second substances,

the occurrence frequency of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples in the negative class data set,

the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the negative class dataset,

the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples in the negative class dataset.

Determining the backward point mutual information value of the DNA/RNA sequence nucleotide to be coded in the negative class data set according to the following formula

Wherein the content of the first and second substances,

the occurrence frequency of the dinucleotide xy at the jth and jth-alpha-1 position of all sequence samples in the negative class data set,

the frequency of occurrence of nucleotide y at the j- α -1 position of all sequence samples in the negative class dataset.

DNA/RNA sequence sample to be encodedThe point-to-point mutual information coding value of the jth nucleotide in the negative data set

Defined as forward point mutual information value

And backward point mutual information value

The sample of DNA/RNA sequences of length l encodes a feature vector V of length l-2 alpha-2 and point-to-point information^-：

(3.3) samples of DNA/RNA sequences to be encoded of given length l, by means of vector V⁺And V^-And subtracting the corresponding elements to determine a feature vector V:

V＝[V_α+2,V_α+3,…,V_j]

(4) feature combination

When the parameter alpha takes the value of 0, the characteristic vector V (0) is [ V₂,V₃,V₄,…,V_l-2,V_l-1]When the number of elements is l-2 and alpha is 1, the characteristic vector V (1) is [ V ]₃,V₄,V₅,…,V_l-3,V_l-2]When the number of the elements is l-4, … and alpha is (l-5)/2, the characteristic vector V ((l-5)/2) is [ V ]_(l-1)/2,V_(l+1)/2,V_(l+3)/2]When the number of elements is 3 and alpha is (l-3)/2, the eigenvector V ((l-3)/2) is [ V ]_(l+1)/2]Elements ofThe number is 1; combining the characteristic vectors determined by different values of the parameter alpha into an element number (l-1)²High-dimensional feature vector of [ V (0), V (1), …, V ((l-5)/2), V ((l-3)/2) ]4]。

(5) DNA/RNA sequence coding

Using the above-mentioned steps (1) to (4), the DNA/RNA sequence dataset D is encoded as a numerical dataset D',

s is the number of samples in the numerical data set D', and the value of s is a finite positive integer (l-1)²And/4 is the characteristic number of the numerical data set D'.

In order to encode DNA/RNA sequence data into numerical data which contains more classified identification information and has low redundancy among features, the invention provides a bidirectional dinucleotide position specificity preference method, and DNA/RNA sequence samples are encoded into numerical feature samples by adopting a point mutual information method based on a nucleotide position specificity preference matrix and a bidirectional dinucleotide position specificity preference matrix of positive and negative DNA/RNA sequence data; in order to extract more dinucleotide position information from a DNA/RNA sequence sample, a parameter alpha is introduced to express dinucleotide spacing in the process of constructing a bidirectional dinucleotide position specificity preference matrix, and numerical feature vectors determined by different values of alpha are combined into a global high-dimensional numerical feature vector which contains more classified identification information and has low redundancy among features. The DNA/RNA sequence coding method is adopted to carry out comparison simulation experiments with the existing 7 coding methods, and the experimental result shows that the support vector machine model established by the coding method of the invention is used for N of caenorhabditis elegans DNA⁴-methylcytosine (N)⁴-methycytosine, 4mC) site recognition prediction accuracy, sensitivity, specificity, MCC, AUROC, AUPRC reaching 0.905, 0.902, 0.909, 0.811, 0.967, 0.966 respectively, all higher than other 7 comparative coding methods; n of support vector machine established based on coding method of the invention to microzyme RNA⁶-methyladenosine (N)⁶methyladenosine，m⁶A) The accuracy, sensitivity, specificity, MCC, AUROC and AUPRC of the site recognition prediction reach 0.905, 0.916, 0.894, 0.810 and 0.968 respectively0.967, all higher than the other 7 contrast coding methods.

Drawings

FIG. 1 is a flow chart of the method of example 1 of the present invention.

FIG. 2 is a diagram of the support vector machine model established based on the present invention and 7 encoding methods, respectively, for N of caenorhabditis elegans DNA⁴-a plot of AUROC values predicted for methylcytosine site recognition.

FIG. 3 is a diagram of the support vector machine model established based on the present invention and 7 encoding methods, respectively, for N of caenorhabditis elegans DNA⁴-profile of predicted AUPRC values for methylcytosine site recognition.

FIG. 4 shows the N-fold of the model of support vector machine for yeast RNA, which was constructed based on the present invention and 7 encoding methods⁶-methyladenosine site recognition predicted AUROC value curve.

FIG. 5 shows the N-fold of the model of support vector machine for yeast RNA, which was constructed based on the present invention and 7 encoding methods⁶-recognition of the predicted AUPRC value profile by methyladenosine sites.

Detailed Description

The present invention will be described in further detail below with reference to the drawings and examples, but the present invention is not limited to the embodiments described below.

Example 1

The document iDNA4mC identification DNA N⁴N of C.elegans DNA from Methylthionine sites based on nucleotide chemical peptides⁴-methylcytosine (N)⁴Methelkytosine, 4mC) dataset, which has 3108 DNA sequence samples, wherein the number of positive dataset samples, i.e. the true N⁴1554 samples of methylcytosine and a negative data set, i.e. not N⁴The number of methylcytosine samples is 1554 and the length l of each sequence sample is 41. The method for encoding bidirectional dinucleotide position-specific preference and mutual information DNA of this example consists of the following steps (see FIG. 1):

(1) construction of DNA sequence nucleotide position specificity preference matrix

Given a DNA sequence dataset D, the dataset consists of a positive class dataset and a negative class dataset.

Wherein A, C, G, T is 4 kinds of nucleotides of DNA, i is the position of the nucleotide, i is more than or equal to 1 and less than or equal to l, the value of i is a limited positive integer, l is the nucleotide length of the DNA sequence sample, the value of l is an odd number, the value of l in this embodiment is 41,

the occurrence frequency of nucleotide A, C, G, T at the i-th position of all sequence samples in the positive type dataset is shown.

Wherein the content of the first and second substances,

the frequency of occurrence of nucleotide A, C, G, T at the i-th position of all sequence samples in the negative class dataset, respectively.

(2) Construction of a DNA sequence bidirectional dinucleotide position-specific preference matrix

Wherein AA, AC, … and TT are 16 dinucleotides formed by A, C, G, T of 4 nucleotides of DNA, alpha represents the distance between two nucleotides, alpha is more than or equal to 0 and less than or equal to (l-3)/2, the value of alpha is limited positive integer, the value of alpha is more than or equal to 0 and less than or equal to 19 in the embodiment, j is the position of the nucleotide, the value of alpha +2 is more than or equal to j and less than or equal to l-alpha-1, the value of alpha +2 is more than or equal to j and less than or equal to 40-alpha in the embodiment, the value of j is limited positive,

the occurrence frequencies of the dinucleotides AA, AC, … and TT at the jth and jth + alpha +1 positions of all sequence samples in the positive type data set are respectively.

Wherein the content of the first and second substances,

the occurrence frequencies of dinucleotides AA, AC, … and TT at the jth and jth-alpha-1 positions of all sequence samples in the positive type data set are respectively.

Wherein the content of the first and second substances,

the occurrence frequencies of the dinucleotides AA, AC, … and TT at the jth and jth alpha +1 positions of all sequence samples of the negative class data set are respectively.

Wherein the content of the first and second substances,

the occurrence frequencies of the dinucleotides AA, AC, … and TT at the jth and jth alpha-1 positions of all sequence samples of the negative class data set are respectively.

(3) Determination of mutual information values of nucleotides of DNA sequences

(3.1) determining the forward mutual information value of the nucleotides of the DNA sequence to be encoded in the positive data set according to the following formula

Wherein x is the nucleotide at the jth position, x.epsilon. { A, C, G, T }, z is the nucleotide at the j + alpha +1 position, z.epsilon. { A, C, G, T },

Determining the backward mutual information value of the nucleotide of the DNA sequence to be encoded in the positive data set according to the following formula

Wherein y is the nucleotide at position j- α -1, y ∈ { A, C, G, T },

The point mutual information coding value of the nucleotide at the j position of a DNA sequence sample to be coded in the positive data set

Defined as forward point mutual information value

And backward point mutual information value

The length l DNA sequence sample is coded into a length l-2 alpha-2 point mutual information characteristic vector V⁺：

The value of l in this example is 41.

(3.2) determining the forward mutual information value of the nucleotides of the DNA sequence to be encoded in the negative data set according to the following formula

Wherein the content of the first and second substances,

Determining the backward point mutual information value of the DNA sequence nucleotide to be coded in the negative class data set according to the following formula

Wherein the content of the first and second substances,

The point mutual information coding value of the nucleotide at the jth position of a DNA sequence sample to be coded in a negative class data set

Defined as forward point mutual information value

And backward point mutual information value

The length l DNA sequence sample is coded into a length l-2 alpha-2 point mutual information characteristic vector V^-：

The value of l in the example is 41.

(3.3) given a sample of DNA sequence to be encoded of length l, by means of vector V⁺And V^-And subtracting the corresponding elements to determine a feature vector V:

V＝[V_α+2,V_α+3,…,V_j]

(4) feature combination

When the parameter alpha takes the value of 0, the characteristic vector V (0) is [ V₂,V₃,V₄,…,V_l-2,V_l-1]When the number of elements is l-2 and alpha is 1, the characteristic vector V (1) is [ V ]₃,V₄,V₅,…,V_l-3,V_l-2]When the number of the elements is l-4, … and alpha is (l-5)/2, the characteristic vector V ((l-5)/2) is [ V ]_(l-1)/2,V_(l+1)/2,V_(l+3)/2]When the number of elements is 3 and alpha is (l-3)/2, the eigenvector V ((l-3)/2) is [ V ]_(l+1)/2]The number of elements is 1; combining the characteristic vectors determined by different values of the parameter alpha into an element number (l-1)²High-dimensional feature vector of [ V (0), V (1), …, V ((l-5)/2), V ((l-3)/2) ]4]In this embodiment, the value of l is 41.

(5) DNA sequence coding

Using the above-mentioned steps (1) to (4), the DNA sequence data set D is encoded into a numerical data set D',

s is the number of samples in the numerical data set D', s is a finite positive integer, s in this embodiment is 3108, (l-1)²And/4 is the characteristic number of the numerical data set D' to complete the DNA sequence coding.

To verify the beneficial effects of the present invention, the inventors applied the DNA sequence encoding method of example 1 and PSNP (position-specific nucleotide sequences), PSDP (position-specific nucleotide sequences), KNF (nucleotide sequences), KSNPF (Kslotted nucleotide sequences), NPPS (nucleotide sequences), PBE (positional array encoding), NCPNC (nucleotide sequence encoding) encoding methods to N of C.elegans DNA⁴-recognition and prediction of methylcytosine sites, and comparing the performances of the support vector machine models established by the coding methods. Average Classification Accuracy (Accuracy), Sensitivity (Sensitivity), Specificity (Specificity), MCC (matrix's correlation), AUROC (area under the receiver operating characterization curve), A-fold cross validation methodUPRC (area under the precision call curve). The experimental method is as follows:

1. n of C.elegans DNA according to example 1⁴-a sample of a methylcytosine dataset sequence is encoded.

2. Data set normalization

The numerical data set D' is normalized by the maximum minimization method of:

wherein, g_m,nIs the nth characteristic value, g, of the mth sample of the numerical data set D_m,nThe normalized value is g'_m,n，max(g_n) And min (g)_n) Representing the maximum and minimum eigenvalues on the nth column of the numerical data set D', m is greater than or equal to 1 and less than or equal to s, n is greater than or equal to 1 and less than or equal to (l-1)²And 4, the values of m and n are finite positive integers, in the embodiment, the value of l is 41, and the value of s is 3108.

3. Partitioning a data set

Dividing the normalized numerical data set D 'into 10 parts by adopting a K-fold cross validation method (K ═ 10), and taking 1 part of the numerical data set D' as a test set D 'in turn'_TeAnd the rest 9 parts are taken as training set D'_TrRun 10 times in total, each time training set D'_TrAnd test set D'_TeThe ratio of (A) to (B) is 9: 1.

4. Training models and tests

With training set D'_TrTraining support vector machine model with test set D'_TeThe performance of the support vector machine model was tested.

The same procedure was followed for the 7 comparative experiments according to Steps 2-4 of the experimental procedure, for N of caenorhabditis elegans DNA⁴The recognition and prediction of methylcytosine sites, the average classification Accuracy (Accuracy), Sensitivity (Sensitivity), Specificity (Specificity), and the experimental results of MCC are shown in Table 1, the experimental results of AUROC are shown in FIG. 2, and the experimental results of AUPRC are shown in FIG. 3.

Table 1 experimental results comparing example 1 method with 7 methods

As can be seen from Table 1, the support vector machine model established based on the DNA sequence coding method of the present invention is suitable for N of caenorhabditis elegans DNA⁴The accuracy, sensitivity, specificity and MCC of recognition prediction of the methylcytosine locus reach 0.905, 0.902, 0.909 and 0.811 respectively, which are higher than those of other 7 comparative encoding methods.

As can be seen from FIG. 2, the support vector machine model established based on the DNA sequence coding method of the present invention is directed to N of caenorhabditis elegans DNA⁴The AUROC value predicted by methylcytosine site recognition was 0.967, higher than the other 7 comparative coding methods.

As can be seen from FIG. 3, the support vector machine model established based on the DNA sequence coding method of the present invention was used for N of caenorhabditis elegans DNA⁴Recognition of the methylcytosine site predicts an AUPRC value of 0.966, higher than for the other 7 comparative coding methods.

Example 2

In the document Benchmarkdataformentifying N⁶N of Yeast RNA in Saccharomyces cerevisiae genome⁶-methyladenosine (N)⁶methyladenosine，m⁶A) For the data set example, the data set has 2614 RNA sequence samples, wherein, the number of positive data set samples is the true N⁶1307 methyladenosine samples, negative class dataset samples, i.e. not N⁶1307 samples of methyladenosine and a length l of 51 for each sequence sample. The method for coding bi-directional dinucleotide position-specific preference and mutual information RNA sequences of the present example consists of the following steps (see FIG. 1).

(1) Construction of RNA sequence nucleotide position specificity preference matrix

Given an RNA sequence dataset D, the dataset consists of a positive class dataset and a negative class dataset.

Wherein A, C, G, U is 4 kinds of nucleotides of RNA, i is the position of the nucleotide, i is not less than 1 and not more than l, the value of i is a limited positive integer, l is the nucleotide length of the RNA sequence sample, the value of l is an odd number, the value of l in this embodiment is 51,

the occurrence frequency of nucleotide A, C, G, U at the i-th position of all sequence samples in the positive type dataset is shown.

Wherein the content of the first and second substances,

the frequency of occurrence of nucleotide A, C, G, U at the i-th position of all sequence samples in the negative class dataset, respectively.

(2) Construction of RNA sequence bidirectional dinucleotide position specificity preference matrix

Wherein AA, AC, … and UU are 16 dinucleotides formed by A, C, G, U of 4 nucleotides of RNA, alpha represents the distance between two nucleotides, alpha is more than or equal to 0 and less than or equal to (l-3)/2, the value of alpha is limited positive integer, the value of alpha is more than or equal to 0 and less than or equal to 24 in the embodiment, j is the position of the nucleotide, the value of alpha +2 is more than or equal to j and less than or equal to l-alpha-1, the value of alpha +2 is more than or equal to j and less than or equal to 50-alpha in the embodiment, the value of j is limited positive,

the occurrence frequencies of the dinucleotides AA, AC, … and UU at the jth and jth + alpha +1 positions of all sequence samples in the positive type data set are respectively.

Wherein the content of the first and second substances,

the occurrence frequencies of the dinucleotides AA, AC, … and UU at the jth and jth-alpha-1 positions of all sequence samples in the positive type data set are respectively.

Wherein the content of the first and second substances,

are respectively of negative classFrequency of occurrence of the dinucleotides AA, AC, …, UU at the jth and j + alpha +1 position of all sequence samples in the data set.

Wherein the content of the first and second substances,

the occurrence frequencies of the dinucleotides AA, AC, … and UU at the jth and jth-alpha-1 positions of all sequence samples in the negative class data set are respectively.

(3) Determination of mutual information values of nucleotides of RNA sequences

(3.1) determining the forward mutual information value of the nucleotide sequence of the RNA sequence to be encoded in the positive data set according to the following formula

Wherein x is the nucleotide at the jth position, x.epsilon. { A, C, G, U }, z is the nucleotide at the j + alpha +1 position, z.epsilon. { A, C, G, U },

is the frequency of nucleotide z at position j + α +1 of all sequence samples in the positive dataset.

Determining the backward point mutual information value of the RNA sequence nucleotide to be coded in the positive data set according to the following formula

Wherein y is the nucleotide at position j- α -1, y ∈ { A, C, G, U },

The point mutual information coding value of the nucleotide at the j position of the RNA sequence sample to be coded in the positive data set

Defined as forward point mutual information value

And backward point mutual information value

The RNA sequence sample of length l is encoded into a point mutual information characteristic vector V of length l-2 alpha-2⁺：

The value of l in this example is 51.

(3.2) determining the forward mutual information value of the nucleotide sequence of the RNA sequence to be encoded in the negative class data set according to the following formula

Wherein the content of the first and second substances,

Determining the backward point mutual information value of the RNA sequence nucleotide to be coded in the negative class data set according to the following formula

Wherein the content of the first and second substances,

for the dinucleotide xy at the jth, jth-alpha-1 position of all sequence samples in the negative class datasetThe frequency of occurrence of the frequency of occurrence,

The point mutual information coding value of the nucleotide at the jth position of the RNA sequence sample to be coded in the negative class data set

Defined as forward point mutual information value

And backward point mutual information value

The RNA sequence sample of length l is encoded into a point mutual information characteristic vector V of length l-2 alpha-2^-：

The value of l in this example is 51.

(3.3) samples of RNA sequences to be encoded of given length l, by means of vector V⁺And V^-And subtracting the corresponding elements to determine a feature vector V:

V＝[V_α+2,V_α+3,…,V_j]

(4) feature combination

When the parameter alpha takes the value of 0, the characteristic vector V (0) is [ V₂,V₃,V₄,…,V_l-2,V_l-1]The number of elements is l-2, alpha is takenWhen 1, the feature vector V (1) is [ V ]₃,V₄,V₅,…,V_l-3,V_l-2]When the number of the elements is l-4, … and alpha is (l-5)/2, the characteristic vector V ((l-5)/2) is [ V ]_(l-1)/2,V_(l+1)/2,V_(l+3)/2]When the number of elements is 3 and alpha is (l-3)/2, the eigenvector V ((l-3)/2) is [ V ]_(l+1)/2]The number of elements is 1; combining the characteristic vectors determined by different values of the parameter alpha into an element number (l-1)²High-dimensional feature vector of [ V (0), V (1), …, V ((l-5)/2), V ((l-3)/2) ]4]In this embodiment, the value of l is 51.

(5) RNA sequence coding

Using the above-mentioned steps (1) to (4), the RNA sequence data set D is encoded into a numerical data set D',

s is the number of samples in the numerical data set D', s is a finite positive integer, s in this embodiment is 2614, (l-1)²And/4 is the characteristic number of the numerical data set D' to complete the RNA sequence coding.

To verify the beneficial effects of the present invention, the inventors applied the RNA sequence encoding method of example 2 and PSNP (position-specific nucleotide sequences), PSDP (position-specific nucleotide sequences), knf (nucleotide sequences), ksnpf (kspaged nucleotide sequences), npps (nucleotide sequences), pbe (position specific array), ncpnc (nucleotide sequence and nucleotide composition) encoding methods to the RNA of yeast⁶-recognition prediction of methyladenosine sites, comparing the performance of the support vector machine models established for each coding method. The average classification Accuracy (Accuracy), Sensitivity (Sensitivity), Specificity (Specificity), MCC (material's correlation), AUROC (area under the recording chromatography curve), and AUPRC (area under the recording chromatography curve) of the 10-fold cross-validation method were used for evaluation. The experimental method is as follows:

1. n of Yeast RNA according to example 2⁶-methyladenosine data set sequenceThe samples are encoded.

2. Data set normalization

The numerical data set D' is normalized by the maximum minimization method of:

wherein, g_m,nIs the nth characteristic value, g, of the mth sample of the numerical data set D_m,nThe normalized value is g'_m,n，max(g_n) And min (g)_n) Representing the maximum and minimum eigenvalues on the nth column of the numerical data set D', m is greater than or equal to 1 and less than or equal to s, n is greater than or equal to 1 and less than or equal to (l-1)²And/4, the values of m and n are limited positive integers, the value of l in the embodiment is 51, and the value of s is 2614.

3. Partitioning a data set

4. training models and tests

The same procedure was followed for the 7 comparative experiments in steps 2-4 of the experimental procedure, for N of yeast RNA⁶The recognition and prediction of the methyladenosine sites, the average classification Accuracy (Accuracy), Sensitivity (Sensitivity), Specificity (Specificity) and the experimental results of MCC are shown in Table 2.

The results of AUROC are shown in FIG. 4, and the results of AUPRC are shown in FIG. 5.

Table 2 experimental results comparing example 2 with 7 methods

As can be seen from Table 2, the support vector machine model established based on the RNA sequence coding method of the present invention is used for N of yeast RNA⁶The accuracy, sensitivity, specificity and MCC of recognition prediction of the methyl adenosine locus reach 0.905, 0.916, 0.894 and 0.810 respectively, which are higher than those of 7 other comparative encoding methods.

As can be seen from FIG. 4, the support vector machine model established based on the RNA sequence coding method of the present invention is used for N of yeast RNA⁶Recognition of the methyladenosine site predicted AUROC value of 0.968, higher than the other 7 comparative coding methods.

As can be seen from FIG. 5, the support vector machine model established based on the RNA sequence coding method of the present invention is used for N of yeast RNA⁶Recognition of the methyladenosine site predicted an AUPRC value of 0.967, higher than the other 7 comparative coding methods.

Claims

1. A method for coding DNA/RNA sequences with bidirectional dinucleotide position-specific preference and mutual information, which is characterized by comprising the following steps:

Given a DNA/RNA sequence dataset D, the dataset consists of a positive class dataset and a negative class dataset;

the occurrence frequency of A, C, G, X nucleotides at the i-th position of all sequence samples in the positive type data set respectively;

Wherein the content of the first and second substances,

the frequency of occurrence of nucleotide A, C, G, X at the i-th position of all sequence samples in the negative class dataset;

Wherein AA, AC, … and XX are 16 dinucleotides formed by 4 nucleotides A, C, G, X of DNA/RNA, alpha represents the distance between two nucleotides, alpha is more than or equal to 0 and less than or equal to (l-3)/2, the value of alpha is a limited positive integer, j is the position of the nucleotide, j is more than or equal to alpha +2 and less than or equal to l-alpha-1, the value of j is a limited positive integer,

respectively the generation of dinucleotides AA, AC, … and XX at the jth and jth + alpha +1 positions of all sequence samples in the positive type data setThe current frequency;

Wherein the content of the first and second substances,

the occurrence frequencies of dinucleotides AA, AC, … and XX at the jth and jth-alpha-1 positions of all sequence samples in the positive type data set respectively;

Wherein the content of the first and second substances,

the occurrence frequencies of dinucleotides AA, AC, … and XX at the jth and jth + alpha +1 positions of all sequence samples in the negative class data set respectively;

Wherein the content of the first and second substances,

the occurrence frequencies of dinucleotides AA, AC, … and XX at the jth and jth-alpha-1 positions of all sequence samples in the negative class data set respectively;

is the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples of the positive dataset;

determining the backward mutual information value of the nucleotide of the DNA/RNA sequence to be coded in the positive data set according to the following formula

Wherein y is the nucleotide at position j- α -1, y ∈ { A, C, G, X },

is the frequency of occurrence of nucleotide y at the j-alpha-1 position of all sequence samples in the positive dataset;

Defined as forward point mutual information value

And backward point mutual information value

(3.2) determining the forward mutual information value of the nucleotide sequence of the DNA/RNA sequence to be encoded in the negative class data set according to the following formula

Wherein the content of the first and second substances,

the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples in the negative class dataset;

Wherein the content of the first and second substances,

the frequency of occurrence of nucleotide y at the j-alpha-1 position of all sequence samples in the negative class dataset;

Defined as forward point mutual information value

And backward point mutual information value

V＝[V_α+2,V_α+3,…,V_j]

(4) feature combination

When the parameter alpha takes the value of 0, the characteristic vector V (0) is [ V₂,V₃,V₄,…,V_l-2,V_l-1]When the number of elements is l-2 and alpha is 1, the characteristic vector V (1) is [ V ]₃,V₄,V₅,…,V_l-3,V_l-2]When the number of the elements is l-4, … and alpha is (l-5)/2, the characteristic vector V ((l-5)/2) is [ V ]_(l-1)/2,V_(l+1)/2,V_(l+3)/2]Number of elementsIs 3, when alpha is (l-3)/2, the characteristic vector V ((l-3)/2) is [ V ]_(l+1)/2]The number of elements is 1; combining the characteristic vectors determined by different values of the parameter alpha into an element number (l-1)²High-dimensional feature vector of [ V (0), V (1), …, V ((l-5)/2), V ((l-3)/2) ]4]；

(5) DNA/RNA sequence coding