CN112365925A - Bidirectional dinucleotide position specific preference and mutual information DNA/RNA sequence coding method - Google Patents

Bidirectional dinucleotide position specific preference and mutual information DNA/RNA sequence coding method Download PDF

Info

Publication number
CN112365925A
CN112365925A CN202011236138.3A CN202011236138A CN112365925A CN 112365925 A CN112365925 A CN 112365925A CN 202011236138 A CN202011236138 A CN 202011236138A CN 112365925 A CN112365925 A CN 112365925A
Authority
CN
China
Prior art keywords
dna
nucleotide
alpha
sequence
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011236138.3A
Other languages
Chinese (zh)
Inventor
王明钊
谢娟英
许升全
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Normal University
Original Assignee
Shaanxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Normal University filed Critical Shaanxi Normal University
Priority to CN202011236138.3A priority Critical patent/CN112365925A/en
Publication of CN112365925A publication Critical patent/CN112365925A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Data Mining & Analysis (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Artificial Intelligence (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A coding method of a bidirectional dinucleotide position specificity preference and point mutual information DNA/RNA sequence comprises the steps of constructing a DNA/RNA sequence nucleotide position specificity preference matrix, constructing a DNA/RNA sequence bidirectional dinucleotide position specificity preference matrix, determining a point mutual information value and a characteristic combination of DNA/RNA sequence nucleotides, and coding the DNA/RNA sequence. In order to extract more position information of dinucleotides from DNA/RNA sequence data, a parameter alpha is introduced to express the distance between the dinucleotides, numerical characteristic vectors with different values of alpha are combined into a global high-dimensional numerical characteristic vector, and the global high-dimensional numerical characteristic vector is used for a4mC methylation site of DNA and m of RNA6A has very good performance in methylation site recognition. The DNA/RNA numerical characteristic data obtained by the invention has the advantages of more classification information, low redundancy among characteristics and trained model identificationHigh accuracy and the like, and can be used for coding DNA/RNA sequences.

Description

Bidirectional dinucleotide position specific preference and mutual information DNA/RNA sequence coding method
Technical Field
The invention belongs to the technical field of sequence data analysis, and particularly relates to a DNA/RNA sequence coding method.
Background
Machine learning techniques play a key role in the DNA/RNA methylation site recognition problem in the post-genomic era. In general, the solution to this problem by using machine learning technique mainly comprises 3 steps: sequence data encoding, construction models and performance assessment, wherein the sequence data encoding is the most important step for solving the problem, namely how to extract numerical features containing more classified identification information from DNA/RNA sequence data is the key for accurately identifying DNA/RNA methylation sites.
At present, the existing coding method has the defects of low coding characteristic dimension and incapability of extracting key identification information from DNA/RNA sequence data, so that the accuracy of the established prediction identification model is low. The DNA/RNA sequence data are coded by adopting a plurality of coding methods, and the coded numerical characteristics are combined into high-dimensional characteristics, so that the defects of a single coding method can be overcome, high redundancy of the numerical characteristics and waste of computing resources are caused, and the improvement on the accuracy of a DNA/RNA methylation site recognition model is very limited. Therefore, developing a DNA/RNA sequence data encoding method that can effectively encode DNA/RNA sequences into high-dimensional numerical features containing more classified identification information and has low redundancy between features is a key to solve the problem and is a hot spot of research in this field.
Disclosure of Invention
The technical problem to be solved by the invention is to overcome the defects of the prior art and provide a bidirectional dinucleotide position specificity preference and mutual information DNA/RNA sequence coding method which has the advantages of more classified identification information, low redundancy among characteristics and high identification and prediction accuracy of an established model.
The technical scheme adopted for solving the technical problems comprises the following steps:
(1) construction of DNA/RNA sequence nucleotide position specificity preference matrix
Given a DNA/RNA sequence dataset D, the dataset consists of a positive class dataset and a negative class dataset.
Determining a nucleotide position-specific preference matrix for a positive data set according to
Figure BDA0002766724740000021
Figure BDA0002766724740000022
Wherein A, C, G, X is 4 nucleotides of DNA/RNA, X is represented as nucleotide T in the DNA sequence data set, and is represented as nucleotide U in the RNA sequence data set, i is the position of the nucleotide, i is more than or equal to 1 and less than or equal to l, the value of i is a limited positive integer, l is the nucleotide length of the DNA/RNA sequence sample, the value of l is an odd number,
Figure BDA0002766724740000023
the occurrence frequency of nucleotide A, C, G, X at the i-th position of all sequence samples in the positive type dataset is shown.
Determining a nucleotide position-specific preference matrix for a negative class dataset as follows
Figure BDA0002766724740000024
Figure BDA0002766724740000025
Wherein the content of the first and second substances,
Figure BDA0002766724740000026
the frequency of occurrence of nucleotide A, C, G, X at the i-th position of all sequence samples in the negative class dataset, respectively.
(2) Construction of DNA/RNA sequence bidirectional dinucleotide position specificity preference matrix
Determining a forward dinucleotide position-specific preference matrix for a positive class dataset as follows
Figure BDA0002766724740000027
Figure BDA0002766724740000028
Wherein AA, AC, … and XX are 16 dinucleosides consisting of 4 nucleotides A, C, G, X of DNA/RNAAcid, alpha represents the distance between two nucleotides, alpha is more than or equal to 0 and less than or equal to (l-3)/2, the value of alpha is a limited positive integer, j is the position of the nucleotide, j is more than or equal to alpha +2 and less than or equal to l-alpha-1, the value of j is a limited positive integer,
Figure BDA0002766724740000029
the occurrence frequencies of the dinucleotides AA, AC, … and XX at the jth and jth + alpha +1 positions of all sequence samples in the positive type data set are respectively.
Determining a backward dinucleotide position-specific preference matrix for a positive data set according to
Figure BDA00027667247400000210
Figure BDA0002766724740000031
Wherein the content of the first and second substances,
Figure BDA0002766724740000032
the occurrence frequencies of dinucleotides AA, AC, … and XX at the jth and jth-alpha-1 positions of all sequence samples in the positive type data set are respectively.
Determining a forward dinucleotide position-specific preference matrix for the negative class dataset as follows
Figure BDA0002766724740000033
Figure BDA0002766724740000034
Wherein the content of the first and second substances,
Figure BDA0002766724740000035
the occurrence frequencies of the dinucleotides AA, AC, … and XX at the jth and jth + alpha +1 positions of all sequence samples in the negative class data set are respectively.
Determining a backward dinucleotide position-specific preference matrix for a negative class dataset as follows
Figure BDA0002766724740000036
Figure BDA0002766724740000037
Wherein the content of the first and second substances,
Figure BDA0002766724740000038
the occurrence frequencies of dinucleotides AA, AC, … and XX at the jth and jth-alpha-1 positions of all sequence samples in the negative class data set are respectively.
(3) Determination of mutual information values of nucleotides of DNA/RNA sequences
(3.1) determining the forward mutual information value of the nucleotide sequence of the DNA/RNA sequence to be encoded in the positive data set according to the following formula
Figure BDA0002766724740000039
Figure BDA00027667247400000310
Wherein X is the nucleotide at the jth position, x.epsilon. { A, C, G, X }, z is the nucleotide at the j + alpha +1 position, z.epsilon. { A, C, G, X },
Figure BDA00027667247400000311
is the frequency of the occurrence of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples of the positive dataset,
Figure BDA0002766724740000041
is the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the positive data set,
Figure BDA0002766724740000042
is the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples of the positive dataset.
Determining the backward mutual trust of the nucleotide sequence of the DNA/RNA sequence to be encoded in the positive data set according to the following formulaInformation value
Figure BDA0002766724740000043
Figure BDA0002766724740000044
Wherein y is the nucleotide at position j- α -1, y ∈ { A, C, G, X },
Figure BDA0002766724740000045
the occurrence frequency of the dinucleotide xy at the jth and jth-alpha-1 position of all sequence samples of the positive type data set,
Figure BDA0002766724740000046
is the frequency of occurrence of nucleotide y at the j- α -1 position of all sequence samples of the positive dataset.
The point mutual information coding value of the nucleotide at the j position of a sample of a DNA/RNA sequence to be coded in the positive data set
Figure BDA0002766724740000047
Defined as forward point mutual information value
Figure BDA0002766724740000048
And backward point mutual information value
Figure BDA0002766724740000049
The sample of DNA/RNA sequences of length l encodes a feature vector V of length l-2 alpha-2 and point-to-point information+
Figure BDA00027667247400000410
Figure BDA00027667247400000411
(3.2) determination of the nucleotide sequence of the DNA/RNA to be encoded in the negative class data set according to the following formulaForward point mutual information value
Figure BDA00027667247400000412
Figure BDA00027667247400000413
Wherein the content of the first and second substances,
Figure BDA00027667247400000414
the occurrence frequency of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples in the negative class data set,
Figure BDA00027667247400000415
the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the negative class dataset,
Figure BDA00027667247400000416
the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples in the negative class dataset.
Determining the backward point mutual information value of the DNA/RNA sequence nucleotide to be coded in the negative class data set according to the following formula
Figure BDA00027667247400000417
Figure BDA00027667247400000418
Wherein the content of the first and second substances,
Figure BDA00027667247400000419
the occurrence frequency of the dinucleotide xy at the jth and jth-alpha-1 position of all sequence samples in the negative class data set,
Figure BDA00027667247400000420
the frequency of occurrence of nucleotide y at the j- α -1 position of all sequence samples in the negative class dataset.
DNA/RNA sequence sample to be encodedThe point-to-point mutual information coding value of the jth nucleotide in the negative data set
Figure BDA00027667247400000421
Defined as forward point mutual information value
Figure BDA0002766724740000051
And backward point mutual information value
Figure BDA0002766724740000052
The sample of DNA/RNA sequences of length l encodes a feature vector V of length l-2 alpha-2 and point-to-point information-
Figure BDA0002766724740000053
Figure BDA0002766724740000054
(3.3) samples of DNA/RNA sequences to be encoded of given length l, by means of vector V+And V-And subtracting the corresponding elements to determine a feature vector V:
V=[Vα+2,Vα+3,…,Vj]
Figure BDA0002766724740000055
(4) feature combination
When the parameter alpha takes the value of 0, the characteristic vector V (0) is [ V2,V3,V4,…,Vl-2,Vl-1]When the number of elements is l-2 and alpha is 1, the characteristic vector V (1) is [ V ]3,V4,V5,…,Vl-3,Vl-2]When the number of the elements is l-4, … and alpha is (l-5)/2, the characteristic vector V ((l-5)/2) is [ V ](l-1)/2,V(l+1)/2,V(l+3)/2]When the number of elements is 3 and alpha is (l-3)/2, the eigenvector V ((l-3)/2) is [ V ](l+1)/2]Elements ofThe number is 1; combining the characteristic vectors determined by different values of the parameter alpha into an element number (l-1)2High-dimensional feature vector of [ V (0), V (1), …, V ((l-5)/2), V ((l-3)/2) ]4]。
(5) DNA/RNA sequence coding
Using the above-mentioned steps (1) to (4), the DNA/RNA sequence dataset D is encoded as a numerical dataset D',
Figure BDA0002766724740000056
s is the number of samples in the numerical data set D', and the value of s is a finite positive integer (l-1)2And/4 is the characteristic number of the numerical data set D'.
In order to encode DNA/RNA sequence data into numerical data which contains more classified identification information and has low redundancy among features, the invention provides a bidirectional dinucleotide position specificity preference method, and DNA/RNA sequence samples are encoded into numerical feature samples by adopting a point mutual information method based on a nucleotide position specificity preference matrix and a bidirectional dinucleotide position specificity preference matrix of positive and negative DNA/RNA sequence data; in order to extract more dinucleotide position information from a DNA/RNA sequence sample, a parameter alpha is introduced to express dinucleotide spacing in the process of constructing a bidirectional dinucleotide position specificity preference matrix, and numerical feature vectors determined by different values of alpha are combined into a global high-dimensional numerical feature vector which contains more classified identification information and has low redundancy among features. The DNA/RNA sequence coding method is adopted to carry out comparison simulation experiments with the existing 7 coding methods, and the experimental result shows that the support vector machine model established by the coding method of the invention is used for N of caenorhabditis elegans DNA4-methylcytosine (N)4-methycytosine, 4mC) site recognition prediction accuracy, sensitivity, specificity, MCC, AUROC, AUPRC reaching 0.905, 0.902, 0.909, 0.811, 0.967, 0.966 respectively, all higher than other 7 comparative coding methods; n of support vector machine established based on coding method of the invention to microzyme RNA6-methyladenosine (N)6methyladenosine,m6A) The accuracy, sensitivity, specificity, MCC, AUROC and AUPRC of the site recognition prediction reach 0.905, 0.916, 0.894, 0.810 and 0.968 respectively0.967, all higher than the other 7 contrast coding methods.
Drawings
FIG. 1 is a flow chart of the method of example 1 of the present invention.
FIG. 2 is a diagram of the support vector machine model established based on the present invention and 7 encoding methods, respectively, for N of caenorhabditis elegans DNA4-a plot of AUROC values predicted for methylcytosine site recognition.
FIG. 3 is a diagram of the support vector machine model established based on the present invention and 7 encoding methods, respectively, for N of caenorhabditis elegans DNA4-profile of predicted AUPRC values for methylcytosine site recognition.
FIG. 4 shows the N-fold of the model of support vector machine for yeast RNA, which was constructed based on the present invention and 7 encoding methods6-methyladenosine site recognition predicted AUROC value curve.
FIG. 5 shows the N-fold of the model of support vector machine for yeast RNA, which was constructed based on the present invention and 7 encoding methods6-recognition of the predicted AUPRC value profile by methyladenosine sites.
Detailed Description
The present invention will be described in further detail below with reference to the drawings and examples, but the present invention is not limited to the embodiments described below.
Example 1
The document iDNA4mC identification DNA N4N of C.elegans DNA from Methylthionine sites based on nucleotide chemical peptides4-methylcytosine (N)4Methelkytosine, 4mC) dataset, which has 3108 DNA sequence samples, wherein the number of positive dataset samples, i.e. the true N41554 samples of methylcytosine and a negative data set, i.e. not N4The number of methylcytosine samples is 1554 and the length l of each sequence sample is 41. The method for encoding bidirectional dinucleotide position-specific preference and mutual information DNA of this example consists of the following steps (see FIG. 1):
(1) construction of DNA sequence nucleotide position specificity preference matrix
Given a DNA sequence dataset D, the dataset consists of a positive class dataset and a negative class dataset.
Determining a nucleotide position-specific preference matrix for a positive data set according to
Figure BDA0002766724740000071
Figure BDA0002766724740000072
Wherein A, C, G, T is 4 kinds of nucleotides of DNA, i is the position of the nucleotide, i is more than or equal to 1 and less than or equal to l, the value of i is a limited positive integer, l is the nucleotide length of the DNA sequence sample, the value of l is an odd number, the value of l in this embodiment is 41,
Figure BDA0002766724740000073
Figure BDA0002766724740000074
the occurrence frequency of nucleotide A, C, G, T at the i-th position of all sequence samples in the positive type dataset is shown.
Determining a nucleotide position-specific preference matrix for a negative class dataset as follows
Figure BDA0002766724740000075
Figure BDA0002766724740000076
Wherein the content of the first and second substances,
Figure BDA0002766724740000077
the frequency of occurrence of nucleotide A, C, G, T at the i-th position of all sequence samples in the negative class dataset, respectively.
(2) Construction of a DNA sequence bidirectional dinucleotide position-specific preference matrix
Determining a forward dinucleotide position-specific preference matrix for a positive class dataset as follows
Figure BDA0002766724740000078
Figure BDA0002766724740000079
Wherein AA, AC, … and TT are 16 dinucleotides formed by A, C, G, T of 4 nucleotides of DNA, alpha represents the distance between two nucleotides, alpha is more than or equal to 0 and less than or equal to (l-3)/2, the value of alpha is limited positive integer, the value of alpha is more than or equal to 0 and less than or equal to 19 in the embodiment, j is the position of the nucleotide, the value of alpha +2 is more than or equal to j and less than or equal to l-alpha-1, the value of alpha +2 is more than or equal to j and less than or equal to 40-alpha in the embodiment, the value of j is limited positive,
Figure BDA0002766724740000081
the occurrence frequencies of the dinucleotides AA, AC, … and TT at the jth and jth + alpha +1 positions of all sequence samples in the positive type data set are respectively.
Determining a backward dinucleotide position-specific preference matrix for a positive data set according to
Figure BDA0002766724740000082
Figure BDA0002766724740000083
Wherein the content of the first and second substances,
Figure BDA0002766724740000084
the occurrence frequencies of dinucleotides AA, AC, … and TT at the jth and jth-alpha-1 positions of all sequence samples in the positive type data set are respectively.
Determining a forward dinucleotide position-specific preference matrix for the negative class dataset as follows
Figure BDA0002766724740000085
Figure BDA0002766724740000086
Wherein the content of the first and second substances,
Figure BDA0002766724740000087
the occurrence frequencies of the dinucleotides AA, AC, … and TT at the jth and jth alpha +1 positions of all sequence samples of the negative class data set are respectively.
Determining a backward dinucleotide position-specific preference matrix for a negative class dataset as follows
Figure BDA0002766724740000088
Figure BDA0002766724740000089
Wherein the content of the first and second substances,
Figure BDA00027667247400000810
the occurrence frequencies of the dinucleotides AA, AC, … and TT at the jth and jth alpha-1 positions of all sequence samples of the negative class data set are respectively.
(3) Determination of mutual information values of nucleotides of DNA sequences
(3.1) determining the forward mutual information value of the nucleotides of the DNA sequence to be encoded in the positive data set according to the following formula
Figure BDA00027667247400000811
Figure BDA0002766724740000091
Wherein x is the nucleotide at the jth position, x.epsilon. { A, C, G, T }, z is the nucleotide at the j + alpha +1 position, z.epsilon. { A, C, G, T },
Figure BDA0002766724740000092
is the frequency of the occurrence of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples of the positive dataset,
Figure BDA0002766724740000093
is the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the positive data set,
Figure BDA0002766724740000094
is the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples of the positive dataset.
Determining the backward mutual information value of the nucleotide of the DNA sequence to be encoded in the positive data set according to the following formula
Figure BDA0002766724740000095
Figure BDA0002766724740000096
Wherein y is the nucleotide at position j- α -1, y ∈ { A, C, G, T },
Figure BDA0002766724740000097
the occurrence frequency of the dinucleotide xy at the jth and jth-alpha-1 position of all sequence samples of the positive type data set,
Figure BDA0002766724740000098
is the frequency of occurrence of nucleotide y at the j- α -1 position of all sequence samples of the positive dataset.
The point mutual information coding value of the nucleotide at the j position of a DNA sequence sample to be coded in the positive data set
Figure BDA0002766724740000099
Defined as forward point mutual information value
Figure BDA00027667247400000910
And backward point mutual information value
Figure BDA00027667247400000911
The length l DNA sequence sample is coded into a length l-2 alpha-2 point mutual information characteristic vector V+
Figure BDA00027667247400000912
Figure BDA00027667247400000913
The value of l in this example is 41.
(3.2) determining the forward mutual information value of the nucleotides of the DNA sequence to be encoded in the negative data set according to the following formula
Figure BDA00027667247400000914
Figure BDA00027667247400000915
Wherein the content of the first and second substances,
Figure BDA00027667247400000916
the occurrence frequency of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples in the negative class data set,
Figure BDA00027667247400000917
the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the negative class dataset,
Figure BDA00027667247400000918
the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples in the negative class dataset.
Determining the backward point mutual information value of the DNA sequence nucleotide to be coded in the negative class data set according to the following formula
Figure BDA00027667247400000919
Figure BDA0002766724740000101
Wherein the content of the first and second substances,
Figure BDA0002766724740000102
the occurrence frequency of the dinucleotide xy at the jth and jth-alpha-1 position of all sequence samples in the negative class data set,
Figure BDA0002766724740000103
the frequency of occurrence of nucleotide y at the j- α -1 position of all sequence samples in the negative class dataset.
The point mutual information coding value of the nucleotide at the jth position of a DNA sequence sample to be coded in a negative class data set
Figure BDA0002766724740000104
Defined as forward point mutual information value
Figure BDA0002766724740000105
And backward point mutual information value
Figure BDA0002766724740000106
The length l DNA sequence sample is coded into a length l-2 alpha-2 point mutual information characteristic vector V-
Figure BDA0002766724740000107
Figure BDA0002766724740000108
The value of l in the example is 41.
(3.3) given a sample of DNA sequence to be encoded of length l, by means of vector V+And V-And subtracting the corresponding elements to determine a feature vector V:
V=[Vα+2,Vα+3,…,Vj]
Figure BDA0002766724740000109
(4) feature combination
When the parameter alpha takes the value of 0, the characteristic vector V (0) is [ V2,V3,V4,…,Vl-2,Vl-1]When the number of elements is l-2 and alpha is 1, the characteristic vector V (1) is [ V ]3,V4,V5,…,Vl-3,Vl-2]When the number of the elements is l-4, … and alpha is (l-5)/2, the characteristic vector V ((l-5)/2) is [ V ](l-1)/2,V(l+1)/2,V(l+3)/2]When the number of elements is 3 and alpha is (l-3)/2, the eigenvector V ((l-3)/2) is [ V ](l+1)/2]The number of elements is 1; combining the characteristic vectors determined by different values of the parameter alpha into an element number (l-1)2High-dimensional feature vector of [ V (0), V (1), …, V ((l-5)/2), V ((l-3)/2) ]4]In this embodiment, the value of l is 41.
(5) DNA sequence coding
Using the above-mentioned steps (1) to (4), the DNA sequence data set D is encoded into a numerical data set D',
Figure BDA00027667247400001010
s is the number of samples in the numerical data set D', s is a finite positive integer, s in this embodiment is 3108, (l-1)2And/4 is the characteristic number of the numerical data set D' to complete the DNA sequence coding.
To verify the beneficial effects of the present invention, the inventors applied the DNA sequence encoding method of example 1 and PSNP (position-specific nucleotide sequences), PSDP (position-specific nucleotide sequences), KNF (nucleotide sequences), KSNPF (Kslotted nucleotide sequences), NPPS (nucleotide sequences), PBE (positional array encoding), NCPNC (nucleotide sequence encoding) encoding methods to N of C.elegans DNA4-recognition and prediction of methylcytosine sites, and comparing the performances of the support vector machine models established by the coding methods. Average Classification Accuracy (Accuracy), Sensitivity (Sensitivity), Specificity (Specificity), MCC (matrix's correlation), AUROC (area under the receiver operating characterization curve), A-fold cross validation methodUPRC (area under the precision call curve). The experimental method is as follows:
1. n of C.elegans DNA according to example 14-a sample of a methylcytosine dataset sequence is encoded.
2. Data set normalization
The numerical data set D' is normalized by the maximum minimization method of:
Figure BDA0002766724740000111
wherein, gm,nIs the nth characteristic value, g, of the mth sample of the numerical data set Dm,nThe normalized value is g'm,n,max(gn) And min (g)n) Representing the maximum and minimum eigenvalues on the nth column of the numerical data set D', m is greater than or equal to 1 and less than or equal to s, n is greater than or equal to 1 and less than or equal to (l-1)2And 4, the values of m and n are finite positive integers, in the embodiment, the value of l is 41, and the value of s is 3108.
3. Partitioning a data set
Dividing the normalized numerical data set D 'into 10 parts by adopting a K-fold cross validation method (K ═ 10), and taking 1 part of the numerical data set D' as a test set D 'in turn'TeAnd the rest 9 parts are taken as training set D'TrRun 10 times in total, each time training set D'TrAnd test set D'TeThe ratio of (A) to (B) is 9: 1.
4. Training models and tests
With training set D'TrTraining support vector machine model with test set D'TeThe performance of the support vector machine model was tested.
The same procedure was followed for the 7 comparative experiments according to Steps 2-4 of the experimental procedure, for N of caenorhabditis elegans DNA4The recognition and prediction of methylcytosine sites, the average classification Accuracy (Accuracy), Sensitivity (Sensitivity), Specificity (Specificity), and the experimental results of MCC are shown in Table 1, the experimental results of AUROC are shown in FIG. 2, and the experimental results of AUPRC are shown in FIG. 3.
Table 1 experimental results comparing example 1 method with 7 methods
Figure BDA0002766724740000121
As can be seen from Table 1, the support vector machine model established based on the DNA sequence coding method of the present invention is suitable for N of caenorhabditis elegans DNA4The accuracy, sensitivity, specificity and MCC of recognition prediction of the methylcytosine locus reach 0.905, 0.902, 0.909 and 0.811 respectively, which are higher than those of other 7 comparative encoding methods.
As can be seen from FIG. 2, the support vector machine model established based on the DNA sequence coding method of the present invention is directed to N of caenorhabditis elegans DNA4The AUROC value predicted by methylcytosine site recognition was 0.967, higher than the other 7 comparative coding methods.
As can be seen from FIG. 3, the support vector machine model established based on the DNA sequence coding method of the present invention was used for N of caenorhabditis elegans DNA4Recognition of the methylcytosine site predicts an AUPRC value of 0.966, higher than for the other 7 comparative coding methods.
Example 2
In the document Benchmarkdataformentifying N6N of Yeast RNA in Saccharomyces cerevisiae genome6-methyladenosine (N)6methyladenosine,m6A) For the data set example, the data set has 2614 RNA sequence samples, wherein, the number of positive data set samples is the true N61307 methyladenosine samples, negative class dataset samples, i.e. not N61307 samples of methyladenosine and a length l of 51 for each sequence sample. The method for coding bi-directional dinucleotide position-specific preference and mutual information RNA sequences of the present example consists of the following steps (see FIG. 1).
(1) Construction of RNA sequence nucleotide position specificity preference matrix
Given an RNA sequence dataset D, the dataset consists of a positive class dataset and a negative class dataset.
Determining a nucleotide position-specific preference matrix for a positive data set according to
Figure BDA0002766724740000131
Figure BDA0002766724740000132
Wherein A, C, G, U is 4 kinds of nucleotides of RNA, i is the position of the nucleotide, i is not less than 1 and not more than l, the value of i is a limited positive integer, l is the nucleotide length of the RNA sequence sample, the value of l is an odd number, the value of l in this embodiment is 51,
Figure BDA0002766724740000133
Figure BDA0002766724740000134
the occurrence frequency of nucleotide A, C, G, U at the i-th position of all sequence samples in the positive type dataset is shown.
Determining a nucleotide position-specific preference matrix for a negative class dataset as follows
Figure BDA0002766724740000135
Figure BDA0002766724740000136
Wherein the content of the first and second substances,
Figure BDA0002766724740000137
the frequency of occurrence of nucleotide A, C, G, U at the i-th position of all sequence samples in the negative class dataset, respectively.
(2) Construction of RNA sequence bidirectional dinucleotide position specificity preference matrix
Determining a forward dinucleotide position-specific preference matrix for a positive class dataset as follows
Figure BDA0002766724740000138
Figure BDA0002766724740000139
Wherein AA, AC, … and UU are 16 dinucleotides formed by A, C, G, U of 4 nucleotides of RNA, alpha represents the distance between two nucleotides, alpha is more than or equal to 0 and less than or equal to (l-3)/2, the value of alpha is limited positive integer, the value of alpha is more than or equal to 0 and less than or equal to 24 in the embodiment, j is the position of the nucleotide, the value of alpha +2 is more than or equal to j and less than or equal to l-alpha-1, the value of alpha +2 is more than or equal to j and less than or equal to 50-alpha in the embodiment, the value of j is limited positive,
Figure BDA00027667247400001310
the occurrence frequencies of the dinucleotides AA, AC, … and UU at the jth and jth + alpha +1 positions of all sequence samples in the positive type data set are respectively.
Determining a backward dinucleotide position-specific preference matrix for a positive data set according to
Figure BDA0002766724740000141
Figure BDA0002766724740000142
Wherein the content of the first and second substances,
Figure BDA0002766724740000143
the occurrence frequencies of the dinucleotides AA, AC, … and UU at the jth and jth-alpha-1 positions of all sequence samples in the positive type data set are respectively.
Determining a forward dinucleotide position-specific preference matrix for the negative class dataset as follows
Figure BDA0002766724740000144
Figure BDA0002766724740000145
Wherein the content of the first and second substances,
Figure BDA0002766724740000146
are respectively of negative classFrequency of occurrence of the dinucleotides AA, AC, …, UU at the jth and j + alpha +1 position of all sequence samples in the data set.
Determining a backward dinucleotide position-specific preference matrix for a negative class dataset as follows
Figure BDA0002766724740000147
Figure BDA0002766724740000148
Wherein the content of the first and second substances,
Figure BDA0002766724740000149
the occurrence frequencies of the dinucleotides AA, AC, … and UU at the jth and jth-alpha-1 positions of all sequence samples in the negative class data set are respectively.
(3) Determination of mutual information values of nucleotides of RNA sequences
(3.1) determining the forward mutual information value of the nucleotide sequence of the RNA sequence to be encoded in the positive data set according to the following formula
Figure BDA00027667247400001410
Figure BDA00027667247400001411
Wherein x is the nucleotide at the jth position, x.epsilon. { A, C, G, U }, z is the nucleotide at the j + alpha +1 position, z.epsilon. { A, C, G, U },
Figure BDA0002766724740000151
is the frequency of the occurrence of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples of the positive dataset,
Figure BDA0002766724740000152
is the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the positive data set,
Figure BDA0002766724740000153
is the frequency of nucleotide z at position j + α +1 of all sequence samples in the positive dataset.
Determining the backward point mutual information value of the RNA sequence nucleotide to be coded in the positive data set according to the following formula
Figure BDA0002766724740000154
Figure BDA0002766724740000155
Wherein y is the nucleotide at position j- α -1, y ∈ { A, C, G, U },
Figure BDA0002766724740000156
the occurrence frequency of the dinucleotide xy at the jth and jth-alpha-1 position of all sequence samples of the positive type data set,
Figure BDA0002766724740000157
is the frequency of occurrence of nucleotide y at the j- α -1 position of all sequence samples of the positive dataset.
The point mutual information coding value of the nucleotide at the j position of the RNA sequence sample to be coded in the positive data set
Figure BDA0002766724740000158
Defined as forward point mutual information value
Figure BDA0002766724740000159
And backward point mutual information value
Figure BDA00027667247400001510
The RNA sequence sample of length l is encoded into a point mutual information characteristic vector V of length l-2 alpha-2+
Figure BDA00027667247400001511
Figure BDA00027667247400001512
The value of l in this example is 51.
(3.2) determining the forward mutual information value of the nucleotide sequence of the RNA sequence to be encoded in the negative class data set according to the following formula
Figure BDA00027667247400001513
Figure BDA00027667247400001514
Wherein the content of the first and second substances,
Figure BDA00027667247400001515
the occurrence frequency of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples in the negative class data set,
Figure BDA00027667247400001516
the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the negative class dataset,
Figure BDA00027667247400001517
the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples in the negative class dataset.
Determining the backward point mutual information value of the RNA sequence nucleotide to be coded in the negative class data set according to the following formula
Figure BDA00027667247400001518
Figure BDA00027667247400001519
Wherein the content of the first and second substances,
Figure BDA00027667247400001520
for the dinucleotide xy at the jth, jth-alpha-1 position of all sequence samples in the negative class datasetThe frequency of occurrence of the frequency of occurrence,
Figure BDA0002766724740000161
the frequency of occurrence of nucleotide y at the j- α -1 position of all sequence samples in the negative class dataset.
The point mutual information coding value of the nucleotide at the jth position of the RNA sequence sample to be coded in the negative class data set
Figure BDA0002766724740000162
Defined as forward point mutual information value
Figure BDA0002766724740000163
And backward point mutual information value
Figure BDA0002766724740000164
The RNA sequence sample of length l is encoded into a point mutual information characteristic vector V of length l-2 alpha-2-
Figure BDA0002766724740000165
Figure BDA0002766724740000166
The value of l in this example is 51.
(3.3) samples of RNA sequences to be encoded of given length l, by means of vector V+And V-And subtracting the corresponding elements to determine a feature vector V:
V=[Vα+2,Vα+3,…,Vj]
Figure BDA0002766724740000167
(4) feature combination
When the parameter alpha takes the value of 0, the characteristic vector V (0) is [ V2,V3,V4,…,Vl-2,Vl-1]The number of elements is l-2, alpha is takenWhen 1, the feature vector V (1) is [ V ]3,V4,V5,…,Vl-3,Vl-2]When the number of the elements is l-4, … and alpha is (l-5)/2, the characteristic vector V ((l-5)/2) is [ V ](l-1)/2,V(l+1)/2,V(l+3)/2]When the number of elements is 3 and alpha is (l-3)/2, the eigenvector V ((l-3)/2) is [ V ](l+1)/2]The number of elements is 1; combining the characteristic vectors determined by different values of the parameter alpha into an element number (l-1)2High-dimensional feature vector of [ V (0), V (1), …, V ((l-5)/2), V ((l-3)/2) ]4]In this embodiment, the value of l is 51.
(5) RNA sequence coding
Using the above-mentioned steps (1) to (4), the RNA sequence data set D is encoded into a numerical data set D',
Figure BDA0002766724740000168
s is the number of samples in the numerical data set D', s is a finite positive integer, s in this embodiment is 2614, (l-1)2And/4 is the characteristic number of the numerical data set D' to complete the RNA sequence coding.
To verify the beneficial effects of the present invention, the inventors applied the RNA sequence encoding method of example 2 and PSNP (position-specific nucleotide sequences), PSDP (position-specific nucleotide sequences), knf (nucleotide sequences), ksnpf (kspaged nucleotide sequences), npps (nucleotide sequences), pbe (position specific array), ncpnc (nucleotide sequence and nucleotide composition) encoding methods to the RNA of yeast6-recognition prediction of methyladenosine sites, comparing the performance of the support vector machine models established for each coding method. The average classification Accuracy (Accuracy), Sensitivity (Sensitivity), Specificity (Specificity), MCC (material's correlation), AUROC (area under the recording chromatography curve), and AUPRC (area under the recording chromatography curve) of the 10-fold cross-validation method were used for evaluation. The experimental method is as follows:
1. n of Yeast RNA according to example 26-methyladenosine data set sequenceThe samples are encoded.
2. Data set normalization
The numerical data set D' is normalized by the maximum minimization method of:
Figure BDA0002766724740000171
wherein, gm,nIs the nth characteristic value, g, of the mth sample of the numerical data set Dm,nThe normalized value is g'm,n,max(gn) And min (g)n) Representing the maximum and minimum eigenvalues on the nth column of the numerical data set D', m is greater than or equal to 1 and less than or equal to s, n is greater than or equal to 1 and less than or equal to (l-1)2And/4, the values of m and n are limited positive integers, the value of l in the embodiment is 51, and the value of s is 2614.
3. Partitioning a data set
Dividing the normalized numerical data set D 'into 10 parts by adopting a K-fold cross validation method (K ═ 10), and taking 1 part of the numerical data set D' as a test set D 'in turn'TeAnd the rest 9 parts are taken as training set D'TrRun 10 times in total, each time training set D'TrAnd test set D'TeThe ratio of (A) to (B) is 9: 1.
4. training models and tests
With training set D'TrTraining support vector machine model with test set D'TeThe performance of the support vector machine model was tested.
The same procedure was followed for the 7 comparative experiments in steps 2-4 of the experimental procedure, for N of yeast RNA6The recognition and prediction of the methyladenosine sites, the average classification Accuracy (Accuracy), Sensitivity (Sensitivity), Specificity (Specificity) and the experimental results of MCC are shown in Table 2.
The results of AUROC are shown in FIG. 4, and the results of AUPRC are shown in FIG. 5.
Table 2 experimental results comparing example 2 with 7 methods
Figure BDA0002766724740000181
As can be seen from Table 2, the support vector machine model established based on the RNA sequence coding method of the present invention is used for N of yeast RNA6The accuracy, sensitivity, specificity and MCC of recognition prediction of the methyl adenosine locus reach 0.905, 0.916, 0.894 and 0.810 respectively, which are higher than those of 7 other comparative encoding methods.
As can be seen from FIG. 4, the support vector machine model established based on the RNA sequence coding method of the present invention is used for N of yeast RNA6Recognition of the methyladenosine site predicted AUROC value of 0.968, higher than the other 7 comparative coding methods.
As can be seen from FIG. 5, the support vector machine model established based on the RNA sequence coding method of the present invention is used for N of yeast RNA6Recognition of the methyladenosine site predicted an AUPRC value of 0.967, higher than the other 7 comparative coding methods.

Claims (1)

1. A method for coding DNA/RNA sequences with bidirectional dinucleotide position-specific preference and mutual information, which is characterized by comprising the following steps:
(1) construction of DNA/RNA sequence nucleotide position specificity preference matrix
Given a DNA/RNA sequence dataset D, the dataset consists of a positive class dataset and a negative class dataset;
determining a nucleotide position-specific preference matrix for a positive data set according to
Figure FDA0002766724730000011
Figure FDA0002766724730000012
Wherein A, C, G, X is 4 nucleotides of DNA/RNA, X is represented as nucleotide T in the DNA sequence data set, and is represented as nucleotide U in the RNA sequence data set, i is the position of the nucleotide, i is more than or equal to 1 and less than or equal to l, the value of i is a limited positive integer, l is the nucleotide length of the DNA/RNA sequence sample, the value of l is an odd number,
Figure FDA0002766724730000013
the occurrence frequency of A, C, G, X nucleotides at the i-th position of all sequence samples in the positive type data set respectively;
determining a nucleotide position-specific preference matrix for a negative class dataset as follows
Figure FDA0002766724730000014
Figure FDA0002766724730000015
Wherein the content of the first and second substances,
Figure FDA0002766724730000016
the frequency of occurrence of nucleotide A, C, G, X at the i-th position of all sequence samples in the negative class dataset;
(2) construction of DNA/RNA sequence bidirectional dinucleotide position specificity preference matrix
Determining a forward dinucleotide position-specific preference matrix for a positive class dataset as follows
Figure FDA0002766724730000017
Figure FDA0002766724730000021
Wherein AA, AC, … and XX are 16 dinucleotides formed by 4 nucleotides A, C, G, X of DNA/RNA, alpha represents the distance between two nucleotides, alpha is more than or equal to 0 and less than or equal to (l-3)/2, the value of alpha is a limited positive integer, j is the position of the nucleotide, j is more than or equal to alpha +2 and less than or equal to l-alpha-1, the value of j is a limited positive integer,
Figure FDA0002766724730000022
respectively the generation of dinucleotides AA, AC, … and XX at the jth and jth + alpha +1 positions of all sequence samples in the positive type data setThe current frequency;
determining a backward dinucleotide position-specific preference matrix for a positive data set according to
Figure FDA0002766724730000023
Figure FDA0002766724730000024
Wherein the content of the first and second substances,
Figure FDA0002766724730000025
the occurrence frequencies of dinucleotides AA, AC, … and XX at the jth and jth-alpha-1 positions of all sequence samples in the positive type data set respectively;
determining a forward dinucleotide position-specific preference matrix for the negative class dataset as follows
Figure FDA0002766724730000026
Figure FDA0002766724730000027
Wherein the content of the first and second substances,
Figure FDA0002766724730000028
the occurrence frequencies of dinucleotides AA, AC, … and XX at the jth and jth + alpha +1 positions of all sequence samples in the negative class data set respectively;
determining a backward dinucleotide position-specific preference matrix for a negative class dataset as follows
Figure FDA0002766724730000029
Figure FDA0002766724730000031
Wherein the content of the first and second substances,
Figure FDA0002766724730000032
the occurrence frequencies of dinucleotides AA, AC, … and XX at the jth and jth-alpha-1 positions of all sequence samples in the negative class data set respectively;
(3) determination of mutual information values of nucleotides of DNA/RNA sequences
(3.1) determining the forward mutual information value of the nucleotide sequence of the DNA/RNA sequence to be encoded in the positive data set according to the following formula
Figure FDA0002766724730000033
Figure FDA0002766724730000034
Wherein X is the nucleotide at the jth position, x.epsilon. { A, C, G, X }, z is the nucleotide at the j + alpha +1 position, z.epsilon. { A, C, G, X },
Figure FDA0002766724730000035
is the frequency of the occurrence of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples of the positive dataset,
Figure FDA0002766724730000036
is the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the positive data set,
Figure FDA0002766724730000037
is the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples of the positive dataset;
determining the backward mutual information value of the nucleotide of the DNA/RNA sequence to be coded in the positive data set according to the following formula
Figure FDA0002766724730000038
Figure FDA0002766724730000039
Wherein y is the nucleotide at position j- α -1, y ∈ { A, C, G, X },
Figure FDA00027667247300000310
the occurrence frequency of the dinucleotide xy at the jth and jth-alpha-1 position of all sequence samples of the positive type data set,
Figure FDA00027667247300000311
is the frequency of occurrence of nucleotide y at the j-alpha-1 position of all sequence samples in the positive dataset;
the point mutual information coding value of the nucleotide at the j position of a sample of a DNA/RNA sequence to be coded in the positive data set
Figure FDA00027667247300000312
Defined as forward point mutual information value
Figure FDA00027667247300000313
And backward point mutual information value
Figure FDA00027667247300000314
The sample of DNA/RNA sequences of length l encodes a feature vector V of length l-2 alpha-2 and point-to-point information+
Figure FDA00027667247300000315
Figure FDA00027667247300000316
(3.2) determining the forward mutual information value of the nucleotide sequence of the DNA/RNA sequence to be encoded in the negative class data set according to the following formula
Figure FDA0002766724730000041
Figure FDA00027667247300000413
Wherein the content of the first and second substances,
Figure FDA0002766724730000042
the occurrence frequency of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples in the negative class data set,
Figure FDA00027667247300000414
the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the negative class dataset,
Figure FDA00027667247300000415
the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples in the negative class dataset;
determining the backward point mutual information value of the DNA/RNA sequence nucleotide to be coded in the negative class data set according to the following formula
Figure FDA0002766724730000043
Figure FDA0002766724730000044
Wherein the content of the first and second substances,
Figure FDA0002766724730000045
the occurrence frequency of the dinucleotide xy at the jth and jth-alpha-1 position of all sequence samples in the negative class data set,
Figure FDA0002766724730000046
the frequency of occurrence of nucleotide y at the j-alpha-1 position of all sequence samples in the negative class dataset;
DNA/RNA sequence sample to be encodedThe point-to-point mutual information coding value of the jth nucleotide in the negative data set
Figure FDA0002766724730000047
Defined as forward point mutual information value
Figure FDA0002766724730000048
And backward point mutual information value
Figure FDA0002766724730000049
The sample of DNA/RNA sequences of length l encodes a feature vector V of length l-2 alpha-2 and point-to-point information-
Figure FDA00027667247300000410
Figure FDA00027667247300000411
(3.3) samples of DNA/RNA sequences to be encoded of given length l, by means of vector V+And V-And subtracting the corresponding elements to determine a feature vector V:
V=[Vα+2,Vα+3,…,Vj]
Figure FDA00027667247300000412
(4) feature combination
When the parameter alpha takes the value of 0, the characteristic vector V (0) is [ V2,V3,V4,…,Vl-2,Vl-1]When the number of elements is l-2 and alpha is 1, the characteristic vector V (1) is [ V ]3,V4,V5,…,Vl-3,Vl-2]When the number of the elements is l-4, … and alpha is (l-5)/2, the characteristic vector V ((l-5)/2) is [ V ](l-1)/2,V(l+1)/2,V(l+3)/2]Number of elementsIs 3, when alpha is (l-3)/2, the characteristic vector V ((l-3)/2) is [ V ](l+1)/2]The number of elements is 1; combining the characteristic vectors determined by different values of the parameter alpha into an element number (l-1)2High-dimensional feature vector of [ V (0), V (1), …, V ((l-5)/2), V ((l-3)/2) ]4];
(5) DNA/RNA sequence coding
Using the above-mentioned steps (1) to (4), the DNA/RNA sequence dataset D is encoded as a numerical dataset D',
Figure FDA0002766724730000051
s is the number of samples in the numerical data set D', and the value of s is a finite positive integer (l-1)2And/4 is the characteristic number of the numerical data set D'.
CN202011236138.3A 2020-11-09 2020-11-09 Bidirectional dinucleotide position specific preference and mutual information DNA/RNA sequence coding method Pending CN112365925A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011236138.3A CN112365925A (en) 2020-11-09 2020-11-09 Bidirectional dinucleotide position specific preference and mutual information DNA/RNA sequence coding method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011236138.3A CN112365925A (en) 2020-11-09 2020-11-09 Bidirectional dinucleotide position specific preference and mutual information DNA/RNA sequence coding method

Publications (1)

Publication Number Publication Date
CN112365925A true CN112365925A (en) 2021-02-12

Family

ID=74510310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011236138.3A Pending CN112365925A (en) 2020-11-09 2020-11-09 Bidirectional dinucleotide position specific preference and mutual information DNA/RNA sequence coding method

Country Status (1)

Country Link
CN (1) CN112365925A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250718A (en) * 2016-07-29 2016-12-21 於铉 N based on individually balanced Boosting algorithm1methylate adenosine site estimation method
CN111161793A (en) * 2020-01-09 2020-05-15 青岛科技大学 Stacking integration based N in RNA6Method for predicting methyladenosine modification site
CN112365924A (en) * 2020-11-09 2021-02-12 陕西师范大学 Bidirectional trinucleotide position specificity preference and point combined mutual information DNA/RNA sequence coding method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250718A (en) * 2016-07-29 2016-12-21 於铉 N based on individually balanced Boosting algorithm1methylate adenosine site estimation method
CN111161793A (en) * 2020-01-09 2020-05-15 青岛科技大学 Stacking integration based N in RNA6Method for predicting methyladenosine modification site
CN112365924A (en) * 2020-11-09 2021-02-12 陕西师范大学 Bidirectional trinucleotide position specificity preference and point combined mutual information DNA/RNA sequence coding method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
GUANG-QING LI 等: "TargetM6A: Identifying N6-Methyladenosine Sites From RNA Sequences via Position-Specific Nucleotide Propensities and a Support Vector Machine", IEEE TRANSACTIONS ON NANOBIOSCIENCE, vol. 15, no. 7, 10 August 2016 (2016-08-10), pages 674 - 682, XP011638122, DOI: 10.1109/TNB.2016.2599115 *
MINGZHAO WANG 等: "M6A-BiNP: predicting N6-methyladenosine sites based on bidirectional position-specific propensities of polynucleotides and pointwise joint mutual information", RNA BIOLOGY, vol. 18, no. 12, 23 June 2021 (2021-06-23), pages 2498 - 2512 *
MINGZHAO WANG 等: "PSP-PJMI: an innovative feature representation algorithm for identifying DNA N4-methylcytosine sites", INF SCI., vol. 606, 31 August 2022 (2022-08-31), pages 968 - 983 *
PENGWEI XING 等: "Identifying N6-methyladenosine sites using multi-interval nucleotide pair position specificity and support vector machine", SCIENTIFIC REPORTS, vol. 7, 25 April 2017 (2017-04-25), pages 46757 *
谢娟英 等: "面向甲基化修饰位点预测的DNA/RNA序列特征 编码算法研究进展", 中国科学: 生命科学, vol. 53, no. 6, 24 November 2022 (2022-11-24), pages 841 - 875 *

Similar Documents

Publication Publication Date Title
Yang et al. iRNA-2OM: a sequence-based predictor for identifying 2′-O-methylation sites in homo sapiens
Le et al. iEnhancer-5Step: identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding
Chen et al. DeepM6ASeq-EL: prediction of human N6-methyladenosine (m 6 a) sites with LSTM and ensemble learning
Basith et al. iGHBP: computational identification of growth hormone binding proteins from sequences using extremely randomised tree
van der Laan et al. Asymptotic optimality of likelihood-based cross-validation
Li et al. iORI-PseKNC: a predictor for identifying origin of replication with pseudo k-tuple nucleotide composition
CN110289047B (en) Sequencing data-based tumor purity and absolute copy number prediction method and system
EP1488228A1 (en) Methods for identifying large subsets of differentially expressed genes based on multivariate microarray data analysis
Cheng et al. Computationally predicting protein-RNA interactions using only positive and unlabeled examples
Gan et al. Sparse representation for tumor classification based on feature extraction using latent low-rank representation
US20210398605A1 (en) System and method for promoter prediction in human genome
Feng et al. Accurate de novo prediction of RNA 3D structure with transformer network
CN112365924B (en) Bidirectional trinucleotide position specificity preference and point combined mutual information DNA/RNA sequence coding method
Raza et al. iPro-TCN: Prediction of DNA Promoters Recognition and their Strength Using Temporal Convolutional Network
CN113257357A (en) Method for predicting protein residue contact map
CN116612814A (en) Regression model-based batch detection method, device, equipment and medium for gene sample pollution
CN112365925A (en) Bidirectional dinucleotide position specific preference and mutual information DNA/RNA sequence coding method
Golenko et al. IMPLEMENTATION OF MACHINE LEARNING MODELS TO DETERMINE THE APPROPRIATE MODEL FOR PROTEIN FUNCTION PREDICTION.
CN114758721B (en) Deep learning-based transcription factor binding site positioning method
Zhang et al. Nature-inspired compressed sensing for transcriptomic profiling from random composite measurements
Dalton Optimal Bayesian feature selection
CN113764031B (en) Prediction method of N6 methyl adenosine locus in trans-tissue/species RNA
Zhang et al. Estimating the rate constant from biosensor data via an adaptive variational Bayesian approach
Zhang et al. Improving protein secondary structure prediction by using the residue conformational classes
Wang et al. PSP-PJMI: An innovative feature representation algorithm for identifying DNA N4-methylcytosine sites

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination