CN112365925A - Bidirectional dinucleotide position specific preference and mutual information DNA/RNA sequence coding method - Google Patents
Bidirectional dinucleotide position specific preference and mutual information DNA/RNA sequence coding method Download PDFInfo
- Publication number
- CN112365925A CN112365925A CN202011236138.3A CN202011236138A CN112365925A CN 112365925 A CN112365925 A CN 112365925A CN 202011236138 A CN202011236138 A CN 202011236138A CN 112365925 A CN112365925 A CN 112365925A
- Authority
- CN
- China
- Prior art keywords
- dna
- nucleotide
- alpha
- sequence
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 108091028043 Nucleic acid sequence Proteins 0.000 title claims abstract description 140
- 238000000034 method Methods 0.000 title claims abstract description 56
- 230000002457 bidirectional effect Effects 0.000 title claims abstract description 14
- 239000002773 nucleotide Substances 0.000 claims abstract description 119
- 125000003729 nucleotide group Chemical group 0.000 claims abstract description 119
- 239000013598 vector Substances 0.000 claims abstract description 42
- 239000011159 matrix material Substances 0.000 claims abstract description 38
- 108020004414 DNA Proteins 0.000 claims abstract description 35
- 239000000126 substance Substances 0.000 claims description 25
- 238000010276 construction Methods 0.000 claims description 9
- 230000011987 methylation Effects 0.000 abstract 2
- 238000007069 methylation reaction Methods 0.000 abstract 2
- 238000012706 support-vector machine Methods 0.000 description 18
- 230000035945 sensitivity Effects 0.000 description 12
- HWPZZUQOWRWFDB-UHFFFAOYSA-N 1-methylcytosine Chemical compound CN1C=CC(N)=NC1=O HWPZZUQOWRWFDB-UHFFFAOYSA-N 0.000 description 10
- 240000004808 Saccharomyces cerevisiae Species 0.000 description 10
- 235000014680 Saccharomyces cerevisiae Nutrition 0.000 description 10
- 238000012549 training Methods 0.000 description 10
- 230000000052 comparative effect Effects 0.000 description 9
- MXYRZDAGKTVQIL-IOSLPCCCSA-N (2r,3r,4s,5r)-2-(6-aminopurin-9-yl)-5-(hydroxymethyl)-2-methyloxolane-3,4-diol Chemical compound C1=NC2=C(N)N=CN=C2N1[C@]1(C)O[C@H](CO)[C@@H](O)[C@H]1O MXYRZDAGKTVQIL-IOSLPCCCSA-N 0.000 description 8
- 238000012360 testing method Methods 0.000 description 8
- 241000244203 Caenorhabditis elegans Species 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 6
- TYEYVCQGIWVYQJ-SPIULWCRSA-N (2r,3r,4s,5r)-2-(6-amino-6-methyl-8h-purin-9-yl)-5-(hydroxymethyl)oxolane-3,4-diol Chemical compound C1N=C2C(C)(N)N=CN=C2N1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O TYEYVCQGIWVYQJ-SPIULWCRSA-N 0.000 description 4
- 238000002790 cross-validation Methods 0.000 description 4
- 230000007067 DNA methylation Effects 0.000 description 3
- 230000006093 RNA methylation Effects 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- COHVJBUINVIGOI-UHFFFAOYSA-N 4-amino-4-methyl-1,3-dihydropyrimidin-2-one Chemical compound CC1(N)NC(=O)NC=C1 COHVJBUINVIGOI-UHFFFAOYSA-N 0.000 description 2
- VQAYFKKCNSOZKM-IOSLPCCCSA-N N(6)-methyladenosine Chemical compound C1=NC=2C(NC)=NC=NC=2N1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O VQAYFKKCNSOZKM-IOSLPCCCSA-N 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004587 chromatography analysis Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- CJYMGANFUNYUNB-UHFFFAOYSA-N 2-methylthionine Chemical compound CC1=CC=CC=CC=CS1 CJYMGANFUNYUNB-UHFFFAOYSA-N 0.000 description 1
- 101150017770 ENPP1 gene Proteins 0.000 description 1
- 101000812677 Homo sapiens Nucleotide pyrophosphatase Proteins 0.000 description 1
- 102100039306 Nucleotide pyrophosphatase Human genes 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 102000004196 processed proteins & peptides Human genes 0.000 description 1
- 108090000765 processed proteins & peptides Proteins 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Data Mining & Analysis (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Artificial Intelligence (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
A coding method of a bidirectional dinucleotide position specificity preference and point mutual information DNA/RNA sequence comprises the steps of constructing a DNA/RNA sequence nucleotide position specificity preference matrix, constructing a DNA/RNA sequence bidirectional dinucleotide position specificity preference matrix, determining a point mutual information value and a characteristic combination of DNA/RNA sequence nucleotides, and coding the DNA/RNA sequence. In order to extract more position information of dinucleotides from DNA/RNA sequence data, a parameter alpha is introduced to express the distance between the dinucleotides, numerical characteristic vectors with different values of alpha are combined into a global high-dimensional numerical characteristic vector, and the global high-dimensional numerical characteristic vector is used for a4mC methylation site of DNA and m of RNA6A has very good performance in methylation site recognition. The DNA/RNA numerical characteristic data obtained by the invention has the advantages of more classification information, low redundancy among characteristics and trained model identificationHigh accuracy and the like, and can be used for coding DNA/RNA sequences.
Description
Technical Field
The invention belongs to the technical field of sequence data analysis, and particularly relates to a DNA/RNA sequence coding method.
Background
Machine learning techniques play a key role in the DNA/RNA methylation site recognition problem in the post-genomic era. In general, the solution to this problem by using machine learning technique mainly comprises 3 steps: sequence data encoding, construction models and performance assessment, wherein the sequence data encoding is the most important step for solving the problem, namely how to extract numerical features containing more classified identification information from DNA/RNA sequence data is the key for accurately identifying DNA/RNA methylation sites.
At present, the existing coding method has the defects of low coding characteristic dimension and incapability of extracting key identification information from DNA/RNA sequence data, so that the accuracy of the established prediction identification model is low. The DNA/RNA sequence data are coded by adopting a plurality of coding methods, and the coded numerical characteristics are combined into high-dimensional characteristics, so that the defects of a single coding method can be overcome, high redundancy of the numerical characteristics and waste of computing resources are caused, and the improvement on the accuracy of a DNA/RNA methylation site recognition model is very limited. Therefore, developing a DNA/RNA sequence data encoding method that can effectively encode DNA/RNA sequences into high-dimensional numerical features containing more classified identification information and has low redundancy between features is a key to solve the problem and is a hot spot of research in this field.
Disclosure of Invention
The technical problem to be solved by the invention is to overcome the defects of the prior art and provide a bidirectional dinucleotide position specificity preference and mutual information DNA/RNA sequence coding method which has the advantages of more classified identification information, low redundancy among characteristics and high identification and prediction accuracy of an established model.
The technical scheme adopted for solving the technical problems comprises the following steps:
(1) construction of DNA/RNA sequence nucleotide position specificity preference matrix
Given a DNA/RNA sequence dataset D, the dataset consists of a positive class dataset and a negative class dataset.
Wherein A, C, G, X is 4 nucleotides of DNA/RNA, X is represented as nucleotide T in the DNA sequence data set, and is represented as nucleotide U in the RNA sequence data set, i is the position of the nucleotide, i is more than or equal to 1 and less than or equal to l, the value of i is a limited positive integer, l is the nucleotide length of the DNA/RNA sequence sample, the value of l is an odd number,the occurrence frequency of nucleotide A, C, G, X at the i-th position of all sequence samples in the positive type dataset is shown.
Determining a nucleotide position-specific preference matrix for a negative class dataset as follows
Wherein the content of the first and second substances,the frequency of occurrence of nucleotide A, C, G, X at the i-th position of all sequence samples in the negative class dataset, respectively.
(2) Construction of DNA/RNA sequence bidirectional dinucleotide position specificity preference matrix
Determining a forward dinucleotide position-specific preference matrix for a positive class dataset as follows
Wherein AA, AC, … and XX are 16 dinucleosides consisting of 4 nucleotides A, C, G, X of DNA/RNAAcid, alpha represents the distance between two nucleotides, alpha is more than or equal to 0 and less than or equal to (l-3)/2, the value of alpha is a limited positive integer, j is the position of the nucleotide, j is more than or equal to alpha +2 and less than or equal to l-alpha-1, the value of j is a limited positive integer,the occurrence frequencies of the dinucleotides AA, AC, … and XX at the jth and jth + alpha +1 positions of all sequence samples in the positive type data set are respectively.
Determining a backward dinucleotide position-specific preference matrix for a positive data set according to
Wherein the content of the first and second substances,the occurrence frequencies of dinucleotides AA, AC, … and XX at the jth and jth-alpha-1 positions of all sequence samples in the positive type data set are respectively.
Determining a forward dinucleotide position-specific preference matrix for the negative class dataset as follows
Wherein the content of the first and second substances,the occurrence frequencies of the dinucleotides AA, AC, … and XX at the jth and jth + alpha +1 positions of all sequence samples in the negative class data set are respectively.
Determining a backward dinucleotide position-specific preference matrix for a negative class dataset as follows
Wherein the content of the first and second substances,the occurrence frequencies of dinucleotides AA, AC, … and XX at the jth and jth-alpha-1 positions of all sequence samples in the negative class data set are respectively.
(3) Determination of mutual information values of nucleotides of DNA/RNA sequences
(3.1) determining the forward mutual information value of the nucleotide sequence of the DNA/RNA sequence to be encoded in the positive data set according to the following formula
Wherein X is the nucleotide at the jth position, x.epsilon. { A, C, G, X }, z is the nucleotide at the j + alpha +1 position, z.epsilon. { A, C, G, X },is the frequency of the occurrence of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples of the positive dataset,is the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the positive data set,is the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples of the positive dataset.
Determining the backward mutual trust of the nucleotide sequence of the DNA/RNA sequence to be encoded in the positive data set according to the following formulaInformation value
Wherein y is the nucleotide at position j- α -1, y ∈ { A, C, G, X },the occurrence frequency of the dinucleotide xy at the jth and jth-alpha-1 position of all sequence samples of the positive type data set,is the frequency of occurrence of nucleotide y at the j- α -1 position of all sequence samples of the positive dataset.
The point mutual information coding value of the nucleotide at the j position of a sample of a DNA/RNA sequence to be coded in the positive data setDefined as forward point mutual information valueAnd backward point mutual information valueThe sample of DNA/RNA sequences of length l encodes a feature vector V of length l-2 alpha-2 and point-to-point information+:
(3.2) determination of the nucleotide sequence of the DNA/RNA to be encoded in the negative class data set according to the following formulaForward point mutual information value
Wherein the content of the first and second substances,the occurrence frequency of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples in the negative class data set,the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the negative class dataset,the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples in the negative class dataset.
Determining the backward point mutual information value of the DNA/RNA sequence nucleotide to be coded in the negative class data set according to the following formula
Wherein the content of the first and second substances,the occurrence frequency of the dinucleotide xy at the jth and jth-alpha-1 position of all sequence samples in the negative class data set,the frequency of occurrence of nucleotide y at the j- α -1 position of all sequence samples in the negative class dataset.
DNA/RNA sequence sample to be encodedThe point-to-point mutual information coding value of the jth nucleotide in the negative data setDefined as forward point mutual information valueAnd backward point mutual information valueThe sample of DNA/RNA sequences of length l encodes a feature vector V of length l-2 alpha-2 and point-to-point information-:
(3.3) samples of DNA/RNA sequences to be encoded of given length l, by means of vector V+And V-And subtracting the corresponding elements to determine a feature vector V:
V=[Vα+2,Vα+3,…,Vj]
(4) feature combination
When the parameter alpha takes the value of 0, the characteristic vector V (0) is [ V2,V3,V4,…,Vl-2,Vl-1]When the number of elements is l-2 and alpha is 1, the characteristic vector V (1) is [ V ]3,V4,V5,…,Vl-3,Vl-2]When the number of the elements is l-4, … and alpha is (l-5)/2, the characteristic vector V ((l-5)/2) is [ V ](l-1)/2,V(l+1)/2,V(l+3)/2]When the number of elements is 3 and alpha is (l-3)/2, the eigenvector V ((l-3)/2) is [ V ](l+1)/2]Elements ofThe number is 1; combining the characteristic vectors determined by different values of the parameter alpha into an element number (l-1)2High-dimensional feature vector of [ V (0), V (1), …, V ((l-5)/2), V ((l-3)/2) ]4]。
(5) DNA/RNA sequence coding
Using the above-mentioned steps (1) to (4), the DNA/RNA sequence dataset D is encoded as a numerical dataset D',s is the number of samples in the numerical data set D', and the value of s is a finite positive integer (l-1)2And/4 is the characteristic number of the numerical data set D'.
In order to encode DNA/RNA sequence data into numerical data which contains more classified identification information and has low redundancy among features, the invention provides a bidirectional dinucleotide position specificity preference method, and DNA/RNA sequence samples are encoded into numerical feature samples by adopting a point mutual information method based on a nucleotide position specificity preference matrix and a bidirectional dinucleotide position specificity preference matrix of positive and negative DNA/RNA sequence data; in order to extract more dinucleotide position information from a DNA/RNA sequence sample, a parameter alpha is introduced to express dinucleotide spacing in the process of constructing a bidirectional dinucleotide position specificity preference matrix, and numerical feature vectors determined by different values of alpha are combined into a global high-dimensional numerical feature vector which contains more classified identification information and has low redundancy among features. The DNA/RNA sequence coding method is adopted to carry out comparison simulation experiments with the existing 7 coding methods, and the experimental result shows that the support vector machine model established by the coding method of the invention is used for N of caenorhabditis elegans DNA4-methylcytosine (N)4-methycytosine, 4mC) site recognition prediction accuracy, sensitivity, specificity, MCC, AUROC, AUPRC reaching 0.905, 0.902, 0.909, 0.811, 0.967, 0.966 respectively, all higher than other 7 comparative coding methods; n of support vector machine established based on coding method of the invention to microzyme RNA6-methyladenosine (N)6methyladenosine,m6A) The accuracy, sensitivity, specificity, MCC, AUROC and AUPRC of the site recognition prediction reach 0.905, 0.916, 0.894, 0.810 and 0.968 respectively0.967, all higher than the other 7 contrast coding methods.
Drawings
FIG. 1 is a flow chart of the method of example 1 of the present invention.
FIG. 2 is a diagram of the support vector machine model established based on the present invention and 7 encoding methods, respectively, for N of caenorhabditis elegans DNA4-a plot of AUROC values predicted for methylcytosine site recognition.
FIG. 3 is a diagram of the support vector machine model established based on the present invention and 7 encoding methods, respectively, for N of caenorhabditis elegans DNA4-profile of predicted AUPRC values for methylcytosine site recognition.
FIG. 4 shows the N-fold of the model of support vector machine for yeast RNA, which was constructed based on the present invention and 7 encoding methods6-methyladenosine site recognition predicted AUROC value curve.
FIG. 5 shows the N-fold of the model of support vector machine for yeast RNA, which was constructed based on the present invention and 7 encoding methods6-recognition of the predicted AUPRC value profile by methyladenosine sites.
Detailed Description
The present invention will be described in further detail below with reference to the drawings and examples, but the present invention is not limited to the embodiments described below.
Example 1
The document iDNA4mC identification DNA N4N of C.elegans DNA from Methylthionine sites based on nucleotide chemical peptides4-methylcytosine (N)4Methelkytosine, 4mC) dataset, which has 3108 DNA sequence samples, wherein the number of positive dataset samples, i.e. the true N41554 samples of methylcytosine and a negative data set, i.e. not N4The number of methylcytosine samples is 1554 and the length l of each sequence sample is 41. The method for encoding bidirectional dinucleotide position-specific preference and mutual information DNA of this example consists of the following steps (see FIG. 1):
(1) construction of DNA sequence nucleotide position specificity preference matrix
Given a DNA sequence dataset D, the dataset consists of a positive class dataset and a negative class dataset.
Wherein A, C, G, T is 4 kinds of nucleotides of DNA, i is the position of the nucleotide, i is more than or equal to 1 and less than or equal to l, the value of i is a limited positive integer, l is the nucleotide length of the DNA sequence sample, the value of l is an odd number, the value of l in this embodiment is 41, the occurrence frequency of nucleotide A, C, G, T at the i-th position of all sequence samples in the positive type dataset is shown.
Determining a nucleotide position-specific preference matrix for a negative class dataset as follows
Wherein the content of the first and second substances,the frequency of occurrence of nucleotide A, C, G, T at the i-th position of all sequence samples in the negative class dataset, respectively.
(2) Construction of a DNA sequence bidirectional dinucleotide position-specific preference matrix
Determining a forward dinucleotide position-specific preference matrix for a positive class dataset as follows
Wherein AA, AC, … and TT are 16 dinucleotides formed by A, C, G, T of 4 nucleotides of DNA, alpha represents the distance between two nucleotides, alpha is more than or equal to 0 and less than or equal to (l-3)/2, the value of alpha is limited positive integer, the value of alpha is more than or equal to 0 and less than or equal to 19 in the embodiment, j is the position of the nucleotide, the value of alpha +2 is more than or equal to j and less than or equal to l-alpha-1, the value of alpha +2 is more than or equal to j and less than or equal to 40-alpha in the embodiment, the value of j is limited positive,the occurrence frequencies of the dinucleotides AA, AC, … and TT at the jth and jth + alpha +1 positions of all sequence samples in the positive type data set are respectively.
Determining a backward dinucleotide position-specific preference matrix for a positive data set according to
Wherein the content of the first and second substances,the occurrence frequencies of dinucleotides AA, AC, … and TT at the jth and jth-alpha-1 positions of all sequence samples in the positive type data set are respectively.
Determining a forward dinucleotide position-specific preference matrix for the negative class dataset as follows
Wherein the content of the first and second substances,the occurrence frequencies of the dinucleotides AA, AC, … and TT at the jth and jth alpha +1 positions of all sequence samples of the negative class data set are respectively.
Determining a backward dinucleotide position-specific preference matrix for a negative class dataset as follows
Wherein the content of the first and second substances,the occurrence frequencies of the dinucleotides AA, AC, … and TT at the jth and jth alpha-1 positions of all sequence samples of the negative class data set are respectively.
(3) Determination of mutual information values of nucleotides of DNA sequences
(3.1) determining the forward mutual information value of the nucleotides of the DNA sequence to be encoded in the positive data set according to the following formula
Wherein x is the nucleotide at the jth position, x.epsilon. { A, C, G, T }, z is the nucleotide at the j + alpha +1 position, z.epsilon. { A, C, G, T },is the frequency of the occurrence of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples of the positive dataset,is the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the positive data set,is the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples of the positive dataset.
Determining the backward mutual information value of the nucleotide of the DNA sequence to be encoded in the positive data set according to the following formula
Wherein y is the nucleotide at position j- α -1, y ∈ { A, C, G, T },the occurrence frequency of the dinucleotide xy at the jth and jth-alpha-1 position of all sequence samples of the positive type data set,is the frequency of occurrence of nucleotide y at the j- α -1 position of all sequence samples of the positive dataset.
The point mutual information coding value of the nucleotide at the j position of a DNA sequence sample to be coded in the positive data setDefined as forward point mutual information valueAnd backward point mutual information valueThe length l DNA sequence sample is coded into a length l-2 alpha-2 point mutual information characteristic vector V+:
The value of l in this example is 41.
(3.2) determining the forward mutual information value of the nucleotides of the DNA sequence to be encoded in the negative data set according to the following formula
Wherein the content of the first and second substances,the occurrence frequency of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples in the negative class data set,the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the negative class dataset,the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples in the negative class dataset.
Determining the backward point mutual information value of the DNA sequence nucleotide to be coded in the negative class data set according to the following formula
Wherein the content of the first and second substances,the occurrence frequency of the dinucleotide xy at the jth and jth-alpha-1 position of all sequence samples in the negative class data set,the frequency of occurrence of nucleotide y at the j- α -1 position of all sequence samples in the negative class dataset.
The point mutual information coding value of the nucleotide at the jth position of a DNA sequence sample to be coded in a negative class data setDefined as forward point mutual information valueAnd backward point mutual information valueThe length l DNA sequence sample is coded into a length l-2 alpha-2 point mutual information characteristic vector V-:
The value of l in the example is 41.
(3.3) given a sample of DNA sequence to be encoded of length l, by means of vector V+And V-And subtracting the corresponding elements to determine a feature vector V:
V=[Vα+2,Vα+3,…,Vj]
(4) feature combination
When the parameter alpha takes the value of 0, the characteristic vector V (0) is [ V2,V3,V4,…,Vl-2,Vl-1]When the number of elements is l-2 and alpha is 1, the characteristic vector V (1) is [ V ]3,V4,V5,…,Vl-3,Vl-2]When the number of the elements is l-4, … and alpha is (l-5)/2, the characteristic vector V ((l-5)/2) is [ V ](l-1)/2,V(l+1)/2,V(l+3)/2]When the number of elements is 3 and alpha is (l-3)/2, the eigenvector V ((l-3)/2) is [ V ](l+1)/2]The number of elements is 1; combining the characteristic vectors determined by different values of the parameter alpha into an element number (l-1)2High-dimensional feature vector of [ V (0), V (1), …, V ((l-5)/2), V ((l-3)/2) ]4]In this embodiment, the value of l is 41.
(5) DNA sequence coding
Using the above-mentioned steps (1) to (4), the DNA sequence data set D is encoded into a numerical data set D',s is the number of samples in the numerical data set D', s is a finite positive integer, s in this embodiment is 3108, (l-1)2And/4 is the characteristic number of the numerical data set D' to complete the DNA sequence coding.
To verify the beneficial effects of the present invention, the inventors applied the DNA sequence encoding method of example 1 and PSNP (position-specific nucleotide sequences), PSDP (position-specific nucleotide sequences), KNF (nucleotide sequences), KSNPF (Kslotted nucleotide sequences), NPPS (nucleotide sequences), PBE (positional array encoding), NCPNC (nucleotide sequence encoding) encoding methods to N of C.elegans DNA4-recognition and prediction of methylcytosine sites, and comparing the performances of the support vector machine models established by the coding methods. Average Classification Accuracy (Accuracy), Sensitivity (Sensitivity), Specificity (Specificity), MCC (matrix's correlation), AUROC (area under the receiver operating characterization curve), A-fold cross validation methodUPRC (area under the precision call curve). The experimental method is as follows:
1. n of C.elegans DNA according to example 14-a sample of a methylcytosine dataset sequence is encoded.
2. Data set normalization
The numerical data set D' is normalized by the maximum minimization method of:
wherein, gm,nIs the nth characteristic value, g, of the mth sample of the numerical data set Dm,nThe normalized value is g'm,n,max(gn) And min (g)n) Representing the maximum and minimum eigenvalues on the nth column of the numerical data set D', m is greater than or equal to 1 and less than or equal to s, n is greater than or equal to 1 and less than or equal to (l-1)2And 4, the values of m and n are finite positive integers, in the embodiment, the value of l is 41, and the value of s is 3108.
3. Partitioning a data set
Dividing the normalized numerical data set D 'into 10 parts by adopting a K-fold cross validation method (K ═ 10), and taking 1 part of the numerical data set D' as a test set D 'in turn'TeAnd the rest 9 parts are taken as training set D'TrRun 10 times in total, each time training set D'TrAnd test set D'TeThe ratio of (A) to (B) is 9: 1.
4. Training models and tests
With training set D'TrTraining support vector machine model with test set D'TeThe performance of the support vector machine model was tested.
The same procedure was followed for the 7 comparative experiments according to Steps 2-4 of the experimental procedure, for N of caenorhabditis elegans DNA4The recognition and prediction of methylcytosine sites, the average classification Accuracy (Accuracy), Sensitivity (Sensitivity), Specificity (Specificity), and the experimental results of MCC are shown in Table 1, the experimental results of AUROC are shown in FIG. 2, and the experimental results of AUPRC are shown in FIG. 3.
Table 1 experimental results comparing example 1 method with 7 methods
As can be seen from Table 1, the support vector machine model established based on the DNA sequence coding method of the present invention is suitable for N of caenorhabditis elegans DNA4The accuracy, sensitivity, specificity and MCC of recognition prediction of the methylcytosine locus reach 0.905, 0.902, 0.909 and 0.811 respectively, which are higher than those of other 7 comparative encoding methods.
As can be seen from FIG. 2, the support vector machine model established based on the DNA sequence coding method of the present invention is directed to N of caenorhabditis elegans DNA4The AUROC value predicted by methylcytosine site recognition was 0.967, higher than the other 7 comparative coding methods.
As can be seen from FIG. 3, the support vector machine model established based on the DNA sequence coding method of the present invention was used for N of caenorhabditis elegans DNA4Recognition of the methylcytosine site predicts an AUPRC value of 0.966, higher than for the other 7 comparative coding methods.
Example 2
In the document Benchmarkdataformentifying N6N of Yeast RNA in Saccharomyces cerevisiae genome6-methyladenosine (N)6methyladenosine,m6A) For the data set example, the data set has 2614 RNA sequence samples, wherein, the number of positive data set samples is the true N61307 methyladenosine samples, negative class dataset samples, i.e. not N61307 samples of methyladenosine and a length l of 51 for each sequence sample. The method for coding bi-directional dinucleotide position-specific preference and mutual information RNA sequences of the present example consists of the following steps (see FIG. 1).
(1) Construction of RNA sequence nucleotide position specificity preference matrix
Given an RNA sequence dataset D, the dataset consists of a positive class dataset and a negative class dataset.
Wherein A, C, G, U is 4 kinds of nucleotides of RNA, i is the position of the nucleotide, i is not less than 1 and not more than l, the value of i is a limited positive integer, l is the nucleotide length of the RNA sequence sample, the value of l is an odd number, the value of l in this embodiment is 51, the occurrence frequency of nucleotide A, C, G, U at the i-th position of all sequence samples in the positive type dataset is shown.
Determining a nucleotide position-specific preference matrix for a negative class dataset as follows
Wherein the content of the first and second substances,the frequency of occurrence of nucleotide A, C, G, U at the i-th position of all sequence samples in the negative class dataset, respectively.
(2) Construction of RNA sequence bidirectional dinucleotide position specificity preference matrix
Determining a forward dinucleotide position-specific preference matrix for a positive class dataset as follows
Wherein AA, AC, … and UU are 16 dinucleotides formed by A, C, G, U of 4 nucleotides of RNA, alpha represents the distance between two nucleotides, alpha is more than or equal to 0 and less than or equal to (l-3)/2, the value of alpha is limited positive integer, the value of alpha is more than or equal to 0 and less than or equal to 24 in the embodiment, j is the position of the nucleotide, the value of alpha +2 is more than or equal to j and less than or equal to l-alpha-1, the value of alpha +2 is more than or equal to j and less than or equal to 50-alpha in the embodiment, the value of j is limited positive,the occurrence frequencies of the dinucleotides AA, AC, … and UU at the jth and jth + alpha +1 positions of all sequence samples in the positive type data set are respectively.
Determining a backward dinucleotide position-specific preference matrix for a positive data set according to
Wherein the content of the first and second substances,the occurrence frequencies of the dinucleotides AA, AC, … and UU at the jth and jth-alpha-1 positions of all sequence samples in the positive type data set are respectively.
Determining a forward dinucleotide position-specific preference matrix for the negative class dataset as follows
Wherein the content of the first and second substances,are respectively of negative classFrequency of occurrence of the dinucleotides AA, AC, …, UU at the jth and j + alpha +1 position of all sequence samples in the data set.
Determining a backward dinucleotide position-specific preference matrix for a negative class dataset as follows
Wherein the content of the first and second substances,the occurrence frequencies of the dinucleotides AA, AC, … and UU at the jth and jth-alpha-1 positions of all sequence samples in the negative class data set are respectively.
(3) Determination of mutual information values of nucleotides of RNA sequences
(3.1) determining the forward mutual information value of the nucleotide sequence of the RNA sequence to be encoded in the positive data set according to the following formula
Wherein x is the nucleotide at the jth position, x.epsilon. { A, C, G, U }, z is the nucleotide at the j + alpha +1 position, z.epsilon. { A, C, G, U },is the frequency of the occurrence of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples of the positive dataset,is the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the positive data set,is the frequency of nucleotide z at position j + α +1 of all sequence samples in the positive dataset.
Determining the backward point mutual information value of the RNA sequence nucleotide to be coded in the positive data set according to the following formula
Wherein y is the nucleotide at position j- α -1, y ∈ { A, C, G, U },the occurrence frequency of the dinucleotide xy at the jth and jth-alpha-1 position of all sequence samples of the positive type data set,is the frequency of occurrence of nucleotide y at the j- α -1 position of all sequence samples of the positive dataset.
The point mutual information coding value of the nucleotide at the j position of the RNA sequence sample to be coded in the positive data setDefined as forward point mutual information valueAnd backward point mutual information valueThe RNA sequence sample of length l is encoded into a point mutual information characteristic vector V of length l-2 alpha-2+:
The value of l in this example is 51.
(3.2) determining the forward mutual information value of the nucleotide sequence of the RNA sequence to be encoded in the negative class data set according to the following formula
Wherein the content of the first and second substances,the occurrence frequency of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples in the negative class data set,the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the negative class dataset,the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples in the negative class dataset.
Determining the backward point mutual information value of the RNA sequence nucleotide to be coded in the negative class data set according to the following formula
Wherein the content of the first and second substances,for the dinucleotide xy at the jth, jth-alpha-1 position of all sequence samples in the negative class datasetThe frequency of occurrence of the frequency of occurrence,the frequency of occurrence of nucleotide y at the j- α -1 position of all sequence samples in the negative class dataset.
The point mutual information coding value of the nucleotide at the jth position of the RNA sequence sample to be coded in the negative class data setDefined as forward point mutual information valueAnd backward point mutual information valueThe RNA sequence sample of length l is encoded into a point mutual information characteristic vector V of length l-2 alpha-2-:
The value of l in this example is 51.
(3.3) samples of RNA sequences to be encoded of given length l, by means of vector V+And V-And subtracting the corresponding elements to determine a feature vector V:
V=[Vα+2,Vα+3,…,Vj]
(4) feature combination
When the parameter alpha takes the value of 0, the characteristic vector V (0) is [ V2,V3,V4,…,Vl-2,Vl-1]The number of elements is l-2, alpha is takenWhen 1, the feature vector V (1) is [ V ]3,V4,V5,…,Vl-3,Vl-2]When the number of the elements is l-4, … and alpha is (l-5)/2, the characteristic vector V ((l-5)/2) is [ V ](l-1)/2,V(l+1)/2,V(l+3)/2]When the number of elements is 3 and alpha is (l-3)/2, the eigenvector V ((l-3)/2) is [ V ](l+1)/2]The number of elements is 1; combining the characteristic vectors determined by different values of the parameter alpha into an element number (l-1)2High-dimensional feature vector of [ V (0), V (1), …, V ((l-5)/2), V ((l-3)/2) ]4]In this embodiment, the value of l is 51.
(5) RNA sequence coding
Using the above-mentioned steps (1) to (4), the RNA sequence data set D is encoded into a numerical data set D',s is the number of samples in the numerical data set D', s is a finite positive integer, s in this embodiment is 2614, (l-1)2And/4 is the characteristic number of the numerical data set D' to complete the RNA sequence coding.
To verify the beneficial effects of the present invention, the inventors applied the RNA sequence encoding method of example 2 and PSNP (position-specific nucleotide sequences), PSDP (position-specific nucleotide sequences), knf (nucleotide sequences), ksnpf (kspaged nucleotide sequences), npps (nucleotide sequences), pbe (position specific array), ncpnc (nucleotide sequence and nucleotide composition) encoding methods to the RNA of yeast6-recognition prediction of methyladenosine sites, comparing the performance of the support vector machine models established for each coding method. The average classification Accuracy (Accuracy), Sensitivity (Sensitivity), Specificity (Specificity), MCC (material's correlation), AUROC (area under the recording chromatography curve), and AUPRC (area under the recording chromatography curve) of the 10-fold cross-validation method were used for evaluation. The experimental method is as follows:
1. n of Yeast RNA according to example 26-methyladenosine data set sequenceThe samples are encoded.
2. Data set normalization
The numerical data set D' is normalized by the maximum minimization method of:
wherein, gm,nIs the nth characteristic value, g, of the mth sample of the numerical data set Dm,nThe normalized value is g'm,n,max(gn) And min (g)n) Representing the maximum and minimum eigenvalues on the nth column of the numerical data set D', m is greater than or equal to 1 and less than or equal to s, n is greater than or equal to 1 and less than or equal to (l-1)2And/4, the values of m and n are limited positive integers, the value of l in the embodiment is 51, and the value of s is 2614.
3. Partitioning a data set
Dividing the normalized numerical data set D 'into 10 parts by adopting a K-fold cross validation method (K ═ 10), and taking 1 part of the numerical data set D' as a test set D 'in turn'TeAnd the rest 9 parts are taken as training set D'TrRun 10 times in total, each time training set D'TrAnd test set D'TeThe ratio of (A) to (B) is 9: 1.
4. training models and tests
With training set D'TrTraining support vector machine model with test set D'TeThe performance of the support vector machine model was tested.
The same procedure was followed for the 7 comparative experiments in steps 2-4 of the experimental procedure, for N of yeast RNA6The recognition and prediction of the methyladenosine sites, the average classification Accuracy (Accuracy), Sensitivity (Sensitivity), Specificity (Specificity) and the experimental results of MCC are shown in Table 2.
The results of AUROC are shown in FIG. 4, and the results of AUPRC are shown in FIG. 5.
Table 2 experimental results comparing example 2 with 7 methods
As can be seen from Table 2, the support vector machine model established based on the RNA sequence coding method of the present invention is used for N of yeast RNA6The accuracy, sensitivity, specificity and MCC of recognition prediction of the methyl adenosine locus reach 0.905, 0.916, 0.894 and 0.810 respectively, which are higher than those of 7 other comparative encoding methods.
As can be seen from FIG. 4, the support vector machine model established based on the RNA sequence coding method of the present invention is used for N of yeast RNA6Recognition of the methyladenosine site predicted AUROC value of 0.968, higher than the other 7 comparative coding methods.
As can be seen from FIG. 5, the support vector machine model established based on the RNA sequence coding method of the present invention is used for N of yeast RNA6Recognition of the methyladenosine site predicted an AUPRC value of 0.967, higher than the other 7 comparative coding methods.
Claims (1)
1. A method for coding DNA/RNA sequences with bidirectional dinucleotide position-specific preference and mutual information, which is characterized by comprising the following steps:
(1) construction of DNA/RNA sequence nucleotide position specificity preference matrix
Given a DNA/RNA sequence dataset D, the dataset consists of a positive class dataset and a negative class dataset;
Wherein A, C, G, X is 4 nucleotides of DNA/RNA, X is represented as nucleotide T in the DNA sequence data set, and is represented as nucleotide U in the RNA sequence data set, i is the position of the nucleotide, i is more than or equal to 1 and less than or equal to l, the value of i is a limited positive integer, l is the nucleotide length of the DNA/RNA sequence sample, the value of l is an odd number,the occurrence frequency of A, C, G, X nucleotides at the i-th position of all sequence samples in the positive type data set respectively;
determining a nucleotide position-specific preference matrix for a negative class dataset as follows
Wherein the content of the first and second substances,the frequency of occurrence of nucleotide A, C, G, X at the i-th position of all sequence samples in the negative class dataset;
(2) construction of DNA/RNA sequence bidirectional dinucleotide position specificity preference matrix
Determining a forward dinucleotide position-specific preference matrix for a positive class dataset as follows
Wherein AA, AC, … and XX are 16 dinucleotides formed by 4 nucleotides A, C, G, X of DNA/RNA, alpha represents the distance between two nucleotides, alpha is more than or equal to 0 and less than or equal to (l-3)/2, the value of alpha is a limited positive integer, j is the position of the nucleotide, j is more than or equal to alpha +2 and less than or equal to l-alpha-1, the value of j is a limited positive integer,respectively the generation of dinucleotides AA, AC, … and XX at the jth and jth + alpha +1 positions of all sequence samples in the positive type data setThe current frequency;
determining a backward dinucleotide position-specific preference matrix for a positive data set according to
Wherein the content of the first and second substances,the occurrence frequencies of dinucleotides AA, AC, … and XX at the jth and jth-alpha-1 positions of all sequence samples in the positive type data set respectively;
determining a forward dinucleotide position-specific preference matrix for the negative class dataset as follows
Wherein the content of the first and second substances,the occurrence frequencies of dinucleotides AA, AC, … and XX at the jth and jth + alpha +1 positions of all sequence samples in the negative class data set respectively;
determining a backward dinucleotide position-specific preference matrix for a negative class dataset as follows
Wherein the content of the first and second substances,the occurrence frequencies of dinucleotides AA, AC, … and XX at the jth and jth-alpha-1 positions of all sequence samples in the negative class data set respectively;
(3) determination of mutual information values of nucleotides of DNA/RNA sequences
(3.1) determining the forward mutual information value of the nucleotide sequence of the DNA/RNA sequence to be encoded in the positive data set according to the following formula
Wherein X is the nucleotide at the jth position, x.epsilon. { A, C, G, X }, z is the nucleotide at the j + alpha +1 position, z.epsilon. { A, C, G, X },is the frequency of the occurrence of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples of the positive dataset,is the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the positive data set,is the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples of the positive dataset;
determining the backward mutual information value of the nucleotide of the DNA/RNA sequence to be coded in the positive data set according to the following formula
Wherein y is the nucleotide at position j- α -1, y ∈ { A, C, G, X },the occurrence frequency of the dinucleotide xy at the jth and jth-alpha-1 position of all sequence samples of the positive type data set,is the frequency of occurrence of nucleotide y at the j-alpha-1 position of all sequence samples in the positive dataset;
the point mutual information coding value of the nucleotide at the j position of a sample of a DNA/RNA sequence to be coded in the positive data setDefined as forward point mutual information valueAnd backward point mutual information valueThe sample of DNA/RNA sequences of length l encodes a feature vector V of length l-2 alpha-2 and point-to-point information+:
(3.2) determining the forward mutual information value of the nucleotide sequence of the DNA/RNA sequence to be encoded in the negative class data set according to the following formula
Wherein the content of the first and second substances,the occurrence frequency of the dinucleotide xz at the jth and jth + alpha +1 positions of all sequence samples in the negative class data set,the frequency of occurrence of nucleotide x at the jth position of all sequence samples in the negative class dataset,the frequency of occurrence of nucleotide z at position j + α +1 of all sequence samples in the negative class dataset;
determining the backward point mutual information value of the DNA/RNA sequence nucleotide to be coded in the negative class data set according to the following formula
Wherein the content of the first and second substances,the occurrence frequency of the dinucleotide xy at the jth and jth-alpha-1 position of all sequence samples in the negative class data set,the frequency of occurrence of nucleotide y at the j-alpha-1 position of all sequence samples in the negative class dataset;
DNA/RNA sequence sample to be encodedThe point-to-point mutual information coding value of the jth nucleotide in the negative data setDefined as forward point mutual information valueAnd backward point mutual information valueThe sample of DNA/RNA sequences of length l encodes a feature vector V of length l-2 alpha-2 and point-to-point information-:
(3.3) samples of DNA/RNA sequences to be encoded of given length l, by means of vector V+And V-And subtracting the corresponding elements to determine a feature vector V:
V=[Vα+2,Vα+3,…,Vj]
(4) feature combination
When the parameter alpha takes the value of 0, the characteristic vector V (0) is [ V2,V3,V4,…,Vl-2,Vl-1]When the number of elements is l-2 and alpha is 1, the characteristic vector V (1) is [ V ]3,V4,V5,…,Vl-3,Vl-2]When the number of the elements is l-4, … and alpha is (l-5)/2, the characteristic vector V ((l-5)/2) is [ V ](l-1)/2,V(l+1)/2,V(l+3)/2]Number of elementsIs 3, when alpha is (l-3)/2, the characteristic vector V ((l-3)/2) is [ V ](l+1)/2]The number of elements is 1; combining the characteristic vectors determined by different values of the parameter alpha into an element number (l-1)2High-dimensional feature vector of [ V (0), V (1), …, V ((l-5)/2), V ((l-3)/2) ]4];
(5) DNA/RNA sequence coding
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011236138.3A CN112365925A (en) | 2020-11-09 | 2020-11-09 | Bidirectional dinucleotide position specific preference and mutual information DNA/RNA sequence coding method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011236138.3A CN112365925A (en) | 2020-11-09 | 2020-11-09 | Bidirectional dinucleotide position specific preference and mutual information DNA/RNA sequence coding method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112365925A true CN112365925A (en) | 2021-02-12 |
Family
ID=74510310
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011236138.3A Pending CN112365925A (en) | 2020-11-09 | 2020-11-09 | Bidirectional dinucleotide position specific preference and mutual information DNA/RNA sequence coding method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112365925A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106250718A (en) * | 2016-07-29 | 2016-12-21 | 於铉 | N based on individually balanced Boosting algorithm1methylate adenosine site estimation method |
CN111161793A (en) * | 2020-01-09 | 2020-05-15 | 青岛科技大学 | Stacking integration based N in RNA6Method for predicting methyladenosine modification site |
CN112365924A (en) * | 2020-11-09 | 2021-02-12 | 陕西师范大学 | Bidirectional trinucleotide position specificity preference and point combined mutual information DNA/RNA sequence coding method |
-
2020
- 2020-11-09 CN CN202011236138.3A patent/CN112365925A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106250718A (en) * | 2016-07-29 | 2016-12-21 | 於铉 | N based on individually balanced Boosting algorithm1methylate adenosine site estimation method |
CN111161793A (en) * | 2020-01-09 | 2020-05-15 | 青岛科技大学 | Stacking integration based N in RNA6Method for predicting methyladenosine modification site |
CN112365924A (en) * | 2020-11-09 | 2021-02-12 | 陕西师范大学 | Bidirectional trinucleotide position specificity preference and point combined mutual information DNA/RNA sequence coding method |
Non-Patent Citations (5)
Title |
---|
GUANG-QING LI 等: "TargetM6A: Identifying N6-Methyladenosine Sites From RNA Sequences via Position-Specific Nucleotide Propensities and a Support Vector Machine", IEEE TRANSACTIONS ON NANOBIOSCIENCE, vol. 15, no. 7, 10 August 2016 (2016-08-10), pages 674 - 682, XP011638122, DOI: 10.1109/TNB.2016.2599115 * |
MINGZHAO WANG 等: "M6A-BiNP: predicting N6-methyladenosine sites based on bidirectional position-specific propensities of polynucleotides and pointwise joint mutual information", RNA BIOLOGY, vol. 18, no. 12, 23 June 2021 (2021-06-23), pages 2498 - 2512 * |
MINGZHAO WANG 等: "PSP-PJMI: an innovative feature representation algorithm for identifying DNA N4-methylcytosine sites", INF SCI., vol. 606, 31 August 2022 (2022-08-31), pages 968 - 983 * |
PENGWEI XING 等: "Identifying N6-methyladenosine sites using multi-interval nucleotide pair position specificity and support vector machine", SCIENTIFIC REPORTS, vol. 7, 25 April 2017 (2017-04-25), pages 46757 * |
谢娟英 等: "面向甲基化修饰位点预测的DNA/RNA序列特征 编码算法研究进展", 中国科学: 生命科学, vol. 53, no. 6, 24 November 2022 (2022-11-24), pages 841 - 875 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yang et al. | iRNA-2OM: a sequence-based predictor for identifying 2′-O-methylation sites in homo sapiens | |
Le et al. | iEnhancer-5Step: identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding | |
Chen et al. | DeepM6ASeq-EL: prediction of human N6-methyladenosine (m 6 a) sites with LSTM and ensemble learning | |
Basith et al. | iGHBP: computational identification of growth hormone binding proteins from sequences using extremely randomised tree | |
van der Laan et al. | Asymptotic optimality of likelihood-based cross-validation | |
Li et al. | iORI-PseKNC: a predictor for identifying origin of replication with pseudo k-tuple nucleotide composition | |
CN110289047B (en) | Sequencing data-based tumor purity and absolute copy number prediction method and system | |
EP1488228A1 (en) | Methods for identifying large subsets of differentially expressed genes based on multivariate microarray data analysis | |
Cheng et al. | Computationally predicting protein-RNA interactions using only positive and unlabeled examples | |
Gan et al. | Sparse representation for tumor classification based on feature extraction using latent low-rank representation | |
US20210398605A1 (en) | System and method for promoter prediction in human genome | |
Feng et al. | Accurate de novo prediction of RNA 3D structure with transformer network | |
CN112365924B (en) | Bidirectional trinucleotide position specificity preference and point combined mutual information DNA/RNA sequence coding method | |
Raza et al. | iPro-TCN: Prediction of DNA Promoters Recognition and their Strength Using Temporal Convolutional Network | |
CN113257357A (en) | Method for predicting protein residue contact map | |
CN116612814A (en) | Regression model-based batch detection method, device, equipment and medium for gene sample pollution | |
CN112365925A (en) | Bidirectional dinucleotide position specific preference and mutual information DNA/RNA sequence coding method | |
Golenko et al. | IMPLEMENTATION OF MACHINE LEARNING MODELS TO DETERMINE THE APPROPRIATE MODEL FOR PROTEIN FUNCTION PREDICTION. | |
CN114758721B (en) | Deep learning-based transcription factor binding site positioning method | |
Zhang et al. | Nature-inspired compressed sensing for transcriptomic profiling from random composite measurements | |
Dalton | Optimal Bayesian feature selection | |
CN113764031B (en) | Prediction method of N6 methyl adenosine locus in trans-tissue/species RNA | |
Zhang et al. | Estimating the rate constant from biosensor data via an adaptive variational Bayesian approach | |
Zhang et al. | Improving protein secondary structure prediction by using the residue conformational classes | |
Wang et al. | PSP-PJMI: An innovative feature representation algorithm for identifying DNA N4-methylcytosine sites |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |