CN111223522B - Method for identifying lncRNA based on fuzzy k-mer utilization rate - Google Patents
Method for identifying lncRNA based on fuzzy k-mer utilization rate Download PDFInfo
- Publication number
- CN111223522B CN111223522B CN202010010135.1A CN202010010135A CN111223522B CN 111223522 B CN111223522 B CN 111223522B CN 202010010135 A CN202010010135 A CN 202010010135A CN 111223522 B CN111223522 B CN 111223522B
- Authority
- CN
- China
- Prior art keywords
- mer
- fuzzy
- sequence
- traditional
- mers
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 108020005198 Long Noncoding RNA Proteins 0.000 title claims abstract description 7
- 239000011159 matrix material Substances 0.000 claims abstract description 16
- 108091028043 Nucleic acid sequence Proteins 0.000 claims abstract description 9
- 238000012549 training Methods 0.000 claims abstract description 7
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 108091046869 Telomeric non-coding RNA Proteins 0.000 claims description 29
- 239000013598 vector Substances 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 16
- 108020004999 messenger RNA Proteins 0.000 claims description 15
- 108090000623 proteins and genes Proteins 0.000 claims description 13
- 108091032973 (ribonucleotides)n+m Proteins 0.000 claims description 11
- 238000012706 support-vector machine Methods 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 10
- 238000013145 classification model Methods 0.000 claims description 5
- 102000004169 proteins and genes Human genes 0.000 claims description 5
- 238000002790 cross-validation Methods 0.000 claims description 4
- 108091027963 non-coding RNA Proteins 0.000 abstract description 20
- 102000042567 non-coding RNA Human genes 0.000 abstract description 19
- 238000012165 high-throughput sequencing Methods 0.000 abstract description 2
- 238000011282 treatment Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 13
- 238000004422 calculation algorithm Methods 0.000 description 8
- 108700011259 MicroRNAs Proteins 0.000 description 6
- 239000002679 microRNA Substances 0.000 description 6
- 238000012360 testing method Methods 0.000 description 5
- 238000011160 research Methods 0.000 description 4
- 101000582936 Homo sapiens Pleckstrin Proteins 0.000 description 3
- 108091007412 Piwi-interacting RNA Proteins 0.000 description 3
- 102100030264 Pleckstrin Human genes 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 201000010099 disease Diseases 0.000 description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 3
- 239000004055 small Interfering RNA Substances 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000008844 regulatory mechanism Effects 0.000 description 2
- 241000894007 species Species 0.000 description 2
- 102000040650 (ribonucleotides)n+m Human genes 0.000 description 1
- 108010077544 Chromatin Proteins 0.000 description 1
- 108091023288 HOTAIR Proteins 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 238000012300 Sequence Analysis Methods 0.000 description 1
- 108091007416 X-inactive specific transcript Proteins 0.000 description 1
- 108091035715 XIST (gene) Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008827 biological function Effects 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 239000006227 byproduct Substances 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 230000022131 cell cycle Effects 0.000 description 1
- 230000024245 cell differentiation Effects 0.000 description 1
- 210000003483 chromatin Anatomy 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000019975 dosage compensation by inactivation of X chromosome Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000013100 final test Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 108091070501 miRNA Proteins 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000025308 nuclear transport Effects 0.000 description 1
- 230000029279 positive regulation of transcription, DNA-dependent Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000014493 regulation of gene expression Effects 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 230000002103 transcriptional effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000005303 weighing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Abstract
The invention discloses a method for identifying lncRNA based on fuzzy k-mer utilization rate, which specifically comprises the following steps: step 1, preprocessing RNA sequence data; step 2, defining a traditional k-mer and a fuzzy k-mer, and calculating the use frequency of the traditional k-mer; step 3, determining the corresponding relation between the traditional k-mer and the fuzzy k-mer; step 4, solving a corresponding relation matrix c of the traditional k-mer and the fuzzy k-mer m The method comprises the steps of carrying out a first treatment on the surface of the And 5, training a prediction model by using the fuzzy k-mer. The implementation of the invention can help to systematically and accurately identify long-chain non-coding RNA in various species and various cells from large-scale high-throughput sequencing data.
Description
Technical Field
The invention belongs to the technical field of identification of long-chain non-coding RNA (lncRNA), and relates to a method for identifying lncRNA based on fuzzy k-mer utilization rate.
Background
In the field of molecular biology, non-coding RNAs are one of the current research hotspots. microRNA (miRNA) in non-coding and long non-coding RNA (lncRNA) are important in research. The research of microRNA is mature, and scientists are developing long-chain non-coding RNA, which has important biomedical functions.
Long non-coding RNAs were originally thought to be merely by-products of genome transcription, but were simply "noisy" and not of any biological function. With the gradual discovery of the functions of non-coding RNA genes such as Xist, hotair and the like, long-chain non-coding RNA genes are found to have very important functions and a number of the genes greatly exceeding those of the genes for encoding proteins. The functions of long-chain non-coding RNA are mainly expressed in the following steps: transcriptional interference, regulation of gene expression, transcriptional activation, X-chromosome inactivation, nuclear transport, genomic imprinting, chromatin modification, etc., are closely related to the occurrence, development, diagnosis, and treatment of diseases.
The identification of long-chain non-coding RNA is a necessary way for researching long-chain non-coding RNA, and is an important basic frontier work. It is not easy to identify long-chain non-coding RNAs from raw transcriptome biological experimental data, which can be determined through multiple steps of calculations and analyses using a combination of various data and tools. One of the critical steps is the assessment of the coding capacity of transcripts.
An algorithm for distinguishing protein-encoding genes from long non-coding RNA genes using k-mer characteristics is disclosed in BMC bioinformatics, and is named PLEK. The algorithm is particularly useful for identifying long-chain non-coding RNAs from large-scale de novo assembled transcriptomes. A large number of experiments show that: the accuracy increases as k increases, but at the same time the computation increases as k increases. In order to balance between accuracy and computational effort, k=5 is finally chosen. In addition, in the process of calculating k-mers, as k increases, a sparse matrix phenomenon is generated, that is, the calculated k-mer frequency is mostly 0. The calculation of k-mers is affected when SNPs or indels are present in the transcripts.
In view of the above problems, a method for identifying lncRNA based on fuzzy k-mer use rate is provided, wherein the fuzzy k-mer has better robustness in calculation of k-mer use frequency.
Disclosure of Invention
The invention aims to provide a method for identifying lncRNA based on fuzzy k-mer utilization rate, which adopts strict filtering conditions to collect reliable mRNA and lncRNA sequences, so that the output result of a subsequent identification model is more reliable and reliable, and the system error is reduced; and the fuzzy k-mer is adopted, so that the computational complexity is reduced.
The technical scheme adopted by the invention is that the method for identifying the lncRNA based on the fuzzy k-mer utilization rate comprises the following steps:
step 1, preprocessing RNA sequence data;
step 2, defining a traditional k-mer and a fuzzy k-mer, and calculating the use frequency of the traditional k-mer;
step 3, determining the corresponding relation between the traditional k-mer and the fuzzy k-mer;
step 4, solving a corresponding relation matrix c of the traditional k-mer and the fuzzy k-mer m ;
And 5, training a prediction model by using the fuzzy k-mer.
The present invention is also characterized in that,
the specific process of the step 1 is as follows:
transcript mRNA sequences and annotations of human encoded proteins were downloaded from RefSeq database, human long non-coding RNA was collected from genode v17, and mRNA and lncRNA with putative, predicted, pseudogene annotations were excluded.
The specific process of the step 2 is as follows:
the definition of a conventional k-mer is: the conventional k-mer is denoted as u= { U j },1≤j≤N=4 t The sequence length is t;
the definition of fuzzy k-mers is: fuzzy k-mers are noted as v= { V i },The length of the sequence is n, wherein the length of the base sequence which cannot be matched is n-x, the x positions have bases, and x is required to be smaller than n;
the traditional k-mer use frequency calculation process is as follows:
the sequence in the sliding window is matched with the specific sequence i, and the using times N of the specific sequence i are i Then 1 is increased and the k-mer frequency of use F is calculated using the following equation (1) i :
F i =N i /M k (1);
wherein ,Mk Is the total number of times a sliding window of length k can slide along an RNA sequence.
The specific process of the step 3 is as follows:
matrix A of one-to-one correspondence of a conventional k-mer and a fuzzy k-mer M×N =[a i,j ]The values of the elements in the matrix A are 1 or 0:
let x be the vector, length N, where element x j Is u j Is the number of (3); let y be the vector, length M, where element y i Is of length v i The one-to-one correspondence of the conventional k-mers and fuzzy k-mers is:
y=Ax (2)。
the specific process of the step 4 is as follows:
to solve the minimum L of equation (2) 2 A paradigm solution defining a kernel function as:
wherein ,Nm (S 1 ,S 2 ) Representation sequence S 1 And sequence S 2 M unmatched fuzzy k-mer numbers in between;
wherein ,r=m1 +m 2 -2t-m,b=4。
The specific process of the step 5 is as follows:
firstly, normalizing the adjusted use frequency to a number in the range of 0 to 1 using the svm-scale program in the LIBSVM package; then, a support vector machine with a radial basis function as a core is adopted as a two-classifier; obtaining parameters C of an optimized support vector machine and parameters gamma of a core by using a grid.py script in the LIBSVM package; in the parameter searching process, 10 times of cross validation is adopted to evaluate the performance of the classification model corresponding to each pair of C and gamma parameters.
The beneficial effects of the invention are as follows:
(1) Theoretical significance: the quantity of the long-chain non-coding RNA is far more than that of the coding protein genes, the biomedical functions of the long-chain non-coding RNA are confirmed, and the identification of the long-chain non-coding RNA lays a theoretical foundation for researching gene regulation, cell cycle, occurrence and development of complex diseases and the like.
(2) Biological significance: the identification of long-chain non-coding RNA creates conditions for elucidating the gene expression regulation mechanism, can also provide causal explanation for researching species evolution and organism diversity, and deepens the understanding of cell differentiation and genome-level molecular regulation mechanism.
(3) Application value: implementation of the algorithm will help to systematically and accurately identify long non-coding RNAs in various species, various cells, from large-scale high-throughput sequencing data. The algorithm can also be popularized and applied to the identification of other types of RNA molecules, such as identification microRNA (miRNA), piRNA (Piwi-interacting RNA) and the like.
Detailed Description
The present invention will be described in detail with reference to the following embodiments.
The invention discloses a method for identifying lncRNA based on fuzzy k-mer utilization rate, which specifically comprises the following steps:
step 1, data preprocessing:
the present algorithm employs a machine learning approach, and therefore requires accurate and authoritative training and testing of the data set. NCBI provides two databases of RefSeq and genode that provide a comprehensive, well annotated RNA sequence dataset without redundancy. Transcript mRNA sequences and annotations for human encoded proteins can be downloaded from RefSeq database (release 60), and human long-chain non-coding RNA (lncRNA) can be collected from geneode v 17. There were 34691 mRNAs greater than 200nt in length in humans, and 22389 lncRNAs greater than 200nt in length. We exclude mRNA and lncRNA with putative, predicted, pseudogene et al notes, ensuring that the processed data is of high quality and reliable.
Step 2, calculating the use frequency of the traditional k-mer:
the calculation of the k-mer usage frequency is related to the width of the sliding window and the step size of the sliding window. The sliding window step length adopted by the invention is 1 base (1 nt). The width of the sliding window varies from 1 to k. For the differentiation of mRNA and lncRNA, k has a value of 6.
For RNA, k-mers have a sequence of k A, C, G or T bases. For k=1 to 6, 5460 sequences (4096+1024+256+64+16+4, 4096 6-mer sequences, 1024 5-mer sequences, 256 4-mer sequences, 64 3-mer sequences, 16 2-mer sequences, 4 1-mer sequences) can be obtained.
If the sequence in the sliding window matches a particular sequence i, then the number of uses of the particular sequence i (denoted as N i ) Then 1 is increased. M is M k Is the total number of times a sliding window of length k can slide along an RNA sequence. Frequency of use F i The calculation formula of (2) is as follows:
F i =N i /M k (1);
for RNA sequences, the base may be any of { a, C, G, T }, and { a, C, G, T } may be referred to as an alphabet with an alphabet length l of l=4. The definition of a conventional k-mer and a fuzzy k-mer in a sequence is given below.
Wherein the length of the ambiguous k-mer is n and contains x perfectly matched positions.
The conventional k-mer is denoted as u= { U j },1≤j≤N=4 t The sequence length is t.
Fuzzy k-mers are noted as v= { V i },The length of the sequence is n, wherein the length of the base sequence which cannot be matched is n-x, the x positions have bases, and x must be smaller than n.
Step 3, determining the corresponding relation between the traditional k-mer and the fuzzy k-mer:
matrix A of one-to-one correspondence of a conventional k-mer and a fuzzy k-mer M×N =[a i,j ]The values of the elements in the matrix A are 1 or 0:
here, "u j Matching v i "means at ambiguity v i The bases having exactly x positions can be in one-to-one correspondence with the conventional k-mers, also known as matching. Let x be the vector, length N, where element x j Is u j Is a number of (3). Let y be the vector, length M, where element y i Is of length v i Is a number of (3).
The one-to-one correspondence of the traditional k-mers and fuzzy k-mers is:
y=Ax (2);
and 4, solving a corresponding relation matrix of the traditional k-mer and the fuzzy k-mer, wherein the length of a vector x in the formula (2) is N, and the length of a vector y is M, and because M is smaller than N, the rank of the matrix A is smaller than N, and the solution of the matrix A is not unique.
To estimate the corresponding vector y, the maximum entropy of vector x is required to be optimal. Because of the nonlinear calculation, the calculation difficulty is high, so that an alternative is to find the minimum L of the formula (2) 2 And (5) a paradigm solution.
For RNA transcript X, the feature vector is wherein ,/>For the number of occurrences of the ith fuzzy k-mer in the transcript, < >>Is the number of k-mers.
To solve the minimum L of equation (1) 2 A paradigm solution defining a kernel function as:
wherein ,Nm (S 1 ,S 2 ) Representation sequence S 1 And sequence S 2 M number of mismatched fuzzy k-mers.
wherein ,r=m1 +m 2 -2t-m, b=4, b is the number of bases (A, T, G, C). For two k-mer u of length l 1 and u2 If there are m mismatches, there are l-m perfectly matched positions. For all possible k-mers of length l, all possible k-mers of length l can be enumerated, denoted as u, where with u 1 With m 1 The base not matching with u 2 With m 2 The mismatched bases. Let m be 1 Of the mismatches, t are located in l-m matched positions, where m 1 T are located at m unmatched positions. Thus, m 1 Is positioned atSeed t number of unmatched pieces (b-t) t A kind of module is assembled in the module and the module is assembled in the module. For the remaining r unmatched positions there is +.>The mode of selection of the position is (b-2) r Seed value.
Step 5, training a prediction model by using the fuzzy k-mer:
for any mRNA and lncRNA sequences, the above steps can be used to obtain the utilization vector of the ambiguous k-mer for each sequence, where the set of utilization vectors for mRNA is the positive class and the set of utilization vectors for lncRNA is the negative class. The positive class dataset and the negative class dataset are used as models for training to predict which type the new RNA belongs to.
Here, a support vector machine (libvm) is used as a predictive model, and the new model is tested on an independent dataset. Two database test datasets, namely NCBI RefSeq and GENCODE, with reliable authority are selected to test the new model.
First, the svm-scale program in the libvm package (version 3.17) is used to normalize the adjusted frequency of use to a number in the range of 0 to 1. Then, a support vector machine with radial basis function as a kernel is used as a two-classifier (variance is gamma). And obtaining the parameter C of the optimized support vector machine and the parameter gamma of the core by using the grid.py script in the LIBSVM package. In the parameter searching process, 10 times of cross validation is adopted to evaluate the performance of the classification model corresponding to each pair of C and gamma parameters. And finally, establishing an SVM two-class prediction model with optimized C and gamma parameters.
The invention will be further described with reference to the application principles.
For identifying lncRNA, it is necessary to exclude other RNAs accurately. First, according to the general characteristics of lncRNA, only RNA sequences longer than 200 bases are considered in the present invention. In addition, there is a need to distinguish lncRNA from mRNA. It is important to build an accurate classification model, and training and testing of the classification model requires as accurate a data set as possible, and ideally, the data set used is a gold standard data set. The method employs the currently most authoritative and accepted data sets NCBI RefSeq and GENCODE databases employed by most methods. Unlike other methods, it is: the present invention excludes mRNA and lncRNA with annotations such as putative, predicted, pseudogene, because the annotations of these data are inaccurate. The data left after such processing is more accurate and reliable.
k in the k-mer is variable, different methods using different k. When k is too small, the due characteristics of the sequence cannot be obtained; when k is too large, the computational complexity rises sharply. Therefore, the selection of the appropriate k is critical. An existing algorithm PLEK for predicting lncRNA. In PLEK, the value of k is 5, which is a threshold obtained by weighing accuracy and computational complexity. In the present algorithm, since a fuzzy k-mer algorithm is used, that is, unknown bases of k bases can occur, k=6 is used in the present invention in order to capture enough information in the sequence.
Fuzzy k-mers may correspond to conventional k-mers. This has the advantage that: the method can solve the problem of rapid expansion of calculated amount caused by overlarge k and solve the problem that the characteristics of the sequence cannot be represented due to overlarge k. The corresponding relation between the fuzzy k-mer and the traditional k-mer in the invention is as follows:
the one-to-one correspondence of the traditional k-mer and the fuzzy k-mer is A M×N =[a i,j ]The values of the elements in the matrix A are 1 or 0:
here, "u j Matching v i "means at ambiguity v i The bases having exactly x positions can be in one-to-one correspondence with the conventional k-mers, also known as matching.
The solution of the one-to-one correspondence matrix of the conventional k-mers and fuzzy k-mers is not unique. A feasible solution is to obtain the minimum L 2 And (5) a paradigm solution. For RNA transcript X, the feature vector isFor the number of occurrences of the ith fuzzy k-mer in the transcript, < >>Is the number of k-mers. The kernel function is defined as:
the authentication model adopts a support vector machine. In summary, the support vector machine has excellent classification performance. For features of the computation space obtained with y=ax, a support vector machine was used to train the model, with 10-fold cross validation (10-fold crossing validation) in the sequence. And the obtained model parameters are tested by using the test data. And finally, taking the obtained optimal parameters as model parameters.
The method solves the problems of low prediction accuracy or high calculation complexity in the prior art; the invention can screen reliable mRNA and lncRNA; the invention can identify long-chain non-coding RNA, lays a foundation for researching the functions of the long-chain non-coding RNA, and provides a new idea for explaining the occurrence and development mechanism of diseases.
When k is large, the k-mer usage is a sparse matrix. Thus, careful design of the fuzzy k-mer feature should have the following superior characteristics: (1) the computational complexity is small. The magnitude of the current data set is larger and larger, and the consumption of computing resources is more and more serious, so that proper characteristics are selected, the computation is convenient, and the computation complexity is low. (2) Can reflect the biological characteristics of the RNA sequence. The purpose of the computational features is to solve biological problems, and if there are suitable computational features, they can reflect the biological essential features of the RNA sequence and achieve good results. The characteristics have strong comprehensibility and good interpretability, and lay a good foundation for subsequent data analysis and functional experiments. (3) The generalization capability is strong. The features of the current designs are first to be suitable for sequence analysis of long non-coding RNAs and mrnas, and in order to make the best use of these features possible, it is desirable to extend the computational features to other biological problems, such as predicting mirnas, pirnas, etc.
The correspondence between the conventional k-mers and fuzzy k-mers is not unique, and therefore, a suitable, or "optimal", correspondence is selected. What is then the mapping optimal? The invention adopts L2 paradigm solution. Finally, the method provided by the invention is reasonable as shown by software implementation and final test results.
In feature space, the similarity measure between samples is critical. For fuzzy k-mers, a suitable kernel function is designed to calculate the similarity between two samples. The requirements of easy implementation, low computational complexity and the like are met.
Previous studies have been directed to the problem of studying the biological domain using traditional k-mers as biological features. The invention adopts fuzzy k-mer to study the sequence characteristics of coding protein RNA (mRNA) and long non-coding RNA for the first time. The identification problem of long-chain non-coding RNA is studied based on fuzzy k-mer for the first time, and a kernel function of a support vector machine is designed aiming at the fuzzy k-mer. The research result of the invention can be popularized and applied to the problems of 'identifying genes', 'predicting piRNA', 'category-specific motif discovery', 'miRNA classification', and the like.
Claims (3)
1. A method for identifying lncRNA based on fuzzy k-mer usage, characterized by: the method specifically comprises the following steps:
step 1, preprocessing RNA sequence data;
step 2, defining a traditional k-mer and a fuzzy k-mer, and calculating the use frequency of the traditional k-mer;
the specific process of the step 2 is as follows:
the definition of a conventional k-mer is: the conventional k-mer is denoted as u= { U j },1≤j≤N=4 t The sequence length is t;
the definition of fuzzy k-mers is: fuzzy k-mers are noted asThe length of the sequence is n, wherein the length of the base sequence which cannot be matched is n-x, the x positions have bases, and x is required to be smaller than n;
the traditional k-mer use frequency calculation process is as follows:
the sequence in the sliding window is matched with the specific sequence i, and the using times N of the specific sequence i are i Then 1 is increased and the k-mer frequency of use F is calculated using the following equation (1) i :
F i =N i /M k (1);
wherein ,Mk Is long and isA sliding window of degree k for a total number of slides along the RNA sequence;
step 3, determining the corresponding relation between the traditional k-mer and the fuzzy k-mer;
the specific process of the step 3 is as follows:
matrix A of one-to-one correspondence of a conventional k-mer and a fuzzy k-mer M×N =[a i,j ]The values of the elements in the matrix A are 1 or 0:
let x be the vector, length N, where element x j Is u j Is the number of (3); let y be the vector, length M, where element y i Is of length v i The one-to-one correspondence of the conventional k-mers and fuzzy k-mers is:
y=Ax (2);
step 4, solving a corresponding relation matrix c of the traditional k-mer and the fuzzy k-mer m ;
The specific process of the step 4 is as follows:
to solve the minimum L of equation (2) 2 A paradigm solution defining a kernel function as:
wherein ,Nm (S 1 ,S 2 ) Representation sequence S 1 And sequence S 2 M unmatched fuzzy k-mer numbers in between;
wherein ,r=m1 +m 2 -2t-m,b=4;
And 5, training a prediction model by using the fuzzy k-mer.
2. The method for identifying lncRNA based on fuzzy k-mer usage of claim 1, wherein: the specific process of the step 1 is as follows:
transcript mRNA sequences and annotations of human encoded proteins were downloaded from RefSeq database, human long non-coding RNA was collected from genode v17, and mRNA and lncRNA with putative, predicted, pseudogene annotations were excluded.
3. The method for identifying lncRNA based on fuzzy k-mer usage of claim 1, wherein: the specific process of the step 5 is as follows:
firstly, normalizing the adjusted use frequency to a number in the range of 0 to 1 using the svm-scale program in the LIBSVM package; then, a support vector machine with a radial basis function as a core is adopted as a two-classifier; obtaining parameters C of an optimized support vector machine and parameters gamma of a core by using a grid.py script in the LIBSVM package; in the parameter searching process, 10 times of cross validation is adopted to evaluate the performance of the classification model corresponding to each pair of C and gamma parameters.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010010135.1A CN111223522B (en) | 2020-01-06 | 2020-01-06 | Method for identifying lncRNA based on fuzzy k-mer utilization rate |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010010135.1A CN111223522B (en) | 2020-01-06 | 2020-01-06 | Method for identifying lncRNA based on fuzzy k-mer utilization rate |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111223522A CN111223522A (en) | 2020-06-02 |
CN111223522B true CN111223522B (en) | 2023-04-28 |
Family
ID=70832290
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010010135.1A Active CN111223522B (en) | 2020-01-06 | 2020-01-06 | Method for identifying lncRNA based on fuzzy k-mer utilization rate |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111223522B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116994645B (en) * | 2023-08-01 | 2024-04-09 | 西安理工大学 | Prediction method of piRNA and mRNA target pair based on interactive reasoning network |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010014636A1 (en) * | 2008-07-28 | 2010-02-04 | Rockefeller University | Methods for identifying rna segments bound by rna-binding proteins or ribonucleoprotein complexes |
CN108595913A (en) * | 2018-05-11 | 2018-09-28 | 武汉理工大学 | Differentiate the supervised learning method of mRNA and lncRNA |
CN108876001A (en) * | 2018-05-03 | 2018-11-23 | 东北大学 | A kind of Short-Term Load Forecasting Method based on twin support vector machines |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7907769B2 (en) * | 2004-05-13 | 2011-03-15 | The Charles Stark Draper Laboratory, Inc. | Image-based methods for measuring global nuclear patterns as epigenetic markers of cell differentiation |
-
2020
- 2020-01-06 CN CN202010010135.1A patent/CN111223522B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010014636A1 (en) * | 2008-07-28 | 2010-02-04 | Rockefeller University | Methods for identifying rna segments bound by rna-binding proteins or ribonucleoprotein complexes |
CN108876001A (en) * | 2018-05-03 | 2018-11-23 | 东北大学 | A kind of Short-Term Load Forecasting Method based on twin support vector machines |
CN108595913A (en) * | 2018-05-11 | 2018-09-28 | 武汉理工大学 | Differentiate the supervised learning method of mRNA and lncRNA |
Non-Patent Citations (2)
Title |
---|
常征 ; 孟军 ; 施云生 ; 莫冯然 ; .多特征融合的lncRNA识别与其功能预测.智能系统学报.2018,(06),全文. * |
李俊豪 ; 杨建华 ; 屈良鹄 ; .生物信息学在长非编码RNA研究中的应用.生理科学进展.2016,(03),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN111223522A (en) | 2020-06-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Fan et al. | lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning | |
Zhang et al. | Predicting in-vitro transcription factor binding sites using DNA sequence+ shape | |
Wen et al. | A classification model for lncRNA and mRNA based on k-mers and a convolutional neural network | |
Zhang et al. | Towards a better prediction of subcellular location of long non-coding RNA | |
CN111223522B (en) | Method for identifying lncRNA based on fuzzy k-mer utilization rate | |
Wang et al. | A brief review of machine learning methods for RNA methylation sites prediction | |
Zeng et al. | 4mCPred-MTL: accurate identification of DNA 4mC sites in multiple species using multi-task deep learning based on multi-head attention mechanism | |
Min et al. | Structured Sparse Non-negative Matrix Factorization with $\ell _ {2, 0} $-Norm | |
Jiang et al. | miRTMC: a miRNA target prediction method based on matrix completion algorithm | |
Asim et al. | EL-RMLocNet: An explainable LSTM network for RNA-associated multi-compartment localization prediction | |
Deng et al. | Probing the functions of long non-coding RNAs by exploiting the topology of global association and interaction network | |
Lai et al. | LSA-ac4C: A hybrid neural network incorporating double-layer LSTM and self-attention mechanism for the prediction of N4-acetylcytidine sites in human mRNA | |
CN113764031B (en) | Prediction method of N6 methyl adenosine locus in trans-tissue/species RNA | |
Zhang et al. | Nature-inspired compressed sensing for transcriptomic profiling from random composite measurements | |
Zhou et al. | GD-RDA: a new regularized discriminant analysis for high-dimensional data | |
Jha et al. | Fast and precise prediction of non-coding RNAs (ncRNAs) using sequence alignment and k-mer counting | |
Sun et al. | LRSK: a low-rank self-representation K-means method for clustering single-cell RNA-sequencing data | |
Lu et al. | A constrained probabilistic matrix decomposition method for predicting miRNA-disease associations | |
Sheng et al. | A survey of computational methods and databases for lncRNA-miRNA interaction prediction | |
Li et al. | Predicting cancer lymph-node metastasis from LncRNA expression profiles using local linear reconstruction guided distance metric learning | |
Pian et al. | V-ELMpiRNAPred: Identification of human piRNAs by the voting-based extreme learning machine (V-ELM) with a new hybrid feature | |
Wang et al. | PSP-PJMI: An innovative feature representation algorithm for identifying DNA N4-methylcytosine sites | |
Ho et al. | Toward deep learning approaches for learning structure motifs and classifying biological sequences from RNA A-to-I editing events | |
Yang et al. | Identifying Human miRNA Target Sites via Learning the Interaction Patterns between miRNA and mRNA Segments | |
Moskowitz et al. | Nonparametric analysis of contributions to variance in genomics and epigenomics data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |