CN111223522B - Method for identifying lncRNA based on fuzzy k-mer utilization rate - Google Patents

Method for identifying lncRNA based on fuzzy k-mer utilization rate Download PDF

Info

Publication number
CN111223522B
CN111223522B CN202010010135.1A CN202010010135A CN111223522B CN 111223522 B CN111223522 B CN 111223522B CN 202010010135 A CN202010010135 A CN 202010010135A CN 111223522 B CN111223522 B CN 111223522B
Authority
CN
China
Prior art keywords
mer
fuzzy
sequence
traditional
mers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010010135.1A
Other languages
Chinese (zh)
Other versions
CN111223522A (en
Inventor
李爱民
费蓉
刘雅君
周红芳
刘光明
王磊
黑新宏
周中银
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202010010135.1A priority Critical patent/CN111223522B/en
Publication of CN111223522A publication Critical patent/CN111223522A/en
Application granted granted Critical
Publication of CN111223522B publication Critical patent/CN111223522B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Abstract

The invention discloses a method for identifying lncRNA based on fuzzy k-mer utilization rate, which specifically comprises the following steps: step 1, preprocessing RNA sequence data; step 2, defining a traditional k-mer and a fuzzy k-mer, and calculating the use frequency of the traditional k-mer; step 3, determining the corresponding relation between the traditional k-mer and the fuzzy k-mer; step 4, solving a corresponding relation matrix c of the traditional k-mer and the fuzzy k-mer m The method comprises the steps of carrying out a first treatment on the surface of the And 5, training a prediction model by using the fuzzy k-mer. The implementation of the invention can help to systematically and accurately identify long-chain non-coding RNA in various species and various cells from large-scale high-throughput sequencing data.

Description

Method for identifying lncRNA based on fuzzy k-mer utilization rate
Technical Field
The invention belongs to the technical field of identification of long-chain non-coding RNA (lncRNA), and relates to a method for identifying lncRNA based on fuzzy k-mer utilization rate.
Background
In the field of molecular biology, non-coding RNAs are one of the current research hotspots. microRNA (miRNA) in non-coding and long non-coding RNA (lncRNA) are important in research. The research of microRNA is mature, and scientists are developing long-chain non-coding RNA, which has important biomedical functions.
Long non-coding RNAs were originally thought to be merely by-products of genome transcription, but were simply "noisy" and not of any biological function. With the gradual discovery of the functions of non-coding RNA genes such as Xist, hotair and the like, long-chain non-coding RNA genes are found to have very important functions and a number of the genes greatly exceeding those of the genes for encoding proteins. The functions of long-chain non-coding RNA are mainly expressed in the following steps: transcriptional interference, regulation of gene expression, transcriptional activation, X-chromosome inactivation, nuclear transport, genomic imprinting, chromatin modification, etc., are closely related to the occurrence, development, diagnosis, and treatment of diseases.
The identification of long-chain non-coding RNA is a necessary way for researching long-chain non-coding RNA, and is an important basic frontier work. It is not easy to identify long-chain non-coding RNAs from raw transcriptome biological experimental data, which can be determined through multiple steps of calculations and analyses using a combination of various data and tools. One of the critical steps is the assessment of the coding capacity of transcripts.
An algorithm for distinguishing protein-encoding genes from long non-coding RNA genes using k-mer characteristics is disclosed in BMC bioinformatics, and is named PLEK. The algorithm is particularly useful for identifying long-chain non-coding RNAs from large-scale de novo assembled transcriptomes. A large number of experiments show that: the accuracy increases as k increases, but at the same time the computation increases as k increases. In order to balance between accuracy and computational effort, k=5 is finally chosen. In addition, in the process of calculating k-mers, as k increases, a sparse matrix phenomenon is generated, that is, the calculated k-mer frequency is mostly 0. The calculation of k-mers is affected when SNPs or indels are present in the transcripts.
In view of the above problems, a method for identifying lncRNA based on fuzzy k-mer use rate is provided, wherein the fuzzy k-mer has better robustness in calculation of k-mer use frequency.
Disclosure of Invention
The invention aims to provide a method for identifying lncRNA based on fuzzy k-mer utilization rate, which adopts strict filtering conditions to collect reliable mRNA and lncRNA sequences, so that the output result of a subsequent identification model is more reliable and reliable, and the system error is reduced; and the fuzzy k-mer is adopted, so that the computational complexity is reduced.
The technical scheme adopted by the invention is that the method for identifying the lncRNA based on the fuzzy k-mer utilization rate comprises the following steps:
step 1, preprocessing RNA sequence data;
step 2, defining a traditional k-mer and a fuzzy k-mer, and calculating the use frequency of the traditional k-mer;
step 3, determining the corresponding relation between the traditional k-mer and the fuzzy k-mer;
step 4, solving a corresponding relation matrix c of the traditional k-mer and the fuzzy k-mer m
And 5, training a prediction model by using the fuzzy k-mer.
The present invention is also characterized in that,
the specific process of the step 1 is as follows:
transcript mRNA sequences and annotations of human encoded proteins were downloaded from RefSeq database, human long non-coding RNA was collected from genode v17, and mRNA and lncRNA with putative, predicted, pseudogene annotations were excluded.
The specific process of the step 2 is as follows:
the definition of a conventional k-mer is: the conventional k-mer is denoted as u= { U j },1≤j≤N=4 t The sequence length is t;
the definition of fuzzy k-mers is: fuzzy k-mers are noted as v= { V i },
Figure BDA0002356852210000031
The length of the sequence is n, wherein the length of the base sequence which cannot be matched is n-x, the x positions have bases, and x is required to be smaller than n;
the traditional k-mer use frequency calculation process is as follows:
the sequence in the sliding window is matched with the specific sequence i, and the using times N of the specific sequence i are i Then 1 is increased and the k-mer frequency of use F is calculated using the following equation (1) i
F i =N i /M k (1);
wherein ,Mk Is the total number of times a sliding window of length k can slide along an RNA sequence.
The specific process of the step 3 is as follows:
matrix A of one-to-one correspondence of a conventional k-mer and a fuzzy k-mer M×N =[a i,j ]The values of the elements in the matrix A are 1 or 0:
Figure BDA0002356852210000041
let x be the vector, length N, where element x j Is u j Is the number of (3); let y be the vector, length M, where element y i Is of length v i The one-to-one correspondence of the conventional k-mers and fuzzy k-mers is:
y=Ax (2)。
the specific process of the step 4 is as follows:
for RNA transcript X, the feature vector is
Figure BDA0002356852210000042
wherein ,
Figure BDA0002356852210000043
the number of occurrences of the ith fuzzy k-mer in the transcript;
Figure BDA0002356852210000044
number of k-mers;
to solve the minimum L of equation (2) 2 A paradigm solution defining a kernel function as:
Figure BDA0002356852210000045
Figure BDA0002356852210000046
wherein ,Nm (S 1 ,S 2 ) Representation sequence S 1 And sequence S 2 M unmatched fuzzy k-mer numbers in between;
Figure BDA0002356852210000047
wherein ,r=m1 +m 2 -2t-m,b=4。
The specific process of the step 5 is as follows:
firstly, normalizing the adjusted use frequency to a number in the range of 0 to 1 using the svm-scale program in the LIBSVM package; then, a support vector machine with a radial basis function as a core is adopted as a two-classifier; obtaining parameters C of an optimized support vector machine and parameters gamma of a core by using a grid.py script in the LIBSVM package; in the parameter searching process, 10 times of cross validation is adopted to evaluate the performance of the classification model corresponding to each pair of C and gamma parameters.
The beneficial effects of the invention are as follows:
(1) Theoretical significance: the quantity of the long-chain non-coding RNA is far more than that of the coding protein genes, the biomedical functions of the long-chain non-coding RNA are confirmed, and the identification of the long-chain non-coding RNA lays a theoretical foundation for researching gene regulation, cell cycle, occurrence and development of complex diseases and the like.
(2) Biological significance: the identification of long-chain non-coding RNA creates conditions for elucidating the gene expression regulation mechanism, can also provide causal explanation for researching species evolution and organism diversity, and deepens the understanding of cell differentiation and genome-level molecular regulation mechanism.
(3) Application value: implementation of the algorithm will help to systematically and accurately identify long non-coding RNAs in various species, various cells, from large-scale high-throughput sequencing data. The algorithm can also be popularized and applied to the identification of other types of RNA molecules, such as identification microRNA (miRNA), piRNA (Piwi-interacting RNA) and the like.
Detailed Description
The present invention will be described in detail with reference to the following embodiments.
The invention discloses a method for identifying lncRNA based on fuzzy k-mer utilization rate, which specifically comprises the following steps:
step 1, data preprocessing:
the present algorithm employs a machine learning approach, and therefore requires accurate and authoritative training and testing of the data set. NCBI provides two databases of RefSeq and genode that provide a comprehensive, well annotated RNA sequence dataset without redundancy. Transcript mRNA sequences and annotations for human encoded proteins can be downloaded from RefSeq database (release 60), and human long-chain non-coding RNA (lncRNA) can be collected from geneode v 17. There were 34691 mRNAs greater than 200nt in length in humans, and 22389 lncRNAs greater than 200nt in length. We exclude mRNA and lncRNA with putative, predicted, pseudogene et al notes, ensuring that the processed data is of high quality and reliable.
Step 2, calculating the use frequency of the traditional k-mer:
the calculation of the k-mer usage frequency is related to the width of the sliding window and the step size of the sliding window. The sliding window step length adopted by the invention is 1 base (1 nt). The width of the sliding window varies from 1 to k. For the differentiation of mRNA and lncRNA, k has a value of 6.
For RNA, k-mers have a sequence of k A, C, G or T bases. For k=1 to 6, 5460 sequences (4096+1024+256+64+16+4, 4096 6-mer sequences, 1024 5-mer sequences, 256 4-mer sequences, 64 3-mer sequences, 16 2-mer sequences, 4 1-mer sequences) can be obtained.
If the sequence in the sliding window matches a particular sequence i, then the number of uses of the particular sequence i (denoted as N i ) Then 1 is increased. M is M k Is the total number of times a sliding window of length k can slide along an RNA sequence. Frequency of use F i The calculation formula of (2) is as follows:
F i =N i /M k (1);
for RNA sequences, the base may be any of { a, C, G, T }, and { a, C, G, T } may be referred to as an alphabet with an alphabet length l of l=4. The definition of a conventional k-mer and a fuzzy k-mer in a sequence is given below.
Wherein the length of the ambiguous k-mer is n and contains x perfectly matched positions.
The conventional k-mer is denoted as u= { U j },1≤j≤N=4 t The sequence length is t.
Fuzzy k-mers are noted as v= { V i },
Figure BDA0002356852210000061
The length of the sequence is n, wherein the length of the base sequence which cannot be matched is n-x, the x positions have bases, and x must be smaller than n.
Step 3, determining the corresponding relation between the traditional k-mer and the fuzzy k-mer:
matrix A of one-to-one correspondence of a conventional k-mer and a fuzzy k-mer M×N =[a i,j ]The values of the elements in the matrix A are 1 or 0:
Figure BDA0002356852210000071
here, "u j Matching v i "means at ambiguity v i The bases having exactly x positions can be in one-to-one correspondence with the conventional k-mers, also known as matching. Let x be the vector, length N, where element x j Is u j Is a number of (3). Let y be the vector, length M, where element y i Is of length v i Is a number of (3).
The one-to-one correspondence of the traditional k-mers and fuzzy k-mers is:
y=Ax (2);
and 4, solving a corresponding relation matrix of the traditional k-mer and the fuzzy k-mer, wherein the length of a vector x in the formula (2) is N, and the length of a vector y is M, and because M is smaller than N, the rank of the matrix A is smaller than N, and the solution of the matrix A is not unique.
To estimate the corresponding vector y, the maximum entropy of vector x is required to be optimal. Because of the nonlinear calculation, the calculation difficulty is high, so that an alternative is to find the minimum L of the formula (2) 2 And (5) a paradigm solution.
For RNA transcript X, the feature vector is
Figure BDA0002356852210000072
wherein ,/>
Figure BDA0002356852210000073
For the number of occurrences of the ith fuzzy k-mer in the transcript, < >>
Figure BDA0002356852210000074
Is the number of k-mers.
To solve the minimum L of equation (1) 2 A paradigm solution defining a kernel function as:
Figure BDA0002356852210000075
Figure BDA0002356852210000076
wherein ,Nm (S 1 ,S 2 ) Representation sequence S 1 And sequence S 2 M number of mismatched fuzzy k-mers.
Figure BDA0002356852210000081
wherein ,r=m1 +m 2 -2t-m, b=4, b is the number of bases (A, T, G, C). For two k-mer u of length l 1 and u2 If there are m mismatches, there are l-m perfectly matched positions. For all possible k-mers of length l, all possible k-mers of length l can be enumerated, denoted as u, where with u 1 With m 1 The base not matching with u 2 With m 2 The mismatched bases. Let m be 1 Of the mismatches, t are located in l-m matched positions, where m 1 T are located at m unmatched positions. Thus, m 1 Is positioned at
Figure BDA0002356852210000082
Seed t number of unmatched pieces (b-t) t A kind of module is assembled in the module and the module is assembled in the module. For the remaining r unmatched positions there is +.>
Figure BDA0002356852210000083
The mode of selection of the position is (b-2) r Seed value.
Step 5, training a prediction model by using the fuzzy k-mer:
for any mRNA and lncRNA sequences, the above steps can be used to obtain the utilization vector of the ambiguous k-mer for each sequence, where the set of utilization vectors for mRNA is the positive class and the set of utilization vectors for lncRNA is the negative class. The positive class dataset and the negative class dataset are used as models for training to predict which type the new RNA belongs to.
Here, a support vector machine (libvm) is used as a predictive model, and the new model is tested on an independent dataset. Two database test datasets, namely NCBI RefSeq and GENCODE, with reliable authority are selected to test the new model.
First, the svm-scale program in the libvm package (version 3.17) is used to normalize the adjusted frequency of use to a number in the range of 0 to 1. Then, a support vector machine with radial basis function as a kernel is used as a two-classifier (variance is gamma). And obtaining the parameter C of the optimized support vector machine and the parameter gamma of the core by using the grid.py script in the LIBSVM package. In the parameter searching process, 10 times of cross validation is adopted to evaluate the performance of the classification model corresponding to each pair of C and gamma parameters. And finally, establishing an SVM two-class prediction model with optimized C and gamma parameters.
The invention will be further described with reference to the application principles.
For identifying lncRNA, it is necessary to exclude other RNAs accurately. First, according to the general characteristics of lncRNA, only RNA sequences longer than 200 bases are considered in the present invention. In addition, there is a need to distinguish lncRNA from mRNA. It is important to build an accurate classification model, and training and testing of the classification model requires as accurate a data set as possible, and ideally, the data set used is a gold standard data set. The method employs the currently most authoritative and accepted data sets NCBI RefSeq and GENCODE databases employed by most methods. Unlike other methods, it is: the present invention excludes mRNA and lncRNA with annotations such as putative, predicted, pseudogene, because the annotations of these data are inaccurate. The data left after such processing is more accurate and reliable.
k in the k-mer is variable, different methods using different k. When k is too small, the due characteristics of the sequence cannot be obtained; when k is too large, the computational complexity rises sharply. Therefore, the selection of the appropriate k is critical. An existing algorithm PLEK for predicting lncRNA. In PLEK, the value of k is 5, which is a threshold obtained by weighing accuracy and computational complexity. In the present algorithm, since a fuzzy k-mer algorithm is used, that is, unknown bases of k bases can occur, k=6 is used in the present invention in order to capture enough information in the sequence.
Fuzzy k-mers may correspond to conventional k-mers. This has the advantage that: the method can solve the problem of rapid expansion of calculated amount caused by overlarge k and solve the problem that the characteristics of the sequence cannot be represented due to overlarge k. The corresponding relation between the fuzzy k-mer and the traditional k-mer in the invention is as follows:
the one-to-one correspondence of the traditional k-mer and the fuzzy k-mer is A M×N =[a i,j ]The values of the elements in the matrix A are 1 or 0:
Figure BDA0002356852210000101
here, "u j Matching v i "means at ambiguity v i The bases having exactly x positions can be in one-to-one correspondence with the conventional k-mers, also known as matching.
The solution of the one-to-one correspondence matrix of the conventional k-mers and fuzzy k-mers is not unique. A feasible solution is to obtain the minimum L 2 And (5) a paradigm solution. For RNA transcript X, the feature vector is
Figure BDA0002356852210000102
For the number of occurrences of the ith fuzzy k-mer in the transcript, < >>
Figure BDA0002356852210000103
Is the number of k-mers. The kernel function is defined as:
Figure BDA0002356852210000104
the authentication model adopts a support vector machine. In summary, the support vector machine has excellent classification performance. For features of the computation space obtained with y=ax, a support vector machine was used to train the model, with 10-fold cross validation (10-fold crossing validation) in the sequence. And the obtained model parameters are tested by using the test data. And finally, taking the obtained optimal parameters as model parameters.
The method solves the problems of low prediction accuracy or high calculation complexity in the prior art; the invention can screen reliable mRNA and lncRNA; the invention can identify long-chain non-coding RNA, lays a foundation for researching the functions of the long-chain non-coding RNA, and provides a new idea for explaining the occurrence and development mechanism of diseases.
When k is large, the k-mer usage is a sparse matrix. Thus, careful design of the fuzzy k-mer feature should have the following superior characteristics: (1) the computational complexity is small. The magnitude of the current data set is larger and larger, and the consumption of computing resources is more and more serious, so that proper characteristics are selected, the computation is convenient, and the computation complexity is low. (2) Can reflect the biological characteristics of the RNA sequence. The purpose of the computational features is to solve biological problems, and if there are suitable computational features, they can reflect the biological essential features of the RNA sequence and achieve good results. The characteristics have strong comprehensibility and good interpretability, and lay a good foundation for subsequent data analysis and functional experiments. (3) The generalization capability is strong. The features of the current designs are first to be suitable for sequence analysis of long non-coding RNAs and mrnas, and in order to make the best use of these features possible, it is desirable to extend the computational features to other biological problems, such as predicting mirnas, pirnas, etc.
The correspondence between the conventional k-mers and fuzzy k-mers is not unique, and therefore, a suitable, or "optimal", correspondence is selected. What is then the mapping optimal? The invention adopts L2 paradigm solution. Finally, the method provided by the invention is reasonable as shown by software implementation and final test results.
In feature space, the similarity measure between samples is critical. For fuzzy k-mers, a suitable kernel function is designed to calculate the similarity between two samples. The requirements of easy implementation, low computational complexity and the like are met.
Previous studies have been directed to the problem of studying the biological domain using traditional k-mers as biological features. The invention adopts fuzzy k-mer to study the sequence characteristics of coding protein RNA (mRNA) and long non-coding RNA for the first time. The identification problem of long-chain non-coding RNA is studied based on fuzzy k-mer for the first time, and a kernel function of a support vector machine is designed aiming at the fuzzy k-mer. The research result of the invention can be popularized and applied to the problems of 'identifying genes', 'predicting piRNA', 'category-specific motif discovery', 'miRNA classification', and the like.

Claims (3)

1. A method for identifying lncRNA based on fuzzy k-mer usage, characterized by: the method specifically comprises the following steps:
step 1, preprocessing RNA sequence data;
step 2, defining a traditional k-mer and a fuzzy k-mer, and calculating the use frequency of the traditional k-mer;
the specific process of the step 2 is as follows:
the definition of a conventional k-mer is: the conventional k-mer is denoted as u= { U j },1≤j≤N=4 t The sequence length is t;
the definition of fuzzy k-mers is: fuzzy k-mers are noted as
Figure FDA0004070037050000011
The length of the sequence is n, wherein the length of the base sequence which cannot be matched is n-x, the x positions have bases, and x is required to be smaller than n;
the traditional k-mer use frequency calculation process is as follows:
the sequence in the sliding window is matched with the specific sequence i, and the using times N of the specific sequence i are i Then 1 is increased and the k-mer frequency of use F is calculated using the following equation (1) i
F i =N i /M k (1);
wherein ,Mk Is long and isA sliding window of degree k for a total number of slides along the RNA sequence;
step 3, determining the corresponding relation between the traditional k-mer and the fuzzy k-mer;
the specific process of the step 3 is as follows:
matrix A of one-to-one correspondence of a conventional k-mer and a fuzzy k-mer M×N =[a i,j ]The values of the elements in the matrix A are 1 or 0:
Figure FDA0004070037050000021
let x be the vector, length N, where element x j Is u j Is the number of (3); let y be the vector, length M, where element y i Is of length v i The one-to-one correspondence of the conventional k-mers and fuzzy k-mers is:
y=Ax (2);
step 4, solving a corresponding relation matrix c of the traditional k-mer and the fuzzy k-mer m
The specific process of the step 4 is as follows:
for RNA transcript X, the feature vector is
Figure FDA0004070037050000022
wherein ,
Figure FDA0004070037050000023
the number of occurrences of the ith fuzzy k-mer in the transcript;
Figure FDA0004070037050000024
number of k-mers;
to solve the minimum L of equation (2) 2 A paradigm solution defining a kernel function as:
Figure FDA0004070037050000025
Figure FDA0004070037050000026
wherein ,Nm (S 1 ,S 2 ) Representation sequence S 1 And sequence S 2 M unmatched fuzzy k-mer numbers in between;
Figure FDA0004070037050000027
wherein ,r=m1 +m 2 -2t-m,b=4;
And 5, training a prediction model by using the fuzzy k-mer.
2. The method for identifying lncRNA based on fuzzy k-mer usage of claim 1, wherein: the specific process of the step 1 is as follows:
transcript mRNA sequences and annotations of human encoded proteins were downloaded from RefSeq database, human long non-coding RNA was collected from genode v17, and mRNA and lncRNA with putative, predicted, pseudogene annotations were excluded.
3. The method for identifying lncRNA based on fuzzy k-mer usage of claim 1, wherein: the specific process of the step 5 is as follows:
firstly, normalizing the adjusted use frequency to a number in the range of 0 to 1 using the svm-scale program in the LIBSVM package; then, a support vector machine with a radial basis function as a core is adopted as a two-classifier; obtaining parameters C of an optimized support vector machine and parameters gamma of a core by using a grid.py script in the LIBSVM package; in the parameter searching process, 10 times of cross validation is adopted to evaluate the performance of the classification model corresponding to each pair of C and gamma parameters.
CN202010010135.1A 2020-01-06 2020-01-06 Method for identifying lncRNA based on fuzzy k-mer utilization rate Active CN111223522B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010010135.1A CN111223522B (en) 2020-01-06 2020-01-06 Method for identifying lncRNA based on fuzzy k-mer utilization rate

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010010135.1A CN111223522B (en) 2020-01-06 2020-01-06 Method for identifying lncRNA based on fuzzy k-mer utilization rate

Publications (2)

Publication Number Publication Date
CN111223522A CN111223522A (en) 2020-06-02
CN111223522B true CN111223522B (en) 2023-04-28

Family

ID=70832290

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010010135.1A Active CN111223522B (en) 2020-01-06 2020-01-06 Method for identifying lncRNA based on fuzzy k-mer utilization rate

Country Status (1)

Country Link
CN (1) CN111223522B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994645B (en) * 2023-08-01 2024-04-09 西安理工大学 Prediction method of piRNA and mRNA target pair based on interactive reasoning network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010014636A1 (en) * 2008-07-28 2010-02-04 Rockefeller University Methods for identifying rna segments bound by rna-binding proteins or ribonucleoprotein complexes
CN108595913A (en) * 2018-05-11 2018-09-28 武汉理工大学 Differentiate the supervised learning method of mRNA and lncRNA
CN108876001A (en) * 2018-05-03 2018-11-23 东北大学 A kind of Short-Term Load Forecasting Method based on twin support vector machines

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7907769B2 (en) * 2004-05-13 2011-03-15 The Charles Stark Draper Laboratory, Inc. Image-based methods for measuring global nuclear patterns as epigenetic markers of cell differentiation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010014636A1 (en) * 2008-07-28 2010-02-04 Rockefeller University Methods for identifying rna segments bound by rna-binding proteins or ribonucleoprotein complexes
CN108876001A (en) * 2018-05-03 2018-11-23 东北大学 A kind of Short-Term Load Forecasting Method based on twin support vector machines
CN108595913A (en) * 2018-05-11 2018-09-28 武汉理工大学 Differentiate the supervised learning method of mRNA and lncRNA

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
常征 ; 孟军 ; 施云生 ; 莫冯然 ; .多特征融合的lncRNA识别与其功能预测.智能系统学报.2018,(06),全文. *
李俊豪 ; 杨建华 ; 屈良鹄 ; .生物信息学在长非编码RNA研究中的应用.生理科学进展.2016,(03),全文. *

Also Published As

Publication number Publication date
CN111223522A (en) 2020-06-02

Similar Documents

Publication Publication Date Title
Fan et al. lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning
Zhang et al. Predicting in-vitro transcription factor binding sites using DNA sequence+ shape
Wen et al. A classification model for lncRNA and mRNA based on k-mers and a convolutional neural network
Zhang et al. Towards a better prediction of subcellular location of long non-coding RNA
CN111223522B (en) Method for identifying lncRNA based on fuzzy k-mer utilization rate
Wang et al. A brief review of machine learning methods for RNA methylation sites prediction
Zeng et al. 4mCPred-MTL: accurate identification of DNA 4mC sites in multiple species using multi-task deep learning based on multi-head attention mechanism
Min et al. Structured Sparse Non-negative Matrix Factorization with $\ell _ {2, 0} $-Norm
Jiang et al. miRTMC: a miRNA target prediction method based on matrix completion algorithm
Asim et al. EL-RMLocNet: An explainable LSTM network for RNA-associated multi-compartment localization prediction
Deng et al. Probing the functions of long non-coding RNAs by exploiting the topology of global association and interaction network
Lai et al. LSA-ac4C: A hybrid neural network incorporating double-layer LSTM and self-attention mechanism for the prediction of N4-acetylcytidine sites in human mRNA
CN113764031B (en) Prediction method of N6 methyl adenosine locus in trans-tissue/species RNA
Zhang et al. Nature-inspired compressed sensing for transcriptomic profiling from random composite measurements
Zhou et al. GD-RDA: a new regularized discriminant analysis for high-dimensional data
Jha et al. Fast and precise prediction of non-coding RNAs (ncRNAs) using sequence alignment and k-mer counting
Sun et al. LRSK: a low-rank self-representation K-means method for clustering single-cell RNA-sequencing data
Lu et al. A constrained probabilistic matrix decomposition method for predicting miRNA-disease associations
Sheng et al. A survey of computational methods and databases for lncRNA-miRNA interaction prediction
Li et al. Predicting cancer lymph-node metastasis from LncRNA expression profiles using local linear reconstruction guided distance metric learning
Pian et al. V-ELMpiRNAPred: Identification of human piRNAs by the voting-based extreme learning machine (V-ELM) with a new hybrid feature
Wang et al. PSP-PJMI: An innovative feature representation algorithm for identifying DNA N4-methylcytosine sites
Ho et al. Toward deep learning approaches for learning structure motifs and classifying biological sequences from RNA A-to-I editing events
Yang et al. Identifying Human miRNA Target Sites via Learning the Interaction Patterns between miRNA and mRNA Segments
Moskowitz et al. Nonparametric analysis of contributions to variance in genomics and epigenomics data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant