CN111223522B

CN111223522B - Method for identifying lncRNA based on fuzzy k-mer utilization rate

Info

Publication number: CN111223522B
Application number: CN202010010135.1A
Authority: CN
Inventors: 李爱民; 费蓉; 刘雅君; 周红芳; 刘光明; 王磊; 黑新宏; 周中银
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2023-04-28
Anticipated expiration: 2040-01-06
Also published as: CN111223522A

Abstract

The invention discloses a method for identifying lncRNA based on fuzzy k-mer utilization rate, which specifically comprises the following steps: step 1, preprocessing RNA sequence data; step 2, defining a traditional k-mer and a fuzzy k-mer, and calculating the use frequency of the traditional k-mer; step 3, determining the corresponding relation between the traditional k-mer and the fuzzy k-mer; step 4, solving a corresponding relation matrix c of the traditional k-mer and the fuzzy k-mer _m The method comprises the steps of carrying out a first treatment on the surface of the And 5, training a prediction model by using the fuzzy k-mer. The implementation of the invention can help to systematically and accurately identify long-chain non-coding RNA in various species and various cells from large-scale high-throughput sequencing data.

Description

Method for identifying lncRNA based on fuzzy k-mer utilization rate

Technical Field

The invention belongs to the technical field of identification of long-chain non-coding RNA (lncRNA), and relates to a method for identifying lncRNA based on fuzzy k-mer utilization rate.

Background

In the field of molecular biology, non-coding RNAs are one of the current research hotspots. microRNA (miRNA) in non-coding and long non-coding RNA (lncRNA) are important in research. The research of microRNA is mature, and scientists are developing long-chain non-coding RNA, which has important biomedical functions.

Long non-coding RNAs were originally thought to be merely by-products of genome transcription, but were simply "noisy" and not of any biological function. With the gradual discovery of the functions of non-coding RNA genes such as Xist, hotair and the like, long-chain non-coding RNA genes are found to have very important functions and a number of the genes greatly exceeding those of the genes for encoding proteins. The functions of long-chain non-coding RNA are mainly expressed in the following steps: transcriptional interference, regulation of gene expression, transcriptional activation, X-chromosome inactivation, nuclear transport, genomic imprinting, chromatin modification, etc., are closely related to the occurrence, development, diagnosis, and treatment of diseases.

The identification of long-chain non-coding RNA is a necessary way for researching long-chain non-coding RNA, and is an important basic frontier work. It is not easy to identify long-chain non-coding RNAs from raw transcriptome biological experimental data, which can be determined through multiple steps of calculations and analyses using a combination of various data and tools. One of the critical steps is the assessment of the coding capacity of transcripts.

An algorithm for distinguishing protein-encoding genes from long non-coding RNA genes using k-mer characteristics is disclosed in BMC bioinformatics, and is named PLEK. The algorithm is particularly useful for identifying long-chain non-coding RNAs from large-scale de novo assembled transcriptomes. A large number of experiments show that: the accuracy increases as k increases, but at the same time the computation increases as k increases. In order to balance between accuracy and computational effort, k=5 is finally chosen. In addition, in the process of calculating k-mers, as k increases, a sparse matrix phenomenon is generated, that is, the calculated k-mer frequency is mostly 0. The calculation of k-mers is affected when SNPs or indels are present in the transcripts.

In view of the above problems, a method for identifying lncRNA based on fuzzy k-mer use rate is provided, wherein the fuzzy k-mer has better robustness in calculation of k-mer use frequency.

Disclosure of Invention

The invention aims to provide a method for identifying lncRNA based on fuzzy k-mer utilization rate, which adopts strict filtering conditions to collect reliable mRNA and lncRNA sequences, so that the output result of a subsequent identification model is more reliable and reliable, and the system error is reduced; and the fuzzy k-mer is adopted, so that the computational complexity is reduced.

The technical scheme adopted by the invention is that the method for identifying the lncRNA based on the fuzzy k-mer utilization rate comprises the following steps:

step 1, preprocessing RNA sequence data;

step 2, defining a traditional k-mer and a fuzzy k-mer, and calculating the use frequency of the traditional k-mer;

step 3, determining the corresponding relation between the traditional k-mer and the fuzzy k-mer;

step 4, solving a corresponding relation matrix c of the traditional k-mer and the fuzzy k-mer _m ；

And 5, training a prediction model by using the fuzzy k-mer.

The present invention is also characterized in that,

the specific process of the step 1 is as follows:

transcript mRNA sequences and annotations of human encoded proteins were downloaded from RefSeq database, human long non-coding RNA was collected from genode v17, and mRNA and lncRNA with putative, predicted, pseudogene annotations were excluded.

The specific process of the step 2 is as follows:

the definition of a conventional k-mer is: the conventional k-mer is denoted as u= { U _j }，1≤j≤N＝4 ^t The sequence length is t;

the definition of fuzzy k-mers is: fuzzy k-mers are noted as v= { V _i },

The length of the sequence is n, wherein the length of the base sequence which cannot be matched is n-x, the x positions have bases, and x is required to be smaller than n;

the traditional k-mer use frequency calculation process is as follows:

the sequence in the sliding window is matched with the specific sequence i, and the using times N of the specific sequence i are _i Then 1 is increased and the k-mer frequency of use F is calculated using the following equation (1) _i ：

F _i ＝N _i /M _k (1)；

wherein ,M_k Is the total number of times a sliding window of length k can slide along an RNA sequence.

The specific process of the step 3 is as follows:

matrix A of one-to-one correspondence of a conventional k-mer and a fuzzy k-mer _M×N ＝[a _i,j ]The values of the elements in the matrix A are 1 or 0:

let x be the vector, length N, where element x _j Is u _j Is the number of (3); let y be the vector, length M, where element y _i Is of length v _i The one-to-one correspondence of the conventional k-mers and fuzzy k-mers is:

y＝Ax (2)。

the specific process of the step 4 is as follows:

for RNA transcript X, the feature vector is

wherein ,

the number of occurrences of the ith fuzzy k-mer in the transcript;

number of k-mers;

to solve the minimum L of equation (2) ₂ A paradigm solution defining a kernel function as:

wherein ,N_m (S ₁ ，S ₂ ) Representation sequence S ₁ And sequence S ₂ M unmatched fuzzy k-mer numbers in between;

wherein ,r＝m₁ +m ₂ -2t-m,b＝4。

The specific process of the step 5 is as follows:

firstly, normalizing the adjusted use frequency to a number in the range of 0 to 1 using the svm-scale program in the LIBSVM package; then, a support vector machine with a radial basis function as a core is adopted as a two-classifier; obtaining parameters C of an optimized support vector machine and parameters gamma of a core by using a grid.py script in the LIBSVM package; in the parameter searching process, 10 times of cross validation is adopted to evaluate the performance of the classification model corresponding to each pair of C and gamma parameters.

The beneficial effects of the invention are as follows:

(1) Theoretical significance: the quantity of the long-chain non-coding RNA is far more than that of the coding protein genes, the biomedical functions of the long-chain non-coding RNA are confirmed, and the identification of the long-chain non-coding RNA lays a theoretical foundation for researching gene regulation, cell cycle, occurrence and development of complex diseases and the like.

(2) Biological significance: the identification of long-chain non-coding RNA creates conditions for elucidating the gene expression regulation mechanism, can also provide causal explanation for researching species evolution and organism diversity, and deepens the understanding of cell differentiation and genome-level molecular regulation mechanism.

(3) Application value: implementation of the algorithm will help to systematically and accurately identify long non-coding RNAs in various species, various cells, from large-scale high-throughput sequencing data. The algorithm can also be popularized and applied to the identification of other types of RNA molecules, such as identification microRNA (miRNA), piRNA (Piwi-interacting RNA) and the like.

Detailed Description

The present invention will be described in detail with reference to the following embodiments.

The invention discloses a method for identifying lncRNA based on fuzzy k-mer utilization rate, which specifically comprises the following steps:

step 1, data preprocessing:

the present algorithm employs a machine learning approach, and therefore requires accurate and authoritative training and testing of the data set. NCBI provides two databases of RefSeq and genode that provide a comprehensive, well annotated RNA sequence dataset without redundancy. Transcript mRNA sequences and annotations for human encoded proteins can be downloaded from RefSeq database (release 60), and human long-chain non-coding RNA (lncRNA) can be collected from geneode v 17. There were 34691 mRNAs greater than 200nt in length in humans, and 22389 lncRNAs greater than 200nt in length. We exclude mRNA and lncRNA with putative, predicted, pseudogene et al notes, ensuring that the processed data is of high quality and reliable.

Step 2, calculating the use frequency of the traditional k-mer:

the calculation of the k-mer usage frequency is related to the width of the sliding window and the step size of the sliding window. The sliding window step length adopted by the invention is 1 base (1 nt). The width of the sliding window varies from 1 to k. For the differentiation of mRNA and lncRNA, k has a value of 6.

For RNA, k-mers have a sequence of k A, C, G or T bases. For k=1 to 6, 5460 sequences (4096+1024+256+64+16+4, 4096 6-mer sequences, 1024 5-mer sequences, 256 4-mer sequences, 64 3-mer sequences, 16 2-mer sequences, 4 1-mer sequences) can be obtained.

If the sequence in the sliding window matches a particular sequence i, then the number of uses of the particular sequence i (denoted as N _i ) Then 1 is increased. M is M _k Is the total number of times a sliding window of length k can slide along an RNA sequence. Frequency of use F _i The calculation formula of (2) is as follows:

F _i ＝N _i /M _k (1)；

for RNA sequences, the base may be any of { a, C, G, T }, and { a, C, G, T } may be referred to as an alphabet with an alphabet length l of l=4. The definition of a conventional k-mer and a fuzzy k-mer in a sequence is given below.

Wherein the length of the ambiguous k-mer is n and contains x perfectly matched positions.

The conventional k-mer is denoted as u= { U _j }，1≤j≤N＝4 ^t The sequence length is t.

Fuzzy k-mers are noted as v= { V _i },

The length of the sequence is n, wherein the length of the base sequence which cannot be matched is n-x, the x positions have bases, and x must be smaller than n.

Step 3, determining the corresponding relation between the traditional k-mer and the fuzzy k-mer:

here, "u _j Matching v _i "means at ambiguity v _i The bases having exactly x positions can be in one-to-one correspondence with the conventional k-mers, also known as matching. Let x be the vector, length N, where element x _j Is u _j Is a number of (3). Let y be the vector, length M, where element y _i Is of length v _i Is a number of (3).

The one-to-one correspondence of the traditional k-mers and fuzzy k-mers is:

y＝Ax (2)；

and 4, solving a corresponding relation matrix of the traditional k-mer and the fuzzy k-mer, wherein the length of a vector x in the formula (2) is N, and the length of a vector y is M, and because M is smaller than N, the rank of the matrix A is smaller than N, and the solution of the matrix A is not unique.

To estimate the corresponding vector y, the maximum entropy of vector x is required to be optimal. Because of the nonlinear calculation, the calculation difficulty is high, so that an alternative is to find the minimum L of the formula (2) ₂ And (5) a paradigm solution.

For RNA transcript X, the feature vector is

wherein ,/>

For the number of occurrences of the ith fuzzy k-mer in the transcript, < >>

Is the number of k-mers.

To solve the minimum L of equation (1) ₂ A paradigm solution defining a kernel function as:

wherein ,N_m (S ₁ ，S ₂ ) Representation sequence S ₁ And sequence S ₂ M number of mismatched fuzzy k-mers.

wherein ,r＝m₁ +m ₂ -2t-m, b=4, b is the number of bases (A, T, G, C). For two k-mer u of length l ₁ and u₂ If there are m mismatches, there are l-m perfectly matched positions. For all possible k-mers of length l, all possible k-mers of length l can be enumerated, denoted as u, where with u ₁ With m ₁ The base not matching with u ₂ With m ₂ The mismatched bases. Let m be ₁ Of the mismatches, t are located in l-m matched positions, where m ₁ T are located at m unmatched positions. Thus, m ₁ Is positioned at

Seed t number of unmatched pieces (b-t) ^t A kind of module is assembled in the module and the module is assembled in the module. For the remaining r unmatched positions there is +.>

The mode of selection of the position is (b-2) ^r Seed value.

Step 5, training a prediction model by using the fuzzy k-mer:

for any mRNA and lncRNA sequences, the above steps can be used to obtain the utilization vector of the ambiguous k-mer for each sequence, where the set of utilization vectors for mRNA is the positive class and the set of utilization vectors for lncRNA is the negative class. The positive class dataset and the negative class dataset are used as models for training to predict which type the new RNA belongs to.

Here, a support vector machine (libvm) is used as a predictive model, and the new model is tested on an independent dataset. Two database test datasets, namely NCBI RefSeq and GENCODE, with reliable authority are selected to test the new model.

First, the svm-scale program in the libvm package (version 3.17) is used to normalize the adjusted frequency of use to a number in the range of 0 to 1. Then, a support vector machine with radial basis function as a kernel is used as a two-classifier (variance is gamma). And obtaining the parameter C of the optimized support vector machine and the parameter gamma of the core by using the grid.py script in the LIBSVM package. In the parameter searching process, 10 times of cross validation is adopted to evaluate the performance of the classification model corresponding to each pair of C and gamma parameters. And finally, establishing an SVM two-class prediction model with optimized C and gamma parameters.

The invention will be further described with reference to the application principles.

For identifying lncRNA, it is necessary to exclude other RNAs accurately. First, according to the general characteristics of lncRNA, only RNA sequences longer than 200 bases are considered in the present invention. In addition, there is a need to distinguish lncRNA from mRNA. It is important to build an accurate classification model, and training and testing of the classification model requires as accurate a data set as possible, and ideally, the data set used is a gold standard data set. The method employs the currently most authoritative and accepted data sets NCBI RefSeq and GENCODE databases employed by most methods. Unlike other methods, it is: the present invention excludes mRNA and lncRNA with annotations such as putative, predicted, pseudogene, because the annotations of these data are inaccurate. The data left after such processing is more accurate and reliable.

k in the k-mer is variable, different methods using different k. When k is too small, the due characteristics of the sequence cannot be obtained; when k is too large, the computational complexity rises sharply. Therefore, the selection of the appropriate k is critical. An existing algorithm PLEK for predicting lncRNA. In PLEK, the value of k is 5, which is a threshold obtained by weighing accuracy and computational complexity. In the present algorithm, since a fuzzy k-mer algorithm is used, that is, unknown bases of k bases can occur, k=6 is used in the present invention in order to capture enough information in the sequence.

Fuzzy k-mers may correspond to conventional k-mers. This has the advantage that: the method can solve the problem of rapid expansion of calculated amount caused by overlarge k and solve the problem that the characteristics of the sequence cannot be represented due to overlarge k. The corresponding relation between the fuzzy k-mer and the traditional k-mer in the invention is as follows:

the one-to-one correspondence of the traditional k-mer and the fuzzy k-mer is A _M×N ＝[a _i,j ]The values of the elements in the matrix A are 1 or 0:

here, "u _j Matching v _i "means at ambiguity v _i The bases having exactly x positions can be in one-to-one correspondence with the conventional k-mers, also known as matching.

The solution of the one-to-one correspondence matrix of the conventional k-mers and fuzzy k-mers is not unique. A feasible solution is to obtain the minimum L ₂ And (5) a paradigm solution. For RNA transcript X, the feature vector is

For the number of occurrences of the ith fuzzy k-mer in the transcript, < >>

Is the number of k-mers. The kernel function is defined as:

the authentication model adopts a support vector machine. In summary, the support vector machine has excellent classification performance. For features of the computation space obtained with y=ax, a support vector machine was used to train the model, with 10-fold cross validation (10-fold crossing validation) in the sequence. And the obtained model parameters are tested by using the test data. And finally, taking the obtained optimal parameters as model parameters.

The method solves the problems of low prediction accuracy or high calculation complexity in the prior art; the invention can screen reliable mRNA and lncRNA; the invention can identify long-chain non-coding RNA, lays a foundation for researching the functions of the long-chain non-coding RNA, and provides a new idea for explaining the occurrence and development mechanism of diseases.

When k is large, the k-mer usage is a sparse matrix. Thus, careful design of the fuzzy k-mer feature should have the following superior characteristics: (1) the computational complexity is small. The magnitude of the current data set is larger and larger, and the consumption of computing resources is more and more serious, so that proper characteristics are selected, the computation is convenient, and the computation complexity is low. (2) Can reflect the biological characteristics of the RNA sequence. The purpose of the computational features is to solve biological problems, and if there are suitable computational features, they can reflect the biological essential features of the RNA sequence and achieve good results. The characteristics have strong comprehensibility and good interpretability, and lay a good foundation for subsequent data analysis and functional experiments. (3) The generalization capability is strong. The features of the current designs are first to be suitable for sequence analysis of long non-coding RNAs and mrnas, and in order to make the best use of these features possible, it is desirable to extend the computational features to other biological problems, such as predicting mirnas, pirnas, etc.

The correspondence between the conventional k-mers and fuzzy k-mers is not unique, and therefore, a suitable, or "optimal", correspondence is selected. What is then the mapping optimal? The invention adopts L2 paradigm solution. Finally, the method provided by the invention is reasonable as shown by software implementation and final test results.

In feature space, the similarity measure between samples is critical. For fuzzy k-mers, a suitable kernel function is designed to calculate the similarity between two samples. The requirements of easy implementation, low computational complexity and the like are met.

Previous studies have been directed to the problem of studying the biological domain using traditional k-mers as biological features. The invention adopts fuzzy k-mer to study the sequence characteristics of coding protein RNA (mRNA) and long non-coding RNA for the first time. The identification problem of long-chain non-coding RNA is studied based on fuzzy k-mer for the first time, and a kernel function of a support vector machine is designed aiming at the fuzzy k-mer. The research result of the invention can be popularized and applied to the problems of 'identifying genes', 'predicting piRNA', 'category-specific motif discovery', 'miRNA classification', and the like.