CN103218543B

CN103218543B - A kind of method and system distinguishing protein coding gene and Noncoding gene

Info

Publication number: CN103218543B
Application number: CN201310102224.9A
Authority: CN
Inventors: 赵屹; 孙亮; 罗海涛
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2013-03-27
Filing date: 2013-03-27
Publication date: 2016-04-13
Anticipated expiration: 2033-03-27
Also published as: CN103218543A

Abstract

The invention provides a kind of method and system distinguishing protein coding gene and Noncoding gene, it can distinguish the feature of protein coding gene and Noncoding gene on sequence level, this feature does not rely on the known data of species, do not need conservative property information, and have long non-coding RNA and well judge effect, except have powerful advantage in accuracy except, self is simple to operate, do not need too much file to rely on, the processing time is obviously better than known method.

Description

A kind of method and system distinguishing protein coding gene and Noncoding gene

Technical field

The present invention relates to life science, particularly relate to a kind of method and system distinguishing protein coding gene and Noncoding gene.

Background technology

Mainly containing method in two in the world carries out differentiation protein coding gene (hereinafter referred to as encoding gene) and Noncoding gene at present:

CPC method is developed by Peking University's Life Science College, relies on the information such as the open reading frame of predicted gene and known protein pool to judge that a nucleotide sequence is as encoding gene or Noncoding gene.The method too depends on Forecasting Methodology and the given data storehouse of open reading frame, it is determined that the presence of obvious deficiency to the judgement of new gene and long non-coding gene, and according to we self test and appraisal display, the judging nicety rate for long non-coding gene is very low.

PhyloCSF is a kind of method adopted in recent years in the world, relies on multiple species sequence comparison information to obtain conservative type region, judges coding or non-coding sequence according to the conservative type power of sequence to be measured.But, because a lot of species do not have whole genome sequence at all, so several species sequence alignment information cannot be obtained.Therefore, the conservative property of sequence cannot be weighed for a lot of species, and then coding and non-coding ability cannot be judged.In addition, long non-coding gene internal has the module (subsequence) of multiple conservative type, therefore only relies on conservative type region to judge that code capacity is too unilateral, and it is also very low that we self show accuracy rate to the test and appraisal of the method.

Summary of the invention

For solving the problem, the invention provides a kind of method and system distinguishing encoding gene and Noncoding gene, it mainly utilizes the sequence right frequency statistics of codon of contacting to be branched away the sequence area that coded sequence and non-coding sequence and coding region produce according to other five kinds of reading code modes accurately, do not rely on the data that species are known, do not need conservative property information, and have long non-coding RNA and well judge effect.

For achieving the above object, the invention provides a kind of method invention distinguishing encoding gene and Noncoding gene, the method comprises:

Step 1, is divided into positive and negative two training set by sample set according to coding and non-coding sequence, performs step 2 respectively to step 4 to positive and negative two training set;

Step 2, in training set, count the frequency of occurrences of each adjacent nucleotide tripolymer ANT in coded sequence, non-coding sequence and intergenic region sequence and build respectively and occur frequency matrix, build scoring matrix based on three frequency of occurrences matrixes by log2-ratio computing;

Step 3, the mode that described scoring matrix utilizes moving window to carry out giving a mark calculates window score value S-score, in this, as first feature of disaggregated model, and in the array using dynamic programming algorithm to find out respectively to be converted to by the coded sequence in described sample set and non-coding sequence have maximum subsegment and region as feature subsequence MLCDS, and using the length of described MLCDS as second of disaggregated model feature;

Step 4, utilizes i ∈ (1,2,3,4,5,6) obtains the 3rd feature of disaggregated model, and wherein X is the length of MLCDS in reading code mode, and Yi represents MLCDS length respective in whole six kinds of reading code modes;

Utilize j ∈ (1,2,3,4,5) the 4th feature of disaggregated model is obtained, wherein S is the S-score of the MLCDS extracted according to correct reading code mode in the six kinds of reading code modes had altogether at nucleotide sequence, the S-score of the MLCDS extracted in remaining other the five kinds wrong reading code modes of Ej representative;

Utilize single core thuja acid tripolymer to carry out log2-ratio computing in the frequency of occurrences of coding and non-coding region, obtain five feature of nucleotide tripolymer Preference as disaggregated model;

Step 5, utilizes five positive and negative two incompatible train classification models of set of eigenvectors of feature composition of described positive and negative two training set, treats that distinguishing sequence utilizes described disaggregated model to carry out prediction and obtains distinguishing result.

Wherein frequency of occurrences X _ithe computing formula of F is:

X_{i} N = Σ_{j = 1}^{n} S_{j} (X_{i})

T = Σ_{i = 1}^{m} X_{i} N = Σ_{i = 1}^{m} (Σ_{j = 1}^{n} S_{j} (X_{i}); m = 64 * 64; n = (1 . . . . . . . . N)

X_{i} F = \frac{X_{i} N}{T}

Wherein X represents the ANT of certain type, S _j(X _i) be occurrence number in a certain bar sequence of X in a certain class arrangement set, X _in is the occurrence number of this kind of ANT in certain arrangement set whole, and T then represents the altogether occurrence number of the ANT of all kinds in this data centralization, and m represents the kind number of ANT, and n represents the sequence number comprised in the set of the type.

Further, described step 3 comprises:

Step 31, uses moving window to scan according to the mode of six frame reading codes every bar transcript sequence of coded sequence and non-coding sequence respectively;

Step 32, described scoring matrix can be given a mark to each subwindow of described moving window in the process of above-mentioned scanning, the transcript that one is made up of nucleotide sequence is converted into six arrays, the element in described array is exactly the window score value of each subwindow;

Step 33, utilize in dynamic programming algorithm ask maximum subsegment and each array in described six arrays of mode in find out one and add and maximum subsegment, obtain six candidate's largest fields;

Step 34, described scoring matrix find out in described six candidate's largest fields maximum that of score value as this transcript as the feature subsequence in CDS region.

And add in described step 33 and with maximum subsegment X computing formula be:

X = \max_{1 \leq i \leq j \leq n} {Σ_{k = i}^{j} a [k]}

A [k] be have this maximum subsegment and maximum subsegment, i and j represents the initial sum final position of this maximum subsegment of a [k] in this reading code mode respectively.

Further, in described step 5: treat that distinguishing sequence is divided into positive and negative training set by described, then convert described positive and negative training set to input format required by support vector machines, and put it into the training that SVM carries out disaggregated model, obtain distinguishing result.

For achieving the above object, present invention also offers a kind of system distinguishing encoding gene and Noncoding gene, this system comprises:

Pretreatment module, for sample set is divided into positive and negative two training set according to coding and non-coding sequence, performs frequency statistics module to characteristic extracting module respectively to positive and negative two training set;

Frequency statistics module, there is frequency matrix for counting the frequency of occurrences of each ANT in coded sequence, non-coding sequence and intergenic region sequence and build respectively in training set, build scoring matrix based on three frequency of occurrences matrixes by log2-ratio computing;

Sequential extraction procedures module, the mode that described scoring matrix utilizes moving window to carry out giving a mark calculates window score value S-score, in this, as first feature of disaggregated model, and use dynamic programming algorithm find out respectively by the coded sequence of described sample sequence and non-coding sequence convert in moral array have maximum subsegment and region as feature subsequence MLCDS, using the length of described MLCDS as second of disaggregated model feature;

Characteristic extracting module, utilizes i ∈ (1,2,3,4,5,6) obtains the 3rd feature of disaggregated model, and wherein X is the length of MLCDS in reading code mode, and Yi represents MLCDS length respective in whole six kinds of reading code modes;

Utilize j ∈ (1,2,3,4,5) obtains the 4th feature of disaggregated model, and wherein S is the S-score of MLCDS in the S-score of MLCDS in reading code mode, Ej representative other several reading code modes remaining;

Distinguish result and obtain module, utilize five positive and negative two incompatible train classification models of set of eigenvectors of feature composition of described positive and negative two training set, treat that distinguishing sequence utilizes described disaggregated model to carry out prediction and obtains distinguishing result.

Wherein frequency of occurrences X _ithe computing formula of F is:

X_{i} N = Σ_{j = 1}^{n} S_{j} (X_{i})

T = Σ_{i = 1}^{m} X_{i} N = Σ_{i = 1}^{m} (Σ_{j = 1}^{n} S_{j} (X_{i}); m = 64 * 64; n = (1 . . . . . . . . N)

X_{i} F = \frac{X_{i} N}{T}

Further, described sequential extraction procedures module comprises:

Scan module, uses moving window to scan according to the mode of six frame reading codes every bar transcript sequence of coded sequence and non-coding sequence respectively;

Scoring modules, described scoring matrix can be given a mark to each subwindow of described moving window in the process of above-mentioned scanning, the transcript that one is made up of nucleotide sequence is converted into six arrays, the element in described array is exactly the window score value of each subwindow;

Candidate's field acquisition module, utilize in dynamic programming algorithm ask maximum subsegment and each array in described six arrays of mode in find out one and add and maximum subsegment, obtain six candidate's largest fields;

Sequence selection module, described scoring matrix find out in described six candidate's largest fields maximum that of score value as this transcript as the feature subsequence in CDS region.

And add in described candidate's field acquisition module and with maximum subsegment X computing formula be:

X = \max_{1 \leq i \leq j \leq n} {Σ_{k = i}^{j} a [k]}

Further, described differentiation result obtains in module: treat that distinguishing sequence is divided into positive and negative training set by described, then convert described positive and negative training set to input format required by support vector machines, and put it into the training that SVM carries out disaggregated model, obtain distinguishing result.

Beneficial functional of the present invention is:

Find that the right frequency of sequence series winding codon distinguishes the effective ways of coding and non-coding sequence, the sequence area that coded sequence and non-coding sequence and coding region produce according to other five kinds of reading code modes can branch away by this series connection codon right frequency statistics accurately.

Adopt the thought of dynamic programming, the analysis based on moving window can extract the subsequence that can represent this sequence from a complete sequence; By gathering as test with the encoding gene transcript of the whole mankind, these subsequences that our invention is found have more than 98% all to have all or part of coincidence with the CDS region of standard.

Propose transcript self distinction.For protein coding gene transcript and Noncoding gene transcript, not only difference to some extent on some attributes and feature between the two, and the difference of both inherences is more remarkable, for the transcript of protein coding gene, carry out translating the different sequence of generation six to it according to six frame reading codes, and find wherein have one, though be in the length representing sequence or code capacity all very significant difference in other five kinds.But concerning there is not self difference this Noncoding gene transcript.

The distribution of Compositive sequence largest field score value, length, proportion, value and codon frequency five kinds of features can obtain best classifying quality.

Describe the present invention below in conjunction with the drawings and specific embodiments, but not as a limitation of the invention.

Accompanying drawing explanation

Fig. 1 a is a kind of method flow diagram distinguishing encoding gene and Noncoding gene of the present invention;

Fig. 1 b is a kind of system schematic distinguishing encoding gene and Noncoding gene of the present invention;

Fig. 2 is that the ROC curve of the present invention and prior art two kinds of methods compares schematic diagram;

Fig. 3 is frequency of occurrences matrix of the present invention;

Fig. 4 is the tendentiousness schematic diagram that ANT of the present invention occurs on coding with non-coding sequence;

Fig. 5 is S-score distribution schematic diagram of the present invention;

Fig. 6 is the subwindow score value schematic diagram of reading code mode sequence of the present invention;

Fig. 7 is MLCDS of the present invention and the degree that the overlaps schematic diagram of the CDS sequence of standard;

Fig. 8 is coding of the present invention, person's non-coding RNA and mean value statistical schematic diagram.

Embodiment

Fig. 1 a is a kind of method flow diagram distinguishing encoding gene and Noncoding gene of the present invention.As shown in Figure 1a, the method comprises:

Step 3, the mode that described scoring matrix utilizes moving window to carry out giving a mark calculates window score value S-score, in this, as first feature of disaggregated model, and use dynamic programming algorithm find out respectively by the array converted in the coded sequence in described sample set and non-coding sequence have maximum subsegment and region as feature subsequence MLCDS, and using the length of described MLCDS as second of disaggregated model feature, in the array that wherein maximum subsegment and referring to is made up of positive number and negative at, add and maximum continuous print subsegment, if namely delete the head and the tail of this continuous subsegment or increase an element that position is adjacent in array and all can reduce adding and value of this continuous subsegment,

Wherein frequency of occurrences X _ithe computing formula of F is:

X_{i} N = Σ_{j = 1}^{n} S_{j} (X_{i})

T = Σ_{i = 1}^{m} X_{i} N = Σ_{i = 1}^{m} (Σ_{j = 1}^{n} S_{j} (X_{i}); m = 64 * 64; n = (1 . . . . . . . . N)

X_{i} F = \frac{X_{i} N}{T}

Further, described step 3 comprises:

X = \max_{1 \leq i \leq j \leq n} {Σ_{k = i}^{j} a [k]}

Fig. 1 b is a kind of system schematic distinguishing encoding gene and Noncoding gene of the present invention.As shown in Figure 1 b, this system comprises:

Pretreatment module 100, for sample set is divided into positive and negative two training set according to coding and non-coding sequence, performs frequency statistics module to characteristic extracting module respectively to positive and negative two training set;

Frequency statistics module 200, there is frequency matrix for counting the frequency of occurrences of each ANT in coded sequence, non-coding sequence and intergenic region sequence and build respectively in training set, build scoring matrix based on three frequency of occurrences matrixes by log2-ratio computing;

Sequential extraction procedures module 300, the mode that described scoring matrix utilizes moving window to carry out giving a mark calculates window score value S-score, in this, as first feature of disaggregated model, and in the array using dynamic programming algorithm to find out respectively to be converted to by the coded sequence of described sample sequence and non-coding sequence have maximum subsegment and region as feature subsequence MLCDS, using the length of described MLCDS as second of disaggregated model feature, in the array that wherein maximum subsegment and referring to is made up of positive number and negative at, add and maximum continuous print subsegment, if namely delete the head and the tail of this continuous subsegment or increase an element that position is adjacent in array and all can reduce adding and value of this continuous subsegment,

Characteristic extracting module 400, utilizes i ∈ (1,2,3,4,5,6) obtains the 3rd feature of disaggregated model, and wherein X is the length of MLCDS in reading code mode, and Yi represents MLCDS length respective in whole six kinds of reading code modes;

Distinguish result and obtain module 500, utilize five positive and negative two incompatible train classification models of set of eigenvectors of feature composition of described positive and negative two training set, treat that distinguishing sequence utilizes described disaggregated model to carry out prediction and obtains distinguishing result.

Wherein frequency of occurrences X _ithe computing formula of F is:

X_{i} N = Σ_{j = 1}^{n} S_{j} (X_{i})

T = Σ_{i = 1}^{m} X_{i} N = Σ_{i = 1}^{m} (Σ_{j = 1}^{n} S_{j} (X_{i}); m = 64 * 64; n = (1 . . . . . . . . N)

X_{i} F = \frac{X_{i} N}{T}

Further, described sequential extraction procedures module 300 comprises:

X = \max_{1 \leq i \leq j \leq n} {Σ_{k = 1}^{j} a [k]}

Further, described differentiation result obtains in module 500: treat that distinguishing sequence is divided into positive and negative training set by described, then convert described positive and negative training set to input format required by support vector machines, and put it into the training that SVM carries out disaggregated model, obtain distinguishing result.In: treat that distinguishing sequence is divided into positive and negative training set by described, then convert described positive and negative training set to input format required by support vector machines, and put it into the training that SVM carries out disaggregated model, obtain distinguishing result.

Fig. 2 is that the ROC curve of the present invention and prior art two kinds of methods compares schematic diagram.The present invention uses that the test set of mouse is incompatible to be compared CNCI, CPC and phyloCSF, and provides ROC curve.On the basis of ROC curve, give an evaluation index more intuitively herein: minimal error rate point, namely average at that point false positive rate and the product of false negative rate reach minimum, in other words this minimal error rate point is exactly the point making disaggregated model reach balanced, is that disaggregated model actual performance embodies.Whole test and the process compared are divided into three steps: the first, and we run CNCI at home server and calculate coding in test sample book and non-coding RNA and note down result.The second, the coding in test sample book and noncoding RNA data are submitted on the wed-server of Peking University CPC by we, wait for and return results and note down.3rd in order to run phyloCSF(downloadedfromhttp: //compbio.mit.edu/PhyloCSF) on our Galaxy platform that the bed file of test sample book and the full-length genome annotation information of mouse are uploaded, use " stitchMAFblocks " this functional module on the platform, calculate 29 mammiferous Multiple Sequence Alignment data, and the input format converted under unix required for phyloCSF, then this software is run according to the help document of phyloCSF with following parameter configuration--minCodons=30,--orf=ATGStop,--strategy=fixed,--frames=3, and--removeRefGaps.These three kinds of softwares can provide a fractional value to classified sequence, with zero for boundary distinguishes coding and non-coding transcript, score value is larger, illustrates that the code capacity of this sequence is stronger otherwise more weak.Also be that the score value finally provided by these three kinds of softwares that slide is calculated ROC curve and determines minimal error rate herein.These three kinds of softwares ROC curve is separately represented with the dotted line of a solid line and two kinds of different models in figure, and we can be sure of to mark out with closed square information that minimal error rate point provides by this figure, relative to the test data set of mouse, CNCI has classifying quality more reliable than CPC and phyloCSF, lower minimal error rate, is respectively 0.05,0.11 and 0.28 (Fig. 2).And ability consuming time and convenient degree are all better than more than 3 times of other two methods.

The framework of CNCI entirety comprises two main parts: the structure of CNCI scoring matrix and the foundation of disaggregated model.First we count the frequency of occurrences of each ANT in coded sequence, non-coding sequence and intergenic region sequence in as the species of training set, and construct three frequency of occurrences matrixes respectively.With the frequency of occurrences of ANT in intergenic region sequence as a reference, the present invention ANT does log2-ratio computing at the frequency matrix of coded sequence and ANT at the frequency matrix of non-coding sequence, thus constructing our CNCI scoring matrix, this matrix is suggested in the present invention first.

Secondly on the basis of this scoring matrix, we start to build disaggregated model, a moving window is used to scan the mode of every bar transcript sequence according to six frame reading codes in the present invention at the beginning of model, in the process of scanning, CNCI scoring matrix can be given a mark to each subwindow, thus the transcript that is made up of nucleotide sequence is converted into six arrays, then introduce in dynamic programming algorithm ask maximum subsegment and method, in each array, find out one add and maximum subsegment.After the six frame ends of scan, CNCI automatically can find out that maximum feature subsequence as this transcript of score value in the maximum subsegment of these six candidates, namely as the sequence (MLCDS) in CDS region.

Last CNCI calculates respective MLCDS set respectively by the method for moving window in the coding and non-coding sequence set of training sample, and extract based on every bar MLCDS the feature that five have biological value and statistical significance, after feature extraction is complete, we are sent in SVM model according to the label of positive negative sample and are trained, and can realize the work RNA sequence of the unknown being implemented to classification after having trained.

In research herein, we analyze the frequency of utilization of ANT at CDS and non-coding region, and wherein non-coding region comprises non-coding RNA and intron sequences.Codon on CDS region or the nucleotide tripolymer one on genome sequence have 64 kinds, therefore there are 64 × 64 kinds of ANT, and here we have added up the frequency of occurrences of each ANT in three kinds of dissimilar arrangement sets according to formula below:

X_{i} N = Σ_{j = 1}^{n} S_{j} (X_{i})

T = Σ_{i = 1}^{m} X_{i} N = Σ_{i = 1}^{m} (Σ_{j = 1}^{n} S_{j} (X_{i}); m = 64 * 64; n = (1 . . . . . . . . N)

X_{i} F = \frac{X_{i} N}{T}

In above-mentioned formula, X represents the ANT of certain type, S _j(X _i) be occurrence number in a certain bar sequence of X in a certain class arrangement set, X _in is then the occurrence number of this kind of ANT in certain arrangement set whole according to formulae discovery, and T then represents the altogether occurrence number of the ANT of all kinds in this data centralization.Wherein m represents the kind number of ANT, and n represents the sequence number comprised in the set of the type.So X _if is exactly our a certain ANT that will calculate frequency of utilization in a certain class data centralization.

We are according to coding region, non-coding RNA and intron sequences three kinds of data acquisitions, use the above-mentioned formulae discovery frequency of occurrences of ANT respectively, and construct frequency of occurrences matrix (Fig. 3) on the RNA data set of the mankind and mouse.Horizontal ordinate and ordinate represent 64 kinds of trimerical types of nucleotide respectively, and color is darker just illustrates that the frequency of occurrences of this kind of ANT is higher.We can find out, no matter be concerning the mankind or mouse, their ANTs coding region frequency collection of illustrative plates all closely, this illustrates and is also within same evolution Model for the mankind ANTs and mouse, this is also that the encoding proteins region between these two species is also subject to the restriction of the selection pressure of same type because the distance of evolution does not have too wide in the gap cause.And relative to the frequency of utilization of ANTs in non-coding and intron sequences, no matter be on the mankind or mouse, all have marked difference with them at the expression map in CDS region.This group statistics illustrates that ANTs presents at coding region and non-coding region and a kind ofly has inclined distribution, and this characteristic also provides sufficient theoretical foundation for we distinguish coding with non-coding transcript.

Invention describes the frequency of utilization situation in the different genes interval of ANTs on the mankind and mouse, and demonstrate ANTs and have marked difference in the frequency of utilization of coding region and non-coding region, but do not provide the concrete manifestation form of this difference, both a certain ANT be actually tend to coding region or non-coding region occur more more, whether have some ANTs to be exactly that special hobby occurs at coding or non-coding region.We calculate the frequency of occurrences of ANT in the set of mankind's coding RNA and the log2-ratio of the frequency of occurrences of ANT in the set of mankind's non-coding RNA respectively, and define matrix:

H - matrix = \log (\frac{HC - matrix}{HN - matrix})

Use the same method and also define matrix in mouse set:

M - matrix = \log (\frac{MC - matrix}{MN - matrix})

As described in Figure 4, above it, each point represents the tendentiousness that a certain ANT occurs on coding with non-coding sequence, color more levels off to grey and just illustrates that this kind of ANT more likes occurring at coding region, otherwise more levels off to black and more like occurring at non-coding region.Can find out, significantly tend to non-coding region occur be all the ANT comprising terminator codon, both these ANT all with tag, taa and tga and complementary pair beginning or ending.And the ANT tending to occur at coding region is random substantially.This frequency of utilization also demonstrating ANT is equally distributed at non-coding region substantially, and is specifically expressing at some ANT of coding region.It is also substantially identical for crossing the expression map calculating these two matrixes, this absolutely proves that this scoring matrix substantially can be general within the scope of this section evolutionary distance of the mankind to mouse, because the genomic data of the mankind studied and pay close attention to more, so relative to other species, at present by each large database concept in the transcript data of including, the data volume of the mankind is maximum and quality is the highest, so our ensuing training sample used be all data from the mankind, the CNCI disaggregated model mentioned also all refer to human rna data the model of training out.

According to the CNCI scoring matrix built above, we define the code capacity that an index (sequence-score:S-score) being named as sequence score value weighs one section of sequence.The computing method of S-score are defined as follows:

S = Σ_{i = 1}^{n} {H_{p} (x_{n})}

In this calculation expression, S just represents the S-score of sequence, and Hp is CNCI scoring matrix, and X represents various types of ANTs, n and represents this sequence in units of nucleotide tripolymer, the length of gained after converting.(nucleotide tripolymer=3 nucleotide).Through this S-score calculating we just can provide one to any sequence and react the score value of its code capacity, this value is larger just illustrates that the code capacity of this section of sequence is stronger, otherwise then code capacity is more weak.Whether the present invention effectively can distinguish other 5 kinds of reading code modes of real coding, non-coding sequence and coded sequence by this S-score of data verification once: first, score value that is that we calculate six kinds of coded systems of 30507 CDS sequences of the mankind and 18566 long non-coding RNA sd, obtain 7 groups of arrays be made up of sequence score value, and the sequence score value in these 7 arrays is divided according to distribution density.Secondly our application based on the CNCI scoring matrix of the mankind calculate real CDS region, CDS region displacement reading code sequence 1,

S-score(Fig. 5 of the displacement reading code sequence 2 in CDS region, the anti-chain reading code sequence 1 in CDS region, the anti-chain reading code sequence 2 in CDS region, the anti-chain reading code sequence 3 in CDS region and long non-coding RNA).In figure, the solid line of black represents the S-score distribution in real CDS region, and the dotted line line of 5 black represents the S-score distribution of 5 kinds of wrong reading code sequences, the S-score distribution situation of the dash area of black then code long non-coding RNA.From figure, we can significantly see, real CDS region generally all has higher S-score, and the overwhelming majority is all greater than zero, and the reading code mode of mistake and the S-score score value of long non-coding RNA are substantially all distributed in less than zero, and the score value Density Distribution of these two kinds of sequences does not almost have difference.We can obtain two information accordingly: first, the reading code mode of coded sequence with mistake can distinguish with non-coding sequence by the method for S-score significantly, the second, the reading code mode of mistake and non-coding sequence almost do not have any difference under the division of S-score.

Research purpose is herein the method that exploitation one can distinguish coding and non-coding RNA, although CNCI scoring matrix can well identify the non-coding sequence of CDS region and other kinds, but real encoding proteins RNA also has 5 ' non-coding end (5 ' UTR) and a longer 3 ' non-coding end (3 ' UTR) except having CDS region, these two ends are all the non-coding regions on encoding proteins transcript, their existence can reduce the S-score of a coded sequence to a large extent, therefore this scoring matrix can't be applied directly in the RNA sequence of total length by we.In order to address this problem, the method that we introduce moving window carrys out the RNA of assistant analysis total length, the span of this moving window can be N(10-100) individual nucleotide tripolymer, the step-length of window is 1 nucleotide tripolymer, namely often slide once, window can move the length of 3 nucleotide, pass through the method, the transcript sequence of a total length just can be divided into n by us, and (n is the length-1 that sequence is scaled after nucleotide tripolymer, why doing conversion is because scanning sequence according to six kinds of reading code modes for convenience) subsequence, the length of subsequence is exactly the size of moving window.We use CNCI scoring matrix to give a mark to each subwindow, calculate its S-score, transcript sequence due to a total length has six kinds of reading code modes, so moving window can scan six times to any sequence according to six kinds of reading code modes, so under the analysis of moving window, any sequence all can be converted into 6 independently arrays, and the element in array is exactly the S-score of each subwindow in its original reading code mode.We use non-coding transcript set in the encoding gene transcript set of the mankind and gencode.v11 database as test set respectively, calculate the subwindow score value distribution situation of each sequence, and have done integration to the distribution situation of these two set.Can be seen by collection of illustrative plates, in 6 points of value sets of one and same coding gene transcripts, we can find a curve (being represented by the solid line of overstriking) with significantly difference, the part that this curve projects upwards is exactly real CDS region, and just this character is not possessed in the transcript of Noncoding gene, in order to verify that this phenomenon that the present invention finds has ubiquity and repeatability, we have carried out normalized by the method for interpolation of sampling to 30507 sequences, by these sequences all scaling to same length, the solid line representative of overstriking, the subwindow score value (Fig. 6) of the reading code mode sequence that all encoding genes are correct.

We have found that according to above-mentioned research, usually a kind of special reading code mode all can be found for coded sequence, have one section in the subwindow array of this reading code mode continuously and longer S-score, and we do not see this character in non-coding sequence.So, the key solving RNA classification problem is exactly: how can all extract the subsequence that a section has optimum code ability from coding or non-coding RNA, no matter and can ensure the subsequence come from coding RNA be the score value of S-score or the length of subsequence all significant higher than from noncoding subsequence.We have employed the thought of dynamic programming algorithm in this article, with ask maximum subsegment and mode extract above-mentioned subsequence, because this cross-talk sequence is substantially just equal to CDS region in coding RNA, so we give its called after as the sequence (MLCDS) of CDS, whole ask maximum subsegment and and to find out the flow process of MLCDS as follows:

X = \max_{1 \leq i \leq j \leq n} {Σ_{k = i}^{j} a [k]}

Here X represent a kind of reading code mode maximum subsegment and, a [k] be have this maximum subsegment and maximum subsegment, i and j represents the initial sum final position of this maximum subsegment of a [k] in this reading code mode respectively, our object to calculate X, i and j, therefore defines following improved formula:

B [j] = \max_{1 \leq j \leq n} {Σ_{k = m}^{j} a [k]}

M is a variable, and b [j] is maximum subsegment and the group of local, and a [j] is the low order end of b [j], and therefore we can define some criterions below:

X = \max_{1 \leq j \leq n} b [j]

According to the definition of b [j], we can draw a conclusion: as b [j-1] though >0 time a [j] why be worth and all can have, b [j]=b [j-1]+a [j].As b [j-1] though <0 time then a [j] why be worth and have, b [j]=a [j].Therefore:

b [j] = \max_{1 \leq j \leq n} {b [j - 1] + a [j], a [j]}

Several steps through above calculate we just can obtain the subwindow array of any one reading code mode of any coding RNA maximum subsegment and, namely there is most that cross-talk sequence MLCDS of code capacity, six kinds of reading code modes are the then corresponding MLCDS of six candidates, maximum that of our selective value is as real MLCDS in this article, further, the reading code mode at its place is just identified as correct reading code mode.Because to coding and non-coding RNA we all carried out same operation, so a MLCDS all can be filtered out in any case from non-coding RNA.

The extraction of MLCDS is the emphasis building whole disaggregated model, and alternatively the quality of this few MLCDS directly determines the classifying quality that CNCI is final.So the confidence level in order to prove model, we have to comprehensively assess our quality of these MLCDS that finds.Because non-coding RNA is less than reference CDS region, so the present invention is with reference to the CDS of 30507 standards on hg19, and according to the full-length RNA sequence that they are original, extract reading code mode belonging to these CDS and their initial sum final positions in this reading code mode, then with these information for reference conditions, and we MLCDS that finds from full-length RNA has done and has compared (Fig. 7).Various digital watch in figure understand the degree that overlaps of MLCDS and standard CD S, percentage in pie chart then indicates that percent how many MLCDS meets this coincidence degree, such as, in figure numeral 1 representative part illustrate in the MLCDS that we find, with the degree that overlaps of the CDS sequence of standard be greater than 95% accounted for 25%.And numeral 9 represents only have the MLCDS of 1.7% to have with its corresponding CDS the coincidence being less than 40%.

In order to train classification models, so that distinguish coding and long non-coding RNA, define five characteristic of divisions in the present invention.First for coding RNA, they all generally have one section very long and CDS region that coding quality is higher, but not volume RNA does not have this character, research up-to-date at present finds that there is in some long non-coding RNAs may contain some shorter peptide chains, but its length and code capacity are all far inferior to real CDS region, and the MLCDS sequence that we find just in time can reflect this characteristic.So the value of the length of MLCDS sequence and its S-score is defined as two characteristic of divisions the most basic in this article, they are named as M-Score and M-Length respectively.Secondly, according to the discussion above us, in six kinds of reading code modes of coding RNA, always there will be the subsequence that is significantly different from other modes of five kinds in the S-score distribution of subwindow, but non-coding RNA but do not have this attribute.According to this phenomenon, we define two original features from the angle of sequence self difference, and the method for this feature extraction and thought were never reported in other pertinent literature.The difference of so-called sequence self just refers to, if what analyze is that a coding RNA is so just bound to there is a kind of index, this index can embody in six kinds of reading code modes of this RNA, there is a special sequence, this sequence strengthens the dispersion degree between these six arrays, but if a non-coding RNA, so this dispersion degree just relatively can not be so remarkable.Therefore, these two new features, LENGTH-Percentage and SCORE-distance is defined as follows:

LENGTH - Percentage = \frac{X}{Σ_{i = 0}^{n} (Y_{i})}, i &Element; (1,2,3,4,5,6)

X is the length of MLCDS in correct reading code mode, and Yi represents MLCDS length respective in whole six kinds of reading code modes.

SCORE - dis \tan ce = \frac{Σ_{j = 0}^{n} (S - E_{j})}{5}, j &Element; (1,2,3,4,5)

S is the S-score of MLCDS in the S-score of MLCDS in correct reading code mode, Ej representative other several reading code modes remaining.

In order to verify that these features can be used for distinguishing coding and non-coding sequence really significantly, with the known coding with good gene annotation and non-coding RNA, the transcribe value in set calculated of above-mentioned four kinds of features in respective type is done and added up, and obtained the average (see table 1) of these four features in coding and non-coding set.

Table 1

In order to verify that these four kinds of features exist general applicability in different species, we have selected gene annotation best class and these two species of mouse.This shows this, four features all have good classifying quality in these two species, and the value that they calculate in coding and Noncoding gene set has obvious difference, so these four features are selected for train classification models by us.Finally, we carry out the disaggregated model of supplementary complete CNCI by the feature (eliminating 3 terminator codons) that the frequency of occurrences of single core thuja acid tripolymer in MLCDS region defines one 61 dimension with this.Nucleotide tripolymer is called codon in CDS region, they are directly relevant to translation albumen and have the key element of biological significance, since protein is as the concrete material of carrying biological function, so their ingredient may be random scarcely, there will not be 20 seed amino acids to appear at this situation on polypeptied chain uniformly.The codon of the necessary albumen of some composition biosome certain is used multiple times in CDS region, the frequency of utilization of these tripolymers or codon, necessarily far away higher than the mean value of the 20 seed amino acid frequencies of occurrences, infrequently then be there will be contrary situation by what use.We still first suppose that the frequency of occurrences of these tripolymers on non-coding sequence is approximately equal to mean value, these tripolymers coding and non-coding region the frequency of occurrences also correspondence do a log2-ratio computing, the trimerical frequency of utilization so more tending to occur at coding region will, higher than mean value, be tended to then being used of non-coding region appearance less.We are used for doing the data set added up is still 30507 18566 non-volume RNAs(Fig. 8 compiled on RNAs and Gencode (v11) on hg19).Ordinate represents same codon in the coding region frequency of occurrences relative in non-coding region frequency of occurrences log-ration value, 61 nucleotide tripolymers outside horizontal ordinate representative removing terminator codon, can find out, for coding or non-coding RNA, most nucleotide tripolymers has comparatively significantly Preference, also the weighted effect of forward can be played to the classifying quality of entirety even if Preference is not a few nucleotide tripolymer clearly, so the single core thuja acid tripolymer Preference feature of this 61 dimension supplements to join among CNCI model by we strengthen classifying quality.To this step, whole features of whole disaggregated model all define complete.

According to analysis process, feature extraction is carried out to mankind's coding and non-coding RNA aggregated data, then the characteristic sequence obtained is divided into positive and negative training set and is organized into the input format required by SVM, putting into the training wherein carrying out disaggregated model.The SVM that we use in this article is LIBSVM3.0 version, have chosen RBF kernel function during training pattern, and in order to prevent the generation learning phenomenon, all parameters comprise C and g and all adopt default value.

Certainly; the present invention also can have other various embodiments; when not deviating from the present invention's spirit and essence thereof; those of ordinary skill in the art are when making various corresponding change and distortion according to the present invention, but these change accordingly and are out of shape the protection domain that all should belong to the claim appended by the present invention.

Claims

1. distinguish a method for encoding gene and Noncoding gene, it is characterized in that, comprising:

Step 2, counts the frequency of occurrences X of each adjacent nucleotide tripolymer ANT in coded sequence, non-coding sequence and intergenic region sequence in training set _if also builds respectively and occurs frequency matrix, builds scoring matrix, wherein frequency of occurrences X based on three frequency of occurrences matrixes by log2-ratio computing _ithe computing formula of F is:

X_{i} N = Σ_{j = 1}^{n} S_{j} (X_{i})

T = Σ_{i = 1}^{m} X_{i} N = Σ_{i = 1}^{m} (Σ_{j = 1}^{n} S_{j} (X_{i}); m = 64 * 64; n = (1........ N)

X_{i} F = \frac{X_{i} N}{T}

Wherein X _irepresent the ANT of i-th certain type, S _j(X _i) be occurrence number in a certain bar sequence of X in a certain class arrangement set, X _in is the occurrence number of this kind of ANT in certain arrangement set whole, and T then represents the altogether occurrence number of the ANT of all kinds in certain arrangement set whole, and m represents the kind number of ANT, and n represents the sequence number comprised in the set of the type;

Step 4, utilizes obtain the 3rd feature of disaggregated model, wherein X is the length of MLCDS in reading code mode, and Yi represents MLCDS length respective in whole six kinds of reading code modes;

Utilize obtain the 4th feature of disaggregated model, wherein S is the S-score of the MLCDS extracted according to correct reading code mode in the six kinds of reading code modes had altogether at nucleotide sequence, the S-score of the MLCDS extracted in remaining other the five kinds wrong reading code modes of Ej representative;

2. the method distinguishing encoding gene and Noncoding gene as claimed in claim 1, it is characterized in that, described step 2 comprises:

Step 21, uses moving window to scan according to the mode of six frame reading codes every bar transcript sequence of coded sequence and non-coding sequence respectively;

Step 22, described scoring matrix can be given a mark to each subwindow of described moving window in the process of above-mentioned scanning, the transcript that one is made up of nucleotide sequence is converted into six arrays, the element in described array is exactly the window score value of each subwindow;

Step 23, utilize in dynamic programming algorithm ask maximum subsegment and each array in described six arrays of mode in find out one and add and maximum subsegment, obtain six candidate's largest fields;

Step 24, described scoring matrix find out in described six candidate's largest fields maximum that of score value as this transcript as the feature subsequence in CDS region.

3. the method distinguishing encoding gene and Noncoding gene as claimed in claim 2, is characterized in that, adds with maximum subsegment x computing formula to be in described step 23:

x = \underset{1 \leq i \leq j \leq n}{m a x} {Σ_{k = i}^{j} a [k]}

4. the method distinguishing encoding gene and Noncoding gene as claimed in claim 1, it is characterized in that, in described step 5: treat that distinguishing sequence is divided into positive and negative training set by described, then described positive and negative training set is converted to input format required by support vector machines, and put it into the training that SVM carries out disaggregated model, obtain distinguishing result.

5. distinguish a system for encoding gene and Noncoding gene, it is characterized in that, comprising:

Frequency statistics module, for counting the frequency of occurrences X of each ANT in coded sequence, non-coding sequence and intergenic region sequence in training set _if also builds respectively and occurs frequency matrix, builds scoring matrix, wherein frequency of occurrences X based on three frequency of occurrences matrixes by log2-ratio computing _ithe computing formula of F is:

X_{i} N = Σ_{j = 1}^{n} S_{j} (X_{i})

T = Σ_{i = 1}^{m} X_{i} N = Σ_{i = 1}^{m} (Σ_{j = 1}^{n} S_{j} (X_{i}); m = 64 * 64; n = (1........ N)

X_{i} F = \frac{X_{i} N}{T}

Sequential extraction procedures module, the mode that described scoring matrix utilizes moving window to carry out giving a mark calculates window score value S-score, in this, as first feature of disaggregated model, and in the array using dynamic programming algorithm to find out respectively to be converted to by the coded sequence of described sample sequence and non-coding sequence have maximum subsegment and region as feature subsequence MLCDS, using the length of described MLCDS as second of disaggregated model feature;

Characteristic extracting module, utilizes obtain the 3rd feature of disaggregated model, wherein X is the length of MLCDS in reading code mode, and Yi represents MLCDS length respective in whole six kinds of reading code modes;

Utilize obtain the 4th feature of disaggregated model, wherein S is the S-score of MLCDS in the S-score of MLCDS in reading code mode, Ej representative other several reading code modes remaining;

6. the system distinguishing encoding gene and Noncoding gene as claimed in claim 5, it is characterized in that, described sequential extraction procedures module comprises:

7. the system distinguishing encoding gene and Noncoding gene as claimed in claim 6, is characterized in that, adds with maximum subsegment x computing formula to be in described candidate's field acquisition module:

x = \underset{1 \leq i \leq j \leq n}{m a x} {Σ_{k = i}^{j} a [k]}

8. the system distinguishing encoding gene and Noncoding gene as claimed in claim 5, it is characterized in that, described differentiation result obtains in module: treat that distinguishing sequence is divided into positive and negative training set by described, then described positive and negative training set is converted to input format required by support vector machines, and put it into the training that SVM carries out disaggregated model, obtain distinguishing result.