The Forecasting Methodology and device of ribonucleic acid pseudoknot structure based on k stems
Technical field
The invention belongs to biological information engineering field, it is related to a kind of false knot knot to ribonucleic acid (hereinafter referred to as RNA)
The method that structure is predicted, more particularly to the method and device that the RNA pseudoknot structures based on k stems are predicted.
Background technology
RNA is one of macromolecular mostly important in biosystem, and it exercises a variety of functions in vivo, is synthesis egg
The template of white matter.RNA secondary structure predictions are used for protein functional assays, are the bases of RNA Tertiary structure predictions.False knot
(pseudoknot) it is widest construction unit in RNA, is extremely complex and stable RNA structures, false knot is in RNA molecule
It is the key point of current RNA structure predictions research with construction, catalysis and regulatory function.
The method that RNA secondary structure predictions are used mainly has two kinds:Pair early stage uses sequence comparative analysis's method, i.e.,
The primary structure of identical biological function is compared in different organisms, and the where the shoe pinches of the method are:Many RNA
The homologous sequence of molecule is difficult to obtain;A large amount of manpowers are needed, it is less efficient, so mainly using minimum free energy amount at present
Method.
The theoretical foundation of minimum free energy quantity algorithm is that the free energy of stable RNA secondary structures is minimum.Based on minimum
The PKNOTS algorithms of free energy arithmetic use O (n6) time and O (n4) space calculates arbitrary plane false knot and part on-plane surface
False knot.PKNOTS algorithms are only capable of the RNA sequence that computational length is shorter than 140 bases, it is impossible to meet longer RNA sequence structure prediction
The need for.PknotsRG algorithms calculate the simple nested false knot being made up of Liang Gejing areas, and any two of which false knot is arranged side by side
Or nest relation.In fact, by inner ring and the raised false knot constituted generally existing in RNA, intersecting false knot also has important make
With.Therefore, it can neither be ignored.Plane false knot is widest false knot subclass, comprising above-mentioned by inner ring and convex
Act the false knot constituted and the situation for intersecting false knot.Only one sequence is folded into one in all sequences of PseudoBase databases
Individual on-plane surface false knot, remaining sequence is all folded into plane false knot.Therefore we mainly consider the calculating of arbitrary plane false knot.
Dynamic programming algorithm is used for closest neighbor model by Zuker first, it is proposed that MFOLD algorithms, more than 20
Year update and develop, existing oneself turns into widely used RNA secondary structure predictions method in the world, for including n core
The RNA sequence of thuja acid, MFOLD algorithms use O (n3) time and O (n2) spatial prediction its optimal secondary structure, at present for length
Less than the RNA sequence of 700 nucleotides, MFOLD algorithms can correctly predicted about 73% RNA bases pair, but for long sequence and
The prediction accuracy of part subclass can be reduced, and the algorithm simply show the rough framework of Tertiary structure predictions, additionally, due to calculation
The limitation of method, MFOLD algorithms in itself can not predict false knot and more complicated tertiary structure.
The content of the invention
Present invention solves the technical problem that being so as to RNA structure predictions, the RNA especially to including false knot based on k stems
Structure is predicted method, reduces the time complexity and space complexity of prediction, improves forecasting accuracy.
A kind of Forecasting Methodology of ribonucleic acid pseudoknot structure based on k stems of the present invention comprises the following steps:
Input one section of ribonucleic acid base sequence;
Define false knot, k stems, k >=1;
Base and k stems are searched from left to right, and all k stems found out are marked;
Pseudoknot structure characteristic is constituted according to the intersection of two or more k stem base-pairs, false knot is searched;
Calculate the least energy of the ribonucleic acid pseudoknot structure comprising k stems;
Export the pseudoknot structure of ribonucleic acid.
1 stem (is designated as S1[i, j]) closed by base-pair (i, j) and (r, s) ∈ S, if (k-1) stem by base-pair (r ',
S ') and (k, l) ∈ S closed, i<r<r’<k<l<s’<s<J, v=r '-r+s-s '>2, then sealed by (i, j) and (k, l) ∈ S
The structure closed is referred to as k stems and (is designated as Sk[i,j]).Wherein, the intersection of base-pair constitutes false knot in two k stems.Search from left to right
During base, 1 stem is first looked for, if finding 1 stem, to all kilobase markers in 1 stem, similarly, 2 stems, 3 stem ... k stems are searched,
If finding, to all kilobase markers in k stems.
A kind of prediction meanss of the ribonucleic acid pseudoknot structure based on k stems include:
Input block:It inputs one section of ribonucleic acid base sequence;
Definition unit:Define 1 stem, 2 stem ... k stems;
Searching unit:Base is searched from left to right, and all 1 stems for finding out, the base in 2 stem ... k stems are marked;
Pseudoknot structure searching unit:Pseudoknot structure characteristic is constituted according to the intersection of two or more k stem base-pairs, searched false
Knot;
False knot computing unit:Calculate the least energy of the ribonucleic acid pseudoknot structure comprising k stems;
Output unit:It exports the pseudoknot structure of ribonucleic acid base sequence according to least-energy principle.
The search speed of the method for the present invention, accuracy, Sensitivity and Specificity are better than PKNOTS algorithms.Therefore it is our
Method is more more effective than PKNOTS algorithm in the prediction of plane false knot.
Brief description of the drawings
Fig. 1 is the Forecasting Methodology flow chart of the RNA pseudoknot structures based on k stems of the present invention;
Fig. 2 is the flow chart of the k stems processing of the present invention;
Fig. 3 is the prediction meanss for being used in corresponding diagram 1 predict RNA pseudoknot structures;
Fig. 4 is the example of the RNA pseudoknot structure of the present invention;
Fig. 5 is the expression diagram of W and V in calculating RNA pseudoknot structure least energies of the invention.
Embodiment
Illustrate the concept on RNA sequence, base-pair, false knot etc. first.
RNA primary structures:The expression that puts in order of four kinds of bases on RNA molecule side chain.In general RNA base sequences from
5 ' start to 3 ' to terminate, and so whole sequence s is expressed as s=s1s2…sn, siRepresent i-th of base of RNA sequence, si∈{A,
U, G, C }, RNA subsequences si,jIt is a s sequence fragment, is expressed as:si,j=si…sj。
Base-pair:If si·sj∈ { AU, CG, GU }, then si·sjConstitute base-pair.The energy stacked in base-pair is
Negative value.
RNA secondary structures:Set of one group of base to composition in RNA sequence, is represented with S.For any base pair, if
si·sj∈S、si′·sj′If ∈ S and i=i ', j=j ', that is, base can not simultaneously with two and more than two
Base constitutes base pair.
False knot:If base is to si·sjWith si′·sj′∈ S, if i < i ' < j < j ', sequence si...si′
...sj...sj′Constitute pseudoknot structure.
Fig. 1 is the flow chart for being used to predict the Forecasting Methodology based on stem area RNA pseudoknot structures according to the present invention;The present invention
Method comprise the following steps:Input one section of RNA sequence;Define false knot, k stems (k >=1);Base is searched from left to right,
All k stems found out are marked;Pseudoknot structure characteristic is constituted according to the intersection of two or more k stem base-pairs, searched false
Knot;Calculate the least energy of the ribonucleic acid pseudoknot structure comprising k stems;Export the pseudoknot structure of ribonucleic acid.Fig. 3 is correspondence
It is used for the prediction meanss for predicting the RNA pseudoknot structures based on stem area in Fig. 1.The prediction meanss of RNA pseudoknot structures include:Input is single
Member:It inputs one section of ribonucleic acid base sequence;Definition unit:It defines false knot and defines k stems, k >=1;Searching unit:From a left side
Base is searched to the right, and all 1 stems for finding out, the base in 2 stem ... k stems are marked;Pseudoknot structure searching unit:According to
The intersection of two or more k stem base-pairs constitutes pseudoknot structure characteristic, searches false knot;False knot energy calculation unit:Calculate comprising k
The least energy of the ribonucleic acid pseudoknot structure of stem;Output unit:It exports ribonucleic acid base sequence according to least-energy principle
The pseudoknot structure of row.
Fig. 2 is the flow chart handled according to the k stems of the present invention:Input one section of s=s1s2…snSequence, is searched from left to right
Base, if there is i, j so that siAnd sjThere are more than three continuous adjacent bases in pairing, j-i >=6, and s to si·
sj、s(i+1)·s(j-1)。。。、sk·sl, then base is to si·sjAnd sk·slThe interval of closing is defined as 1 stem;To all pairings in 1 stem
Base be marked;The continuous base for searching pairing of free base relaying closed in 1 stem, if there is more than three bases pair,
It is defined as 2 stems;Base to all pairings in 2 stems is marked;The continuous lookup of free base relaying closed in 1 stem and 2 stems is matched somebody with somebody
To base, if there is more than three bases pair, be defined as 3 stems;It is straight that base to all pairings in 3 stems is marked ...
To finding k stems.If there is the intersection of two or more k stem base-pairs, then false knot is constituted.
Definition:RNA subsequences Si,jIn, if (i, j), (i+1, j-1) ..., (k, l) is all base pair, i<k<l<J, then by
The structure that (i, j) and (k, l) ∈ S are closed is referred to as 1 stem, is expressed as S1[i,j].If 1 stem S1[i, j] is by (i, j) and (r, s) ∈ S
Closed, 1 stem S1[r ', s '] is closed by (r ', s ') and (k, l) ∈ S, i<r<r’<k<l<s’<s<J, v=r '-r+s-s '>
2, then 2 stems are referred to as by (i, j) and (k, l) the ∈ S structures closed, are expressed as S2[i,j]。
Similarly, if S1[i, j] is closed by (i, j) and (r, s) ∈ S, and (k-1) stem is by (r ', s ') and (k, l) ∈ S institutes
Closing, i<r<r’<k<l<s’<s<J, v=r '-r+s-s '>2, then k stems are referred to as by (i, j) and (k, l) the ∈ S structures closed,
It is expressed as Sk[i, j], SkThe least energy of [i, j] is expressed as ESk(i, j), k stems SkThe length of [i, j] is expressed as LSk(i, j)=
K-i+1 or RSk(i, j)=j-l+1.
If 2 stem S2[i, j] is made up of two 1 nested stems and its internal unpaired base.If E2(r,r’:S ', s) is represented
Base is to the energy of (r, s) and (r ', s ') 2 ring structures constituted, ES1(i, j) and ES1(r ', s ') represent respectively by base to (i, j)
The energy of 1 stem of (r ', s ') closing, then ES2(i, j)=ES1(i,j)+E2(r,r’:s’,s)+ES1(r’,s’).Similarly ESk
(i, j)=ES1(i,j)+E2(r,r’:s’,s)+ESk-1(r’,s’)。
If LS (i, j) ∈ { LS1(i,j),LS2(i,j)},ES(i,j)∈{ES1(i,j),ES2(i,j)}.The present invention's
In method, the free energy and length of 1 stem and 2 stems use O (n3) time pre-process and be stored in respectively triangular matrix ES (i,
J), in LS (i, j), its calculating process is shown in program 1.
Similarly, by ESkThe calculation formula of (i, j) understands that the time complexity for calculating k stems is O (n3), space complexity is O
(n2).The calculating of k stems (k >=3) is realized by dynamic programming algorithm below.
K stems are made up of stem and 2 rings, and the energy sum of stacking and ring of its free energy contained by it, any false knot can divide
Solve as k stems and multi-branched ring.
Embodiment 1:
In the prediction of RNA pseudoknot structures, if in k stems during k=1 or k=2, the program of related 1 stem and 2 stems is calculated as follows institute
State.
Program 1:The calculating of the energy and length of 1 stem and 2 stems
Fig. 4 provides a simple false knot.Use two 1 stem (S1[1,19]、S1And three subsequence (s [7,30])6,6、
s13,14、s20,24) constitute a false knot.Because each 1 stem is determined by two parameters, the storage of 1 stem needs O (n2) space, therefore
The time complexity for calculating false knot is O (n4), space complexity is O (n2)。
Known by Fig. 4:W (1,30)=ES1(1,19)+ES1(7,30)+W(6,6)+W(13,14)+W(20,24)
Embodiment 2:
Give a sequence s=s1s2…sn, sequence fragment si,j=si…sj, 1 < i < j < n.If W (i, j) is subsequence
Si,jThe corresponding secondary structure S comprising false knot least energy.If V (i, j) is siAnd sjIn the case of base is constituted to (i, j),
Subsequence Si,jThe corresponding secondary structure S comprising false knot least energy.
Fig. 5 provides W (i, j) and V (i, j) rated figure.W (i, j) comprising pseudoknot structure is by following 4 kinds of situation meters
Calculate:
1)sjIt is unpaired base, base siAnd sj-1Pair relationhip do not determine, such as Fig. 5 .1, W (i, j)=W of calculating (i,
j-1);
2)siIt is unpaired base, base si+1And sjPair relationhip do not determine, such as Fig. 5 .2, W (i, j)=W (i+ of calculating
1,j);
3)siAnd sk, sk+1And sjBase pair is not constituted and in different subsequence Si,kAnd Sk+1,jIn corresponding secondary structure, i<k<
J, such as Fig. 5 .3, calculating
4)siAnd sjBase is constituted to (i, j), such as Fig. 5 .4, W (i, j)=min (V (i, j)) of calculating.
V (i, j) comprising pseudoknot structure is calculated by following three situation:
(1) S is 1 ring closed by base to (i, j), such as Fig. 5 .5, V (i, j)=minE of calculating1(i,j);
(2) S is 2 rings closed by base to (i, j) and (k, l), such as Fig. 5 .6.
V (i, j)=min (E2(i,k:l,j)+V(k,l)),i<k<l<J, u=(k-i+j-l) -2<U
(3) S is k rings (k >=3) or pseudoknot structure, i<k<J, such as Fig. 5 .6, calculatingWherein M represents to constitute a multi-branched ring
Penalty value, P represents the penalty value of each base pair in multi-branched ring.WMIt is identical with W calculation formula, but parameter is different, WMSpecially
For the structure prediction of sequence fragment in multi-branched ring, and W is only used for structure prediction of no outer closures base to time series segment.
The method of the present invention is compared with the experiment of PKNOTS algorithms
PknotsRG methods are only capable of the simple nested false knot that prediction is made up of two 1 stems, it is impossible to which prediction intersects false knot.Make
The method of the present invention is realized with C++, and is compared with PKNOTS methods.
The method of the present invention of table 1 is compared with the calculating time of PKNOTS methods
Calculating time is relatively shown in Table 1.The method of the present invention is in double-core CPU:3.0GHz, the PC for inside saving as 4GB is carried out
Test, and PKNOTS algorithm high-performance computer Silicon Graphics Origin200 are tested.As known from Table 1, count
The RNA sequence that length is 75 bases is calculated, method of the invention is used 51 seconds, and PKNOTS algorithms are used 20 minutes.Calculate length
Spend for the RNA sequence of 105 bases, method of the invention is used 225 seconds, and PKNOTS algorithms are used 235 minutes.Computational length
For the RNA sequence of 200 bases, method of the invention is used 72 minutes, and PKNOTS algorithms can not be calculated.In fact, this hair
Bright method can be with success prediction length more than 1500 bases RNA sequence secondary structure.
Although disclosed herein embodiment as above, described content is only to facilitate understanding the present invention and adopting
Embodiment, is not limited to the present invention.Any those skilled in the art to which this invention pertains, are not departing from this
On the premise of the disclosed spirit and scope of invention, any modification and change can be made in the implementing form and in details,
But the scope of patent protection of the present invention, still should be subject to the scope of the claims as defined in the appended claims.