CN110111838B - Method and device for predicting RNA folding structure containing false knot based on expansion structure - Google Patents

Method and device for predicting RNA folding structure containing false knot based on expansion structure Download PDF

Info

Publication number
CN110111838B
CN110111838B CN201910367639.6A CN201910367639A CN110111838B CN 110111838 B CN110111838 B CN 110111838B CN 201910367639 A CN201910367639 A CN 201910367639A CN 110111838 B CN110111838 B CN 110111838B
Authority
CN
China
Prior art keywords
extended
ribonucleic acid
pseudoknot
base
rna
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910367639.6A
Other languages
Chinese (zh)
Other versions
CN110111838A (en
Inventor
刘振栋
刘芳含
李跃军
李恒斐
郝凡昌
徐俊丽
杨朝晖
勾红领
王继伟
杨玉荣
侯铁
李恒武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201910367639.6A priority Critical patent/CN110111838B/en
Publication of CN110111838A publication Critical patent/CN110111838A/en
Application granted granted Critical
Publication of CN110111838B publication Critical patent/CN110111838B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/10Nucleic acid folding

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Biochemistry (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a method and a device for predicting a pseudoknot-containing expanded structure-based ribonucleic acid folding structure, wherein the method comprises the following steps: randomly inputting a section of ribonucleic acid base sequence, defining a false knot and an extension structure, establishing a ribonucleic acid false knot structure characteristic model and a mathematical model containing the false knot and the extension structure, calculating the minimum base free energy of the characteristic model, and outputting a ribonucleic acid folding structure containing the false knot; the device comprises an input unit, an initialization unit, a storage unit, a calculation unit and an output unit. The invention carries out calculation based on the extension structure, and introduces continuous stacking and coaxial stacking based on the extension structure, thereby being beneficial to forming a complete and accurate RNA folding structure comprising the continuous stacking, the extension structure, a ring structure and a false knot structure, having obviously better searching speed, accuracy, sensitivity and specificity than the prior art, and being more effective than the prior art in the prediction of planar false knot and non-planar false knot structures.

Description

Method and device for predicting RNA folding structure containing false knot based on expansion structure
Technical Field
The invention relates to a method for predicting a pseudoknot structure and an extended structure of ribonucleic acid (RNA), in particular to a method and a device for predicting a ribonucleic acid folding structure based on the extended structure, which comprise the pseudoknot, and belongs to the field of bioinformatics engineering.
Background
Ribonucleic Acid (abbreviated as RNA) is a single strand formed by transcription of a single strand of DNA as a template by the base complementary pairing principle, and is a genetic information carrier existing in biological cells and partial viruses and viroids. RNA is a long chain molecule formed by the condensation of ribonucleotides via phosphodiester bonds. One ribonucleotide molecule consists of a phosphate, a ribose and a base. RNA has 4 kinds of bases, namely A adenine, G guanine, C cytosine and U uracil. The main function is to realize the expression of genetic information on protein, which is a bridge in the process of transforming genetic information to phenotype.
RNA is one of the most important three types of biological macromolecules in biological systems, performs multiple functions in organisms, and is a template for synthesizing proteins. RNA fold structure prediction is used for protein function analysis and is the basis for prediction of RNA tertiary structure. Pseudoknot (pseudokinot) is the most extensive structural unit in RNA and is a very complex and stable RNA structure, the pseudoknot has structural, catalytic and regulatory functions in RNA molecules, and the pseudoknot structure is the key point of the current RNA structure prediction research.
The methods adopted for predicting the RNA folding structure mainly comprise two methods: earlier approaches to sequence alignment analysis, i.e., comparison of primary structures that serve the same biological function in different organisms, have presented difficulties in: homologous sequences are not readily available for many RNA molecules; the method requires a lot of manpower and is low in efficiency, so that the method with minimum free energy is mainly adopted at present. The theoretical basis of the minimum free energy algorithm is the free energy of the stable folded structureThe amount is minimal. PKNOTS algorithm based on minimum free energy algorithm uses O (n)6) Time and O (n)4) Arbitrary planar pseudojunctions and partially non-planar pseudojunctions are spatially computed. The pknot algorithm can only calculate RNA sequences shorter than 140 bases in length, and cannot meet the need for prediction of longer sequence structures. The PknotsRG algorithm computes that two stem regions constitute a simple nested pseudoknot, where any two pseudoknots are in a side-by-side or nested relationship. In fact, pseudonodes consisting of inner loops and bulges are ubiquitous in RNA, and cross-pseudonodes also play an important role. Therefore, both cannot be ignored. Planar pseudoknots are the most widespread class of pseudoknots, including those consisting of an inner ring and a bulge, as well as those that cross. Only one of all sequences of the PseudoBase database is folded as a non-planar pseudoknot, and the remaining sequences are folded as planar pseudoknots. We therefore consider mainly the calculation of arbitrary planar pseudoknots.
Zuker firstly uses a dynamic programming algorithm for a nearest neighbor model to provide an MFOLD algorithm, and through continuous improvement and development for more than two decades, the MFOLD algorithm is now one of the most widely used RNA folding structure prediction methods in the world, and for an RNA sequence containing n nucleotides, the MFOLD algorithm uses O (n)3) Time and O (n)2) The optimal folding structure of spatial prediction, at present, for an RNA sequence with the length of less than 700 nucleotides, an MFOLD algorithm can correctly predict about 73% of base pairs, the prediction accuracy for a longer RNA sequence and a part of molecular classes is lower, in addition, due to the limitation of the algorithm, the MFOLD algorithm cannot predict false knots and more complex three-level interaction, and the application of the algorithm has great limitation.
Chinese patent document CN103235902A discloses a method for predicting RNA structure containing false knots, comprising: determining all building blocks, including pseudoknots, in the RNA sequence to be predicted, placing all building blocks known to be present in a pool S of building blocks0={s1,s2,s3,…snN is the total number of structural units, snRepresents the nth structural unit; determining U ═ { U } by iteration based on all structural units in the RNA sequence to be predicted1,U2,…,Ur,…,UR},UR represents an RNA structure with smaller energy of the RNA structure obtained by the R iteration, and R is the total iteration number; according to UrRespectively determining similarity values of each element in Ur and an actual RNA structure by the sum of free energy of each element and the occurrence frequency of each element in all RNA structures; and predicting the elements with high similarity values in U as the RNA structure of the RNA sequence to be predicted.
CN104298894A discloses a method and a device for predicting a k-stem-based rna pseudoknot structure, comprising the following steps: inputting a ribonucleic acid base sequence; defining false knots and k (k is more than or equal to 1) stems; searching RNA base and k stems from left to right, and determining and marking all the searched k stems; searching for a false knot according to the characteristic that the crossing of the k stems forms the false knot; calculating the minimum free energy of the ribonucleic acid pseudoknot structure comprising the k stems; the pseudoknot structure of the output ribonucleic acid.
CN104765983A discloses a method and an apparatus for predicting a rna pseudoknot structure based on a semi-extended structure, comprising the following steps: inputting a ribonucleic acid base sequence; defining a semi-extension structure; establishing a ribonucleic acid pseudoknot structure representation model containing a k stem and a semi-expanded structure and a corresponding calculation formula of minimum energy; according to the minimum energy principle, the pseudoknot structure of the ribonucleic acid base sequence is output.
Although the method is more effective than a PKNOTS algorithm in the aspect of the prediction of the false knot structure, the method has the defects of low parameter precision, inaccurate free energy value, large error of a calculation method and the like in a false knot representation model, so that the search speed, accuracy, sensitivity and specificity in the aspect of the false knot structure prediction do not achieve ideal effects and need to be further improved.
Therefore, it is necessary to propose a concept of extended structure and to provide two RNA sequence segments si,kAnd sl,jThe internal and external base pairing rules in the method are specified, so that the method is closer to a real structure, the defects of low parameter precision, inaccurate free energy value, large error of a calculation method and the like in the prediction of the RNA folding structure containing the false knot are overcome, and the search speed, the accuracy, the sensitivity and the specificity are obviously improved.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a method for predicting a pseudoknot-containing expanded structure-based ribonucleic acid folded structure, so that the time complexity and the space complexity of the pseudoknot-containing expanded structure-based RNA folded structure prediction are greatly reduced, the search speed is higher, the accuracy is higher, and the sensitivity and the specificity are obviously improved. An apparatus for implementing the method is also provided.
The invention relates to a method for predicting a pseudoknot-containing expanded structure-based ribonucleic acid folding structure, which comprises the following steps of:
(1) randomly inputting a section of ribonucleic acid base sequence, defining a false knot and defining an extended structure;
inputting one segment s ═ s1s2…snSequence, randomly finding bases, if i, j are present, such that siAnd sjPairing, j-i is more than or equal to 3, and more than three continuous adjacent base pairs s exist in si·sj、s(i+1)·
Figure BDA0002048748160000021
sk·slThen base pair si·sjAnd sk·slThe closed interval is determined as a continuous stack, and all matched bases in the stack are marked; continuously searching matched bases in the free bases closed by continuous stacking, and determining as continuous stacking if more than three base pairs exist; forming a false knot if there are more than two intersections of consecutive stacks; after the continuous stacking is determined, the continuous stacking and two alkali sequences containing free bases are determined as an extended structure; the false node is formed by cross pairing of two pairs of base pairs; the false knot structure is formed by cross pairing of more than two continuous stacking or expanding structures;
(2) establishing a ribonucleic acid pseudoknot structure characteristic model and a mathematical model containing pseudoknots and an extended structure;
(3) calculating the minimum base free energy of the characteristic model;
(4) and (4) calculating the result according to the minimum base free energy principle, and outputting the ribonucleic acid folding structure containing the pseudoknot.
An extended structure consisting of two segments s of ribonucleic acid sequencei,kAnd sl,jComposition i<k<l<j. Segment si,kAnd sl,jThe intersections may form a pseudoknot structure.
The two ribonucleic acid sequence segments si,kAnd sl,jIn the presence of p and q, i<p<q<k, let sp,qAnd sl,jConstituting a continuous stack and segments si,kThere is no pairing between internal bases, i.e.: if m and n are present and m<n, if (m, n) is a base pair, then m<i or n>k, or m<i and n<i, or k>m and k>n, then the segment si,kAnd sl,jForm an extended structure with P [ i, k: l, j]Representing the optimal extended structure; or two fragments of ribonucleic acid sequence si,kAnd sl,jIn, r and s, l<r<s<j, let sr,sAnd si,kConstituting a continuous stack and segments sl,jThere is no pairing between internal bases, i.e.: if m and n are present and m<n, if (m, n) is a base pair, then m<l or n>j, or m<l and n<l, or k>j and k>j, then segment si,kAnd sl,jForm an extended structure with P [ i, k: l, j]Representing its optimal expanded structure.
The following is a calculation method for optimizing calculation parameters by using a representation model of an extended structure and a pseudoknot improved folded structure and a continuous stacking. Optimizing and improving the values of the Watson-Crick free energy parameter, the false knot energy parameter and the base pairing stacking parameter of the nearest neighbor.
W (i, j) is two extended structural bases siAnd sjSubsequence s when base pair (i, j) is not formedi,jThe corresponding cases involving pseudoknot calculation of W (i, j) based on the minimum free energy of the RNA fold structure S of the expanded structure include: (1) in the expanded configuration siAnd sjDo not participate in forming the stack, siAnd sjIs an unpaired base, siAnd sjDo not form a base pair (i, j) and are in different subsequences si,kAnd sk+1,jIn the corresponding RNA fold structure, i<k<j;(2)siAnd sjDo not form a base pair (i, j); si,jComprises an extension structure and anSub-sequence composition; or consists of two extension structures; or two spreading structures and one subsequence.
V (i, j) is at base siAnd sjSubsequence s when forming base pair (i, j)i,jThe corresponding pseudoknot-containing is based on the minimum energy of the RNA fold structure S of the expanded structure, and calculating the V (i, j) case comprises: s is a continuous stack of closed base pairs (i, j) in an extended configuration; or S is a stack closed by base pairs (i, j) and (k, l) containing pseudoknots in the expanded structure, i<k<j; or S is a pseudo-knot in the extended structure<k<j,k<r<l closed stack, and so on.
The case of calculating an extended structure containing a pseudoknot includes: (1) one extension structure is composed of another extension structure and one or several unpaired bases; (2) one extension structure is composed of another extension structure and a subsequence containing base pairs; (3) one expansion structure is formed by the other two expansion structures; (4) the two extension structures are crossed to form a false knot structure.
The minimum free energy of W (i, j), V (i, j) and the extended structure containing the false junctions is calculated by using a dynamic programming algorithm.
The device for predicting the folding structure of ribonucleic acid containing the pseudoknot based on the extended structure for realizing the method comprises the following steps:
an input unit: inputting a ribonucleic acid base sequence;
an initialization unit: defining a false knot and defining an extension structure;
a storage unit: storing the established pseudoknot model and the ribonucleic acid folding structure characteristic model of the extended structure, and storing corresponding parameters, data structures and calculation formulas of minimum base free energy;
a calculation unit: calculating a free energy value and a probability value;
an output unit: and outputting the RNA base sequence folding structure containing the false knot based on the extended structure according to the minimum free energy principle and the occurrence statistical probability.
The invention provides the concept of the extension structure, carries out calculation based on the extension structure, and accurately definesExtended structure, for two fragments of ribonucleic acid sequence si,kAnd sl,jThe internal and external base pairing rules in (1) are specified so that they are closer to the true structure, i.e., the presence of p and q, i<p<q<k, let sp,qAnd sl,jConstituting a continuous stack and segments si,kThere is no pairing between internal bases, i.e.: if m and n are present and m<n, if (m, n) is a base pair, then m<i or n>k, or m<i and n<i, or k>m and k>n is the same as the formula (I). The representation model of the folding structure and the calculation method of continuous stacking are improved by using the extension structure and the false knot, and calculation parameters are optimized. Optimization improves the values of the nearest neighbor free energy parameter and the base pairing stacking parameter. The defects of low parameter precision, inaccurate free energy value, large error of a calculation method and the like of a representation model of a false knot in a semi-extension structure are overcome, so that the search speed, the accuracy, the sensitivity and the specificity are obviously improved compared with the prior art, the predicted accuracy reaches 93.7%, the average sensitivity reaches 98.2%, and the average specificity reaches 97.5%. More effective than the prior art in the prediction of planar and non-planar pseudojunctions.
Drawings
FIG. 1 is a flow chart of a method for predicting RNA folding structure comprising a pseudoknot extended-based structure according to the present invention.
FIG. 2 is a flow chart of the present invention for finding a continuous stack and extension structure;
FIG. 3 is a flow chart of a processing unit in a prediction device according to the present invention;
FIG. 4 is a schematic diagram of an example of an RNA folding structure;
FIG. 5 is an example of the improved energy parameters and calculation method of the RNA folding structure of the present invention;
FIG. 6 is a model representation of the minimum free energy of W (i, j)) and V (i, j) after improved optimization in RNA containing pseudoknots according to the invention;
FIG. 7 is a schematic partial representation of an RNA expansion structure of the present invention comprising a pseudoknot.
Detailed Description
First, the concept of RNA sequence, base pair, pseudoknot, etc. will be explained.
RNA sequencesThe method comprises the following steps: the sequence of the four bases on the side chain of an RNA molecule is generally represented by A, U, G, C base pairs: if s isi·sjE is { AU, CG, GU }, then si·sjForming base pairs. The energy of base pair stacking is negative. False knot: if s isi·sj∈{AU,CG,GU},sk·sl∈{AU,CG,GU},i<k<j<l, then base pair si·sjAnd sk·slForming a false knot.
RNA primary structure: the sequence of four bases on the side chain of the RNA sequence is shown. Typically, the RNA sequence ends from 5 'to 3', such that the entire sequence s is denoted as s ═ s1s2…sn,siIs the i base, s, of an RNA sequenceiBelongs to { A, U, G, C }, RNA base subsequence si,jIs a sequence fragment of s, represented as: si,j=si…sj
RNA secondary structure: the set of base pairs in the RNA sequence constitutes the RNA fold structure, denoted S. For any radical pair, if si·sj∈S、si′·sj′E S and i ═ i ', j ═ j', i.e., one group cannot form a group pair with two or more groups at the same time. The base pairs and the flow-off bases can form hairpin loops, stacks, inner loops, outer loops, raised, etc. loop structures. RNA tertiary structure: according to the principle of folding dynamics, the RNA secondary structure is further folded and twisted to form a structure.
Referring to FIG. 1, the method for predicting RNA folding structure based on extended structure, including pseudoknot, of the present invention comprises the following steps: inputting a ribonucleic acid base sequence; defining a false knot and an extension structure; establishing a ribonucleic acid false knot structure mathematical representation model containing false knots and based on an extended structure; calculating the minimum energy of the model; according to the principle of minimum free energy, the folded structure of ribonucleic acid is output.
Figure 2 shows the continuous stacking process of the present invention: inputting one segment s ═ s1s2…snSequence, randomly finding bases, if i, j are present, such that siAnd sjPairing, j-i is not less than 3, and more than three links are present in sSuccessive pairs of adjacent bases si·sj、s(i+1)·sk·slThen base pair si·sjAnd sk·slThe closed interval is determined as a stack; labeling all paired bases in the stack; continuing to search for paired bases in the stacked closed free bases, and determining as a continuous stack if more than three base pairs exist; if there are more than two intersections of consecutive stacks, a false knot is formed.
FIG. 3 shows a processing unit of a prediction apparatus according to the extended structure processing flow of the present invention, which includes a ribonucleic acid input unit, a data storage unit, a consecutive stack search determination processing unit, an extended structure search determination unit, and a structure output unit including a pseudoknot.
FIG. 4 shows schematic diagrams of primary structure, secondary structure and tertiary structure corresponding to the RNA folding structure, wherein in the folding process of RNA, the ribonucleic acid base sequence of the RNA can be regarded as the primary structure, the primary structure forms the secondary structure comprising an inner ring, a bulge, an outer ring, a hairpin ring and the like through folding according to the base pairing rule, and the secondary structure can form the tertiary structure through further folding and twisting.
Definition 1: RNA base sequence Si,jIf (i, j), (i +1, j-1), …, (k, l) are all base pairs and there are no cross-pairs, i<k<l<j, the structure enclosed by (i, j) and (k, l) e S is called a stack, which can be represented as T1[i,j]. If stacking T1[i,j]Closed by (i, j) and (r, S) e S, stacking T1[r’,s’]Is blocked by (r ', S') and (k, l) ∈ S, and there is no cross-pairing within the base, i<r<r’<k<l<s’<s<j,v=r’–r+s-s’>2, the RNA folding structure enclosed by (i, j) and (k, l) ∈ S is called a 2-order sequential stack, which can be denoted as T2[i,j]。
In the same way, if T1[i,j]Closed by (i, j) and (r, S) ∈ S, the successive stack of (k-1) orders being closed by (r ', S') and (k, l) ∈ S, i<r<r’<k<l<s’<s<j,v=r’–r+s-s’>2, a structure enclosed by (i, j) and (k, l) e S, and no cross-pairing inside the bases, called k-order continuous stacking, denoted Tk[i,j],Tk[i,j]Can be ETk(i,j),Tk[i,j]Can be expressed as LTk(i, j) ═ k-i +1 or RTk(i, j) ═ j-l + 1. Let T2[i,j]Consisting of two nested stacks and their internal unpaired bases. Let E2(r, r ': s', s) represents the energy of the successive stacked structures of the base pairs (r, s) and (r ', s'), ET1(i, j) represents the stacking energy of the closure of the base pair (i, j) >, ET1(r ', s') respectively represent the stacking energy enclosed by the base pair (r ', s'), then ET2(i,j)=ET1(i,j)+E2(r,r’:s’,s)+ET1(r ', s') + a (a is the compensation parameter). Similar reason ETk(i,j)=ET1(i,j)+E2(r,r’:s’,s)+ETk-1(r ', s') + b (b is the compensation parameter).
Let LT (i, j) be ∈ { LT1(i,j),LT2(i,j)},ET(i,j)∈{ET1(i,j),ET2(i, j) }. In the process of the invention, O (n) is used for the free energy and length of the successive stacks3) And respectively storing the time pre-processed time in the triangular matrixes ES (i, j) and LS (i, j).
In the same way, from ETkThe calculation formula of (i, j) shows that the time complexity for calculating the continuous stacking is O (n)3) Spatial complexity of O (n)2). The calculation of the successive stacks may be implemented by a dynamic programming algorithm.
Definition 2-an extended structure consisting of two RNA base sequence segments s satisfying the conditionsi,kAnd sl,jComposition i<k<l<j. Presence of p and q, i<p<q<k, let sp,qAnd sl,jConstituting a continuous stack, and segments si,kThere is no pairing between internal bases, i.e.: if m and n are present and m<n, if (m, n) is a base pair, then m<i or n>k, or m<i and n<i, or k>m and k>n, then the segment si,kAnd sl,jAn extended structure is constructed. Let P [ i, k: l, j]Represents the optimal extended structure, EP (i, k: l, j) represents P [ i, k: l, j ]]Is equal to j-l +1 or LP (i, k: l, j), respectively) K-i +1 denotes P [ i, k: l, j]Length of (d). Once the structure P [ i, k: l, j ] is expanded]It is determined that LP (i, k: l, j) is also uniquely determined. LP (i, k: l, j) can use O (n)3) Spatial storage P [ i, k: l, j](ii) a Similarly, LP (i, k: l, j) can use O (n)3) Spatial storage P [ i, k: l, j]。
FIG. 5 shows an example of calculating the energy of a given RNA sequence comprising 12 bases during RNA folding, based on the base pairing rules and stacking parameters.
In the RNA folding structure, for the k-th order consecutive stacking and expanding structure, the corresponding procedure is calculated as follows:
in RNA structure, k-order continuous stacking and expanding structure energy and length are calculated
// Note: let (i, j) denote RNA base siAnd sjThe formed base pairs, g, represent the compensation coefficients for k-th order sequential stacking in RNA fold structures. P 'represents the offset of one base pair of a false knot in the extended structure, and Q' represents the penalty of one unpaired base of the false knot in the extended structure. //
Algorithm(S,k)
1.
Figure BDA0002048748160000051
2.
Figure BDA0002048748160000061
3.For r=4to n
4.For i=1to n-r
5.j←i+r+2;
6.If(i,j)&(i+1,j-1)
7.{LS1(i,j)=1;k←i;l←j;
Method for improving calculation of energy and length of k-order continuous stacking and extending structure
8.ES1(i,j)←ES1(i,j)+g*E1(k,k+1:l-1,l)+g*E2(k,k+1:l-1,l)
+g*E2(k+1,k+2:l-2,l-1);
9.While(k,l)&(k+1,l-1)&(k+2,l-2)((l-k)>4)
10.ESi(i,j)←EiSi(i,j)+g*Ei+1(k,k+1:l-1,l)+g*Ei+1(k+1,k+2:l-2,l-1);
11.LSi(i,j)++;k++;l--;
Loop
12.ESi(i,j)←ESi(i,j)+P’;
Method for calculating energy and length of k-order continuous stacking and expanding structure in improved RNA structure
13.If(k=i+2&l=j-2)
14.While k=i to i+U+1
15.for l=j-U-1+k-i to j
16.If(k,l)
17.V←ESi(i,j)+g*Ei+1(i,k:l,j)+ESi(k,l)+(k-i+j-l-2)*Q’-1;
18.W←g*Ei+1(i,k:l,j)+ESi(k,l)+(k-i+j-l-2)*Q’+2
19.If(V<ESi+1(i,j)&W<ESi+1(i,j))
20.ESi+1(i,j)←V;
21.LSi+1(i,j)←LSi(i,j)+LSi(k,l);
Loop
22.End while
The RNA folding structure can be decomposed into an extended structure and a subsequence, or two crossed extended structures and a subsequence. An extended structure can be decomposed into k-th order continuous stacks and multi-branch loops, so that the pseudoknots can be represented recursively. The expanded structure itself may also comprise a pseudoknot, and the intersection of two expanded structures may in turn form a pseudoknot structure, such that the expanded prediction method may comprise crossing the pseudoknot.
Introducing an extension structure and a k-order continuous stacking model, calculating the extension structure by using the k-order continuous stacking, and establishing a new RNA folding structure mathematical expression model by using the cross calculation nested and non-nested pseudoknot structures of the extension structure. And (3) designing and implementing a dynamic programming algorithm based on a new mathematical expression model of the folding structure containing the pseudoknot, and predicting the RNA folding structure containing any planar and non-planar pseudoknots.
However, the cross-false junction cannot be predicted by using the classical PknotsRG algorithm, but the prediction method and the device based on the semi-extended structure also have the problems of false junction structure representation model defect, free energy parameter defect, no optimization and the like. The invention can utilize the expanded PknotsRG algorithm to predict arbitrary planar and non-planar pseudojunctions. The calculation of a pseudoknot structure formed by one extension structure and one subsequence, or the calculation of a pseudoknot structure formed by two extension structures and one subsequence is added into an MFOLD calculation model to form a pseudoknot calculation model, and a graphical representation of a basic model is given in FIGS. 2 and 3.
FIG. 6 is a partial schematic diagram showing the improvement of W (i, j) and V (i, j) according to the principle of extended structure and minimum free energy during RNA folding, and the definition and calculation process are as follows.
FIG. 7 is a schematic diagram of a portion of an expanded structure-based RNA folding structure comprising a pseudoknot, which may comprise at least 8 cases. The calculation screening optimization of various conditions can be included in the calculation process.
Given a sequence s ═ s1s2…snSequence fragment si,j=si…sjI is more than 1 and less than j and less than n. Let W (i, j) be at siAnd sjIn the case where the base pair (i, j) is not formed, the subsequence si,jThe corresponding RNA fold structure S containing the pseudoknot. Let V (i, j) be siAnd sjWhen the base pair (i, j) is formed, the subsequence si,jThe corresponding RNA fold structure S containing the pseudoknot.
The calculation formulas of W (i, j) and V (i, j) in the mathematical model are given below.
V (i, j) is calculated from the following three cases, S is stacking, S is 2-order continuous stacking, S is k-order continuous stacking (k 3), i < k < j.
Let Ek(i, j) is the minimum energy of the k-th order consecutive stack enclosed by the base pair (i, j) S. If (i, j), (k, l) S, 1. ltoreq. i<k<l<j is less than or equal to n, and (i, k: l, j) is a pair of closed bases (i, j) and (k, l)2 order sequential stack with energy E2(i,k:l,j)。
In the extended structure improvement parameter and free energy calculation, the invention is expressed by using the number u of unpaired bases and the number k of base pairs as functions of variables: ekB + kM + uP. Wherein B represents the offset value constituting one extended structure, M represents the offset value of each base pair in the extended structure, and P represents the offset value of each unpaired base in the extended structure.
1) W (i, j) ═ V (i, j) +∞ if j-i <4
2) V (i, j) + ∞, if bases i and j do not form a base pair
3) W (i, i) ═ 0, and base i cannot pair with itself
4)
Figure BDA0002048748160000071
Using a dynamic programming algorithm, starting from the 3 rd nucleotide sequence of the RNA base sequence,
Figure BDA0002048748160000072
the minimum free energy of all 3 nucleotides was calculated, and so on until W (1, n) was calculated. If the bases j-i ≧ d ≧ 3, V (i, j), W (i, j) are calculated with the bases i 'and j' (j '-i' < d).
The method is compared with experiments of a PKNOTS algorithm and a semi-extended structure method, and is realized by VC + + programming and compared with the PKNOTS algorithm. On the basis, optimizing energy parameters, and calculating all sequences of the PseudoBase database and the Rfam database. The Pknots algorithm and the LP algorithm can only predict partial plane false knots, and the PKNOTS algorithm is the best algorithm for predicting any plane false knot and partial non-plane false knots at present. Therefore, the test results of the method of the present invention are mainly compared with the pknot algorithm and the semi-extended structure method. Firstly, a PKNOTS algorithm and a semi-extension structure method test set are calculated, the used energy parameters are the same as those of the PKNOTS algorithm and the semi-extension structure method, but the extension structure is accurately defined, and coaxial stacking based on the extension structure is introduced, including the coaxial stacking including the pseudoknot based on the extension structure, so that the method is favorable for forming an accurate and complete RNA folding structure, including continuous stacking, a ring structure and a pseudoknot structure. The defects of low parameter precision, inaccurate free energy value, large error of the calculation method and the like of the representation model of the false knot in the semi-extension structure are overcome, the calculation method is improved, and the calculation result is as follows.
Description of the drawings: according to conventional knowledge in the art, the computing time of a computer is generally related by the cooperation of the CPU main frequency, the mainboard structure and the memory size. The improvement in computation time is mainly obtained by the improvement in the computation method (algorithm) in the case of the same computer configuration.
TABLE 1 comparison of computation times for the method of the present invention with the semi-extended structure algorithm, PKNOTS algorithm
Figure BDA0002048748160000081
TABLE 2 comparison of different results of the method of the invention with the PKNOTS algorithm
TABLE 3 comparison of different results of the semi-extended method disclosed in CN104765983A with the PKNOTS algorithm
Figure BDA0002048748160000092
TABLE 4 comparison of the different results of the process of the present invention with the semi-extensive process disclosed in CN104765983A
Figure BDA0002048748160000093
Figure BDA0002048748160000101
A comparison of the calculated time for the method of the invention and the PKNOTS algorithm is shown in table 1. The method uses a PC machine with 4MB memory for testing, and the PKNOTS algorithm uses a high-performance computer Silicon graphics origin200 with 4GB memory for testing. As can be seen from Table 1, the method of the present invention uses 21 seconds, while the PKNOTS algorithm uses 20 minutes, to calculate an RNA sequence 75 bases in length. Calculating an RNA sequence of 105 bases in length, the method of the present invention uses 97 seconds, while the pknot algorithm uses 235 minutes. Calculating an RNA sequence of 200 bases in length, the method of the present invention takes 26 minutes, whereas the pknot algorithm cannot. In fact, the method of the present invention can successfully predict the folding structure of RNA sequences with a length of 1000 bases or more.
Since the method of the invention is based on extended structure calculations and introduces more coaxial stacks, in particular pseudoknot coaxial stacks, the method of the invention facilitates the formation of a complete stem region and a correct pseudoknot structure.
Under the same energy parameters, table 2 shows that the average sensitivity of the method of the present invention is 98.2%, which is better than 71.7% of the pknot algorithm, compared with the results of the pknot algorithm, and compared with the sensitivity and specificity of 15 different sequences; the average specificity of the method of the invention is 97.5%, which is better than 70.6% of the PKNOTS algorithm, and the improvement is remarkable. Table 3 shows that the results of comparison of the different results of the semi-extended method and the pknot algorithm, for the sensitivity and specificity of the different 15 RNA sequences, the average sensitivity of the semi-extended method was 88.1%, which is better than 71.7% of the pknot algorithm; the average specificity of the method of the invention was 86.3%, which is better than 70.6% of the pknot algorithm.
Table 4 shows that the results of comparison of the present invention with the results of the semi-extension method show that the average sensitivity of the present invention is 98.2% better than 88.1% for the semi-extension method for the comparison of the sensitivity and specificity of 15 different RNA sequences; the average specificity of the method of the invention is 97.5%, which is better than 86.3% of the half-expansion method, and the improvement is obvious.
Therefore, test results show that the search speed, the average sensitivity and the average specificity of the method are obviously better than those of the semi-extended method and the PKNOTS algorithm in the prior art.
Results of testing of the PseudoBase international RNA database.
PseudoBase is an RNA pseudoknot database. The method tests all 245 sequences of the PseudoBase database and partial sequences of the Rfam14.1 database, predicts that 381 sequences contain false knots, predicts 357 sequences with correct false knots and has the correct rate of 93.7 percent;
and introducing an extended structure and k-order continuous stacking to establish a new RNA false knot mathematical representation model. Based on the model, a time complexity O (n) is provided3) And a spatial complexity of O (n)2) The method of (3) predicts the RNA folding structure comprising arbitrary planar pseudoknots and more complex non-planar pseudoknots.
PKNOTS algorithm uses a time complexity of O (n)6) And a spatial complexity of O (n)4) And calculating a folding structure comprising a planar pseudoknot and a part of non-planar pseudoknots, wherein the calculated pseudoknot is represented by no more than two gap structures. The method of the invention uses O (n) as the complexity of time3) And a spatial complexity of O (n)3) And the space-time complexity of the calculation false knot is obviously improved compared with that of a PKNOTS algorithm. Pseudoknots can be represented by no more than two spreading structures, and the calculated RNA sequence length can exceed 1600 bases. From the test results, the method of the invention has better search speed, accuracy, sensitivity and specificity than the semi-extended method and the PKNOTS algorithm. Therefore, the method of the invention is more effective in predicting planar and non-planar pseudoknots than the semi-extended method and the PKNOTS algorithm.
The method of the invention can calculate RNA folding nested pseudoknot and cross pseudoknot structures formed by substructures of stacking, hairpin loop, inner loop, bulge, multi-branch loop and the like.
Although the embodiments of the present invention have been described above, the above descriptions are only for the convenience of understanding the present invention, and are not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. A method for predicting a pseudoknot-containing expanded structure-based ribonucleic acid folding structure, which is characterized by comprising the following steps: the method comprises the following steps:
(1) randomly inputting a section of ribonucleic acid base sequence, defining a false knot and defining an extended structure;
inputting one segment s ═ s1s2…sNSequence, randomly finding bases, if i, j are present, such that siAnd sjPairing, j-i is more than or equal to 3, and more than three continuous adjacent base pairs s exist in si·sj、s(i+1)·s(j-1)sk·slThen base pair si·sjAnd sk·slThe closed interval is determined as a continuous stack, and all paired bases in the continuous stack are marked; continuously searching matched bases in the free bases closed by continuous stacking, and determining as continuous stacking if more than three base pairs exist; forming a false knot if there are more than two intersections of consecutive stacks; after the continuous stacking is determined, the continuous stacking and two alkali sequences containing free bases are determined as an extended structure; the false node is formed by cross pairing of two pairs of base pairs; the false knot structure is formed by cross pairing of more than two continuous stacking or expanding structures;
(2) establishing a ribonucleic acid pseudoknot structure characteristic model and a mathematical model containing pseudoknots and an extended structure;
(3) calculating the minimum base free energy of the characteristic model;
(4) and (4) calculating the result according to the minimum base free energy principle, and outputting the ribonucleic acid folding structure containing the pseudoknot.
2. The pseudoknot extended structure-containing RNA folded structure of claim 1A prediction method, characterized by: an extended structure consisting of two segments s of ribonucleic acid sequencei,kAnd sl,jComposition i<k<l<j。
3. The method for predicting the folding structure of pseudoknot extended-structure-containing ribonucleic acid according to claim 2, wherein:
the two ribonucleic acid sequence segments si,kAnd sl,jIn the presence of p and q, i<p<q<k, let sp,qAnd sl,jConstituting a continuous stack and segments si,kThere is no pairing between internal bases, i.e.: if m and n are present and m<n, if (m, n) is a base pair, then m<i or n>k, or m<i and n<i, or k>m and k>n, then the segment si,kAnd sl,jForm an extended structure with P [ i, k: l, j]Representing the optimal extended structure;
or two fragments of ribonucleic acid sequence si,kAnd sl,jIn the presence of r and s, l<r<s<j, let sr,sAnd si,kConstituting a continuous stack and segments sl,jThere is no pairing between internal bases, i.e.: if m and n are present and m<n, if (m, n) is a base pair, then m<l or n>j, or m<l and n<l, or k>j and k>j, then segment si,kAnd sl,jForm an extended structure with P [ i, k: l, j]Representing its optimal expanded structure.
4. The method for predicting the folding structure of pseudoknot extended-structure-containing ribonucleic acid according to claim 1, wherein: w (i, j) is two extended structural bases siAnd sjSubsequence s when base pair (i, j) is not formedi,jThe corresponding cases involving pseudoknot calculation of W (i, j) based on the minimum free energy of the RNA fold structure S of the expanded structure include: (1) in the expanded configuration siAnd sjDo not participate in forming the stack, siAnd sjIs an unpaired base, siAnd sjDo not form a base pair (i, j) and are in different subsequences si,kAnd sk+1,jIn the corresponding RNA fold structure, i<k<j;(2)siAnd sjDoes not form a base pair (i,j);si,jThe method comprises the following steps of (1) forming an extension structure and a subsequence; or consists of two extension structures; or two spreading structures and one subsequence.
5. The method for predicting the folding structure of pseudoknot extended-structure-containing ribonucleic acid according to claim 1, wherein: v (i, j) is at base siAnd sjSubsequence s when forming base pair (i, j)i,jThe corresponding pseudoknot-containing is based on the minimum free energy of the RNA fold structure S of the expanded structure, and calculating the V (i, j) case includes: s is a continuous stack of closed base pairs (i, j) in an extended configuration; or S is a stack closed by base pairs (i, j) and (k, l) containing pseudoknots in the expanded structure, i<k<j; or S is a pseudo-knot in the extended structure<k<j,k<r<l closed stack, and so on.
6. The method for predicting the folding structure of pseudoknot extended-structure-containing ribonucleic acid according to claim 1, wherein: the case of calculating an extended structure containing a pseudoknot includes: (1) one extension structure is composed of another extension structure and one or several unpaired bases; (2) one extension structure is composed of another extension structure and a subsequence containing base pairs; (3) one expansion structure is formed by the other two expansion structures; (4) the two extension structures are crossed to form a false knot structure.
7. The method for predicting the folding structure of pseudoknot extended-structure-containing ribonucleic acid according to claim 1, wherein: the minimum free energy of W (i, j), V (i, j) and the extended structure containing the false junctions is calculated by using a dynamic programming algorithm.
8. A device for predicting the folding structure of pseudoknot-containing extended-structure-based ribonucleic acid, comprising:
an input unit: inputting a ribonucleic acid base sequence;
an initialization unit: defining a false knot and defining an extension structure;
a storage unit: storing the established pseudoknot model and the ribonucleic acid folding structure characteristic model of the extended structure, and storing corresponding parameters, data structures and calculation formulas of minimum free energy;
a calculation unit: calculating a free energy value and a probability value;
an output unit: the method outputs a ribonucleic acid base sequence folding structure containing a false knot based on an expansion structure according to a minimum free energy principle and occurrence statistical probability.
CN201910367639.6A 2019-05-05 2019-05-05 Method and device for predicting RNA folding structure containing false knot based on expansion structure Active CN110111838B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910367639.6A CN110111838B (en) 2019-05-05 2019-05-05 Method and device for predicting RNA folding structure containing false knot based on expansion structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910367639.6A CN110111838B (en) 2019-05-05 2019-05-05 Method and device for predicting RNA folding structure containing false knot based on expansion structure

Publications (2)

Publication Number Publication Date
CN110111838A CN110111838A (en) 2019-08-09
CN110111838B true CN110111838B (en) 2020-02-25

Family

ID=67488095

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910367639.6A Active CN110111838B (en) 2019-05-05 2019-05-05 Method and device for predicting RNA folding structure containing false knot based on expansion structure

Country Status (1)

Country Link
CN (1) CN110111838B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110648719B (en) * 2019-09-23 2021-03-05 吉林大学 Local structure gastric cancer drug-resistant lncRNA secondary structure prediction method based on energy and probability
CN114093420B (en) * 2022-01-11 2022-05-27 山东建筑大学 XGboost-based DNA recombination site prediction method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908102A (en) * 2010-08-13 2010-12-08 山东建筑大学 Ribosomal stalk based predicting method and device of RNA (Ribonucleic Acid) secondary structure
CN104765983A (en) * 2015-04-23 2015-07-08 山东建筑大学 Predicting method and device of ribonucleic pseudoknot structure based on half-extension structure
CN109599146A (en) * 2018-11-08 2019-04-09 武汉科技大学 A kind of band false knot nucleic acid Structure Prediction Methods based on multi-objective genetic algorithm

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100614509B1 (en) * 2002-01-29 2006-08-23 학교법인 인하학원 Visualization Method of RNA Pseudoknot Structures

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908102A (en) * 2010-08-13 2010-12-08 山东建筑大学 Ribosomal stalk based predicting method and device of RNA (Ribonucleic Acid) secondary structure
CN104765983A (en) * 2015-04-23 2015-07-08 山东建筑大学 Predicting method and device of ribonucleic pseudoknot structure based on half-extension structure
CN109599146A (en) * 2018-11-08 2019-04-09 武汉科技大学 A kind of band false knot nucleic acid Structure Prediction Methods based on multi-objective genetic algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Algorithm and Scheme in RNA Structure Prediction including Pseudoknots";Zhendong Liu et.al.;《2018 14th International Conference on Computational Intelligence and Security》;20181231;全文 *
"Approximation Algorithm of the RNA Pseudoknotted Structure Prediction Baesed on MFE";Zhendong Liu et.al.;《Proceeding of the IEEE International Conference on Information and Automation》;20131231;全文 *

Also Published As

Publication number Publication date
CN110111838A (en) 2019-08-09

Similar Documents

Publication Publication Date Title
Korostensky et al. Using traveling salesman problem algorithms for evolutionary tree construction
Sakakibara Pair hidden Markov models on tree structures
Kolpakov et al. Searching for gapped palindromes
CN110111838B (en) Method and device for predicting RNA folding structure containing false knot based on expansion structure
Cupal et al. RNA shape space topology
CN113936737B (en) Method for comparing RNA structures based on RNA motif vectors, family clustering method, method for evaluating allosteric effect, method for functional annotation, system and equipment
US20090125514A1 (en) Sequence Matching Algorithm
Song et al. Efficient parameterized algorithms for biopolymer structure-sequence alignment
Hower et al. Parametric analysis of RNA branching configurations
Liu et al. A Hopfield neural network based algorithm for RNA secondary structure prediction
Höchsmann The tree alignment model: algorithms, implementations and applications for the analysis of RNA secondary structures
Wong et al. Predicting approximate protein-DNA binding cores using association rule mining
Li et al. Characteristics and prediction of RNA structure
Ziv-Ukelson et al. A faster algorithm for RNA co-folding
Lalwani et al. Sequence–structure alignment techniques for RNA: a comprehensive survey
Horesh et al. RNAspa: a shortest path approach for comparative prediction of the secondary structure of ncRNA molecules
Li et al. PSRna: Prediction of small RNA secondary structures based on reverse complementary folding method
Zhang Combinatorial optimization problem solution based on improved genetic algorithm
Achawanantakun et al. ncRNA consensus secondary structure derivation using grammar strings
Frid et al. A Simple, Practical and Complete-Time Algorithm for RNA Folding Using the Four-Russians Speedup
Kollu RNA Secondary Structure Annotation Using RAB Representation
Akimova A generalisation of the Kauffman bracket polynomial to determine and analyse structural elements in a RNA secondary structure
Schonfeld et al. Evaluating distance measures for rna motif search
Smith RNA search acceleration with genetic algorithm generated decision trees
Hower et al. Parametric analysis of RNA folding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant