CN110111838B

CN110111838B - Method and device for predicting RNA folding structure containing false knot based on expansion structure

Info

Publication number: CN110111838B
Application number: CN201910367639.6A
Authority: CN
Inventors: 刘振栋; 刘芳含; 李跃军; 李恒斐; 郝凡昌; 徐俊丽; 杨朝晖; 勾红领; 王继伟; 杨玉荣; 侯铁; 李恒武
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-05-05
Filing date: 2019-05-05
Publication date: 2020-02-25
Anticipated expiration: 2039-05-05
Also published as: CN110111838A

Abstract

The invention provides a method and a device for predicting a pseudoknot-containing expanded structure-based ribonucleic acid folding structure, wherein the method comprises the following steps: randomly inputting a section of ribonucleic acid base sequence, defining a false knot and an extension structure, establishing a ribonucleic acid false knot structure characteristic model and a mathematical model containing the false knot and the extension structure, calculating the minimum base free energy of the characteristic model, and outputting a ribonucleic acid folding structure containing the false knot; the device comprises an input unit, an initialization unit, a storage unit, a calculation unit and an output unit. The invention carries out calculation based on the extension structure, and introduces continuous stacking and coaxial stacking based on the extension structure, thereby being beneficial to forming a complete and accurate RNA folding structure comprising the continuous stacking, the extension structure, a ring structure and a false knot structure, having obviously better searching speed, accuracy, sensitivity and specificity than the prior art, and being more effective than the prior art in the prediction of planar false knot and non-planar false knot structures.

Description

Method and device for predicting RNA folding structure containing false knot based on expansion structure

Technical Field

The invention relates to a method for predicting a pseudoknot structure and an extended structure of ribonucleic acid (RNA), in particular to a method and a device for predicting a ribonucleic acid folding structure based on the extended structure, which comprise the pseudoknot, and belongs to the field of bioinformatics engineering.

Background

Ribonucleic Acid (abbreviated as RNA) is a single strand formed by transcription of a single strand of DNA as a template by the base complementary pairing principle, and is a genetic information carrier existing in biological cells and partial viruses and viroids. RNA is a long chain molecule formed by the condensation of ribonucleotides via phosphodiester bonds. One ribonucleotide molecule consists of a phosphate, a ribose and a base. RNA has 4 kinds of bases, namely A adenine, G guanine, C cytosine and U uracil. The main function is to realize the expression of genetic information on protein, which is a bridge in the process of transforming genetic information to phenotype.

RNA is one of the most important three types of biological macromolecules in biological systems, performs multiple functions in organisms, and is a template for synthesizing proteins. RNA fold structure prediction is used for protein function analysis and is the basis for prediction of RNA tertiary structure. Pseudoknot (pseudokinot) is the most extensive structural unit in RNA and is a very complex and stable RNA structure, the pseudoknot has structural, catalytic and regulatory functions in RNA molecules, and the pseudoknot structure is the key point of the current RNA structure prediction research.

The methods adopted for predicting the RNA folding structure mainly comprise two methods: earlier approaches to sequence alignment analysis, i.e., comparison of primary structures that serve the same biological function in different organisms, have presented difficulties in: homologous sequences are not readily available for many RNA molecules; the method requires a lot of manpower and is low in efficiency, so that the method with minimum free energy is mainly adopted at present. The theoretical basis of the minimum free energy algorithm is the free energy of the stable folded structureThe amount is minimal. PKNOTS algorithm based on minimum free energy algorithm uses O (n)⁶) Time and O (n)⁴) Arbitrary planar pseudojunctions and partially non-planar pseudojunctions are spatially computed. The pknot algorithm can only calculate RNA sequences shorter than 140 bases in length, and cannot meet the need for prediction of longer sequence structures. The PknotsRG algorithm computes that two stem regions constitute a simple nested pseudoknot, where any two pseudoknots are in a side-by-side or nested relationship. In fact, pseudonodes consisting of inner loops and bulges are ubiquitous in RNA, and cross-pseudonodes also play an important role. Therefore, both cannot be ignored. Planar pseudoknots are the most widespread class of pseudoknots, including those consisting of an inner ring and a bulge, as well as those that cross. Only one of all sequences of the PseudoBase database is folded as a non-planar pseudoknot, and the remaining sequences are folded as planar pseudoknots. We therefore consider mainly the calculation of arbitrary planar pseudoknots.

Zuker firstly uses a dynamic programming algorithm for a nearest neighbor model to provide an MFOLD algorithm, and through continuous improvement and development for more than two decades, the MFOLD algorithm is now one of the most widely used RNA folding structure prediction methods in the world, and for an RNA sequence containing n nucleotides, the MFOLD algorithm uses O (n)³) Time and O (n)²) The optimal folding structure of spatial prediction, at present, for an RNA sequence with the length of less than 700 nucleotides, an MFOLD algorithm can correctly predict about 73% of base pairs, the prediction accuracy for a longer RNA sequence and a part of molecular classes is lower, in addition, due to the limitation of the algorithm, the MFOLD algorithm cannot predict false knots and more complex three-level interaction, and the application of the algorithm has great limitation.

Chinese patent document CN103235902A discloses a method for predicting RNA structure containing false knots, comprising: determining all building blocks, including pseudoknots, in the RNA sequence to be predicted, placing all building blocks known to be present in a pool S of building blocks₀＝{s₁，s₂，s₃，…s_nN is the total number of structural units, s_nRepresents the nth structural unit; determining U ═ { U } by iteration based on all structural units in the RNA sequence to be predicted₁，U₂，…，U_r，…，U_R}，UR represents an RNA structure with smaller energy of the RNA structure obtained by the R iteration, and R is the total iteration number; according to U_rRespectively determining similarity values of each element in Ur and an actual RNA structure by the sum of free energy of each element and the occurrence frequency of each element in all RNA structures; and predicting the elements with high similarity values in U as the RNA structure of the RNA sequence to be predicted.

CN104298894A discloses a method and a device for predicting a k-stem-based rna pseudoknot structure, comprising the following steps: inputting a ribonucleic acid base sequence; defining false knots and k (k is more than or equal to 1) stems; searching RNA base and k stems from left to right, and determining and marking all the searched k stems; searching for a false knot according to the characteristic that the crossing of the k stems forms the false knot; calculating the minimum free energy of the ribonucleic acid pseudoknot structure comprising the k stems; the pseudoknot structure of the output ribonucleic acid.

CN104765983A discloses a method and an apparatus for predicting a rna pseudoknot structure based on a semi-extended structure, comprising the following steps: inputting a ribonucleic acid base sequence; defining a semi-extension structure; establishing a ribonucleic acid pseudoknot structure representation model containing a k stem and a semi-expanded structure and a corresponding calculation formula of minimum energy; according to the minimum energy principle, the pseudoknot structure of the ribonucleic acid base sequence is output.

Although the method is more effective than a PKNOTS algorithm in the aspect of the prediction of the false knot structure, the method has the defects of low parameter precision, inaccurate free energy value, large error of a calculation method and the like in a false knot representation model, so that the search speed, accuracy, sensitivity and specificity in the aspect of the false knot structure prediction do not achieve ideal effects and need to be further improved.

Therefore, it is necessary to propose a concept of extended structure and to provide two RNA sequence segments s_i,kAnd s_l,jThe internal and external base pairing rules in the method are specified, so that the method is closer to a real structure, the defects of low parameter precision, inaccurate free energy value, large error of a calculation method and the like in the prediction of the RNA folding structure containing the false knot are overcome, and the search speed, the accuracy, the sensitivity and the specificity are obviously improved.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method for predicting a pseudoknot-containing expanded structure-based ribonucleic acid folded structure, so that the time complexity and the space complexity of the pseudoknot-containing expanded structure-based RNA folded structure prediction are greatly reduced, the search speed is higher, the accuracy is higher, and the sensitivity and the specificity are obviously improved. An apparatus for implementing the method is also provided.

The invention relates to a method for predicting a pseudoknot-containing expanded structure-based ribonucleic acid folding structure, which comprises the following steps of:

(1) randomly inputting a section of ribonucleic acid base sequence, defining a false knot and defining an extended structure;

inputting one segment s ═ s₁s₂…s_nSequence, randomly finding bases, if i, j are present, such that s_iAnd s_jPairing, j-i is more than or equal to 3, and more than three continuous adjacent base pairs s exist in s_i·s_j、s_(i+1)·

s_k·s_lThen base pair s_i·s_jAnd s_k·s_lThe closed interval is determined as a continuous stack, and all matched bases in the stack are marked; continuously searching matched bases in the free bases closed by continuous stacking, and determining as continuous stacking if more than three base pairs exist; forming a false knot if there are more than two intersections of consecutive stacks; after the continuous stacking is determined, the continuous stacking and two alkali sequences containing free bases are determined as an extended structure; the false node is formed by cross pairing of two pairs of base pairs; the false knot structure is formed by cross pairing of more than two continuous stacking or expanding structures;

(2) establishing a ribonucleic acid pseudoknot structure characteristic model and a mathematical model containing pseudoknots and an extended structure;

(3) calculating the minimum base free energy of the characteristic model;

(4) and (4) calculating the result according to the minimum base free energy principle, and outputting the ribonucleic acid folding structure containing the pseudoknot.

An extended structure consisting of two segments s of ribonucleic acid sequence_i,kAnd s_l,jComposition i<k<l<j. Segment s_i,kAnd s_l,jThe intersections may form a pseudoknot structure.

The two ribonucleic acid sequence segments s_i,kAnd s_l,jIn the presence of p and q, i<p<q<k, let s_p,qAnd s_l,jConstituting a continuous stack and segments s_i,kThere is no pairing between internal bases, i.e.: if m and n are present and m<n, if (m, n) is a base pair, then mk, or mm and k>n, then the segment s_i,kAnd s_l,jForm an extended structure with P [ i, k: l, j]Representing the optimal extended structure; or two fragments of ribonucleic acid sequence s_i,kAnd s_l,jIn, r and s, l<r<s<j, let s_r,sAnd s_i,kConstituting a continuous stack and segments s_l,jThere is no pairing between internal bases, i.e.: if m and n are present and m<n, if (m, n) is a base pair, then m<l or n>j, or m<l and n<l, or k>j and k>j, then segment s_i,kAnd s_l,jForm an extended structure with P [ i, k: l, j]Representing its optimal expanded structure.

The following is a calculation method for optimizing calculation parameters by using a representation model of an extended structure and a pseudoknot improved folded structure and a continuous stacking. Optimizing and improving the values of the Watson-Crick free energy parameter, the false knot energy parameter and the base pairing stacking parameter of the nearest neighbor.

W (i, j) is two extended structural bases s_iAnd s_jSubsequence s when base pair (i, j) is not formed_i,jThe corresponding cases involving pseudoknot calculation of W (i, j) based on the minimum free energy of the RNA fold structure S of the expanded structure include: (1) in the expanded configuration s_iAnd s_jDo not participate in forming the stack, s_iAnd s_jIs an unpaired base, s_iAnd s_jDo not form a base pair (i, j) and are in different subsequences s_i,kAnd s_k+1,jIn the corresponding RNA fold structure, i<k<j；(2)s_iAnd s_jDo not form a base pair (i, j); s_i,jComprises an extension structure and anSub-sequence composition; or consists of two extension structures; or two spreading structures and one subsequence.

V (i, j) is at base s_iAnd s_jSubsequence s when forming base pair (i, j)_i,jThe corresponding pseudoknot-containing is based on the minimum energy of the RNA fold structure S of the expanded structure, and calculating the V (i, j) case comprises: s is a continuous stack of closed base pairs (i, j) in an extended configuration; or S is a stack closed by base pairs (i, j) and (k, l) containing pseudoknots in the expanded structure, i<k<j; or S is a pseudo-knot in the extended structure<k<j,k<r<l closed stack, and so on.

The case of calculating an extended structure containing a pseudoknot includes: (1) one extension structure is composed of another extension structure and one or several unpaired bases; (2) one extension structure is composed of another extension structure and a subsequence containing base pairs; (3) one expansion structure is formed by the other two expansion structures; (4) the two extension structures are crossed to form a false knot structure.

The minimum free energy of W (i, j), V (i, j) and the extended structure containing the false junctions is calculated by using a dynamic programming algorithm.

The device for predicting the folding structure of ribonucleic acid containing the pseudoknot based on the extended structure for realizing the method comprises the following steps:

an input unit: inputting a ribonucleic acid base sequence;

an initialization unit: defining a false knot and defining an extension structure;

a storage unit: storing the established pseudoknot model and the ribonucleic acid folding structure characteristic model of the extended structure, and storing corresponding parameters, data structures and calculation formulas of minimum base free energy;

a calculation unit: calculating a free energy value and a probability value;

an output unit: and outputting the RNA base sequence folding structure containing the false knot based on the extended structure according to the minimum free energy principle and the occurrence statistical probability.

The invention provides the concept of the extension structure, carries out calculation based on the extension structure, and accurately definesExtended structure, for two fragments of ribonucleic acid sequence s_i,kAnd s_l,jThe internal and external base pairing rules in (1) are specified so that they are closer to the true structure, i.e., the presence of p and q, i<p<q<k, let s_p,qAnd s_l,jConstituting a continuous stack and segments s_i,kThere is no pairing between internal bases, i.e.: if m and n are present and m<n, if (m, n) is a base pair, then mk, or mm and k>n is the same as the formula (I). The representation model of the folding structure and the calculation method of continuous stacking are improved by using the extension structure and the false knot, and calculation parameters are optimized. Optimization improves the values of the nearest neighbor free energy parameter and the base pairing stacking parameter. The defects of low parameter precision, inaccurate free energy value, large error of a calculation method and the like of a representation model of a false knot in a semi-extension structure are overcome, so that the search speed, the accuracy, the sensitivity and the specificity are obviously improved compared with the prior art, the predicted accuracy reaches 93.7%, the average sensitivity reaches 98.2%, and the average specificity reaches 97.5%. More effective than the prior art in the prediction of planar and non-planar pseudojunctions.

Drawings

FIG. 1 is a flow chart of a method for predicting RNA folding structure comprising a pseudoknot extended-based structure according to the present invention.

FIG. 2 is a flow chart of the present invention for finding a continuous stack and extension structure;

FIG. 3 is a flow chart of a processing unit in a prediction device according to the present invention;

FIG. 4 is a schematic diagram of an example of an RNA folding structure;

FIG. 5 is an example of the improved energy parameters and calculation method of the RNA folding structure of the present invention;

FIG. 6 is a model representation of the minimum free energy of W (i, j)) and V (i, j) after improved optimization in RNA containing pseudoknots according to the invention;

FIG. 7 is a schematic partial representation of an RNA expansion structure of the present invention comprising a pseudoknot.

Detailed Description

First, the concept of RNA sequence, base pair, pseudoknot, etc. will be explained.

RNA sequencesThe method comprises the following steps: the sequence of the four bases on the side chain of an RNA molecule is generally represented by A, U, G, C base pairs: if s is_i·s_jE is { AU, CG, GU }, then s_i·s_jForming base pairs. The energy of base pair stacking is negative. False knot: if s is_i·s_j∈{AU,CG,GU},s_k·s_l∈{AU,CG,GU},i<k<j<l, then base pair s_i·s_jAnd s_k·s_lForming a false knot.

RNA primary structure: the sequence of four bases on the side chain of the RNA sequence is shown. Typically, the RNA sequence ends from 5 'to 3', such that the entire sequence s is denoted as s ═ s₁s₂…s_n，s_iIs the i base, s, of an RNA sequence_iBelongs to { A, U, G, C }, RNA base subsequence s_i,jIs a sequence fragment of s, represented as: s_i,j＝s_i…s_j。

RNA secondary structure: the set of base pairs in the RNA sequence constitutes the RNA fold structure, denoted S. For any radical pair, if s_i·s_j∈S、s_i′·s_j′E S and i ═ i ', j ═ j', i.e., one group cannot form a group pair with two or more groups at the same time. The base pairs and the flow-off bases can form hairpin loops, stacks, inner loops, outer loops, raised, etc. loop structures. RNA tertiary structure: according to the principle of folding dynamics, the RNA secondary structure is further folded and twisted to form a structure.

Referring to FIG. 1, the method for predicting RNA folding structure based on extended structure, including pseudoknot, of the present invention comprises the following steps: inputting a ribonucleic acid base sequence; defining a false knot and an extension structure; establishing a ribonucleic acid false knot structure mathematical representation model containing false knots and based on an extended structure; calculating the minimum energy of the model; according to the principle of minimum free energy, the folded structure of ribonucleic acid is output.

Figure 2 shows the continuous stacking process of the present invention: inputting one segment s ═ s₁s₂…s_nSequence, randomly finding bases, if i, j are present, such that s_iAnd s_jPairing, j-i is not less than 3, and more than three links are present in sSuccessive pairs of adjacent bases s_i·s_j、s_(i+1)·s_k·s_lThen base pair s_i·s_jAnd s_k·s_lThe closed interval is determined as a stack; labeling all paired bases in the stack; continuing to search for paired bases in the stacked closed free bases, and determining as a continuous stack if more than three base pairs exist; if there are more than two intersections of consecutive stacks, a false knot is formed.

FIG. 3 shows a processing unit of a prediction apparatus according to the extended structure processing flow of the present invention, which includes a ribonucleic acid input unit, a data storage unit, a consecutive stack search determination processing unit, an extended structure search determination unit, and a structure output unit including a pseudoknot.

FIG. 4 shows schematic diagrams of primary structure, secondary structure and tertiary structure corresponding to the RNA folding structure, wherein in the folding process of RNA, the ribonucleic acid base sequence of the RNA can be regarded as the primary structure, the primary structure forms the secondary structure comprising an inner ring, a bulge, an outer ring, a hairpin ring and the like through folding according to the base pairing rule, and the secondary structure can form the tertiary structure through further folding and twisting.

Definition 1: RNA base sequence S_i,jIf (i, j), (i +1, j-1), …, (k, l) are all base pairs and there are no cross-pairs, i<k<l<j, the structure enclosed by (i, j) and (k, l) e S is called a stack, which can be represented as T₁[i,j]. If stacking T₁[i,j]Closed by (i, j) and (r, S) e S, stacking T₁[r’,s’]Is blocked by (r ', S') and (k, l) ∈ S, and there is no cross-pairing within the base, i<r<r’<k<l<s’<s<j，v＝r’–r+s-s’>2, the RNA folding structure enclosed by (i, j) and (k, l) ∈ S is called a 2-order sequential stack, which can be denoted as T₂[i,j]。

In the same way, if T₁[i,j]Closed by (i, j) and (r, S) ∈ S, the successive stack of (k-1) orders being closed by (r ', S') and (k, l) ∈ S, i<r<r’<k<l<s’<s<j，v＝r’–r+s-s’>2, a structure enclosed by (i, j) and (k, l) e S, and no cross-pairing inside the bases, called k-order continuous stacking, denoted T_k[i,j]，T_k[i,j]Can be ET_k(i,j)，T_k[i,j]Can be expressed as LT_k(i, j) ═ k-i +1 or RT_k(i, j) ═ j-l + 1. Let T₂[i,j]Consisting of two nested stacks and their internal unpaired bases. Let E₂(r, r ': s', s) represents the energy of the successive stacked structures of the base pairs (r, s) and (r ', s'), ET₁(i, j) represents the stacking energy of the closure of the base pair (i, j) >, ET₁(r ', s') respectively represent the stacking energy enclosed by the base pair (r ', s'), then ET₂(i,j)＝ET₁(i,j)+E₂(r,r’:s’,s)+ET₁(r ', s') + a (a is the compensation parameter). Similar reason ET_k(i,j)＝ET₁(i,j)+E₂(r,r’:s’,s)+ET_k-1(r ', s') + b (b is the compensation parameter).

Let LT (i, j) be ∈ { LT₁(i,j),LT₂(i,j)},ET(i,j)∈{ET₁(i,j),ET₂(i, j) }. In the process of the invention, O (n) is used for the free energy and length of the successive stacks³) And respectively storing the time pre-processed time in the triangular matrixes ES (i, j) and LS (i, j).

In the same way, from ET_kThe calculation formula of (i, j) shows that the time complexity for calculating the continuous stacking is O (n)³) Spatial complexity of O (n)²). The calculation of the successive stacks may be implemented by a dynamic programming algorithm.

Definition 2-an extended structure consisting of two RNA base sequence segments s satisfying the conditions_i,kAnd s_l,jComposition i<k<l<j. Presence of p and q, i<p<q<k, let s_p,qAnd s_l,jConstituting a continuous stack, and segments s_i,kThere is no pairing between internal bases, i.e.: if m and n are present and m<n, if (m, n) is a base pair, then mk, or mm and k>n, then the segment s_i,kAnd s_l,jAn extended structure is constructed. Let P [ i, k: l, j]Represents the optimal extended structure, EP (i, k: l, j) represents P [ i, k: l, j ]]Is equal to j-l +1 or LP (i, k: l, j), respectively) K-i +1 denotes P [ i, k: l, j]Length of (d). Once the structure P [ i, k: l, j ] is expanded]It is determined that LP (i, k: l, j) is also uniquely determined. LP (i, k: l, j) can use O (n)³) Spatial storage P [ i, k: l, j](ii) a Similarly, LP (i, k: l, j) can use O (n)³) Spatial storage P [ i, k: l, j]。

FIG. 5 shows an example of calculating the energy of a given RNA sequence comprising 12 bases during RNA folding, based on the base pairing rules and stacking parameters.

In the RNA folding structure, for the k-th order consecutive stacking and expanding structure, the corresponding procedure is calculated as follows:

in RNA structure, k-order continuous stacking and expanding structure energy and length are calculated

// Note: let (i, j) denote RNA base s_iAnd s_jThe formed base pairs, g, represent the compensation coefficients for k-th order sequential stacking in RNA fold structures. P 'represents the offset of one base pair of a false knot in the extended structure, and Q' represents the penalty of one unpaired base of the false knot in the extended structure. //

Algorithm(S,k)

1.

2.

3.For r＝4to n

4.For i＝1to n-r

5.j←i+r+2；

6.If(i,j)&(i+1,j-1)

7.{LS₁(i,j)＝1；k←i；l←j；

Method for improving calculation of energy and length of k-order continuous stacking and extending structure

8.ES₁(i,j)←ES₁(i,j)+g*E₁(k,k+1:l-1,l)+g*E₂(k,k+1:l-1,l)

+g*E₂(k+1,k+2:l-2,l-1)；

9.While(k,l)&(k+1,l-1)&(k+2,l-2)((l-k)>4)

10.ES_i(i,j)←E_iS_i(i,j)+g*E_i+1(k,k+1:l-1,l)+g*E_i+1(k+1,k+2:l-2,l-1)；

11.LS_i(i,j)++；k++；l--；

Loop

12.ES_i(i,j)←ES_i(i,j)+P’；

Method for calculating energy and length of k-order continuous stacking and expanding structure in improved RNA structure

13.If(k＝i+2&l＝j-2)

14.While k＝i to i+U+1

15.for l＝j-U-1+k-i to j

16.If(k,l)

17.V←ES_i(i,j)+g*E_i+1(i,k:l,j)+ES_i(k,l)+(k-i+j-l-2)*Q’-1；

18.W←g*E_i+1(i,k:l,j)+ES_i(k,l)+(k-i+j-l-2)*Q’+2

19.If(V<ES_i+1(i,j)&W<ES_i+1(i,j))

20.ES_i+1(i,j)←V；

21.LS_i+1(i,j)←LS_i(i,j)+LS_i(k,l)；

Loop

22.End while

The RNA folding structure can be decomposed into an extended structure and a subsequence, or two crossed extended structures and a subsequence. An extended structure can be decomposed into k-th order continuous stacks and multi-branch loops, so that the pseudoknots can be represented recursively. The expanded structure itself may also comprise a pseudoknot, and the intersection of two expanded structures may in turn form a pseudoknot structure, such that the expanded prediction method may comprise crossing the pseudoknot.

Introducing an extension structure and a k-order continuous stacking model, calculating the extension structure by using the k-order continuous stacking, and establishing a new RNA folding structure mathematical expression model by using the cross calculation nested and non-nested pseudoknot structures of the extension structure. And (3) designing and implementing a dynamic programming algorithm based on a new mathematical expression model of the folding structure containing the pseudoknot, and predicting the RNA folding structure containing any planar and non-planar pseudoknots.

However, the cross-false junction cannot be predicted by using the classical PknotsRG algorithm, but the prediction method and the device based on the semi-extended structure also have the problems of false junction structure representation model defect, free energy parameter defect, no optimization and the like. The invention can utilize the expanded PknotsRG algorithm to predict arbitrary planar and non-planar pseudojunctions. The calculation of a pseudoknot structure formed by one extension structure and one subsequence, or the calculation of a pseudoknot structure formed by two extension structures and one subsequence is added into an MFOLD calculation model to form a pseudoknot calculation model, and a graphical representation of a basic model is given in FIGS. 2 and 3.

FIG. 6 is a partial schematic diagram showing the improvement of W (i, j) and V (i, j) according to the principle of extended structure and minimum free energy during RNA folding, and the definition and calculation process are as follows.

FIG. 7 is a schematic diagram of a portion of an expanded structure-based RNA folding structure comprising a pseudoknot, which may comprise at least 8 cases. The calculation screening optimization of various conditions can be included in the calculation process.

Given a sequence s ═ s₁s₂…s_nSequence fragment s_i,j＝s_i…s_jI is more than 1 and less than j and less than n. Let W (i, j) be at s_iAnd s_jIn the case where the base pair (i, j) is not formed, the subsequence s_i,jThe corresponding RNA fold structure S containing the pseudoknot. Let V (i, j) be s_iAnd s_jWhen the base pair (i, j) is formed, the subsequence s_i,jThe corresponding RNA fold structure S containing the pseudoknot.

The calculation formulas of W (i, j) and V (i, j) in the mathematical model are given below.

V (i, j) is calculated from the following three cases, S is stacking, S is 2-order continuous stacking, S is k-order continuous stacking (k 3), i < k < j.

Let E_k(i, j) is the minimum energy of the k-th order consecutive stack enclosed by the base pair (i, j) S. If (i, j), (k, l) S, 1. ltoreq. i<k<l<j is less than or equal to n, and (i, k: l, j) is a pair of closed bases (i, j) and (k, l)2 order sequential stack with energy E₂(i,k:l,j)。

In the extended structure improvement parameter and free energy calculation, the invention is expressed by using the number u of unpaired bases and the number k of base pairs as functions of variables: e_kB + kM + uP. Wherein B represents the offset value constituting one extended structure, M represents the offset value of each base pair in the extended structure, and P represents the offset value of each unpaired base in the extended structure.

1) W (i, j) ═ V (i, j) +∞ if j-i <4

2) V (i, j) + ∞, if bases i and j do not form a base pair

3) W (i, i) ═ 0, and base i cannot pair with itself

4)

Using a dynamic programming algorithm, starting from the 3 rd nucleotide sequence of the RNA base sequence,

the minimum free energy of all 3 nucleotides was calculated, and so on until W (1, n) was calculated. If the bases j-i ≧ d ≧ 3, V (i, j), W (i, j) are calculated with the bases i 'and j' (j '-i' < d).

The method is compared with experiments of a PKNOTS algorithm and a semi-extended structure method, and is realized by VC + + programming and compared with the PKNOTS algorithm. On the basis, optimizing energy parameters, and calculating all sequences of the PseudoBase database and the Rfam database. The Pknots algorithm and the LP algorithm can only predict partial plane false knots, and the PKNOTS algorithm is the best algorithm for predicting any plane false knot and partial non-plane false knots at present. Therefore, the test results of the method of the present invention are mainly compared with the pknot algorithm and the semi-extended structure method. Firstly, a PKNOTS algorithm and a semi-extension structure method test set are calculated, the used energy parameters are the same as those of the PKNOTS algorithm and the semi-extension structure method, but the extension structure is accurately defined, and coaxial stacking based on the extension structure is introduced, including the coaxial stacking including the pseudoknot based on the extension structure, so that the method is favorable for forming an accurate and complete RNA folding structure, including continuous stacking, a ring structure and a pseudoknot structure. The defects of low parameter precision, inaccurate free energy value, large error of the calculation method and the like of the representation model of the false knot in the semi-extension structure are overcome, the calculation method is improved, and the calculation result is as follows.

Description of the drawings: according to conventional knowledge in the art, the computing time of a computer is generally related by the cooperation of the CPU main frequency, the mainboard structure and the memory size. The improvement in computation time is mainly obtained by the improvement in the computation method (algorithm) in the case of the same computer configuration.

TABLE 1 comparison of computation times for the method of the present invention with the semi-extended structure algorithm, PKNOTS algorithm

TABLE 2 comparison of different results of the method of the invention with the PKNOTS algorithm

TABLE 3 comparison of different results of the semi-extended method disclosed in CN104765983A with the PKNOTS algorithm

TABLE 4 comparison of the different results of the process of the present invention with the semi-extensive process disclosed in CN104765983A

A comparison of the calculated time for the method of the invention and the PKNOTS algorithm is shown in table 1. The method uses a PC machine with 4MB memory for testing, and the PKNOTS algorithm uses a high-performance computer Silicon graphics origin200 with 4GB memory for testing. As can be seen from Table 1, the method of the present invention uses 21 seconds, while the PKNOTS algorithm uses 20 minutes, to calculate an RNA sequence 75 bases in length. Calculating an RNA sequence of 105 bases in length, the method of the present invention uses 97 seconds, while the pknot algorithm uses 235 minutes. Calculating an RNA sequence of 200 bases in length, the method of the present invention takes 26 minutes, whereas the pknot algorithm cannot. In fact, the method of the present invention can successfully predict the folding structure of RNA sequences with a length of 1000 bases or more.

Since the method of the invention is based on extended structure calculations and introduces more coaxial stacks, in particular pseudoknot coaxial stacks, the method of the invention facilitates the formation of a complete stem region and a correct pseudoknot structure.

Under the same energy parameters, table 2 shows that the average sensitivity of the method of the present invention is 98.2%, which is better than 71.7% of the pknot algorithm, compared with the results of the pknot algorithm, and compared with the sensitivity and specificity of 15 different sequences; the average specificity of the method of the invention is 97.5%, which is better than 70.6% of the PKNOTS algorithm, and the improvement is remarkable. Table 3 shows that the results of comparison of the different results of the semi-extended method and the pknot algorithm, for the sensitivity and specificity of the different 15 RNA sequences, the average sensitivity of the semi-extended method was 88.1%, which is better than 71.7% of the pknot algorithm; the average specificity of the method of the invention was 86.3%, which is better than 70.6% of the pknot algorithm.

Table 4 shows that the results of comparison of the present invention with the results of the semi-extension method show that the average sensitivity of the present invention is 98.2% better than 88.1% for the semi-extension method for the comparison of the sensitivity and specificity of 15 different RNA sequences; the average specificity of the method of the invention is 97.5%, which is better than 86.3% of the half-expansion method, and the improvement is obvious.

Therefore, test results show that the search speed, the average sensitivity and the average specificity of the method are obviously better than those of the semi-extended method and the PKNOTS algorithm in the prior art.

Results of testing of the PseudoBase international RNA database.

PseudoBase is an RNA pseudoknot database. The method tests all 245 sequences of the PseudoBase database and partial sequences of the Rfam14.1 database, predicts that 381 sequences contain false knots, predicts 357 sequences with correct false knots and has the correct rate of 93.7 percent;

and introducing an extended structure and k-order continuous stacking to establish a new RNA false knot mathematical representation model. Based on the model, a time complexity O (n) is provided³) And a spatial complexity of O (n)²) The method of (3) predicts the RNA folding structure comprising arbitrary planar pseudoknots and more complex non-planar pseudoknots.

PKNOTS algorithm uses a time complexity of O (n)⁶) And a spatial complexity of O (n)⁴) And calculating a folding structure comprising a planar pseudoknot and a part of non-planar pseudoknots, wherein the calculated pseudoknot is represented by no more than two gap structures. The method of the invention uses O (n) as the complexity of time³) And a spatial complexity of O (n)³) And the space-time complexity of the calculation false knot is obviously improved compared with that of a PKNOTS algorithm. Pseudoknots can be represented by no more than two spreading structures, and the calculated RNA sequence length can exceed 1600 bases. From the test results, the method of the invention has better search speed, accuracy, sensitivity and specificity than the semi-extended method and the PKNOTS algorithm. Therefore, the method of the invention is more effective in predicting planar and non-planar pseudoknots than the semi-extended method and the PKNOTS algorithm.

The method of the invention can calculate RNA folding nested pseudoknot and cross pseudoknot structures formed by substructures of stacking, hairpin loop, inner loop, bulge, multi-branch loop and the like.

Although the embodiments of the present invention have been described above, the above descriptions are only for the convenience of understanding the present invention, and are not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for predicting a pseudoknot-containing expanded structure-based ribonucleic acid folding structure, which is characterized by comprising the following steps: the method comprises the following steps:

inputting one segment s ═ s₁s₂…s_NSequence, randomly finding bases, if i, j are present, such that s_iAnd s_jPairing, j-i is more than or equal to 3, and more than three continuous adjacent base pairs s exist in s_i·s_j、s_(i+1)·s_(j-1)…_、s_k·s_lThen base pair s_i·s_jAnd s_k·s_lThe closed interval is determined as a continuous stack, and all paired bases in the continuous stack are marked; continuously searching matched bases in the free bases closed by continuous stacking, and determining as continuous stacking if more than three base pairs exist; forming a false knot if there are more than two intersections of consecutive stacks; after the continuous stacking is determined, the continuous stacking and two alkali sequences containing free bases are determined as an extended structure; the false node is formed by cross pairing of two pairs of base pairs; the false knot structure is formed by cross pairing of more than two continuous stacking or expanding structures;

(3) calculating the minimum base free energy of the characteristic model;

2. The pseudoknot extended structure-containing RNA folded structure of claim 1A prediction method, characterized by: an extended structure consisting of two segments s of ribonucleic acid sequence_i,kAnd s_l,jComposition i<k<l<j。

3. The method for predicting the folding structure of pseudoknot extended-structure-containing ribonucleic acid according to claim 2, wherein:

the two ribonucleic acid sequence segments s_i,kAnd s_l,jIn the presence of p and q, i<p<q<k, let s_p,qAnd s_l,jConstituting a continuous stack and segments s_i,kThere is no pairing between internal bases, i.e.: if m and n are present and m<n, if (m, n) is a base pair, then mk, or mm and k>n, then the segment s_i,kAnd s_l,jForm an extended structure with P [ i, k: l, j]Representing the optimal extended structure;

or two fragments of ribonucleic acid sequence s_i,kAnd s_l,jIn the presence of r and s, l<r<s<j, let s_r,sAnd s_i,kConstituting a continuous stack and segments s_l,jThere is no pairing between internal bases, i.e.: if m and n are present and m<n, if (m, n) is a base pair, then m<l or n>j, or m<l and n<l, or k>j and k>j, then segment s_i,kAnd s_l,jForm an extended structure with P [ i, k: l, j]Representing its optimal expanded structure.

4. The method for predicting the folding structure of pseudoknot extended-structure-containing ribonucleic acid according to claim 1, wherein: w (i, j) is two extended structural bases s_iAnd s_jSubsequence s when base pair (i, j) is not formed_i,jThe corresponding cases involving pseudoknot calculation of W (i, j) based on the minimum free energy of the RNA fold structure S of the expanded structure include: (1) in the expanded configuration s_iAnd s_jDo not participate in forming the stack, s_iAnd s_jIs an unpaired base, s_iAnd s_jDo not form a base pair (i, j) and are in different subsequences s_i,kAnd s_k+1,jIn the corresponding RNA fold structure, i<k<j；(2)s_iAnd s_jDoes not form a base pair (i,j)；s_i,jThe method comprises the following steps of (1) forming an extension structure and a subsequence; or consists of two extension structures; or two spreading structures and one subsequence.

5. The method for predicting the folding structure of pseudoknot extended-structure-containing ribonucleic acid according to claim 1, wherein: v (i, j) is at base s_iAnd s_jSubsequence s when forming base pair (i, j)_i,jThe corresponding pseudoknot-containing is based on the minimum free energy of the RNA fold structure S of the expanded structure, and calculating the V (i, j) case includes: s is a continuous stack of closed base pairs (i, j) in an extended configuration; or S is a stack closed by base pairs (i, j) and (k, l) containing pseudoknots in the expanded structure, i<k<j; or S is a pseudo-knot in the extended structure<k<j,k<r<l closed stack, and so on.

6. The method for predicting the folding structure of pseudoknot extended-structure-containing ribonucleic acid according to claim 1, wherein: the case of calculating an extended structure containing a pseudoknot includes: (1) one extension structure is composed of another extension structure and one or several unpaired bases; (2) one extension structure is composed of another extension structure and a subsequence containing base pairs; (3) one expansion structure is formed by the other two expansion structures; (4) the two extension structures are crossed to form a false knot structure.

7. The method for predicting the folding structure of pseudoknot extended-structure-containing ribonucleic acid according to claim 1, wherein: the minimum free energy of W (i, j), V (i, j) and the extended structure containing the false junctions is calculated by using a dynamic programming algorithm.

8. A device for predicting the folding structure of pseudoknot-containing extended-structure-based ribonucleic acid, comprising:

an input unit: inputting a ribonucleic acid base sequence;

a storage unit: storing the established pseudoknot model and the ribonucleic acid folding structure characteristic model of the extended structure, and storing corresponding parameters, data structures and calculation formulas of minimum free energy;

a calculation unit: calculating a free energy value and a probability value;

an output unit: the method outputs a ribonucleic acid base sequence folding structure containing a false knot based on an expansion structure according to a minimum free energy principle and occurrence statistical probability.