CN104537278A

CN104537278A - Hardware acceleration method for predication of RNA second-stage structure with pseudoknot

Info

Publication number: CN104537278A
Application number: CN201410717249.4A
Authority: CN
Inventors: 夏飞; 金国庆; 沈金华
Original assignee: Naval University of Engineering PLA
Current assignee: Naval University of Engineering PLA
Priority date: 2014-12-01
Filing date: 2014-12-01
Publication date: 2015-04-22

Abstract

The invention discloses a method for accelerating the predication of an RNA second-stage structure with pseudoknot based on a four-dimensional dynamic planning method, and aims at accelerating the predication of the RNA second-stage structure with pseudoknot. According to the technical scheme, the method comprises the steps of building a heterogeneous computing system through a host and a reconfigurable algorithm accelerator; sending parameters of a formatted thermodynamic model and coded RNA sequences to the reconfigurable algorithm accelerator through the host; computing seven computing modules of the algorithm accelerator through the non-backtracking PKNOTS algorithm by the MPMD mode; when in computing, a four-dimensional matrix is decomposed by the rectangular dimension reduction method into N three-dimensional matrixes, then the fine granularity is achieved by the task dividing strategy of circularly dividing in each layer by areas and parallelly processing by rows in the area, and the computing is carried out synchronously; for n PE in each computing module, n data in different rows of the area are computed synchronously through the SPMD mode. With the adoption of the method, the predication of the RNA second-stage structure with pseudoknot is accelerated; the technology is novel, the performance is high, and the cost is low.

Description

Hardware-accelerated method is carried out to the RNA secondary structure prediction of band false knot

Technical field

The present invention relates to a kind of method accelerated the RNA secondary structure prediction of the band false knot based on four-dimensional dynamic programming method, object is the speed of the RNA secondary structure prediction accelerating band false knot.

Background technology

RNA secondary structure is the important evidence identifying ncRNA, is basis and the prerequisite of research RNA function.Laboratory facilities are the most reliable methods obtaining RNA secondary structure, RNA structure determination method main at present has X-ray diffraction and nuclear magnetic resonance, although the result adopting experimental technique to obtain is accurately reliable but its process is very consuming time, and it is of a high price, therefore study RNA structure prediction computing method just to seem particular importance, adopt the method for computing machine and mathematical model prediction RNA sequence secondary structure to be widely adopted in recent years, become the hot issue of RNA research field.RNA secondary structure prediction method generally comprises following three parts:

(1) geometric representation method

Because the stability of the base-pair RNA secondary structure forming pairing plays a driving role, and be all destroy RNA structural stability by not matching the various rings that base forms, so the core of RNA secondary structure prediction finds the pairing base in sequence, the nested parenthesis figure " (() ()) " that " " is inserted in usual employing represents base pairing situation, wherein left parenthesis " (", right parenthesis ") " and " " all corresponding sequence in base, base on the left right-bracketed representation correspondence position of pairing forms complementary pairing, and " " represents that the base of correspondence position in sequence forms ring structure.Fig. 1 is RNA sequence secondary structure schematic diagram.

(2) scoring functions

By test determination and the statistical study to known structure RNA sequence, to adjacent base to and between base-pair and base independently interaction factor realize parametrization, and adopt scoring functions to be that various possible RNA secondary structure is given a mark, thus the quality of evaluation and foreca result.

(3) search strategy

RNA secondary structure prediction is not one and carries out exhaustive process to various possible structure, needs selection optimization method to carry out fast search to structure space, finds rapidly and a certain overall situation secondary structure that greatly (little) value is corresponding.

Definition 1: the RNA sequence supposing R to be length be n, R=r ₁r ₂r ₃... r _n(ij) the base ri in RNA sequence R and rj formation complementary pairing is represented, i, j, k, l represent base ri, rj, rk, rl sequence number in RNA sequence respectively, and 1≤i≤j≤n, 1≤k≤l≤n, then RNA secondary structure prediction problem is actually and finds as scoring functions y=f (g (x ₁), g (x ₂) ..., g (x _i)) get the overall situation greatly in (little) value situation, the S set of base-pair in sequence R, wherein f is composite function, x _irepresent subsequence r ₁r ₂... r _i, (1≤i≤n).

The RNA secondary structure prediction method that current existence two class is main: first method is ab initio prediction method, and the method is using wall scroll RNA sequence as input.Nussinov algorithm is the RNA structure prediction algorithms based on simple sequence proposed the earliest, and this algorithm realizes structure prediction, therefore also referred to as maximum base pairing algorithm by finding the structure with maximum base pairing number.Because the method only considers pairing base-pair RNA secondary structural stability role, the precision of prediction of algorithm is poor.

The energy that can be RNA molecule due to base pairing reduces, structure tends towards stability, therefore minimum free energy algorithm (Minimum Free Energy, be called for short MFE) think at a certain temperature, RNA molecule reaches certain thermodynamic equilibrium by conformation adjustment, make free energy minimization, thus form the most stable state, namely secondary structure is now considered to the true secondary structure of RNA.Minimum free energy algorithm is proposed in 1981 by M.Zuker, and be otherwise known as Zuker algorithm.The calculating object of this algorithm is not simple base pairing quantity, but the free energy of subsequence.Algorithm basic thought has independence and additivity assumption based on the free energy of kernel texture each in RNA secondary structure, adopt each kernel texture free energy parameter list of determination of test method, the free energy of minor structure sequence likely formed is added, and the minimum free energy of whole piece RNA sequence equals the minimum value of all possible minor structure energy sum.Zuker algorithm is the structure prediction algorithms for wall scroll RNA sequence best at present, especially achieves for the structure prediction of microRNA and well predicts the outcome, the shortcoming of the method be not support package containing the prediction of the RNA secondary structure of false knot.

Forecasting Methodology based on stochastic context-free grammar (SCFG) model is also the typical structure Forecasting Methodology for wall scroll RNA sequence, be the total probability model being most suitable for description and modeling RNA secondary structure at present, occupy critical role in RNA secondary structure prediction research field.Standard alignment algorithms at present based on SCFG theoretical model is Coche-Younger-Kasami, is called for short CYK algorithm [18] [19].CYK algorithm is compared for the co-variation model (covariance model is called for short CM model) realizing wall scroll sequence and single rna family, thus judges whether this RNA sequence belongs to this family and obtain the secondary structure of this sequence further.Although CYK algorithm is using simple sequence as input, the foundation of the co-variation model of family needs a large amount of RNA sequences to carry out parameter estimation.

Above method all belongs to simple sequence ab initio prediction method, and along with the development of genomic sequencing technique, known RNA sequence also gets more and more, and this makes to utilize Comparative genomic strategy method to predict, and RNA secondary structure becomes possibility.These class methods are using many Homologous RNA sequences or the comparison that is made up of them as input, and its theoretical foundation is that the structural conservation of biological sequence is greater than sequence conservation.Homology search method, based on Multiple Sequence Alignment, is first utilized Multiple Sequence Alignment instrument as ClustalW program structure RNA Multiple Sequence Alignment, is then obtained the conserved structure of this group sequence by abrupt climatic change.The Typical Representative of homology predicted method is RNAalifold, this algorithm is the expansion of MFE method in RNA Multiple Sequence Alignment, it considers covariance (covariance information) while this group sequence average minimum free energy of calculating, calculates by energy balane and covariant score value the public secondary structure that the method combined predicts this group homologous sequence.

Above-mentioned algorithm is based on simple sequence or all limits base-pair (ij) and (kl) mutual position relationship based on the comparison of multisequencing homology, namely i<k<j<l or k<i<l<j is met, and do not consider the cross structure that base is formed, therefore all can not predict false knot.And false knot to copy and protein synthesis regulation plays very important effect [34] virus genomic, is the important composition key element of RNA tertiary structure, therefore the prediction of false knot is become to the hot issue in current RNA secondary structure prediction field.Because the RNA secondary structure prediction of the band false knot based on MFE model is proved to be NP-complete problem.In order to improve the practicality of algorithm, researcher experimentally observation data strengthens the constraint condition of false knot prediction, reduces algorithm complex, and making becomes possibility by computing method prediction false knot.At present, the approximate data of existing several support false knot prediction.Rivas and Eddy in 1999 adopts dynamic programming algorithm to achieve prediction to RNA false knot first, and the time complexity of algorithm is O (n ⁶), space complexity is O (n ⁴).Computation complexity is reduced to O (n by the type retraining false knot by document ⁵), further computation complexity is reduced to O (n ⁴), but can only the simplest false knot be predicted, practicality is not high.The algorithm proposed due to Rivas and Eddy can be good at prediction plane false knot and restricted on-plane surface false knot, is the RNA secondary structure prediction algorithm of complete, the most authoritative support false knot of generally acknowledging at present.

Although above-mentioned algorithm have employed different RNA secondary structure geometric representation methods and scoring functions, but all have employed identical search strategy, namely the structure prediction PROBLEM DECOMPOSITION of whole sequence is the structure prediction problem of a series of subsequence by employing Dynamic Programming Idea, progressively obtains the optimum solution of whole sequential structure from the shortest subsequence.According to the dynamic programming problems criteria for classification proposed in introduction, conventional structure prediction belongs to 3-dimension dynamic planning problem, and the RNA structure prediction with false knot belongs to four-dimensional dynamic programming problems.

Previously discussed is the RNA secondary structure prediction not comprising false knot, and in fact false knot is also a kind of common RNA secondary structure type, and it plays very important effect on the formation of RNA tertiary structure and affecting in the functional activity of RNA.

Definition 2: in the condition (3) of definition 1, if (k, l) ∈ S, if meet i<k<j<l or k<i<l<j, then the cross structure that base-pair (ij) and base-pair (kl) are formed just is called false knot (Pseudoknots).Fig. 2 is that the dome figure of false knot in RNA secondary structure and correspondence represents.

Therefore, the restriction to base sequence number in soften terms in the basis of definition 2 (3), namely to sequence R=r ₁r ₂r ₃... r _nin base r _i, r _j, r _k, r _lallow to occur as i<k<j<l or k<i<l<j, (ij) and the staggered pairs of (kl), then namely common RNA secondary structure prediction problem expands the RNA secondary structure prediction problem into being with false knot.Although relative to the base pairing quantity of routine, in total number seldom, false knot is the important composition key element of RNA tertiary structure to false knot, and false knot predictive ability becomes the important indicator weighing RNA secondary structure prediction algorithm performance gradually in recent years.

In the algorithm of all support false knots prediction, the PKNOTS algorithm that Rivas and Eddy proposes can be good at prediction plane false knot and restricted on-plane surface false knot, be the RNA secondary structure prediction algorithm of the best support false knot prediction of generally acknowledging at present, it predicts the outcome and obtains experimental check.

PKNOTS algorithm is the simple sequence ab initio prediction method based on MFE model, also uses Dynamic Programming Idea, sequential structure forecasting problem is decomposed into the structure prediction problem of shorter subsequence, is obtained the structure of sequence itself by the structure calculating subsequence.Because each false knot needs ijkl tetra-parameters to describe, so need four to recirculate to find optimum in the recurrence relation of dynamic programming, be therefore called four-dimensional dynamic programming problems.The input of PKNOTS algorithm is wall scroll RNA sequence, and output is the base pairing result comprising false knot (if existence).

Because the structure prediction problem comprising false knot has high computational complexity, be not suitable for the sequence that treatment scale is longer.Test shows, AMD Phenom 9650Quad CPU uses PKNOTS-1.05 program to realize prediction to the sequence that length is 64bps needs 140s, and the time that the sequential structure that length is 128bps is predicted is more than 9000s, about 2.5 hours.Though have at present and much can support that false knot is predicted based on the modified algorithm of PKNOTS, but these algorithms have all sacrificed precision and correctness exchanges execution speed for, prediction effect is not good; In addition also have some other false knot Forecasting Methodology, as based on stacking stable weight matching algorithm, although have desirable computation complexity, only have the false knot of particular type and predict the outcome preferably.Comparatively speaking, the prediction effect of PKNOTS algorithm is obviously better than other algorithms, but high Space-time Complexity limits the practicality of PKNOTS algorithm, can only realize prediction at present to the short data records structure comprising tens bases.

2010, the people such as Krishnan are based on IBM Cell polycaryon processor, with PKNOTS algorithm for object has carried out parallelization resarch for four-dimensional dynamic programming algorithm first, compared with the standard P KNOTS software run on a general purpose microprocessor, the parallel version that Sony Play Station3 platform runs obtains the acceleration effect of about 3 times, but can only support that length is less than the RNA sequence of 100bps.This section is based on FPGA platform, research is launched to the four-dimensional dynamic programming problems that the RNA secondary structure prediction field of band false knot relates to, a kind of data correlation analysis method of calculating feature extraction for higher-dimension dynamic programming problems complexity, and realize storage optimization and fine grained parallel on this basis, realize relative to existing serial algorithm the overall acceleration effect obtaining 3 ~ 5 times.

The computation process of PKNOTS algorithm relates to three two-dimensional matrixs VX, WX, WBX and four four-matrixs VHX, ZHX, YHX and WHX, and the time complexity of algorithm is O (n ⁶), space complexity is O (n ⁴), n is sequence length.Wherein the iterative formula of four-matrix VHX, ZHX and YHX is as follows:

VHX (i, j : k, l) = \min \{\begin{matrix} {EIS}^{2} (i, j : k, l) \\ {EIS}^{2} (i, j : r, s) + VHX (r, s : k, l) \\ {EIS}^{2} (r, s : k, l) + VHX (i, j : r, s) \\ WHX (i + 1, j - 1 : k - 1, l + 1) + M \end{matrix} - - - (1)

ZHX (i, j : k, l) = \min \{\begin{matrix} VHX (i, j : k, l) + P \\ ZHX (i, j : k - 1, l) + Q \\ ZHX (i, j : k, l + 1) + Q \\ ZHX (i, j : r, l) + WX (r + 1, k) \\ ZHX (i, j : k, s) + WX (l, s - 1) \\ {EIS}^{2} (i, j : r, s) + ZHX (r, s : k, l) \\ WHX (i + 1, j - 1 : k, l) + P + M \end{matrix} - - - (2)

YHX (i, j : k, l) = \min \{\begin{matrix} VHX (i, j : k, l) + P \\ YHX (i + 1, j : k, l) + Q \\ YHX (i, j - 1 : k, l) + Q \\ YHX (r + 1, j : k, l) + WX (i, r) \\ YHX (i, s : k, l) + WX (s + 1, j) \\ {EIS}^{2} (r, s : k, l) + YHX (i, j : r, s) \\ WHX (i, j : k - 1, l + 1) + P + M \end{matrix} - - - (3)

Variable i in above-mentioned formula, j, k, l, r, s represent the sequence number of base in RNA sequence, are also element coordinates in a matrix simultaneously, meet the relation of i≤r≤k≤l≤s≤j.VHX, ZHX and YHX are four-dimensional dynamic programming matrix, P, Q, M and EIS ²for energy parameter.Compare with Zuker algorithm computing formula, comprise the PKNOTS algorithm of false knot owing to considering this special minor structure of false knot, therefore the search volume of candidate structure is increased, the energy balane being each element from formula form correspondingly increases candidate branch, but basic thought is still based on minimum energy model, each step in computation process is all choose minimum value as partial structurtes from energy value corresponding to all possible minor structure, this ultimate principle of minimum value that the least energy of whole piece RNA sequence equals all possible minor structure energy sum does not still become.

Fig. 3 is the computer memory figure in four-dimensional dynamic programming algorithm.If adopt 2 d plane picture to represent four-matrix, can find in PKNOTS algorithm according to the relation between element subscript, high bidimensional i and j of each four-matrix forms a two-dimentional upper triangular matrix, and each unit (being called Cell) is wherein a two-dimentional triangular matrix.

Three two-dimensional matrixs VX, WX, WBX in PKNOTS algorithm can be found and the oneself existed between four four-matrixs VHX, ZHX, YHX and WHX more than three dynamic programming problems complexity calls and call relation mutually by analyzing further iterative formula.Fig. 4 is the data dependence relation figure between matrix, and the circle in figure represents matrix, and arrow representative exists data dependence relation, and there is its data from circle representative and rely on, WHX matrix also has from circle, does not mark in the drawings for simplicity.

Because band pseudoknot structure prediction algorithm is based on the expansion on the conventional structure prediction algorithm of minimum free energy model, and all have employed dynamic programming method to realize arranging local to the calculating of global energy by district, therefore with previously described RNA secondary structure prediction algorithm, there is similar calculating feature: (1) basic zoning is triangle; (2) there is quantity of parameters table query manipulation in computation process; (3) there are ranks to replace and irregular memory access feature, data correlation distance changes along with the movement calculating position.In addition, structure prediction algorithms due to band false knot relaxes the position limitation to pairing base, the increase of search volume causes memory scheduling and IO bandwidth to become design bottleneck, and the formulation of the increase of the computation complexity analysis of being correlated with to data and paralleling tactic brings challenges.

Although above-mentioned algorithm have employed different RNA secondary structure geometric representation methods and scoring functions, but all have employed identical search strategy, namely the structure prediction PROBLEM DECOMPOSITION of whole sequence is the structure prediction problem of a series of subsequence by employing Dynamic Programming Idea, progressively obtains the optimum solution of whole sequential structure from the shortest subsequence.

Finding shows, is mostly limited to sequence primary structure level at home and abroad at present to the hardware-accelerated research in sequential analysis field, does not realize hardware-accelerated report to higher-dimension especially quaternary structure prediction algorithm at present.

Summary of the invention

Object of the present invention is exactly for existing methodical defect, proposes a kind of method accelerated the RNA secondary structure prediction of the band false knot based on four-dimensional dynamic programming method first, and object is the speed of the RNA secondary structure prediction accelerating band false knot.

Technical scheme of the present invention is achieved in that it first builds the heterogeneous computing system be made up of main frame and restructural algorithm accelerator, then the RNA sequence after the thermodynamical model parameter after format and coding is sent to restructural algorithm accelerator by main frame, and the PKNOTS algorithm that seven computing modules of algorithm accelerator adopt MPMD mode to perform non recounting calculates; Adopt matrix dimension reduction method that four-matrix is decomposed into N number of three-dimensional matrice in calculating, then adopt successively to turn in segmentation and region by region wheel and realize fine grained parallel calculating by the partitioning strategy of multitask of row parallel processing, n PE of each computing module inside adopts SPMD mode to calculate n the data being positioned at region different lines simultaneously, and n is natural number.

PKNOTS algorithm is calculated and is realized by three two-dimensional matrixs and four four-matrix computing modules, and three described two-dimensional matrix computing modules are PE_VX, PE_WX and PE_WBX; Four described four-matrix computing modules are PE_WHX, PE_VHX, PE_ZHX and PE_YHX.

Described two-dimensional matrix computing module PE_VX, PE_WX are identical with the structure of PE_WBX, their inside all comprises a sub-PE controller, sub-PE computing unit, a local storage and a number reportedly pass register, and its neutron PE controller is for realizing the control to calculating and data memory access sequential; The core of sub-PE computing unit is a 32bit totalizer, for realizing the additive operation to two input operands, its result of calculation writes local storage simultaneously and data transmit register, local storage is used for the result of calculation of buffer memory two-dimensional matrix one permutation element, data transmit the result of calculation that register only stores currentElement, and for next computing module immediately.

Described four-matrix computing module PE_WHX, PE_VHX, PE_ZHX are identical with the structure of PE_YHX, and the function of four-matrix computing module realizes the parallel computation to two-dimentional Cell, and each Cell is a two-dimentional triangular matrix.

The present invention with the labyrinth prediction algorithm in RNA sequential analysis field to the demand of high-performance calculation for background, based on the isomeric architecture of general purpose microprocessor in conjunction with FPGA hardware arithmetic accelerator, start with from the dynamic calculation feature extracting typical method, study and complex data is correlated with and the optimization method of irregular memory access, fine grained parallel is realized to typical algorithm, reaches the object of efficient speed-up computation; And propose a kind of fine granularity parallel algorithm RNA secondary structure of band false knot being realized to prediction based on reconfigurable hardware on this basis, algorithm group for specific area provides a kind of basic hardware configuration template and Framework for Parallel Programming, for effectively reducing algorithm accelerator design complexity, realizes accelerator and generates fast and lay the foundation.

Four-dimensional dynamic programming matrix fine grained parallel computing method disclosed by the invention, the speed of the RNA secondary structure prediction being with false knot can not only be accelerated, and the method and hardware accelerator can not only be instructed to generate fast based on the parallel Programming template under isomeric architecture and design framework, Technical Reference can also be provided for the computational problem of the higher-dimension dynamic programming matrix solving other field.

Accompanying drawing explanation

Fig. 1 is RNA sequence secondary structure schematic diagram

Fig. 2 is the dome figure of false knot in RNA secondary structure and correspondence

Fig. 3 is the computer memory figure in four-dimensional dynamic programming algorithm

Fig. 4 is the data dependence relation figure between matrix

Fig. 5 is time-space domain overlapped data correlation analysis process flow diagram

Fig. 6 is two-dimensional matrix computing module cut-away view

Fig. 7 is the computation process figure of four-matrix

Fig. 8 is the computation process figure of three-dimensional matrice

Fig. 9 is the linear array figure of four-matrix computing module inside

Figure 10 is four-dimensional dynamic programming algorithm parallel computation structure figure

Embodiment

Below in conjunction with embodiment, the invention will be further described:

The present invention first builds the heterogeneous computing system be made up of main frame and restructural algorithm accelerator, then the RNA sequence after the thermodynamical model parameter after format and coding is sent to restructural algorithm accelerator by main frame, and the PKNOTS algorithm that seven computing modules of algorithm accelerator adopt MPMD mode to perform non recounting calculates; Adopt matrix dimension reduction method that four-matrix is decomposed into N number of three-dimensional matrice in calculating, then adopt successively to turn in segmentation and region by region wheel and realize fine grained parallel calculating by the partitioning strategy of multitask of row parallel processing, n PE of each computing module inside adopts SPMD mode to calculate n the data being positioned at region different lines simultaneously, and n is natural number.

PKNOTS algorithm is calculated and is realized by three two-dimensional matrixs and four four-matrix computing modules, and three described two-dimensional matrix computing modules are PE_VX, PE_WX and PE_WBX; Four described four-matrix computing modules are PE_WHX, PE_VHX, PE_ZHX and PE_YHX.The parallel computation structure of seven modules (PE), as shown in Figure 10.Each PE is responsible for the calculating of a matrix.The mode connecing institute's compute matrix title after the name employing PE_ of module carries out naming (such as: module PE_VX represents that current PE has been responsible for the calculating of matrix V X).

Seven modules (PE) in Figure 10 form a PE array, PE array calculates the unit (i in seven matrixes with same index at every turn simultaneously, j), an element for two-dimensional matrix VX, WX and WBX, be then the one deck (i.e. a delta-shaped region) in Fig. 7 for four-matrix WHX, VHX, YHX, ZHX, be called one " Cell ".

PE_VX, PE_WX are identical with the structure of PE_WBX for first three two-dimensional matrix computing module, and its inner structure as shown in Figure 6.The inside of two-dimensional matrix computing module PE_VX, PE_WX and PE_WBX all comprises a sub-PE controller (Sub PE Controller), sub-PE computing unit (Sub_PE), a local storage (Mem) and a number reportedly pass register (Trans Regs).Annexation between each assembly as shown in Figure 6, arrows show data direction of transfer.Its neutron PE controller (Sub PE Controller) is for realizing the control to calculating and data memory access sequential; The core of sub-PE computing unit is a 32bit totalizer, and for realizing the additive operation to two input operands, its result of calculation writes local storage (Mem) simultaneously and data transmit register.Wherein local storage (Mem) is for the result of calculation of buffer memory two-dimensional matrix one permutation element, and data transmit the result of calculation that register only stores currentElement, and for next computing module immediately.

Rear four four-matrix computing modules PE_WHX, PE_VHX, PE_ZHX are identical with the structure of PE_YHX, and its inner structure as shown in Figure 9.The inside of each computing module comprises a linear PE array.Function due to four-matrix computing module realizes the parallel computation to two dimension " Cell ", and each Cell is a two-dimentional triangular matrix, because being employed herein the many PE linear array structure shown in Fig. 9, adopt " dividing by arranging to take turns to turn " strategy realization parallel computation to multiple row element.The structure forming all sub-processing unit of PE array in Fig. 9 is identical, and the sub modular structure shown in its inner structure with Fig. 9 is identical.Sub-PE control module (Sub PE Controller) realizes task matching, the calculation task of a column element in triangular matrix is loaded on corresponding sub-processing unit (Sub PE) at every turn, and controls the synchronous of array.Calculate after starting, every sub-PE unit in array calculates when the element of in prostatitis at every turn, like this by whole array integrally, just achieves the cornerwise synchronous calculating of in certain region of Fig. 8 (b) subgraph.Along with every sub-PE current institute computing unit to top offset, the diagonal line that whole PE array is current calculated also just moves up thereupon, so also just progressively achieves the parallel computation to two dimension " Cell ".

Present invention employs the data dependence analysis method that one is called " time-space domain is overlapping ", analyze generating run execution sequence table by performance of program, and therefrom extract data dependence; By the project in multiple-unit execution sequence table is merged on Time and place territory set up data source to destination mapping relations (by time domain overlap find incoherent operation, realize parallel computation; Find same data source by spatial domain overlap, realize data reusing), and build memory access dispatch matrix, the formulation of guide data optimizing scheduling and paralleling tactic.

Fig. 5 is " time-space domain is overlapping " data dependence analysis method main flow, which show and is loaded into and the generative process transmitting schematic diagram from source code to data.This process comprises arithmetic type and data source statistics, data are relevant and operand source analysis, multiple-unit execution sequence table merge and generation memory access dispatch matrix four steps.

1. arithmetic type and data source statistics

The processing unit corresponding to each matrix, lists action type and Data Source according to the execution sequence of code in software; If run into loop statement, circulation is launched, with " Cell " in four-matrix for basic data block, statistics and analysis is carried out to the Changing Pattern of loop variable and data relevant range, draws currentElement and calculate the motion track that institute relies on element, extraction data dependence; According to data dependence, the operation execution in code is numbered in order, lists action type, Data Source and associated sequence numbers, generate the operation execution sequence table in source code as shown in table 1.In table 1, action type one hurdle both can represent that operated the identical operation of a set type that also can represent in loop body.The YHX (1,1) that operand is originated in one represents that current operation depends on the data in YHX Matrix C ell (1,1), and Para (1,1) represents that current operation depends on the data in parameter list (1,1) region.Associated sequence numbers one hurdle represents that the result of calculation of current operation will be used by the subsequent operation that sequence number is corresponding, namely exists to there is write-then-read between sequence number and be correlated with.If this column is empty, represent this operation do not rely on before the operation result of operation.

Table 1 operates execution sequence table

2. data are correlated with and operand source analysis

Table 1 reflects the serial implementation of code, is undertaken analyzing the data dependence relation that can obtain between operation by his-and-hers watches 1.Next according to the execution sequence met under the condition of THE Truth Of The Data pass, operation is marked again, to the numbering that there is the operation allocation order that data are correlated with, distributing identical numbering to there is not the operation that data are correlated with, then the identical operation of numbering being merged.Can executed in parallel owing to numbering that identical operation means, the true execution time of that is this group operation is identical, therefore this step is called time-interleaving.Next the Data Source of execution time overlapping processing is analyzed, operand identical item in a hurdle of originating is merged.Mean that real memory access address is identical or close with the Data Source of different operating in a line is identical, the data of access belong to same Cell, therefore can be regarded as the memory access of address space overlap.The execution sequence table after space-time overlap processing as shown in table 2 is obtained through above-mentioned two treatment steps.Finally sequence list operand identical item between a hurdle adjacent rows of originating is marked.

Execution sequence table after table 2 space-time overlap processing

Comparison sheet 1 and table 2 can find, due in table 1 front two do not exist data be correlated with, therefore merged; And the operand Origination section of two groups of add operations is overlapping, therefore the Cell (1,1) depending on YHX matrix is merged.The adjacent list item that observation table 2 operand is originated in a hurdle can find, YHX (1,1) and WHX (1,1) is used by adjacent operation, therefore marks it.If same data source use by adjacent operator and mean and can consider to carry out buffer memory in sheet to it when memory scheduling, reduce sheet external memory access expense by data reusing.

3. multiple-unit execution sequence table merges

According to step 1 and 2, operation execution sequence table is set up to each processing unit, then multiple form is merged, generate table multiple-unit execution sequence table.Table 3 lists the execution sequence table of three computing modules side by side, and each module comprises the Data Source sequence number relevant with representing data.

Table 3 multiple-unit execution sequence table

Because different computing module calculates the element that in different matrix, coordinate is identical simultaneously, and result is stored in FPGA sheet, if there is data dependence relation between these elements, realize data reusing by data transfer network in sheet, and sheet external memory scheduling problem can not be related to, therefore this step does not consider that the data between computing module are correlated with.Therefore, in table 2 there is precedence relationship between operating in the longitudinal direction of same computing module, and the lateral operation of same sequence number can executed in parallel.

Next ensureing under each module prerequisite that longitudinally execution sequence is constant relatively, execution time overlapping processing again, the execution sequence of computing module is adjusted up and down, makes to be in different computing module but the identical operation of Data Source is positioned at same a line of form as far as possible.

Multiple-unit execution sequence table after table 4 time-interleaving

Comparison sheet 3 and table 4 can find, because first operation of computing module 2 and the 3rd operation of computing module 1 all will use ZHX (1,1), so the first row of computing module in table 32 to be moved to the third line of table 4, therefore computing module 2 is in idle condition in the first two execution time section.Based on same reason, second operation of computing module 3 is moved to the fourth line of table 4, aligns with second operation of computing module 2.Because first operation of computing module 3 and first operation of computing module 1 all will use YHX (1,1), Para (1,2), so the position of computing module 3 the first row remains unchanged.

4. generate memory access dispatch matrix

Data source merging is carried out again in the basis of table 4, first the data source used is needed to arrange all for every a line computing modules according to performing sequence number, identical Data Source is carried out horizontal meaders, generates memory access dispatch matrix according to the loading order of data source; Secondly, consider execution time upper adjacent accessing operation, if employ identical data source, carry out vertical consolidation, realize data reusing by the data buffer storage of FPGA inside, avoid repeating to be loaded into, finally generate final memory access dispatch matrix.The schematic diagram that table 5 is memory access dispatch matrix, form is longitudinally loaded into tactic data source address for pressing, be laterally the destination of data transmission, " 1 " represents that left data can be used by the computing module of correspondence, and " 0 " represents that left data can not be used by respective modules."●" represents that corresponding data block is loaded into FPGA, and is in effective status, no longer needs to be loaded into from sheet.

Table 5 memory access dispatch matrix

Data Source	Computing module 1	Computing module 2	Computing module 3
				YHX(1,1)	1	0	1
Para(1,2)	1	0	1
				Para(1,1)	1	0	0
VHX(1,1)	0	0	1
				YHX(1,1)●	1	0	0
WHX(1,1)	1	0	0
				WHX(1,1)●	1	0	0
ZHX(1,1)	1	1	0
				Para(1,1)●	0	1	0
Para(1,2)●	0	1	0
				ZHX(1,1)●	0	1	1
VHX(1,1)●	0	1	1
				YHX(1,1)●	0	1	0
ZHX(1,1)●	0	1	1
				WHX(1,1)●	0	0	1
…	…	…	…

Need to consider following factor when generating memory access dispatch matrix: (1) data dependence, if the different pieces of information source that current calculating relies on is stored in different memory modules, is then loaded into from different passage simultaneously; (2) if different pieces of information source is stored in same memory module, then by using sequencing to be loaded into, streamline is started as early as possible; (3) if the use of data source does not exist correlativity, then first chunk data is loaded into; (4) if the idle and buffer zone that FPGA is inner available free of IO passage, subsequent data chunk of looking ahead immediately.

Experimental result shows, uses the memory access dispatch matrix instructional film external memory access scheduling of final generation, data distribution and reuses the access request that can reduce about 50%, thus effectively reducing memory access expense.

Because the storage demand of two-dimensional matrix VX, WX and WBX is little, computation process is simple, therefore left in FPGA sheet, too much need not consider their calculating and storage problem during design, and the calculating of four-matrix WHX, VHX, YHX, ZHX is the core of PKNOTS algorithm.According to data dependence relation between the matrix shown in Fig. 4, WHX matrix is in the core of data dependence graph, and therefore this section illustrates the filling process of four-matrix for WHX matrix.

Four-dimensional triangular matrix WHX (i, j, k, l) can be broken down into N number of three-dimensional triangulation matrix W HX _i(j, k, l) (1≤i≤N), each three-dimensional matrice WHX _ithe two-dimentional triangular matrix that (j, k, l) is N by N number of length of side is formed, and each two-dimentional upper triangular matrix corresponds to a Cell in Fig. 3.

As shown in Figure 7, computation process is base unit with Cell, from WHX ₁the 1st layer of (Cell ₁) start: as matrix W HX ₁the 1st layer of WHX of (j, k, l) ₁(1, k, l) calculates the 2nd layer of WHX again after having calculated ₁(2, k, l), until WHX ₁last one deck complete; And then calculate the 2nd three-dimensional matrice WHX ₂the 1st layer, the 2nd layer ..., n-th layer; Next WHX is calculated again ₃, a to the last matrix W HX _nn-th layer complete.The numerical value of the dotted line in figure and two-dimentional triangular matrix upper right corner mark represents the computation sequence of Cell.

To each three-dimensional matrice WHX in four-matrix _ithe calculating of (j, k, l), adopts the Region dividing calculative strategy shown in Fig. 8, and be some regions by every one deck (Cell) by column split, then region calculates one by one.Every layer, three-dimensional matrice matrix in Fig. 8 (a) subgraph is all divided into three regions, and the numbering in region represents the computation sequence of corresponding region, and the dotted line of band arrow represents the computation sequence of element in each region.To the calculating in each region, turn according to the row wheel of pressing in Fig. 8 shown in (b) subgraph the Task Assigned Policy divided, use multiple processing unit according to order from the bottom to top along the parallel computation of diagonal of a matrix realization to current region in current C ell.

To each region shown in Fig. 8 (b) subgraph, each PE is responsible for calculating the row in current region, the row of the element that PE calculates number and PE sequence number one_to_one corresponding in an array.The p column element of filling shade in figure represents current zoning, and they are dispensed to parallel computation on p PE simultaneously.Each PE, from the bottom position arranged separately, calculates according to order from the bottom to top.When the calculating of current region starts, the element that all PE calculate all is arranged in (unit of figure asterisk represents initial calculation position): PE_1 on the principal diagonal of triangular matrix and calculates element (k, l), 2nd PE calculates element (k+1, l+1),, p PE calculates element (k+p-1, l+p-1).According to the data dependence of algorithm, there are not data and be correlated with in the element on diagonal line, and p the element be therefore on different PE can parallel computation.And due to the calculated amount being positioned at element on same diagonal line equal, therefore all PE can synchronously boost, and at any time, the element of the current calculating of PE array is always on matrix same diagonal line.Due to each column element of triangular matrix number not etc., as the element row-coordinate k=1 that PE calculates, PE calculates time-out, enters waiting status (if result of calculation needs to write back chip external memory, PE will send the request of writing back in the wait state).All PE in array order will enter synchronous waiting status by number successively, and send the request of writing back.

With ZHX, the computation sequence identical with WHX is adopted to other three four-matrixs VHX, YHX.And for three two-dimensional matrixs VX, WX and WBX, then adopt from left to right calculating by column, the often order arranged from the bottom to top to realize filling.In order to realize parallel computation, devise seven computing modules (PE) herein, each PE is responsible for the calculating of a matrix.PE array calculates the unit (i in seven matrixes with same index at every turn simultaneously, j), being an element for two-dimensional matrix VX, WX and WBX, is then the one deck in Fig. 7 for four-matrix WHX, VHX, YHX, ZHX, i.e. one " Cell ".

Figure 10 is the four-dimensional dynamic programming algorithm parallel computation structure based on isomery many PE linear array, the synchronous and write-back control module formation primarily of antenna array control module, many PE computing array, memory module and array.The switching in the wherein initialization of antenna array control module in charge computing array, task matching and controlling calculation region.

Computing array, by seven PE module compositions, realizes the calculating to seven free energy matrixes respectively, and the status of all PE is reciprocity, is all connected with data bus.Each computing module has independently input and output data buffer storage (Data Buf and Cache), wherein Data Buf is used for the data that buffer memory is loaded into from sheet, Cache is for storing the result of calculation of this module, the computational logic of all PE is all connected by data transfer network with output data buffer storage, realizes reusing by data sharing mode.The data buffer storage of whole computing array and the exclusive data Cache of PE use multiport BlockRAM storage block in FPGA sheet to realize.In order to avoid access conflict, each computing module preserves the copy of a RNA sequence and free energy parameter list, uses distributed storage resource in FPGA sheet to realize.In addition, have also been devised data between PE and transmit Parasites Fauna, realize the quick transmission of PE result of calculation.Array is synchronously connected with arithmetic logic with the output buffer memory of each PE with write-back control module, for the synchronous of control PE array with the result of calculation of preserving in Cache is write back chip external memory successively.

Contrast test: we achieve hardware PKNOTS algorithm accelerator on test platform.Test platform is made up of a multi-purpose computer and an algorithm accelerator.Host configuration is Intel Core2 tetra-core Q94002.66GHz processor, 4.0GB main memory.Algorithm accelerator hardware mainly comprises 1 XilinxVirtex7 Series FPGA chip (XC7VX485T), article three, capacity is the DDR3-1600DRAM memory stick of 8GB, accelerator is connected with main frame (adopting the integrated GTX Transceiver of XC7VX485T chip internal to realize) by SFP+ optic fibre data channel, and valid data transmission bandwidth can reach 10Gb/s.Dynamic restructuring supported by algorithm accelerator, the quick switching between the CM model that can complete different scales in 60ms, be setup time second level conventional collocation method as compared with JTAG or parallel SlectMAP, the allocative efficiency of FPGA improves 2 ~ 3 orders of magnitude.RNA secondary structure prediction software version is PKNOTS-1.08, developed by Washington, DC University Medical College Elena Rivas, run in Intel Core2 tetra-core Q9400, Intel Xeon (R) X5670CPU and FPGA algorithm accelerator three kinds of different platforms respectively.

Experimental result (table 6) shows, XC7VX485T FPGA platform can only realize a PKNOTS algorithm accelerating engine, main cause is that buffer memory four-matrix " Cell " data block occupies too much memory capacity, the utilization rate of storage resources reaches 82%, and logical resource utilization rate is only 28%.Because arithmetic type mainly takes advantage of add operation, there is not extensive MUX and centralized memory access port in design, system clock frequency can reach 210MHz, and visible storage resources deficiency is the Main Bottleneck that system realizes.If use the commercial FPGA device XC6VSX1140T of current maximum-norm can realize at least 2 PKNOTS algorithm accelerating engines, the structure prediction of 2 RNA sequences can be realized simultaneously, and the sequence more grown can be supported.

The four-dimensional dynamic programming algorithm of table 6 realizes result in FPGA platform

Parallel effect

Table 7PKNOTS algorithm acceleration effect (chronomere: second)

The experimental selection RNA sequences of 4 groups of length between 30 ~ 176bps, test the average performance times of PKNOTS-1.08 program under Inter Q9400 and Intel Xeon (R) X5670CPU platform, and compare with hardware accelerator.As can be seen from Table 7, algorithm accelerator performs the speed-up ratio that the sequential structure prediction comprising 30 bases can obtain 2 times, and when test sequence is 176bps, the acceleration effect of 51.8 times can be obtained.Also the acceleration effect more than 25 times can be obtained compared with Intel Xeon (R) X5670.Be limited to logic and the memory capacity of XC7VX485T FPGA device, the structure prediction of the sequence realization band false knot of 256bps can only be less than at present to length.Use the synthesis result of Xilinx eda tool to show, XC7VX1140T chip can realize 2 PKNOTS accelerating engines, structure prediction is realized to 2 sequences simultaneously, relative to current main flow CPU platform, the acceleration effect more than 60 times can be obtained.

Claims

1. one kind is carried out hardware-accelerated method to the RNA secondary structure prediction of band false knot, it first builds the heterogeneous computing system be made up of main frame and restructural algorithm accelerator, then the RNA sequence after the thermodynamical model parameter after format and coding is sent to restructural algorithm accelerator by main frame, and the PKNOTS algorithm that seven computing modules of algorithm accelerator adopt MPMD mode to perform non recounting calculates; Adopt matrix dimension reduction method that four-matrix is decomposed into N number of three-dimensional matrice in calculating, then adopt successively to turn in segmentation and region by region wheel and realize fine grained parallel calculating by the partitioning strategy of multitask of row parallel processing, n PE of each computing module inside adopts SPMD mode to calculate n the data being positioned at region different lines simultaneously, and n is natural number.

2. a kind of RNA secondary structure prediction to band false knot according to claim 1 carries out hardware-accelerated method, it is characterized in that: PKNOTS algorithm is calculated and realized by three two-dimensional matrixs and four four-matrix computing modules, and three described two-dimensional matrix computing modules are PE_VX, PE_WX and PE_WBX; Four described four-matrix computing modules are PE_WHX, PE_VHX, PE_ZHX and PE_YHX.

3. a kind of RNA secondary structure prediction to band false knot according to claim 2 carries out hardware-accelerated method, it is characterized in that: two-dimensional matrix computing module PE_VX, PE_WX are identical with the structure of PE_WBX, their inside all comprises a sub-PE controller, sub-PE computing unit, a local storage and a number reportedly pass register, and its neutron PE controller is for realizing the control to calculating and data memory access sequential; The core of sub-PE computing unit is a 32bit totalizer, for realizing the additive operation to two input operands, its result of calculation writes local storage simultaneously and data transmit register, local storage is used for the result of calculation of buffer memory two-dimensional matrix one permutation element, data transmit the result of calculation that register only stores currentElement, and for next computing module immediately.

4. a kind of RNA secondary structure prediction to band false knot according to claim 2 carries out hardware-accelerated method, it is characterized in that: four-matrix computing module PE_WHX, PE_VHX, PE_ZHX are identical with the structure of PE_YHX, the function of four-matrix computing module realizes the parallel computation to two-dimentional Cell, and each Cell is a two-dimentional triangular matrix.