CN103377035A

CN103377035A - Pipeline parallelization method for coarse-grained streaming application

Info

Publication number: CN103377035A
Application number: CN2012101075275A
Authority: CN
Inventors: 刘鹏; 黄春明; 史册; 于绩洋; 刘扬帆; 郭俊; 姚庆栋
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2012-04-12
Filing date: 2012-04-12
Publication date: 2013-10-30

Abstract

The invention discloses a pipeline parallelization method for coarse-grained streaming applications. The pipeline parallelization method includes performing classic data profiling and dependency analysis on serial C-codes to acquire a task dependence graph, performing dependence transformation on the task dependence graph to acquire a directed acyclic graph, building a system feature graph, performing task scheduling on the directed acyclic graph according to the system feature graph and judging whether a task scheduling result meets performance requirements or not, if not, then aggregating and splitting task of the directed acyclic graph to acquire a new directed acyclic graph, selecting and calculating the highest-cost task of the new directed acyclic graph to acquire a new calculated hot spot region, returning to performing the dependency analysis again, segmenting and modifying the serial C-codes according to the task scheduling result so as to obtain parallelized C-codes, encoding to generate parallel executable files through an encoder, and loading the parallel executable files to a target hardware platform to execute. The pipeline parallelization method is adaptable to multilayer nested loop structures and capable of extracting parallelism of the multilayer loop.

Description

Flowing water parallel method for the application of coarse particle degree stream

Technical field

The present invention relates to a kind of computer application field, particularly a kind of stream to the coarse particle degree is used the method that realizes the flowing water parallelization.

Background technology

In order to take full advantage of the resource of multiple nucleus system, must solve the problem of parallel programming.Because the popular and for a long time set serial programming thinking of programmer of C language so that in a large number the c program code carry over, and these application programs are usually as upper layer application or its Software tool of multiple nucleus system.Nearly 85% embedded system development person is still using the C/C++ coding at present.Use if allow the programmer rewrite this class with new parallel programming language, then development difficulty is large, the construction cycle is long.Therefore need at present the urgent executed in parallel efficient of serial c program on multiple nucleus system that improves.Yet actual conditions are the effective ways that also lack at present the C/C++ parallelization.

Use at the stream that built-in field is widely used, process in the supervisor such as audio frequency, video, encryption and signal, this class method has abundant concurrency, and they have some common characteristics: 1) data-driven; 2) comprise abundant loop structure, and mostly be the multilayer circulation body; 3) there is the complex control dependence between function.Therefore, extract the concurrency of hiding in the stream application, need rationally efficiently streamlined disposal route.

The flowing water parallel method that extracts in the application program at present can be classified as three classes: IMT (IndependentMulti-Threading), CMT (CyclicMulti-Threading) and PMT (Pipelined Multi-Threading).Wherein the IMT technology does not allow cross-thread rely on to exist, and during this technology was used at first and calculates based on the science of array, foremost IMT technology was DOALL, the situation of Existence dependency relationship not between each iteration in it is only applicable to circulate.The CMT technology is to the replenishing and expansion of IMT technology, mainly is in order to be applicable to the situation of Existence dependency relationship between iteration, DOACROSS belongs to the CMT technology, and it guarantees data interaction between iteration by having inserted synchronization mechanism.DSWP (Decoupled Software Pipeline) technology belongs to the PMT technology, and the DSWP technology is intended to the pipelined parallel of development cycle body.The processing mode that the DSWP technology is compared the DOACROSS technology is different, and the DSWP technology is that loop code is cut, and is divided into different threads and is assigned on the different processor cores, and each thread is arranged according to the parallel mode of flowing water.But there is following defective in above method: 1) they mainly are for extracting the fine granularity Thread level parallelism, the method of polymerization is taked in the processing that they rely on control, this point very easily formed the set of tasks of coarsegrain for coarse particle degree application program, become performance bottleneck; 2) above method is only applicable to the exploitation to innermost loop, and is undesirable for the effect structure of multilayer nest circulation; 3) DOACROSS technology and DOALL technology are only applicable to the application of the regular memory access of tool and simple control dependence, and the science that is similar to is calculated, and most application program has erratic control stream and complicated memory access, accesses such as pointer.

Also there are some research work to extract concurrency from the structure dissection to source program, but this class work often stresses to carry out modeling and information search in the structure to program, and lack conversion for the dependence between modules, rely on the programmer manually to finish aspect the concurrency and do not have systematic method to do guidance extracting.

Summary of the invention

Technical matters to be solved by this invention provides a kind of flowing water parallel method that can flow for the coarse particle degree application, be applicable to the structure of multilayer nest circulation, can extract the concurrency of multilayer circulation.

For solving the problems of the technologies described above, the technical solution adopted in the present invention provides a kind of flowing water parallel method of using for coarse particle degree stream, may further comprise the steps:

A) serial C code is carried out the typical data analysis and obtain calculating the hot spot region;

B) will calculate the hot spot region carries out dependency analysis and obtains Task Dependent figure TDG;

C) Task Dependent figure is relied on conversion and obtain directed acyclic graph TDGAT;

D) hardware model of setting up target hardware platform obtains architectural feature figure ACG;

E) for architectural feature figure directed acyclic graph is carried out task scheduling and obtain as a result SR of task scheduling;

F) judge whether the task scheduling result satisfies performance requirement, if meet the demands, execution in step G then) and subsequent step, if do not meet the demands, then the task in the directed acyclic graph is carried out polymerization and fractionation, obtain new directed acyclic graph, select the task of computing cost maximum in the new directed acyclic graph, obtain new calculating hot spot region, return again step B) continue to carry out;

G) according to the task scheduling result serial C code is cut apart, revised and obtain the parallel C code;

H) use the compiler compiling that is applicable to target hardware platform to generate parallel executable file;

I) described parallel executable file is loaded on the target hardware platform carries out.

Steps A) " serial C code is carried out the typical data analysis obtains calculating the hot spot region " described in may further comprise the steps:

A) insert Debugging message in serial C code, compiling generates the serial executable file;

B) allow described serial executable file carry out at computer platform, collect the essential information of runtime environment, be specially the relevant information of obtaining internal storage access, function calling relationship and branching selection;

C) allow described serial executable file carry out at target hardware platform, collect the information relevant with described target hardware platform, be specially the size that takies of the instruction space of computing cost, program of function and data space, choose the function of computing cost maximum as calculating the hot spot region.

Step B) " will calculate hot spot region carry out dependency analysis obtain Task Dependent figure TDG " described in comprises and carries out static scanning to calculating the hot spot region, follow the tracks of the function read-write situation of calculating in the hot spot region, set up data dependence relation and control dependence between each function, according to the data dependence relation between each function and control dependence, set up Task Dependent figure TDG, TDG=(V, E, w _v, w _e), wherein V represents the set of task, E represents the set of dependence edge, i.e. data dependence relation between task and control dependence, w _vBe the data structure of task, used and stored corresponding quantitative information, w _eBe the data structure of dependence edge, be used for storing corresponding quantitative information.

Step C) " Task Dependent figure is relied on conversion obtains directed acyclic graph TDGAT " described in and may further comprise the steps:

A) judge whether Task Dependent figure TDG exists control to rely on, if exist, then eliminate control and rely on, enter step b), if do not exist, then enter step b);

B) judge whether Task Dependent figure TDG exists data dependence between iteration, if exist, then eliminate data dependence between redundant iteration, enter step c), if do not exist, then enter step c);

C) judge whether Task Dependent figure TDG exists the ring-type data dependence, if exist, then set up strong connected component SCC, generate directed acyclic graph TDGAT, if do not exist, then directly generate directed acyclic graph TDGAT;

D) through dependence obtain after the conversion directed acyclic graph TDGAT=(V ', E ', w ' _v, w ' _e), the wherein set of V ' expression task, the set of E ' expression dependence edge, w ' _vBe the data structure of task, used and stored corresponding quantitative information, w ' _eIt is the data structure of dependence edge, be used for storing corresponding quantitative information, its structure and TDG are basically identical, difference is only to exist among the TDGAT data dependence limit, it is the data dependence relation between task, each node is a strong connected component SCC among the TDGAT, does not have branch's control task among the TDGAT, the w ' among the TDGAT _vAnd w ' _eThe data field that comprises and the w among the TDG _vAnd w _eIdentical.

Step D) figure of architectural feature described in ACG=(P, R, w _p, w _r), for the characteristic of describing target hardware, wherein P represents the set of processor, R represents the link set between processor, w _pThe size of storage space on the processor, w _rThe bandwidth of expression link.

Step e) " for architectural feature figure directed acyclic graph is carried out task scheduling and obtains as a result SR of task scheduling " described in may further comprise the steps:

I) priority that sets the tasks is selected ready task;

II) select suitable processor for ready task.

Step I) " priority sets the tasks " may further comprise the steps described in:

A) with task according to data dependence relation, be divided in the different task groups, task division that can executed in parallel is to same task groups, task possibility Existence dependency relationship in the different task group, for these task groups arrange priority, the task groups priority that more early can begin to carry out is higher, and limit priority is 0;

B) for the different task in each task groups arranges task priority TP, according to following formula calculation task priority, task priority TP is larger to each task, and task priority is higher:

Known

The longest path length of the bl end task that is current task in the directed acyclic graph TDGAT wherein, bl is larger, indicates that this task is larger on the impact of follow-up work, should priority processing; DynamicMem is the size of the communication buffer of this required by task, DynamicMem (n ' _i)=; _jC (e ' _Ij), wherein c (e ' _Ij) expression dependence edge e _IjThe traffic; StaticMem is the data of task and the size of instruction storage space;

≠ be weight coefficient,

≠=1/M, P are the set of processor, p _jBe a processor among the P, t represents task n _i' at processor p _jOn execution time, R is average data transfer rate in the link, i.e. link bandwidth, M is the average storage size in each processor, cp is the critical path among the TDGAT; Len is path, i.e. computing cost on this path and communication overhead sum.

Step II) " for ready task is selected suitable processor " described in may further comprise the steps:

Ready task calculated can obtain factors A vailableFactor, i.e. AF, is defined as follows AF as the standard of weighing this processor and whether be fit to ready task with AF:

Known

Wherein DL (n ' _i, p _j) be the data localitys, expression task n ' _iBe assigned to processor p _jUpper can be multiplexing the data space size or the size of diminishbb communication buffer; DRT (n ' _i) be the data ready time, represent task n ' _iThe time that can begin to carry out; PEAT (n ' _i, p _j) be the processor ready time, expression processor p _jFree time, can begin the n ' that executes the task _iTime, this time equals processor p _jA upper task n ' who carries out _jConcluding time, as task n ' _iAnd p _jA upper task n ' who carries out _jWhen having the Branch Tasks of mutex relation, PEAT (n ' _i, p _j) equal task PEAT (n ' _j, p _j), i.e. task n ' _iWith task n ' _jHas identical processor ready time; T (n ' _i, p _j) be computing cost, representative is with task n ' _iBe assigned to processor p _jUpper execution needed computing time; * be adjust DL (n ' _i, p _j) the coefficient factor of proportion in Available Factor.

Step F) " task in the directed acyclic graph being carried out polymerization and fractionation " described in may further comprise the steps:

A) according to task scheduling SR as a result, the working condition of checking each processor on the target hardware platform;

B) task that computing cost is less condenses together, and is assigned on the same processor, to improve the utilization ratio of processor as far as possible;

C) can not be assigned to the little task of computing cost on the same processor by polymerization and split and come, the free time that is inserted into each processor carries out in the groove.

Step G) described in " according to the task scheduling result to serial C code cut apart, revise obtain the parallel C code " may further comprise the steps:

A) serial C program in machine code section corresponding to each task among the figure dependency graph TDGAT that set the tasks, the program code of correspondence is extracted, redefine and be packaged into function body, this function body is concrete C code corresponding to task, is responsible for the input data are carried out calculation process;

B) for guaranteeing that this task can be to the correct data of other tasks transmission, in function body, revise and add corresponding the processing, be specially the useful data that generates is copied, divides or distributes local space to preserve;

C) communicate by letter and synchronization statements with the end insertion in the stem of function body, because the execution of task has atomic properties, before any data interaction operation only allows the computing execution of the task that occurs in to begin or after finishing, these communications and synchronization statements are provided by operating system, obtain at last the parallel C code through above processing.

Beneficial effect: the method that the present invention proposes is based on didactic, by the loop optimization Stepwise Refinement, adjusts the pipelining-stage granularity, progressively excavates out the concurrency of multilayer circulation with this; Setting up on the basis of TDG, for dependence complicated between task, propose a kind ofly to be different from traditional simple aggregation of passing through and to eliminate the method that ring-type relies on, by branching polymerization, supposition and identification that redundancy is relied on, can effectively eliminate control and rely on and untie the data dependence ring; The present invention has also defined the output TDGAT that relies on conversion, and TDGAT compares with TDG, and each task is a relatively complete full-mesh component SCC; The method for scheduling task that the present invention proposes has taken into full account the variation of task dependencies, dispatching method to be reducing the execution time as optimization aim as far as possible, realizes optimization that storage is taken by increasing data locality and multiplexing storage space on the basis that guarantees the execution time; The task scheduling that the present invention adopts has also considered to exist the situation of Branch Tasks, by the alternative between the judgement task, rationally utilizes hardware resource.

Description of drawings

By reference to the accompanying drawings, other characteristics of the present invention and advantage can be from below by becoming clearer the explanation of giving an example the preferred implementation that principle of the present invention is made an explanation.

Fig. 1 the present invention is directed to the schematic flow sheet that the coarse particle degree flows a kind of embodiment of the flowing water parallel method of using;

Fig. 2 the present invention is directed to the schematic flow sheet that relies on conversion in a kind of embodiment of the flowing water parallel method that coarse particle degree stream uses;

Fig. 3 is the schematic flow sheet that the present invention is directed to task scheduling in a kind of embodiment of the flowing water parallel method that coarse particle degree stream uses.

Embodiment

Below in conjunction with accompanying drawing embodiments of the present invention are described in detail:

As shown in Figure 1, a kind of flowing water parallel method of using for coarse particle degree stream.

Step 1, process analysis.The major function of process analysis is to select to calculate hot spot region CH (computing hotspot), and it is set up the Task Dependent graph model.Regard the modules of program as TU task unit, and the relation between them is represented by the limit between the task.Process analysis is divided into typical data analysis and dependency analysis two parts.Typical data analysis is in order to obtain the runtime environment information of program, and dependency analysis is the relation that relies in order to obtain data dependence and control.Process analysis comprises that serial C code is carried out typical data analysis to be obtained calculating the hot spot region and will calculate the hot spot region and carry out dependency analysis and obtain Task Dependent figure TDG (Task Dependence Graph).

Wherein serial C code is carried out typical data analysis and obtain calculating the hot spot region and be included in the serial C code and insert Debugging message, compiling generates the serial executable file; Allow described serial executable file carry out at computer platform, collect the essential information of runtime environment, be specially the relevant information of obtaining internal storage access, function calling relationship and branching selection; Allow described serial executable file carry out at target hardware platform, collect the information relevant with described target hardware platform, be specially the size that takies of the instruction space of computing cost, program of function and data space, choose the function of computing cost maximum as calculating the hot spot region.

To calculate the hot spot region carries out dependency analysis and obtains Task Dependent figure TDG and comprise and carry out static scanning to calculating the hot spot region, follow the tracks of the function read-write situation of calculating in the hot spot region, set up data dependence relation and control dependence between each function, according to the data dependence relation between each function and control dependence, set up Task Dependent figure TDG, TDG=(V, E, w _v, w _e), wherein V represents the set of task, E represents the set of dependence edge, i.e. data dependence relation between task and control dependence, w _vBe the data structure of task, used and stored corresponding quantitative information, w _eBe the data structure of dependence edge, be used for storing corresponding quantitative information.

Represent a task, this task shows as one section statement block in the C code, and this statement block can be function, and n also comprises nested subfunction, also can be some continuous statements.

n _i，n _j∈V。e _IjRepresent one group from task n _iTo task n _jDependence, data dependence relation or control dependence, at this moment pre (n _j)=n _iTitle task n _iBe task n _jFirst sequence task, same succ (n _i)=n _jTitle task n _jBe task n _iSubsequent tasks.The different 1. control task V of four classes that divide of function according to task _c, namely statement is judged by loop control statement or branch; 2. cycle task V _l, namely in loop body, can repeatedly being carried out of task; 3. Branch Tasks V _b, the task in branched structure; 4. common task V _o, i.e. type except above three kinds of tasks.The limit also is divided into 1. data dependence limit E between iteration accordingly _i, represent the data dependence relation between different iteration in the loop body; 2. control dependence edge E _c, the dependence that representative is initiated by the control node; 3. general data dependence edge E _o, the type of representative except above two kinds of dependence edges.Set up data structure for each node task and dependence edge in addition and stored corresponding quantitative information, as shown in the table:

To the attribute i of branch _bDefine a kind of data structure of describing the Branch Tasks characteristic, the branch information tabulation:

Each Branch Tasks has such record list item.Each territory of this list item is defined as follows: branch_level is branch's rank, and if the number of the node of divergence of passing through when representation program is carried out this node is the non-branch node of this node level=0 then.Branch_label[i] be branch's label, the name of i node of divergence of expression present node process.Branch_condition[i] be branch condition, conditional value when representing present node through i node of divergence, branch_exclusive is the tabulation of branch alternative, branch_exclusive[k] k Mutex-task in the Mutex-task tabulation of expression present node.

To cycle attribute i _lDefine a kind of data structure of describing the cycle task characteristic, the cyclical information tabulation:

Each cycle task has such record list item.Each territory is defined as follows in the table: loop_level is the circulation rank, when representation program is carried out this node the number of the cycle control node that must process, reflected the residing loop body degree of depth of this cycle task.Loop_label[i] be the circulation label, represent the name of the control node of the residing i layer circulation of present node.Loop_num[i] be number of iterations, the iteration total degree of the residing i layer circulation of expression present node, loop_childnode[k] be the cycle task child node, record depends on k task of this cycle task.

TDG is suitable for coarseness, has the control structures such as circulation, branch, satisfies the program of data flow driven characteristics.TDG combines the characteristics of c program, can be preferably the control flow structure of c program is represented to unite with data flow architecture.

Step 2: rely on conversion.May there be complicated dependence among the Task Dependent figure through process analysis foundation, concurrency between task has been hidden, be that follow-up task scheduling is prepared in order to extract concurrency, proposition is to the dependence transform method of Task Dependent figure, the dependence edge that deletion is redundant, task is carried out suitable polymerization and fractionation, finally obtain clear and fully to express the directed acyclic graph TDGAT of program parallelization.Rely on shift process as shown in Figure 2, specifically comprise judging a) whether Task Dependent figure TDG exists control to rely on, if exist, then eliminate control and rely on, enter step b), if do not exist, then enter step b); B) judge whether Task Dependent figure TDG exists data dependence between iteration, if exist, then eliminate data dependence between redundant iteration, enter step c), if do not exist, then enter step c); C) judge whether Task Dependent figure TDG exists the ring-type data dependence, if exist, then set up strong connected component SCC, generate directed acyclic graph TDGAT, if do not exist, then directly generate directed acyclic graph TDGAT; D) through dependence obtain after the conversion directed acyclic graph TDGAT=(V ', E ', w ' _v, w ' _e), the wherein set of V ' expression task, the set of E ' expression dependence edge, w ' _vBe the data structure of task, used and stored corresponding quantitative information, w ' _eIt is the data structure of dependence edge, be used for storing corresponding quantitative information, its structure and TDG are basically identical, difference is only to exist among the TDGAT data dependence limit, it is the data dependence relation between task, each node is a strong connected component SCC among the TDGAT, does not have branch's control task among the TDGAT, the w ' among the TDGAT _vAnd w ' _eThe data field that comprises and the w among the TDG _vAnd w _eIdentical.

Rely on conversion and be divided into three steps: eliminate control and rely on, eliminate between redundant iteration data dependence and set up strong connected component SCC.

Eliminate the control dependence and comprise that eliminating branch's control relies on and eliminate the cycle control dependence.

Eliminating branch's control dependence specifically comprises: the alternative of Branch Tasks, and there is and only has the selected execution of a meeting in some branches under the same Rule of judgment at synchronization, that is to say that these branches are mutual exclusions, and this is the characteristic of Branch Tasks.Eliminate branch's control dependence edge, in order to indicate alternative, we upgrade the alternative tabulation branch_exclusive of branch of each Branch Tasks when eliminating branch's control dependence.Concrete steps are: the 1) i of each branch node in the scanning _bInformation namely satisfies l (n)=1,5 or 7 node; Suppose that pending branch node is n _j, n _jPV _B, n _jBranch information i _bThe rank branch_level=N of branch of middle record begins to scan one by one n from i=0 _jBranch information i _bThe label branch_label[i of branch of middle record] and branch condition branch_condition[i].By the label branch_label[i of branch] can navigate to domination n _jThe branch control node of i layer branch.If i＞N-1 represents branch node n _jProcessing finish, search for next branch node.2) traversal n _jAll output branchs control dependence edges of i layer branch control node, find other Branch Tasks that arranged by this control node.Check respectively the label branch_label of branch and the branch condition branch_condition of these Branch Tasks, judge these tasks whether with task n _jMutual exclusion.If mutual exclusion is set up, the task with these joins n so _jBranch alternative tabulation branch_exclusive in.3) branch is controlled the content of node, copy merging to branch node n _jIn.4) repeating step 1) to 3), until handle branch nodes all among the TDG, namely eliminate all branch's control nodes and branch's control dependence edge, leave out at last branch's control nodes all among the TDG.

Eliminating cycle control relies on and specifically comprises: the cycle task information i that provides according to TDG _l, can know the behavioral characteristic of this loop body, comprise the degree of depth of this circulation, carry out number of times, the information such as the dependence between the iteration.If a loop body is always carried out fixing iterations or always repeatedly carried out (carrying out minimum iterations/maximum iterations＞1/2), we claim this loop body to lay particular stress on so.If a loop body is laid particular stress on, think that so the behavior of this loop body is highly predictable.The cycle control node is negligible to the domination of cycle task so.Eliminate the cycle control dependence edge, to the processing of cycle control dependence edge, adopt the supposition technology to process, concrete steps are as follows: the cyclical information i that observes each cycle task _lCirculation rank loop_level, circulation label loop_label and the number of iterations loop_num of middle record confirm whether this circulation has to lay particular stress on, i.e. biased.If this circulation is laid particular stress on, think that so the behavior of this class circulation is foreseeable, they are selected as circulation to be inferred.All cycle controls of eliminating in the selected loop body rely on.Insert code and realize the wrong mechanism of inferring.In each cycle task, insert non-conditional control statement, such as while (true).The acquiescence cycle task can repeatedly be carried out, and all waits the order of cycle control task to be checked no longer at every turn.Realize that the wrong function of inferring mechanism is inserted in the afterbody task in the loop body.Mistake infers that the major function of mechanism is achieved in that at first, and mistake infers that function can distribute the data result in the last round of iteration of the each circulation of buffer zone record, and whenever enters and new do Data Update one time once taking turns iteration; Then, this function can be inquired about the state of cycle control task, to determine this time infer whether hit; If prediction thinks so that accurately this iteration is effective, task continue to be carried out, otherwise, infers that unsuccessfully this time iteration is invalid, inferred by mistake and be responsible for data rewind to the result of effective iteration of last time and withdraw from circulation.

The data dependence concrete steps comprise that all dependences of temporary transient deletion are apart from d between the iteration of elimination redundancy _DepGreater than data dependence limit between 1 iteration; Then, to eliminate d _DepGreater than the TDG behind the data dependence between 1 iteration be the basis, find out the strong connected component among the figure, each strong connected component is labeled as a SCC.Comprise at least one task among each SCC.In case SCC divides and to finish, and just can learn which SCC is each task be assigned among the TDG, and can know that strong connected component between different task is apart from d _s(n _i, n _j).The strong distance that is communicated with refers to any two node n among the figure _iAnd n _jThe number of the SCC at interval between the SCC at place separately.If n _i, and n _jBetween do not have communication path, d so _s(n _i, n _j)=-1.If n _i, and n _jBetween have communication path, but two tasks are directly to link to each other, and are middle without other nodes, so d _s(n _i, n _j)=0.If n _i, and n _jBetween have many communication paths, an and the longest paths (n _i, n _I+l..., n _I+k, n _j) passed through k node, so d _s(n _i, n _j)=k+l.At last, whether the data dependence limit belongs to data dependence limit between redundant iteration between the iteration of deleting before judging, if it is deletion, otherwise again add back among the TDG.One of condition is satisfied on the data dependence limit between redundant iteration:

Suppose to exist data dependence limit e between iteration _IjPE _i, e _IjStart from task n _i, end at task n _j, n _i, n _jPV _ld _Dep(e _Ij) expression dependence edge e _IjThe dependence distance, d _s(n _i, n _j) expression task n _iWith task n _jStrong connected component distance.

If d _Dep(e _Ij)＞d _s(n _i, n _j), d _s(n _i, n _j) 〉=1 is if expression is carried out follow-up pipelining-stage division, so n with the basis that is divided into of present SCC _iThe SCC at place is to n _jThe data dependence of place SCC always can in time be satisfied, so e _IjBe data dependence limit between the iteration of redundancy.If d _Dep(e _IjD is then inquired about in)=1 _s(n _i, n _j) value.If d _s(n _i, n _j)=-1 illustrates task n _jWith task n _jIn same SCC, e _IjBe the data interaction of SCC inside, e _IjBe data dependence limit between the iteration of redundancy.If d _s(n _i, n _j)=0 is actually d _Dep(e _Ij)＞d _s(n _i, n _j), d _s(n _i, n _jWork as d in) 〉=1 _s(n _i, n _jThe specific example of)=0 o'clock, the e of this moment _IjAlso belong to data dependence limit between redundant iteration.

Seek remaining ring-type data dependence in task image, the limit and the task that form these ring-types dependences are carried out polymerization, Task Dependent figure TDG becomes a directed acyclic graph TDGAT.

Step 3: hardware description.The hardware model that hardware description is namely set up target hardware platform obtains architectural feature figure ACG.Wherein said architectural feature figure ACG=(P, R, w _p, w _r), for the characteristic of describing target hardware, P represents the set of processor, R represents the link set between processor, w _pThe size of storage space on the processor, w _rThe bandwidth of expression link.

Step 4: task scheduling.The task scheduling fundamental purpose is under the prerequisite of given directed acyclic graph, different tasks is assigned on the different processors, and determines their execution sequence and execution time, in order to realize optimization aim.If parallel scheme fails to reach performance requirement, task is split and polymerization, adjust its granularity, and reselect the calculating hot spot region, repeat the operation of above-mentioned one or two steps.Task scheduling comprises that for architectural feature figure directed acyclic graph being carried out task scheduling obtains as a result SR and judge whether the task scheduling result satisfies performance requirement of task scheduling, if meet the demands, then enter next step, if do not meet the demands, then the task in the directed acyclic graph is carried out polymerization and fractionation, obtain new directed acyclic graph, select the task of computing cost maximum in the new directed acyclic graph, obtain new calculating hot spot region, the hot spot region of will calculating of returning again in the step 1 process analysis is carried out dependency analysis and is obtained Task Dependent figure TDG, execution in step two again, and the execution in step quadruple is newly judged again.

Wherein the schematic flow sheet of task scheduling as shown in Figure 3, its detailed process comprises the priority that sets the tasks, and selects ready task and is the suitable processor of ready task selection.The priority that sets the tasks comprises task according to data dependence relation, be divided in the different task groups, task division that can executed in parallel is to same task groups, task possibility Existence dependency relationship in the different task group, for these task groups arrange priority, the task groups priority that more early can begin to carry out is higher, and limit priority is 0; For the different task in each task groups arranges task priority TP, according to following formula calculation task priority, task priority TP is larger to each task, and task priority is higher:

Known

≠ be weight coefficient,

The priority of task is set, adopts method dynamic and static combination that two-stage priority is set in the present embodiment.Can satisfy simultaneously dependence, the task of executed in parallel is called with batch task at one time.Different tasks can be assigned in the different task groups according to affiliated batch difference.Any two tasks in each task groups are separate, and the task in the different task group may Existence dependency relationship.In order to distinguish the execution sequence of different task group, for each task groups is provided with priority, i.e. task groups priority BP (Batch Priority).The priority of less this task groups of explanation of BP is higher, can more early begin to carry out.Task priority TP (Task Priority), TP represents the priority between the different task in the same task groups.Because can executed in parallel with batch task, may there be so the situation of contention hardware resource between them.In order to solve the problem of resource contention, with batch task task priority TP need to be set.It is nonsensical that the TP of task in the different task group is compared.Each task can be assigned to corresponding task groups, and the BP of each grouping of initialization.The algorithm false code that BP arranges is as follows:

TP is static priority, no longer changes after once setting.And BP is dynamic priority.Whenever there being task to be scheduled, need the division of updating task group.The algorithm false code that dynamically updates task groups is as follows:

Describedly select suitable processor to comprise ready task calculated to obtain factors A vailableFactor for ready task, i.e. AF, is defined as follows AF as the standard of weighing this processor and whether be fit to ready task with AF:

Known

Wherein DL (n ' _i, p _j) be the data localitys, expression task n ' _iBe assigned to processor p _jUpper can be multiplexing the data space size or the size of diminishbb communication buffer; DRT (n ' _j) be the data ready time, represent task n ' _iThe time that can begin to carry out; PEAT (n ' _i, p _j) be the processor ready time, expression processor p _jFree time, can begin the n ' that executes the task _iTime, this time equals processor p _jA upper task n ' who carries out _jConcluding time, as task n ' _iAnd p _jA upper task n ' who carries out _jWhen having the Branch Tasks of mutex relation, PEAT (n ' _i, p _j) equal task PEAT (n ' _j, p _j), i.e. task n ' _iWith task n ' _jHas identical processor ready time; T (n ' _i, p _j) be computing cost, representative is with task n ' _iBe assigned to processor p _jUpper execution needed computing time; * be adjust DL (n ' _i, p _j) the coefficient factor of proportion in Available Factor.AF (n ' _i, p _j) larger, if illustrate task n ' _iBe assigned to processor p _jUpper more energy Optimized Operation performance.For the criterion of task choosing processor as follows:

Task in the directed acyclic graph is carried out polymerization and fractionation to be comprised according to task scheduling SR as a result, the working condition of checking each processor on the target hardware platform; The task that computing cost is less condenses together, and is assigned on the same processor, to improve the utilization ratio of processor as far as possible; Can not be assigned to the little task of computing cost on the same processor by polymerization and split and come, the free time that is inserted into each processor carries out in the groove.

Step 5: parallel codes generates.Parallel codes generates and to comprise according to the task scheduling result serial C code is cut apart, revised and obtain the compiler compiling that parallel C code and use be applicable to target hardware platform and generate the executable file that walks abreast.Describedly according to the task scheduling result serial C code is cut apart, revised and obtain the parallel C code and comprise serial C program in machine code section corresponding to each task among the figure dependency graph TDGAT that sets the tasks, the program code of correspondence is extracted, redefine and be packaged into function body, this function body is concrete C code corresponding to task, is responsible for the input data are carried out calculation process; For guaranteeing that this task can be to the correct data of other tasks transmission, in function body, revise and add corresponding the processing, be specially the useful data that generates is copied, divides or distributes local space to preserve; Stem at function body is communicated by letter and synchronization statements with the end insertion, because the execution of task has atomic properties, before any data interaction operation only allows the computing execution of the task that occurs in to begin or after finishing, these communications and synchronization statements are provided by operating system, obtain at last the parallel C code through above processing.

Step 6: parallel codes is carried out.Parallel codes is carried out and to be about to described parallel executable file and to be loaded on the target hardware platform and to carry out.

Although described by reference to the accompanying drawings embodiments of the present invention, those of ordinary skills can make various distortion or modification within the scope of the appended claims.

Claims

1. flowing water parallel method of using for coarse particle degree stream is characterized in that may further comprise the steps:

2. the flowing water parallel method of using for coarse particle degree stream according to claim 1 is characterized in that steps A) described in " serial C code is carried out the typical data analysis obtains calculating the hot spot region " may further comprise the steps:

3. the flowing water parallel method of using for coarse particle degree stream according to claim 1, it is characterized in that: " will calculate hot spot region carry out dependency analysis obtain Task Dependent figure TDG " step B) comprises and carries out static scanning to calculating the hot spot region, follow the tracks of the function read-write situation of calculating in the hot spot region, set up data dependence relation and control dependence between each function, according to the data dependence relation between each function and control dependence, set up Task Dependent figure TDG, TDG=(V, E, w _v, w _e), wherein V represents the set of task, E represents the set of dependence edge, i.e. data dependence relation between task and control dependence, w _vBe the data structure of task, be used for storing corresponding quantitative information, w _eBe the data structure of dependence edge, be used for storing corresponding quantitative information.

4. the flowing water parallel method of using for coarse particle degree stream according to claim 1 is characterized in that step C) described in " Task Dependent figure is relied on conversion obtains directed acyclic graph TDGAT " may further comprise the steps:

D) through dependence obtain after the conversion directed acyclic graph TDGAT=(V ', E ', w ' _v, w ' _e), the wherein set of V ' expression task, the set of E ' expression dependence edge, w ' _vBe the data structure of task, be used for storing corresponding quantitative information, w ' _eIt is the data structure of dependence edge, be used for storing corresponding quantitative information, its structure and TDG are basically identical, difference is only to exist among the TDGAT data dependence limit, it is the data dependence relation between task, each node is a strong connected component SCC among the TDGAT, does not have branch's control task among the TDGAT, the w ' among the TDGAT _vAnd w ' _eThe data field that comprises and the w among the TDG _vAnd w _eIdentical.

5. the flowing water parallel method of using for coarse particle degree stream according to claim 1 is characterized in that: the figure of architectural feature step D) ACG=(P, R, w _p, w _r), be used for describing the characteristic of target hardware, wherein P represents the set of processor, R represents the link set between processor, w _pThe size of storage space on the processor, w _rThe bandwidth of expression link.

6. the flowing water parallel method of using for coarse particle degree stream according to claim 1 is characterized in that step e) described in " for architectural feature figure directed acyclic graph is carried out task scheduling and obtains as a result SR of task scheduling " may further comprise the steps:

I) priority that sets the tasks is selected ready task;

II) select suitable processor for ready task.

7. the flowing water parallel method of using for coarse particle degree stream according to claim 6 is characterized in that step I) described in " priority sets the tasks " may further comprise the steps:

Known TDGAT=(V ', E ', w ' _v, w ' _e)

TP(n′ _i)＝α×bl(n′ _i)+β×DynamicMem(n′ _i)+γ×StaticMem(n′ _i)

The longest path length of the bl end task that is current task in the directed acyclic graph TDGAT wherein, bl is larger, indicates that this task is larger on the impact of follow-up work, should priority processing; DynamicMem is the size of the communication buffer of this required by task, DynamicMem (n ' _i)=∑ _jC (e ' _Ij), wherein c (e ' _Ij) expression dependence edge e _IjThe traffic; StaticMem is the data of task and the size of instruction storage space; α, β and γ are weight coefficient, α=1/len (cp); β=1/ (R * min (t (n ' _i, p _j)),

γ=1/M, P are the set of processor, p _jBe a processor among the P, t represents task n _i' at processor p _jOn execution time, R is average data transfer rate in the link, i.e. link bandwidth, M is the average storage size in each processor, cp is the critical path among the TDGAT, len is path, i.e. computing cost on this path and communication overhead sum.

8. the flowing water parallel method of using for coarse particle degree stream according to claim 6, it is characterized in that: " for ready task is selected suitable processor " Step II) comprises that calculating can be obtained factors A vailableFactor to ready task, be AF,, as the standard of weighing this processor and whether be fit to ready task AF is defined as follows with AF:

Known TDGAT=(V ', E ', w ' _v, w ' _e), ACG=(P, R, w _p, w _r)

{&ForAll; n}_{i}^{'} &Element; V^{'},

&ForAll; p_{j} &Element; P

AvailableFactor (n ' _i, p _j)=λ DL (n ' _i, p _j)-max{DRT (n ' _i), PEAT (n ' _i, p _j)-t (n ' _i, p _j) wherein DL (n ' _i, p _i) be the data localitys, expression task n ' _iBe assigned to processor p _jUpper can be multiplexing the data space size or the size of diminishbb communication buffer; DRT (n ' _j) be the data ready time, represent task n ' _iThe time that can begin to carry out; PEAT (n ' _i, p _j) be the processor ready time, expression processor p _jFree time, can begin the n ' that executes the task _iTime, this time equals processor p _jA upper task n ' who carries out _jConcluding time, as task n ' _iAnd p _jA upper task n ' who carries out _jWhen having the Branch Tasks of mutex relation, PEAT (n ' _i, p _j) equal task PEAT (n ' _j, p _j), i.e. task n ' _iWith task n ' _jHas identical processor ready time; T (n ' _i, p _j) be computing cost, representative is with task n ' _iBe assigned to processor p _jUpper execution needed computing time; λ be adjust DL (n ' _i, p _j) the coefficient factor of proportion in Available Factor.

9. the flowing water parallel method of using for coarse particle degree stream according to claim 1 is characterized in that step F) described in " task in the directed acyclic graph is carried out polymerization and fractionation " may further comprise the steps:

10. the flowing water parallel method of using for coarse particle degree stream according to claim 1 is characterized in that step G) described in " according to the task scheduling result to serial C code cut apart, revise obtain the parallel C code " may further comprise the steps: