CN103377035A - Pipeline parallelization method for coarse-grained streaming application - Google Patents

Pipeline parallelization method for coarse-grained streaming application Download PDF

Info

Publication number
CN103377035A
CN103377035A CN2012101075275A CN201210107527A CN103377035A CN 103377035 A CN103377035 A CN 103377035A CN 2012101075275 A CN2012101075275 A CN 2012101075275A CN 201210107527 A CN201210107527 A CN 201210107527A CN 103377035 A CN103377035 A CN 103377035A
Authority
CN
China
Prior art keywords
task
data
processor
dependence
directed acyclic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012101075275A
Other languages
Chinese (zh)
Inventor
刘鹏
黄春明
史册
于绩洋
刘扬帆
郭俊
姚庆栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN2012101075275A priority Critical patent/CN103377035A/en
Publication of CN103377035A publication Critical patent/CN103377035A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a pipeline parallelization method for coarse-grained streaming applications. The pipeline parallelization method includes performing classic data profiling and dependency analysis on serial C-codes to acquire a task dependence graph, performing dependence transformation on the task dependence graph to acquire a directed acyclic graph, building a system feature graph, performing task scheduling on the directed acyclic graph according to the system feature graph and judging whether a task scheduling result meets performance requirements or not, if not, then aggregating and splitting task of the directed acyclic graph to acquire a new directed acyclic graph, selecting and calculating the highest-cost task of the new directed acyclic graph to acquire a new calculated hot spot region, returning to performing the dependency analysis again, segmenting and modifying the serial C-codes according to the task scheduling result so as to obtain parallelized C-codes, encoding to generate parallel executable files through an encoder, and loading the parallel executable files to a target hardware platform to execute. The pipeline parallelization method is adaptable to multilayer nested loop structures and capable of extracting parallelism of the multilayer loop.

Description

Flowing water parallel method for the application of coarse particle degree stream
Technical field
The present invention relates to a kind of computer application field, particularly a kind of stream to the coarse particle degree is used the method that realizes the flowing water parallelization.
Background technology
In order to take full advantage of the resource of multiple nucleus system, must solve the problem of parallel programming.Because the popular and for a long time set serial programming thinking of programmer of C language so that in a large number the c program code carry over, and these application programs are usually as upper layer application or its Software tool of multiple nucleus system.Nearly 85% embedded system development person is still using the C/C++ coding at present.Use if allow the programmer rewrite this class with new parallel programming language, then development difficulty is large, the construction cycle is long.Therefore need at present the urgent executed in parallel efficient of serial c program on multiple nucleus system that improves.Yet actual conditions are the effective ways that also lack at present the C/C++ parallelization.
Use at the stream that built-in field is widely used, process in the supervisor such as audio frequency, video, encryption and signal, this class method has abundant concurrency, and they have some common characteristics: 1) data-driven; 2) comprise abundant loop structure, and mostly be the multilayer circulation body; 3) there is the complex control dependence between function.Therefore, extract the concurrency of hiding in the stream application, need rationally efficiently streamlined disposal route.
The flowing water parallel method that extracts in the application program at present can be classified as three classes: IMT (IndependentMulti-Threading), CMT (CyclicMulti-Threading) and PMT (Pipelined Multi-Threading).Wherein the IMT technology does not allow cross-thread rely on to exist, and during this technology was used at first and calculates based on the science of array, foremost IMT technology was DOALL, the situation of Existence dependency relationship not between each iteration in it is only applicable to circulate.The CMT technology is to the replenishing and expansion of IMT technology, mainly is in order to be applicable to the situation of Existence dependency relationship between iteration, DOACROSS belongs to the CMT technology, and it guarantees data interaction between iteration by having inserted synchronization mechanism.DSWP (Decoupled Software Pipeline) technology belongs to the PMT technology, and the DSWP technology is intended to the pipelined parallel of development cycle body.The processing mode that the DSWP technology is compared the DOACROSS technology is different, and the DSWP technology is that loop code is cut, and is divided into different threads and is assigned on the different processor cores, and each thread is arranged according to the parallel mode of flowing water.But there is following defective in above method: 1) they mainly are for extracting the fine granularity Thread level parallelism, the method of polymerization is taked in the processing that they rely on control, this point very easily formed the set of tasks of coarsegrain for coarse particle degree application program, become performance bottleneck; 2) above method is only applicable to the exploitation to innermost loop, and is undesirable for the effect structure of multilayer nest circulation; 3) DOACROSS technology and DOALL technology are only applicable to the application of the regular memory access of tool and simple control dependence, and the science that is similar to is calculated, and most application program has erratic control stream and complicated memory access, accesses such as pointer.
Also there are some research work to extract concurrency from the structure dissection to source program, but this class work often stresses to carry out modeling and information search in the structure to program, and lack conversion for the dependence between modules, rely on the programmer manually to finish aspect the concurrency and do not have systematic method to do guidance extracting.
Summary of the invention
Technical matters to be solved by this invention provides a kind of flowing water parallel method that can flow for the coarse particle degree application, be applicable to the structure of multilayer nest circulation, can extract the concurrency of multilayer circulation.
For solving the problems of the technologies described above, the technical solution adopted in the present invention provides a kind of flowing water parallel method of using for coarse particle degree stream, may further comprise the steps:
A) serial C code is carried out the typical data analysis and obtain calculating the hot spot region;
B) will calculate the hot spot region carries out dependency analysis and obtains Task Dependent figure TDG;
C) Task Dependent figure is relied on conversion and obtain directed acyclic graph TDGAT;
D) hardware model of setting up target hardware platform obtains architectural feature figure ACG;
E) for architectural feature figure directed acyclic graph is carried out task scheduling and obtain as a result SR of task scheduling;
F) judge whether the task scheduling result satisfies performance requirement, if meet the demands, execution in step G then) and subsequent step, if do not meet the demands, then the task in the directed acyclic graph is carried out polymerization and fractionation, obtain new directed acyclic graph, select the task of computing cost maximum in the new directed acyclic graph, obtain new calculating hot spot region, return again step B) continue to carry out;
G) according to the task scheduling result serial C code is cut apart, revised and obtain the parallel C code;
H) use the compiler compiling that is applicable to target hardware platform to generate parallel executable file;
I) described parallel executable file is loaded on the target hardware platform carries out.
Steps A) " serial C code is carried out the typical data analysis obtains calculating the hot spot region " described in may further comprise the steps:
A) insert Debugging message in serial C code, compiling generates the serial executable file;
B) allow described serial executable file carry out at computer platform, collect the essential information of runtime environment, be specially the relevant information of obtaining internal storage access, function calling relationship and branching selection;
C) allow described serial executable file carry out at target hardware platform, collect the information relevant with described target hardware platform, be specially the size that takies of the instruction space of computing cost, program of function and data space, choose the function of computing cost maximum as calculating the hot spot region.
Step B) " will calculate hot spot region carry out dependency analysis obtain Task Dependent figure TDG " described in comprises and carries out static scanning to calculating the hot spot region, follow the tracks of the function read-write situation of calculating in the hot spot region, set up data dependence relation and control dependence between each function, according to the data dependence relation between each function and control dependence, set up Task Dependent figure TDG, TDG=(V, E, w v, w e), wherein V represents the set of task, E represents the set of dependence edge, i.e. data dependence relation between task and control dependence, w vBe the data structure of task, used and stored corresponding quantitative information, w eBe the data structure of dependence edge, be used for storing corresponding quantitative information.
Step C) " Task Dependent figure is relied on conversion obtains directed acyclic graph TDGAT " described in and may further comprise the steps:
A) judge whether Task Dependent figure TDG exists control to rely on, if exist, then eliminate control and rely on, enter step b), if do not exist, then enter step b);
B) judge whether Task Dependent figure TDG exists data dependence between iteration, if exist, then eliminate data dependence between redundant iteration, enter step c), if do not exist, then enter step c);
C) judge whether Task Dependent figure TDG exists the ring-type data dependence, if exist, then set up strong connected component SCC, generate directed acyclic graph TDGAT, if do not exist, then directly generate directed acyclic graph TDGAT;
D) through dependence obtain after the conversion directed acyclic graph TDGAT=(V ', E ', w ' v, w ' e), the wherein set of V ' expression task, the set of E ' expression dependence edge, w ' vBe the data structure of task, used and stored corresponding quantitative information, w ' eIt is the data structure of dependence edge, be used for storing corresponding quantitative information, its structure and TDG are basically identical, difference is only to exist among the TDGAT data dependence limit, it is the data dependence relation between task, each node is a strong connected component SCC among the TDGAT, does not have branch's control task among the TDGAT, the w ' among the TDGAT vAnd w ' eThe data field that comprises and the w among the TDG vAnd w eIdentical.
Step D) figure of architectural feature described in ACG=(P, R, w p, w r), for the characteristic of describing target hardware, wherein P represents the set of processor, R represents the link set between processor, w pThe size of storage space on the processor, w rThe bandwidth of expression link.
Step e) " for architectural feature figure directed acyclic graph is carried out task scheduling and obtains as a result SR of task scheduling " described in may further comprise the steps:
I) priority that sets the tasks is selected ready task;
II) select suitable processor for ready task.
Step I) " priority sets the tasks " may further comprise the steps described in:
A) with task according to data dependence relation, be divided in the different task groups, task division that can executed in parallel is to same task groups, task possibility Existence dependency relationship in the different task group, for these task groups arrange priority, the task groups priority that more early can begin to carry out is higher, and limit priority is 0;
B) for the different task in each task groups arranges task priority TP, according to following formula calculation task priority, task priority TP is larger to each task, and task priority is higher:
Known
Figure BDA0000152672280000041
Figure BDA0000152672280000042
The longest path length of the bl end task that is current task in the directed acyclic graph TDGAT wherein, bl is larger, indicates that this task is larger on the impact of follow-up work, should priority processing; DynamicMem is the size of the communication buffer of this required by task, DynamicMem (n ' i)=; jC (e ' Ij), wherein c (e ' Ij) expression dependence edge e IjThe traffic; StaticMem is the data of task and the size of instruction storage space;
Figure BDA0000152672280000051
≠ be weight coefficient,
Figure BDA0000152672280000053
Figure BDA0000152672280000054
Figure BDA0000152672280000055
≠=1/M, P are the set of processor, p jBe a processor among the P, t represents task n i' at processor p jOn execution time, R is average data transfer rate in the link, i.e. link bandwidth, M is the average storage size in each processor, cp is the critical path among the TDGAT; Len is path, i.e. computing cost on this path and communication overhead sum.
Step II) " for ready task is selected suitable processor " described in may further comprise the steps:
Ready task calculated can obtain factors A vailableFactor, i.e. AF, is defined as follows AF as the standard of weighing this processor and whether be fit to ready task with AF:
Known
Figure BDA0000152672280000056
Figure BDA0000152672280000057
Figure BDA0000152672280000058
Wherein DL (n ' i, p j) be the data localitys, expression task n ' iBe assigned to processor p jUpper can be multiplexing the data space size or the size of diminishbb communication buffer; DRT (n ' i) be the data ready time, represent task n ' iThe time that can begin to carry out; PEAT (n ' i, p j) be the processor ready time, expression processor p jFree time, can begin the n ' that executes the task iTime, this time equals processor p jA upper task n ' who carries out jConcluding time, as task n ' iAnd p jA upper task n ' who carries out jWhen having the Branch Tasks of mutex relation, PEAT (n ' i, p j) equal task PEAT (n ' j, p j), i.e. task n ' iWith task n ' jHas identical processor ready time; T (n ' i, p j) be computing cost, representative is with task n ' iBe assigned to processor p jUpper execution needed computing time; * be adjust DL (n ' i, p j) the coefficient factor of proportion in Available Factor.
Step F) " task in the directed acyclic graph being carried out polymerization and fractionation " described in may further comprise the steps:
A) according to task scheduling SR as a result, the working condition of checking each processor on the target hardware platform;
B) task that computing cost is less condenses together, and is assigned on the same processor, to improve the utilization ratio of processor as far as possible;
C) can not be assigned to the little task of computing cost on the same processor by polymerization and split and come, the free time that is inserted into each processor carries out in the groove.
Step G) described in " according to the task scheduling result to serial C code cut apart, revise obtain the parallel C code " may further comprise the steps:
A) serial C program in machine code section corresponding to each task among the figure dependency graph TDGAT that set the tasks, the program code of correspondence is extracted, redefine and be packaged into function body, this function body is concrete C code corresponding to task, is responsible for the input data are carried out calculation process;
B) for guaranteeing that this task can be to the correct data of other tasks transmission, in function body, revise and add corresponding the processing, be specially the useful data that generates is copied, divides or distributes local space to preserve;
C) communicate by letter and synchronization statements with the end insertion in the stem of function body, because the execution of task has atomic properties, before any data interaction operation only allows the computing execution of the task that occurs in to begin or after finishing, these communications and synchronization statements are provided by operating system, obtain at last the parallel C code through above processing.
Beneficial effect: the method that the present invention proposes is based on didactic, by the loop optimization Stepwise Refinement, adjusts the pipelining-stage granularity, progressively excavates out the concurrency of multilayer circulation with this; Setting up on the basis of TDG, for dependence complicated between task, propose a kind ofly to be different from traditional simple aggregation of passing through and to eliminate the method that ring-type relies on, by branching polymerization, supposition and identification that redundancy is relied on, can effectively eliminate control and rely on and untie the data dependence ring; The present invention has also defined the output TDGAT that relies on conversion, and TDGAT compares with TDG, and each task is a relatively complete full-mesh component SCC; The method for scheduling task that the present invention proposes has taken into full account the variation of task dependencies, dispatching method to be reducing the execution time as optimization aim as far as possible, realizes optimization that storage is taken by increasing data locality and multiplexing storage space on the basis that guarantees the execution time; The task scheduling that the present invention adopts has also considered to exist the situation of Branch Tasks, by the alternative between the judgement task, rationally utilizes hardware resource.
Description of drawings
By reference to the accompanying drawings, other characteristics of the present invention and advantage can be from below by becoming clearer the explanation of giving an example the preferred implementation that principle of the present invention is made an explanation.
Fig. 1 the present invention is directed to the schematic flow sheet that the coarse particle degree flows a kind of embodiment of the flowing water parallel method of using;
Fig. 2 the present invention is directed to the schematic flow sheet that relies on conversion in a kind of embodiment of the flowing water parallel method that coarse particle degree stream uses;
Fig. 3 is the schematic flow sheet that the present invention is directed to task scheduling in a kind of embodiment of the flowing water parallel method that coarse particle degree stream uses.
Embodiment
Below in conjunction with accompanying drawing embodiments of the present invention are described in detail:
As shown in Figure 1, a kind of flowing water parallel method of using for coarse particle degree stream.
Step 1, process analysis.The major function of process analysis is to select to calculate hot spot region CH (computing hotspot), and it is set up the Task Dependent graph model.Regard the modules of program as TU task unit, and the relation between them is represented by the limit between the task.Process analysis is divided into typical data analysis and dependency analysis two parts.Typical data analysis is in order to obtain the runtime environment information of program, and dependency analysis is the relation that relies in order to obtain data dependence and control.Process analysis comprises that serial C code is carried out typical data analysis to be obtained calculating the hot spot region and will calculate the hot spot region and carry out dependency analysis and obtain Task Dependent figure TDG (Task Dependence Graph).
Wherein serial C code is carried out typical data analysis and obtain calculating the hot spot region and be included in the serial C code and insert Debugging message, compiling generates the serial executable file; Allow described serial executable file carry out at computer platform, collect the essential information of runtime environment, be specially the relevant information of obtaining internal storage access, function calling relationship and branching selection; Allow described serial executable file carry out at target hardware platform, collect the information relevant with described target hardware platform, be specially the size that takies of the instruction space of computing cost, program of function and data space, choose the function of computing cost maximum as calculating the hot spot region.
To calculate the hot spot region carries out dependency analysis and obtains Task Dependent figure TDG and comprise and carry out static scanning to calculating the hot spot region, follow the tracks of the function read-write situation of calculating in the hot spot region, set up data dependence relation and control dependence between each function, according to the data dependence relation between each function and control dependence, set up Task Dependent figure TDG, TDG=(V, E, w v, w e), wherein V represents the set of task, E represents the set of dependence edge, i.e. data dependence relation between task and control dependence, w vBe the data structure of task, used and stored corresponding quantitative information, w eBe the data structure of dependence edge, be used for storing corresponding quantitative information.
Figure BDA0000152672280000081
Represent a task, this task shows as one section statement block in the C code, and this statement block can be function, and n also comprises nested subfunction, also can be some continuous statements.
Figure BDA0000152672280000082
n i,n j∈V。e IjRepresent one group from task n iTo task n jDependence, data dependence relation or control dependence, at this moment pre (n j)=n iTitle task n iBe task n jFirst sequence task, same succ (n i)=n jTitle task n jBe task n iSubsequent tasks.The different 1. control task V of four classes that divide of function according to task c, namely statement is judged by loop control statement or branch; 2. cycle task V l, namely in loop body, can repeatedly being carried out of task; 3. Branch Tasks V b, the task in branched structure; 4. common task V o, i.e. type except above three kinds of tasks.The limit also is divided into 1. data dependence limit E between iteration accordingly i, represent the data dependence relation between different iteration in the loop body; 2. control dependence edge E c, the dependence that representative is initiated by the control node; 3. general data dependence edge E o, the type of representative except above two kinds of dependence edges.Set up data structure for each node task and dependence edge in addition and stored corresponding quantitative information, as shown in the table:
Figure BDA0000152672280000083
Figure BDA0000152672280000091
To the attribute i of branch bDefine a kind of data structure of describing the Branch Tasks characteristic, the branch information tabulation:
Figure BDA0000152672280000092
Each Branch Tasks has such record list item.Each territory of this list item is defined as follows: branch_level is branch's rank, and if the number of the node of divergence of passing through when representation program is carried out this node is the non-branch node of this node level=0 then.Branch_label[i] be branch's label, the name of i node of divergence of expression present node process.Branch_condition[i] be branch condition, conditional value when representing present node through i node of divergence, branch_exclusive is the tabulation of branch alternative, branch_exclusive[k] k Mutex-task in the Mutex-task tabulation of expression present node.
To cycle attribute i lDefine a kind of data structure of describing the cycle task characteristic, the cyclical information tabulation:
Figure BDA0000152672280000101
Each cycle task has such record list item.Each territory is defined as follows in the table: loop_level is the circulation rank, when representation program is carried out this node the number of the cycle control node that must process, reflected the residing loop body degree of depth of this cycle task.Loop_label[i] be the circulation label, represent the name of the control node of the residing i layer circulation of present node.Loop_num[i] be number of iterations, the iteration total degree of the residing i layer circulation of expression present node, loop_childnode[k] be the cycle task child node, record depends on k task of this cycle task.
TDG is suitable for coarseness, has the control structures such as circulation, branch, satisfies the program of data flow driven characteristics.TDG combines the characteristics of c program, can be preferably the control flow structure of c program is represented to unite with data flow architecture.
Step 2: rely on conversion.May there be complicated dependence among the Task Dependent figure through process analysis foundation, concurrency between task has been hidden, be that follow-up task scheduling is prepared in order to extract concurrency, proposition is to the dependence transform method of Task Dependent figure, the dependence edge that deletion is redundant, task is carried out suitable polymerization and fractionation, finally obtain clear and fully to express the directed acyclic graph TDGAT of program parallelization.Rely on shift process as shown in Figure 2, specifically comprise judging a) whether Task Dependent figure TDG exists control to rely on, if exist, then eliminate control and rely on, enter step b), if do not exist, then enter step b); B) judge whether Task Dependent figure TDG exists data dependence between iteration, if exist, then eliminate data dependence between redundant iteration, enter step c), if do not exist, then enter step c); C) judge whether Task Dependent figure TDG exists the ring-type data dependence, if exist, then set up strong connected component SCC, generate directed acyclic graph TDGAT, if do not exist, then directly generate directed acyclic graph TDGAT; D) through dependence obtain after the conversion directed acyclic graph TDGAT=(V ', E ', w ' v, w ' e), the wherein set of V ' expression task, the set of E ' expression dependence edge, w ' vBe the data structure of task, used and stored corresponding quantitative information, w ' eIt is the data structure of dependence edge, be used for storing corresponding quantitative information, its structure and TDG are basically identical, difference is only to exist among the TDGAT data dependence limit, it is the data dependence relation between task, each node is a strong connected component SCC among the TDGAT, does not have branch's control task among the TDGAT, the w ' among the TDGAT vAnd w ' eThe data field that comprises and the w among the TDG vAnd w eIdentical.
Rely on conversion and be divided into three steps: eliminate control and rely on, eliminate between redundant iteration data dependence and set up strong connected component SCC.
Eliminate the control dependence and comprise that eliminating branch's control relies on and eliminate the cycle control dependence.
Eliminating branch's control dependence specifically comprises: the alternative of Branch Tasks, and there is and only has the selected execution of a meeting in some branches under the same Rule of judgment at synchronization, that is to say that these branches are mutual exclusions, and this is the characteristic of Branch Tasks.Eliminate branch's control dependence edge, in order to indicate alternative, we upgrade the alternative tabulation branch_exclusive of branch of each Branch Tasks when eliminating branch's control dependence.Concrete steps are: the 1) i of each branch node in the scanning bInformation namely satisfies l (n)=1,5 or 7 node; Suppose that pending branch node is n j, n jPV B, n jBranch information i bThe rank branch_level=N of branch of middle record begins to scan one by one n from i=0 jBranch information i bThe label branch_label[i of branch of middle record] and branch condition branch_condition[i].By the label branch_label[i of branch] can navigate to domination n jThe branch control node of i layer branch.If i>N-1 represents branch node n jProcessing finish, search for next branch node.2) traversal n jAll output branchs control dependence edges of i layer branch control node, find other Branch Tasks that arranged by this control node.Check respectively the label branch_label of branch and the branch condition branch_condition of these Branch Tasks, judge these tasks whether with task n jMutual exclusion.If mutual exclusion is set up, the task with these joins n so jBranch alternative tabulation branch_exclusive in.3) branch is controlled the content of node, copy merging to branch node n jIn.4) repeating step 1) to 3), until handle branch nodes all among the TDG, namely eliminate all branch's control nodes and branch's control dependence edge, leave out at last branch's control nodes all among the TDG.
Eliminating cycle control relies on and specifically comprises: the cycle task information i that provides according to TDG l, can know the behavioral characteristic of this loop body, comprise the degree of depth of this circulation, carry out number of times, the information such as the dependence between the iteration.If a loop body is always carried out fixing iterations or always repeatedly carried out (carrying out minimum iterations/maximum iterations>1/2), we claim this loop body to lay particular stress on so.If a loop body is laid particular stress on, think that so the behavior of this loop body is highly predictable.The cycle control node is negligible to the domination of cycle task so.Eliminate the cycle control dependence edge, to the processing of cycle control dependence edge, adopt the supposition technology to process, concrete steps are as follows: the cyclical information i that observes each cycle task lCirculation rank loop_level, circulation label loop_label and the number of iterations loop_num of middle record confirm whether this circulation has to lay particular stress on, i.e. biased.If this circulation is laid particular stress on, think that so the behavior of this class circulation is foreseeable, they are selected as circulation to be inferred.All cycle controls of eliminating in the selected loop body rely on.Insert code and realize the wrong mechanism of inferring.In each cycle task, insert non-conditional control statement, such as while (true).The acquiescence cycle task can repeatedly be carried out, and all waits the order of cycle control task to be checked no longer at every turn.Realize that the wrong function of inferring mechanism is inserted in the afterbody task in the loop body.Mistake infers that the major function of mechanism is achieved in that at first, and mistake infers that function can distribute the data result in the last round of iteration of the each circulation of buffer zone record, and whenever enters and new do Data Update one time once taking turns iteration; Then, this function can be inquired about the state of cycle control task, to determine this time infer whether hit; If prediction thinks so that accurately this iteration is effective, task continue to be carried out, otherwise, infers that unsuccessfully this time iteration is invalid, inferred by mistake and be responsible for data rewind to the result of effective iteration of last time and withdraw from circulation.
The data dependence concrete steps comprise that all dependences of temporary transient deletion are apart from d between the iteration of elimination redundancy DepGreater than data dependence limit between 1 iteration; Then, to eliminate d DepGreater than the TDG behind the data dependence between 1 iteration be the basis, find out the strong connected component among the figure, each strong connected component is labeled as a SCC.Comprise at least one task among each SCC.In case SCC divides and to finish, and just can learn which SCC is each task be assigned among the TDG, and can know that strong connected component between different task is apart from d s(n i, n j).The strong distance that is communicated with refers to any two node n among the figure iAnd n jThe number of the SCC at interval between the SCC at place separately.If n i, and n jBetween do not have communication path, d so s(n i, n j)=-1.If n i, and n jBetween have communication path, but two tasks are directly to link to each other, and are middle without other nodes, so d s(n i, n j)=0.If n i, and n jBetween have many communication paths, an and the longest paths (n i, n I+l..., n I+k, n j) passed through k node, so d s(n i, n j)=k+l.At last, whether the data dependence limit belongs to data dependence limit between redundant iteration between the iteration of deleting before judging, if it is deletion, otherwise again add back among the TDG.One of condition is satisfied on the data dependence limit between redundant iteration:
Suppose to exist data dependence limit e between iteration IjPE i, e IjStart from task n i, end at task n j, n i, n jPV ld Dep(e Ij) expression dependence edge e IjThe dependence distance, d s(n i, n j) expression task n iWith task n jStrong connected component distance.
If d Dep(e Ij)>d s(n i, n j), d s(n i, n j) 〉=1 is if expression is carried out follow-up pipelining-stage division, so n with the basis that is divided into of present SCC iThe SCC at place is to n jThe data dependence of place SCC always can in time be satisfied, so e IjBe data dependence limit between the iteration of redundancy.If d Dep(e IjD is then inquired about in)=1 s(n i, n j) value.If d s(n i, n j)=-1 illustrates task n jWith task n jIn same SCC, e IjBe the data interaction of SCC inside, e IjBe data dependence limit between the iteration of redundancy.If d s(n i, n j)=0 is actually d Dep(e Ij)>d s(n i, n j), d s(n i, n jWork as d in) 〉=1 s(n i, n jThe specific example of)=0 o'clock, the e of this moment IjAlso belong to data dependence limit between redundant iteration.
Seek remaining ring-type data dependence in task image, the limit and the task that form these ring-types dependences are carried out polymerization, Task Dependent figure TDG becomes a directed acyclic graph TDGAT.
Step 3: hardware description.The hardware model that hardware description is namely set up target hardware platform obtains architectural feature figure ACG.Wherein said architectural feature figure ACG=(P, R, w p, w r), for the characteristic of describing target hardware, P represents the set of processor, R represents the link set between processor, w pThe size of storage space on the processor, w rThe bandwidth of expression link.
Step 4: task scheduling.The task scheduling fundamental purpose is under the prerequisite of given directed acyclic graph, different tasks is assigned on the different processors, and determines their execution sequence and execution time, in order to realize optimization aim.If parallel scheme fails to reach performance requirement, task is split and polymerization, adjust its granularity, and reselect the calculating hot spot region, repeat the operation of above-mentioned one or two steps.Task scheduling comprises that for architectural feature figure directed acyclic graph being carried out task scheduling obtains as a result SR and judge whether the task scheduling result satisfies performance requirement of task scheduling, if meet the demands, then enter next step, if do not meet the demands, then the task in the directed acyclic graph is carried out polymerization and fractionation, obtain new directed acyclic graph, select the task of computing cost maximum in the new directed acyclic graph, obtain new calculating hot spot region, the hot spot region of will calculating of returning again in the step 1 process analysis is carried out dependency analysis and is obtained Task Dependent figure TDG, execution in step two again, and the execution in step quadruple is newly judged again.
Wherein the schematic flow sheet of task scheduling as shown in Figure 3, its detailed process comprises the priority that sets the tasks, and selects ready task and is the suitable processor of ready task selection.The priority that sets the tasks comprises task according to data dependence relation, be divided in the different task groups, task division that can executed in parallel is to same task groups, task possibility Existence dependency relationship in the different task group, for these task groups arrange priority, the task groups priority that more early can begin to carry out is higher, and limit priority is 0; For the different task in each task groups arranges task priority TP, according to following formula calculation task priority, task priority TP is larger to each task, and task priority is higher:
Known
Figure BDA0000152672280000141
Figure BDA0000152672280000142
The longest path length of the bl end task that is current task in the directed acyclic graph TDGAT wherein, bl is larger, indicates that this task is larger on the impact of follow-up work, should priority processing; DynamicMem is the size of the communication buffer of this required by task, DynamicMem (n ' i)=; jC (e ' Ij), wherein c (e ' Ij) expression dependence edge e IjThe traffic; StaticMem is the data of task and the size of instruction storage space;
Figure BDA0000152672280000143
Figure BDA0000152672280000144
≠ be weight coefficient,
Figure BDA0000152672280000145
Figure BDA0000152672280000146
Figure BDA0000152672280000147
≠=1/M, P are the set of processor, p jBe a processor among the P, t represents task n i' at processor p jOn execution time, R is average data transfer rate in the link, i.e. link bandwidth, M is the average storage size in each processor, cp is the critical path among the TDGAT; Len is path, i.e. computing cost on this path and communication overhead sum.
The priority of task is set, adopts method dynamic and static combination that two-stage priority is set in the present embodiment.Can satisfy simultaneously dependence, the task of executed in parallel is called with batch task at one time.Different tasks can be assigned in the different task groups according to affiliated batch difference.Any two tasks in each task groups are separate, and the task in the different task group may Existence dependency relationship.In order to distinguish the execution sequence of different task group, for each task groups is provided with priority, i.e. task groups priority BP (Batch Priority).The priority of less this task groups of explanation of BP is higher, can more early begin to carry out.Task priority TP (Task Priority), TP represents the priority between the different task in the same task groups.Because can executed in parallel with batch task, may there be so the situation of contention hardware resource between them.In order to solve the problem of resource contention, with batch task task priority TP need to be set.It is nonsensical that the TP of task in the different task group is compared.Each task can be assigned to corresponding task groups, and the BP of each grouping of initialization.The algorithm false code that BP arranges is as follows:
Figure BDA0000152672280000151
TP is static priority, no longer changes after once setting.And BP is dynamic priority.Whenever there being task to be scheduled, need the division of updating task group.The algorithm false code that dynamically updates task groups is as follows:
Figure BDA0000152672280000161
Describedly select suitable processor to comprise ready task calculated to obtain factors A vailableFactor for ready task, i.e. AF, is defined as follows AF as the standard of weighing this processor and whether be fit to ready task with AF:
Known
Figure BDA0000152672280000164
Wherein DL (n ' i, p j) be the data localitys, expression task n ' iBe assigned to processor p jUpper can be multiplexing the data space size or the size of diminishbb communication buffer; DRT (n ' j) be the data ready time, represent task n ' iThe time that can begin to carry out; PEAT (n ' i, p j) be the processor ready time, expression processor p jFree time, can begin the n ' that executes the task iTime, this time equals processor p jA upper task n ' who carries out jConcluding time, as task n ' iAnd p jA upper task n ' who carries out jWhen having the Branch Tasks of mutex relation, PEAT (n ' i, p j) equal task PEAT (n ' j, p j), i.e. task n ' iWith task n ' jHas identical processor ready time; T (n ' i, p j) be computing cost, representative is with task n ' iBe assigned to processor p jUpper execution needed computing time; * be adjust DL (n ' i, p j) the coefficient factor of proportion in Available Factor.AF (n ' i, p j) larger, if illustrate task n ' iBe assigned to processor p jUpper more energy Optimized Operation performance.For the criterion of task choosing processor as follows:
Figure BDA0000152672280000171
Task in the directed acyclic graph is carried out polymerization and fractionation to be comprised according to task scheduling SR as a result, the working condition of checking each processor on the target hardware platform; The task that computing cost is less condenses together, and is assigned on the same processor, to improve the utilization ratio of processor as far as possible; Can not be assigned to the little task of computing cost on the same processor by polymerization and split and come, the free time that is inserted into each processor carries out in the groove.
Step 5: parallel codes generates.Parallel codes generates and to comprise according to the task scheduling result serial C code is cut apart, revised and obtain the compiler compiling that parallel C code and use be applicable to target hardware platform and generate the executable file that walks abreast.Describedly according to the task scheduling result serial C code is cut apart, revised and obtain the parallel C code and comprise serial C program in machine code section corresponding to each task among the figure dependency graph TDGAT that sets the tasks, the program code of correspondence is extracted, redefine and be packaged into function body, this function body is concrete C code corresponding to task, is responsible for the input data are carried out calculation process; For guaranteeing that this task can be to the correct data of other tasks transmission, in function body, revise and add corresponding the processing, be specially the useful data that generates is copied, divides or distributes local space to preserve; Stem at function body is communicated by letter and synchronization statements with the end insertion, because the execution of task has atomic properties, before any data interaction operation only allows the computing execution of the task that occurs in to begin or after finishing, these communications and synchronization statements are provided by operating system, obtain at last the parallel C code through above processing.
Step 6: parallel codes is carried out.Parallel codes is carried out and to be about to described parallel executable file and to be loaded on the target hardware platform and to carry out.
Although described by reference to the accompanying drawings embodiments of the present invention, those of ordinary skills can make various distortion or modification within the scope of the appended claims.

Claims (10)

1. flowing water parallel method of using for coarse particle degree stream is characterized in that may further comprise the steps:
A) serial C code is carried out the typical data analysis and obtain calculating the hot spot region;
B) will calculate the hot spot region carries out dependency analysis and obtains Task Dependent figure TDG;
C) Task Dependent figure is relied on conversion and obtain directed acyclic graph TDGAT;
D) hardware model of setting up target hardware platform obtains architectural feature figure ACG;
E) for architectural feature figure directed acyclic graph is carried out task scheduling and obtain as a result SR of task scheduling;
F) judge whether the task scheduling result satisfies performance requirement, if meet the demands, execution in step G then) and subsequent step, if do not meet the demands, then the task in the directed acyclic graph is carried out polymerization and fractionation, obtain new directed acyclic graph, select the task of computing cost maximum in the new directed acyclic graph, obtain new calculating hot spot region, return again step B) continue to carry out;
G) according to the task scheduling result serial C code is cut apart, revised and obtain the parallel C code;
H) use the compiler compiling that is applicable to target hardware platform to generate parallel executable file;
I) described parallel executable file is loaded on the target hardware platform carries out.
2. the flowing water parallel method of using for coarse particle degree stream according to claim 1 is characterized in that steps A) described in " serial C code is carried out the typical data analysis obtains calculating the hot spot region " may further comprise the steps:
A) insert Debugging message in serial C code, compiling generates the serial executable file;
B) allow described serial executable file carry out at computer platform, collect the essential information of runtime environment, be specially the relevant information of obtaining internal storage access, function calling relationship and branching selection;
C) allow described serial executable file carry out at target hardware platform, collect the information relevant with described target hardware platform, be specially the size that takies of the instruction space of computing cost, program of function and data space, choose the function of computing cost maximum as calculating the hot spot region.
3. the flowing water parallel method of using for coarse particle degree stream according to claim 1, it is characterized in that: " will calculate hot spot region carry out dependency analysis obtain Task Dependent figure TDG " step B) comprises and carries out static scanning to calculating the hot spot region, follow the tracks of the function read-write situation of calculating in the hot spot region, set up data dependence relation and control dependence between each function, according to the data dependence relation between each function and control dependence, set up Task Dependent figure TDG, TDG=(V, E, w v, w e), wherein V represents the set of task, E represents the set of dependence edge, i.e. data dependence relation between task and control dependence, w vBe the data structure of task, be used for storing corresponding quantitative information, w eBe the data structure of dependence edge, be used for storing corresponding quantitative information.
4. the flowing water parallel method of using for coarse particle degree stream according to claim 1 is characterized in that step C) described in " Task Dependent figure is relied on conversion obtains directed acyclic graph TDGAT " may further comprise the steps:
A) judge whether Task Dependent figure TDG exists control to rely on, if exist, then eliminate control and rely on, enter step b), if do not exist, then enter step b);
B) judge whether Task Dependent figure TDG exists data dependence between iteration, if exist, then eliminate data dependence between redundant iteration, enter step c), if do not exist, then enter step c);
C) judge whether Task Dependent figure TDG exists the ring-type data dependence, if exist, then set up strong connected component SCC, generate directed acyclic graph TDGAT, if do not exist, then directly generate directed acyclic graph TDGAT;
D) through dependence obtain after the conversion directed acyclic graph TDGAT=(V ', E ', w ' v, w ' e), the wherein set of V ' expression task, the set of E ' expression dependence edge, w ' vBe the data structure of task, be used for storing corresponding quantitative information, w ' eIt is the data structure of dependence edge, be used for storing corresponding quantitative information, its structure and TDG are basically identical, difference is only to exist among the TDGAT data dependence limit, it is the data dependence relation between task, each node is a strong connected component SCC among the TDGAT, does not have branch's control task among the TDGAT, the w ' among the TDGAT vAnd w ' eThe data field that comprises and the w among the TDG vAnd w eIdentical.
5. the flowing water parallel method of using for coarse particle degree stream according to claim 1 is characterized in that: the figure of architectural feature step D) ACG=(P, R, w p, w r), be used for describing the characteristic of target hardware, wherein P represents the set of processor, R represents the link set between processor, w pThe size of storage space on the processor, w rThe bandwidth of expression link.
6. the flowing water parallel method of using for coarse particle degree stream according to claim 1 is characterized in that step e) described in " for architectural feature figure directed acyclic graph is carried out task scheduling and obtains as a result SR of task scheduling " may further comprise the steps:
I) priority that sets the tasks is selected ready task;
II) select suitable processor for ready task.
7. the flowing water parallel method of using for coarse particle degree stream according to claim 6 is characterized in that step I) described in " priority sets the tasks " may further comprise the steps:
A) with task according to data dependence relation, be divided in the different task groups, task division that can executed in parallel is to same task groups, task possibility Existence dependency relationship in the different task group, for these task groups arrange priority, the task groups priority that more early can begin to carry out is higher, and limit priority is 0;
B) for the different task in each task groups arranges task priority TP, according to following formula calculation task priority, task priority TP is larger to each task, and task priority is higher:
Known TDGAT=(V ', E ', w ' v, w ' e)
Figure FDA0000152672270000031
TP(n′ i)=α×bl(n′ i)+β×DynamicMem(n′ i)+γ×StaticMem(n′ i)
The longest path length of the bl end task that is current task in the directed acyclic graph TDGAT wherein, bl is larger, indicates that this task is larger on the impact of follow-up work, should priority processing; DynamicMem is the size of the communication buffer of this required by task, DynamicMem (n ' i)=∑ jC (e ' Ij), wherein c (e ' Ij) expression dependence edge e IjThe traffic; StaticMem is the data of task and the size of instruction storage space; α, β and γ are weight coefficient, α=1/len (cp); β=1/ (R * min (t (n ' i, p j)),
Figure FDA0000152672270000032
γ=1/M, P are the set of processor, p jBe a processor among the P, t represents task n i' at processor p jOn execution time, R is average data transfer rate in the link, i.e. link bandwidth, M is the average storage size in each processor, cp is the critical path among the TDGAT, len is path, i.e. computing cost on this path and communication overhead sum.
8. the flowing water parallel method of using for coarse particle degree stream according to claim 6, it is characterized in that: " for ready task is selected suitable processor " Step II) comprises that calculating can be obtained factors A vailableFactor to ready task, be AF,, as the standard of weighing this processor and whether be fit to ready task AF is defined as follows with AF:
Known TDGAT=(V ', E ', w ' v, w ' e), ACG=(P, R, w p, w r)
∀ n i ′ ∈ V ′ , ∀ p j ∈ P
AvailableFactor (n ' i, p j)=λ DL (n ' i, p j)-max{DRT (n ' i), PEAT (n ' i, p j)-t (n ' i, p j) wherein DL (n ' i, p i) be the data localitys, expression task n ' iBe assigned to processor p jUpper can be multiplexing the data space size or the size of diminishbb communication buffer; DRT (n ' j) be the data ready time, represent task n ' iThe time that can begin to carry out; PEAT (n ' i, p j) be the processor ready time, expression processor p jFree time, can begin the n ' that executes the task iTime, this time equals processor p jA upper task n ' who carries out jConcluding time, as task n ' iAnd p jA upper task n ' who carries out jWhen having the Branch Tasks of mutex relation, PEAT (n ' i, p j) equal task PEAT (n ' j, p j), i.e. task n ' iWith task n ' jHas identical processor ready time; T (n ' i, p j) be computing cost, representative is with task n ' iBe assigned to processor p jUpper execution needed computing time; λ be adjust DL (n ' i, p j) the coefficient factor of proportion in Available Factor.
9. the flowing water parallel method of using for coarse particle degree stream according to claim 1 is characterized in that step F) described in " task in the directed acyclic graph is carried out polymerization and fractionation " may further comprise the steps:
A) according to task scheduling SR as a result, the working condition of checking each processor on the target hardware platform;
B) task that computing cost is less condenses together, and is assigned on the same processor, to improve the utilization ratio of processor as far as possible;
C) can not be assigned to the little task of computing cost on the same processor by polymerization and split and come, the free time that is inserted into each processor carries out in the groove.
10. the flowing water parallel method of using for coarse particle degree stream according to claim 1 is characterized in that step G) described in " according to the task scheduling result to serial C code cut apart, revise obtain the parallel C code " may further comprise the steps:
A) serial C program in machine code section corresponding to each task among the figure dependency graph TDGAT that set the tasks, the program code of correspondence is extracted, redefine and be packaged into function body, this function body is concrete C code corresponding to task, is responsible for the input data are carried out calculation process;
B) for guaranteeing that this task can be to the correct data of other tasks transmission, in function body, revise and add corresponding the processing, be specially the useful data that generates is copied, divides or distributes local space to preserve;
C) communicate by letter and synchronization statements with the end insertion in the stem of function body, because the execution of task has atomic properties, before any data interaction operation only allows the computing execution of the task that occurs in to begin or after finishing, these communications and synchronization statements are provided by operating system, obtain at last the parallel C code through above processing.
CN2012101075275A 2012-04-12 2012-04-12 Pipeline parallelization method for coarse-grained streaming application Pending CN103377035A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012101075275A CN103377035A (en) 2012-04-12 2012-04-12 Pipeline parallelization method for coarse-grained streaming application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012101075275A CN103377035A (en) 2012-04-12 2012-04-12 Pipeline parallelization method for coarse-grained streaming application

Publications (1)

Publication Number Publication Date
CN103377035A true CN103377035A (en) 2013-10-30

Family

ID=49462204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012101075275A Pending CN103377035A (en) 2012-04-12 2012-04-12 Pipeline parallelization method for coarse-grained streaming application

Country Status (1)

Country Link
CN (1) CN103377035A (en)

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631659A (en) * 2013-12-16 2014-03-12 武汉科技大学 Schedule optimization method for communication energy consumption in on-chip network
CN103902362A (en) * 2014-04-29 2014-07-02 浪潮电子信息产业股份有限公司 Method for parallelizing SHIFT module serial codes in GTC software
CN103955406A (en) * 2014-04-14 2014-07-30 浙江大学 Super block-based based speculation parallelization method
CN104536898A (en) * 2015-01-19 2015-04-22 浙江大学 C-program parallel region detecting method
CN105468445A (en) * 2015-11-20 2016-04-06 Tcl集团股份有限公司 WEB-based Spark application program scheduling method and system
CN105589736A (en) * 2015-12-21 2016-05-18 西安电子科技大学 Hardware description language simulation acceleration method based on net list segmentation and multithreading paralleling
CN106208101A (en) * 2016-08-17 2016-12-07 上海格蒂能源科技有限公司 Intelligent follow-up electric energy apparatus for correcting and show merit analyze method
CN107193535A (en) * 2017-05-16 2017-09-22 中国人民解放军信息工程大学 Based on the parallel implementation method of the nested cyclic vector of SIMD extension part and its device
CN107239334A (en) * 2017-05-31 2017-10-10 清华大学无锡应用技术研究院 Handle the method and device irregularly applied
CN107688488A (en) * 2016-08-03 2018-02-13 中国移动通信集团湖北有限公司 A kind of optimization method and device of the task scheduling based on metadata
CN108139931A (en) * 2015-10-16 2018-06-08 高通股份有限公司 It synchronizes to accelerate task subgraph by remapping
CN108255492A (en) * 2016-12-28 2018-07-06 学校法人早稻田大学 The generation method of concurrent program and parallelizing compilers device
CN108536514A (en) * 2017-03-01 2018-09-14 龙芯中科技术有限公司 A kind of recognition methods of hotspot approach and device
CN108733832A (en) * 2018-05-28 2018-11-02 北京阿可科技有限公司 The distributed storage method of directed acyclic graph
CN108762905A (en) * 2018-05-24 2018-11-06 苏州乐麟无线信息科技有限公司 A kind for the treatment of method and apparatus of multitask event
CN109783157A (en) * 2018-12-29 2019-05-21 深圳云天励飞技术有限公司 A kind of method and relevant apparatus of algorithm routine load
CN109951556A (en) * 2019-03-27 2019-06-28 联想(北京)有限公司 A kind of Spark task processing method and system
CN110377428A (en) * 2019-07-23 2019-10-25 上海盈至自动化科技有限公司 A kind of collecting and distributing type data analysis and Control system
CN111062646A (en) * 2019-12-31 2020-04-24 芜湖哈特机器人产业技术研究院有限公司 Multilayer nested loop task dispatching method
WO2020083050A1 (en) * 2018-10-23 2020-04-30 华为技术有限公司 Data stream processing method and related device
WO2020259560A1 (en) * 2019-06-24 2020-12-30 华为技术有限公司 Method and apparatus for inserting synchronization instruction
CN112486658A (en) * 2020-12-17 2021-03-12 华控清交信息科技(北京)有限公司 Task scheduling method and device for task scheduling
CN112631610A (en) * 2020-11-30 2021-04-09 上海交通大学 Method for eliminating memory access conflict for data reuse of coarse-grained reconfigurable structure
CN112835696A (en) * 2021-02-08 2021-05-25 兴业数字金融服务(上海)股份有限公司 Multi-tenant task scheduling method, system and medium
CN113302592A (en) * 2018-12-12 2021-08-24 纬湃科技有限责任公司 Method for controlling an engine control unit having a multi-core processor
CN113704076A (en) * 2021-10-27 2021-11-26 北京每日菜场科技有限公司 Task optimization method and device, electronic equipment and computer readable medium
CN115658274A (en) * 2022-11-14 2023-01-31 之江实验室 Modular scheduling method and device for neural network reasoning in core grain and computing equipment
CN117610320A (en) * 2024-01-23 2024-02-27 中国人民解放军国防科技大学 Directed acyclic graph workflow engine cyclic scheduling method, device and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101013384A (en) * 2007-02-08 2007-08-08 浙江大学 Model-based method for analyzing schedulability of real-time system
US20090106187A1 (en) * 2007-10-18 2009-04-23 Nec Corporation Information processing apparatus having process units operable in parallel
CN101860752A (en) * 2010-05-07 2010-10-13 浙江大学 Video code stream parallelization method for embedded multi-core system
CN101989192A (en) * 2010-11-04 2011-03-23 浙江大学 Method for automatically parallelizing program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101013384A (en) * 2007-02-08 2007-08-08 浙江大学 Model-based method for analyzing schedulability of real-time system
US20090106187A1 (en) * 2007-10-18 2009-04-23 Nec Corporation Information processing apparatus having process units operable in parallel
CN101860752A (en) * 2010-05-07 2010-10-13 浙江大学 Video code stream parallelization method for embedded multi-core system
CN101989192A (en) * 2010-11-04 2011-03-23 浙江大学 Method for automatically parallelizing program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陶文质: "基于MPSOC针对媒体应用的可变粒度并行化方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631659A (en) * 2013-12-16 2014-03-12 武汉科技大学 Schedule optimization method for communication energy consumption in on-chip network
CN103955406A (en) * 2014-04-14 2014-07-30 浙江大学 Super block-based based speculation parallelization method
CN103902362A (en) * 2014-04-29 2014-07-02 浪潮电子信息产业股份有限公司 Method for parallelizing SHIFT module serial codes in GTC software
CN103902362B (en) * 2014-04-29 2018-05-18 浪潮电子信息产业股份有限公司 A kind of method to GTC software SHIFT module serial code parallelizations
CN104536898B (en) * 2015-01-19 2017-10-31 浙江大学 The detection method of c program parallel regions
CN104536898A (en) * 2015-01-19 2015-04-22 浙江大学 C-program parallel region detecting method
CN108139931A (en) * 2015-10-16 2018-06-08 高通股份有限公司 It synchronizes to accelerate task subgraph by remapping
CN105468445A (en) * 2015-11-20 2016-04-06 Tcl集团股份有限公司 WEB-based Spark application program scheduling method and system
CN105468445B (en) * 2015-11-20 2020-01-14 Tcl集团股份有限公司 WEB-based Spark application program scheduling method and system
CN105589736A (en) * 2015-12-21 2016-05-18 西安电子科技大学 Hardware description language simulation acceleration method based on net list segmentation and multithreading paralleling
CN105589736B (en) * 2015-12-21 2019-03-26 西安电子科技大学 Hardware description language based on netlist segmentation and multi-threaded parallel emulates accelerated method
CN107688488B (en) * 2016-08-03 2020-10-20 中国移动通信集团湖北有限公司 Metadata-based task scheduling optimization method and device
CN107688488A (en) * 2016-08-03 2018-02-13 中国移动通信集团湖北有限公司 A kind of optimization method and device of the task scheduling based on metadata
CN106208101A (en) * 2016-08-17 2016-12-07 上海格蒂能源科技有限公司 Intelligent follow-up electric energy apparatus for correcting and show merit analyze method
CN108255492A (en) * 2016-12-28 2018-07-06 学校法人早稻田大学 The generation method of concurrent program and parallelizing compilers device
CN108536514B (en) * 2017-03-01 2020-10-27 龙芯中科技术有限公司 Hot spot method identification method and device
CN108536514A (en) * 2017-03-01 2018-09-14 龙芯中科技术有限公司 A kind of recognition methods of hotspot approach and device
CN107193535A (en) * 2017-05-16 2017-09-22 中国人民解放军信息工程大学 Based on the parallel implementation method of the nested cyclic vector of SIMD extension part and its device
CN107193535B (en) * 2017-05-16 2019-11-08 中国人民解放军信息工程大学 Based on the parallel implementation method of the nested cyclic vector of SIMD extension component and its device
CN107239334A (en) * 2017-05-31 2017-10-10 清华大学无锡应用技术研究院 Handle the method and device irregularly applied
CN108762905A (en) * 2018-05-24 2018-11-06 苏州乐麟无线信息科技有限公司 A kind for the treatment of method and apparatus of multitask event
CN108733832A (en) * 2018-05-28 2018-11-02 北京阿可科技有限公司 The distributed storage method of directed acyclic graph
WO2020083050A1 (en) * 2018-10-23 2020-04-30 华为技术有限公司 Data stream processing method and related device
US11900113B2 (en) 2018-10-23 2024-02-13 Huawei Technologies Co., Ltd. Data flow processing method and related device
CN113302592A (en) * 2018-12-12 2021-08-24 纬湃科技有限责任公司 Method for controlling an engine control unit having a multi-core processor
US11907757B2 (en) 2018-12-12 2024-02-20 Vitesco Technologies GmbH Method for controlling a multicore-processor engine control unit
CN109783157A (en) * 2018-12-29 2019-05-21 深圳云天励飞技术有限公司 A kind of method and relevant apparatus of algorithm routine load
WO2020134830A1 (en) * 2018-12-29 2020-07-02 深圳云天励飞技术有限公司 Algorithm program loading method and related apparatus
CN109951556A (en) * 2019-03-27 2019-06-28 联想(北京)有限公司 A kind of Spark task processing method and system
US11934832B2 (en) 2019-06-24 2024-03-19 Huawei Technologies Co., Ltd. Synchronization instruction insertion method and apparatus
WO2020259560A1 (en) * 2019-06-24 2020-12-30 华为技术有限公司 Method and apparatus for inserting synchronization instruction
CN110377428A (en) * 2019-07-23 2019-10-25 上海盈至自动化科技有限公司 A kind of collecting and distributing type data analysis and Control system
CN111062646B (en) * 2019-12-31 2023-11-24 芜湖哈特机器人产业技术研究院有限公司 Multi-level nested circulation task dispatching method
CN111062646A (en) * 2019-12-31 2020-04-24 芜湖哈特机器人产业技术研究院有限公司 Multilayer nested loop task dispatching method
CN112631610B (en) * 2020-11-30 2022-04-26 上海交通大学 Method for eliminating memory access conflict for data reuse of coarse-grained reconfigurable structure
CN112631610A (en) * 2020-11-30 2021-04-09 上海交通大学 Method for eliminating memory access conflict for data reuse of coarse-grained reconfigurable structure
CN112486658A (en) * 2020-12-17 2021-03-12 华控清交信息科技(北京)有限公司 Task scheduling method and device for task scheduling
CN112835696A (en) * 2021-02-08 2021-05-25 兴业数字金融服务(上海)股份有限公司 Multi-tenant task scheduling method, system and medium
CN112835696B (en) * 2021-02-08 2023-12-05 兴业数字金融服务(上海)股份有限公司 Multi-tenant task scheduling method, system and medium
CN113704076A (en) * 2021-10-27 2021-11-26 北京每日菜场科技有限公司 Task optimization method and device, electronic equipment and computer readable medium
CN115658274A (en) * 2022-11-14 2023-01-31 之江实验室 Modular scheduling method and device for neural network reasoning in core grain and computing equipment
CN117610320A (en) * 2024-01-23 2024-02-27 中国人民解放军国防科技大学 Directed acyclic graph workflow engine cyclic scheduling method, device and equipment
CN117610320B (en) * 2024-01-23 2024-04-02 中国人民解放军国防科技大学 Directed acyclic graph workflow engine cyclic scheduling method, device and equipment

Similar Documents

Publication Publication Date Title
CN103377035A (en) Pipeline parallelization method for coarse-grained streaming application
CN105956021B (en) A kind of automation task suitable for distributed machines study parallel method and its system
CN104965761B (en) A kind of more granularity divisions of string routine based on GPU/CPU mixed architectures and dispatching method
KR102257028B1 (en) Apparatus and method for allocating deep learning task adaptively based on computing platform
CN103858099B (en) The method and system applied for execution, the circuit with machine instruction
CN101963918B (en) Method for realizing virtual execution environment of central processing unit (CPU)/graphics processing unit (GPU) heterogeneous platform
US9152389B2 (en) Trace generating unit, system, and program of the same
Walsh et al. Efficient learning of action schemas and web-service descriptions.
US20100023731A1 (en) Generation of parallelized program based on program dependence graph
Zheng et al. AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures
CN110383247A (en) Method, computer-readable medium and heterogeneous computing system performed by computer
CN103970602A (en) Data flow program scheduling method oriented to multi-core processor X86
US20140281407A1 (en) Methods and apparatus to compile instructions for a vector of instruction pointers processor architecture
CN112183735A (en) Method and device for generating operation data and related product
CN101655783B (en) Forward-looking multithreading partitioning method
Yi et al. Fast training of deep learning models over multiple gpus
Yaacov BPPy: Behavioral Programming in Python
Boucheneb et al. Optimal reachability in cost time Petri nets
CN103559069B (en) A kind of optimization method across between file processes based on algebra system
Bures et al. Towards intelligent ensembles
Carle et al. Static extraction of memory access profiles for multi-core interference analysis of real-time tasks
JP4946323B2 (en) Parallelization program generation method, parallelization program generation apparatus, and parallelization program generation program
Van der Poll Formal methods in software development: A road less travelled
Nguyen et al. No! Not another deep learning framework
Francis et al. Optimisation modelling for software developers

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20131030