Summary of the invention
The invention provides a kind of multilayer piecemeal dispatching method with storage perception, its purpose is, by coming resource is managed distribution and scheduling to carrying out reasonably repeatedly piecemeal between iterative space, when overcoming the resource unreasonable distribution that exists in the prior art and causing scheduling strategy to be carried out, the problem that deadline is long and energy consumption is many, overcome the restriction owing to local storage simultaneously, cause the problem of loss of data easily.
A kind of multilayer piecemeal dispatching method with storage perception may further comprise the steps:
Step 1: all tasks are performed once as an iteration, carry out between the iterative space that the group task with execution sequence repeatedly makes up as the piecemeal object with needs, determine the piecemeal vector (P between iterative space
i, P
j) direction, the size of piecemeal on the Pj direction is f, and the size of piecemeal on the Pi direction is h, and two that find out ragged edge the dependence set D between task rely on CW and CCW, P
i=CCW and P
j=CW;
Step 2: determine the size of data of the required loading of current iteration and preservation and the relational expression of piecemeal vector magnitude f and h, and the scheduling length Ls of current iteration;
Step 3: the big or small f and the h that determine the piecemeal vector according to strategy one and strategy two;
1) setting f is 1, calculates h according to strategy one and strategy two, obtains h1 and h2 respectively;
Strategy one: 2NUM
Other+ NUM
Top+ NUM
Next≤ M
s
Strategy two: (NUM
Top+ NUM
Other) Tw+NUM
Other* Tr≤Ls * f * h;
2) if h1〉h2, then the value of h is h2, adopts strategy one and strategy two to calculate f, obtains f1 and f2 respectively, enters 3);
Otherwise dividing the value of block size h is h1, and the value of f is 1;
3) if f1〉1, divide block size to be defined as f1*h2; Otherwise the branch block size is f*h1, and f=1;
Step 4: technology when adopting the iteration restatement, the dependence of nexine circulation between the time-delay change task between the dispersion task, reconstruct divides block space;
Step 5: divide block space for the first time according to a minute block size f*h division, each sub-piecemeal that the first time, piecemeal produced is used as a node namely as a bunch of task, constitute between new iterative space, successively each sub-piecemeal is carried out piecemeal according to step 1, obtain the direction vector (P2 of piecemeal for the second time
i, P2
j);
Step 6: determine the size of piecemeal vector for the second time;
The piecemeal vector is at P2 for the second time
iSize on the direction is N
Core, at P2
jSize on the direction is 1, N
CoreQuantity for processor cores;
Step 7: obtain execution sequence figure to carrying out between iterative space behind the piecemeal according to two piecemeal vectors that obtain, dispatch according to the task of execution sequence figure.
The concrete deterministic process of the direction of piecemeal vector is as follows in the described step 1:
Dependence between task refers to the execution sequence between the task, uses d
k=(d
Ki, d
Kj) expression, wherein d
KiRepresent that two tasks are at the execution dependence of nexine circulation, d
KjRepresent two tasks in the execution dependence of skin circulation, two that find out ragged edge the dependence set D between task rely on CW and CCW, P
i=CCW and P
j=CW;
The counterclockwise interval vector of CCW refers to the vector with j vector angle maximum, and the clockwise interval vector of CW refers to the vector with j vector angle minimum.
Wherein, dependence between task refers to the execution sequence between the task, namely a task must be waited for the relation that just can be performed after another task is finished, in computing machine, represent the task executions order with figure usually, each node among the figure represents a task, and the dependence between the limit representative task between node and the node is the suffered specific limited of task execution sequence just; An iteration represents all tasks and all is performed once, and all iteration constitute between iterative space; The i.e. circulation of iteration, i represent certain task in a circulation (nexine circulation) i be performed, j represent j circulation (skin circulates) namely all tasks be performed for the j time.
The relational expression of the size of data of the required loading of current iteration and preservation and piecemeal vector magnitude f and h is as follows in the described step 2:
(1) size of data of the required loading of current iteration and preservation comprises two parts: first, the size of data that current iteration produces is NUM
Next+ NUM
Top+ NUM
OtherSecond portion loads the size of data that current iteration and next iteration need use in advance and is NUM
Other
Wherein, NUM
NextRepresent this piecemeal to produce and next piecemeal is badly in need of the size of data used, NUM
TopRepresent this piecemeal generation and be positioned at vertical with the Pi direction, simultaneously adjacent with the current piecemeal required size of data of using of piecemeal;
NUM
OtherRepresent that this piecemeal produces and except next piecemeal (next partition) and top piecemeal (top partition), the size of data that other all piecemeals need be used;
Scheduling length Ls refers to the time that an iteration is carried out in the described step 2.
NUM in the described step 3 in strategy one and the strategy two
NextRepresent this piecemeal to produce and next piecemeal is badly in need of the data used, NUM
TopRepresent this piecemeal generation and be positioned at vertical with the Pi direction, simultaneously adjacent with the current piecemeal required data of using of piecemeal; NUM
OtherRepresent that this piecemeal produces and except next piecemeal and top piecemeal, the data that other all piecemeals need be used; Ls represents the scheduling length of each iteration, and Tr represents to read a needed time of data from main memory, and Tw represents to write data to the needed time of main memory; Ms refers to the SPM(scratch-pad storage) amount of capacity.
(retiming) a kind ofly optimizes the technology of cycle period by assignment latency during restatement, and the rotation scheduling be a kind of during based on restatement the resource restriction of technology optimize scheduling strategy, it postpones to obtain a compacter scheduling by redistributing.The piecemeal dispatching technique is technology and prefetching technique during in conjunction with the iteration restatement, each iteration is regarded as a point, (it should be noted that dividing between iterative space then, an iteration refers to all tasks are carried out once, and comprise all iteration between iterative space), the execution of a piecemeal of a piecemeal then.Because the dependence between the task, so in piecemeal, how to consider emphatically to make piecemeal reasonable by piecemeal.That is to say between piece and the piece can not have endless loop, can a piecemeal of a piecemeal carry out scheduled for executing.
Beneficial effect
The invention provides a kind of multilayer piecemeal dispatching method with storage perception, the first step draws the direction that the correct shape of piecemeal is for the first time namely determined two piecemeal vectors according to the dependence of task, note do (Pi, Pj); Second step obtained each required loading of circulation and the size of data of preservation and the relational expression of piecemeal vector magnitude f and h, and the scheduling length Ls of each iteration; In the 3rd step, decide the son of the piecemeal first time to divide block size according to local storage size and scheduling length; The 4th step, the dependence when utilizing the iteration restatement between the technology change task, reconstruct piecemeal position; In the 5th step, on the basis of the piecemeal first time, each sub-piecemeal of piecemeal is used as a node and is rebuild a branch block space the first time,, carry out the piecemeal second time; Obtain execution sequence figure to carrying out between iterative space behind the piecemeal according to two piecemeal vectors that obtain, dispatch according to the task of execution sequence figure; Multilayer piecemeal dispatching method with storage perception has been taken all factors into consideration memory span and storage delay, carries out and to the adjustment of dependence between the task, has improved the degree of parallelism of task by piecemeal, has reduced the quantity of write operation and has reduced average scheduling time.
Very strict for branch selection block-shaped and direction, because the dependence between the task, irrational piecemeal scheme will cause task to carry out, and reasonably piecemeal will reduce the time of task scheduling and the generation of write operation; The inventive method is taken all factors into consideration memory span and storage delay, effectively improves the overall performance of system.
Embodiment
The present invention is described in further detail below in conjunction with accompanying drawing.
For more convenient resource block management, we at first carry out modelling to the task execution sequence and handle, the use two-dimensional coordinate is represented, i represents the direction of nexine circulation, j represents the direction of outer circulation, each circulation can be used iteration iteration(i, j) expression, and an iteration represents all tasks and all is performed once.
In this example, the configuration of the computing machine that inspection is executed the task, the kernel number that obtains computer processor is 3, the amount of capacity of the SPM that inlays on each kernel is that the computing unit number of 128KB and each kernel is 3, read a needed time T r=2clock cycle(clock period of data from main memory), write data to the needed time T w=4clock cycle(clock period of main memory).
As shown in Figure 1, have the process flow diagram of the multilayer piecemeal dispatching method of storage perception for the present invention is a kind of, its concrete operations step is as follows:
Step 1: all tasks are performed once as an iteration, carry out between the iterative space that the group task with execution sequence repeatedly makes up as the piecemeal object with need, determine the piecemeal vector (P between iterative space
i, P
j) direction, the size of piecemeal on the Pj direction is f, and the size of piecemeal on the Pi direction is h, and two that find out ragged edge the dependence set D between task rely on CW and CCW, P
i=CCW and P
j=CW;
Wherein, dependence between task refers to the execution sequence between the task, namely a task must be waited for the relation that just can be performed after another task is finished, in computing machine, represent the task executions order with figure usually, each node among the figure represents a task, and the dependence between the limit representative task between node and the node is the suffered specific limited of task execution sequence just; An iteration represents all tasks and all is performed once, and all iteration constitute between iterative space; The i.e. circulation of iteration, each task all will be performed repeatedly, which time of a task A carried out us and use A[i, j] expression, A[i, j] expression task A will i circulation in the nexine circulation, and j circulation was performed once during skin circulated.
As shown in Figure 3, be a concrete two dimensional application task model and corresponding task image MDFG, task image MDFG=<V, E, d, t〉be an X-Y scheme with node weights and limit weights, wherein the V representation node just represents a task in this application, dependence between the E representative task, (u, v) ∈ E means between node u and the node v and exists dependence, and d (e)=(d
i, d
j) representing a delay, the concrete dependence between two tasks has been described.
If task a is expressed as d to the dependence between the task b
k=(x y), means that then (i, task b j) depend on iteration iteration (i-x, task a j-y) to iteration iteration.d
kThe dependence of=(0,0) expression between the task of same iteration.Two dimensional application among Fig. 2 comprises 4 tasks, is respectively A, B, C and D, is d as task A to the dependence between the task B
k=(0,0), task C is d to the dependence between the task D
k=(0,1).Dependence set D={d1 between CCW and CW and task, d2, d3, d4, the relation of d5}, as shown in Figure 4.
In this example, there are 3 nonzero-lag vectors (0,1) (1,0) and (1,1) in the dependence D set.So (1,0) is the CW vector, (1,1) is CCW, and the vector of piecemeal is P for the first time
i=CCW and P
j=CW;
Because cycle applications has basic characteristics: the task that each iteration (circulation) is carried out is all identical, and each task order of carrying out is all identical, so each dependence that iteration possesses is all consistent.In order to find out the clockwise interval vector of CW(more easily) and the counterclockwise interval vector of CCW(), we use vector representation to all dependences, and CCW refers to and the vector of j vector angle maximum so, and CW refers to the vector with j vector angle minimum.
Find out ragged edge from D two rely on CW and CCW, P
i=CCW and P
j=CW, i.e. d
4=CCW, d
3=CW;
Step 2: calculate the size of data of the required loading of current iteration and preservation and the relational expression between f and the h, represent with the equation that contains h or f respectively, and the scheduling length Ls of current iteration, suppose that namely f or h determine;
(1) size of data of the required loading of current iteration and preservation comprises two parts: first, the size of data that current iteration produces is NUM
Next+ NUM
Top+ NUM
OtherSecond: load the size of data that current iteration and next iteration need use in advance and be NUM
Other(2) scheduling length refers to the time that an iteration is carried out, in this example, it is all consistent to set all task execution times, suppose to contain in the iteration n task, and there be m core to carry out, scheduling length Ls=(n/m so) * and each task executions time, each task executions time is i.e. 1 clock period unit interval.
Wherein, NUM
NextRepresent this piecemeal to produce and next piecemeal is badly in need of the size of data used, NUM
TopRepresent this piecemeal generation and be positioned at the required size of data of using of vertical with the Pi direction and adjacent with current piecemeal piecemeal; NUM
OtherRepresent the size of data that this piecemeal produces and other all piecemeals need be used except next piecemeal (next partition) and top piecemeal (top partition);
Step 3: the big or small f*h that determines piecemeal;
1) when f=1, draws h1 and h2 respectively according to strategy one and strategy two;
Strategy one: 2NUM
Other+ NUM
Top+ NUM
Next≤ M
s
Strategy two: (NUM
Top+ NUM
Other) Tw+NUM
Other* Tr≤Ls * f * h;
Wherein: NUM
NextRepresent this piecemeal to produce and next piecemeal is badly in need of the size of data used, NUM
TopRepresent this piecemeal generation and be positioned at the required size of data of using of vertical with the Pi direction and adjacent with current piecemeal piecemeal; NUM
OtherRepresent the size of data that this piecemeal produces and other all piecemeals need be used except the next piecemeal top of institute piecemeal; Ls represents the scheduling length of each iteration, and Tr represents to read a needed time of data from main memory, and Tw represents to write data to the needed time of main memory; Ms refers to the SPM(scratch-pad storage) amount of capacity;
2) size of judgement h1 and h2 if h1 greater than h2, then makes h=h2, is utilized strategy one and strategy two calculating f
1Otherwise the branch block size is f*h1, and f=1;
3) if f1 greater than 1, divides block size to be defined as f1*h2 so; Otherwise the branch block size is f*h1, and f=1;
In this example, try to achieve and divide block size f=1, h=4 for the first time.
After carrying out the piecemeal first time between iterative space, its piecemeal synoptic diagram as shown in Figure 5, wherein, figure (a) be unreasonable piecemeal, figure (b) finds out therefrom that for according to scheming execution sequence figure that (a) piecemeal obtains there is endless loop in this execution sequence, can't carry out; The reasonable piecemeal of figure (c) for using the inventive method to obtain, the execution sequence figure of figure (d) for obtaining according to figure (c); From figure (d) as can be seen, after the piecemeal, the dependence between piece and the piece is remaining (1,0) only, (1,1) two kinds for the first time.
Step 4: technology when adopting the iteration restatement, the dependence of nexine circulation between the time-delay change task between the dispersion task, reconstruct divides block space;
(retiming) a kind ofly optimizes the technology of cycle period by assignment latency during restatement, and the rotation scheduling be a kind of during based on restatement the resource restriction of technology optimize scheduling strategy, it postpones to obtain a compacter scheduling by redistributing.The piecemeal dispatching technique is technology and prefetching technique during in conjunction with the iteration restatement, and each iteration is regarded as a point, then to dividing between iterative space, and the execution of a piecemeal of a piecemeal then.Because the dependence between the task, so in piecemeal, how to consider emphatically to make piecemeal reasonable by piecemeal.That is to say not have endless loop between piece and the piece, can a piecemeal of a piecemeal carry out scheduled for executing.
In order to obtain a compacter scheduling, technology was come the dependence between the change task when we will carry out an iteration restatement, technology is the dependence between the delay reconstruction task of utilizing between the dispersion task during iteration restatement, shortens the task executions cycle with this.In order to keep the execution sequence of row-wise, need guarantee not change the dependence between piecemeal and the piecemeal during iteration restatement, so the circulation dependence of innermost layer between our the change task postpones to be d such as existing between task A and the task B
3=(1,1), after postponing by dispersion, the delay between A and the B becomes d
3=(0,1), that is to say, when not utilizing the iteration restatement before the technology, iteration(i, carrying out j) of task A must depend on iteration iteration(i+1, and j-1) the task B in is after technology changes delay when utilizing the iteration restatement, iteration(i, carrying out j) of task A depend on iteration (i, the B that carries out in j-1), as shown in Figure 6.
Step 5: divide block space for the first time according to a minute block size f*h division, each sub-piecemeal that the first time, piecemeal produced is used as a node namely as a bunch of task, constitute between new iterative space, obtain the direction P2 of piecemeal for the second time according to step 1
iAnd P2
j
For the first time piecemeal is to carrying out piecemeal between iterative space, and for the second time piecemeal be to the first time piecemeal sub-piecemeal carry out piecemeal, then each for the first time the sub-piece of piecemeal be defined as partition (i, j).
The framework of task scheduling as shown in Figure 7 and Figure 8, Fig. 7 has illustrated the execution sequence of task scheduling, and Fig. 8 has illustrated the scheduling of a task.In the method, for convenience's sake, we are divided three classes to the situation of the piecemeal first time (first_level_partition) according to the current first_level_partition that is carrying out: next first_level_partition, top first_level _ partition and other first_level_partition.And according to the position that utilizes of data it is carried out subregion for each first_level_partition piecemeal, shown in Fig. 8 (a), be divided into four zones altogether, first zone, the task of this piecemeal of data that representative produces will be used, task among the data n ext first_level_partition that second region representation produces need be used, task among the data top first_level_partition that the 3rd region representation produces need be used, and the 4th zone refers to that the data other first_level_partition that produces need use.We can calculate each delay (d(e) among the first_level_partition very soon like this): d
k=(d
Ki, d
Kj), the data of other first_level_partition of confession of generation:
A
goto_top(d
k)=area(PQVU)=d
ki(f-d
kj)
A
goto_next(d
k)=area(VSWX)=d
kj(h-d
ki)
A
goto_other(d
k)=area(UVRS)=d
kid
kj
Further, when we determined the branch block size of first_level_partition, we can calculate the data that a first_level_partition the inside produces and will store very soon.
Step 6: determine the size of piecemeal for the second time; From hardware configuration information, obtain the quantity N of processor cores
Core=3.P2 then
iThe size of direction is N
Core=3, P2
jThe size of direction is 1.
Step 7: obtain execution sequence figure to carrying out between iterative space behind the piecemeal according to two piecemeal vectors that obtain, dispatch according to the task of execution sequence figure.
As shown in Figure 9, using multiple bencmark is that benchmark is tested the inventive method TLP and other two kinds of algorithm List and IRP in the performance of task on average scheduling time; As seen from the figure, the inventive method TLP(has the multilayer piecemeal task scheduling strategy of storage perception) obviously be better than other two kinds of dispatching algorithms in the performance of task aspect average scheduling time.The performance increase rate reaches about 30%.This be because have the storage perception the piecemeal scheduling strategy when carrying out the branch block dispatching, not only consider the degree of parallelism of task, also fully taken into account storage delay, guarantee that the scheduling time of storer is no longer than the scheduling time of processor, some storage time-delays have been avoided like this, save the stand-by period, thereby improved the performance of system, reduced scheduling time.
Performance aspect write operation is used multiple bencmark and is benchmark test the inventive method TLP and other two kinds of algorithm List and the performance of IRP aspect write operation as shown in figure 10; As seen from the figure, the inventive method TLP(have the storage perception multilayer piecemeal task scheduling strategy) than other two kinds of algorithms, the write operation decreased average about 45%.This is because fully taken into account the capacity of local storage in the partition strategy of the present invention, through twice piecemeal, guarantee that there is local storage as far as possible in the needed data of the treatable piecemeal of each kernel core, saved a large amount of write operations like this, reduce the consumption of scheduling time and energy accordingly, thereby improved the performance of system.But certain when the capacity of local storage, along with the expansion of task scale, the data of required storage increase, and the number of write operation also will increase, thus task average scheduling time can increase, the performance of system can reduce.
IIR, 2D, WDF(1 among Fig. 9 and Figure 10), WDF(2), DPCM(1), DPCM(2), DPCM(3), FLOYD(1), FLOYD(2) and FLOYD(3) to be data handling utility bencmark be benchmark.
Task is that multidimensional DSP uses among the present invention, but the multilayer partition strategy with storage consciousness that proposes can expand on the DSP and other application with cycle specificity of n dimension.