Summary of the invention
The invention provides a kind of multilamellar piecemeal dispatching method with storage perception, its object is to, by iteration space being carried out reasonably repeatedly piecemeal, resource is managed distribution and scheduling, the resource unreasonable distribution existed in prior art is overcome to cause when scheduling strategy performs, deadline length and the many problem of energy expenditure, overcome the restriction due to local storage, it is easy to the problem causing loss of data simultaneously.
A kind of multilamellar piecemeal dispatching method with storage perception, comprises the following steps:
Step 1: all of task is executed once as an iteration, performs between the iterative space that a group task with execution sequence repeatedly builds as piecemeal object needing, it is determined that the piecemeal vector (P in iteration spacei, Pj) direction, piecemeal is sized to f on Pj direction, and piecemeal is sized to h on Pi direction, and two that find out ragged edge from the dependence set D between task rely on CW and CCW, Pi=CCW and Pj=CW;
Step 2: determine the relational expression of the required size of data loaded and preserve of current iteration and piecemeal vector magnitude f and h and the scheduling length Ls of current iteration;
Step 3: strategically one and tactful two determine size f and the h of piecemeal vector;
1) set f as 1, calculate h according to strategy one and strategy two, respectively obtain h1 and h2;
Strategy one: 2NUMother+NUMtop+NUMnext��Ms;
Strategy two: (NUMtop+NUMother)Tw+NUMother�� Tr��Ls �� f �� h;
2) if h1 > h2, then the value of h is h2, adopts strategy one and strategy two to calculate f, respectively obtains f1 and f2, enter 3);
Otherwise, the value of piecemeal size h is the value of h1, f is 1;
3) if f1 > 1, piecemeal is sized to f1*h2; Otherwise piecemeal is sized to f*h1 and f=1;
Step 4: adopting iteration weight clocking technique, between the time delay change task between decentralized task, the dependence of nexine circulation, reconstructs segmented spaces;
Step 5: divide first time segmented spaces according to piecemeal size f*h, each sub-piecemeal produced by first time piecemeal is used as a node namely as a bunch of task, constitute new iteration space, successively every sub-piecemeal is carried out piecemeal according to step 1, it is thus achieved that the direction vector (P2 of second time piecemeali, P2j);
Step 6: determine the size of second time piecemeal vector;
Second time piecemeal vector is at P2iDirection is sized to Ncore, at P2jDirection is sized to 1, NcoreQuantity for processor cores;
Step 7: obtain execution sequence figure after iteration space being carried out piecemeal according to two the piecemeal vectors obtained, according to execution sequence figure, task is scheduling.
In described step 1 piecemeal vector direction specifically determine that process is as follows:
Dependence between task refers to the execution sequence between task, uses dk=(dki, dkj) represent, wherein dkiRepresent the execution dependence that two tasks circulate, d at nexinekjRepresenting two tasks execution dependence at outer loop, two that find out ragged edge from the dependence set D between task rely on CW and CCW, Pi=CCW and Pj=CW;
CCW is counterclockwise, and interval vector refers to the vector maximum with j vector angle, and CW interval vector clockwise refers to the vector minimum with j vector angle.
Wherein, dependence between task refers to the execution sequence between task, namely task has to wait for the relation that just can be performed after another task completes, the execution sequence of task is represented in a computer typically by figure, one task of each node on behalf in figure, the limit between node and node represents the dependence between task i.e. the specific restriction suffered by tasks carrying order; One iteration represents all of task and is all executed once, and all of iteration constitutes iteration space; One i.e. circulation of iteration, i represents that certain task i-th in a circulation (nexine circulation) is performed, and j represents that jth circulation (outer loop) i.e. all task jth time are performed.
In described step 2, the required size of data loaded and preserve of current iteration is as follows with the relational expression of piecemeal vector magnitude f and h:
(1) the required size of data loaded and preserve of current iteration includes two parts: Part I, size of data produced by current iteration is NUMnext+NUMtop+NUMother; Part II, loads current iteration in advance and following iteration needs the size of data used to be NUMother;
Wherein, NUMnextThis piecemeal is represented to produce and next piecemeal is badly in need of the size of data used, NUMtopRepresent that this piecemeal produces and is positioned at vertical with Pi direction, the size of data used required for simultaneously adjacent with current piecemeal piecemeal;
NUMotherRepresenting that this piecemeal produces and except next piecemeal (nextpartition) and top piecemeal (toppartition), other all piecemeals need the size of data used;
In described step 2, scheduling length Ls refers to the time that an iteration performs.
NUM in strategy one and tactful two in described step 3nextThis piecemeal is represented to produce and next piecemeal is badly in need of the data used, NUMtopRepresent that this piecemeal produces and is positioned at vertical with Pi direction, the data used required for simultaneously adjacent with current piecemeal piecemeal; NUMotherRepresenting that this piecemeal produces and except next piecemeal and top piecemeal, other all piecemeals need the data used; Ls represents the scheduling length of each iteration, and Tr represents and reads the time required for data from main memory, and Tw represents and writes data to the time required for main memory; Ms refers to SPM(scratch-pad storage) amount of capacity.
During restatement, (retiming) is a kind of technology being optimized cycle period by assignment latency, and rotating scheduling is a kind of resource limit Optimized Operation strategy based on weight clocking technique, and it obtains a greater compactness of scheduling by redistributing delay. Piecemeal dispatching technique is in conjunction with iteration weight clocking technique and prefetching technique, each iteration is regarded as a point, then iteration space is divided and (it should be noted that, an iteration refers to all of tasks carrying once, and iteration space comprises all of iteration), the execution of right one piecemeal of later piecemeal. Due to the dependence between task, so when piecemeal, piecemeal will be considered how emphatically so that piecemeal is reasonable. That is do not have endless loop between block and block, it is possible to one piecemeal of a piecemeal be scheduling perform.
Beneficial effect
The invention provides a kind of multilamellar piecemeal dispatching method with storage perception, the first step, show that namely the correct shape of first time piecemeal determines the direction of two piecemeal vectors according to the dependence of task, be denoted as (Pi, Pj);Second step, it is thus achieved that every time circulate the relational expression of the required size of data loaded and preserve and piecemeal vector magnitude f and h and the scheduling length Ls of each iteration; 3rd step, determines the sub-piecemeal size of first time piecemeal according to local storage size and scheduling length; 4th step, utilizes the dependence between iteration weight clocking technique change task, reconstructs piecemeal position; 5th step, on the basis of first time piecemeal, is used as every sub-piecemeal of first time piecemeal as a node and rebuilds a segmented spaces, carry out second time piecemeal; Obtain execution sequence figure after iteration space being carried out piecemeal according to two the piecemeal vectors obtained, according to execution sequence figure, task is scheduling; The multilamellar piecemeal dispatching method with storage perception has considered memory span and storage delay, is performed by piecemeal and to the adjustment of dependence between task, improves the degree of parallelism of task, decreases the quantity of write operation and reduces average scheduled time.
Selection for Partitional form and direction is very strict, and due to the dependence between task, irrational segment partition scheme will cause that task cannot perform, and reasonably piecemeal will reduce the time of task scheduling and the generation of write operation; The inventive method considers memory span and storage delay, the effective overall performance improving system.
Detailed description of the invention
Below in conjunction with accompanying drawing, the present invention is described in further detail.
For more convenient resource block management, first tasks carrying order is carried out modelling process by us, use two-dimensional coordinate represents, i represents the direction that nexine circulates, j represents the direction of outer loop, each circulation can use iteration iteration(i, j) represents, an iteration represents all of task and is all executed once.
In this example, check the configuration of the computer of execution task, the kernel number obtaining computer processor is 3, the amount of capacity of the SPM inlayed on each kernel is the computing unit number of 128KB and each kernel is 3, clock cycle time Tr=2clockcycle(required for data is read from main memory), write data to clock cycle time Tw=4clockcycle(required for main memory).
As it is shown in figure 1, be a kind of flow chart with the multilamellar piecemeal dispatching method storing perception of the present invention, its concrete operation step is as follows:
Step 1: all of task is executed once as an iteration, using between the iterative space that the group task with execution sequence that need to perform repeatedly builds as piecemeal object, it is determined that the piecemeal vector (P in iteration spacei, Pj) direction, piecemeal is sized to f on Pj direction, and piecemeal is sized to h on Pi direction, and two that find out ragged edge from the dependence set D between task rely on CW and CCW, Pi=CCW and Pj=CW;
Wherein, dependence between task refers to the execution sequence between task, namely task has to wait for the relation that just can be performed after another task completes, the execution sequence of task is represented in a computer typically by figure, one task of each node on behalf in figure, the limit between node and node represents the dependence between task i.e. the specific restriction suffered by tasks carrying order; One iteration represents all of task and is all executed once, and all of iteration constitutes iteration space; One i.e. circulation of iteration, each task will be performed a number of times, and which time of a task A performs us and use A [i, j] represent, A [i, j] represents that task A i-th will circulate in nexine circulates, and in outer loop, jth circulation is executed once.
As shown in Figure 3, it is the task image MDFG of a concrete two dimensional application task model and correspondence, task image MDFG=<V, E, d, t>it is the X-Y scheme with node weights and limit weights, wherein a V representation node, namely represent a task in this application, E represents the dependence between task, (u, v) �� E means to also exist between node u and node v dependence, and d (e)=(di,dj) represent a delay, describe the concrete dependence between two tasks.
If the dependence between task a to task b is expressed as dk=(x y), then means that (i, task b j) depends on the task a of iteration iteration (i-x, j-y) to iteration iteration. dk=(0,0) represents the dependence between the task of same iteration. Two dimensional application in Fig. 2 comprises 4 tasks, respectively A, B, C and D, if the dependence between task A to task B is dk=(0,0), the dependence between task C to task D is dk=(0,1). The relation of dependence set D={d1, d2, d3, d4, d5} between CCW and CW and task, as shown in Figure 4.
In this example, there is 3 nonzero-lag vectors (0,1) (1,0) and (-1,1) in dependence D set. So (1,0) is CW vector, (-1,1) is CCW, and the first time vector of piecemeal is Pi=CCW and Pj=CW;
Because cycle applications has basic characteristics: the task that iteration (circulation) performs is all identical, and the task order performed is all identical every time every time, so the possessed dependence of iteration is all consistent every time. In order to find out CW(interval vector clockwise more easily) and CCW(interval vector counterclockwise), we use vector representation all of dependence, then CCW refers to the vector maximum with j vector angle, and CW refers to the vector minimum with j vector angle.
Two that find out ragged edge from D rely on CW and CCW, Pi=CCW and Pj=CW, i.e. d4=CCW, d3=CW;
Step 2: calculate the relational expression between required size of data and f and h loaded and preserve of current iteration, represent with the equation containing h or f respectively, and the scheduling length Ls of current iteration, namely suppose f or h it has been determined that;
(1) the required size of data loaded and preserve of current iteration includes two parts: Part I, size of data produced by current iteration is NUMnext+NUMtop+NUMother; Second: load current iteration in advance and following iteration needs the size of data used to be NUMother; (2) scheduling length refers to the time that an iteration performs, in this example, set all of task execution time all consistent, assume in an iteration containing n task, and have m core to perform, so scheduling length Ls=(n/m) execution time of �� each task, the execution time of each task is unit interval i.e. 1 clock cycle.
Wherein, NUMnextThis piecemeal is represented to produce and next piecemeal is badly in need of the size of data used, NUMtopRepresent that this piecemeal produces and is positioned at the size of data used required for vertical with Pi direction and adjacent with current piecemeal piecemeal; NUMotherRepresent the size of data that this piecemeal produces and other all piecemeal needs are used except next piecemeal (nextpartition) and top piecemeal (toppartition);
Step 3: determine the size f*h of piecemeal;
1) as f=1, h1 and h2 is drawn respectively according to strategy one and strategy two;
Strategy one: 2NUMother+NUMtop+NUMnext��Ms;
Strategy two: (NUMtop+NUMother)Tw+NUMother�� Tr��Ls �� f �� h;
Wherein: NUMnextThis piecemeal is represented to produce and next piecemeal is badly in need of the size of data used, NUMtopRepresent that this piecemeal produces and is positioned at the size of data used required for vertical with Pi direction and adjacent with current piecemeal piecemeal; NUMotherRepresent the size of data that this piecemeal produces and other all piecemeal needs are used except next piecemeal institute top piecemeal; Ls represents the scheduling length of each iteration, and Tr represents and reads the time required for data from main memory, and Tw represents and writes data to the time required for main memory; Ms refers to SPM(scratch-pad storage) amount of capacity;
2) judge the size of h1 and h2, if h1 is more than h2, then make h=h2, Utilization strategies one and strategy two calculate f1; Otherwise piecemeal is sized to f*h1 and f=1;
3) if f1 is more than 1, then piecemeal is sized to f1*h2; Otherwise piecemeal is sized to f*h1 and f=1;
In this example, try to achieve first time piecemeal size f=1, h=4.
After iteration space carries out first time piecemeal, as it is shown in figure 5, wherein, figure (a) is unreasonable piecemeal to its piecemeal schematic diagram, and figure (b) is the execution sequence figure obtained according to figure (a) piecemeal, therefrom finds out that this execution sequence exists endless loop, it is impossible to perform; Figure (c) is the reasonable piecemeal using the inventive method to obtain, and figure (d) is according to figure (c) the execution sequence figure obtained; From figure (d) it can be seen that after first time piecemeal, the dependence between block and block is only remaining (1,0), (-1,1) two kinds.
Step 4: adopting iteration weight clocking technique, between the time delay change task between decentralized task, the dependence of nexine circulation, reconstructs segmented spaces;
During restatement, (retiming) is a kind of technology being optimized cycle period by assignment latency, and rotating scheduling is a kind of resource limit Optimized Operation strategy based on weight clocking technique, and it obtains a greater compactness of scheduling by redistributing delay. Piecemeal dispatching technique, in conjunction with iteration weight clocking technique and prefetching technique, is regarded a point as each iteration, then iteration space is divided, the execution of right one piecemeal of later piecemeal. Due to the dependence between task, so when piecemeal, piecemeal will be considered how emphatically so that piecemeal is reasonable. That is not endless loop between block and block, it is possible to one piecemeal of a piecemeal be scheduling perform.
In order to obtain a greater compactness of scheduling, we carry out the dependence between change task by carrying out an iteration weight clocking technique, iteration weight clocking technique is to utilize the dependence between the delay reconstruction task between decentralized task, shortens the execution cycle of task with this.In order to keep the execution sequence of row-wise, during iteration restatement, need the dependence ensureing not change between piecemeal and piecemeal, so the Circular dependency relation of innermost layer between our a change task, such as exist between task A and task B and postpone as d3=(-1,1), after passing through to disperse to postpone, the delay between A and B becomes d3=(0,1), say, that before not utilizing iteration weight clocking technique, iteration(i, j) in perform task A be necessarily dependent upon iteration iteration(i+1, j-1) in task B, utilize iteration weight clocking technique change postpone after, iteration(i, performing in j) of task A depends on the B performed in iteration (i, j-1), as shown in Figure 6.
Step 5: divide first time segmented spaces according to piecemeal size f*h, is used as each sub-piecemeal produced by first time piecemeal as a node namely as a bunch of task, constitutes new iteration space, obtain the direction P2 of piecemeal for the second time according to step 1iAnd P2j��
First time piecemeal is that iteration space carries out piecemeal, and second time piecemeal is that the sub-piecemeal to first time piecemeal carries out piecemeal, then the sub-block of each first time piecemeal be defined as partition (i, j).
As shown in Figure 7 and Figure 8, Fig. 7 describes the execution sequence of task scheduling to the framework of task scheduling, and Fig. 8 describes the scheduling of a task. in the method, for convenience's sake, first time piecemeal (first_level_partition) is divided three classes by we according to the situation of the first_level_partition being currently executing: nextfirst_level_partition, topfirst_level_partition and otherfirst_level_partition. and to it, subregion is carried out according to the position that utilizes of data for each first_level_partition piecemeal, as shown in Figure 8 (a), one is divided into four regions, first region, the task of representing produced this piecemeal of data will be used, task in data nextfirst_level_partition produced by second region representation needs to use, task in data topfirst_level_partition produced by 3rd region representation needs to use, 4th region refers to that the data otherfirst_level_partition of generation needs to use. so we can quickly calculate each delay (d(e) in a first_level_partition): dk=(dki,dkj), the data for other first_level_partition of generation:
Agoto_top(dk)=area(PQVU)=dki(f-dkj)
Agoto_next(dk)=area(VSWX)=dkj(h-dki)
Agoto_other(dk)=area(UVRS)=dkidkj
Further, when we determined that the piecemeal size of first_level_partition, we quickly can calculate the data being produced and storing inside a first_level_partition.
Step 6: determine the size of second time piecemeal; The quantity N of processor cores is obtained from hardware configuration informationcore=3. Then P2iDirection be sized to Ncore=3, P2jDirection be sized to 1.
Step 7: obtain execution sequence figure after iteration space being carried out piecemeal according to two the piecemeal vectors obtained, according to execution sequence figure, task is scheduling.
As it is shown in figure 9, apply multiple bencmark and benchmark tests the inventive method TLP and other two kinds of algorithm List and IRP performances on task average scheduled time; As seen from the figure, the inventive method TLP(has the multilamellar piecemeal task scheduling strategy of storage perception) performance in task average scheduled time is substantially better than other two kinds of dispatching algorithms. Performance increase rate reaches about 30%.This is because there is the piecemeal scheduling strategy storing perception when carrying out piecemeal scheduling, consider not only the degree of parallelism of task, also fully take into account storage delay, ensure the scheduling time no longer than the processor scheduling time of memorizer, this avoid some storage time delays, saving the waiting time, thus improve the performance of system, decreasing scheduling time.
Performance in write operation as shown in Figure 10, is applied multiple bencmark and benchmark is tested the inventive method TLP and other two kinds of algorithm List and IRP performances in write operation; As seen from the figure, the inventive method TLP(has the multilamellar piecemeal task scheduling strategy of storage perception) than other two kinds of algorithms, write operation on average decreases about 45%. This is because the partition strategy of the present invention has fully taken into account the capacity of local storage, through twice piecemeal, there is local storage as far as possible in the data required for ensureing each kernel treatable piecemeal of core, which save substantial amounts of write operation, consequently reduce the consumption of scheduling time and energy, thus improve the performance of system. But when the capacity of local storage is certain, along with the expansion of task scale, the data of required storage increase, and the number of write operation also will increase, thus task average scheduled time can increase, the performance of system can reduce.
IIR, 2D, WDF(1 in Fig. 9 and Figure 10), WDF(2), DPCM(1), DPCM(2), DPCM(3), FLOYD(1), FLOYD(2) and FLOYD(3) be data handling utility bencmark and benchmark.
In the present invention, task is multidimensional DSP application, but the multilamellar partition strategy with storage consciousness proposed can expand to the n DSP tieed up and other have in the application of cycle specificity.