CN103246563A - Multi-layer block scheduling method with storage sensing function - Google Patents

Multi-layer block scheduling method with storage sensing function Download PDF

Info

Publication number
CN103246563A
CN103246563A CN201310145363XA CN201310145363A CN103246563A CN 103246563 A CN103246563 A CN 103246563A CN 201310145363X A CN201310145363X A CN 201310145363XA CN 201310145363 A CN201310145363 A CN 201310145363A CN 103246563 A CN103246563 A CN 103246563A
Authority
CN
China
Prior art keywords
piecemeal
task
num
size
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310145363XA
Other languages
Chinese (zh)
Other versions
CN103246563B (en
Inventor
王艳
李肯立
杜家宜
唐卓
肖正
朱宁波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN201310145363.XA priority Critical patent/CN103246563B/en
Publication of CN103246563A publication Critical patent/CN103246563A/en
Application granted granted Critical
Publication of CN103246563B publication Critical patent/CN103246563B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a multi-layer block scheduling method with a storage sensing function. The method includes: firstly, acquiring directions of block vectors (Pi and Pj) according to task dependency; secondly, acquiring a relation between block vector magnitudes f and h and the size of data to be loaded and stored in each cycle, and schedule length Ls at each iteration; thirdly, determining the size of sub-blocks of the blocks according to the size of a local storage and the schedule length; fourthly, changing dependency of the tasks by iterative re-timing technique, and reconstructing positions of the blocks; fifthly, reconstructing a block space by using each sub-block in primary block division as a node, and performing secondary block division; and partitioning the iterative space to obtain an execution sequence diagram according to the two block vectors, and scheduling the tasks according to the execution sequence diagram. The method has the advantages that storage capacity and storage delay are combined, the blocks and the dependency of the tasks are changed, and accordingly task parallelism is improved, the amount of write operation is reduced and average scheduling time is reduced.

Description

A kind of multilayer piecemeal dispatching method with storage perception
Technical field
The present invention relates to a kind of multilayer piecemeal dispatching method with storage perception.
Background technology
It all is the iterative recursive circulation that the processing of most of science and data-signal is used.This generic task has run into two challenges when embedded multiprocessor is carried out: first, most of data-signal Processing tasks are to calculate responsive type and the application of data sensitive type, use for this class, the bad scheduling strategy of efficient will produce a large amount of write operations, therefore can consume a large amount of time and energy consumption; The second, the development of CPU speed is too quick with respect to the storer inbound pacing, and the storage speed of storer has slowly seriously hindered the raising of system performance.Though embedded multiprocessor has the one's own instruction set of a cover, can realize different calculation tasks flexibly by software programming, but the coding that is instructed and execution sequence restriction, the memory access bottleneck reaches the restriction of fixing hierarchy of control structure, often can not arrive top speed and optimum efficiency.
(prefetching) strategy of looking ahead is a kind of technology that can effectively improve the performance of system that proposes at storage delay, namely before data have demand, just these data are deposited in cache memory (cache), can tolerate long storage time-delay like this.Prefetch policy of the prior art can be divided three classes: hardware based prefetch policy, and based on the prefetch policy of software and based on the prefetch policy of hardware and software.But hardware based prefetch policy requires some supporter to be linked to cache memory cache, and depends on dynamic information available in the process of carrying out.Remove to analyze one section static routine and depend on compiler technologies based on the prefetch policy of software, and in program code, add prefetched instruction.But too many prefetch operation will cause a unbalanced scheduling and storage time-delay meeting very long.
For this reason, a lot of embedded multiprocessors have used SPM to come replaced C ache, and SPM is a kind of minimum storage that is embedded on the chip, are a kind of compiler support and the storer that can manage by software.In fact the SPM storer can regard the local storage of each nuclear core as, and energy is the performance of optimization system further, and can effectively reduce the consumption of energy.But for the large-scale data signal processing tasks, resource management scheduling strategy implementation will produce a large amount of write operations improperly.
In order to increase this locality storage of data, a lot of researchs all are devoted to according to the management of implemening resource integration of the situation of task.Traditional multidimensional task resource management method, executing the task is that dependence according to task is carried out in proper order with row-Lie (row-column) or column-row.Because the restriction of local storage, this manner of execution can produce lot of data and need write in the main memory, even can cause losing of a lot of data.
Summary of the invention
The invention provides a kind of multilayer piecemeal dispatching method with storage perception, its purpose is, by coming resource is managed distribution and scheduling to carrying out reasonably repeatedly piecemeal between iterative space, when overcoming the resource unreasonable distribution that exists in the prior art and causing scheduling strategy to be carried out, the problem that deadline is long and energy consumption is many, overcome the restriction owing to local storage simultaneously, cause the problem of loss of data easily.
A kind of multilayer piecemeal dispatching method with storage perception may further comprise the steps:
Step 1: all tasks are performed once as an iteration, carry out between the iterative space that the group task with execution sequence repeatedly makes up as the piecemeal object with needs, determine the piecemeal vector (P between iterative space i, P j) direction, the size of piecemeal on the Pj direction is f, and the size of piecemeal on the Pi direction is h, and two that find out ragged edge the dependence set D between task rely on CW and CCW, P i=CCW and P j=CW;
Step 2: determine the size of data of the required loading of current iteration and preservation and the relational expression of piecemeal vector magnitude f and h, and the scheduling length Ls of current iteration;
Step 3: the big or small f and the h that determine the piecemeal vector according to strategy one and strategy two;
1) setting f is 1, calculates h according to strategy one and strategy two, obtains h1 and h2 respectively;
Strategy one: 2NUM Other+ NUM Top+ NUM Next≤ M s
Strategy two: (NUM Top+ NUM Other) Tw+NUM Other* Tr≤Ls * f * h;
2) if h1〉h2, then the value of h is h2, adopts strategy one and strategy two to calculate f, obtains f1 and f2 respectively, enters 3);
Otherwise dividing the value of block size h is h1, and the value of f is 1;
3) if f1〉1, divide block size to be defined as f1*h2; Otherwise the branch block size is f*h1, and f=1;
Step 4: technology when adopting the iteration restatement, the dependence of nexine circulation between the time-delay change task between the dispersion task, reconstruct divides block space;
Step 5: divide block space for the first time according to a minute block size f*h division, each sub-piecemeal that the first time, piecemeal produced is used as a node namely as a bunch of task, constitute between new iterative space, successively each sub-piecemeal is carried out piecemeal according to step 1, obtain the direction vector (P2 of piecemeal for the second time i, P2 j);
Step 6: determine the size of piecemeal vector for the second time;
The piecemeal vector is at P2 for the second time iSize on the direction is N Core, at P2 jSize on the direction is 1, N CoreQuantity for processor cores;
Step 7: obtain execution sequence figure to carrying out between iterative space behind the piecemeal according to two piecemeal vectors that obtain, dispatch according to the task of execution sequence figure.
The concrete deterministic process of the direction of piecemeal vector is as follows in the described step 1:
Dependence between task refers to the execution sequence between the task, uses d k=(d Ki, d Kj) expression, wherein d KiRepresent that two tasks are at the execution dependence of nexine circulation, d KjRepresent two tasks in the execution dependence of skin circulation, two that find out ragged edge the dependence set D between task rely on CW and CCW, P i=CCW and P j=CW;
The counterclockwise interval vector of CCW refers to the vector with j vector angle maximum, and the clockwise interval vector of CW refers to the vector with j vector angle minimum.
Wherein, dependence between task refers to the execution sequence between the task, namely a task must be waited for the relation that just can be performed after another task is finished, in computing machine, represent the task executions order with figure usually, each node among the figure represents a task, and the dependence between the limit representative task between node and the node is the suffered specific limited of task execution sequence just; An iteration represents all tasks and all is performed once, and all iteration constitute between iterative space; The i.e. circulation of iteration, i represent certain task in a circulation (nexine circulation) i be performed, j represent j circulation (skin circulates) namely all tasks be performed for the j time.
The relational expression of the size of data of the required loading of current iteration and preservation and piecemeal vector magnitude f and h is as follows in the described step 2:
(1) size of data of the required loading of current iteration and preservation comprises two parts: first, the size of data that current iteration produces is NUM Next+ NUM Top+ NUM OtherSecond portion loads the size of data that current iteration and next iteration need use in advance and is NUM Other
NUM other = Σ d k A goto _ others ( d k ) = Σ d k ( d ki ) ( d kj )
NUM top = Σ d k A goto - top ( d k ) = Σ d k d ki ( f - d kj )
NUM next = Σ d k A goto - next ( d k ) = Σ d k d kj ( h - d ki )
Wherein, NUM NextRepresent this piecemeal to produce and next piecemeal is badly in need of the size of data used, NUM TopRepresent this piecemeal generation and be positioned at vertical with the Pi direction, simultaneously adjacent with the current piecemeal required size of data of using of piecemeal;
NUM OtherRepresent that this piecemeal produces and except next piecemeal (next partition) and top piecemeal (top partition), the size of data that other all piecemeals need be used;
Scheduling length Ls refers to the time that an iteration is carried out in the described step 2.
NUM in the described step 3 in strategy one and the strategy two NextRepresent this piecemeal to produce and next piecemeal is badly in need of the data used, NUM TopRepresent this piecemeal generation and be positioned at vertical with the Pi direction, simultaneously adjacent with the current piecemeal required data of using of piecemeal; NUM OtherRepresent that this piecemeal produces and except next piecemeal and top piecemeal, the data that other all piecemeals need be used; Ls represents the scheduling length of each iteration, and Tr represents to read a needed time of data from main memory, and Tw represents to write data to the needed time of main memory; Ms refers to the SPM(scratch-pad storage) amount of capacity.
(retiming) a kind ofly optimizes the technology of cycle period by assignment latency during restatement, and the rotation scheduling be a kind of during based on restatement the resource restriction of technology optimize scheduling strategy, it postpones to obtain a compacter scheduling by redistributing.The piecemeal dispatching technique is technology and prefetching technique during in conjunction with the iteration restatement, each iteration is regarded as a point, (it should be noted that dividing between iterative space then, an iteration refers to all tasks are carried out once, and comprise all iteration between iterative space), the execution of a piecemeal of a piecemeal then.Because the dependence between the task, so in piecemeal, how to consider emphatically to make piecemeal reasonable by piecemeal.That is to say between piece and the piece can not have endless loop, can a piecemeal of a piecemeal carry out scheduled for executing.
Beneficial effect
The invention provides a kind of multilayer piecemeal dispatching method with storage perception, the first step draws the direction that the correct shape of piecemeal is for the first time namely determined two piecemeal vectors according to the dependence of task, note do (Pi, Pj); Second step obtained each required loading of circulation and the size of data of preservation and the relational expression of piecemeal vector magnitude f and h, and the scheduling length Ls of each iteration; In the 3rd step, decide the son of the piecemeal first time to divide block size according to local storage size and scheduling length; The 4th step, the dependence when utilizing the iteration restatement between the technology change task, reconstruct piecemeal position; In the 5th step, on the basis of the piecemeal first time, each sub-piecemeal of piecemeal is used as a node and is rebuild a branch block space the first time,, carry out the piecemeal second time; Obtain execution sequence figure to carrying out between iterative space behind the piecemeal according to two piecemeal vectors that obtain, dispatch according to the task of execution sequence figure; Multilayer piecemeal dispatching method with storage perception has been taken all factors into consideration memory span and storage delay, carries out and to the adjustment of dependence between the task, has improved the degree of parallelism of task by piecemeal, has reduced the quantity of write operation and has reduced average scheduling time.
Very strict for branch selection block-shaped and direction, because the dependence between the task, irrational piecemeal scheme will cause task to carry out, and reasonably piecemeal will reduce the time of task scheduling and the generation of write operation; The inventive method is taken all factors into consideration memory span and storage delay, effectively improves the overall performance of system.
Description of drawings
Fig. 1 is process flow diagram of the present invention;
Fig. 2 is synoptic diagram between iterative space;
Fig. 3 is two dimensional application task model and corresponding task image MDFG, and wherein, figure (a) is a two dimensional application task model, the MDFG figure of figure (b) for obtaining according to figure (a);
Fig. 4 be dependence between task at the location drawing of counterclockwise interval CCW and clockwise interval CW, wherein figure (a) be CCW and CW area schematic, scheming (b) is the location drawing of dependence in CCW and CW zone between task;
The reasonable piecemeal of Fig. 5 and unreasonable piecemeal synoptic diagram, wherein, figure (a) is unreasonable piecemeal, the execution sequence figure of figure (b) for obtaining according to figure (a) piecemeal, figure (c) is reasonable piecemeal, the execution sequence figure of figure (d) for obtaining according to figure (c);
Fig. 6 is the nexine dependence synoptic diagram between technology change task when using restatement;
Fig. 7 piecemeal dispatching sequence figure schemes (a) expression by the face over one's competence that the second time, piecemeal was formed, and figure (b) represents piecemeal and the relation of piecemeal for the first time for the second time, and figure (c) is the execution sequence of piecemeal;
Fig. 8 processor part concerns with the scheduling of memory portion;
Fig. 9 is the write operation number contrast synoptic diagram of multiple dispatching method;
The average scheduling time contrast of the task of the multiple dispatching method of Figure 10 synoptic diagram.
Embodiment
The present invention is described in further detail below in conjunction with accompanying drawing.
For more convenient resource block management, we at first carry out modelling to the task execution sequence and handle, the use two-dimensional coordinate is represented, i represents the direction of nexine circulation, j represents the direction of outer circulation, each circulation can be used iteration iteration(i, j) expression, and an iteration represents all tasks and all is performed once.
In this example, the configuration of the computing machine that inspection is executed the task, the kernel number that obtains computer processor is 3, the amount of capacity of the SPM that inlays on each kernel is that the computing unit number of 128KB and each kernel is 3, read a needed time T r=2clock cycle(clock period of data from main memory), write data to the needed time T w=4clock cycle(clock period of main memory).
As shown in Figure 1, have the process flow diagram of the multilayer piecemeal dispatching method of storage perception for the present invention is a kind of, its concrete operations step is as follows:
Step 1: all tasks are performed once as an iteration, carry out between the iterative space that the group task with execution sequence repeatedly makes up as the piecemeal object with need, determine the piecemeal vector (P between iterative space i, P j) direction, the size of piecemeal on the Pj direction is f, and the size of piecemeal on the Pi direction is h, and two that find out ragged edge the dependence set D between task rely on CW and CCW, P i=CCW and P j=CW;
Wherein, dependence between task refers to the execution sequence between the task, namely a task must be waited for the relation that just can be performed after another task is finished, in computing machine, represent the task executions order with figure usually, each node among the figure represents a task, and the dependence between the limit representative task between node and the node is the suffered specific limited of task execution sequence just; An iteration represents all tasks and all is performed once, and all iteration constitute between iterative space; The i.e. circulation of iteration, each task all will be performed repeatedly, which time of a task A carried out us and use A[i, j] expression, A[i, j] expression task A will i circulation in the nexine circulation, and j circulation was performed once during skin circulated.
As shown in Figure 3, be a concrete two dimensional application task model and corresponding task image MDFG, task image MDFG=<V, E, d, t〉be an X-Y scheme with node weights and limit weights, wherein the V representation node just represents a task in this application, dependence between the E representative task, (u, v) ∈ E means between node u and the node v and exists dependence, and d (e)=(d i, d j) representing a delay, the concrete dependence between two tasks has been described.
If task a is expressed as d to the dependence between the task b k=(x y), means that then (i, task b j) depend on iteration iteration (i-x, task a j-y) to iteration iteration.d kThe dependence of=(0,0) expression between the task of same iteration.Two dimensional application among Fig. 2 comprises 4 tasks, is respectively A, B, C and D, is d as task A to the dependence between the task B k=(0,0), task C is d to the dependence between the task D k=(0,1).Dependence set D={d1 between CCW and CW and task, d2, d3, d4, the relation of d5}, as shown in Figure 4.
In this example, there are 3 nonzero-lag vectors (0,1) (1,0) and (1,1) in the dependence D set.So (1,0) is the CW vector, (1,1) is CCW, and the vector of piecemeal is P for the first time i=CCW and P j=CW;
Because cycle applications has basic characteristics: the task that each iteration (circulation) is carried out is all identical, and each task order of carrying out is all identical, so each dependence that iteration possesses is all consistent.In order to find out the clockwise interval vector of CW(more easily) and the counterclockwise interval vector of CCW(), we use vector representation to all dependences, and CCW refers to and the vector of j vector angle maximum so, and CW refers to the vector with j vector angle minimum.
Find out ragged edge from D two rely on CW and CCW, P i=CCW and P j=CW, i.e. d 4=CCW, d 3=CW;
Step 2: calculate the size of data of the required loading of current iteration and preservation and the relational expression between f and the h, represent with the equation that contains h or f respectively, and the scheduling length Ls of current iteration, suppose that namely f or h determine;
(1) size of data of the required loading of current iteration and preservation comprises two parts: first, the size of data that current iteration produces is NUM Next+ NUM Top+ NUM OtherSecond: load the size of data that current iteration and next iteration need use in advance and be NUM Other(2) scheduling length refers to the time that an iteration is carried out, in this example, it is all consistent to set all task execution times, suppose to contain in the iteration n task, and there be m core to carry out, scheduling length Ls=(n/m so) * and each task executions time, each task executions time is i.e. 1 clock period unit interval.
NUM other = Σ d k A goto _ others ( d k ) = Σ d k ( d ki ) ( d kj )
NUM top = Σ d k A goto - top ( d k ) = Σ d k d ki ( f - d kj )
NUM next = Σ d k A goto - next ( d k ) = Σ d k d kj ( h - d ki )
Wherein, NUM NextRepresent this piecemeal to produce and next piecemeal is badly in need of the size of data used, NUM TopRepresent this piecemeal generation and be positioned at the required size of data of using of vertical with the Pi direction and adjacent with current piecemeal piecemeal; NUM OtherRepresent the size of data that this piecemeal produces and other all piecemeals need be used except next piecemeal (next partition) and top piecemeal (top partition);
Step 3: the big or small f*h that determines piecemeal;
1) when f=1, draws h1 and h2 respectively according to strategy one and strategy two;
Strategy one: 2NUM Other+ NUM Top+ NUM Next≤ M s
Strategy two: (NUM Top+ NUM Other) Tw+NUM Other* Tr≤Ls * f * h;
Wherein: NUM NextRepresent this piecemeal to produce and next piecemeal is badly in need of the size of data used, NUM TopRepresent this piecemeal generation and be positioned at the required size of data of using of vertical with the Pi direction and adjacent with current piecemeal piecemeal; NUM OtherRepresent the size of data that this piecemeal produces and other all piecemeals need be used except the next piecemeal top of institute piecemeal; Ls represents the scheduling length of each iteration, and Tr represents to read a needed time of data from main memory, and Tw represents to write data to the needed time of main memory; Ms refers to the SPM(scratch-pad storage) amount of capacity;
2) size of judgement h1 and h2 if h1 greater than h2, then makes h=h2, is utilized strategy one and strategy two calculating f 1Otherwise the branch block size is f*h1, and f=1;
3) if f1 greater than 1, divides block size to be defined as f1*h2 so; Otherwise the branch block size is f*h1, and f=1;
In this example, try to achieve and divide block size f=1, h=4 for the first time.
After carrying out the piecemeal first time between iterative space, its piecemeal synoptic diagram as shown in Figure 5, wherein, figure (a) be unreasonable piecemeal, figure (b) finds out therefrom that for according to scheming execution sequence figure that (a) piecemeal obtains there is endless loop in this execution sequence, can't carry out; The reasonable piecemeal of figure (c) for using the inventive method to obtain, the execution sequence figure of figure (d) for obtaining according to figure (c); From figure (d) as can be seen, after the piecemeal, the dependence between piece and the piece is remaining (1,0) only, (1,1) two kinds for the first time.
Step 4: technology when adopting the iteration restatement, the dependence of nexine circulation between the time-delay change task between the dispersion task, reconstruct divides block space;
(retiming) a kind ofly optimizes the technology of cycle period by assignment latency during restatement, and the rotation scheduling be a kind of during based on restatement the resource restriction of technology optimize scheduling strategy, it postpones to obtain a compacter scheduling by redistributing.The piecemeal dispatching technique is technology and prefetching technique during in conjunction with the iteration restatement, and each iteration is regarded as a point, then to dividing between iterative space, and the execution of a piecemeal of a piecemeal then.Because the dependence between the task, so in piecemeal, how to consider emphatically to make piecemeal reasonable by piecemeal.That is to say not have endless loop between piece and the piece, can a piecemeal of a piecemeal carry out scheduled for executing.
In order to obtain a compacter scheduling, technology was come the dependence between the change task when we will carry out an iteration restatement, technology is the dependence between the delay reconstruction task of utilizing between the dispersion task during iteration restatement, shortens the task executions cycle with this.In order to keep the execution sequence of row-wise, need guarantee not change the dependence between piecemeal and the piecemeal during iteration restatement, so the circulation dependence of innermost layer between our the change task postpones to be d such as existing between task A and the task B 3=(1,1), after postponing by dispersion, the delay between A and the B becomes d 3=(0,1), that is to say, when not utilizing the iteration restatement before the technology, iteration(i, carrying out j) of task A must depend on iteration iteration(i+1, and j-1) the task B in is after technology changes delay when utilizing the iteration restatement, iteration(i, carrying out j) of task A depend on iteration (i, the B that carries out in j-1), as shown in Figure 6.
Step 5: divide block space for the first time according to a minute block size f*h division, each sub-piecemeal that the first time, piecemeal produced is used as a node namely as a bunch of task, constitute between new iterative space, obtain the direction P2 of piecemeal for the second time according to step 1 iAnd P2 j
For the first time piecemeal is to carrying out piecemeal between iterative space, and for the second time piecemeal be to the first time piecemeal sub-piecemeal carry out piecemeal, then each for the first time the sub-piece of piecemeal be defined as partition (i, j).
The framework of task scheduling as shown in Figure 7 and Figure 8, Fig. 7 has illustrated the execution sequence of task scheduling, and Fig. 8 has illustrated the scheduling of a task.In the method, for convenience's sake, we are divided three classes to the situation of the piecemeal first time (first_level_partition) according to the current first_level_partition that is carrying out: next first_level_partition, top first_level _ partition and other first_level_partition.And according to the position that utilizes of data it is carried out subregion for each first_level_partition piecemeal, shown in Fig. 8 (a), be divided into four zones altogether, first zone, the task of this piecemeal of data that representative produces will be used, task among the data n ext first_level_partition that second region representation produces need be used, task among the data top first_level_partition that the 3rd region representation produces need be used, and the 4th zone refers to that the data other first_level_partition that produces need use.We can calculate each delay (d(e) among the first_level_partition very soon like this): d k=(d Ki, d Kj), the data of other first_level_partition of confession of generation:
A goto_top(d k)=area(PQVU)=d ki(f-d kj)
A goto_next(d k)=area(VSWX)=d kj(h-d ki)
A goto_other(d k)=area(UVRS)=d kid kj
Further, when we determined the branch block size of first_level_partition, we can calculate the data that a first_level_partition the inside produces and will store very soon.
NUM other = Σ d k A goto _ others ( d k ) = Σ d k ( d ki ) ( d kj )
NUM top = Σ d k A goto - top ( d k ) = Σ d k d ki ( f - d kj )
NUM next = Σ d k A goto - next ( d k ) = Σ d k d kj ( h - di )
Step 6: determine the size of piecemeal for the second time; From hardware configuration information, obtain the quantity N of processor cores Core=3.P2 then iThe size of direction is N Core=3, P2 jThe size of direction is 1.
Step 7: obtain execution sequence figure to carrying out between iterative space behind the piecemeal according to two piecemeal vectors that obtain, dispatch according to the task of execution sequence figure.
As shown in Figure 9, using multiple bencmark is that benchmark is tested the inventive method TLP and other two kinds of algorithm List and IRP in the performance of task on average scheduling time; As seen from the figure, the inventive method TLP(has the multilayer piecemeal task scheduling strategy of storage perception) obviously be better than other two kinds of dispatching algorithms in the performance of task aspect average scheduling time.The performance increase rate reaches about 30%.This be because have the storage perception the piecemeal scheduling strategy when carrying out the branch block dispatching, not only consider the degree of parallelism of task, also fully taken into account storage delay, guarantee that the scheduling time of storer is no longer than the scheduling time of processor, some storage time-delays have been avoided like this, save the stand-by period, thereby improved the performance of system, reduced scheduling time.
Performance aspect write operation is used multiple bencmark and is benchmark test the inventive method TLP and other two kinds of algorithm List and the performance of IRP aspect write operation as shown in figure 10; As seen from the figure, the inventive method TLP(have the storage perception multilayer piecemeal task scheduling strategy) than other two kinds of algorithms, the write operation decreased average about 45%.This is because fully taken into account the capacity of local storage in the partition strategy of the present invention, through twice piecemeal, guarantee that there is local storage as far as possible in the needed data of the treatable piecemeal of each kernel core, saved a large amount of write operations like this, reduce the consumption of scheduling time and energy accordingly, thereby improved the performance of system.But certain when the capacity of local storage, along with the expansion of task scale, the data of required storage increase, and the number of write operation also will increase, thus task average scheduling time can increase, the performance of system can reduce.
IIR, 2D, WDF(1 among Fig. 9 and Figure 10), WDF(2), DPCM(1), DPCM(2), DPCM(3), FLOYD(1), FLOYD(2) and FLOYD(3) to be data handling utility bencmark be benchmark.
Task is that multidimensional DSP uses among the present invention, but the multilayer partition strategy with storage consciousness that proposes can expand on the DSP and other application with cycle specificity of n dimension.

Claims (4)

1. the multilayer piecemeal dispatching method with storage perception is characterized in that, may further comprise the steps:
Step 1: all tasks are performed and once are called an iteration, carry out between the iterative space that the group task with execution sequence repeatedly makes up as the piecemeal object with needs, determine the piecemeal vector (P between iterative space i, P j) direction, the size of piecemeal on the Pj direction is f, and the size of piecemeal on the Pi direction is h, and two that find out ragged edge the dependence set D between task rely on CW and CCW, P i=CCW and P j=CW, described dependence refers to the execution sequence between task;
Step 2: determine the size of data of the required loading of current iteration and preservation and the relational expression of piecemeal vector magnitude f and h, and the scheduling length Ls of current iteration;
Step 3: the big or small f and the h that determine the piecemeal vector according to strategy one and strategy two;
1) setting f is 1, calculates h according to strategy one and strategy two, obtains h1 and h2 respectively;
Strategy one: 2NUM Other+ NUM Top+ NUM Next≤ M s
Strategy two: (NUM Top+ NUM Other) Tw+NUM Other* Tr≤Ls * f * h;
2) if h1〉h2, then the value of h is h2, adopts strategy one and strategy two to calculate f, obtains f1 and f2 respectively, enters 3); Otherwise dividing the value of block size h is h1, and the value of f is 1;
3) if f1〉1, divide block size to be defined as f1*h2; Otherwise the branch block size is f*h1, and f=1;
Step 4: technology when adopting the iteration restatement, the dependence of nexine circulation between the time-delay change task between the dispersion task, reconstruct divides block space;
Step 5: divide block space for the first time according to a minute block size f*h division, each sub-piecemeal that the first time, piecemeal produced is used as a node namely as a bunch of task, constitute between new iterative space, successively each sub-piecemeal is carried out piecemeal according to step 1, obtain the direction vector (P2 of piecemeal for the second time i, P2 j);
Step 6: determine the size of piecemeal vector for the second time;
The piecemeal vector is at P2 for the second time iSize on the direction is N Core, at P2 jSize on the direction is 1, N CoreQuantity for processor cores;
Step 7: obtain execution sequence figure to carrying out between iterative space behind the piecemeal according to two piecemeal vectors that obtain, dispatch according to the task of execution sequence figure.
According to claim 1 have the storage perception multilayer piecemeal dispatching method, it is characterized in that the concrete deterministic process of the direction of piecemeal vector is as follows in the described step 1:
Dependence between task refers to the execution sequence between the task, uses d k=(d Ki, d Kj) expression, wherein d KiRepresent that two tasks are at the execution dependence of nexine circulation, d KjRepresent two tasks in the execution dependence of skin circulation, two that find out ragged edge the dependence set D between task rely on CW and CCW, P i=CCW and P j=CW;
The counterclockwise interval vector of CCW refers to the vector with j vector angle maximum, and the clockwise interval vector of CW refers to the vector with j vector angle minimum.
According to claim 1 have the storage perception multilayer piecemeal dispatching method, it is characterized in that the relational expression of the size of data of the required loading of current iteration and preservation and piecemeal vector magnitude f and h is as follows in the described step 2:
(1) size of data of the required loading of current iteration and preservation comprises two parts: first, the size of data that current iteration produces is NUM Next+ NUM Top+ NUM OtherSecond portion loads the size of data that current iteration and next iteration need use in advance and is NUM Other
NUM other = Σ d k A goto _ others ( d k ) = Σ d k ( d ki ) ( d kj )
NUM top = Σ d k A goto - top ( d k ) = Σ d k d ki ( f - d kj )
NUM next = Σ d k A goto - next ( d k ) = Σ d k d kj ( h - d ki )
Wherein, NUM NextRepresent this piecemeal to produce and next piecemeal is badly in need of the size of data used, NUM TopRepresent this piecemeal generation and be positioned at the required size of data of using of vertical with the Pi direction and adjacent with current piecemeal piecemeal; NUM OtherRepresent the size of data that this piecemeal produces and other all piecemeals need be used except next piecemeal (next partition) and top piecemeal (top partition);
Scheduling length Ls refers to the time that an iteration is carried out in the described step 2.
4. the multilayer piecemeal dispatching method with storage perception according to claim 1 is characterized in that, the NUM in the described step 3 in strategy one and the strategy two NextRepresent this piecemeal to produce and next piecemeal is badly in need of the data used, NUM TopRepresent this piecemeal generation and be positioned at vertical with the Pi direction, simultaneously adjacent with the current piecemeal required data of using of piecemeal; NUM OtherRepresent that this piecemeal produces and except the next piecemeal top of institute piecemeal, the data that other all piecemeals need be used; Ls represents the scheduling length of each iteration, and Tr represents to read a needed time of data from main memory, and Tw represents to write data to the needed time of main memory; Ms refers to the SPM(scratch-pad storage) amount of capacity.
CN201310145363.XA 2013-04-24 2013-04-24 A kind of multilamellar piecemeal dispatching method with storage perception Active CN103246563B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310145363.XA CN103246563B (en) 2013-04-24 2013-04-24 A kind of multilamellar piecemeal dispatching method with storage perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310145363.XA CN103246563B (en) 2013-04-24 2013-04-24 A kind of multilamellar piecemeal dispatching method with storage perception

Publications (2)

Publication Number Publication Date
CN103246563A true CN103246563A (en) 2013-08-14
CN103246563B CN103246563B (en) 2016-06-08

Family

ID=48926094

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310145363.XA Active CN103246563B (en) 2013-04-24 2013-04-24 A kind of multilamellar piecemeal dispatching method with storage perception

Country Status (1)

Country Link
CN (1) CN103246563B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639769A (en) * 2008-07-30 2010-02-03 国际商业机器公司 Method and device for splitting and sequencing dataset in multiprocessor system
CN101980168A (en) * 2010-11-05 2011-02-23 北京云快线软件服务有限公司 Dynamic partitioning transmission method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639769A (en) * 2008-07-30 2010-02-03 国际商业机器公司 Method and device for splitting and sequencing dataset in multiprocessor system
CN101980168A (en) * 2010-11-05 2011-02-23 北京云快线软件服务有限公司 Dynamic partitioning transmission method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHUN JASON XUE等: "Iterational Retiming with Partitioning:Loop Scheduling with Complete Memory Latency Hiding", 《ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS》 *
ZHONG WANG等: "Optimal Loop Scheduling for Hiding Memory Latency Based on Two Level Partitioning and Prefetching", 《SIGNAL PROCESSING,IEEE TRANSACTIONS ON》 *

Also Published As

Publication number Publication date
CN103246563B (en) 2016-06-08

Similar Documents

Publication Publication Date Title
Venkataraman et al. Presto: distributed machine learning and graph processing with sparse matrices
CN112306678B (en) Method and system for parallel processing of algorithms based on heterogeneous many-core processor
CN103150265B (en) The fine-grained data distribution method of isomery storer on Embedded sheet
CN102708009B (en) Method for sharing GPU (graphics processing unit) by multiple tasks based on CUDA (compute unified device architecture)
CN105468439B (en) The self-adaptive parallel method of neighbours in radii fixus is traversed under CPU-GPU isomery frame
JP2020518881A (en) Computer-implemented method, computer-readable medium and heterogeneous computing system
RU2012138911A (en) METHOD, SYSTEM AND EQUIPMENT OF SPACE OF EXECUTION
Ji et al. RSVM: a region-based software virtual memory for GPU
Zhang et al. Optimizing the Barnes-Hut algorithm in UPC
Li et al. A simple yet effective balanced edge partition model for parallel computing
Diener et al. Evaluating thread placement based on memory access patterns for multi-core processors
Raju et al. A survey on techniques for cooperative CPU-GPU computing
Melab et al. A GPU-accelerated branch-and-bound algorithm for the flow-shop scheduling problem
Maggioni et al. AdELL: An adaptive warp-balancing ELL format for efficient sparse matrix-vector multiplication on GPUs
Holst et al. High-throughput logic timing simulation on GPGPUs
JP2015516633A (en) Apparatus, system, and memory management method
Mantovani et al. Performance issues on many-core processors: A D2Q37 Lattice Boltzmann scheme as a test-case
CN108108242B (en) Storage layer intelligent distribution control method based on big data
Tang et al. Optimizing and auto-tuning iterative stencil loops for GPUs with the in-plane method
CN109522127B (en) Fluid machinery simulation program heterogeneous acceleration method based on GPU
Wittmann et al. Modeling and analyzing performance for highly optimized propagation steps of the lattice Boltzmann method on sparse lattices
Cecilia et al. Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE
CN103246563A (en) Multi-layer block scheduling method with storage sensing function
Boyer Improving Resource Utilization in Heterogeneous CPU-GPU Systems
Hugo et al. A runtime approach to dynamic resource allocation for sparse direct solvers

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent of invention or patent application
CB03 Change of inventor or designer information

Inventor after: Li Kenli

Inventor after: Wang Yan

Inventor after: Du Jiayi

Inventor after: Tang Zhuo

Inventor after: Xiao Zheng

Inventor after: Zhu Ningbo

Inventor before: Wang Yan

Inventor before: Li Kenli

Inventor before: Du Jiayi

Inventor before: Tang Zhuo

Inventor before: Xiao Zheng

Inventor before: Zhu Ningbo

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: WANG YAN LI KENLI DU JIAYI TANG ZHUO XIAO ZHENG ZHU NINGBO TO: LI KENLI WANG YAN DU JIAYI TANG ZHUO XIAO ZHENG ZHU NINGBO

C14 Grant of patent or utility model
GR01 Patent grant