CN103246563B - A kind of multilamellar piecemeal dispatching method with storage perception - Google Patents

A kind of multilamellar piecemeal dispatching method with storage perception Download PDF

Info

Publication number
CN103246563B
CN103246563B CN201310145363.XA CN201310145363A CN103246563B CN 103246563 B CN103246563 B CN 103246563B CN 201310145363 A CN201310145363 A CN 201310145363A CN 103246563 B CN103246563 B CN 103246563B
Authority
CN
China
Prior art keywords
piecemeal
task
num
iteration
dependence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310145363.XA
Other languages
Chinese (zh)
Other versions
CN103246563A (en
Inventor
李肯立
王艳
杜家宜
唐卓
肖正
朱宁波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN201310145363.XA priority Critical patent/CN103246563B/en
Publication of CN103246563A publication Critical patent/CN103246563A/en
Application granted granted Critical
Publication of CN103246563B publication Critical patent/CN103246563B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a kind of multilamellar piecemeal dispatching method with storage perception, the first step, draw the direction of piecemeal vector (Pi, Pj) according to the dependence of task; Second step, it is thus achieved that every time circulate the relational expression of the required size of data loaded and preserve and piecemeal vector magnitude f and h and the scheduling length Ls of each iteration; 3rd step, determines the sub-piecemeal size of piecemeal according to local storage size and scheduling length; 4th step, utilizes the dependence between iteration weight clocking technique change task, reconstructs piecemeal position; 5th step, is used as every sub-piecemeal of first time piecemeal as a node and rebuilds a segmented spaces, carry out second time piecemeal; Obtain execution sequence figure after iteration space being carried out piecemeal according to two piecemeal vectors, according to execution sequence figure, task is scheduling; The method, in conjunction with memory span and storage delay, by piecemeal with to the adjustment of dependence between task, improves the degree of parallelism of task, reduces the quantity of write operation and reduce average scheduled time.

Description

A kind of multilamellar piecemeal dispatching method with storage perception
Technical field
The present invention relates to a kind of multilamellar piecemeal dispatching method with storage perception.
Background technology
The application that processes of most of science and data signal is all iterative recursive circulation. This generic task encounters two challenges when performing on embedded multiprocessor: first, it is calculate responsive type and the application of data responsive type that most of data signals process task, for this kind of application, the bad scheduling strategy of efficiency will produce substantial amounts of write operation, therefore can consume substantial amounts of time and energy consumption; Second, receiving speed relative to memorizer, the development of CPU speed is excessively quick, and the storage speed of memorizer slowly seriously hinders the raising of systematic function. Although embedded multiprocessor has a set of one's own instruction set, software programming can be passed through and realize different calculating tasks flexibly, but it is affected by coding and the execution sequence restriction of instruction, the restriction of memory access bottleneck and fixing control architecture, tends not to arrive maximum speed and optimum efficiency.
Prefetch the technology that (prefetching) strategy is a kind of performance that can be effectively improved system proposed for storage delay, namely before data have demand, just these data are stored in cache memory (cache), so can tolerate storing time delay for a long time.The strategy that prefetches of the prior art can be divided three classes: hardware based prefetches strategy, prefetching strategy and prefetching strategy based on hardware and software based on software. But hardware based some supporter of policy mandates that prefetch are linked to cache memory cache, and depend on dynamically available information in the process performed. And based on software prefetch strategy depend on compiler technologies go analyze one section of static routine, and in program code add prefetched instruction. But, too many pre-extract operation will cause that a unbalanced scheduling and storage time delay can be very long.
For this, a lot of embedded multiprocessors have used SPM to be a kind of minimum storage being embedded on chip to replace Cache, SPM, are a kind of compiler support and the memorizer can being managed by software. SPM memorizer can essentially regard the local storage of each core core as, can optimize the performance of system further, and can effectively reduce the consumption of energy. But for large-scale data signal processing tasks, resource management's scheduling strategy performs process and will produce substantial amounts of write operation improperly.
In order to increase the locally stored of data, a lot of research situations about being devoted to according to task are implement resource integration management. Traditional multidimensional task resource management method, execution task is that the dependence according to task performs with row-column (row-column) or column-row order. Due to the restriction of local storage, this execution method can produce substantial amounts of data to be needed, in write main memory, even to cause very multidata loss.
Summary of the invention
The invention provides a kind of multilamellar piecemeal dispatching method with storage perception, its object is to, by iteration space being carried out reasonably repeatedly piecemeal, resource is managed distribution and scheduling, the resource unreasonable distribution existed in prior art is overcome to cause when scheduling strategy performs, deadline length and the many problem of energy expenditure, overcome the restriction due to local storage, it is easy to the problem causing loss of data simultaneously.
A kind of multilamellar piecemeal dispatching method with storage perception, comprises the following steps:
Step 1: all of task is executed once as an iteration, performs between the iterative space that a group task with execution sequence repeatedly builds as piecemeal object needing, it is determined that the piecemeal vector (P in iteration spacei, Pj) direction, piecemeal is sized to f on Pj direction, and piecemeal is sized to h on Pi direction, and two that find out ragged edge from the dependence set D between task rely on CW and CCW, Pi=CCW and Pj=CW;
Step 2: determine the relational expression of the required size of data loaded and preserve of current iteration and piecemeal vector magnitude f and h and the scheduling length Ls of current iteration;
Step 3: strategically one and tactful two determine size f and the h of piecemeal vector;
1) set f as 1, calculate h according to strategy one and strategy two, respectively obtain h1 and h2;
Strategy one: 2NUMother+NUMtop+NUMnext��Ms;
Strategy two: (NUMtop+NUMother)Tw+NUMother�� Tr��Ls �� f �� h;
2) if h1 > h2, then the value of h is h2, adopts strategy one and strategy two to calculate f, respectively obtains f1 and f2, enter 3);
Otherwise, the value of piecemeal size h is the value of h1, f is 1;
3) if f1 > 1, piecemeal is sized to f1*h2; Otherwise piecemeal is sized to f*h1 and f=1;
Step 4: adopting iteration weight clocking technique, between the time delay change task between decentralized task, the dependence of nexine circulation, reconstructs segmented spaces;
Step 5: divide first time segmented spaces according to piecemeal size f*h, each sub-piecemeal produced by first time piecemeal is used as a node namely as a bunch of task, constitute new iteration space, successively every sub-piecemeal is carried out piecemeal according to step 1, it is thus achieved that the direction vector (P2 of second time piecemeali, P2j);
Step 6: determine the size of second time piecemeal vector;
Second time piecemeal vector is at P2iDirection is sized to Ncore, at P2jDirection is sized to 1, NcoreQuantity for processor cores;
Step 7: obtain execution sequence figure after iteration space being carried out piecemeal according to two the piecemeal vectors obtained, according to execution sequence figure, task is scheduling.
In described step 1 piecemeal vector direction specifically determine that process is as follows:
Dependence between task refers to the execution sequence between task, uses dk=(dki, dkj) represent, wherein dkiRepresent the execution dependence that two tasks circulate, d at nexinekjRepresenting two tasks execution dependence at outer loop, two that find out ragged edge from the dependence set D between task rely on CW and CCW, Pi=CCW and Pj=CW;
CCW is counterclockwise, and interval vector refers to the vector maximum with j vector angle, and CW interval vector clockwise refers to the vector minimum with j vector angle.
Wherein, dependence between task refers to the execution sequence between task, namely task has to wait for the relation that just can be performed after another task completes, the execution sequence of task is represented in a computer typically by figure, one task of each node on behalf in figure, the limit between node and node represents the dependence between task i.e. the specific restriction suffered by tasks carrying order; One iteration represents all of task and is all executed once, and all of iteration constitutes iteration space; One i.e. circulation of iteration, i represents that certain task i-th in a circulation (nexine circulation) is performed, and j represents that jth circulation (outer loop) i.e. all task jth time are performed.
In described step 2, the required size of data loaded and preserve of current iteration is as follows with the relational expression of piecemeal vector magnitude f and h:
(1) the required size of data loaded and preserve of current iteration includes two parts: Part I, size of data produced by current iteration is NUMnext+NUMtop+NUMother; Part II, loads current iteration in advance and following iteration needs the size of data used to be NUMother;
NUM other = Σ d k A goto _ others ( d k ) = Σ d k ( d ki ) ( d kj )
NUM top = Σ d k A goto - top ( d k ) = Σ d k d ki ( f - d kj )
NUM next = Σ d k A goto - next ( d k ) = Σ d k d kj ( h - d ki )
Wherein, NUMnextThis piecemeal is represented to produce and next piecemeal is badly in need of the size of data used, NUMtopRepresent that this piecemeal produces and is positioned at vertical with Pi direction, the size of data used required for simultaneously adjacent with current piecemeal piecemeal;
NUMotherRepresenting that this piecemeal produces and except next piecemeal (nextpartition) and top piecemeal (toppartition), other all piecemeals need the size of data used;
In described step 2, scheduling length Ls refers to the time that an iteration performs.
NUM in strategy one and tactful two in described step 3nextThis piecemeal is represented to produce and next piecemeal is badly in need of the data used, NUMtopRepresent that this piecemeal produces and is positioned at vertical with Pi direction, the data used required for simultaneously adjacent with current piecemeal piecemeal; NUMotherRepresenting that this piecemeal produces and except next piecemeal and top piecemeal, other all piecemeals need the data used; Ls represents the scheduling length of each iteration, and Tr represents and reads the time required for data from main memory, and Tw represents and writes data to the time required for main memory; Ms refers to SPM(scratch-pad storage) amount of capacity.
During restatement, (retiming) is a kind of technology being optimized cycle period by assignment latency, and rotating scheduling is a kind of resource limit Optimized Operation strategy based on weight clocking technique, and it obtains a greater compactness of scheduling by redistributing delay. Piecemeal dispatching technique is in conjunction with iteration weight clocking technique and prefetching technique, each iteration is regarded as a point, then iteration space is divided and (it should be noted that, an iteration refers to all of tasks carrying once, and iteration space comprises all of iteration), the execution of right one piecemeal of later piecemeal. Due to the dependence between task, so when piecemeal, piecemeal will be considered how emphatically so that piecemeal is reasonable. That is do not have endless loop between block and block, it is possible to one piecemeal of a piecemeal be scheduling perform.
Beneficial effect
The invention provides a kind of multilamellar piecemeal dispatching method with storage perception, the first step, show that namely the correct shape of first time piecemeal determines the direction of two piecemeal vectors according to the dependence of task, be denoted as (Pi, Pj);Second step, it is thus achieved that every time circulate the relational expression of the required size of data loaded and preserve and piecemeal vector magnitude f and h and the scheduling length Ls of each iteration; 3rd step, determines the sub-piecemeal size of first time piecemeal according to local storage size and scheduling length; 4th step, utilizes the dependence between iteration weight clocking technique change task, reconstructs piecemeal position; 5th step, on the basis of first time piecemeal, is used as every sub-piecemeal of first time piecemeal as a node and rebuilds a segmented spaces, carry out second time piecemeal; Obtain execution sequence figure after iteration space being carried out piecemeal according to two the piecemeal vectors obtained, according to execution sequence figure, task is scheduling; The multilamellar piecemeal dispatching method with storage perception has considered memory span and storage delay, is performed by piecemeal and to the adjustment of dependence between task, improves the degree of parallelism of task, decreases the quantity of write operation and reduces average scheduled time.
Selection for Partitional form and direction is very strict, and due to the dependence between task, irrational segment partition scheme will cause that task cannot perform, and reasonably piecemeal will reduce the time of task scheduling and the generation of write operation; The inventive method considers memory span and storage delay, the effective overall performance improving system.
Accompanying drawing explanation
Fig. 1 is the flow chart of the present invention;
Fig. 2 is iteration space schematic diagram;
Fig. 3 is the task image MDFG of two dimensional application task model and correspondence, and wherein, figure (a) is a two dimensional application task model, and figure (b) is according to figure (a) MDFG obtained figure;
Fig. 4 is dependence between task at the location drawing of interval CCW counterclockwise and interval CW clockwise, and wherein figure (a) is CCW and CW area schematic, and figure (b) is the location drawing in CCW and CW region of the dependence between task;
The reasonable piecemeal of Fig. 5 and unreasonable piecemeal schematic diagram, wherein, figure (a) is unreasonable piecemeal, and figure (b) is the execution sequence figure obtained according to figure (a) piecemeal, and figure (c) is reasonable piecemeal, and figure (d) is according to figure (c) the execution sequence figure obtained;
Fig. 6 is the nexine dependence schematic diagram between application weight clocking technique change task;
Fig. 7 piecemeal dispatching sequence schemes, and figure (a) represents that the face over one's competence being made up of second time piecemeal, figure (b) represent the relation of first time piecemeal and second time piecemeal, the execution sequence that figure (c) is piecemeal;
The scheduling relation of Fig. 8 processor part and memory portion;
Fig. 9 is the write operation number contrast schematic diagram of multiple dispatching method;
The task average scheduled time contrast schematic diagram of the multiple dispatching method of Figure 10.
Detailed description of the invention
Below in conjunction with accompanying drawing, the present invention is described in further detail.
For more convenient resource block management, first tasks carrying order is carried out modelling process by us, use two-dimensional coordinate represents, i represents the direction that nexine circulates, j represents the direction of outer loop, each circulation can use iteration iteration(i, j) represents, an iteration represents all of task and is all executed once.
In this example, check the configuration of the computer of execution task, the kernel number obtaining computer processor is 3, the amount of capacity of the SPM inlayed on each kernel is the computing unit number of 128KB and each kernel is 3, clock cycle time Tr=2clockcycle(required for data is read from main memory), write data to clock cycle time Tw=4clockcycle(required for main memory).
As it is shown in figure 1, be a kind of flow chart with the multilamellar piecemeal dispatching method storing perception of the present invention, its concrete operation step is as follows:
Step 1: all of task is executed once as an iteration, using between the iterative space that the group task with execution sequence that need to perform repeatedly builds as piecemeal object, it is determined that the piecemeal vector (P in iteration spacei, Pj) direction, piecemeal is sized to f on Pj direction, and piecemeal is sized to h on Pi direction, and two that find out ragged edge from the dependence set D between task rely on CW and CCW, Pi=CCW and Pj=CW;
Wherein, dependence between task refers to the execution sequence between task, namely task has to wait for the relation that just can be performed after another task completes, the execution sequence of task is represented in a computer typically by figure, one task of each node on behalf in figure, the limit between node and node represents the dependence between task i.e. the specific restriction suffered by tasks carrying order; One iteration represents all of task and is all executed once, and all of iteration constitutes iteration space; One i.e. circulation of iteration, each task will be performed a number of times, and which time of a task A performs us and use A [i, j] represent, A [i, j] represents that task A i-th will circulate in nexine circulates, and in outer loop, jth circulation is executed once.
As shown in Figure 3, it is the task image MDFG of a concrete two dimensional application task model and correspondence, task image MDFG=<V, E, d, t>it is the X-Y scheme with node weights and limit weights, wherein a V representation node, namely represent a task in this application, E represents the dependence between task, (u, v) �� E means to also exist between node u and node v dependence, and d (e)=(di,dj) represent a delay, describe the concrete dependence between two tasks.
If the dependence between task a to task b is expressed as dk=(x y), then means that (i, task b j) depends on the task a of iteration iteration (i-x, j-y) to iteration iteration. dk=(0,0) represents the dependence between the task of same iteration. Two dimensional application in Fig. 2 comprises 4 tasks, respectively A, B, C and D, if the dependence between task A to task B is dk=(0,0), the dependence between task C to task D is dk=(0,1). The relation of dependence set D={d1, d2, d3, d4, d5} between CCW and CW and task, as shown in Figure 4.
In this example, there is 3 nonzero-lag vectors (0,1) (1,0) and (-1,1) in dependence D set. So (1,0) is CW vector, (-1,1) is CCW, and the first time vector of piecemeal is Pi=CCW and Pj=CW;
Because cycle applications has basic characteristics: the task that iteration (circulation) performs is all identical, and the task order performed is all identical every time every time, so the possessed dependence of iteration is all consistent every time. In order to find out CW(interval vector clockwise more easily) and CCW(interval vector counterclockwise), we use vector representation all of dependence, then CCW refers to the vector maximum with j vector angle, and CW refers to the vector minimum with j vector angle.
Two that find out ragged edge from D rely on CW and CCW, Pi=CCW and Pj=CW, i.e. d4=CCW, d3=CW;
Step 2: calculate the relational expression between required size of data and f and h loaded and preserve of current iteration, represent with the equation containing h or f respectively, and the scheduling length Ls of current iteration, namely suppose f or h it has been determined that;
(1) the required size of data loaded and preserve of current iteration includes two parts: Part I, size of data produced by current iteration is NUMnext+NUMtop+NUMother; Second: load current iteration in advance and following iteration needs the size of data used to be NUMother; (2) scheduling length refers to the time that an iteration performs, in this example, set all of task execution time all consistent, assume in an iteration containing n task, and have m core to perform, so scheduling length Ls=(n/m) execution time of �� each task, the execution time of each task is unit interval i.e. 1 clock cycle.
NUM other = &Sigma; d k A goto _ others ( d k ) = &Sigma; d k ( d ki ) ( d kj )
NUM top = &Sigma; d k A goto - top ( d k ) = &Sigma; d k d ki ( f - d kj )
NUM next = &Sigma; d k A goto - next ( d k ) = &Sigma; d k d kj ( h - d ki )
Wherein, NUMnextThis piecemeal is represented to produce and next piecemeal is badly in need of the size of data used, NUMtopRepresent that this piecemeal produces and is positioned at the size of data used required for vertical with Pi direction and adjacent with current piecemeal piecemeal; NUMotherRepresent the size of data that this piecemeal produces and other all piecemeal needs are used except next piecemeal (nextpartition) and top piecemeal (toppartition);
Step 3: determine the size f*h of piecemeal;
1) as f=1, h1 and h2 is drawn respectively according to strategy one and strategy two;
Strategy one: 2NUMother+NUMtop+NUMnext��Ms;
Strategy two: (NUMtop+NUMother)Tw+NUMother�� Tr��Ls �� f �� h;
Wherein: NUMnextThis piecemeal is represented to produce and next piecemeal is badly in need of the size of data used, NUMtopRepresent that this piecemeal produces and is positioned at the size of data used required for vertical with Pi direction and adjacent with current piecemeal piecemeal; NUMotherRepresent the size of data that this piecemeal produces and other all piecemeal needs are used except next piecemeal institute top piecemeal; Ls represents the scheduling length of each iteration, and Tr represents and reads the time required for data from main memory, and Tw represents and writes data to the time required for main memory; Ms refers to SPM(scratch-pad storage) amount of capacity;
2) judge the size of h1 and h2, if h1 is more than h2, then make h=h2, Utilization strategies one and strategy two calculate f1; Otherwise piecemeal is sized to f*h1 and f=1;
3) if f1 is more than 1, then piecemeal is sized to f1*h2; Otherwise piecemeal is sized to f*h1 and f=1;
In this example, try to achieve first time piecemeal size f=1, h=4.
After iteration space carries out first time piecemeal, as it is shown in figure 5, wherein, figure (a) is unreasonable piecemeal to its piecemeal schematic diagram, and figure (b) is the execution sequence figure obtained according to figure (a) piecemeal, therefrom finds out that this execution sequence exists endless loop, it is impossible to perform; Figure (c) is the reasonable piecemeal using the inventive method to obtain, and figure (d) is according to figure (c) the execution sequence figure obtained; From figure (d) it can be seen that after first time piecemeal, the dependence between block and block is only remaining (1,0), (-1,1) two kinds.
Step 4: adopting iteration weight clocking technique, between the time delay change task between decentralized task, the dependence of nexine circulation, reconstructs segmented spaces;
During restatement, (retiming) is a kind of technology being optimized cycle period by assignment latency, and rotating scheduling is a kind of resource limit Optimized Operation strategy based on weight clocking technique, and it obtains a greater compactness of scheduling by redistributing delay. Piecemeal dispatching technique, in conjunction with iteration weight clocking technique and prefetching technique, is regarded a point as each iteration, then iteration space is divided, the execution of right one piecemeal of later piecemeal. Due to the dependence between task, so when piecemeal, piecemeal will be considered how emphatically so that piecemeal is reasonable. That is not endless loop between block and block, it is possible to one piecemeal of a piecemeal be scheduling perform.
In order to obtain a greater compactness of scheduling, we carry out the dependence between change task by carrying out an iteration weight clocking technique, iteration weight clocking technique is to utilize the dependence between the delay reconstruction task between decentralized task, shortens the execution cycle of task with this.In order to keep the execution sequence of row-wise, during iteration restatement, need the dependence ensureing not change between piecemeal and piecemeal, so the Circular dependency relation of innermost layer between our a change task, such as exist between task A and task B and postpone as d3=(-1,1), after passing through to disperse to postpone, the delay between A and B becomes d3=(0,1), say, that before not utilizing iteration weight clocking technique, iteration(i, j) in perform task A be necessarily dependent upon iteration iteration(i+1, j-1) in task B, utilize iteration weight clocking technique change postpone after, iteration(i, performing in j) of task A depends on the B performed in iteration (i, j-1), as shown in Figure 6.
Step 5: divide first time segmented spaces according to piecemeal size f*h, is used as each sub-piecemeal produced by first time piecemeal as a node namely as a bunch of task, constitutes new iteration space, obtain the direction P2 of piecemeal for the second time according to step 1iAnd P2j��
First time piecemeal is that iteration space carries out piecemeal, and second time piecemeal is that the sub-piecemeal to first time piecemeal carries out piecemeal, then the sub-block of each first time piecemeal be defined as partition (i, j).
As shown in Figure 7 and Figure 8, Fig. 7 describes the execution sequence of task scheduling to the framework of task scheduling, and Fig. 8 describes the scheduling of a task. in the method, for convenience's sake, first time piecemeal (first_level_partition) is divided three classes by we according to the situation of the first_level_partition being currently executing: nextfirst_level_partition, topfirst_level_partition and otherfirst_level_partition. and to it, subregion is carried out according to the position that utilizes of data for each first_level_partition piecemeal, as shown in Figure 8 (a), one is divided into four regions, first region, the task of representing produced this piecemeal of data will be used, task in data nextfirst_level_partition produced by second region representation needs to use, task in data topfirst_level_partition produced by 3rd region representation needs to use, 4th region refers to that the data otherfirst_level_partition of generation needs to use. so we can quickly calculate each delay (d(e) in a first_level_partition): dk=(dki,dkj), the data for other first_level_partition of generation:
Agoto_top(dk)=area(PQVU)=dki(f-dkj)
Agoto_next(dk)=area(VSWX)=dkj(h-dki)
Agoto_other(dk)=area(UVRS)=dkidkj
Further, when we determined that the piecemeal size of first_level_partition, we quickly can calculate the data being produced and storing inside a first_level_partition.
NUM other = &Sigma; d k A goto _ others ( d k ) = &Sigma; d k ( d ki ) ( d kj )
NUM top = &Sigma; d k A goto - top ( d k ) = &Sigma; d k d ki ( f - d kj )
NUM next = &Sigma; d k A goto - next ( d k ) = &Sigma; d k d kj ( h - di )
Step 6: determine the size of second time piecemeal; The quantity N of processor cores is obtained from hardware configuration informationcore=3. Then P2iDirection be sized to Ncore=3, P2jDirection be sized to 1.
Step 7: obtain execution sequence figure after iteration space being carried out piecemeal according to two the piecemeal vectors obtained, according to execution sequence figure, task is scheduling.
As it is shown in figure 9, apply multiple bencmark and benchmark tests the inventive method TLP and other two kinds of algorithm List and IRP performances on task average scheduled time; As seen from the figure, the inventive method TLP(has the multilamellar piecemeal task scheduling strategy of storage perception) performance in task average scheduled time is substantially better than other two kinds of dispatching algorithms. Performance increase rate reaches about 30%.This is because there is the piecemeal scheduling strategy storing perception when carrying out piecemeal scheduling, consider not only the degree of parallelism of task, also fully take into account storage delay, ensure the scheduling time no longer than the processor scheduling time of memorizer, this avoid some storage time delays, saving the waiting time, thus improve the performance of system, decreasing scheduling time.
Performance in write operation as shown in Figure 10, is applied multiple bencmark and benchmark is tested the inventive method TLP and other two kinds of algorithm List and IRP performances in write operation; As seen from the figure, the inventive method TLP(has the multilamellar piecemeal task scheduling strategy of storage perception) than other two kinds of algorithms, write operation on average decreases about 45%. This is because the partition strategy of the present invention has fully taken into account the capacity of local storage, through twice piecemeal, there is local storage as far as possible in the data required for ensureing each kernel treatable piecemeal of core, which save substantial amounts of write operation, consequently reduce the consumption of scheduling time and energy, thus improve the performance of system. But when the capacity of local storage is certain, along with the expansion of task scale, the data of required storage increase, and the number of write operation also will increase, thus task average scheduled time can increase, the performance of system can reduce.
IIR, 2D, WDF(1 in Fig. 9 and Figure 10), WDF(2), DPCM(1), DPCM(2), DPCM(3), FLOYD(1), FLOYD(2) and FLOYD(3) be data handling utility bencmark and benchmark.
In the present invention, task is multidimensional DSP application, but the multilamellar partition strategy with storage consciousness proposed can expand to the n DSP tieed up and other have in the application of cycle specificity.

Claims (1)

1. a multilamellar piecemeal dispatching method with storage perception, it is characterised in that comprise the following steps:
Step 1: all of task is executed once and is called an iteration, performs between the iterative space that a group task with execution sequence repeatedly builds as piecemeal object needing, it is determined that the piecemeal vector (P in iteration spacei, Pj) direction, piecemeal is at PjBeing sized to f on direction, piecemeal is at PiBeing sized to h on direction, two that find out ragged edge from the dependence set D between task rely on CW and CCW, Pi=CCW and Pj=CW, described dependence refers to the execution sequence between task;
Step 2: determine the relational expression of the required size of data loaded and preserve of current iteration and piecemeal vector magnitude f and h and the scheduling length Ls of current iteration;
Step 3: strategically one and tactful two determine size f and the h of piecemeal vector;
1) set f as 1, calculate h according to strategy one and strategy two, respectively obtain h1 and h2;
Strategy one: 2NUMother+NUMtop+NUMnext��Ms;
Strategy two: (NUMtop+NUMother)Tw+NUMother�� Tr��Ls �� f �� h;
2) if h1 > h2, then the value of h is h2, adopts strategy one and strategy two to calculate f, respectively obtains f1 and f2, enter 3); Otherwise, the value of h is the value of h1, f is 1, enters step 4;
3) if the value of f1 > 1, f is f1, piecemeal is sized to f1*h2; Otherwise piecemeal is sized to f*h1 and f=1;
Step 4: adopting iteration weight clocking technique, between the time delay change task between decentralized task, the dependence of nexine circulation, reconstructs segmented spaces;
Step 5: divide first time segmented spaces according to piecemeal size f*h, each sub-piecemeal produced by first time piecemeal is used as a node namely as a bunch of task, constitute new iteration space, successively every sub-piecemeal is carried out piecemeal according to step 1, it is thus achieved that the direction vector (P2 of second time piecemeali, P2j);
Step 6: determine the size of second time piecemeal vector;
Second time piecemeal vector is at P2iDirection is sized to Ncore, at P2jDirection is sized to 1, NcoreQuantity for processor cores;
Step 7: obtain execution sequence figure after iteration space being carried out piecemeal according to two the piecemeal vectors obtained, according to execution sequence figure, task is scheduling;
In described step 1 piecemeal vector direction specifically determine that process is as follows:
Dependence between task refers to the execution sequence between task, uses dk=(dki, dkj) represent, wherein dkiRepresent the execution dependence that two tasks circulate, d at nexinekjRepresenting two tasks execution dependence at outer loop, two that find out ragged edge from the dependence set D between task rely on CW and CCW, Pi=CCW and Pj=CW;
CCW is counterclockwise, and interval vector refers to the vector maximum with j vector angle, and CW interval vector clockwise refers to the vector minimum with j vector angle;
In described step 2, the required size of data loaded and preserve of current iteration is as follows with the relational expression of piecemeal vector magnitude f and h:
(1) the required size of data loaded and preserve of current iteration includes two parts: Part I, size of data produced by current iteration is NUMnext+NUMtop+NUMother; Part II, loads current iteration in advance and following iteration needs the size of data used to be NUMother;
NUM o t h e r = &Sigma; d k A g o t o _ o t h e r s ( d k ) = &Sigma; d k ( d k i ) ( d k j )
NUM t o p = &Sigma; d k A g o t o - t o p ( d k ) = &Sigma; d k d k i ( f - d k j )
NUM n e x t = &Sigma; d k A g o t o - n e x t ( d k ) = &Sigma; d k d k j ( h - d k i )
Wherein, NUMnextThis piecemeal is represented to produce and next piecemeal is badly in need of the size of data used, NUMtopRepresent that this piecemeal produces and is positioned at and PiThe size of data used required for the piecemeal that direction is vertical and adjacent with current piecemeal; NUMotherRepresent the size of data that this piecemeal produces and other all piecemeal needs are used except next piecemeal and top piecemeal;
Agoto_others(dk) belong to intermediate variable, represent when piecemeal performs task, the dependence d between taskkThe data for otherfirst_level_partition (other ground floor piecemeals) produced;
Agoto-top(dk) belong to intermediate variable, represent when piecemeal performs task, the dependence d between taskkThe data for topfirst_level_partition produced;
Agoto-next(dk) belong to intermediate variable, represent when piecemeal performs task, the dependence d between taskkThe data for nextfirst_level_partition produced;
Otherfirst_level_partition, topfirst_level_partition and nextfirst_level_partition are relative to current ground floor piecemeal currentfirstlevelpartition and describe, otherfirst_level_partition represents other ground floor piecemeals, and topfirst_level_partition represents that top ground floor piecemeal and nextfirst_level_partition represent next ground floor piecemeal;
In described step 2, scheduling length Ls refers to the time that an iteration performs;
NUM in strategy one and tactful two in described step 3nextThis piecemeal is represented to produce and next piecemeal is badly in need of the data used, NUMtopRepresent that this piecemeal produces and is positioned at and PiDirection is vertical, the data used required for simultaneously adjacent with current piecemeal piecemeal; NUMotherRepresenting that this piecemeal produces and except next piecemeal and top piecemeal, other all piecemeals need the data used; Ls represents the scheduling length of each iteration, and Tr represents and reads the time required for data from main memory, and Tw represents and writes data to the time required for main memory; MsRefer to the amount of capacity of scratch-pad storage SPM.
CN201310145363.XA 2013-04-24 2013-04-24 A kind of multilamellar piecemeal dispatching method with storage perception Active CN103246563B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310145363.XA CN103246563B (en) 2013-04-24 2013-04-24 A kind of multilamellar piecemeal dispatching method with storage perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310145363.XA CN103246563B (en) 2013-04-24 2013-04-24 A kind of multilamellar piecemeal dispatching method with storage perception

Publications (2)

Publication Number Publication Date
CN103246563A CN103246563A (en) 2013-08-14
CN103246563B true CN103246563B (en) 2016-06-08

Family

ID=48926094

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310145363.XA Active CN103246563B (en) 2013-04-24 2013-04-24 A kind of multilamellar piecemeal dispatching method with storage perception

Country Status (1)

Country Link
CN (1) CN103246563B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639769A (en) * 2008-07-30 2010-02-03 国际商业机器公司 Method and device for splitting and sequencing dataset in multiprocessor system
CN101980168A (en) * 2010-11-05 2011-02-23 北京云快线软件服务有限公司 Dynamic partitioning transmission method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639769A (en) * 2008-07-30 2010-02-03 国际商业机器公司 Method and device for splitting and sequencing dataset in multiprocessor system
CN101980168A (en) * 2010-11-05 2011-02-23 北京云快线软件服务有限公司 Dynamic partitioning transmission method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Iterational Retiming with Partitioning:Loop Scheduling with Complete Memory Latency Hiding;Chun Jason Xue等;《ACM Transactions on Embedded Computing Systems》;20100228;第9卷(第3期);全文 *
Optimal Loop Scheduling for Hiding Memory Latency Based on Two Level Partitioning and Prefetching;Zhong Wang等;《Signal Processing,IEEE Transactions on》;20011130;第49卷(第11期);全文 *

Also Published As

Publication number Publication date
CN103246563A (en) 2013-08-14

Similar Documents

Publication Publication Date Title
Olmedo et al. Dissecting the CUDA scheduling hierarchy: a performance and predictability perspective
Li et al. Locality-aware CTA clustering for modern GPUs
CN112306678B (en) Method and system for parallel processing of algorithms based on heterogeneous many-core processor
CN108595258B (en) GPGPU register file dynamic expansion method
Lu et al. Optimizing depthwise separable convolution operations on gpus
CN103150265B (en) The fine-grained data distribution method of isomery storer on Embedded sheet
CN102981807B (en) Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment
Raju et al. A survey on techniques for cooperative CPU-GPU computing
CN102708009B (en) Method for sharing GPU (graphics processing unit) by multiple tasks based on CUDA (compute unified device architecture)
Zhuge et al. Minimizing access cost for multiple types of memory units in embedded systems through data allocation and scheduling
Tan et al. Analysis and performance results of computing betweenness centrality on IBM Cyclops64
Huzaifa et al. Inter-kernel reuse-aware thread block scheduling
Wang et al. Architecture and compiler support for gpus using energy-efficient affine register files
Tang et al. Optimizing and auto-tuning iterative stencil loops for GPUs with the in-plane method
Xu et al. Loop interchange and tiling for multi-dimensional loops to minimize write operations on NVMs
KR101765830B1 (en) Multi-core system and method for driving the same
Zhong et al. swmr: A framework for accelerating mapreduce applications on sunway taihulight
CN103246563B (en) A kind of multilamellar piecemeal dispatching method with storage perception
Lin et al. swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer
Saidi et al. Optimizing two-dimensional DMA transfers for scratchpad Based MPSoCs platforms
Saidi et al. Optimal 2D data partitioning for DMA transfers on MPSoCs
Geng et al. The importance of efficient fine-grain synchronization for many-core systems
Endo et al. Software technology that deals with deeper memory hierarchy in post-petascale era
CN111221640B (en) GPU-CPU cooperative energy saving method
Du et al. Optimization of data allocation on CMP embedded system with data migration

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent for invention or patent application
CB03 Change of inventor or designer information

Inventor after: Li Kenli

Inventor after: Wang Yan

Inventor after: Du Jiayi

Inventor after: Tang Zhuo

Inventor after: Xiao Zheng

Inventor after: Zhu Ningbo

Inventor before: Wang Yan

Inventor before: Li Kenli

Inventor before: Du Jiayi

Inventor before: Tang Zhuo

Inventor before: Xiao Zheng

Inventor before: Zhu Ningbo

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: WANG YAN LI KENLI DU JIAYI TANG ZHUO XIAO ZHENG ZHU NINGBO TO: LI KENLI WANG YAN DU JIAYI TANG ZHUO XIAO ZHENG ZHU NINGBO

C14 Grant of patent or utility model
GR01 Patent grant