CN105487838A

CN105487838A - Task-level parallel scheduling method and system for dynamically reconfigurable processor

Info

Publication number: CN105487838A
Application number: CN201510817591.6A
Authority: CN
Inventors: 田丰硕; 赵仲元; 绳伟光; 何卫锋
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2015-11-23
Filing date: 2015-11-23
Publication date: 2016-04-13
Anticipated expiration: 2035-11-23
Also published as: CN105487838B

Abstract

The invention proposes a task-level parallel scheduling method and system for a dynamically reconfigurable processor. The system comprises a main controller, a plurality of reconfigurable processing units, a main memory, a direct memory access device and a system bus, wherein each reconfigurable processing unit consists of a co-controller, a plurality of reconfigurable processing element arrays in charge of reconfigurable calculation and a plurality of shared memories used for data storage; the reconfigurable processing element arrays and the shared memories are adjacently arranged; and the shared memories can be read and written by the two connected reconfigurable processing element arrays around. According to the task-level parallel scheduling method and system for the dynamically reconfigurable processor, different scheduling modes can be executed for different tasks by adjusting the scheduling method, and basically all parallel tasks can be well accelerated in parallel in the reconfigurable processor.

Description

A kind of task-level parallelism dispatching method of dynamic reconfigurable processor and system

Technical field

The present invention relates to computer realm, and in particular to a kind of task-level parallelism dispatching method of dynamic reconfigurable processor and system.

Background technology

In processor computation schema in the past, be usually divided into following two classes.Traditional general-purpose computations based on von Neumann processor has extremely strong dirigibility, but executive mode, limited arithmetic element and memory bandwidth that its instruction stream drives make its overall performance and power consumption unsatisfactory.Dedicated computing can for specific optimizing application structure and circuit, and without the need to instruction set, its execution speed is fast, low in energy consumption.But special-purpose computing system also exists fatal defect, dirigibility and extendability very poor, the more complicated application of constantly developing often has not been come by simple expansion.For different application, must design different special-purpose computing systems, therefore the design of hardware often cannot catch up with the renewal speed of application.Meanwhile, the design cycle of special-purpose computing system is long, and disposable engineering input cost is too high.A kind of account form that the dirigibility of software and the high efficiency of hardware are combined that Reconfigurable Computation occurs just under this background, Reconfigurable Computing Technology combines the advantage of general processor and ASIC, and the high-level efficiency of hardware can either be provided to have again the programmability of software.It obtains better balance between the key indexs such as performance, power consumption and dirigibility, has filled up the blank between general-purpose computations and dedicated computing.

Processor main flow in present stage computing machine has the multi-core CPU of 2 ~ 8 cores and many core GPU, and this design also makes parallel processing become heat subject, and parallel algorithm and multiple programming also become the content that programmer must understand and grasp.In June, 2007, NVIDIA is proposed CUDA, and this is a kind of using the software and hardware architecture of GPU as data parallel equipment.CUDA becomes model using CPU as main frame, and the CUDA parallel computation function that GPU wherein operates on GPU as coprocessor is called as kernal (kernel).Kenerl function is not a complete program, but the step that can be executed in parallel in whole CUDA program.Well like this CPU and GPU is divided the work thus realizes various levels of parallel.

But, for the present invention for the general reconfigurable processor parallel processing specification that but neither one is unified.A general processor and one or more reconfigurable processing unit (ReconfigurableProcessingUnit is contained in typical reconfigurable processor framework, RPU), how on this coarseness reconfigurable processor, task matching is carried out for multitask application, the present invention proposes a set of RPU and distribute and method for scheduling task and system.

Summary of the invention

The present invention proposes a kind of task-level parallelism dispatching method and system of dynamic reconfigurable processor, different scheduling modes can be carried out for different tasks, the parallel accelerate that substantially all parallel tasks all can obtain on this reconfigurable processor by regulating dispatching method.

In order to achieve the above object, the present invention proposes a kind of task-level parallelism dispatching system of dynamic reconfigurable processor, comprises master controller, multiple reconfigurable processing unit, primary memory, direct memory access and system bus,

Wherein, described each reconfigurable processing unit forms by assisting the reconfigurable processing unit array of controller, multiple responsible Reconfigurable Computation and multiple shared storage for data storage, wherein said reconfigurable processing unit array and shared storage arranged adjacent, described shared storage can read and write by two reconfigurable processing unit arrays being around connected.

Further, described master controller is used for the serial code being not suitable for reconfigurable processing unit process in executive routine, and is responsible for scheduling, the start-up and functionning of multiple reconfigurable processing unit.

Further, what described reconfigurable processing unit was responsible for computation-intensive in calculation procedure can parallel codes.

Further, described association controller for carry multiple reconfigurable processing unit array calculate needed for data and configuration information, control the startup of multiple reconfigurable processing unit array, run with termination.

In order to achieve the above object, the present invention also proposes a kind of task-level parallelism dispatching method of dynamic reconfigurable processor, comprises the following steps:

Computation-intensive in application program parallel codes can be encapsulated as kernel function;

Serial section code and parallel section code are compiled respectively the executable code generating and be applicable to master controller and reconfigurable processing unit;

Described master controller performs serial section code;

When performing kernel function partial code, described master controller carries out kernel function partial code described in dispatching distribution process to reconfigurable processing unit.

Further, described master controller carries out scheduling to reconfigurable processing unit and is divided into synchronization call and asynchronous call two kinds of parallel modes:

If synchronization call, the reconfigurable processing unit be not in operation then is found by master controller, and be loaded into executable code and configuration information, meanwhile, master controller is hung up, in synchronization call, called by multiple reconfigurable processing unit, the content of process different pieces of information block, after all reconfigurable processing unit process terminate simultaneously, carry out update process result by synchronous function rreturn value, and the serial code continuing master controller performs;

If asynchronous call, the reconfigurable processing unit be not in operation then is found by master controller, meanwhile under the prerequisite of not interrupting master controller, executable code and configuration information are loaded into, start reconfigurable processing unit, master controller continues to move to when needing reconfigurable processing unit return data, then stops to wait for that reconfigurable processing unit has calculated also return data.

Further, when described kernel function instruction is less, when a reconfigurable processing unit can complete separately the calculation task of whole kernel function, the configuration information that multiple reconfigurable processing unit array executed in parallel of its inside are identical, the data that each reconfigurable processing unit array is responsible for the shared storage of oneself calculate.

Further, when described kernel function instruction is more, when disposablely can not execute all statements of kernel function, kernel function is divided into the subtask that multiple length is identical, respectively the configuration information of subtask is distributed to multiple reconfigurable processing unit array in order, because each reconfigurable processing unit array can read and write adjacent upper strata shared storage and lower floor's shared storage, by the A block of the sizes such as each shared storage is divided into and B block, in task fluvial processes, each reconfigurable processing unit array first reads data from the A block of upper strata shared storage, and result is write the B block of lower floor's shared storage, after process terminates, each reconfigurable processing unit array reads data from the B block of upper strata shared storage again, and result is write the A block of lower floor's shared storage, while these two processes, the carrying of data to main memory is carried out to the part that first and last shared storage does not participate in calculating.

Compared with prior art, technique scheme comprises following innovative point and beneficial effect (advantage):

1, task-level parallelism dispatching method of the present invention carries out designing towards the isomery coarseness reconfigurable processor of specific tri-layer realizing, the method of the data-intensive of application program with the part kernel function of computation-intensive is packed out, master controller is responsible for the process of serial code and the distribution of reconfigurable processing unit, the reconfigurable processing unit that kernel function transfers to computation capability stronger processes, and carries out computing in reconfigurable processing unit at flexible allocation reconfigurable arrays wherein.The computation capability of multi-level isomery coarseness reconfigurable processor can be fully played in this way, then coordinate specific compiler can the application program of abundant parallel accelerate computation-intensive.

2, the present invention is on the basis of GPU concurrent operation instrument CUDA, has been transplanted to by its Method of Scheduling Parallel on multi-level isomery coarseness reconfigurable processor, and has proposed the scheduling mode of new streamlined, extended the dispatching method to variety classes task.

3, present invention achieves the process of the many reconfigurable processing unit scheduling of multitask, different scheduling modes can be carried out for different tasks by regulating dispatching method, avoid the requirement of single scheduling mode to task, the parallel accelerate that substantially all parallel tasks all can obtain on this reconfigurable processor.

Accompanying drawing explanation

Figure 1 shows that the task-level parallelism dispatching system structural representation of the dynamic reconfigurable processor of present pre-ferred embodiments.

Figure 2 shows that the task-level parallelism dispatching method process flow diagram of the dynamic reconfigurable processor of present pre-ferred embodiments.

Fig. 3 and the synchronous dispatching method schematic diagram that Figure 4 shows that when described kernel function instruction is less.

Figure 5 shows that the asynchronous schedule method schematic diagram when described kernel function instruction is more.

Embodiment

Provide the specific embodiment of the present invention below in conjunction with accompanying drawing, but the invention is not restricted to following embodiment.According to the following describes and claims, advantages and features of the invention will be clearer.It should be noted that, accompanying drawing all adopts the form that simplifies very much and all uses non-ratio accurately, only for object that is convenient, the aid illustration embodiment of the present invention lucidly.

Please refer to Fig. 1, Figure 1 shows that the task-level parallelism dispatching system structural representation of the dynamic reconfigurable processor of present pre-ferred embodiments, wherein the enlarged drawing of RPU1 on square frame limit, right side, to show its inside function structure.The present invention proposes a kind of task-level parallelism dispatching system of dynamic reconfigurable processor, comprise master controller ARM11, multiple reconfigurable processing unit (ReconfigurableProcessUnit, RPU), primary memory DDR, direct memory access (DirectMemoryAccess, and system bus AHB DMA), wherein, described each reconfigurable processing unit RPU is by assisting controller ARM7, reconfigurable processing unit array (the ProcessingElementArray of multiple responsible Reconfigurable Computation, PEA) and multiple for data store shared storage (SharedMemory, SM) form, wherein said reconfigurable processing unit array PEA and shared storage SM arranged adjacent, described shared storage SM can read and write by two reconfigurable processing unit array PEA being around connected.

According to present pre-ferred embodiments, described master controller ARM11 is used for the serial code being not suitable for reconfigurable processing unit RPU process in executive routine, and is responsible for scheduling, the start-up and functionning of multiple reconfigurable processing unit RPU.What described reconfigurable processing unit RPU was responsible for computation-intensive in calculation procedure can parallel codes.Further, described association controller ARM7 for carry multiple reconfigurable processing unit array PEA calculate needed for data and configuration information, control the startup of multiple reconfigurable processing unit array PEA, run with termination.

According to present pre-ferred embodiments, each reconfigurable processing unit RPU comprises 4 reconfigurable processing unit array PEA and 4 shared storage SM.

Please refer to Fig. 2, Figure 2 shows that the task-level parallelism dispatching method process flow diagram of the dynamic reconfigurable processor of present pre-ferred embodiments.The present invention also proposes a kind of task-level parallelism dispatching method of dynamic reconfigurable processor, comprises the following steps:

Step S100: the computation-intensive in application program parallel codes can be encapsulated as kernel function;

Step S200: serial section code and parallel section code are compiled respectively the executable code generating and be applicable to master controller and reconfigurable processing unit;

Step S300: described master controller performs serial section code;

Step S400: when performing kernel function partial code, described master controller carries out kernel function partial code described in dispatching distribution process to reconfigurable processing unit.

For general reconfigurable processor, parallel procedure is only the computing of the computation-intensive utilizing the high-speed computation of configurable component to process a large amount of repeatability mostly, and for the heterogeneous reconfigurable processor of the multi-level more complicated that the present invention is directed to, it contains the computing module of three levels, be respectively master controller, association's controller and PEA, between their threes, internal memory is independent, together constitutes the restructural isomery framework of a tri-layer.Carry out in the process processed in application programs, the method for task-level parallelism not only includes the parallel of reconfigurable cell, further comprises distribution and the scheduling of many RPU.

For the C program in machine code of an application, we parallel codes can be encapsulated as kernal (kernel) function by wherein computation-intensive, in one application, can there is the kernal function that multiple function is different with complexity.In compilation process, serial section and parallel section are compiled respectively the executable code generating and be applicable to master controller part and reconfigurable processing unit RPU part.There is the parallel of two levels in kernal function, the code be respectively on the parallel of the synchronous asynchronous between different RPU and inner 4 PEA of RPU walks abreast.

For the application program of a multitask, reconstruction structure when performing, first perform serial section code by master controller, when performing kernal part, can be distributed RPU by the RPU scheduler of master controller, and be divided into synchronization call and asynchronous call two kinds of parallel modes:

If asynchronous call, the reconfigurable processing unit be not in operation then is found by master controller, meanwhile under the prerequisite of not interrupting master controller, executable code and configuration information are loaded into, start reconfigurable processing unit, master controller continues to move to when needing reconfigurable processing unit return data, then stops to wait for that reconfigurable processing unit has calculated also return data.Kernal function large for data volume can be processed by many RPU parallel computation by synchronization call mode, and the different kernal of dependence that do not have can walk abreast by asynchronous call.

After distributing the RPU performing kernal function and dispatch, same in RPU inside exist tasks in parallel, the namely parallel computation of second level.In the hardware structure that the present invention relies on, 4 PEA and corresponding SM are contained in each RPU, wherein SM can read and write by two PEA being around connected, we just can complete the parallel calculating method of two kinds of modes for kernal kind for this framework like this.

Please refer to Fig. 3 and Fig. 4, Fig. 3 and the synchronous dispatching method schematic diagram that Figure 4 shows that when described kernel function instruction is less.When described kernel function instruction is less, when a reconfigurable processing unit can complete separately the calculation task of whole kernel function, the configuration information that multiple reconfigurable processing unit array executed in parallel of its inside are identical, the data that each reconfigurable processing unit array is responsible for the shared storage of oneself calculate, and so just the kernal execution time can be shortened to maximum 1/4th.Wherein first row square frame block1 ~ 4 represent that data are copied to the left-half of SM by association's controller from DDR, circle represents that PEA performs, second row square frame block1 ~ 4 represent that the data calculated are outputted to the right half part of SM by PEA, 3rd row square frame block1 ~ 4 represent that data output to DDR from SM right half part by association's controller, are more than the loop body in the middle of program.

When described kernel function instruction is more, when disposablely can not execute all statements of kernel function, to this situation, we need the mode taking flowing water to carry out tasks in parallel, concrete scheme is kernal function is divided into the subtask of no more than 4, each subtask length is as far as possible identical, respectively the configuration information of subtask is given 4 PEA in order, because each reconfigurable processing unit array can read and write adjacent upper strata shared storage and lower floor's shared storage, by the A block of the sizes such as each shared storage is divided into and B block, in task fluvial processes, each reconfigurable processing unit array first reads data from the A block of upper strata shared storage, and result is write the B block of lower floor's shared storage, after process terminates, each reconfigurable processing unit array reads data from the B block of upper strata shared storage again, and result is write the A block of lower floor's shared storage, while these two processes, the carrying of data to main memory is carried out to the part that first and last shared storage does not participate in calculating.Like this, in whole process, there is no extra data handling time, can be good at unremitting flowing water and perform kernal function.

Please refer to Fig. 5, Figure 5 shows that the asynchronous schedule method schematic diagram when described kernel function instruction is more.As shown in Figure 5, all orders in circulation are divided into several part according to certain method by programmer, in the process of compiler processes GR-C, these orders can be compiled into different PEA configuration packet respectively, by 4 PEA sequence calls, clock period is delayed one-period respectively operating, reach the effect of flowing water, one-period, the data of part A in SM1 are carried out computing by PEA1, result puts into the B block of SM2, the data of part B in SM1 are carried out computing by second period PEA1, the B blocks of data of SM2 is carried out computing by PEA2 simultaneously, put into the A block of SM2 and SM3 respectively, form the process of flowing water by that analogy, after PEA4 has calculated data directly by data copy in main memory, until all data processings are complete.RPU presses inside the complete execution flow process of task division, and wherein block1 ~ block5 represents different data stream, and rightmost side box indicating data output in DDR from SM3.

Although the present invention with preferred embodiment disclose as above, so itself and be not used to limit the present invention.Persond having ordinary knowledge in the technical field of the present invention, without departing from the spirit and scope of the present invention, when being used for a variety of modifications and variations.Therefore, protection scope of the present invention is when being as the criterion depending on those as defined in claim.

Claims

1. a task-level parallelism dispatching system for dynamic reconfigurable processor, is characterized in that, comprises master controller, multiple reconfigurable processing unit, primary memory, direct memory access and system bus,

2. the task-level parallelism dispatching system of dynamic reconfigurable processor according to claim 1, it is characterized in that, described master controller is used for the serial code being not suitable for reconfigurable processing unit process in executive routine, and is responsible for scheduling, the start-up and functionning of multiple reconfigurable processing unit.

3. the task-level parallelism dispatching system of dynamic reconfigurable processor according to claim 1, is characterized in that, what described reconfigurable processing unit was responsible for computation-intensive in calculation procedure can parallel codes.

4. the task-level parallelism dispatching system of dynamic reconfigurable processor according to claim 1, it is characterized in that, described association controller for carry multiple reconfigurable processing unit array calculate needed for data and configuration information, control the startup of multiple reconfigurable processing unit array, run with termination.

5. a task-level parallelism dispatching method for dynamic reconfigurable processor, is characterized in that, comprise the following steps:

Described master controller performs serial section code;

6. the task-level parallelism dispatching method of dynamic reconfigurable processor according to claim 5, is characterized in that, described master controller carries out scheduling to reconfigurable processing unit and is divided into synchronization call and asynchronous call two kinds of parallel modes:

7. the task-level parallelism dispatching method of dynamic reconfigurable processor according to claim 5, it is characterized in that, when described kernel function instruction is less, when a reconfigurable processing unit can complete separately the calculation task of whole kernel function, the configuration information that multiple reconfigurable processing unit array executed in parallel of its inside are identical, the data that each reconfigurable processing unit array is responsible for the shared storage of oneself calculate.

8. the task-level parallelism dispatching method of dynamic reconfigurable processor according to claim 5, it is characterized in that, when described kernel function instruction is more, when disposablely can not execute all statements of kernel function, kernel function is divided into the subtask that multiple length is identical, respectively the configuration information of subtask is distributed to multiple reconfigurable processing unit array in order, because each reconfigurable processing unit array can read and write adjacent upper strata shared storage and lower floor's shared storage, by the A block of the sizes such as each shared storage is divided into and B block, in task fluvial processes, each reconfigurable processing unit array first reads data from the A block of upper strata shared storage, and result is write the B block of lower floor's shared storage, after process terminates, each reconfigurable processing unit array reads data from the B block of upper strata shared storage again, and result is write the A block of lower floor's shared storage, while these two processes, the carrying of data to main memory is carried out to the part that first and last shared storage does not participate in calculating.