CN101833438A

CN101833438A - General data processing method based on multiple parallel

Info

Publication number: CN101833438A
Application number: CN201010150549A
Authority: CN
Inventors: 许端清; 杨鑫; 赵磊
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2010-04-19
Filing date: 2010-04-19
Publication date: 2010-09-15

Abstract

The invention discloses a general data processing method based on multiple parallel, comprising the steps of: (1) dividing an application program for data processing into a plurality of execution behaviors; (2) dividing all of the execution behaviors into a plurality of tasks according to the basic operation types of data by the execution behavior; (3) dividing the data to be processed by the application program into static data and dynamic data; and (4) operating computational tasks on a GPU (Graphic Processing Unit), and operating logical judgment tasks on a CPU (Central Processing Unit) until the execution of the application program is finished. The data processing method performs specialized optimization for complex algorithm with dynamically-characterized performance behavior and irregular data structures, and can carry out dynamic management on the data according to a memory locality principle and an SIMD (Single Instruction Multiple Data) operation mechanism during data processing so as to effectively utilize the computing resources and the memory resources of hardware furthest during the data processing by the application program.

Description

A kind of general data processing method based on multiple parallel

Technical field

The present invention relates to the parallel computing field, relate in particular to a kind of conventional data method for parallel processing based on the heterogeneous polynuclear framework.

Background technology

Along with current rapid development of science and technology, high-performance calculation has become the research means that has strategic significance in the scientific technological advance, it constituted in modern science and technology and the engineering design with traditional theoretical research and laboratory experiment complement each other, inter-related research method, be called three big " pillars " of 21 century scientific research in the world.The application of high-performance computer mainly concentrates on science research and development, telecommunications, finance, government etc., yes performs meritorious deeds never to be obliterated so high-performance computer is for the contribution of country, in order to accelerate the paces of current informatization, growing field is applied to the high-performance calculation technology.The speed of calculating has greatly been accelerated in high-performance calculation, has shortened development and production cycle.The research ability has been widened in its application greatly, promotes and promoted the development of modern science and engineering.Accelerate development high-performance calculation for promoting China's science and technology capability of independent innovation, enhancing national competitiveness, safeguarding national security, promote the development of the national economy, construction innovation-oriented country to have crucial strategic importance.

In the evolution of high-performance computing sector, once sought hegemony high-performance calculation market with the minicomputer that the RISC framework is taken as the leading factor, owing to the development of X86 framework, the X86 framework that has overwhelming superiority on price had finally replaced minicomputer with the form of cluster afterwards.Though can solve the problem of part mass computing by creating distributed system, distributed system has communication overhead big, the failure rate height; The access structure complexity of data, expense is big; Weakness such as the difficult control of safety of data and confidentiality.Along with computer processor, particularly GPU (Graphical Processing Unit) raising at full speed of computing power and cheap price, high-performance calculation progressively enters desktop (low side) field, make each researchist, scientist and slip-stick artist all might have the supercomputer of oneself, can deal with problems faster, accelerate the rhythm of scientific development.Present GPU has comprised up to a hundred processing units, can obtain the performance of 1TFLOPS to the single-precision floating point computing, also can obtain to surpass the performance of 80GFLOPS to the double-precision floating point computing, can have the video memory of 4GB, surpasses the 100GB/ bandwidth of second.Although GPU is a kind of processor that aims at graphics calculations and design originally, yet be particularly suitable for doing GPU that large-scale parallel calculates appears at many non-graphical application rapidly with characteristics such as powerful calculated performance, lower energy consumption, cheap price and floor area are less high-performance computing sector.Nowadays, many important science engineerings are all attempting the GPU computing power is added in their code.Just warmly wait in expectation their work of software engineers can obtain remarkable performance by GPU.

Yet present most of application programs are grafted directly to GPU and go up the raising that can't obtain performance immediately, even also performance decrease can occur.This mainly because these programs and structure are not to design at the GPU Architecture characteristic, can't excavate the whole computing power of GPU.How to utilize concurrent application to carry out efficiently data processing normally a complexity and work consuming time.

Summary of the invention

The invention provides a kind of data parallel, tasks in parallel, parallel multiple parallel data processing method of pipeline of having merged, can effectively use the computational resource and the storage resources of hardware in the time of can making application program carry out data processing to greatest extent.

A kind of general data processing method based on multiple parallel, carry out in computing machine with GPU and CPU processor:

(1) application program that will carry out data processing is divided into some act of execution;

Each act of execution may be done to few basic operation to data, for example storage of the visit of data, data etc., perhaps computations;

(2) according to basic operation type and the computations type of act of execution, all act of execution are divided into several tasks, are about to similar act of execution and put under in the same calculation task data;

Similar act of execution is meant to have identical calculating operation or similar storage operation, and similar storage operation is meant that the visit to data remains in the subrange of storage area.

The division of this step can be satisfied SIMD (Single Instruction, Multiple Data) the execution characteristic of hardware and local access's characteristic of storage.

Each task is finished the calculation task of appointment, and short and small as much as possible and function singleness during division according to specific circumstances can executed in parallel between task, and also serializable is carried out.

(3) data that application program need be handled are divided into static data and dynamic data, in can carrying out the computing machine video memory of described application program, divide storage space (storage pool), in this storage space, be respectively static data and dynamic data and divide storage area, be that a part is used to store static data in the storage space, remaining space is used to store dynamic data.

Wherein static data is meant the data that can not change in the application program implementation, and dynamic data is meant the new data that produces in the application program implementation.All these information are recorded in the configuration file in advance;

(4) according to processing mode to data, task in the step (2) is divided into calculation type task and logic determines type task, operation calculation type task on GPU, operation logic judgement type task on CPU, that the present invention adopts is parallel based on pipeline, the multiple parallel executive mode of data parallel, tasks in parallel, finishes the execution of application program.

Pipeline is a kind of producer-consumer's execution pattern, the calculation process that is fit to most application programs, and pipeline is by data recombination balance operating load effectively, avoids a certain unit too much output may occur and makes whole calculation process load inequality; The data parallel execution model as CUDA (Compute Unified Device Architecture), can make full use of hardware SIMD characteristic to the large-scale neatly data set of isomorphism as the programming model of current main-stream, hide memory access and postpone; The task execution model is a kind of extendible execution pattern, can conclusively show out the relation of interdependence and the dynamic act of execution of each unit in the program process.In order to give full play to the heterogeneous polynuclear Architecture characteristic, reasonably use hardware resource, the present invention adopts pipeline parallel data tupe, will be referred to a large amount of application programs execution pipelines that calculate operates on the GPU, and the data task scheduling pipeline that relates to a large amount of logic determines operates on the CPU, two kinds of pipeline asynchronous parallels are carried out, and data task scheduling pipeline is carried out pipeline than application program and moved in advance.By this pipeline Parallel Executing Scheme, both can guarantee independence and the concurrency calculated, also can avoid the synchronous operation of using atom, lock etc. expensive.

Carry out in the pipeline in application program, the act of execution of same calculation task inside is carried out in the data parallel mode, and the act of execution between different computing tasks is with the asynchronous execution of the mode of tasks in parallel.

Owing to may producing uncertain data volume, some programs cause the unbalanced situation of whole execution pipeline load.The execution pattern of data parallel makes a task produce a large amount of new data probably, be difficult to simultaneously these data be stored and use them, a large amount of new data also may consume limited video memory rapidly, and, also increased difficulty to the data management because the generation of these data is at random uncertain.Therefore, execution in step (4) judges whether will carrying out of task has exceeded the size of the new data of generation the remaining space of current storage pool before; In case exceeded the remaining space of storage pool through judgement, we will divide into groups to the data that this task need be handled, and data are handled in batches.This method will reduce the burden that some algorithms that relate to mass data cause to system storage and bandwidth greatly, making the data in the video memory all is the needed data of calculating of thread, thereby further strengthened the parallel efficiency calculation of thread, improved effective use ability hardware.

Owing to when task run, uncertain new task might occur, so the present invention's employing is shifted according to the corresponding data of the task management of being dispatched simultaneously based on the dynamic dispatching of priority.

Each task (comprising task in the step (2) and the new task that occurs when the task run) all is provided with a priority state, when new task occurs, selects the high task of priority to move successively according to the priority state of all tasks.

The priority of measurement task mainly based on its desired data in the type of the position of memory hierarchy, required processor and the size of desired data collection.Mode with data-driven is carried out task scheduling, dispatches according to the pairing task of static data in the type of idle processor and the current storage pool.Specifically, according to priority from high to low order in conjunction with following several principles:

(1) task executions does not need static data;

(2) desired data is in cache;

(3) priority processing possesses the task of abundant similarity data, and perhaps the data of Chan Shenging can be assisted

Help other task to improve the priority of carrying out, perhaps a plurality of task executions have similarity.

(4) desired data is at the GPU video memory;

(5) desired data is at the CPU internal memory;

(6) desired data is transferred to internal memory by hard disk;

(7) the desired data collection is too little and can't make full use of the hardware computing power.

The enforcement of the inventive method is based on more ripe heterogeneous polynuclear framework, Fermi framework such as the up-to-date release of NVIDIA company, perhaps Inter company is about to the Larrabee framework of release etc., these frameworks generally all have the floating-point operation ability above 1TFLOPS, surpass 20 polycaryon processor, hardware thread up to a hundred and complicated memory hierarchy structure.

Data processing method of the present invention has been done special optimization at the complicated algorithm with behavioral characteristics act of execution and irregular data structure, can effectively use the computational resource and the storage resources of hardware when can be when data processing according to storage principle of locality and SIMD operation mechanism data being carried out dynamic management and make application program carry out data processing to greatest extent.Utilize the inventive method can develop the application program of high performance executed in parallel rapidly and easily, this is the progress and the efficient of faster procedure exploitation greatly undoubtedly, saves the research and development expense.

Description of drawings

Fig. 1 increases progressively the performance evaluation that is shown for CUDA and model of the present invention along with scene complexity.

Embodiment

Select 4 nuclear CPU that are furnished with an Intel Xeon 3.7GHz, the PC of a NvidiaGTX285 (1G video memory) verifies feasibility of the present invention.Realized the DLL (dynamic link library) that a cover is realized based on said method based on the PTX instruction set, and go to redesign and write according to method proposed by the invention and have a large amount of dynamically ray trace algorithms of scrambling behaviors in the graphics, and contrast, and done following analysis with the resulting effect of code that the CUDA programming model that uses Nvidia company is write.

Application program is divided into some calculation tasks, in order to satisfy the SIMD/SIMT operation and the local memory access characteristic of hardware, we are encapsulated in the calculation task effectively to handle the calculating with similar act of execution or similar memory access behavior, the short and small as much as possible and function singleness of each calculation task, according to specific circumstances can executed in parallel between calculation task, also serializable is carried out.Calculate in the data parallel mode calculation task inside, and between calculation task with the asynchronous calculating of tasks in parallel mode.Each calculation task all is provided with a state, in order to handle the execution between the calculation task that may have relation of interdependence.

Characteristics according to calculation task in the ray trace algorithm have been created 6 calculation tasks in application program execution pipeline, carry out respectively that light produces, calculation tasks such as traversal accelerating structure, dough sheet are crossing, painted, shade, in data task scheduling pipeline, carry out that light sorts and the establishment of light bag simultaneously.These tasks all have executed in parallel ability preferably, and promptly the SIMD of broad carries out width, but the recursive nature of light makes the SIMD availability acutely descend along with the carrying out of recurrence probably.In addition, we use the delay computing technique further to improve the SIMD utilization factor when realizing, if promptly painted task can't produce enough light after as calculated and form a complete light bag, crossing calculating will be delayed up to complete light bag and form; Similarly, carry out painted calculating if crossing calculation task can't produce abundant light, painted calculating also will be delayed.

Data are divided into static data and dynamic data, and wherein static data is meant the data that can not change in the application program implementation, and dynamic data is meant the new data of the continuous variation that produces in the application program implementation.A storage pool is set when initialization, is that static data distributes certain space in video memory according to concrete application program, and remaining space is occupied by dynamic data.All these information are recorded in the configuration file.

The required static data size of some application programs may exceed the size of video memory, so just may in program process, dispatch static data dynamically, and that each size of data that imports is not necessarily caught up with once is in full accord, so just may be at static data zone and dynamic data interval generation fragment.For fear of the generation of fragment and effectively use video memory, we can adopt the method for two-way distribution in video memory, deposit static data at the low address end of storage pool, and deposit dynamic data in the high address end of storage pool.

As mentioned above, in order to give full play to the heterogeneous polynuclear Architecture characteristic, reasonably use hardware resource, the present invention has designed application program and has carried out the pipeline Parallel Executing Scheme that pipeline combines with data task scheduling pipeline, will be referred to a large amount of application programs execution pipelines that calculate operates on the GPU, and the data task scheduling pipeline that relates to a large amount of logic determines operates on the CPU, and two kinds of pipeline asynchronous parallels are carried out, and data task scheduling pipeline is carried out pipeline than application program and moved in advance.By this pipeline Parallel Executing Scheme, we both can guarantee independence and the concurrency calculated, also can avoid the synchronous operation of using atom, lock etc. expensive.

When realizing, the present invention is based on following 3 principle design data task scheduling pipelines: 1. should keep as much as possible the visit of static data is in the fastest one deck of hardware store level medium velocity (being cache, shared memory etc.), postpone inevitable up to this visit simultaneously as far as possible to the visit of data.2. priority processing possesses the task of abundant similarity data, and perhaps the data of Chan Shenging can assist other task to improve the priority of carrying out, and perhaps a plurality of task executions have similarity.3. carry out task scheduling in the mode of data-driven, dispatch according to the pairing task of static data in the type of idle processor and the current storage pool.

1) the present invention has designed the use that data-analyzing machine comes dynamic control of data, may produce uncertain data volume and cause the unbalanced situation of whole execution pipeline load to solve some programs.The execution pattern of data parallel makes a calculation task produce a large amount of new data probably, be difficult to simultaneously these data be stored and use them, a large amount of new data also may consume limited video memory rapidly, and, also increased difficulty to the data management because the generation of these data is at random uncertain.Therefore, the present invention is provided with a data-analyzing machine, before calculation task is carried out each time, all to judge whether the size of the new data that produces has been exceeded current residue video memory (concrete appraisal procedure is decided according to application corresponding); In case exceeded remaining video memory capacity through judgement, we will divide into groups to the input data, and data are handled in batches.Our this method will reduce the burden that some algorithms that relate to mass data cause to system storage and bandwidth greatly, making the data in the video memory all is the needed data of calculating of thread, thereby further strengthened the parallel efficiency calculation of thread, improved effective use ability hardware.

All set up a data buffer area for each calculation task, be used for the each data that produce or consume of Management Calculation task.Because the generation and the consumption of data is dynamically irregular in some complicated algorithms, in order to satisfy local similar principle and SIMD operating characteristic, make calculating concentrate on local data concentrates and to carry out as far as possible, be necessary these data are reorganized, guarantee to calculate and to continue on hardware, to carry out effectively.The logical process ability that bandwidth ability that current hardware is powerful and CPU are powerful makes that Data Dynamic reorganization operation is very feasible.

2) designed task dispatcher and come that uncertain task is carried out sequence and carry out dynamic dispatching based on priority, shifted according to the corresponding data of the task management of being dispatched simultaneously.We adopt the method for scheduling as required, but when some processors time spent, 1. a semaphore are set, and pin scheduler; 2. scan whole pending task sequence, select the highest task of priority, and mark; 3. to the scheduler release.

Determination of priority is the core of our this scheduler.At the hybrid processing characteristics of resources, the priority that we weigh task mainly based on its desired data in the type of the position of memory hierarchy, required processor and the size of desired data collection.Specifically, according to priority from high to low order in conjunction with following several principles: (1) task executions does not need static data; (2) desired data is in cache; (3) priority processing possesses the task of abundant similarity data, and perhaps the data of Chan Shenging can assist other task to improve the priority of carrying out, and perhaps a plurality of task executions have similarity.(4) desired data is at the GPU video memory; (5) desired data is at the CPU internal memory; (6) desired data is transferred to internal memory by hard disk; (7) the desired data collection is too little and can't make full use of the hardware computing power.

Selection has the test scene of different geometry complexity, and Bunny, Fairy, BART Kitchen be as the test model file,

Fairy is dynamic scene and has twice reflection calculating that drawing resolution is 1024*1024.We have used the inventive method and CUDA programming model that this scene is tested respectively, and the result is as shown in table 1, and visible the inventive method is compared CUDA, has obtained more performance.The pipeline parallel mechanism is by reasonably using hardware computational resource and storage resources, carried out task scheduling based on priority according to the balanced load of process nuclear.

Table 1

	??CUDA	The inventive method
	??CUDA	The inventive method	??Bunny	??9.3	??11.1
??Fairy	??4.3	??5.6	??Bunny	??9.3	??11.1
??Fairy	??4.3	??5.6	??BART?Kitchen	??3.8	??5.1

Table 1 uses CUDA and this model to scenario B unny respectively, the drafting frame number of Fairy, BART Kitchen per second under 1024*1024 resolution.

In order to verify the parallel use ability of the inventive method to hardware, the utilization factor of having tested scalar processor, it has reflected that directly can we develop the executed in parallel ability of algorithm on hardware to greatest extent to the scheduling and the method for organizing of data and task.Notice that we do not use the testing standard of the operating position of ALU as us, even if because thread slot is occupied sometimes, but ALU may not used fully because of the poor efficiency of memory access delay or SIMD yet.As shown in table 2, compare the CUDA programming model, the inventive method can be used the computational resource of GPU more effectively.

Table 2

	??CUDA	The inventive method
	??CUDA	The inventive method	??Bunny	??90％	??90％
??Fairy	??72％	??85％	??Bunny	??90％	??90％
??Fairy	??72％	??85％	??BART?Kitchen	??69％	??80％

The GPU utilization factor of table 2 CUDA and this model relatively.

For illustrate the inventive method can at the scene dynamics of different complexities organize data and scheduler task and the unbalanced situation of load can not occur, the visible CUDA programming model of Fig. 1 the uneven horizontal stroke of tangible load can occur and make the not good situation of the utilization of resources of handling when the scene complexity extremely increases, finally cause performance decrease, the inventive method is then being kept comparatively stable performance always.

Claims

1. the general data processing method based on multiple parallel is characterized in that, in computing machine with GPU and CPU processor:

(2) according to the similarity of act of execution, all act of execution are divided into several tasks to data or calculating;

(3) data that application program need be handled are divided into static data and dynamic data, divide storage space in can carrying out the computing machine video memory of described application program, be respectively static data and dynamic data and divide storage area in this storage space;

(4) on GPU and CPU, set up the execution pipeline respectively, according to processing mode to data, task in the step (2) is divided into calculation type task and logic determines type task, operation calculation type task on GPU, operation logic judgement type task on CPU is until the execution of finishing application program, wherein in the GPU pipeline, task inside is moved in the mode of data parallel, and the mode with tasks in parallel between the task is moved.

2. general data processing method according to claim 1 is characterized in that, each act of execution may be done to few a basic operation or a calculating operation to data.

3. data parallel processing method according to claim 1 is characterized in that, described static data is the data that can not change in the application program implementation, and described dynamic data is the data that produce in the application program implementation.

4. data parallel processing method according to claim 1, it is characterized in that, on GPU and CPU, set up respectively and carry out pipeline, according to processing mode to data, task is divided into calculation type task and logic determines type task, operation calculation type task on GPU, operation logic judgement type task on CPU, two kinds of pipeline executed in parallel.

5. data parallel processing method according to claim 1 is characterized in that, in the step (4), the act of execution of same task inside is moved in the data parallel mode, and between different task act of execution with the parallel mode asynchronous operation.

6. data parallel processing method according to claim 1 is characterized in that, each task all is provided with a priority state, when new task occurs, selects the high task of priority to move successively according to the priority state of all tasks.