CN104965689A

CN104965689A - Hybrid parallel computing method and device for CPUs/GPUs

Info

Publication number: CN104965689A
Application number: CN201510264320.2A
Authority: CN
Inventors: 李清玉
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2015-05-22
Filing date: 2015-05-22
Publication date: 2015-10-07

Abstract

The invention provides a hybrid parallel computing method and a device for CPUs/GPUs. The method comprises following steps of utilizing more than one computing nodes to establish a computing cluster and determining a scheduling policy based on the fact that each computing node comprises a CPU and a GPU; acquiring more than one waiting task; caching more than one acquired waiting task to a task queue; scheduling more than one waiting task to more than one computing node in the task queue; pre-processing scheduled waiting tasks one by one by the CPUs in computing nodes scheduled with waiting tasks and mapping pre-processed tasks to video memory of the GPUs every time when one task is pre-processed; computing tasks mapped to video memory by the GPUs and returning computed results. According to the scheme, computing efficiency of the computing nodes is increased.

Description

The hybrid parallel computing method of a kind of CPU/GPU and device

Technical field

The present invention relates to field of computer technology, particularly the hybrid parallel computing method of a kind of CPU/GPU and device.

Background technology

Along with developing rapidly of computer technology, the process rank of data is also increasing.In order to solve day by day urgent large data processing problem, propose MapReduce programming model at present, MapReduce is a kind of distributed programmed model, can easily massive data sets be distributed on each node of computing cluster, to make multiple node co-treatment, thus realize the fast processing of large data.

In order to improve the calculated performance of MapReduce further, academia and industry member have all made much relevant research to this.Monokaryon GPU (Graphics Processing Unit, graphic process unit) appearance bring huge effect to the performance boost of system, wherein, up to a hundred stream process cores are comprised in GPU, its calculated performance has exceeded TFlops rank per second, be equivalent to a HPCC, thus the quick calculating of mass data can be realized.

But the MapReduce programming model utilizing monokaryon GPU to realize, causes counting yield lower.

Summary of the invention

In view of this, the invention provides hybrid parallel computing method and the device of a kind of CPU/GPU, to solve the lower problem of counting yield in prior art.

Embodiments provide the hybrid parallel computing method of a kind of CPU/GPU, utilize more than one computing node to set up computing cluster, each computing node comprises CPU and GPU, determines scheduling strategy; Also comprise:

Obtain more than one waiting task;

The described more than one waiting task obtained is cached in task queue;

According to described scheduling strategy, the described more than one waiting task in described task queue is dispatched to more than one computing node;

In the computing node being scheduled waiting task, CPU to scheduling waiting task carry out pre-service one by one, and the complete task of every pre-service then by the duty mapping after pre-service in the video memory of GPU;

GPU calculates being mapped in video memory of task, and returns result of calculation.

Preferably, described described more than one waiting task in described task queue is dispatched to more than one computing node before, comprise further:

Travel through more than one waiting task described in described task queue; The operational attribute of the waiting task of current traversal is obtained and record when often traveling through a waiting task; After described task queue traversal being terminated, the waiting task with same operation attribute is merged into same task; And the task after being combined is divided into groups, and according to the task creation hash index district after grouping, so that the task after grouping is kept in described hash index district.

Preferably, described GPU calculates being mapped in video memory of task, comprising:

Be more than one task block by the division of tasks be mapped in video memory, and distribute corresponding Map task for each task block, and by Map task matching corresponding for each task block on each SM processor of GPU, perform Map operation to utilize each SM processor for each task block;

Operated by middle task inverted orientation in GPU video memory by Shuffle, and gather the operation result in Map stage in the Reduce stage.

Preferably,

Comprise further: pre-set Access Control List (ACL), described Access Control List (ACL) comprises task and has the corresponding relation of the user operating this task right;

Before the more than one waiting task of described acquisition, comprise further: determine whether the user of submission waiting task has the operating right to this waiting task according to described Access Control List (ACL), when having, perform the operation obtaining this waiting task.

The embodiment of the present invention additionally provides the hybrid parallel calculation element of a kind of CPU/GPU, and utilize more than one computing node to set up computing cluster, each computing node comprises CPU and GPU, determines scheduling strategy, comprising:

Task buffer module, for obtaining more than one waiting task, and is cached in task queue by the described more than one waiting task obtained;

Task scheduling modules, for according to described scheduling strategy, dispatches to more than one computing node by the described more than one waiting task in described task queue;

Computing node, for utilize when being scheduled waiting task CPU to scheduling waiting task carry out pre-service one by one, and the complete task of every pre-service then by the duty mapping after pre-service in the video memory of GPU; Utilize GPU to calculate being mapped in video memory of task, and return result of calculation.

Preferably,

Described task buffer module, for traveling through more than one waiting task described in described task queue; The operational attribute of the waiting task of current traversal is obtained and record when often traveling through a waiting task; After described task queue traversal being terminated, the waiting task with same operation attribute is merged into same task; And the task after being combined is divided into groups, and according to the task creation hash index district after grouping, so that the task after grouping is kept in described hash index district.

Preferably, described computing node, division of tasks for being mapped in video memory is more than one task block, and distribute corresponding Map task for each task block, and by Map task matching corresponding for each task block on each SM processor of GPU, perform Map operation to utilize each SM processor for each task block; Operated by middle task inverted orientation in GPU video memory by Shuffle, and gather the operation result in Map stage in the Reduce stage.

Preferably, comprise further:

Security module, for determining according to the Access Control List (ACL) pre-set whether the user of submission waiting task has the operating right to this waiting task, when having, perform the operation obtaining this waiting task, described Access Control List (ACL) comprises task and has the corresponding relation of the user operating this task right.

Embodiments provide hybrid parallel computing method and the device of a kind of CPU/GPU, combine by utilizing CPU and GPU, by CPU, pre-service is carried out to waiting task, by GPU, pretreated task is calculated, carry out in the process calculated at GPU, CPU can continue to obtain waiting task and carry out pre-service, thus achieves the parallel computation of CPU and GPU, not only dilatation is achieved to the computing power of GPU, also improve the counting yield of computing node.

Accompanying drawing explanation

Fig. 1 is the method flow diagram that the embodiment of the present invention provides;

Fig. 2 is the method flow diagram that another embodiment of the present invention provides;

Fig. 3 is the apparatus structure schematic diagram that the embodiment of the present invention provides.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described.Obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

Embodiments provide the hybrid parallel computing method of a kind of CPU/GPU, refer to Fig. 1, utilize more than one computing node to set up cluster, each computing node comprises CPU and GPU, determines scheduling strategy; The method can comprise the following steps:

Step 101: obtain more than one waiting task.

Step 102: the more than one waiting task obtained is cached in task queue.

Step 103: according to scheduling strategy, dispatches the more than one waiting task in task queue to more than one computing node.

Step 104: in the computing node being scheduled waiting task, CPU to scheduling waiting task carry out pre-service one by one, and the complete task of every pre-service then by the duty mapping after pre-service in the video memory of GPU.

Step 105:GPU calculates being mapped in video memory of task, and returns result of calculation.

According to this programme, combine by utilizing CPU and GPU, by CPU, pre-service is carried out to waiting task, by GPU, pretreated task is calculated, carry out in the process calculated at GPU, CPU can continue to obtain waiting task and carry out pre-service, thus achieves the parallel computation of CPU and GPU, not only dilatation is achieved to the computing power of GPU, also improve the counting yield of computing node.

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with drawings and the specific embodiments, the present invention is described in further detail.

Embodiments provide the hybrid parallel computing method of a kind of CPU/GPU, refer to Fig. 2, the method can comprise the following steps:

201: utilize more than one computing node to set up computing cluster, each computing node comprises CPU and GPU.

In the present embodiment, because the computational valid time rate that monokaryon GPU or monokaryon CPU is carrying out mass data is lower, therefore, can consider that the hybrid parallel utilizing CPU and GPU in computing node calculates, to improve the counting yield of mass data.

Refer to Fig. 2, for the MapReduce hybrid parallel computing cluster of the CPU/GPU that the present embodiment provides, in this cluster, comprise task buffer module, task scheduling modules, security module and more than one computing node, i.e. computing node 1, computing node 2 ... computing node N, wherein, CPU, GPU, internal memory and local disk has been included in each computing node.

202: user submits more than one waiting task to hybrid parallel computing cluster.

In the present embodiment, security module pre-sets ACL (Access Control List, Access Control List (ACL)), task can be comprised in this ACL, and there is the corresponding relation of the user operating this task right, security module according to the setting in ACL, can allow submitting the user of waiting task to or limit.Such as, the waiting task that user A submits to is reading task a, if security module is determined not comprise user A in the user of the read operation authority with task a in ACL, so security module can send the prompting of refusal reading task a to this user A; If security module determines that in ACL user A has the operating right for reading task a, so security module allows the waiting task that user A submits to.

In a preferred embodiment of the invention, security module can also arrange and allow or forbid that user checks and revise the task that other users submit to.There is provided communication security defencive function by security module, ensure user and the security communicated between modules in hybrid parallel computing cluster, prevent the leakage of sensitive data.

203: the more than one waiting task that user submits to, when determining the waiting task that security module allows user to submit to, is cached in task queue by task buffer module.

In the present embodiment, task buffer module is increased to make full use of the network bandwidth and to improve system effectiveness.To the task that user submits to, first task is kept in task queue.Task queue, except preservation task, can also travel through more than one waiting task in task queue; The operational attribute of the waiting task of current traversal is obtained and record when often traveling through a waiting task; After task queue traversal being terminated, dynamically merging similarity or same task submit to adapt to user the scene that there is a large amount of similarity or repetitive task to.Wherein, similarity or same task can be the similar or identical tasks of operational attribute, such as, are all read operations, or are all that reading task a operates, and are all write operations, are all accessing operation or are all opening operation etc.

In the present embodiment, can also for the task creation hash index district after merging, and the task after being combined is divided into groups, and according to the task creation hash index district after grouping, so that the task after grouping is kept in described hash index district.Further, can also buffer memory task compression design be passed through, the multiple subtasks boil down to waiting task comprised by each waiting task, thus can calculated load be reduced, improve the transfer efficiency of bandwidth.

At service incoming end, increase task buffer module to make full use of the network bandwidth and to improve system effectiveness.To the task that user submits to, first task is kept in task queue.Task queue, except preservation task, also dynamically merges similarity task and submits to adapt to user the scene that there is a large amount of similarity or repetitive task to.By setting up hash index district, grouping similarity task; By buffer memory task compression design, reduce calculated load, improve the transfer efficiency of bandwidth.

204: more than one waiting task, according to task scheduling strategy, is dispatched to more than one computing node by task scheduling modules respectively.

In the present embodiment, task scheduling modules can be considered to dispatch more than one waiting task according to the various aspects such as present load, task rank of internal memory surplus, task amount size, each computing node, and the task scheduling strategy of the present embodiment can be any strategy of the prior art.

Such as, more than one waiting task is dispatched to computing node 1, computing node 2, computing node N by task scheduling modules respectively, refers to Fig. 2.

Wherein, the task scheduling modules of Effect-based operation mechanism supports real-time task scheduling strategy, and the computational resource dynamically calculating every node realizes the optimal scheduling of task; Copy MapReduce framework, a series of external interface is provided, facilitate developer to realize fast and application deployment; Use Hadoop basic communication protocol, strengthen extensibility on the whole.

205: in the computing node being scheduled waiting task, CPU to scheduling waiting task carry out pre-service one by one, and the complete task of every pre-service then by the duty mapping after pre-service in the video memory of GPU.

In the present embodiment, CPU carries out pre-service to waiting task, for by required data-mapping to GPU video memory.

206:GPU calculates being mapped in video memory of task, and returns result of calculation.

In the present embodiment, GPU utilizes MapReduce computation module to calculate as follows being mapped in video memory of task:

A: be more than one task block by the division of tasks be mapped in video memory, and distribute corresponding Map task for each task block, and by Map task matching corresponding for each task block on each SM processor of GPU, perform Map operation to utilize each SM processor for each task block;

Wherein, each SM processor is as follows for each task block execution Map operation:

First, in Hadoop, a set of JAVA comment code (being similar to the comment code in OpenMP) is added in design, and shape is as " // #gmp parallel for ".This cover comment code is used in Map function, marks the code wishing to run on GPU for programmer.

Then, the source code of compiling containing JAVA comment code, obtains the Java bytecode containing comment code.

Afterwards, the java class loader that design one is new on the basis of traditional java class loader, called after GPUClass loader.GPU Class loader can identify part (namely needing the part run on GPU) annotated in Java bytecode, is deployed on each computing node by GPU Class loader.

Then, GPU Class loader detects local computing environments automatically, check whether computing environment can be used, wherein, this computing environment can be CUDA (Compute Unified Device Architecture, unified calculation equipment framework), if unavailable, then directly to calculate on CPU; If available, then detect the concrete version of CUDA, and identify code section (namely needing the part run on GPU) annotated in java class loader.

GPU Class loader, for part annotated in the Java bytecode identified, generates corresponding CUDA code, comprises one section of power function code and one section of run time version, and compiles this two sections of codes.Call the CUDA code after compiling by the mode of JNI, related data is copied on GPU video memory, and CUDA code runs on GPU.When adopting JNI, only have when this section of code meets certain independence condition, and GPU resource in computing environment can time, just successfully can call, otherwise send miscue.

After GPU calculates and terminates, the operation result of CUDA code is copied back local main memory, and Map function obtains these operation results.The code section be not labeled in Map function runs on CPU.

Afterwards, scheduling node follows the tracks of the running status of all Map tasks, and rerun for the failed Map task of operation, until all Map tasks complete, Map process terminates.

B: operate by middle task inverted orientation in GPU video memory by Shuffle, for ensuing Reduce operation provides second fruiting.

C: and the operation result in Map stage is gathered in the Reduce stage, the operation result in the Map stage of gathering is turned back in internal memory with stream data form, and has CPU scheduling instruction to output in network I/O.

206: result of calculation is returned to client by computing cluster.

Embodiments provide the hybrid parallel calculation element of a kind of CPU/GPU, refer to Fig. 3, utilize more than one computing node to set up computing cluster, each computing node comprises CPU and GPU, determines scheduling strategy, comprising:

Task buffer module 301, for obtaining more than one waiting task, and is cached in task queue by the described more than one waiting task obtained;

Task scheduling modules 302, for according to described scheduling strategy, dispatches to more than one computing node by the described more than one waiting task in described task queue;

Computing node 303, for utilize when being scheduled waiting task CPU to scheduling waiting task carry out pre-service one by one, and the complete task of every pre-service then by the duty mapping after pre-service in the video memory of GPU; Utilize GPU to calculate being mapped in video memory of task, and return result of calculation.

Further,

Described task buffer module 301, for traveling through more than one waiting task described in described task queue; The operational attribute of the waiting task of current traversal is obtained and record when often traveling through a waiting task; After described task queue traversal being terminated, the waiting task with same operation attribute is merged into same task; And the task after being combined is divided into groups, and according to the task creation hash index district after grouping, so that the task after grouping is kept in described hash index district.

Further, described computing node 303, division of tasks for being mapped in video memory is more than one task block, and distribute corresponding Map task for each task block, and by Map task matching corresponding for each task block on each SM processor of GPU, perform Map operation to utilize each SM processor for each task block; Operated by middle task inverted orientation in GPU video memory by Shuffle, and gather the operation result in Map stage in the Reduce stage.

Comprise further:

Security module 304, for determining according to the Access Control List (ACL) pre-set whether the user of submission waiting task has the operating right to this waiting task, when having, perform the operation obtaining this waiting task, described Access Control List (ACL) comprises task and has the corresponding relation of the user operating this task right.

To sum up, the embodiment of the present invention at least can realize following beneficial effect:

1, by utilizing CPU and GPU to combine, by CPU, pre-service is carried out to waiting task, by GPU, pretreated task is calculated, carry out in the process calculated at GPU, CPU can continue to obtain waiting task and carry out pre-service, thus achieve the parallel computation of CPU and GPU, not only dilatation is achieved to the computing power of GPU, also improve the counting yield of computing node.

2, by increasing task buffer module to make full use of the network bandwidth and to improve system effectiveness.To the task that user submits to, first task is kept in task queue.Task queue, except preservation task, also dynamically merges similarity task and submits to adapt to user the scene that there is a large amount of similarity or repetitive task to.By setting up hash index district, grouping similarity task; By buffer memory task compression design, reduce calculated load, improve the transfer efficiency of bandwidth.

3, integratedly the unit MapReduce Computational frame Mars of GPU is only supported with transformation, make CPU pre-service that desired data is mapped to GPU video memory, by carrying out dilatation to the computing power of single node, realize the MapReduce hybrid parallel Computational frame supporting GPU/CPU.This framework combines the advantage of the scheduling of CPU complexity and GPU parallel computation, is applicable to compute-intensive applications, utilizes the advantage of extensive GPU cluster to significantly improve calculated performance and the counting yield of MapReduce, and the transparent exploitation realizing Parallel application efficiently realizes.

The content such as information interaction, implementation between each unit in the said equipment, due to the inventive method embodiment based on same design, particular content can see in the inventive method embodiment describe, repeat no more herein.

It should be noted that, in this article, the relational terms of such as first and second and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element " being comprised " limited by statement, and be not precluded within process, method, article or the equipment comprising described key element and also there is other same factor.

One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can have been come by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program, when performing, performs the step comprising said method embodiment; And aforesaid storage medium comprises: ROM, RAM, magnetic disc or CD etc. various can be program code stored medium in.

Finally it should be noted that: the foregoing is only preferred embodiment of the present invention, only for illustration of technical scheme of the present invention, be not intended to limit protection scope of the present invention.All any amendments done within the spirit and principles in the present invention, equivalent replacement, improvement etc., be all included in protection scope of the present invention.

Claims

1. hybrid parallel computing method of CPU/GPU, is characterized in that, utilize more than one computing node to set up computing cluster, each computing node comprises CPU and GPU, determines scheduling strategy; Also comprise:

Obtain more than one waiting task;

The described more than one waiting task obtained is cached in task queue;

2. method according to claim 1, is characterized in that, described described more than one waiting task in described task queue is dispatched to more than one computing node before, comprise further:

3. method according to claim 1, is characterized in that, described GPU calculates being mapped in video memory of task, comprising:

4., according to described method arbitrary in claim 1-3, it is characterized in that,

5. a hybrid parallel calculation element of CPU/GPU, is characterized in that, utilize more than one computing node to set up computing cluster, each computing node comprises CPU and GPU, determines scheduling strategy, comprising:

6. device according to claim 5, is characterized in that,

7. device according to claim 5, it is characterized in that, described computing node, division of tasks for being mapped in video memory is more than one task block, and distribute corresponding Map task for each task block, and by Map task matching corresponding for each task block on each SM processor of GPU, perform Map operation to utilize each SM processor for each task block; Operated by middle task inverted orientation in GPU video memory by Shuffle, and gather the operation result in Map stage in the Reduce stage.

8., according to described device arbitrary in claim 5-7, it is characterized in that, comprise further: