CN107766148B

CN107766148B - Heterogeneous cluster and task processing method and device

Info

Publication number: CN107766148B
Application number: CN201710775681.2A
Authority: CN
Inventors: 温圣召; 周汉清; 刘传秀; 张家军
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2017-08-31
Filing date: 2017-08-31
Publication date: 2021-02-19
Anticipated expiration: 2037-08-31
Also published as: CN107766148A

Abstract

The invention discloses a heterogeneous cluster and a task processing method and a device, wherein the heterogeneous cluster comprises the following steps: the system comprises a plurality of computing nodes, a cluster resource management module, an Ethernet switch and an IB network switch; the computing nodes are used for performing parallel computing; each computing node comprises a GPU and a CPU; the cluster resource management module is used for distributing the computing nodes in the cluster and the GPU resources on the computing nodes; each computing node and the cluster resource management module are interconnected through an Ethernet switch; the compute nodes are interconnected by IB network switches. According to the scheme, the heterogeneous cluster can be used for performing parallel computation, the acceleration ratio is improved, and the execution efficiency of the deep learning task is improved.

Description

Heterogeneous cluster and task processing method and device

[ technical field ] A method for producing a semiconductor device

The invention relates to a computer application technology, in particular to a heterogeneous cluster and task processing method and device.

[ background of the invention ]

With the development of big data and deep learning technology, mass data are trained and learned through a deep learning method, and finally, a set of accurate cognitive model is learned and is more and more emphasized by various technical companies. The more complex and powerful depth model can deeply reveal complex and rich information carried in mass data and can make more accurate prediction on future or unknown events. Deep learning applications include speech recognition, image recognition, natural language processing, search advertisement CTR prediction and the like, the calculation amount of the applications is huge, and a large number of calculation tasks become bottlenecks restricting model training in the process of carrying out model training by deep learning.

If the traditional single CPU scheme is adopted for model training, the required calculation time reaches months or even years, and the application of the deep learning technology to various businesses is seriously influenced. A typical case is that in picture distribution, a set of picture perception models needs to be learned by adopting a deep learning scheme. If a traditional single CPU machine is used for training, the theoretical training time takes 100 years.

At present, it is common to use a CPU cluster to accelerate deep learning applications, but with CPU cluster acceleration, although the acceleration ratio can be increased by increasing the cluster size, in the deep learning field, with the increase of the cluster size, the amount of data required for synchronous communication becomes larger, the time consumption ratio becomes heavier, and the increased acceleration ratio is limited.

[ summary of the invention ]

Various aspects of the application provide a heterogeneous cluster and a task processing method and device, which can perform parallel computation by using the heterogeneous cluster and improve the acceleration ratio.

In one aspect of the present application, a heterogeneous cluster is provided, including:

the system comprises a plurality of computing nodes, a cluster resource management module, an Ethernet switch and an IB network switch; wherein the content of the first and second substances,

the computing nodes are used for performing parallel computing; each computing node comprises a GPU and a CPU;

the cluster resource management module is used for distributing the computing nodes in the cluster and the GPU resources on the computing nodes;

each computing node and the cluster resource management module are interconnected through an Ethernet switch;

the compute nodes are interconnected by IB network switches.

The above aspects and any possible implementation manners further provide an implementation manner, where the compute node employs a dual-path CPU and multiple paths of GPUs; and the CPU in each computing node communicates with the GPU through a PCIE interface.

The above aspects and any possible implementation manners further provide an implementation manner, where the CPU is configured to perform logic control on the GPU in the compute node where the CPU is located, and control the computation that needs to be executed on the GPU;

the GPU is used to perform parallel computations that require acceleration.

The above aspects, and any possible implementations, further provide an implementation,

the CPU of the computing node is interconnected with the CPUs of other computing nodes and the cluster resource management module through the Ethernet switch; the GPUs of the compute nodes are interconnected with the GPUs of the other compute nodes through the IB network switch.

The foregoing aspects and any possible implementations further provide an implementation in which the CPU and cluster resource management module of the compute node are connected to an external shared memory through an ethernet switch.

As for the above-mentioned aspects and any possible implementation manner, an implementation manner is further provided, where the cluster resource management module is specifically configured to:

scheduling computing nodes in the heterogeneous cluster and distributing tasks;

and allocating GPU resources on the scheduled computing nodes, and binding the GPU resources with the tasks.

In another aspect of the present invention, a task processing method applied to the heterogeneous cluster is provided, including:

receiving a task;

allocating computing nodes and GPU resources on the computing nodes for the tasks;

and triggering each computing node to execute the task.

The above-described aspects and any possible implementations further provide an implementation, where the receiving task includes: and receiving tasks input by a user through the front page.

The foregoing aspects and any possible implementations further provide an implementation, where allocating the compute node and the resource on the compute node for the task includes:

scheduling computing nodes in the heterogeneous cluster and distributing tasks;

The above-described aspects and any possible implementations further provide an implementation, where the triggering each compute node to perform the task includes;

and according to the tasks, generating task instructions for instructing the scheduled computing nodes to perform parallel computing, triggering the computing nodes to acquire data for the tasks from the shared memory, performing parallel computing, and storing the execution result data of the tasks in the shared memory.

In another aspect of the present invention, a task processing device applied to the above heterogeneous cluster is provided, including:

the receiving module is used for receiving the task;

the scheduling module is used for distributing the computing nodes and the resources on the computing nodes for the tasks;

the execution module is used for triggering each computing node to execute the task;

wherein the heterogeneous cluster is the heterogeneous cluster described above.

The above-described aspect and any possible implementation further provide an implementation, where the receiving module is specifically configured to:

and receiving tasks input by a user through the front page.

The foregoing aspect and any possible implementation further provide an implementation, where the scheduling module is specifically configured to:

scheduling computing nodes in the heterogeneous cluster and distributing tasks;

The above aspects and any possible implementation further provide an implementation, and the execution module is specifically configured to;

In another aspect of the present invention, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the program.

In another aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method as set forth above.

Based on the introduction, the deep learning application can be accelerated by utilizing the heterogeneous cluster by adopting the scheme of the invention, and the execution efficiency of the deep learning task is improved.

[ description of the drawings ]

FIG. 1 is a block diagram of a heterogeneous cluster according to the present invention;

FIG. 2 is a flowchart of a task processing method of the heterogeneous cluster according to the present invention;

FIG. 3 is a block diagram of a task processing device of a heterogeneous cluster according to the present invention;

fig. 4 illustrates a block diagram of an exemplary computer system/server 012 suitable for use in implementing embodiments of the invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic structural diagram of a heterogeneous cluster according to the present invention, as shown in fig. 1, including a plurality of computing nodes, a cluster resource management module, an ethernet switch, and an ib (infiniband) network switch: wherein the content of the first and second substances,

the computing node is used for carrying out deep learning application; each computing node comprises a GPU and a CPU; each compute node is connected to all other compute nodes.

Preferably, the computing nodes adopt a double-path CPU and a multi-path GPU; the number of GPU cards in each computing node is at least 2; and the CPU in each computing node communicates with the GPU through a PCIE interface.

Preferably, each computing node further comprises an ethernet card and an IB card; the CPU of each computing node is interconnected with the shared memory through the Ethernet switch; the GPUs of each compute node are interconnected by IB network switches.

The cluster resource management module is used for allocating computing nodes in a cluster and CPU and GPU resources on the computing nodes, and comprises: scheduling the computing nodes in the cluster to distribute deep learning tasks; and allocating GPU resources on the scheduled computing nodes, and binding the GPU resources with the deep learning task. The cluster resource management module is connected with the computing nodes and the shared memory through the Ethernet switch.

The ethernet switch is used for connecting the CPUs of all the computing nodes in the cluster to realize data interaction between each computing node and the outside world, for example, downloading data from the shared memory to the computing nodes and uploading data to the shared memory.

The IB network switch is used for connecting the GPUs of all the computing nodes in the cluster and realizing data interaction among the GPUs of the computing nodes in the cluster; the IB network switch is only connected with the internal computing nodes of the cluster and is not connected with other equipment outside the cluster.

In a preferred implementation of this embodiment,

each computing node adopts a double-path low-dominant-frequency CPU and adopts an Intel (R) Xeon E5-2620 model CPU, and each computing node is provided with 2 CPUs; the CPU is used for carrying out logic control on the GPU in the computing node where the CPU is located and controlling the computation which needs to be executed on the GPU;

each computing node adopts a plurality of paths of GPUs; for example, an Nvidia Tesla k40 GPU accelerator card is used, and more than 2 GPU accelerator cards are inserted into each compute node; the parallel computation is used for executing the parallel computation needing to be accelerated in the deep learning process;

the CPU in each computing node is communicated with the GPU through a PCIE standard;

each computing node can be configured with a 10G Ethernet card, and the CPU is connected with an external shared memory through the Ethernet for data interaction with the external shared memory;

each compute node may be configured with a 40G RDMA network card that connects the GPU to the GPUs of other compute nodes in the cluster via the IB network.

The GPU heterogeneous cluster can be configured with more than 100 GPU acceleration cards for large-scale calculation. The GPU heterogeneous cluster processes a comprehensive offline training task, and covers a voice deep learning application, an image deep learning application, a natural language deep learning application or other deep learning applications, each computing node can be deployed with a computing framework such as a Caffe learning computing framework, a Torch learning computing framework, a Theano learning computing framework, a Cuda-Convnet learning computing framework, a KALDI learning computing framework or other deep learning computing frameworks, and each deep learning framework can be used for voice training, image training, natural language training, advertisement training or other training.

In a preferred implementation of the embodiment of the invention,

the cluster resource management module is directly connected with all the computing nodes in the cluster and used for carrying out optimal allocation on the computing nodes in the cluster and GPU resources on the computing nodes according to user requirement configuration; specifically, the method comprises the following steps:

the cluster resource management module allocates the computing nodes in the cluster and the resources thereon according to the received deep learning task, and the allocation comprises the following steps:

and (3) node allocation: distributing computing nodes for executing the deep learning task in a cluster according to the deep learning task; and respectively starting the subtasks corresponding to the deep learning task on the distributed computing nodes.

Resource allocation: and distributing GPU resources on the distributed computing nodes according to the subtasks started on the distributed computing nodes, binding the GPU resources with the subtasks, and distributing the residual resources on the computing nodes to other subtasks when the resource requests of other tasks are met.

The ethernet switch adopts a 10G ethernet switch for connecting the CPUs of the computing nodes in the cluster, and realizes data interaction between the computing nodes and an external shared memory, for example, downloading data from the shared memory to the computing nodes, and uploading data to the shared memory. The shared memory is a parallel distributed memory and supports multithreading parallel reading and writing. The training data for the deep learning task in the shared memory is constructed as an RDD (flexibly distributed dataset) object.

The cluster resource management module is also used for sending the training data address of the deep learning task and the execution result data address of the deep learning task which are distributed to each computing node.

The IB network switch adopts a 40G IB network switch and is used for connecting each computing node in the cluster to realize data interaction among the computing nodes in the cluster; the IB network switch is only connected with the GPU of the internal computing node of the cluster and is not connected with other switches outside the cluster. And the training tasks on the nodes adopt a 40GIB network to realize the rapid synchronization of data.

In a preferred implementation manner of this embodiment, instead of the IB network, a plurality of dedicated networks such as Myrinet, QsNet, and SCI may also be used to implement connection between the computing nodes.

In the embodiment of the present invention, a deep learning network training task is taken as an example for description, fig. 2 is a schematic flow chart of a task processing method based on the heterogeneous cluster shown in fig. 1, and as shown in fig. 2, the method includes the following steps:

step S201, receiving a deep learning task;

step S202, distributing computing nodes and resources on the computing nodes for the deep learning task;

and step S203, triggering each computing node to execute the deep learning task.

The main execution body of the method in fig. 2 is a cluster resource management module.

In a preferred implementation of step S201,

the cluster resource management module receives deep learning task information input by a user through a front-end page, wherein the deep learning task information can include but is not limited to: the method comprises the steps of executing a calculation graph of deep learning, the number of nodes executing a deep learning task and a deep learning library interface required to be called for executing the deep learning; training data addresses for deep learning tasks; and (4) executing result data addresses of the deep learning task.

The deep learning task needs to perform computation of a dataflow graph, so that submission of the deep learning task needs to submit a corresponding computation graph, that is, submission of the computation task is performed in a graph form.

In a preferred implementation of step S202,

the cluster resource management module is used for managing the node number of the deep learning task according to the deep learning task information; allocating compute nodes within a cluster that perform the deep learning task; and calling a deep learning library through a deep learning library interface which is required to be called for executing deep learning and is included in the deep learning task information, and respectively starting subtasks corresponding to the deep learning task on the distributed computing nodes.

And distributing GPU resources on the distributed computing nodes according to the subtasks started on the distributed computing nodes, and binding the GPU resources with the subtasks. Preferably, the resources remaining on the compute node are allocated to other subtasks when other task resource requests are satisfied.

Wherein, the respectively starting the subtasks corresponding to the deep learning task on each distributed computing node further comprises:

the cluster resource manager receives host names and port numbers returned by the computing nodes; generating a subtask network list according to the subtasks corresponding to the computing nodes and the returned host names and port numbers; and sending the subtask network list to each computing node so that each computing node establishes connection among the subtasks according to the subtask network list.

In a preferred implementation of this embodiment, the subtask types may include, but are not limited to: a parameter server subtask, a trainer subtask.

In a preferred implementation of step S203,

and the cluster scheduling server generates a task instruction for indicating the scheduled computing node to perform distributed training on the deep learning network according to the deep learning task. And triggering the subtasks started in each computing node to acquire training data for the deep learning task from the shared memory, performing deep learning calculation, and storing execution result data of the deep learning task into the shared memory.

The cluster resource management module sends the training data address of the deep learning task and the execution result data address of the deep learning task which are distributed to each computing node to the CPU of each computing node; the CPU of each computing node acquires training data for the deep learning task from the shared memory according to the training data address for the deep learning task, and sends the acquired training data for the deep learning task to the GPU; the GPU of each computing node performs deep learning training and sends execution result data of the deep learning training to the CPU; and the CPU of each computing node uploads the execution result data of the deep learning task to the shared memory according to the execution result data address of the deep learning task.

It should be noted that the foregoing method embodiments are described as a series of acts or combinations for simplicity in explanation, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Fig. 3 is a structural diagram of a task processing device of a heterogeneous cluster according to the present invention, where the device may be disposed in a cluster resource management module to complete operations in the method embodiment shown in fig. 2. As shown in fig. 3, includes:

a receiving module 301, configured to receive a deep learning task;

a scheduling module 302, configured to allocate a compute node and resources on the compute node for the deep learning task;

and the execution module 303 is configured to trigger each computing node to execute the deep learning task.

In a preferred implementation of the receiving module 301,

the receiving module 301 receives deep learning task information input by a user through a front end page, which may include but is not limited to: the method comprises the steps of executing a calculation graph of deep learning, the number of nodes executing a deep learning task and a deep learning library interface required to be called for executing the deep learning; training data addresses for deep learning tasks; and (4) executing result data addresses of the deep learning task.

In a preferred implementation of the scheduling module 302,

the scheduling module 302 determines the number of nodes of the deep learning task included in the deep learning task information; allocating compute nodes within a cluster that perform the deep learning task; and calling a deep learning library through a deep learning library interface which is required to be called for executing deep learning and is included in the deep learning task information, and respectively starting subtasks corresponding to the deep learning task on the distributed computing nodes.

the scheduling module 302 receives host names and port numbers returned by the computing nodes; generating a subtask network list according to the subtasks corresponding to the computing nodes and the returned host names and port numbers; and sending the subtask network list to each computing node so that each computing node establishes connection among the subtasks according to the subtask network list.

In a preferred implementation of the execution module 303,

the execution module 303 generates a task instruction indicating that the scheduled computing node performs distributed training on the deep learning network according to the deep learning task. And triggering the subtasks started in each computing node to acquire training data for the deep learning task from the shared memory, performing deep learning calculation, and storing execution result data of the deep learning task into the shared memory.

The execution module 303 sends the training data address of the deep learning task and the execution result data address of the deep learning task, which are allocated to each computing node, to the CPU of each computing node; the CPU of each computing node acquires training data for the deep learning task from the shared memory according to the training data address for the deep learning task, and sends the acquired training data for the deep learning task to the GPU; the GPU of each computing node performs deep learning training and sends execution result data of the deep learning training to the CPU; and the CPU of each computing node uploads the execution result data of the deep learning task to the shared memory according to the execution result data address of the deep learning task.

By adopting the scheme provided by the invention, the heterogeneous cluster can be used for parallel computation, the acceleration ratio is improved, and the execution efficiency of the deep learning task is greatly improved.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Fig. 4 illustrates a block diagram of an exemplary computer system/server 012 suitable for use in implementing embodiments of the invention. The computer system/server 012 shown in fig. 4 is only an example, and should not bring any limitation to the function and the scope of use of the embodiment of the present invention.

As shown in fig. 4, the computer system/server 012 is embodied as a general purpose computing device. The components of computer system/server 012 may include, but are not limited to: one or more processors or processing units 016, a system memory 028, and a bus 018 that couples various system components including the system memory 028 and the processing unit 016.

Bus 018 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer system/server 012 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 012 and includes both volatile and nonvolatile media, removable and non-removable media.

System memory 028 can include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)030 and/or cache memory 032. The computer system/server 012 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 034 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be connected to bus 018 via one or more data media interfaces. Memory 028 can include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the present invention.

Program/utility 040 having a set (at least one) of program modules 042 can be stored, for example, in memory 028, such program modules 042 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof might include an implementation of a network environment. Program modules 042 generally perform the functions and/or methodologies of embodiments of the present invention as described herein.

The computer system/server 012 may also communicate with one or more external devices 014 (e.g., keyboard, pointing device, display 024, etc.), hi the present invention, the computer system/server 012 communicates with an external radar device, and may also communicate with one or more devices that enable a user to interact with the computer system/server 012, and/or with any device (e.g., network card, modem, etc.) that enables the computer system/server 012 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 022. Also, the computer system/server 012 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 020. As shown in fig. 4, the network adapter 020 communicates with the other modules of the computer system/server 012 via bus 018. It should be appreciated that although not shown in fig. 4, other hardware and/or software modules may be used in conjunction with the computer system/server 012, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 016 executes the programs stored in the system memory 028, thereby performing the functions and/or methods of the described embodiments of the present invention.

The computer program described above may be provided in a computer storage medium encoded with a computer program that, when executed by one or more computers, causes the one or more computers to perform the method flows and/or apparatus operations shown in the above-described embodiments of the invention.

With the development of time and technology, the meaning of media is more and more extensive, and the propagation path of computer programs is not limited to tangible media any more, and can also be downloaded from a network directly and the like. Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A heterogeneous cluster, comprising: the system comprises a plurality of computing nodes, a cluster resource management module, an Ethernet switch and an IB network switch; wherein the content of the first and second substances,

the cluster resource management module is used for distributing computing nodes for executing the deep learning task in a cluster according to the deep learning task information and respectively starting subtasks corresponding to the deep learning task on the distributed computing nodes; distributing GPU resources on the distributed computing nodes according to the subtasks started on the distributed computing nodes; sending the training data address of the deep learning task and the execution result data address of the deep learning task which are distributed to each computing node to the CPU of each computing node;

the CPU of each computing node is interconnected with CPUs of other computing nodes, the cluster resource management module and the shared memory through the Ethernet switch, and acquires training data from the shared memory according to the training data address and sends the training data to the GPU; acquiring execution result data of deep learning training performed by a GPU, and uploading the execution result data to a shared memory according to the execution result data address;

the GPUs of each computing node are interconnected through an IB network switch, and the IB network switch is only connected with the GPUs of each computing node in the cluster; and the training tasks distributed by the GPUs realize the rapid synchronization of data through an IB network.

2. The heterogeneous cluster of claim 1, wherein the compute nodes employ dual CPU and multiple GPU; and the CPU in each computing node communicates with the GPU through a PCIE interface.

3. The heterogeneous cluster of claim 1,

the CPU is used for carrying out logic control on the GPU in the computing node where the CPU is located and controlling the computation which needs to be executed on the GPU;

the GPU is used to perform parallel computations that require acceleration.

4. The heterogeneous cluster of claim 1, wherein the cluster resource management module is further configured to bind the allocated GPU resources with a training task.

5. A task processing method applied to the heterogeneous cluster of any one of claims 1 to 4, comprising:

receiving a training task;

distributing a computing node and GPU resources on the computing node for the training task;

and triggering each computing node to execute the training task.

6. The task processing method of claim 5, wherein the receiving a training task comprises: and receiving the training task input by the user through the front page.

7. The task processing method of claim 5, wherein the allocating computing nodes and resources on computing nodes for the training task comprises:

scheduling computing nodes in the heterogeneous cluster and distributing training tasks;

and allocating GPU resources on the scheduled computing nodes, and binding the GPU resources with the training tasks.

8. The task processing method of claim 5, wherein the triggering each compute node to execute the training task comprises;

and according to the training task, generating a task instruction for instructing the scheduled computing nodes to perform parallel computing, triggering each computing node to acquire data for the training task from the shared memory, performing parallel computing, and storing the execution result data of the training task into the shared memory.

9. A task processing device applied to the heterogeneous cluster according to any one of claims 1 to 4, comprising:

the receiving module is used for receiving the training task;

the scheduling module is used for distributing the computing nodes and GPU resources on the computing nodes for the training tasks;

and the execution module is used for triggering each computing node to execute the training task.

10. The task processing device according to claim 9, wherein the receiving module is specifically configured to:

and receiving the training task input by the user through the front page.

11. The task processing device according to claim 9, wherein the scheduling module is specifically configured to:

12. The task processing device according to claim 9, wherein the execution module is specifically configured to;

13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method as claimed in claim 5 when executing the program.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of claim 5.