CN107766148B - Heterogeneous cluster and task processing method and device - Google Patents

Heterogeneous cluster and task processing method and device Download PDF

Info

Publication number
CN107766148B
CN107766148B CN201710775681.2A CN201710775681A CN107766148B CN 107766148 B CN107766148 B CN 107766148B CN 201710775681 A CN201710775681 A CN 201710775681A CN 107766148 B CN107766148 B CN 107766148B
Authority
CN
China
Prior art keywords
task
training
computing
cluster
computing nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710775681.2A
Other languages
Chinese (zh)
Other versions
CN107766148A (en
Inventor
温圣召
周汉清
刘传秀
张家军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201710775681.2A priority Critical patent/CN107766148B/en
Publication of CN107766148A publication Critical patent/CN107766148A/en
Application granted granted Critical
Publication of CN107766148B publication Critical patent/CN107766148B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Abstract

The invention discloses a heterogeneous cluster and a task processing method and a device, wherein the heterogeneous cluster comprises the following steps: the system comprises a plurality of computing nodes, a cluster resource management module, an Ethernet switch and an IB network switch; the computing nodes are used for performing parallel computing; each computing node comprises a GPU and a CPU; the cluster resource management module is used for distributing the computing nodes in the cluster and the GPU resources on the computing nodes; each computing node and the cluster resource management module are interconnected through an Ethernet switch; the compute nodes are interconnected by IB network switches. According to the scheme, the heterogeneous cluster can be used for performing parallel computation, the acceleration ratio is improved, and the execution efficiency of the deep learning task is improved.

Description

Heterogeneous cluster and task processing method and device
[ technical field ] A method for producing a semiconductor device
The invention relates to a computer application technology, in particular to a heterogeneous cluster and task processing method and device.
[ background of the invention ]
With the development of big data and deep learning technology, mass data are trained and learned through a deep learning method, and finally, a set of accurate cognitive model is learned and is more and more emphasized by various technical companies. The more complex and powerful depth model can deeply reveal complex and rich information carried in mass data and can make more accurate prediction on future or unknown events. Deep learning applications include speech recognition, image recognition, natural language processing, search advertisement CTR prediction and the like, the calculation amount of the applications is huge, and a large number of calculation tasks become bottlenecks restricting model training in the process of carrying out model training by deep learning.
If the traditional single CPU scheme is adopted for model training, the required calculation time reaches months or even years, and the application of the deep learning technology to various businesses is seriously influenced. A typical case is that in picture distribution, a set of picture perception models needs to be learned by adopting a deep learning scheme. If a traditional single CPU machine is used for training, the theoretical training time takes 100 years.
At present, it is common to use a CPU cluster to accelerate deep learning applications, but with CPU cluster acceleration, although the acceleration ratio can be increased by increasing the cluster size, in the deep learning field, with the increase of the cluster size, the amount of data required for synchronous communication becomes larger, the time consumption ratio becomes heavier, and the increased acceleration ratio is limited.
[ summary of the invention ]
Various aspects of the application provide a heterogeneous cluster and a task processing method and device, which can perform parallel computation by using the heterogeneous cluster and improve the acceleration ratio.
In one aspect of the present application, a heterogeneous cluster is provided, including:
the system comprises a plurality of computing nodes, a cluster resource management module, an Ethernet switch and an IB network switch; wherein the content of the first and second substances,
the computing nodes are used for performing parallel computing; each computing node comprises a GPU and a CPU;
the cluster resource management module is used for distributing the computing nodes in the cluster and the GPU resources on the computing nodes;
each computing node and the cluster resource management module are interconnected through an Ethernet switch;
the compute nodes are interconnected by IB network switches.
The above aspects and any possible implementation manners further provide an implementation manner, where the compute node employs a dual-path CPU and multiple paths of GPUs; and the CPU in each computing node communicates with the GPU through a PCIE interface.
The above aspects and any possible implementation manners further provide an implementation manner, where the CPU is configured to perform logic control on the GPU in the compute node where the CPU is located, and control the computation that needs to be executed on the GPU;
the GPU is used to perform parallel computations that require acceleration.
The above aspects, and any possible implementations, further provide an implementation,
the CPU of the computing node is interconnected with the CPUs of other computing nodes and the cluster resource management module through the Ethernet switch; the GPUs of the compute nodes are interconnected with the GPUs of the other compute nodes through the IB network switch.
The foregoing aspects and any possible implementations further provide an implementation in which the CPU and cluster resource management module of the compute node are connected to an external shared memory through an ethernet switch.
As for the above-mentioned aspects and any possible implementation manner, an implementation manner is further provided, where the cluster resource management module is specifically configured to:
scheduling computing nodes in the heterogeneous cluster and distributing tasks;
and allocating GPU resources on the scheduled computing nodes, and binding the GPU resources with the tasks.
In another aspect of the present invention, a task processing method applied to the heterogeneous cluster is provided, including:
receiving a task;
allocating computing nodes and GPU resources on the computing nodes for the tasks;
and triggering each computing node to execute the task.
The above-described aspects and any possible implementations further provide an implementation, where the receiving task includes: and receiving tasks input by a user through the front page.
The foregoing aspects and any possible implementations further provide an implementation, where allocating the compute node and the resource on the compute node for the task includes:
scheduling computing nodes in the heterogeneous cluster and distributing tasks;
and allocating GPU resources on the scheduled computing nodes, and binding the GPU resources with the tasks.
The above-described aspects and any possible implementations further provide an implementation, where the triggering each compute node to perform the task includes;
and according to the tasks, generating task instructions for instructing the scheduled computing nodes to perform parallel computing, triggering the computing nodes to acquire data for the tasks from the shared memory, performing parallel computing, and storing the execution result data of the tasks in the shared memory.
In another aspect of the present invention, a task processing device applied to the above heterogeneous cluster is provided, including:
the receiving module is used for receiving the task;
the scheduling module is used for distributing the computing nodes and the resources on the computing nodes for the tasks;
the execution module is used for triggering each computing node to execute the task;
wherein the heterogeneous cluster is the heterogeneous cluster described above.
The above-described aspect and any possible implementation further provide an implementation, where the receiving module is specifically configured to:
and receiving tasks input by a user through the front page.
The foregoing aspect and any possible implementation further provide an implementation, where the scheduling module is specifically configured to:
scheduling computing nodes in the heterogeneous cluster and distributing tasks;
and allocating GPU resources on the scheduled computing nodes, and binding the GPU resources with the tasks.
The above aspects and any possible implementation further provide an implementation, and the execution module is specifically configured to;
and according to the tasks, generating task instructions for instructing the scheduled computing nodes to perform parallel computing, triggering the computing nodes to acquire data for the tasks from the shared memory, performing parallel computing, and storing the execution result data of the tasks in the shared memory.
In another aspect of the present invention, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the program.
In another aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method as set forth above.
Based on the introduction, the deep learning application can be accelerated by utilizing the heterogeneous cluster by adopting the scheme of the invention, and the execution efficiency of the deep learning task is improved.
[ description of the drawings ]
FIG. 1 is a block diagram of a heterogeneous cluster according to the present invention;
FIG. 2 is a flowchart of a task processing method of the heterogeneous cluster according to the present invention;
FIG. 3 is a block diagram of a task processing device of a heterogeneous cluster according to the present invention;
fig. 4 illustrates a block diagram of an exemplary computer system/server 012 suitable for use in implementing embodiments of the invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic structural diagram of a heterogeneous cluster according to the present invention, as shown in fig. 1, including a plurality of computing nodes, a cluster resource management module, an ethernet switch, and an ib (infiniband) network switch: wherein the content of the first and second substances,
the computing node is used for carrying out deep learning application; each computing node comprises a GPU and a CPU; each compute node is connected to all other compute nodes.
Preferably, the computing nodes adopt a double-path CPU and a multi-path GPU; the number of GPU cards in each computing node is at least 2; and the CPU in each computing node communicates with the GPU through a PCIE interface.
Preferably, each computing node further comprises an ethernet card and an IB card; the CPU of each computing node is interconnected with the shared memory through the Ethernet switch; the GPUs of each compute node are interconnected by IB network switches.
The cluster resource management module is used for allocating computing nodes in a cluster and CPU and GPU resources on the computing nodes, and comprises: scheduling the computing nodes in the cluster to distribute deep learning tasks; and allocating GPU resources on the scheduled computing nodes, and binding the GPU resources with the deep learning task. The cluster resource management module is connected with the computing nodes and the shared memory through the Ethernet switch.
The ethernet switch is used for connecting the CPUs of all the computing nodes in the cluster to realize data interaction between each computing node and the outside world, for example, downloading data from the shared memory to the computing nodes and uploading data to the shared memory.
The IB network switch is used for connecting the GPUs of all the computing nodes in the cluster and realizing data interaction among the GPUs of the computing nodes in the cluster; the IB network switch is only connected with the internal computing nodes of the cluster and is not connected with other equipment outside the cluster.
In a preferred implementation of this embodiment,
each computing node adopts a double-path low-dominant-frequency CPU and adopts an Intel (R) Xeon E5-2620 model CPU, and each computing node is provided with 2 CPUs; the CPU is used for carrying out logic control on the GPU in the computing node where the CPU is located and controlling the computation which needs to be executed on the GPU;
each computing node adopts a plurality of paths of GPUs; for example, an Nvidia Tesla k40 GPU accelerator card is used, and more than 2 GPU accelerator cards are inserted into each compute node; the parallel computation is used for executing the parallel computation needing to be accelerated in the deep learning process;
the CPU in each computing node is communicated with the GPU through a PCIE standard;
each computing node can be configured with a 10G Ethernet card, and the CPU is connected with an external shared memory through the Ethernet for data interaction with the external shared memory;
each compute node may be configured with a 40G RDMA network card that connects the GPU to the GPUs of other compute nodes in the cluster via the IB network.
The GPU heterogeneous cluster can be configured with more than 100 GPU acceleration cards for large-scale calculation. The GPU heterogeneous cluster processes a comprehensive offline training task, and covers a voice deep learning application, an image deep learning application, a natural language deep learning application or other deep learning applications, each computing node can be deployed with a computing framework such as a Caffe learning computing framework, a Torch learning computing framework, a Theano learning computing framework, a Cuda-Convnet learning computing framework, a KALDI learning computing framework or other deep learning computing frameworks, and each deep learning framework can be used for voice training, image training, natural language training, advertisement training or other training.
In a preferred implementation of the embodiment of the invention,
the cluster resource management module is directly connected with all the computing nodes in the cluster and used for carrying out optimal allocation on the computing nodes in the cluster and GPU resources on the computing nodes according to user requirement configuration; specifically, the method comprises the following steps:
the cluster resource management module allocates the computing nodes in the cluster and the resources thereon according to the received deep learning task, and the allocation comprises the following steps:
and (3) node allocation: distributing computing nodes for executing the deep learning task in a cluster according to the deep learning task; and respectively starting the subtasks corresponding to the deep learning task on the distributed computing nodes.
Resource allocation: and distributing GPU resources on the distributed computing nodes according to the subtasks started on the distributed computing nodes, binding the GPU resources with the subtasks, and distributing the residual resources on the computing nodes to other subtasks when the resource requests of other tasks are met.
The ethernet switch adopts a 10G ethernet switch for connecting the CPUs of the computing nodes in the cluster, and realizes data interaction between the computing nodes and an external shared memory, for example, downloading data from the shared memory to the computing nodes, and uploading data to the shared memory. The shared memory is a parallel distributed memory and supports multithreading parallel reading and writing. The training data for the deep learning task in the shared memory is constructed as an RDD (flexibly distributed dataset) object.
The cluster resource management module is also used for sending the training data address of the deep learning task and the execution result data address of the deep learning task which are distributed to each computing node.
The IB network switch adopts a 40G IB network switch and is used for connecting each computing node in the cluster to realize data interaction among the computing nodes in the cluster; the IB network switch is only connected with the GPU of the internal computing node of the cluster and is not connected with other switches outside the cluster. And the training tasks on the nodes adopt a 40GIB network to realize the rapid synchronization of data.
In a preferred implementation manner of this embodiment, instead of the IB network, a plurality of dedicated networks such as Myrinet, QsNet, and SCI may also be used to implement connection between the computing nodes.
In the embodiment of the present invention, a deep learning network training task is taken as an example for description, fig. 2 is a schematic flow chart of a task processing method based on the heterogeneous cluster shown in fig. 1, and as shown in fig. 2, the method includes the following steps:
step S201, receiving a deep learning task;
step S202, distributing computing nodes and resources on the computing nodes for the deep learning task;
and step S203, triggering each computing node to execute the deep learning task.
The main execution body of the method in fig. 2 is a cluster resource management module.
In a preferred implementation of step S201,
the cluster resource management module receives deep learning task information input by a user through a front-end page, wherein the deep learning task information can include but is not limited to: the method comprises the steps of executing a calculation graph of deep learning, the number of nodes executing a deep learning task and a deep learning library interface required to be called for executing the deep learning; training data addresses for deep learning tasks; and (4) executing result data addresses of the deep learning task.
The deep learning task needs to perform computation of a dataflow graph, so that submission of the deep learning task needs to submit a corresponding computation graph, that is, submission of the computation task is performed in a graph form.
In a preferred implementation of step S202,
the cluster resource management module is used for managing the node number of the deep learning task according to the deep learning task information; allocating compute nodes within a cluster that perform the deep learning task; and calling a deep learning library through a deep learning library interface which is required to be called for executing deep learning and is included in the deep learning task information, and respectively starting subtasks corresponding to the deep learning task on the distributed computing nodes.
And distributing GPU resources on the distributed computing nodes according to the subtasks started on the distributed computing nodes, and binding the GPU resources with the subtasks. Preferably, the resources remaining on the compute node are allocated to other subtasks when other task resource requests are satisfied.
Wherein, the respectively starting the subtasks corresponding to the deep learning task on each distributed computing node further comprises:
the cluster resource manager receives host names and port numbers returned by the computing nodes; generating a subtask network list according to the subtasks corresponding to the computing nodes and the returned host names and port numbers; and sending the subtask network list to each computing node so that each computing node establishes connection among the subtasks according to the subtask network list.
In a preferred implementation of this embodiment, the subtask types may include, but are not limited to: a parameter server subtask, a trainer subtask.
In a preferred implementation of step S203,
and the cluster scheduling server generates a task instruction for indicating the scheduled computing node to perform distributed training on the deep learning network according to the deep learning task. And triggering the subtasks started in each computing node to acquire training data for the deep learning task from the shared memory, performing deep learning calculation, and storing execution result data of the deep learning task into the shared memory.
The cluster resource management module sends the training data address of the deep learning task and the execution result data address of the deep learning task which are distributed to each computing node to the CPU of each computing node; the CPU of each computing node acquires training data for the deep learning task from the shared memory according to the training data address for the deep learning task, and sends the acquired training data for the deep learning task to the GPU; the GPU of each computing node performs deep learning training and sends execution result data of the deep learning training to the CPU; and the CPU of each computing node uploads the execution result data of the deep learning task to the shared memory according to the execution result data address of the deep learning task.
It should be noted that the foregoing method embodiments are described as a series of acts or combinations for simplicity in explanation, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
Fig. 3 is a structural diagram of a task processing device of a heterogeneous cluster according to the present invention, where the device may be disposed in a cluster resource management module to complete operations in the method embodiment shown in fig. 2. As shown in fig. 3, includes:
a receiving module 301, configured to receive a deep learning task;
a scheduling module 302, configured to allocate a compute node and resources on the compute node for the deep learning task;
and the execution module 303 is configured to trigger each computing node to execute the deep learning task.
In a preferred implementation of the receiving module 301,
the receiving module 301 receives deep learning task information input by a user through a front end page, which may include but is not limited to: the method comprises the steps of executing a calculation graph of deep learning, the number of nodes executing a deep learning task and a deep learning library interface required to be called for executing the deep learning; training data addresses for deep learning tasks; and (4) executing result data addresses of the deep learning task.
In a preferred implementation of the scheduling module 302,
the scheduling module 302 determines the number of nodes of the deep learning task included in the deep learning task information; allocating compute nodes within a cluster that perform the deep learning task; and calling a deep learning library through a deep learning library interface which is required to be called for executing deep learning and is included in the deep learning task information, and respectively starting subtasks corresponding to the deep learning task on the distributed computing nodes.
And distributing GPU resources on the distributed computing nodes according to the subtasks started on the distributed computing nodes, and binding the GPU resources with the subtasks. Preferably, the resources remaining on the compute node are allocated to other subtasks when other task resource requests are satisfied.
Wherein, the respectively starting the subtasks corresponding to the deep learning task on each distributed computing node further comprises:
the scheduling module 302 receives host names and port numbers returned by the computing nodes; generating a subtask network list according to the subtasks corresponding to the computing nodes and the returned host names and port numbers; and sending the subtask network list to each computing node so that each computing node establishes connection among the subtasks according to the subtask network list.
In a preferred implementation of this embodiment, the subtask types may include, but are not limited to: a parameter server subtask, a trainer subtask.
In a preferred implementation of the execution module 303,
the execution module 303 generates a task instruction indicating that the scheduled computing node performs distributed training on the deep learning network according to the deep learning task. And triggering the subtasks started in each computing node to acquire training data for the deep learning task from the shared memory, performing deep learning calculation, and storing execution result data of the deep learning task into the shared memory.
The execution module 303 sends the training data address of the deep learning task and the execution result data address of the deep learning task, which are allocated to each computing node, to the CPU of each computing node; the CPU of each computing node acquires training data for the deep learning task from the shared memory according to the training data address for the deep learning task, and sends the acquired training data for the deep learning task to the GPU; the GPU of each computing node performs deep learning training and sends execution result data of the deep learning training to the CPU; and the CPU of each computing node uploads the execution result data of the deep learning task to the shared memory according to the execution result data address of the deep learning task.
By adopting the scheme provided by the invention, the heterogeneous cluster can be used for parallel computation, the acceleration ratio is improved, and the execution efficiency of the deep learning task is greatly improved.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Fig. 4 illustrates a block diagram of an exemplary computer system/server 012 suitable for use in implementing embodiments of the invention. The computer system/server 012 shown in fig. 4 is only an example, and should not bring any limitation to the function and the scope of use of the embodiment of the present invention.
As shown in fig. 4, the computer system/server 012 is embodied as a general purpose computing device. The components of computer system/server 012 may include, but are not limited to: one or more processors or processing units 016, a system memory 028, and a bus 018 that couples various system components including the system memory 028 and the processing unit 016.
Bus 018 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer system/server 012 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 012 and includes both volatile and nonvolatile media, removable and non-removable media.
System memory 028 can include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)030 and/or cache memory 032. The computer system/server 012 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 034 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be connected to bus 018 via one or more data media interfaces. Memory 028 can include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the present invention.
Program/utility 040 having a set (at least one) of program modules 042 can be stored, for example, in memory 028, such program modules 042 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof might include an implementation of a network environment. Program modules 042 generally perform the functions and/or methodologies of embodiments of the present invention as described herein.
The computer system/server 012 may also communicate with one or more external devices 014 (e.g., keyboard, pointing device, display 024, etc.), hi the present invention, the computer system/server 012 communicates with an external radar device, and may also communicate with one or more devices that enable a user to interact with the computer system/server 012, and/or with any device (e.g., network card, modem, etc.) that enables the computer system/server 012 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 022. Also, the computer system/server 012 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 020. As shown in fig. 4, the network adapter 020 communicates with the other modules of the computer system/server 012 via bus 018. It should be appreciated that although not shown in fig. 4, other hardware and/or software modules may be used in conjunction with the computer system/server 012, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 016 executes the programs stored in the system memory 028, thereby performing the functions and/or methods of the described embodiments of the present invention.
The computer program described above may be provided in a computer storage medium encoded with a computer program that, when executed by one or more computers, causes the one or more computers to perform the method flows and/or apparatus operations shown in the above-described embodiments of the invention.
With the development of time and technology, the meaning of media is more and more extensive, and the propagation path of computer programs is not limited to tangible media any more, and can also be downloaded from a network directly and the like. Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (14)

1. A heterogeneous cluster, comprising: the system comprises a plurality of computing nodes, a cluster resource management module, an Ethernet switch and an IB network switch; wherein the content of the first and second substances,
the computing nodes are used for performing parallel computing; each computing node comprises a GPU and a CPU;
the cluster resource management module is used for distributing computing nodes for executing the deep learning task in a cluster according to the deep learning task information and respectively starting subtasks corresponding to the deep learning task on the distributed computing nodes; distributing GPU resources on the distributed computing nodes according to the subtasks started on the distributed computing nodes; sending the training data address of the deep learning task and the execution result data address of the deep learning task which are distributed to each computing node to the CPU of each computing node;
the CPU of each computing node is interconnected with CPUs of other computing nodes, the cluster resource management module and the shared memory through the Ethernet switch, and acquires training data from the shared memory according to the training data address and sends the training data to the GPU; acquiring execution result data of deep learning training performed by a GPU, and uploading the execution result data to a shared memory according to the execution result data address;
the GPUs of each computing node are interconnected through an IB network switch, and the IB network switch is only connected with the GPUs of each computing node in the cluster; and the training tasks distributed by the GPUs realize the rapid synchronization of data through an IB network.
2. The heterogeneous cluster of claim 1, wherein the compute nodes employ dual CPU and multiple GPU; and the CPU in each computing node communicates with the GPU through a PCIE interface.
3. The heterogeneous cluster of claim 1,
the CPU is used for carrying out logic control on the GPU in the computing node where the CPU is located and controlling the computation which needs to be executed on the GPU;
the GPU is used to perform parallel computations that require acceleration.
4. The heterogeneous cluster of claim 1, wherein the cluster resource management module is further configured to bind the allocated GPU resources with a training task.
5. A task processing method applied to the heterogeneous cluster of any one of claims 1 to 4, comprising:
receiving a training task;
distributing a computing node and GPU resources on the computing node for the training task;
and triggering each computing node to execute the training task.
6. The task processing method of claim 5, wherein the receiving a training task comprises: and receiving the training task input by the user through the front page.
7. The task processing method of claim 5, wherein the allocating computing nodes and resources on computing nodes for the training task comprises:
scheduling computing nodes in the heterogeneous cluster and distributing training tasks;
and allocating GPU resources on the scheduled computing nodes, and binding the GPU resources with the training tasks.
8. The task processing method of claim 5, wherein the triggering each compute node to execute the training task comprises;
and according to the training task, generating a task instruction for instructing the scheduled computing nodes to perform parallel computing, triggering each computing node to acquire data for the training task from the shared memory, performing parallel computing, and storing the execution result data of the training task into the shared memory.
9. A task processing device applied to the heterogeneous cluster according to any one of claims 1 to 4, comprising:
the receiving module is used for receiving the training task;
the scheduling module is used for distributing the computing nodes and GPU resources on the computing nodes for the training tasks;
and the execution module is used for triggering each computing node to execute the training task.
10. The task processing device according to claim 9, wherein the receiving module is specifically configured to:
and receiving the training task input by the user through the front page.
11. The task processing device according to claim 9, wherein the scheduling module is specifically configured to:
scheduling computing nodes in the heterogeneous cluster and distributing training tasks;
and allocating GPU resources on the scheduled computing nodes, and binding the GPU resources with the training tasks.
12. The task processing device according to claim 9, wherein the execution module is specifically configured to;
and according to the training task, generating a task instruction for instructing the scheduled computing nodes to perform parallel computing, triggering each computing node to acquire data for the training task from the shared memory, performing parallel computing, and storing the execution result data of the training task into the shared memory.
13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method as claimed in claim 5 when executing the program.
14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of claim 5.
CN201710775681.2A 2017-08-31 2017-08-31 Heterogeneous cluster and task processing method and device Active CN107766148B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710775681.2A CN107766148B (en) 2017-08-31 2017-08-31 Heterogeneous cluster and task processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710775681.2A CN107766148B (en) 2017-08-31 2017-08-31 Heterogeneous cluster and task processing method and device

Publications (2)

Publication Number Publication Date
CN107766148A CN107766148A (en) 2018-03-06
CN107766148B true CN107766148B (en) 2021-02-19

Family

ID=61265338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710775681.2A Active CN107766148B (en) 2017-08-31 2017-08-31 Heterogeneous cluster and task processing method and device

Country Status (1)

Country Link
CN (1) CN107766148B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110389824A (en) * 2018-04-20 2019-10-29 伊姆西Ip控股有限责任公司 Handle method, equipment and the computer program product of calculating task
CN109062700A (en) * 2018-08-21 2018-12-21 郑州云海信息技术有限公司 A kind of method for managing resource and server based on distributed system
CN111274023B (en) * 2018-12-05 2022-11-22 上海寒武纪信息科技有限公司 Data processing method, device, computer system and storage medium
CN109976911B (en) * 2019-03-25 2021-04-20 哈尔滨工程大学 Self-adaptive resource scheduling method
CN109995862B (en) * 2019-03-29 2021-10-15 北京百度网讯科技有限公司 Resource scheduling method and terminal
CN109992422A (en) * 2019-04-11 2019-07-09 北京朗镜科技有限责任公司 A kind of method for scheduling task towards GPU resource, device and system
CN110399222B (en) * 2019-07-25 2022-01-21 北京邮电大学 GPU cluster deep learning task parallelization method and device and electronic equipment
CN110515732B (en) * 2019-08-23 2021-06-18 中国人民解放军国防科技大学 Task allocation method based on deep learning inference of resource-constrained robot
CN111147603A (en) * 2019-09-30 2020-05-12 华为技术有限公司 Method and device for networking reasoning service
CN110889492B (en) * 2019-11-25 2022-03-08 北京百度网讯科技有限公司 Method and apparatus for training deep learning models
CN112965809A (en) * 2019-12-12 2021-06-15 深圳市优必选科技股份有限公司 Deep learning task processing system and method
CN111159078B (en) * 2019-12-30 2022-05-06 联想长风科技(北京)有限公司 Electronic equipment
US11620502B2 (en) * 2020-01-30 2023-04-04 Alibaba Group Holding Limited Hyper-square implementation of tree AllReduce algorithm for distributed parallel deep learning
US11561840B2 (en) * 2020-01-30 2023-01-24 Alibaba Group Holding Limited Efficient inter-chip interconnect topology for distributed parallel deep learning
US11520640B2 (en) * 2020-01-30 2022-12-06 Alibaba Group Holding Limited Efficient and more advanced implementation of ring-AllReduce algorithm for distributed parallel deep learning
CN113204412A (en) * 2020-01-31 2021-08-03 伊姆西Ip控股有限责任公司 Method, electronic device, and computer storage medium for task scheduling
CN111327692A (en) * 2020-02-05 2020-06-23 北京百度网讯科技有限公司 Model training method and device and cluster system
CN111415007B (en) * 2020-03-26 2023-01-17 中科寒武纪科技股份有限公司 Method and device for calculating data, board card and computer readable storage medium
CN111680791B (en) * 2020-06-16 2023-04-18 北京字节跳动网络技术有限公司 Communication method, device and system suitable for heterogeneous environment
CN112148453A (en) * 2020-09-29 2020-12-29 深圳致星科技有限公司 Computing chip for privacy computation and network computing system
CN114827151A (en) * 2022-05-20 2022-07-29 合肥边缘智芯科技有限公司 Heterogeneous server cluster and data forwarding method, device and equipment
CN114866510B (en) * 2022-05-25 2023-06-30 山东省计算中心(国家超级计算济南中心) Cross-network and off-site interconnection communication method and system based on InfiniBand network
CN116980420B (en) * 2023-09-22 2023-12-15 新华三技术有限公司 Cluster communication method, system, device, equipment and medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105227669A (en) * 2015-10-15 2016-01-06 浪潮(北京)电子信息产业有限公司 A kind of aggregated structure system of CPU and the GPU mixing towards degree of depth study

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102135949B (en) * 2011-03-01 2013-06-19 浪潮(北京)电子信息产业有限公司 Computing network system, method and device based on graphic processing unit
CN102541804B (en) * 2011-12-26 2014-04-02 中国人民解放军信息工程大学 Multi-GPU (graphic processing unit) interconnection system structure in heterogeneous system
US9275498B2 (en) * 2012-08-09 2016-03-01 Qualcomm Incorporated GPU-accelerated path rendering
WO2014094410A1 (en) * 2012-12-20 2014-06-26 中国科学院近代物理研究所 Particle flow simulation system and method
CN106462498B (en) * 2014-06-23 2019-08-02 利奇德股份有限公司 Modularization architecture for exchanging for data-storage system
US9298769B1 (en) * 2014-09-05 2016-03-29 Futurewei Technologies, Inc. Method and apparatus to facilitate discrete-device accelertaion of queries on structured data
CN106529682A (en) * 2016-10-28 2017-03-22 北京奇虎科技有限公司 Method and apparatus for processing deep learning task in big-data cluster

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105227669A (en) * 2015-10-15 2016-01-06 浪潮(北京)电子信息产业有限公司 A kind of aggregated structure system of CPU and the GPU mixing towards degree of depth study

Also Published As

Publication number Publication date
CN107766148A (en) 2018-03-06

Similar Documents

Publication Publication Date Title
CN107766148B (en) Heterogeneous cluster and task processing method and device
CN108537543B (en) Parallel processing method, device, equipment and storage medium for blockchain data
US10552161B2 (en) Cluster graphical processing unit (GPU) resource sharing efficiency by directed acyclic graph (DAG) generation
US9830677B2 (en) Graphics processing unit resource sharing
US10776144B2 (en) Address space management with respect to a coherent accelerator processor interface architecture
US10310908B2 (en) Dynamic usage balance of central processing units and accelerators
US9875124B2 (en) Data assignment and data scheduling for physical machine in a virtual machine environment
JP2020537784A (en) Machine learning runtime library for neural network acceleration
US9501318B2 (en) Scheduling and execution of tasks based on resource availability
CN107678752B (en) Task processing method and device for heterogeneous cluster
US10109030B1 (en) Queue-based GPU virtualization and management system
CN107025256B (en) Method and system for reducing reactivation time of cloud-based services
US20180341516A1 (en) Processing jobs using task dependencies
US20180067764A1 (en) Smart reduce task scheduler
US10031781B2 (en) Estimating job start times on workload management systems
US20140152680A1 (en) System and method for efficient resource management of a signal flow programmed digital signal processor code
CN110909527B (en) Text processing model running method and device, electronic equipment and storage medium
US10783003B2 (en) Method, device, and computer readable medium for managing dedicated processing resources
US10409762B2 (en) Remote direct memory access-based on static analysis of asynchronous blocks
US20180210860A1 (en) System, method and computer program product for dense/sparse linear system solver accelerator
CN111767059A (en) Deployment method and device of deep learning model, electronic equipment and storage medium
US10228982B2 (en) Hyper-threaded processor allocation to nodes in multi-tenant distributed software systems
US9176910B2 (en) Sending a next request to a resource before a completion interrupt for a previous request
EP4295229A1 (en) Asynchronous distributed data flow for machine learning workloads
US20210073033A1 (en) Memory management using coherent accelerator functionality

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant