CN109086134A

CN109086134A - A kind of operation method and device of deep learning operation

Info

Publication number: CN109086134A
Application number: CN201810793520.0A
Authority: CN
Inventors: 袁绍
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2018-07-19
Filing date: 2018-07-19
Publication date: 2018-12-25

Abstract

The invention discloses the operation methods and device of a kind of deep learning operation, this method comprises: receiving selection of the user for resource required for operation deep learning operation and the selection for submitting the docker mirror image of deep learning operation；Deep learning operation is dispatched according to the use of calculate node and loading condition；When by deep learning job scheduling to calculate node, docker mirror image selected by user is pushed from mirror image warehouse, and docker container is created in each calculate node in the cluster；The hardware resource that calculate node is distributed according to deep learning operation is mapped to docker mirror image, and deep learning operation is run using the hardware resource and docker container that are mapped to docker mirror image.By the above-mentioned means, reducing user using cluster operation deep learning operation the time it takes and energy.

Description

A kind of operation method and device of deep learning operation

Technical field

The present invention relates to computer cluster, the espespecially a kind of operation method and device of deep learning operation.

Background technique

The concept of deep learning is derived from the research of artificial neural network.Multilayer perceptron containing more hidden layers is exactly a kind of depth Learning structure.Deep learning, which forms more abstract high level by combination low-level feature, indicates attribute classification or feature, with discovery The distributed nature of data indicates.The deep learning frame of present mainstream include tensorflow, caffe, pytorch, mxnet.Cluster is reflected what one group of computer system interconnection was formed with triangular web by high performance network or local area network The high-performance of picture, enhanced scalability, high performance price ratio computer cluster.As group system is in scientific algorithm, quotient The extensive use of industry operation etc., the effect that group system plays is also more and more important, and being increasingly becoming can not in above-mentioned field Or scarce tool.When cluster application is when deep learning, since deep learning needs to be implemented a large amount of calculating, it is therefore desirable to cluster System there is a large amount of calculate node with provide a large amount of hardware resource (for example, GPU (Graphics Processing Unit, Graphics processing unit) resource).But cluster node substantial amounts, it is difficult to which United Dispatching provides hardware resource, therefore cluster Hardware resource utilization rate it is low, and the hardware resource for dispatching the node of cluster can spend user's a large amount of time and essence Power.In addition, the framing dependence of different deep learning frames is different, user is different depth before model training It practises frame and configures different training environments, this also needs to expend considerable time and effort.

Summary of the invention

In order to solve the above-mentioned technical problems, the present invention provides the operation method and device of a kind of deep learning operation, User can be reduced using cluster operation deep learning operation the time it takes and energy.

To achieve the goals above, on the one hand, the embodiment provides a kind of operation sides of deep learning operation Method, this method comprises:

User is received for the selection of resource required for operation deep learning operation and for submitting deep learning to make The selection of the docker mirror image of industry；

Deep learning operation is dispatched according to the use of calculate node and loading condition；

When by deep learning job scheduling to calculate node, docker selected by user is pushed from mirror image warehouse Mirror image, and docker container is created in each calculate node in the cluster；

The hardware resource that calculate node is distributed according to deep learning operation is mapped to docker mirror image, and uses and reflects Hardware resource and the docker container of docker mirror image are mapped to run deep learning operation.

Further, in an alternative embodiment, required resource includes:

Using the cpu resource of deep learning task training, GPU resource, framework type, queuing message.

Further, in an alternative embodiment, the calculate node in cluster and management node use network file The mode of system NFS carrys out shared stored file；

In the step of using the hardware resource and docker container for being mapped to docker mirror image to run deep learning operation Later, this method further include:

It will be using the model file storage of deep learning task training to calculate node, so that calculate node is by model file Share to management node.

Further, in an alternative embodiment, docker container is created in each calculate node in the cluster The step of after, this method further include:

Cluster is configured using overlay network tool flannel.

To achieve the goals above, on the other hand, the embodiment of the invention provides a kind of operation of deep learning operation dresses It sets, which includes: that user selects receiving module, job scheduling module, container creation module and job run module；Its In,

User select receiving module be used for: receive user for operation deep learning operation required for resource selection with And the selection for submitting the docker mirror image of deep learning operation；

Job scheduling module is used for: deep learning operation is dispatched according to the use of calculate node and loading condition；

Container creation module is used for: when by deep learning job scheduling to calculate node, being pushed from mirror image warehouse Docker mirror image selected by user, and docker container is created in each calculate node in the cluster；

Job run module is used for: the hardware resource that calculate node is distributed according to deep learning operation is mapped to Docker mirror image, and deep learning operation is run using the hardware resource and docker container that are mapped to docker mirror image.

Further, in an alternative embodiment, required resource includes:

The device further includes model file memory module, and model file memory module is used for: being used in job run module Hardware resource and the docker container of docker mirror image are mapped to will make using deep learning after running deep learning operation The model file of industry training is stored to calculate node, so that model file is shared to management node by calculate node.

Further, in an alternative embodiment, which further includes cluster configuration module, and cluster configuration module is used In: after creating docker container in container creation module each calculate node in the cluster, using overlay network tool Flannel configures cluster.

The beneficial effect of the embodiment of the present invention is that user can select operation deep learning operation institute by client The hardware resource needed, and CPU and GPU resource in hardware resource are dynamically distributed using dispatcher software, therefore ensure that collection The high usage of the hardware resource of group, and reduce hardware resource the time it takes and energy that user dispatches cluster.No With deep learning frame can be convenient and efficiently operates on entire cluster, avoiding user, to be that different frames configure different Framework environment, bottom run deep learning operation using docker container, avoid different frames and rely on conflict.Reduce user Configuration surroundings the time it takes and energy.

Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that understand through the implementation of the invention.The objectives and other advantages of the invention can be by specification, right Specifically noted structure is achieved and obtained in claim and attached drawing.

Detailed description of the invention

Attached drawing is used to provide to further understand technical solution of the present invention, and constitutes part of specification, with this The embodiment of application technical solution for explaining the present invention together, does not constitute the limitation to technical solution of the present invention.

Fig. 1 is a kind of flow chart of the operation method of deep learning operation provided in an embodiment of the present invention；

Fig. 2 is a kind of block diagram of the running gear of deep learning operation provided in an embodiment of the present invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention Embodiment be described in detail.It should be noted that in the absence of conflict, in the embodiment and embodiment in the application Feature can mutual any combination.

Step shown in the flowchart of the accompanying drawings can be in a computer system such as a set of computer executable instructions It executes.Also, although logical order is shown in flow charts, and it in some cases, can be to be different from herein suitable Sequence executes shown or described step.

On the one hand, the embodiment provides a kind of operation methods of deep learning operation, as shown in Figure 1, the party Method includes step S101-S107.

Step S101 receives user for the selection of resource required for operation deep learning operation and deep for submitting Spend the selection of the docker mirror image of learning performance.

Wherein, resource and depth required for user is run by the web page selected deep learning performance of client The docker mirror image of learning performance, and user can also select or input training script.Client uses B/S (Browser/ Server, Browser/Server Mode) architecture management system is a kind of network structure mode after web rises, web browser It is the most important application software of client.This mode has unified client, and the core that system function is realized is focused on On server, the exploitation, maintenance and use of system are simplified.As long as installing a browser, such as Netscape in client Navigator or Internet Explorer, server install the databases such as SQL Server, Oracle, MYSQL.Browser Data interaction is carried out with database by Web Server.

After this, client can send to the management node in cluster and request, which can be HTTP (HyperText Transfer Protocol, hypertext transfer protocol) request.Management node upon receiving a request, by institute Received request is sent to dispatcher software.

Step S103 dispatches deep learning operation according to the use of calculate node and loading condition.

Here, using resource management software TORQUE coordinating operation dispatcher software Maui according to each calculating section in cluster Point is respective to dispatch deep learning operation using with loading condition, deep learning operation is assigned to each calculate node, respectively Hardware resource needed for a calculate node provides operation deep learning operation.

Step S105, it is each into cluster from mirror image warehouse when by deep learning job scheduling to calculate node A calculate node pushes docker mirror image selected by user, and creates docker in each calculate node in the cluster and hold Device.

Here, the selected docker mirror image push (push) of user is arrived each calculate node, so as to execute depth gauge It can be regarded as and create docker container in each calculate node of industry.

The hardware resource that calculate node is distributed according to deep learning operation is mapped to docker mirror image by step S107, And deep learning operation is run using the hardware resource and docker container that are mapped to docker mirror image.

Docker is the application container engine of an open source, and the application for allowing developer that can be packaged them (refers to herein Be deep learning operation) and rely on packet into a transplantable container, be then published to the Linux machine of any prevalence On, it also may be implemented to virtualize, container is not have any interface between each other using sandbox mechanism completely.Due to docker Include system independent of any language, frame, therefore run deep learning operation using docker in bottom, avoids difference Deep learning frame framing dependence (framing dependence packet) between conflict.

Further, in an alternative embodiment, required resource includes: using deep learning task training Cpu resource, GPU resource, framework type, queuing message.

Further, in an alternative embodiment, the calculate node in cluster and management node use network file The mode of system NFS (Network File System, Network File System) carrys out shared stored file.NFS is One of the file system that FreeBSD is supported, it allows to pass through TCP/IP network shared resource between the computer in network.

After step S107, this method further include: by the model file storage using deep learning task training to meter Operator node, so that model file is shared to management node by calculate node.User can obtain the model file from management node.

Further, in one embodiment, after step S105, this method further include: use overlay network tool Flannel configures cluster.

When creating docker container in calculate node, due to the property of docker container, two calculate nodes It is not intercommunication between docker container, therefore cluster is configured by deployment overlay network tool flannel, to docker container IP planned, can be achieved with the communication between the docker container across calculate node.Working directory is mapped to conduct The calculate node of docker host, setting GPU resource maps, and GPU use environment is arranged.

On the other hand, the embodiment provides a kind of running gears of deep learning operation, as shown in Fig. 2, should Device includes: that user selects receiving module 201, job scheduling module 203, container creation module 205 and job run module 207；Wherein,

User selects receiving module 201 to be used for: receiving choosing of the user for resource required for operation deep learning operation Select and for submit deep learning operation docker mirror image selection；

Job scheduling module 203 is used for: deep learning operation is dispatched according to the use of calculate node and loading condition；

Container creation module 205 is used for: when by deep learning job scheduling to calculate node, being pushed away from mirror image warehouse Docker mirror image selected by user is sent, and creates docker container in each calculate node in the cluster；

Job run module 207 is used for: the hardware resource that calculate node is distributed according to deep learning operation is mapped to Docker mirror image, and deep learning operation is run using the hardware resource and docker container that are mapped to docker mirror image.

Further, in an alternative embodiment, required resource includes:

The device further includes model file memory module, and the model file memory module is used for: in job run mould After block 207 runs deep learning operation using the hardware resource and docker container that are mapped to docker mirror image, it will use The model file storage of deep learning task training is to calculate node, so that model file is shared to management section by calculate node Point.

Further, in an alternative embodiment, which further includes cluster configuration module, and cluster configuration mould Block is used for: after creating docker container in container creation module each calculate node in the cluster, using overlay network Tool flannel configures cluster.

Although disclosed herein embodiment it is as above, above-mentioned content only for ease of understanding the present invention and use Embodiment is not intended to limit the invention.Technical staff in any fields of the present invention is taken off not departing from the present invention Under the premise of the spirit and scope of dew, any modification and variation, but the present invention can be carried out in the form and details of implementation Scope of patent protection, still should be subject to the scope of the claims as defined in the appended claims.

Claims

1. a kind of operation method of deep learning operation characterized by comprising

User is received for the selection of resource required for operation deep learning operation and for submitting deep learning operation The selection of docker mirror image；

The deep learning operation is dispatched according to the use of calculate node and loading condition；

When by the deep learning job scheduling to calculate node, docker selected by user is pushed from mirror image warehouse Mirror image, and docker container is created in each calculate node in the cluster；

The hardware resource that calculate node is distributed according to the deep learning operation is mapped to the docker mirror image, and is adopted Deep learning operation is run with the hardware resource and the docker container that are mapped to docker mirror image.

2. according to the method described in claim 1, wherein, the required resource includes:

Using cpu resource, GPU resource, framework type, the queuing message of the deep learning task training.

3. according to the method described in claim 1, wherein, calculate node and management node in the cluster use network file The mode of system NFS carrys out shared stored file；

Deep learning operation is run using the hardware resource for being mapped to docker mirror image and the docker container described After step, the method also includes:

By the model file storage using the deep learning task training to the calculate node, so that the calculate node will The model file shares to management node.

4. the method according to claim 1, wherein being created in each calculate node in the cluster After the step of docker container, the method also includes:

The cluster is configured using overlay network tool flannel.

5. a kind of running gear of deep learning operation characterized by comprising user selects receiving module, job scheduling mould Block, container creation module and job run module；Wherein,

The user selects receiving module to be used for: receive user for resource required for operation deep learning operation selection with And the selection for submitting the docker mirror image of deep learning operation；

The job scheduling module is used for: the deep learning operation is dispatched according to the use of calculate node and loading condition；

The container creation module is used for: when by the deep learning job scheduling to calculate node, from mirror image warehouse Docker mirror image selected by user is pushed, and creates docker container in each calculate node in the cluster；

The job run module is used for: the hardware resource that calculate node is distributed according to the deep learning operation is mapped to The docker mirror image, and depth is run using the hardware resource and the docker container that are mapped to docker mirror image Exercises industry.

6. device according to claim 5, wherein it is described required for resource include:

7. device according to claim 5, wherein calculate node and management node in the cluster use network file The mode of system NFS carrys out shared stored file；

Described device further includes model file memory module, and the model file memory module is used for: in the job run mould After block runs deep learning operation using the hardware resource and the docker container that are mapped to docker mirror image, it will use The model file storage of the deep learning task training is to the calculate node, so that the calculate node is literary by the model Part shares to management node.

8. device according to claim 5, which is characterized in that described device further includes cluster configuration module, the cluster Configuration module is used for: the container creation module in each calculate node in the cluster creation docker container it Afterwards, the cluster is configured using overlay network tool flannel.