CN111400021B

CN111400021B - Deep learning method, device and system

Info

Publication number: CN111400021B
Application number: CN201910000910.2A
Authority: CN
Inventors: 丛鹏宇
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Priority date: 2019-01-02
Filing date: 2019-01-02
Publication date: 2023-03-31
Anticipated expiration: 2039-01-02
Also published as: CN111400021A

Abstract

The invention provides a deep learning method, a device and a system, wherein the deep learning method comprises the following steps: acquiring a deep learning task submitted by a user; allocating the deep learning task to a task queue corresponding to the type of the deep learning task; scheduling resources in the first resource sub-cluster, and executing deep learning tasks in the first task queue; scheduling resources in the second resource sub-cluster through a control node of the big data platform, and executing a deep learning task in a second task queue; the type of the deep learning tasks in the first task queue is not based on a big data platform, and the type of the deep learning tasks in the second task queue is based on a big data platform. The embodiment of the invention can support most deep learning frames, is well compatible with a big data platform and reduces the network overhead when calling the data on the big data platform.

Description

Deep learning method, device and system

Technical Field

The invention relates to the technical field of cloud computing, in particular to a deep learning method, device and system.

Background

At present, a Graphics Processing Unit (GPU) is widely used in the artificial intelligence related field due to its strong computing power, and particularly in the deep learning related algorithm task, the GPU can greatly accelerate the training and reasoning speed of the model. For larger-scale data or larger models, a long calculation time still needs to be consumed after a single GPU or even a single GPU and multiple GPUs are accelerated, so that a GPU server cluster is an indispensable component in artificial intelligence algorithm research and application.

Kubernets (K8 s) is a mainstream container arrangement and management tool at present, is one of important technologies in the age of containerization and micro-service, has strong community and rapid development, and effectively supports the isolation and scheduling of resources such as CPU (Central processing Unit), memory, GPU (graphics processing Unit) and the like at present.

Specifically, the K8s-Docker is a resource scheduling scheme adopted by most of the current clustered deep learning systems, and has inherent advantages in support of a deep learning open source framework, but the K8s-Docker scheme cannot be effectively compatible with a traditional big data platform such as Hadoop, and calling data on the big data platform is complex and requires more network overhead.

Disclosure of Invention

The embodiment of the invention provides a deep learning method, a deep learning device and a deep learning system, which are used for solving the problem that a large data platform cannot be effectively compatible in the conventional deep learning resource scheduling scheme.

In a first aspect, an embodiment of the present invention provides a deep learning method, including:

acquiring a deep learning task submitted by a user;

allocating the deep learning task to a task queue corresponding to the type of the deep learning task;

scheduling resources in the first resource sub-cluster, and executing deep learning tasks in the first task queue;

scheduling resources in the second resource sub-cluster through a control node of the big data platform, and executing a deep learning task in a second task queue;

the type of the deep learning task in the first task queue is not based on a big data platform, and the type of the deep learning task in the second task queue is based on the big data platform.

In a second aspect, an embodiment of the present invention provides a deep learning apparatus, including:

the acquisition module is used for acquiring a deep learning task submitted by a user;

the distribution module is used for distributing the deep learning task to a task queue corresponding to the type of the deep learning task;

the first execution module is used for scheduling resources in the first resource sub-cluster and executing the deep learning tasks in the first task queue;

the second execution module is used for scheduling resources in the second resource sub-cluster through the control node of the big data platform and executing the deep learning task in the second task queue;

In a third aspect, an embodiment of the present invention provides a deep learning system, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the computer program, when executed by the processor, may implement the steps of the deep learning method.

In a fourth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, can implement the steps of the deep learning method described above.

According to the deep learning method provided by the embodiment of the invention, the deep learning task submitted by a user is obtained, the deep learning task is distributed into the task queue corresponding to the type of the deep learning task, the resource in the first resource sub-cluster is scheduled, the deep learning task in the first task queue is executed, the resource in the second resource sub-cluster is scheduled through the control node of the big data platform, the deep learning task in the second task queue is executed, and the resource mixed scheduling can be realized, so that the realization of the mainstream deep learning task and the realization of the deep learning task based on the big data platform are integrated, thereby supporting most of deep learning frames, being well compatible with the big data platform and reducing the network overhead when the data on the big data platform is called.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

FIG. 1 is a flow chart of a deep learning method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating resource allocation according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a deep learning process according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a deep learning apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a deep learning system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The deep learning method in the embodiment of the invention can be suitable for deep learning tasks based on a big data platform (big data platform application scene) and deep learning tasks not based on the big data platform, and can realize resource hybrid scheduling, so that the realization of mainstream deep learning tasks and the realization of deep learning tasks based on the big data platform are integrated, thereby supporting most of deep learning frameworks, being well compatible with the big data platform and reducing the network overhead when calling data on the big data platform.

Referring to fig. 1, an embodiment of the present invention provides a deep learning method applied to a deep learning system, where the method includes the following steps:

step 101: and acquiring a deep learning task submitted by a user.

The type of the deep learning task in this step may be based on a big data platform or not (i.e., not based on a big data platform). When the method is not based on a big data platform, the corresponding deep learning task can be understood as a mainstream deep learning task.

Step 102: and distributing the deep learning task to a task queue corresponding to the type of the deep learning task.

It can be understood that, since the type of the deep learning task is based on a big data platform or not, the task queue involved in the embodiment of the present invention may include two types of task queues, namely a task queue based on a big data platform and a task queue not based on a big data platform.

In a specific implementation, the task allocation process of step 102 may be implemented by a task classifier in a deep learning system.

Step 103: and scheduling resources in the first resource sub-cluster, and executing the deep learning tasks in the first task queue.

Step 104: and scheduling the resources in the second resource sub-cluster through the control node of the big data platform, and executing the deep learning task in the second task queue.

The type of the deep learning task in the first task queue is not based on a big data platform, and the type of the deep learning task in the second task queue is based on the big data platform. The first resource sub-cluster may include at least one resource sub-cluster and the second resource sub-cluster may include at least one resource sub-cluster.

It can be understood that, for the purpose of achieving compatibility with a large data platform, the whole resource cluster of the deep learning system may be divided into a plurality of resource sub-clusters, each resource sub-cluster corresponds to one task queue (i.e., corresponds to one task type, and a plurality of resource sub-clusters may correspond to the same task queue), and the deep learning tasks of the corresponding task queues are executed by using the resources in the resource sub-clusters. In this way, the deep learning tasks in the task queue based on the big data platform can be executed by utilizing the resource sub-clusters divided for the deep learning tasks, and the deep learning tasks in the task queue not based on the big data platform can also be executed by utilizing the resource sub-clusters divided for the deep learning tasks, so that the requirements of being compatible with the big data platform are met while most of deep learning frameworks are supported.

Optionally, the resource in the embodiment of the present invention may be selected as a GPU resource.

In the embodiment of the present invention, in a specific implementation, the resource sub-clusters partitioned for the deep learning task based on the non-big data platform may be managed by a control node (for example, a K8s master node) in the deep learning system, and the resource sub-clusters partitioned for the deep learning task based on the big data platform may be managed by a control node (for example, a Yarn master node of a Hadoop platform) of the big data platform after the control node in the deep learning system configures resources and a network. Specifically, step 104 may include:

and submitting the deep learning tasks in the second task queue to a control node of a big data platform, scheduling the resources in the second resource sub-cluster by the control node of the big data platform, and executing the deep learning tasks in the second task queue.

Therefore, the control node of the big data platform is used for executing tasks, and the network overhead when the data on the big data platform is called can be reduced.

In the embodiment of the present invention, when the deep learning task in the task queue is executed, the deep learning task can be executed according to a First In First Out (FIFO) principle.

Optionally, the performing the deep learning task in the first task queue in step 103 may include:

and sequentially executing the deep learning tasks in the first task queue in a first-in first-out mode.

Optionally, the performing the deep learning task in the second task queue in step 104 may include:

sequentially executing the deep learning tasks in the second task queue according to a first-in first-out mode

Therefore, the tasks in the task queue are executed according to the FIFO principle, the prior tasks can be guaranteed to be executed preferentially, and the user requirements are met.

In the embodiment of the invention, in order to ensure the execution efficiency of the deep learning task submitted by the user, besides the corresponding resource sub-clusters are divided for the deep learning task, one or more mixed sub-clusters can be divided from the whole resource cluster, and the resources in the mixed sub-clusters are candidate resources so as to be convenient for calling the resources in the mixed sub-clusters when the resources in certain resource sub-clusters are insufficient.

Optionally, when the resource in the first resource sub-cluster cannot meet the requirement of the deep learning task in the first task queue in step 103, the method further includes:

scheduling resources in the third resource sub-cluster, and executing the deep learning task in the first task queue; wherein the resources in the third resource sub-cluster are candidate resources.

Optionally, when the resource in the second resource sub-cluster cannot meet the requirement of the deep learning task in the second task queue in step 104, the method further includes:

receiving a resource request message sent by a control node of a big data platform;

according to the resource request message, selecting resources from a fourth resource sub-cluster and distributing the resources to the second resource sub-cluster; wherein the resource in the fourth resource sub-cluster is a candidate resource;

and scheduling the resources in the second resource sub-cluster after the resources are allocated through the control node of the big data platform, and executing the deep learning task in the second task queue.

It is to be understood that the third resource sub-cluster and the fourth resource sub-cluster may or may not be identical. The third resource sub-cluster may include at least one partitioned resource sub-cluster, and the fourth resource sub-cluster may include at least one partitioned resource sub-cluster.

Therefore, by means of the division of the candidate resources, the optional resources of the learning task can be increased, and the execution efficiency of the deep learning task submitted by the user is ensured.

It will be appreciated that when the entire resource cluster is divided into a plurality of resource sub-clusters (including a hybrid sub-cluster), the resources in each resource sub-cluster may be dynamically adjusted based on actual demand.

The deep learning process in the embodiment of the present invention is described below with reference to specific examples.

At present, a deep learning framework of a Hadoop-based big data platform mainly comprises a mask (Convolutional neural network framework) and a Tensorflow (machine learning library), and the framework is few and has poor expansibility; the deep learning frameworks based on K8s and Docker are many, such as Tensorflow, caffe, MXNet, torch, theano, kaldi and the like, but cannot be compatible with large data platforms. Based on the above, the deep learning system based on the K8s and Yarn mixed resource scheduling can be realized in the specific embodiment of the invention, so that the vast majority of deep learning frameworks are supported, and meanwhile, the large data platform is well compatible.

In the specific example of the present invention, taking K8s and yann-based hybrid resource scheduling and GPU resources as examples, as shown in fig. 2, the whole GPU cluster may be divided into three sub-clusters, which are respectively a K8s sub-cluster, a hybrid sub-cluster and a yann sub-cluster, and uniform resource allocation management is performed by using a K8s master node in a deep learning system, and resources in the yann sub-cluster are packaged into a container and provided to a Hadoop of a big data platform, so as to perform resource scheduling by using the Yarn master node of the Hadoop platform.

The K8s sub-cluster is completely managed by a K8s main node and is used for executing a deep learning task based on resource scheduling of K8s + Docker; after the resource and the network are configured by the K8s master node, the Yarn sub-cluster is completely managed by the Yarn master node of the Hadoop platform and is used for executing a task of resource scheduling based on Spark + Yarn; and when the resources of the other two types of sub-clusters are insufficient, the hybrid sub-cluster executes two types of tasks according to the FIFO principle. The number of GPUs in such three types of sub-clusters may be determined as appropriate.

The management process for the above three types of sub-clusters can be as follows: when the K8s sub-cluster does not receive the task, the corresponding GPU resources are all in an idle state, the GPU resources are distributed when a new task is received, and the GPU resources are released after the task is completed; initially packaging all GPU resources into a plurality of virtual machines by the K8s master node of the Yarn sub-cluster, and delivering the virtual machines to the Yarn master node for management; when the resources of the other two sub-clusters are insufficient (GPU resources in the K8s sub-cluster are completely occupied, or GPU resources in the Yarn sub-cluster cannot meet the Spark + Yarn-based deep learning task), the hybrid sub-cluster temporarily allocates resources to the new task according to the new task type, and the resources are released after the task is completed.

Referring to fig. 3, the deep learning process in the embodiment of the present invention may include the following steps:

s1: a user submits a deep learning task (hereinafter referred to as a task);

s2: after receiving a task submitted by a user, the task classifier judges the task type according to the frame data type and the frame type, and distributes a task (such as Tensflow or Caffe type) based on a big data platform to a Spark + Yarn task queue, and distributes other tasks (not based on the big data platform) to a K8s + Docker task queue;

s3: for a K8s + Docker task queue, tasks are sequentially taken out according to an FIFO principle, namely according to the smoothness of entering the queue, and are submitted to a K8s main node, the K8s main node firstly issues the tasks to a K8s sub-cluster (namely, calls GPU resources in the K8s sub-cluster to execute the tasks), if the resources in the K8s sub-cluster are insufficient, the tasks are issued to a mixed sub-cluster, and if the resources in the mixed sub-cluster are also insufficient, the tasks enter a waiting state until the resources are sufficient and are executed again; releasing resources immediately after the task is executed;

s4: for a Spark + Yarn task queue, sequentially taking out tasks according to an FIFO principle, namely according to the smoothness of entering the queue, submitting the tasks to a Yarn main node of a big data platform, submitting the tasks to an original Spark cluster for execution for the tasks not using a GPU, firstly submitting the tasks using the GPU to a Yarn sub-cluster for execution (namely calling GPU resources in the Yarn sub-cluster for executing the tasks), if the resources in the Yarn sub-cluster are insufficient, applying for a K8s main node to distribute the resources from a mixed sub-cluster for expanding the Yarn sub-cluster, and if the resources in the mixed sub-cluster are also insufficient, enabling the tasks to enter a waiting state until the resources are sufficient and then executed; and after the execution of the tasks related to the mixed sub-cluster is finished, the K8s master node is immediately informed to release the resources.

The above embodiment describes the deep learning method of the present invention, and the deep learning apparatus of the present invention will be described with reference to the embodiment and the drawings.

Referring to fig. 4, an embodiment of the present invention provides a deep learning apparatus, which is applied to a deep learning system, and includes:

an obtaining module 41, configured to obtain a deep learning task submitted by a user;

an allocating module 42, configured to allocate the deep learning task to a task queue corresponding to a type of the deep learning task;

a first executing module 43, configured to schedule resources in the first resource sub-cluster, and execute the deep learning task in the first task queue;

a second execution module 44, configured to schedule resources in the second resource sub-cluster through a control node of the big data platform, and execute a deep learning task in the second task queue;

According to the deep learning device provided by the embodiment of the invention, the deep learning task submitted by a user is obtained, the deep learning task is distributed into the task queue corresponding to the type of the deep learning task, the resource in the first resource sub-cluster is scheduled, the deep learning task in the first task queue is executed, the resource in the second resource sub-cluster is scheduled through the control node of the big data platform, the deep learning task in the second task queue is executed, and the resource mixed scheduling can be realized, so that the realization of the mainstream deep learning task and the realization of the deep learning task based on the big data platform are integrated, thereby supporting most of deep learning frames, being well compatible with the big data platform and reducing the network overhead when the data on the big data platform is called.

In this embodiment of the present invention, the second executing module 44 is specifically configured to:

and submitting the deep learning tasks in the second task queue to a control node of the big data platform, scheduling the resources in the second resource sub-cluster by the control node of the big data platform, and executing the deep learning tasks in the second task queue.

Optionally, the first executing module 43 is specifically configured to:

sequentially executing the deep learning tasks in the first task queue in a first-in first-out mode;

and/or the presence of a gas in the gas,

the second execution module 44 is specifically configured to:

and sequentially executing the deep learning tasks in the second task queue according to a first-in first-out mode.

Optionally, when the resource in the first resource sub-cluster cannot meet the requirement of the deep learning task in the first task queue, the first execution module 43 is further configured to:

scheduling resources in a third resource sub-cluster, and executing deep learning tasks in the first task queue;

alternatively, the first and second electrodes may be,

when the resource in the second resource sub-cluster cannot meet the requirement of the deep learning task in the second task queue, the apparatus further comprises:

the receiving module is used for receiving a resource request message sent by a control node of the big data platform;

the assignment module 42 is further configured to: according to the resource request message, selecting resources from a fourth resource sub-cluster and distributing the resources to the second resource sub-cluster;

the first execution module 44 is further configured to: and scheduling the resources in the second resource sub-cluster after the resources are allocated through the control node of the big data platform, and executing the deep learning task in the second task queue.

And the resources in the third resource sub-cluster are candidate resources, and the resources in the fourth resource sub-cluster are candidate resources.

Optionally, the resource is a GPU resource.

In addition, an embodiment of the present invention further provides a deep learning system, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the computer program, when executed by the processor, can implement each process of the deep learning method embodiment, and can achieve the same technical effect, and details are not repeated here to avoid repetition.

Specifically, referring to fig. 5, an embodiment of the present invention further provides a deep learning system, which includes a bus 51, a transceiver 52, an antenna 53, a bus interface 54, a processor 55, and a memory 56.

In an embodiment of the present invention, the deep learning system further includes: a computer program stored on the memory 56 and executable on the processor 55.

In particular, the computer program may, when executed by the processor 55, implement the steps of:

acquiring a deep learning task submitted by a user;

In fig. 5, a bus architecture (represented by bus 51), bus 51 may include any number of interconnected buses and bridges, bus 51 linking together various circuits including one or more processors, represented by processor 55, and memory, represented by memory 56. The bus 51 may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface 54 provides an interface between the bus 51 and the transceiver 52. The transceiver 52 may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 55 is transmitted over a wireless medium via the antenna 53, and further, the antenna 53 receives the data and transmits the data to the processor 55.

The processor 55 is responsible for managing the bus 51 and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 56 may be used to store data used by processor 55 in performing operations.

Alternatively, the processor 55 may be a CPU, ASIC, FPGA or CPLD.

The embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, can implement each process of the deep learning method embodiment, and can achieve the same technical effect, and is not described herein again to avoid repetition.

Computer-readable media, which include both non-transitory and non-transitory, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A deep learning method, comprising:

acquiring a deep learning task submitted by a user;

2. The method of claim 1, wherein scheduling, by a control node of a big data platform, resources in a second resource sub-cluster to perform deep learning tasks in a second task queue comprises:

3. The method of claim 1, wherein the performing deep learning tasks in the first task queue comprises:

and/or the presence of a gas in the gas,

the executing of the deep learning task in the second task queue includes:

4. The method of claim 1, wherein when a resource in the first resource sub-cluster fails to meet a requirement of a deep learning task in the first task queue, the method further comprises:

scheduling resources in a third resource sub-cluster, and executing a deep learning task in the first task queue;

alternatively, the first and second electrodes may be,

when the resource in the second resource sub-cluster cannot meet the requirement of the deep learning task in the second task queue, the method further comprises:

receiving a resource request message sent by a control node of the big data platform;

according to the resource request message, selecting resources from a fourth resource sub-cluster and distributing the resources to the second resource sub-cluster;

scheduling resources in the second resource sub-cluster after the resources are allocated through a control node of the big data platform, and executing a deep learning task in the second task queue;

5. The method of any of claims 1-4, wherein the resource is a Graphics Processor (GPU) resource.

6. A deep learning apparatus, comprising:

7. The apparatus of claim 6, wherein the second execution module is specifically configured to:

8. The apparatus of claim 6, wherein the first execution module is specifically configured to:

and/or the presence of a gas in the atmosphere,

the second execution module is specifically configured to:

9. A deep learning system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program, when executed by the processor, implements the steps of the deep learning method as claimed in any one of claims 1 to 5.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the deep learning method according to any one of claims 1 to 5.