CN111274036A

CN111274036A - Deep learning task scheduling method based on speed prediction

Info

Publication number: CN111274036A
Application number: CN202010068852.XA
Authority: CN
Inventors: 曹春; 马晓星; 徐经纬; 李青坪
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2020-06-12
Anticipated expiration: 2040-01-21
Also published as: CN111274036B

Abstract

The invention discloses a deep learning task scheduling method based on speed prediction, which comprises two parts of speed model construction and task scheduling. The speed model building part is used for building a neural network model to predict the speed of processing pictures when each task runs in a cluster, and comprises a training stage and a prediction stage: in the training stage, firstly, profiling is carried out on each task, namely, the training speed of each task under different distributed configurations in a cluster is collected, and a data set is constructed; the features of each task trained in the cluster are collected, a velocity model is constructed and trained using the data set constructed in the previous step. The prediction phase is integrated into the task scheduling portion. The task scheduling part predicts the training speed of the tasks under different configurations by using a speed model and determines the resource allocation of the clusters by using a customized simulated annealing algorithm, thereby achieving the purpose of effectively utilizing the cluster resources.

Description

Deep learning task scheduling method based on speed prediction

Technical Field

The invention relates to a scheduling method of a deep learning task based on speed prediction, in particular to a scheduling method of a distributed deep learning training task combining resource allocation and task placement.

Background

The deep learning technology is a technology for simulating a human neural network to complete a series of complex decisions or predictions, and with more and more diverse deep learning application scenes, neural network models are more and more complex, and meanwhile, a data set required by training a complex model is more and more large, so that the traditional single GPU training is difficult to meet the training of the complex model. Distributed Deep Learning (DDL) aims to improve the training efficiency of a complex model, and a plurality of GPUs are used for accelerating the training process so that the complex model can finish training and start service in a short time. However, in a GPU cluster, there may be multiple DDL training tasks, and an unreasonable resource allocation may result in that each task cannot complete training at the fastest speed, thereby affecting the training efficiency of the tasks. Therefore, it is important to study how to schedule the DDL task so that the cluster resources can be effectively utilized.

Parameter server architecture

In distributed deep learning, a parameter server architecture is an architecture responsible for parameter synchronization among a plurality of working nodes and is divided into two types of nodes, namely a parameter server node (PS) and a computing node (Worker): the PS is responsible for storing global model parameters, receiving the gradient pushed by each Worker, updating the gradient and allowing each Worker to pull the updated parameters; and each Worker locally stores a copy of the sub-global parameters, and is responsible for processing the data set and pushing the calculated gradient to the PS, and after the PS finishes updating the parameters, the updated parameters are pulled to the local and the next iteration is started. The parameter server architecture is proposed in 2010, and in 2014, a Li bath brings the parameter server to the public view, the parameter server architecture realizes communication among nodes at the bottom layer, a transparent use interface is provided for a user, and the user can conveniently start distributed training by using multiple GPUs (graphics processing units) by only writing codes of a PS (packet switched) end and a Worker end according to the interface and giving global configuration, so that the training efficiency of a complex model is improved.

Cluster scheduler

The cluster scheduler is mainly responsible for cluster resource management, task placement, monitoring of the execution state of tasks and the like, and in the traditional scheduler, the customized scheduling is not performed according to the characteristics of deep learning tasks, so that in a deep learning task cluster, the maximized utilization of cluster resources is difficult. Many solutions have been proposed in both academia and industry for deep learning tasks. The method comprises the steps that Optimus is a scheduler which takes optimization of average completion time of tasks as a target, in a scheduling method of the Optimus, a mathematical model is fitted to predict the training speed of the DDL tasks under a parameter server architecture, and a model is fitted to predict the number of epochs left by the tasks after completion of training; then, when the tasks are scheduled, a speed model and a residual epoch prediction model are used for calculating to allocate one PS or Worker to each task, which brings higher profit (namely, the training time of the task is reduced more), so that the PS or workers are allocated to all the tasks in a greedy manner until the PS or workers are added, no profit is brought, or the resources of the clusters are exhausted. A Gaia scheduler of a Tencent open source utilizes a mechanism of GPU point-to-point communication and guarantees that a DDL task is trained under the conditions of high bandwidth and low delay by detecting PCIe connection types (SYS, PIX and the like) during point-to-point communication between GPUs. However, none of these methods takes into account the effect of task placement on resource allocation, and in fact the former has a large effect on the latter. For example, two Workers are allocated to one task, and the training effects of the tasks with the two Workers on the same node and different nodes are different. It is very necessary to consider the resource allocation and task placement in the scheduling method.

Disclosure of Invention

The purpose of the invention is as follows: in the existing scheduling method, the influence on task placement is not considered when resource allocation is performed on a DDL task, and a common method is to allocate resources to the DDL task based on some heuristic rules or based on a certain model, and then to place the allocated resources on each node based on some heuristic rules (for example, considering node load balancing, considering packing more tightly, and the like). In the methods, a heuristic rule is selected to replace the influence of task placement on resource allocation, and an optimal scheduling effect is difficult to achieve in an actual scene. Aiming at the defects of the existing scheduling method, the invention provides a deep learning task scheduling method based on speed prediction, and the method has the advantages of accurate speed prediction, high resource allocation efficiency, quick task training completion and the like.

The technical scheme is as follows: a scheduling method of a deep learning task based on speed prediction comprises a speed model construction stage and a task scheduling stage:

velocity model construction phase

(1) Speed model data set construction: acquiring a task training speed by using the training state of the task in the cluster, and preparing a training speed model as a data set;

(2) implementation of the velocity model: establishing a training device of a speed model based on deep learning, wherein input data of the training device is a data set constructed in the previous step, and output is the training speed of a task (namely the speed of processing pictures);

task scheduling phase

(1) Resource allocation and task placement: the scheduler considers the resource allocation and the task placement in a combined manner, predicts the training speed of each task under different configurations (namely different resource allocation amounts and placement nodes) through a speed model, and determines the optimal configuration for each task, so that the aim of effectively utilizing cluster resources is fulfilled;

(2) and (3) task operation: and after the scheduler calculates and configures each task, scheduling the tasks to the cluster to run, and monitoring the running state of the tasks at the same time.

In the implementation of the velocity model dataset construction in the velocity model construction phase: tasks run in the cluster with different configurations and different training speeds; and sampling to obtain the training speed of the task under different configurations according to all possible running configurations of the task in the cluster, thereby constructing a data set.

Has the advantages that: compared with the prior art, the scheduling method of the deep learning task based on the speed prediction utilizes the speed model to combine resource allocation and task placement for scheduling, so that each task can complete training at a higher speed. Compared with the traditional DDL task scheduling method, the method has the advantages of high accuracy, more reasonable resource allocation, less interference of task operation and the like, and can be applied to the scheduling method of the deep learning cluster. The method has the following advantages:

(1) the problem of resource allocation in a multi-task environment is solved.

In a cluster, multi-task scheduling is always a key research direction, and whether the resource utilization rate of the cluster can be maximized is related to the cost input by a user. In deep learning, the GPU is commonly used as a computing resource for training, and the cost of the GPU is very high, which can be used for users to maximize the utilization of GPU resources. The method provided by the invention can well carry out multitask GPU scheduling, improve the utilization rate of the GPU and solve the problem of GPU resource allocation.

(2) High accuracy of speed prediction

The invention constructs a neural network model to predict the training speed of the DDL task, and adds the characteristics of the task under different placing conditions into the characteristics of the speed model, so that the training speed of the DDL task under different placing conditions can be predicted, and the accuracy is high.

(3) Less disturbance of task operation

Because one characteristic in the speed model is the number of tasks running on each node, the influence and interference of other tasks on the task can be reflected through the characteristic, and therefore the interference caused by other tasks when the task runs can be reduced.

(4) High portability

The invention is constructed on a Kubernetes resource manager, and takes Docker as the operation of a task. Meanwhile, the speed model can use various open source deep learning frames (MXNet/Tensorflow/PyTorch), can be calculated by a CPU (central processing unit) and can also be calculated by a GPU (graphics processing unit).

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a diagram of a model of a velocity-predicted deep neural network.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

As shown in FIG. 1, in the scheduling method of the deep learning task based on the speed prediction, a customized simulated annealing algorithm is used to perform scheduling by considering resource allocation and task placement, and a better configuration is obtained by searching in the simulated annealing process by taking the sum of the speeds of all tasks as an evaluation function, so that the purpose of effectively utilizing cluster resources is achieved. The method comprises a speed model construction stage and a task scheduling stage:

■ speed model building phase

1. Velocity model dataset construction

Based on different configurations of tasks running in the cluster, obtaining characteristics of a speed model as input characteristics of a speed model trainer; the parameter synchronization architecture of the distributed deep learning adopted by the invention is a parameter server architecture, and the characteristics of the speed model comprise the number of Parameter Servers (PS), the number of computing nodes (Worker), the type of the model used by the task, the batch _ size, the placement of the Worker on the nodes and the number of other tasks which are already run on the nodes. From the features of the velocity model, a data set is sampled from all possible cases of task training in the cluster.

2. Implementation of a velocity model

And establishing a trainer based on a deep learning speed model, wherein input data of the trainer is a data set constructed in the previous step, and output is the training speed (namely the speed of processing pictures) of the task. A sequence model is constructed by adopting a Tensorflow Keras deep learning framework, and a two-layer fully-connected network model is constructed. The input of the model is each sample in the data set, the sample characteristics are input in the form of a group of vectors, and the label of the model is the training speed of the task (the speed is normalized during training, which is beneficial to improving the accuracy of the training). The model iterates through a Back-propagation Algorithm (Back-propagation Algorithm) loop, a ReLU function is used as an activation function, and Adam is used as an optimizer, so that the error level between the speed predicted value and the true value output by the model on the test data set is in an expected range, and the deep learning training process is completed.

■ task scheduling phase

1. Resource allocation and task placement

The scheduler considers the resource allocation and the task placement in a combined manner, predicts the training speed of each task under different configurations (namely different resource allocation amounts and placement nodes) through a speed model, and determines the optimal configuration for each task, so that the aim of effectively utilizing cluster resources is fulfilled. The whole process comprises the following steps: (1) constructing a task queue, and placing all tasks submitted by a user in the task queue to wait for scheduling; (2) when each scheduling period starts, the scheduler takes out all tasks from the scheduling queue and generates initial configuration for the tasks, wherein the initial configuration comprises the steps of allocating PS and Worker for each task and generating a group of initial placement nodes for the allocated tasks; (3) the scheduler starts to adjust the configuration of each task, and the configuration of each task is adjusted to a better configuration by using a customized simulated annealing algorithm, so that each task can be trained at a higher speed, the aim is to ensure that the shorter the average completion time of all tasks is, the better the average completion time of all tasks is, and the resource utilization rate of the cluster is also at a higher level.

When the initial configuration is generated, one PS and one Worker are allocated to each task initially, and each PS and one Worker can occupy a certain amount of CPU, GPU and memory resources until all tasks are allocated to one PS and one Worker or the number of the remaining resources in the cluster is insufficient. This uniform resource allocation scheme facilitates convergence of the task configuration adjustment phase to a better configuration. After the initial configuration is set, using a customized simulated annealing algorithm, namely obtaining a current solution through the initial configuration, generating a group of new solutions by generating neighborhoods, comparing evaluation functions of the current solution and the new solutions, and selecting to accept the solution if the evaluation functions are met; if not, selecting to accept the solution with a certain probability; the customized simulated annealing algorithm comprises three modes of generating neighborhoods, namely, distribution of Worker, exchange of Worker and transfer of Worker, and a better configuration is obtained by generating the configuration of a neighborhood adjusting task. The distribution and placement of the PS are not in a search space of a simulated annealing algorithm, the number of the PS is determined to be slightly larger than the number of the Worker by default, and the PS are uniformly placed on nodes containing the Worker; if the cluster resources are insufficient, the number of PS is reduced, and the difference between the number of PS and the number of Worker of each task is not larger than a given value. In the generation of the three neighborhoods, the distribution of the Worker refers to dividing one Worker from the residual resources of the cluster into tasks; the exchange of the Worker means that for two different tasks, the nodes placed by a certain Worker of the two different tasks are exchanged, and the training effect of task prediction under different placing conditions is explored; the transfer of the Worker refers to transferring one Worker of a certain task to another task. And the scheduler explores each neighborhood, takes the sum of the predicted speeds of all tasks as an evaluation function, and finishes the searching process and obtains a group of better configurations when the temperature of the simulated annealing is reduced to a certain value. Meanwhile, in order to avoid that the task with low training speed is in a starvation state for a long time, the invention sets a waiting time threshold value for each task, and if the waiting time of the task exceeds the threshold value, at least one PS and one Worker are allocated to the task.

2. Task execution

After the configuration of each task is calculated, the scheduler deploys the tasks into a cluster, runs each task in a container mode by taking Kubernetes as a cluster resource manager, submits the task with effective configuration (namely the task distributed with PS and Worker) to the Kubernetes cluster to start training, and sets a temporary storage point for each task to store the model parameters of task training; after the tasks are started, the scheduler is responsible for periodically checking the health condition of each task, and is responsible for restarting the task if any task crashes; and if a new task is added into the queue or the training of the task is completed, starting a new round of scheduling to ensure that each task completes the training smoothly.

In summary, the invention can be applied to deep learning cluster resource scheduling, and has great effects on improving resource utilization rate and saving cost. The method can accurately predict the training speed of the task under different configurations, so that an accurate result can be obtained more reliably when scheduling decisions are made. The technology is based on the Kubernetes cluster resource manager, so that a user can conveniently reproduce the technology without considering a plurality of inconvenient factors brought by a bottom environment, and the technology is easy to deploy and maintain. Meanwhile, the speed prediction model can be realized by various deep learning frames, and a user can select a frame which is good at. Therefore, the technology has higher application value.

Claims

1. A scheduling method of a deep learning task based on speed prediction is characterized by comprising a speed model construction stage and a task scheduling stage:

velocity model construction phase

(1) Speed model data set construction: acquiring a task training speed by using the training state of the task in the cluster, and preparing for a training speed model;

(2) implementation of the velocity model: establishing a training device of a speed model based on deep learning, wherein input data of the training device is a data set constructed in the previous step, and output is the training speed of a task;

task scheduling phase

(1) Resource allocation and task placement: the scheduler considers the resource allocation and the task placement in a combined manner, predicts the training speed of each task under different configurations through a speed model, and determines the optimal configuration for each task, so that the purpose of effectively utilizing cluster resources is achieved; wherein different configurations, i.e. different resource allocation amounts and placement nodes;

2. The method for scheduling a deep learning task based on speed prediction as claimed in claim 1, wherein in the implementation of speed model data set construction in the speed model construction stage: tasks run in the cluster with different configurations and different training speeds; and sampling to obtain the training speed of the task under different configurations according to all possible running configurations of the task in the cluster, thereby constructing a data set.

3. The method for scheduling deep learning tasks based on speed prediction as claimed in claim 1, wherein the characteristics of the speed model are obtained based on different configurations of the tasks running in the cluster and are used as the input characteristics of the speed model trainer; the method adopts a parameter synchronization architecture of distributed deep learning, which is a parameter server architecture, and the characteristics of a speed model comprise the number of Parameter Servers (PS), the number of computing nodes (Worker), the type of a model used by a task, the batch size, the placement of the Worker on the nodes and the number of other tasks which are already run on the nodes; the speed model is based on deep learning, and a two-layer fully-connected network is constructed to predict the training speed of the task.

4. The method for scheduling a deep learning task based on speed prediction as claimed in claim 1, wherein in the implementation of the speed model in the speed model construction stage: adopting a Tensorflow Keras deep learning framework to construct a Sequential model and construct a two-layer fully-connected network model; the input of the model is each sample in the data set, the sample characteristics are input in the form of a group of vectors, and the label of the model is the training speed of the task.

5. The method for scheduling the deep learning task based on the speed prediction as claimed in claim 4, wherein the model is iterated through a back propagation algorithm loop, and Adam is used as an optimizer, so that the error level between the speed predicted value and the true value output by the model on the test data set is within an expected range, thereby completing the training process of the deep learning.

6. The method for scheduling deep learning tasks based on speed prediction as claimed in claim 1, wherein in the implementation of the resource allocation and task placement in the task scheduling stage: constructing a task queue, and placing all tasks submitted by a user in the task queue to wait for scheduling; when each scheduling period starts, the scheduler takes out all tasks from the scheduling queue and generates initial configuration for the tasks, wherein the initial configuration comprises the steps of allocating PS and Worker for each task and generating a group of initial placement nodes for the allocated tasks; the scheduler then begins adjusting the configuration of each task, using a customized simulated annealing algorithm to adjust the configuration of each task so that each task can be trained at a faster rate with the goal of making the average completion time of all tasks as short as possible.

7. The method for scheduling deep learning tasks based on speed prediction according to claim 6, wherein the generation of the initial configuration is implemented by: initially, one PS and one Worker are allocated to each task, and each PS and Worker can occupy resources of a CPU, a GPU and a memory until all tasks are allocated to one PS and one Worker, or the number of remaining resources in a cluster is insufficient.

8. The method for scheduling deep learning tasks based on speed prediction according to claim 6, wherein the adjustment of task configuration is implemented by: using a customized simulated annealing algorithm, namely obtaining a current solution through initial configuration, generating a group of new solutions through generating a neighborhood, comparing evaluation functions of the current solution and the new solutions, and selecting to accept the solution if the evaluation functions are met; if not, selecting to accept the solution according to a preset probability; the customized simulated annealing algorithm comprises three modes of generating neighborhoods, namely, distribution of a Worker, exchange of the Worker and transfer of the Worker, and a better configuration is obtained by generating the configuration of a neighborhood adjusting task; the distribution and placement of the PS are not in a search space of a simulated annealing algorithm, the number of the PS is determined to be larger than the number of the Worker by default, and the PS are uniformly placed on nodes containing the Worker; if the cluster resources are insufficient, the number of the PS is reduced, and the difference between the number of the PS and the number of the Worker of each task is not larger than a given value.

9. The scheduling method of deep learning task based on speed prediction as claimed in claim 8, wherein in the generation of three neighborhoods of the simulated annealing algorithm, the distribution of Worker refers to dividing one Worker from the rest resources of the cluster into tasks; the exchange of the Worker means that for two different tasks, the nodes placed by a certain Worker of the two different tasks are exchanged, and the training effect of task prediction under different placing conditions is explored; the transfer of the Worker refers to transferring one Worker of a certain task to another task; the scheduler explores each neighborhood, takes the sum of the predicted speeds of all tasks as an evaluation function, and finishes the searching process and obtains a group of better configurations when the temperature of the simulated annealing is reduced to a preset value; meanwhile, in order to avoid the situation that the task with the low training speed is starved for a long time, a waiting time threshold value is set for each task, and if the waiting time of the task exceeds the threshold value, at least one PS and one Worker are allocated to the task.

10. The method for scheduling deep learning tasks based on speed prediction according to claim 1, wherein in the implementation of the task running of the task scheduling stage: after the configuration of each task is calculated, the task is deployed into a cluster by a scheduler, each task is operated in a container mode by adopting Kubernetes as a cluster resource manager, the task with effective configuration is submitted into the Kubernetes cluster to start training, and a temporary storage point is set for each task to store a model parameter of task training; after the tasks are started, the scheduler is responsible for periodically checking the health condition of each task, and is responsible for restarting the task if any task crashes; if a new task is added into the queue or the training of the task is completed, starting a new round of scheduling to ensure that each task completes the training smoothly; the task of effective configuration is that of PS and Worker are allocated.