CN112416585A

CN112416585A - GPU resource management and intelligent scheduling method for deep learning

Info

Publication number: CN112416585A
Application number: CN202011310749.8A
Authority: CN
Inventors: 顾荣; 刘率; 王肇康; 袁春风; 黄宜华
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2021-02-26
Anticipated expiration: 2040-11-20
Also published as: CN112416585B

Abstract

The invention discloses a deep learning-oriented GPU resource management and intelligent scheduling method, which comprises the following steps: firstly, a user submits deep learning operation through a front-end interface component, wherein the deep learning operation comprises a deep learning program to be executed and a training data set; secondly, adding the job to a queue to be scheduled corresponding to the scheduler after verification; third, starting an independent job manager for the job; fourthly, applying for computing resources required by the operation of the job to a resource manager; fifthly, performing feature modeling and analysis on the job to be scheduled; sixthly, generating a resource scheduling scheme according to the job characteristics and the cluster computing node characteristics; step seven, scheduling the job to the appointed computing node according to the scheduling scheme; and step eight, the operation executor starts the container and executes a deep learning program. The method can solve the problems of low GPU resource utilization rate and poor job execution performance of the conventional cluster resource scheduling method in a deep learning scene.

Description

GPU resource management and intelligent scheduling method for deep learning

Technical Field

The invention relates to the technical field of cluster resource scheduling, in particular to a deep learning-oriented GPU resource management and intelligent scheduling method.

Background

Research and practice in recent years show that deep learning can obtain higher precision in the fields of computer vision, speech recognition and the like compared with the traditional machine learning technology, so that the deep learning is widely used. The deep learning model training process is computationally intensive, and a Graphics Processing Unit (GPU) can more efficiently perform such simple but large-scale computational tasks, and thus becomes an important basic computational resource for running a deep learning program.

Since GPU cards are often expensive, it is costly to deploy independent private clusters for each user (group), and users do not always perform model training work, users often share these GPU resources to reduce costs. In order to avoid the conflict problem and fully utilize the cluster resources, it is necessary to efficiently manage a large amount of resources such as GPUs and to uniformly and reasonably schedule user jobs.

For GPU resource management and scheduling in a deep learning scene, the following problems exist:

in terms of resource utilization, with the rapid development of hardware technology, new GPU cards are continuously pushed out, so that different types of GPU cards usually exist in a cluster, which have great difference in both computing power and video memory, and indiscriminate allocation can cause the problem that some jobs have insufficient performance during execution and other jobs have excessive performance during execution. Due to the lack of a mature and efficient GPU resource virtualization technology, a GPU is usually exclusively used at present, but the requirement for small operation resources for part of development and test purposes is low, and the problem of resource waste is aggravated by exclusive use.

In the resource scheduling strategy, many deep learning model training works are still performed in a mode of a single machine with a single GPU card at present, but in order to pursue higher accuracy, deep learning model networks are deeper and deeper, parameters are more and more, data sets for training are larger and larger in scale, the single GPU card is difficult to accommodate, the performance is also bottleneck, and thus a distributed training mode of a single machine with multiple cards and multiple machines with multiple cards appears. Unlike big data applications, there is complex and massive data exchange and information synchronization between multiple instances of distributed deep learning jobs, and an unreasonable resource scheduling scheme can greatly reduce job execution performance.

Therefore, how to design a scheduling mechanism so that the scheduler still obtains good GPU resource utilization and job execution performance in a deep learning scenario becomes a very challenging task.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention aims to provide a deep learning-oriented GPU resource management and intelligent scheduling method, which solves the problems of low GPU resource utilization rate and poor operation execution performance of the conventional system in a deep learning scene.

The technical scheme is as follows: in order to achieve the above object, the technical solution adopted by the present invention is to provide a deep learning oriented GPU resource management and intelligent scheduling method, comprising the following steps:

(1) a user submits a deep learning operation (short for operation) through a front-end interface component, wherein the operation comprises a deep learning program to be executed, a program operation input data set and task division information of the operation;

(2) performing parameter validity check and authority verification on the operation, and then adding the operation into a specified queue to be scheduled for waiting scheduling;

(3) when the operation is selected to start scheduling, starting an independent operation manager for the operation to take charge of the subsequent operation of the operation;

(4) the job manager applies for computing resources required by operation to a global resource manager for each task according to the task division of the deep learning job;

(5) performing modeling analysis on operation characteristics based on an intelligent prediction model (a prediction model for short) of operation resource demand, wherein the modeling analysis comprises the characteristics of the resource demand of the GPU, the GPU video memory, the CPU, the memory and the network bandwidth during operation of the operation, and generating an operation execution resource demand vector;

(6) generating a resource scheduling scheme of the job by using the resource demand vector returned in the step (5) and combining a distributed architecture and a cluster network topology structure of the job;

(7) scheduling the job to a specified computing node through a pushing mechanism according to the resource scheduling scheme;

(8) the operation executor starts a separate running container for each task of the operation to specifically execute the deep learning program.

Further, in the step (3), because most of the current deep learning frames do not have an elastic mechanism, in order to avoid the problem of resource deadlock caused by scheduling multiple jobs simultaneously, a group scheduling mechanism is adopted, that is, resources are allocated to the next job only after all resource requirements of the previous job are met.

Further, in the step (3), in order to reduce the load of the job scheduler, an independent job manager is started for each job, and the job manager is responsible for the life cycle management of the job, wherein the life cycle management includes resource application, pushing to a computing node, monitoring the running state, retrying failed tasks and the like.

Further, in the step (4), because the current most of the distributed deep learning frames adopt a static mapping mode, the task division of the job is already determined before execution, and therefore the scheduling system only needs to allocate resources and decide a resource scheduling scheme according to the pre-divided tasks.

Further, in the step (5), an intelligent prediction model of job resource demand is established, input features of the prediction model include task division, hyper-parameter setting and data set scale, an output label of the prediction model includes a job resource demand vector (vector for short), the vector includes a CPU, a memory, a GPU computing power, a GPU video memory and a network bandwidth, and a regression problem corresponding to the prediction model is solved by using a conventional machine learning algorithm.

Further, in the step (5), characteristics such as actual resource requirements of similar operations during historical operation are collected, and the subsequent operation resource requirement characteristics are predicted by using the prediction model.

Further, in the step (6), firstly, the execution sequence of the job is determined according to the principles of fair scheduling between queues and first-come-first-serve scheduling in the queues, the job to be scheduled is selected, then the distributed topology of the job and the topology structure of the cluster network are extracted, a network communication cost model is established according to the resource demand characteristics of the job, and finally, a heuristic genetic algorithm is used for solving and generating the scheduling scheme.

Has the advantages that: the invention can effectively solve the problems of low GPU resource utilization rate and poor operation execution performance through a GPU resource management and intelligent scheduling method facing deep learning in a deep learning scene: firstly, the invention abstracts the existing mainstream deep learning frame and extracts commonalities, provides a service interface irrelevant to the deep learning frame, and has good frame compatibility and usability. Secondly, the invention provides an intelligent prediction model for the job resource demand, which can predict the characteristics of the job to be scheduled in operation according to historical scheduling data, thereby automatically determining the job resource demand vector and enhancing scheduling. Thirdly, different from the prior method that the job to be scheduled is completely used as a black box, the invention utilizes the collected information and considers the job distributed topology structure and the cluster network topology structure during scheduling, thereby generating a more efficient scheduling scheme and improving the execution performance of the job.

Drawings

FIG. 1 is a schematic flow diagram of the overall process of the present invention;

FIG. 2 is a schematic diagram illustrating a resource scheduling scheme using block coding according to the present invention;

fig. 3 is a flowchart of the scheduling policy of the present invention.

Detailed Description

The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.

The invention provides a deep learning-oriented GPU resource management and intelligent scheduling method, which solves the problems of low GPU resource utilization rate and poor operation execution performance in a deep learning scene.

As shown in fig. 1, the complete process of the present invention includes 8 parts, namely, a job submission stage, an authority verification stage, a job manager starting stage, a resource application stage, a job feature modeling and analyzing stage, a resource scheduling scheme generation stage, a job distribution stage, and an execution stage. Specific embodiments are described below:

the job submission stage corresponds to the technical scheme step (1). The specific implementation mode is as follows: a user submits deep learning operation through a visual management front end or an API (application programming interface), wherein the deep learning operation comprises an executable deep learning program, an input training set required by program running, operation task division and program starting parameters. In the scheduling system of the present invention, the definition of the job is as follows: a job consists of several tasks, such as in the Parameter Server architecture, which includes both Parameter servers and Worker nodes (Worker). The single-machine single-card and single-machine multi-card operation only has one task, while the multi-machine multi-card operation comprises a plurality of tasks, and each task corresponds to one parameter server or working node. Since most deep learning frameworks do not have an elastic mechanism at present, the number and division of tasks are specified by a user when submitting a job. In scheduling, one task is scheduled to run on one compute node (physical machine), and one compute node can run multiple tasks simultaneously. Jobs are the basic units submitted by users, and tasks are the basic units scheduled for execution by the system.

The authority verification stage corresponds to the technical scheme step (2). The specific implementation mode is as follows: after receiving the job submitted by the user, checking the validity and the integrity of the job parameters, verifying whether the user has the right to submit the job to the specified queue to be scheduled, and finally adding the job to the specified queue to be scheduled of the scheduler after the verification is passed and recording the request.

The job manager startup phase corresponds to the technical solution step (3). The specific implementation mode is as follows: and the scheduler determines the execution sequence of the jobs to be scheduled according to a fair scheduling principle. When the job is selected to begin scheduled execution, a separate job manager is started to take care of the subsequent lifecycle flow of the job.

The resource application stage corresponds to the technical scheme step (4). The specific implementation mode is as follows: the job manager divides tasks according to the job, and then applies for computing resources for each task until all task resource requirements of the job are met.

And (5) corresponding to the technical scheme in the operation characteristic modeling and analyzing stage. The specific implementation mode is as follows: the scheduling system utilizes actual use data of the collected operation on CPU, memory, GPU computing power resource, GPU video memory resource and network bandwidth resource during historical operation of the same kind of operation, then utilizes a random forest algorithm of a traditional machine learning model to train an operation resource demand vector prediction model based on the data, and then utilizes the model to predict the resource demand vector characteristics during operation of the operation to be scheduled, thereby allocating proper resources to the operation and selecting a proper scheduling scheme.

The resource scheduling scheme generation phase corresponds to the technical scheme step (6). The specific implementation mode is as follows: firstly, filtering out computing nodes and a GPU which do not meet requirements to obtain a candidate node list; then, obtaining a GPU model which is most matched with the GPU calculation power requirement of the operation by using the operation resource requirement vector returned in the step (5), selecting a node with the GPU model from the candidate nodes, and using the node as a next candidate node list, wherein if the GPU resource of the model is insufficient, the performance is close to that of the node; and finally, grouping a genetic algorithm by using a heuristic method to generate a better resource scheduling scheme.

In order to solve the problem of resource scheduling of the present invention, a scheduling target needs to be formally defined. For a distributed model training operation, differentThe resource scheduling scheme has a great influence on job execution performance, which is mainly determined by network communication quality, so network communication needs to be considered when generating the resource scheduling scheme, and other indexes, such as reduction of resource fragmentation and the like, need to be considered as far as possible on the basis. Firstly, different scheduling schemes are evaluated based on factors such as job topology, cluster network topology and the like, and Score is calculated according to network communication overhead Cost_networkFitness matching with node_nodeTwo parts are formed. The goal of scheduling is to minimize this score.

Score＝Cost_network+ΣFitness_node

The following describes how to solve the resource scheduling scheme generation problem of the present invention in conjunction with a packet genetic algorithm:

1) and (4) coding mode. Fig. 2 illustrates how packet coding is used to represent a resource scheduling scheme (two scheduling schemes that place 8 tasks onto several compute nodes) under the scheduling policy of the present invention. From the perspective of genetic algorithm concept, each chromosome represents a resource scheduling scheme, each gene locus represents a task, each genome represents a computing node, and each computing node is taken as a unit in the crossing and mutation operations.

2) And (5) generating an initial population. Two simple algorithms, a First-Fit algorithm (First-Fit) and a Random-Fit algorithm (Random-Fit), are adopted to generate a plurality of initial resource scheduling schemes. The idea of the first adaptive algorithm is to schedule each task to the first computing node that can be placed, and the random adaptive algorithm is to randomly select a node that meets the requirement. Both algorithms are low in complexity and run fast enough.

3) Fitness function and selection strategy. The scheduling purpose of the invention is to reduce the communication overhead, so the Fitness function Fitness is a negative value of the scheduling target (the smaller the network communication overhead is, the better the Fitness is). In order to accelerate the convergence speed of the algorithm, a new variable NumNodes is introduced on the basis_usingI.e. the number of computing nodes required by the scheduling scheme, so that the algorithm can preferentially select the resource scheduling using less nodes when the network overhead is closeAnd (5) degree scheme.

Fitness＝-(Cost_network+∑Fitness_node+NumNodes_using)

The tournament method is selected as the selection strategy. The tournament method carries out multiple rounds of elimination and the best selection each time, does not need to carry out full-quantity sequencing, has low complexity, can carry out parallelization processing, has smaller time overhead, and is more suitable for the online scheduling scene of the invention.

4) Intersection and mutation rules. The process of interleaving is as follows: firstly, selecting two schemes X and Y from the current resource scheduling scheme by using a selection strategy, and respectively selecting a cross point (a computing node) and a cross position; then, adding the selected computing nodes and tasks on the nodes into the cross position of another scheduling scheme; then, after the intersection, there may be duplicate computing nodes and tasks in the new scheduling scheme, and these duplicates need to be deleted, because the basic unit of the intersection variation is one computing node, the duplicate computing nodes and the computing nodes where the duplicate tasks are located need to be deleted; finally, because the tasks on the deleted nodes are also deleted, the tasks need to be added to the rest of the computing nodes again, and the deleted tasks are added to the rest of the computing nodes by adopting the first adaptive algorithm. Similar to the intersection rule, the process of mutation is as follows: firstly, selecting a resource scheduling scheme Y, randomly selecting a computing node, and deleting the computing node and tasks on the computing node; and then, the deleted tasks are placed on the rest computing nodes again according to the first adaptive algorithm to obtain a new scheduling scheme.

Fig. 3 shows a resource scheduling scheme generation flow in the present invention. Due to the existence of resource fragmentation, the selected jobs may not be scheduled simultaneously, for which case a period of time is waited and an attempt is again made to generate an efficient resource scheduling scheme. In the scenes of more dispersed resources and the like, the generated resource scheduling scheme may not be good enough, and at the moment, a period of time can be waited to see whether more and more proper resources are released; and finally, scheduling the job according to the current optimal resource scheduling scheme after the generated scheduling scheme is good enough or the operation time of the scheduling algorithm is overtime.

The job distribution stage corresponds to the technical scheme step (7). The specific implementation mode is as follows: and after all task resource requirements of the job are met, the job manager pushes the task of the job to the corresponding computing node according to the resource scheduling scheme to wait for execution.

The job execution stage corresponds to the technical solution step (8). The specific implementation mode is as follows: firstly, creating a corresponding operation running environment (Container) for the operation, and limiting available resources of the Container according to the operation resource requirement; then, after the container is started, downloading a user deep learning program contained in the operation to a specified position in the container; then, mounting a training data set required by deep learning program model training to a local corresponding directory; then, starting a deep learning program (program for short) of a user through a starting command, and continuously monitoring the running condition of the program; and finally, after the program is executed, transferring the output file of the program to an external reliable storage HDFS, destroying the container and releasing the system resources occupied by the container.

The invention provides a deep learning-oriented GPU resource management and intelligent scheduling method. By modeling and analyzing the operation, the system can effectively predict the operation resource demand in advance. Compared with the common scheduling method (scattered, centralized and random), the resource scheduling method of the invention reduces the execution Time of a single Job by 33.5 to 59.5 percent, and further reduces the average Job Completion Time (JCT) of a plurality of jobs by 10 percent. Compared with the conventional Kubernetes system, the resource scheduling method disclosed by the invention has the advantage that the average completion time of the operation is reduced by 48%. In the aspect of system expandability, when the cluster nodes are increased, the throughput speed of the scheduling system can be kept stable, and the system has good expandability. The GPU resource management and intelligent scheduling method for deep learning, which is researched and proposed by the invention, has a remarkable performance optimization effect.

Claims

1. A GPU resource management and intelligent scheduling method facing deep learning comprises the following steps:

(1) a user submits deep learning operation through a front-end interface component, wherein the deep learning operation comprises a deep learning program to be executed, a program operation input data set and task division information of the operation;

(2) performing parameter validity check and authority verification on the deep learning operation, and then adding the deep learning operation into a specified queue to be scheduled for waiting scheduling;

(3) when the deep learning operation is selected to start scheduling, starting an independent operation manager for the deep learning operation to take charge of the subsequent operation of the deep learning operation;

(4) the job manager applies for computing resources required by operation to the global resource manager for each task according to task division of deep learning jobs;

(5) modeling and analyzing the operation characteristics based on the operation resource demand intelligent prediction model, wherein the modeling and analyzing comprises calculating power of a GPU, a GPU video memory, a CPU, a memory and network bandwidth resource demand characteristics when the operation is operated, and generating an operation execution resource demand vector;

(6) generating a resource scheduling scheme of the deep learning job by using the resource demand vector returned in the step (5) and combining a distributed architecture and a cluster network topological structure of the job;

(7) scheduling the deep learning job to a specified computing node through a pushing mechanism according to the resource scheduling scheme;

(8) and the operation executor starts an independent running container for each task of the deep learning operation to specifically execute a deep learning program.

2. The deep learning-oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (3), a group scheduling mechanism is adopted: the resource allocation to the next job is started only after all resource requirements of the previous job are met.

3. The deep learning-oriented GPU resource management and intelligent scheduling method of claim 1, wherein: and (3) starting an independent operation manager for each deep learning operation, wherein the operation manager is responsible for life cycle management of the deep learning operation, and the life cycle management comprises resource application, pushing to a computing node, running state monitoring and retry failure tasks.

4. The deep learning-oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (4), the scheduling system only needs to allocate resources and determine a resource scheduling scheme according to pre-divided tasks, and the task division of the deep learning job is determined before execution.

5. The deep learning-oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (5), an operation resource demand intelligent prediction model is established, input characteristics of the operation resource demand intelligent prediction model include task division, hyper-parameter setting and data set scale, an output label of the operation resource demand intelligent prediction model is an operation execution resource demand vector, the operation execution resource demand vector includes a CPU, a memory, a GPU computing power, a GPU video memory and a network bandwidth, and a regression problem corresponding to the operation resource demand intelligent prediction model is solved by using a traditional machine learning algorithm.

6. The deep learning-oriented GPU resource management and intelligent scheduling method of claim 1, wherein: and (5) acquiring the actual resource demand characteristics of the similar operation in historical operation, and predicting the resource demand characteristics of the subsequent deep learning operation by using the operation resource demand intelligent prediction model.

7. The deep learning-oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (6), firstly, the execution sequence of the operation is determined and the operation to be scheduled is selected according to the principles of fair scheduling between queues and first-come-first-serve scheduling in the queues, then the distributed topology of the operation and the topology structure of the cluster network are extracted, a network communication cost model is established according to the resource demand characteristics of the operation, and finally, a resource scheduling scheme is solved and generated by utilizing a heuristic genetic algorithm.