CN112416585A - GPU resource management and intelligent scheduling method for deep learning - Google Patents

GPU resource management and intelligent scheduling method for deep learning Download PDF

Info

Publication number
CN112416585A
CN112416585A CN202011310749.8A CN202011310749A CN112416585A CN 112416585 A CN112416585 A CN 112416585A CN 202011310749 A CN202011310749 A CN 202011310749A CN 112416585 A CN112416585 A CN 112416585A
Authority
CN
China
Prior art keywords
deep learning
resource
job
scheduling
gpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011310749.8A
Other languages
Chinese (zh)
Other versions
CN112416585B (en
Inventor
顾荣
刘率
王肇康
袁春风
黄宜华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202011310749.8A priority Critical patent/CN112416585B/en
Publication of CN112416585A publication Critical patent/CN112416585A/en
Application granted granted Critical
Publication of CN112416585B publication Critical patent/CN112416585B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/4557Distribution of virtual machine instances; Migration and load balancing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45575Starting, stopping, suspending or resuming virtual machine instances
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a deep learning-oriented GPU resource management and intelligent scheduling method, which comprises the following steps: firstly, a user submits deep learning operation through a front-end interface component, wherein the deep learning operation comprises a deep learning program to be executed and a training data set; secondly, adding the job to a queue to be scheduled corresponding to the scheduler after verification; third, starting an independent job manager for the job; fourthly, applying for computing resources required by the operation of the job to a resource manager; fifthly, performing feature modeling and analysis on the job to be scheduled; sixthly, generating a resource scheduling scheme according to the job characteristics and the cluster computing node characteristics; step seven, scheduling the job to the appointed computing node according to the scheduling scheme; and step eight, the operation executor starts the container and executes a deep learning program. The method can solve the problems of low GPU resource utilization rate and poor job execution performance of the conventional cluster resource scheduling method in a deep learning scene.

Description

GPU resource management and intelligent scheduling method for deep learning
Technical Field
The invention relates to the technical field of cluster resource scheduling, in particular to a deep learning-oriented GPU resource management and intelligent scheduling method.
Background
Research and practice in recent years show that deep learning can obtain higher precision in the fields of computer vision, speech recognition and the like compared with the traditional machine learning technology, so that the deep learning is widely used. The deep learning model training process is computationally intensive, and a Graphics Processing Unit (GPU) can more efficiently perform such simple but large-scale computational tasks, and thus becomes an important basic computational resource for running a deep learning program.
Since GPU cards are often expensive, it is costly to deploy independent private clusters for each user (group), and users do not always perform model training work, users often share these GPU resources to reduce costs. In order to avoid the conflict problem and fully utilize the cluster resources, it is necessary to efficiently manage a large amount of resources such as GPUs and to uniformly and reasonably schedule user jobs.
For GPU resource management and scheduling in a deep learning scene, the following problems exist:
in terms of resource utilization, with the rapid development of hardware technology, new GPU cards are continuously pushed out, so that different types of GPU cards usually exist in a cluster, which have great difference in both computing power and video memory, and indiscriminate allocation can cause the problem that some jobs have insufficient performance during execution and other jobs have excessive performance during execution. Due to the lack of a mature and efficient GPU resource virtualization technology, a GPU is usually exclusively used at present, but the requirement for small operation resources for part of development and test purposes is low, and the problem of resource waste is aggravated by exclusive use.
In the resource scheduling strategy, many deep learning model training works are still performed in a mode of a single machine with a single GPU card at present, but in order to pursue higher accuracy, deep learning model networks are deeper and deeper, parameters are more and more, data sets for training are larger and larger in scale, the single GPU card is difficult to accommodate, the performance is also bottleneck, and thus a distributed training mode of a single machine with multiple cards and multiple machines with multiple cards appears. Unlike big data applications, there is complex and massive data exchange and information synchronization between multiple instances of distributed deep learning jobs, and an unreasonable resource scheduling scheme can greatly reduce job execution performance.
Therefore, how to design a scheduling mechanism so that the scheduler still obtains good GPU resource utilization and job execution performance in a deep learning scenario becomes a very challenging task.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention aims to provide a deep learning-oriented GPU resource management and intelligent scheduling method, which solves the problems of low GPU resource utilization rate and poor operation execution performance of the conventional system in a deep learning scene.
The technical scheme is as follows: in order to achieve the above object, the technical solution adopted by the present invention is to provide a deep learning oriented GPU resource management and intelligent scheduling method, comprising the following steps:
(1) a user submits a deep learning operation (short for operation) through a front-end interface component, wherein the operation comprises a deep learning program to be executed, a program operation input data set and task division information of the operation;
(2) performing parameter validity check and authority verification on the operation, and then adding the operation into a specified queue to be scheduled for waiting scheduling;
(3) when the operation is selected to start scheduling, starting an independent operation manager for the operation to take charge of the subsequent operation of the operation;
(4) the job manager applies for computing resources required by operation to a global resource manager for each task according to the task division of the deep learning job;
(5) performing modeling analysis on operation characteristics based on an intelligent prediction model (a prediction model for short) of operation resource demand, wherein the modeling analysis comprises the characteristics of the resource demand of the GPU, the GPU video memory, the CPU, the memory and the network bandwidth during operation of the operation, and generating an operation execution resource demand vector;
(6) generating a resource scheduling scheme of the job by using the resource demand vector returned in the step (5) and combining a distributed architecture and a cluster network topology structure of the job;
(7) scheduling the job to a specified computing node through a pushing mechanism according to the resource scheduling scheme;
(8) the operation executor starts a separate running container for each task of the operation to specifically execute the deep learning program.
Further, in the step (3), because most of the current deep learning frames do not have an elastic mechanism, in order to avoid the problem of resource deadlock caused by scheduling multiple jobs simultaneously, a group scheduling mechanism is adopted, that is, resources are allocated to the next job only after all resource requirements of the previous job are met.
Further, in the step (3), in order to reduce the load of the job scheduler, an independent job manager is started for each job, and the job manager is responsible for the life cycle management of the job, wherein the life cycle management includes resource application, pushing to a computing node, monitoring the running state, retrying failed tasks and the like.
Further, in the step (4), because the current most of the distributed deep learning frames adopt a static mapping mode, the task division of the job is already determined before execution, and therefore the scheduling system only needs to allocate resources and decide a resource scheduling scheme according to the pre-divided tasks.
Further, in the step (5), an intelligent prediction model of job resource demand is established, input features of the prediction model include task division, hyper-parameter setting and data set scale, an output label of the prediction model includes a job resource demand vector (vector for short), the vector includes a CPU, a memory, a GPU computing power, a GPU video memory and a network bandwidth, and a regression problem corresponding to the prediction model is solved by using a conventional machine learning algorithm.
Further, in the step (5), characteristics such as actual resource requirements of similar operations during historical operation are collected, and the subsequent operation resource requirement characteristics are predicted by using the prediction model.
Further, in the step (6), firstly, the execution sequence of the job is determined according to the principles of fair scheduling between queues and first-come-first-serve scheduling in the queues, the job to be scheduled is selected, then the distributed topology of the job and the topology structure of the cluster network are extracted, a network communication cost model is established according to the resource demand characteristics of the job, and finally, a heuristic genetic algorithm is used for solving and generating the scheduling scheme.
Has the advantages that: the invention can effectively solve the problems of low GPU resource utilization rate and poor operation execution performance through a GPU resource management and intelligent scheduling method facing deep learning in a deep learning scene: firstly, the invention abstracts the existing mainstream deep learning frame and extracts commonalities, provides a service interface irrelevant to the deep learning frame, and has good frame compatibility and usability. Secondly, the invention provides an intelligent prediction model for the job resource demand, which can predict the characteristics of the job to be scheduled in operation according to historical scheduling data, thereby automatically determining the job resource demand vector and enhancing scheduling. Thirdly, different from the prior method that the job to be scheduled is completely used as a black box, the invention utilizes the collected information and considers the job distributed topology structure and the cluster network topology structure during scheduling, thereby generating a more efficient scheduling scheme and improving the execution performance of the job.
Drawings
FIG. 1 is a schematic flow diagram of the overall process of the present invention;
FIG. 2 is a schematic diagram illustrating a resource scheduling scheme using block coding according to the present invention;
fig. 3 is a flowchart of the scheduling policy of the present invention.
Detailed Description
The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.
The invention provides a deep learning-oriented GPU resource management and intelligent scheduling method, which solves the problems of low GPU resource utilization rate and poor operation execution performance in a deep learning scene.
As shown in fig. 1, the complete process of the present invention includes 8 parts, namely, a job submission stage, an authority verification stage, a job manager starting stage, a resource application stage, a job feature modeling and analyzing stage, a resource scheduling scheme generation stage, a job distribution stage, and an execution stage. Specific embodiments are described below:
the job submission stage corresponds to the technical scheme step (1). The specific implementation mode is as follows: a user submits deep learning operation through a visual management front end or an API (application programming interface), wherein the deep learning operation comprises an executable deep learning program, an input training set required by program running, operation task division and program starting parameters. In the scheduling system of the present invention, the definition of the job is as follows: a job consists of several tasks, such as in the Parameter Server architecture, which includes both Parameter servers and Worker nodes (Worker). The single-machine single-card and single-machine multi-card operation only has one task, while the multi-machine multi-card operation comprises a plurality of tasks, and each task corresponds to one parameter server or working node. Since most deep learning frameworks do not have an elastic mechanism at present, the number and division of tasks are specified by a user when submitting a job. In scheduling, one task is scheduled to run on one compute node (physical machine), and one compute node can run multiple tasks simultaneously. Jobs are the basic units submitted by users, and tasks are the basic units scheduled for execution by the system.
The authority verification stage corresponds to the technical scheme step (2). The specific implementation mode is as follows: after receiving the job submitted by the user, checking the validity and the integrity of the job parameters, verifying whether the user has the right to submit the job to the specified queue to be scheduled, and finally adding the job to the specified queue to be scheduled of the scheduler after the verification is passed and recording the request.
The job manager startup phase corresponds to the technical solution step (3). The specific implementation mode is as follows: and the scheduler determines the execution sequence of the jobs to be scheduled according to a fair scheduling principle. When the job is selected to begin scheduled execution, a separate job manager is started to take care of the subsequent lifecycle flow of the job.
The resource application stage corresponds to the technical scheme step (4). The specific implementation mode is as follows: the job manager divides tasks according to the job, and then applies for computing resources for each task until all task resource requirements of the job are met.
And (5) corresponding to the technical scheme in the operation characteristic modeling and analyzing stage. The specific implementation mode is as follows: the scheduling system utilizes actual use data of the collected operation on CPU, memory, GPU computing power resource, GPU video memory resource and network bandwidth resource during historical operation of the same kind of operation, then utilizes a random forest algorithm of a traditional machine learning model to train an operation resource demand vector prediction model based on the data, and then utilizes the model to predict the resource demand vector characteristics during operation of the operation to be scheduled, thereby allocating proper resources to the operation and selecting a proper scheduling scheme.
The resource scheduling scheme generation phase corresponds to the technical scheme step (6). The specific implementation mode is as follows: firstly, filtering out computing nodes and a GPU which do not meet requirements to obtain a candidate node list; then, obtaining a GPU model which is most matched with the GPU calculation power requirement of the operation by using the operation resource requirement vector returned in the step (5), selecting a node with the GPU model from the candidate nodes, and using the node as a next candidate node list, wherein if the GPU resource of the model is insufficient, the performance is close to that of the node; and finally, grouping a genetic algorithm by using a heuristic method to generate a better resource scheduling scheme.
In order to solve the problem of resource scheduling of the present invention, a scheduling target needs to be formally defined. For a distributed model training operation, differentThe resource scheduling scheme has a great influence on job execution performance, which is mainly determined by network communication quality, so network communication needs to be considered when generating the resource scheduling scheme, and other indexes, such as reduction of resource fragmentation and the like, need to be considered as far as possible on the basis. Firstly, different scheduling schemes are evaluated based on factors such as job topology, cluster network topology and the like, and Score is calculated according to network communication overhead CostnetworkFitness matching with nodenodeTwo parts are formed. The goal of scheduling is to minimize this score.
Score=Costnetwork+ΣFitnessnode
The following describes how to solve the resource scheduling scheme generation problem of the present invention in conjunction with a packet genetic algorithm:
1) and (4) coding mode. Fig. 2 illustrates how packet coding is used to represent a resource scheduling scheme (two scheduling schemes that place 8 tasks onto several compute nodes) under the scheduling policy of the present invention. From the perspective of genetic algorithm concept, each chromosome represents a resource scheduling scheme, each gene locus represents a task, each genome represents a computing node, and each computing node is taken as a unit in the crossing and mutation operations.
2) And (5) generating an initial population. Two simple algorithms, a First-Fit algorithm (First-Fit) and a Random-Fit algorithm (Random-Fit), are adopted to generate a plurality of initial resource scheduling schemes. The idea of the first adaptive algorithm is to schedule each task to the first computing node that can be placed, and the random adaptive algorithm is to randomly select a node that meets the requirement. Both algorithms are low in complexity and run fast enough.
3) Fitness function and selection strategy. The scheduling purpose of the invention is to reduce the communication overhead, so the Fitness function Fitness is a negative value of the scheduling target (the smaller the network communication overhead is, the better the Fitness is). In order to accelerate the convergence speed of the algorithm, a new variable NumNodes is introduced on the basisusingI.e. the number of computing nodes required by the scheduling scheme, so that the algorithm can preferentially select the resource scheduling using less nodes when the network overhead is closeAnd (5) degree scheme.
Fitness=-(Costnetwork+∑Fitnessnode+NumNodesusing)
The tournament method is selected as the selection strategy. The tournament method carries out multiple rounds of elimination and the best selection each time, does not need to carry out full-quantity sequencing, has low complexity, can carry out parallelization processing, has smaller time overhead, and is more suitable for the online scheduling scene of the invention.
4) Intersection and mutation rules. The process of interleaving is as follows: firstly, selecting two schemes X and Y from the current resource scheduling scheme by using a selection strategy, and respectively selecting a cross point (a computing node) and a cross position; then, adding the selected computing nodes and tasks on the nodes into the cross position of another scheduling scheme; then, after the intersection, there may be duplicate computing nodes and tasks in the new scheduling scheme, and these duplicates need to be deleted, because the basic unit of the intersection variation is one computing node, the duplicate computing nodes and the computing nodes where the duplicate tasks are located need to be deleted; finally, because the tasks on the deleted nodes are also deleted, the tasks need to be added to the rest of the computing nodes again, and the deleted tasks are added to the rest of the computing nodes by adopting the first adaptive algorithm. Similar to the intersection rule, the process of mutation is as follows: firstly, selecting a resource scheduling scheme Y, randomly selecting a computing node, and deleting the computing node and tasks on the computing node; and then, the deleted tasks are placed on the rest computing nodes again according to the first adaptive algorithm to obtain a new scheduling scheme.
Fig. 3 shows a resource scheduling scheme generation flow in the present invention. Due to the existence of resource fragmentation, the selected jobs may not be scheduled simultaneously, for which case a period of time is waited and an attempt is again made to generate an efficient resource scheduling scheme. In the scenes of more dispersed resources and the like, the generated resource scheduling scheme may not be good enough, and at the moment, a period of time can be waited to see whether more and more proper resources are released; and finally, scheduling the job according to the current optimal resource scheduling scheme after the generated scheduling scheme is good enough or the operation time of the scheduling algorithm is overtime.
The job distribution stage corresponds to the technical scheme step (7). The specific implementation mode is as follows: and after all task resource requirements of the job are met, the job manager pushes the task of the job to the corresponding computing node according to the resource scheduling scheme to wait for execution.
The job execution stage corresponds to the technical solution step (8). The specific implementation mode is as follows: firstly, creating a corresponding operation running environment (Container) for the operation, and limiting available resources of the Container according to the operation resource requirement; then, after the container is started, downloading a user deep learning program contained in the operation to a specified position in the container; then, mounting a training data set required by deep learning program model training to a local corresponding directory; then, starting a deep learning program (program for short) of a user through a starting command, and continuously monitoring the running condition of the program; and finally, after the program is executed, transferring the output file of the program to an external reliable storage HDFS, destroying the container and releasing the system resources occupied by the container.
The invention provides a deep learning-oriented GPU resource management and intelligent scheduling method. By modeling and analyzing the operation, the system can effectively predict the operation resource demand in advance. Compared with the common scheduling method (scattered, centralized and random), the resource scheduling method of the invention reduces the execution Time of a single Job by 33.5 to 59.5 percent, and further reduces the average Job Completion Time (JCT) of a plurality of jobs by 10 percent. Compared with the conventional Kubernetes system, the resource scheduling method disclosed by the invention has the advantage that the average completion time of the operation is reduced by 48%. In the aspect of system expandability, when the cluster nodes are increased, the throughput speed of the scheduling system can be kept stable, and the system has good expandability. The GPU resource management and intelligent scheduling method for deep learning, which is researched and proposed by the invention, has a remarkable performance optimization effect.

Claims (7)

1. A GPU resource management and intelligent scheduling method facing deep learning comprises the following steps:
(1) a user submits deep learning operation through a front-end interface component, wherein the deep learning operation comprises a deep learning program to be executed, a program operation input data set and task division information of the operation;
(2) performing parameter validity check and authority verification on the deep learning operation, and then adding the deep learning operation into a specified queue to be scheduled for waiting scheduling;
(3) when the deep learning operation is selected to start scheduling, starting an independent operation manager for the deep learning operation to take charge of the subsequent operation of the deep learning operation;
(4) the job manager applies for computing resources required by operation to the global resource manager for each task according to task division of deep learning jobs;
(5) modeling and analyzing the operation characteristics based on the operation resource demand intelligent prediction model, wherein the modeling and analyzing comprises calculating power of a GPU, a GPU video memory, a CPU, a memory and network bandwidth resource demand characteristics when the operation is operated, and generating an operation execution resource demand vector;
(6) generating a resource scheduling scheme of the deep learning job by using the resource demand vector returned in the step (5) and combining a distributed architecture and a cluster network topological structure of the job;
(7) scheduling the deep learning job to a specified computing node through a pushing mechanism according to the resource scheduling scheme;
(8) and the operation executor starts an independent running container for each task of the deep learning operation to specifically execute a deep learning program.
2. The deep learning-oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (3), a group scheduling mechanism is adopted: the resource allocation to the next job is started only after all resource requirements of the previous job are met.
3. The deep learning-oriented GPU resource management and intelligent scheduling method of claim 1, wherein: and (3) starting an independent operation manager for each deep learning operation, wherein the operation manager is responsible for life cycle management of the deep learning operation, and the life cycle management comprises resource application, pushing to a computing node, running state monitoring and retry failure tasks.
4. The deep learning-oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (4), the scheduling system only needs to allocate resources and determine a resource scheduling scheme according to pre-divided tasks, and the task division of the deep learning job is determined before execution.
5. The deep learning-oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (5), an operation resource demand intelligent prediction model is established, input characteristics of the operation resource demand intelligent prediction model include task division, hyper-parameter setting and data set scale, an output label of the operation resource demand intelligent prediction model is an operation execution resource demand vector, the operation execution resource demand vector includes a CPU, a memory, a GPU computing power, a GPU video memory and a network bandwidth, and a regression problem corresponding to the operation resource demand intelligent prediction model is solved by using a traditional machine learning algorithm.
6. The deep learning-oriented GPU resource management and intelligent scheduling method of claim 1, wherein: and (5) acquiring the actual resource demand characteristics of the similar operation in historical operation, and predicting the resource demand characteristics of the subsequent deep learning operation by using the operation resource demand intelligent prediction model.
7. The deep learning-oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (6), firstly, the execution sequence of the operation is determined and the operation to be scheduled is selected according to the principles of fair scheduling between queues and first-come-first-serve scheduling in the queues, then the distributed topology of the operation and the topology structure of the cluster network are extracted, a network communication cost model is established according to the resource demand characteristics of the operation, and finally, a resource scheduling scheme is solved and generated by utilizing a heuristic genetic algorithm.
CN202011310749.8A 2020-11-20 2020-11-20 Deep learning-oriented GPU resource management and intelligent scheduling method Active CN112416585B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011310749.8A CN112416585B (en) 2020-11-20 2020-11-20 Deep learning-oriented GPU resource management and intelligent scheduling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011310749.8A CN112416585B (en) 2020-11-20 2020-11-20 Deep learning-oriented GPU resource management and intelligent scheduling method

Publications (2)

Publication Number Publication Date
CN112416585A true CN112416585A (en) 2021-02-26
CN112416585B CN112416585B (en) 2024-03-15

Family

ID=74776959

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011310749.8A Active CN112416585B (en) 2020-11-20 2020-11-20 Deep learning-oriented GPU resource management and intelligent scheduling method

Country Status (1)

Country Link
CN (1) CN112416585B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094116A (en) * 2021-04-01 2021-07-09 中国科学院软件研究所 Deep learning application cloud configuration recommendation method and system based on load characteristic analysis
CN113608722A (en) * 2021-07-31 2021-11-05 云南电网有限责任公司信息中心 Algorithm packaging method based on distributed technology
CN113791906A (en) * 2021-08-09 2021-12-14 戴西(上海)软件有限公司 Scheduling system and optimization algorithm based on GPU resources in artificial intelligence and engineering fields
CN115202850A (en) * 2022-09-09 2022-10-18 国家超级计算天津中心 Job scheduling method and device, electronic equipment and storage medium
WO2022262167A1 (en) * 2021-06-15 2022-12-22 上海商汤科技开发有限公司 Cluster resource scheduling method and apparatus, electronic device and storage medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106959891A (en) * 2017-03-30 2017-07-18 山东超越数控电子有限公司 A kind of cluster management method and system for realizing GPU scheduling
CN108881446A (en) * 2018-06-22 2018-11-23 深源恒际科技有限公司 A kind of artificial intelligence plateform system based on deep learning
CN109034396A (en) * 2018-07-11 2018-12-18 北京百度网讯科技有限公司 Method and apparatus for handling the deep learning operation in distributed type assemblies
CN109086134A (en) * 2018-07-19 2018-12-25 郑州云海信息技术有限公司 A kind of operation method and device of deep learning operation
CN109189401A (en) * 2018-07-06 2019-01-11 曙光信息产业(北京)有限公司 A kind of dispositions method and system of deep learning frame
US20190087383A1 (en) * 2017-09-19 2019-03-21 Beijing Baidu Netcom Science And Technology Co., Ltd. Intelligent big data system, and method and apparatus for providing intelligent big data service
US20190332422A1 (en) * 2018-04-26 2019-10-31 International Business Machines Corporation Dynamic accelerator scheduling and grouping for deep learning jobs in a computing cluster
CN110399222A (en) * 2019-07-25 2019-11-01 北京邮电大学 GPU cluster deep learning task parallel method, device and electronic equipment
CN110442451A (en) * 2019-07-12 2019-11-12 中电海康集团有限公司 A kind of polymorphic type GPU cluster resource management dispatching method and system towards deep learning
CN111090456A (en) * 2019-12-06 2020-05-01 浪潮(北京)电子信息产业有限公司 Construction method, device, equipment and medium for deep learning development environment
KR102140730B1 (en) * 2019-12-17 2020-08-04 (주) 씨이랩 Method and system for providing develop environment of deep learning based gpu
CN111694656A (en) * 2020-04-22 2020-09-22 北京大学 Cluster resource scheduling method and system based on multi-agent deep reinforcement learning

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106959891A (en) * 2017-03-30 2017-07-18 山东超越数控电子有限公司 A kind of cluster management method and system for realizing GPU scheduling
US20190087383A1 (en) * 2017-09-19 2019-03-21 Beijing Baidu Netcom Science And Technology Co., Ltd. Intelligent big data system, and method and apparatus for providing intelligent big data service
US20190332422A1 (en) * 2018-04-26 2019-10-31 International Business Machines Corporation Dynamic accelerator scheduling and grouping for deep learning jobs in a computing cluster
CN108881446A (en) * 2018-06-22 2018-11-23 深源恒际科技有限公司 A kind of artificial intelligence plateform system based on deep learning
CN109189401A (en) * 2018-07-06 2019-01-11 曙光信息产业(北京)有限公司 A kind of dispositions method and system of deep learning frame
CN109034396A (en) * 2018-07-11 2018-12-18 北京百度网讯科技有限公司 Method and apparatus for handling the deep learning operation in distributed type assemblies
CN109086134A (en) * 2018-07-19 2018-12-25 郑州云海信息技术有限公司 A kind of operation method and device of deep learning operation
CN110442451A (en) * 2019-07-12 2019-11-12 中电海康集团有限公司 A kind of polymorphic type GPU cluster resource management dispatching method and system towards deep learning
CN110399222A (en) * 2019-07-25 2019-11-01 北京邮电大学 GPU cluster deep learning task parallel method, device and electronic equipment
CN111090456A (en) * 2019-12-06 2020-05-01 浪潮(北京)电子信息产业有限公司 Construction method, device, equipment and medium for deep learning development environment
KR102140730B1 (en) * 2019-12-17 2020-08-04 (주) 씨이랩 Method and system for providing develop environment of deep learning based gpu
CN111694656A (en) * 2020-04-22 2020-09-22 北京大学 Cluster resource scheduling method and system based on multi-agent deep reinforcement learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
PANELVÍCTOR CAMPOS ET AL.: "Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster", 《PROCEDIA COMPUTER SCIENCE》 *
刘惠 等: "面向GPU异构集群的自学习负载均衡调度算法惠", 《西安石油大学学报( 自然科学版)》, vol. 30, no. 3 *
林健 等: "深度学习云服务适配问题研究", 《软件导刊》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094116A (en) * 2021-04-01 2021-07-09 中国科学院软件研究所 Deep learning application cloud configuration recommendation method and system based on load characteristic analysis
CN113094116B (en) * 2021-04-01 2022-10-11 中国科学院软件研究所 Deep learning application cloud configuration recommendation method and system based on load characteristic analysis
WO2022262167A1 (en) * 2021-06-15 2022-12-22 上海商汤科技开发有限公司 Cluster resource scheduling method and apparatus, electronic device and storage medium
CN113608722A (en) * 2021-07-31 2021-11-05 云南电网有限责任公司信息中心 Algorithm packaging method based on distributed technology
CN113791906A (en) * 2021-08-09 2021-12-14 戴西(上海)软件有限公司 Scheduling system and optimization algorithm based on GPU resources in artificial intelligence and engineering fields
CN115202850A (en) * 2022-09-09 2022-10-18 国家超级计算天津中心 Job scheduling method and device, electronic equipment and storage medium
CN115202850B (en) * 2022-09-09 2022-12-20 国家超级计算天津中心 Job scheduling method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112416585B (en) 2024-03-15

Similar Documents

Publication Publication Date Title
CN112416585B (en) Deep learning-oriented GPU resource management and intelligent scheduling method
JP4781089B2 (en) Task assignment method and task assignment device
US20200257968A1 (en) Self-learning scheduler for application orchestration on shared compute cluster
Gu et al. Liquid: Intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed GPU clusters
CN113377540A (en) Cluster resource scheduling method and device, electronic equipment and storage medium
CN111381950A (en) Task scheduling method and system based on multiple copies for edge computing environment
CN114741207B (en) GPU resource scheduling method and system based on multi-dimensional combination parallelism
CN111768006A (en) Artificial intelligence model training method, device, equipment and storage medium
TWI786564B (en) Task scheduling method and apparatus, storage media and computer equipment
CN114237869B (en) Ray double-layer scheduling method and device based on reinforcement learning and electronic equipment
CN111274036A (en) Deep learning task scheduling method based on speed prediction
CN115237580B (en) Intelligent calculation-oriented flow parallel training self-adaptive adjustment system and method
CN115586961A (en) AI platform computing resource task scheduling method, device and medium
CN113946431B (en) Resource scheduling method, system, medium and computing device
AlOrbani et al. Load balancing and resource allocation in smart cities using reinforcement learning
CN110221902A (en) A kind of data transmission method and relevant apparatus based on virtual machine
CN116010051A (en) Federal learning multitasking scheduling method and device
CN116932201A (en) Multi-resource sharing scheduling method for deep learning training task
CN114677222A (en) Parallel transaction processing method, system and computer storage medium for block chain
Thai et al. Algorithms for optimising heterogeneous Cloud virtual machine clusters
US11934870B2 (en) Method for scheduling a set of computing tasks in a supercomputer
Sunder et al. Load balancing optimization based on enhanced genetic algorithm in cloud computing
US10402514B2 (en) Modeling and simulation of distributed computing frameworks
Biswas et al. A Machine Learning Approach for Predicting Efficient CPU Scheduling Algorithm
CN113391886A (en) Task scheduling method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant