CN113568725A - Deep learning job priority scheduling method and deep learning job system - Google Patents

Deep learning job priority scheduling method and deep learning job system Download PDF

Info

Publication number
CN113568725A
CN113568725A CN202110794626.4A CN202110794626A CN113568725A CN 113568725 A CN113568725 A CN 113568725A CN 202110794626 A CN202110794626 A CN 202110794626A CN 113568725 A CN113568725 A CN 113568725A
Authority
CN
China
Prior art keywords
job
time
priority
jobs
predicted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110794626.4A
Other languages
Chinese (zh)
Inventor
周悦媛
章家维
杨康
邵恩
谭光明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202110794626.4A priority Critical patent/CN113568725A/en
Publication of CN113568725A publication Critical patent/CN113568725A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/48Indexing scheme relating to G06F9/48
    • G06F2209/484Precedence

Abstract

The invention provides a deep learning job priority scheduling method, which comprises the following steps: in any job scheduling period, acquiring the predicted working parameters of all available GPUs in a GPU cluster and the predicted job parameters of all jobs in a waiting queue of the GPU cluster; predicting the residual execution time of each job according to the predicted working parameters and the predicted job parameters; taking the product of the residual execution time of any operation and the estimated resource quantity of the operation as the operation area of the operation; and selecting the operation with the minimum operation area from all the operations, and setting the operation with the highest priority in the current operation period. The invention also provides a deep learning operation system and a data processing device.

Description

Deep learning job priority scheduling method and deep learning job system
Technical Field
The invention relates to the technical field of deep learning, in particular to a deep learning job priority scheduling method based on time estimation and a deep learning job system.
Background
The priority algorithm originally came from a process scheduling problem in the operating system. In early operating systems, only a single-machine priority algorithm was designed, since a single process typically ran on only a single CPU. The main purpose of these priority algorithms is to avoid multiple jobs requesting CPU resources simultaneously, while optimizing as much as possible some common performance metrics, such as job average response time, average turnaround time, response ratio, etc.
According to different scenes and optimization targets, various priority algorithms respectively perform targeted optimization.
For example, if it is desired to minimize the average waiting time of all jobs, a short job priority algorithm may be used, i.e., the priority of a job is made monotonically decreasing with the length of the job execution time, with shorter jobs having higher priorities. Mathematical derivations can be used to strictly prove that such algorithms are optimal, but from a fairness perspective, such algorithms suffer from significant starvation. In addition, there are also some classical priority algorithms as follows:
SJF (short job priority algorithm): the algorithm is a classic algorithm for task scheduling of an operating system, and tasks with short execution time are preferentially executed. The algorithm can generally obtain good response time, but is not suitable for GPU cluster scheduling. The main reason is that a single process typically runs on only a single CPU, whereas deep learning training jobs tend to run on multiple GPU devices.
LRF (low resource priority algorithm): this is a scheduling algorithm widely used for resource sensitive tasks. The algorithm only concerns the resource usage of the jobs, and the jobs with lower resource usage have higher priority and can be scheduled preferentially.
RAND (random priority algorithm): the random priority algorithm is also a priority policy, i.e. the next scheduled task in the queue is completely random or has some randomness.
To address the starvation problem, the highest response ratio priority algorithm may be selected. In this algorithm, the response ratio is defined as the time that the job has been waiting divided by the time that needs to be executed, so that the priority of the job monotonically increases with the response ratio. Thus, if the priority of a single job changes with the waiting time, when the waiting time is too long, the response ratio becomes high and thus is scheduled, thereby avoiding the job starvation phenomenon.
If one wants to perform some jobs with high real-time performance, but places particular emphasis on fairness, there is also a time slice rotation algorithm that allows the process to be interrupted during execution. This algorithm divides time into a number of small time slices, and by assigning each time slice to a process, the illusion that each process is running continuously can be created, reducing response time. This algorithm has strong fairness because the time slice allocation is uniform for each job. In fact, this notion of time slicing and using preemption can also be used in deep learning training jobs. Yujeong Choi et al choose to preempt on the NPU during their work, and choose whether to preempt the executing job by the job in the waiting queue according to the urgency of each job, which reduces the response time while achieving fairness.
If the complexity of the scheduling algorithm is simply reduced and the time overhead caused by the scheduling algorithm is shortened, a first-come first-serve algorithm can also be used. The algorithm simply sorts the jobs according to the arrival time, and the jobs arriving early have high priority, so the algorithm is executed first; the opposite is true for the job arriving at play. The algorithm does not need repeated sorting and inserting operations on the waiting queue, so that the consumed time is short.
These algorithms have a good effect on their own optimization targets in the case of single-machine scheduling, but they are no longer fully applicable in multi-machine scheduling because they cannot utilize the information that the job needs to occupy the number of CPUs. There are more factors to consider for a multi-machine algorithm.
The multi-machine scheduling scenario is divided into two scenarios, that is, scheduled jobs each occupy only one processor and each can occupy multiple processors. Wherein the former is a special case of the latter. But even for the former, the problem of finding a scheduling algorithm to optimize some important metrics is NP-complete. Although the optimal problem is difficult to solve at present, under the condition of abandoning the search of the optimal solution, a considerable number of heuristic scheduling algorithms can be selected under various different demand scenes at present.
There is much work related to the priority. One of the common methods is to use a backhaul method to advance the jobs with less occupied resources to the jobs with higher priority and more occupied resources to execute under a certain condition, so as to occupy idle resources, thereby expecting to improve the cluster utilization. The Eric Gaussier et al research work uses backpilling and is based on various simple heuristic priority algorithms developed using a dynamic "multi-arm slot machine" learning method to gradually converge the priority algorithm to one that best matches the job load characteristics, thereby reducing the average latency of the job. In the prior art, a priority scheduling policy of a secondary queue is proposed. The method specifically includes that all Node values in a cluster are calculated according to a fixed rule to generate a first Node priority queue, a Pod priority queue is obtained through a dynamic priority algorithm, nodes which cannot be scheduled are filtered by two queues to generate a second Node priority queue, the Node with the highest priority is selected from the queue to be bound with a Pod popped up by the Pod priority queue, the next Pod scheduling cycle is entered after the binding is successful, Node binding is optimized from the second Node priority queue through a self-contained priority algorithm after the binding is failed, and no suitable Node can be operated by the Pod and enters the next Pod scheduling cycle after the Node binding is failed again. The invention utilizes the neural network to obtain the weight of each index of the operation in the cluster in a black box mode, and then carries out priority ranking according to the weight.
Other algorithms also comprise a Gang Scheduling method which simulates the design of a time slice rotation algorithm under the single machine condition. Like the time slice rotation algorithm, the gan Scheduling method divides time into smaller time slices, performs the calculation on each processor, and allocates a certain number of processors under a certain time slice according to the needs of each job. Like the time slice rotation algorithm, the algorithm can also improve fairness and reduce response time.
The methods for dealing with errors are mainly divided into two methods, namely, the method for dealing with errors is used for reducing the error overhead and the error probability.
On the software level, the most common method for reducing error overhead is to use Checkpoint technology, i.e. a save point is periodically inserted in the process of executing a job, and when an error occurs in the job, the job can be continuously executed from the save point after being restarted, thereby avoiding a lot of time waste caused by re-execution of the job from the beginning. But the act of saving Checkpoint itself causes a high time overhead and therefore it is necessary to select the appropriate frequency for saving. Two examples of related work are as follows. Han Li and the like consider transmission paths of data among a plurality of machine nodes in the process of executing the operation in detail, and only store Checkpoint when the data needs to be transmitted across the nodes, so that each node is ensured to execute the executed content on the node again after an error occurs, and the content on other nodes which do not have errors is not required to be executed again. The method reduces the frequency of preservation of Checkpoint and avoids high error overhead caused by too few Checkpoints. IsmaiAkturn and the like propose that the value and the time overhead of storing Checkpoint should be measured, and for some checkpoints with large storage time overhead, the Checkpoint is selected not to be stored any more, and a mode of executing a small part of program to acquire again after errors is adopted to avoid a large amount of access overhead.
There are also many related efforts to reduce the error probability, and particularly, Ajeya Naithani and others choose to use a scheduling method to alleviate the error problem, and hopefully, reduce the frequency of the error occurrence. They found experimentally that in heterogeneous processor architectures, different jobs executed on different kinds of processors have different error probabilities. To reduce the average number of errors, jobs executing on different types of processors may be continually swapped for a placement that is least frequently errored.
The existing priority strategy has the following problems: 1) the cluster fault influence is not considered, and the dispatching is only carried out aiming at the cluster fault-free condition; 2) the average response time of the jobs in the cluster cannot be minimized, and the average response time is an important index of the user service quality and reflects the time interval from the submission of the jobs to the arrival of the results by the user; 3) some existing priority policies utilize machine learning algorithms to assist in determining the priority of a job, such priority algorithms being poorly interpretable.
Disclosure of Invention
Aiming at the problems, the invention provides a deep learning job priority scheduling method based on time estimation, which comprises the following steps: in any job scheduling period, acquiring the predicted working parameters of all available GPUs in a GPU cluster and the predicted job parameters of all jobs in a waiting queue of the GPU cluster; predicting the residual execution time of each job according to the predicted working parameters and the predicted job parameters; taking the product of the residual execution time of any operation and the estimated resource quantity of the operation as the operation area of the operation; and selecting the operation with the minimum operation area from all the operations, and setting the operation with the highest priority in the current operation period.
The invention relates to a deep learning job priority scheduling method, wherein the predicted working parameters comprise: a fault parameter λ reflecting the mean fault probability p of all the available GPUs, λ being p ═ 1-eAnd the average recovery time delta after GPU failure; the predicted operation parameters include: the non-fault theory residual operation time length T, the pre-estimated resource number n and the operation scheduling period tau of each operation; remaining execution time of any job
Figure BDA0003162190980000041
Wherein, tau0For the predicted execution time of the job within a job scheduling period tau,
Figure BDA0003162190980000042
p 'is the probability of no failure of the job within one job scheduling period τ, and p' ═ e within one job scheduling period τ-nλτ
The deep learning job priority scheduling method of the invention executes all deep learning jobs in at least one job scheduling period.
The invention also provides a deep learning operation system, which comprises: the parameter acquisition module is used for acquiring the predicted working parameters of all available GPUs in the GPU cluster and the predicted working parameters of all the jobs in the waiting queue of the GPU cluster in any job scheduling period; and predicting the residual execution time of each job according to the predicted working parameters and the predicted job parameters; the priority scheduling module is used for setting the priority of the current operation period according to the operation area of the operation; the product of the residual execution time of any job and the estimated resource quantity of the job is used as the job area of the job, the job with the minimum job area in all jobs is selected, and the highest priority in the current job period is set.
The deep learning operation system of the invention, wherein the predicted working parameters comprise: a fault parameter λ reflecting the mean fault probability p of all the available GPUs, λ being p ═ 1-eAnd the average recovery time delta after GPU failure; the predicted operation parameters include: the non-fault theory residual operation time length T, the pre-estimated resource number n and the operation scheduling period tau of each operation; remaining execution time of any job
Figure BDA0003162190980000051
Wherein, tau0For the predicted execution time of the job within a job scheduling period tau,
Figure BDA0003162190980000052
p 'is the probability of no failure of the job within one job scheduling period τ, and p' ═ e within one job scheduling period τ-nλτ
The deep learning operation system of the invention executes all deep learning operations in at least one operation scheduling period.
The present invention also provides a computer-readable storage medium storing computer-executable instructions, wherein the computer-executable instructions, when executed by a processor, implement the deep learning job priority scheduling method as described above.
The present invention further provides a data processing apparatus, comprising: GPU clustering; a processor; a computer-readable storage medium that schedules deep learning jobs for execution on the cluster of GPUs when the processor retrieves and executes the computer-executable instructions in the computer-readable storage medium.
Drawings
Fig. 1 is a flowchart of acquiring a job priority policy in the present invention.
FIG. 2 is a schematic diagram of job run time prediction without checkpointing.
FIG. 3 is a schematic diagram of job run time prediction with checkpointing.
FIG. 4 is a schematic diagram of a data processing apparatus of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
When the inventor conducts large-scale distributed deep learning job priority strategy research, the inventor finds that the short job priority strategy used in the traditional CPU cluster is unreasonable to be directly used in the GPU cluster. The main reason is that a single process typically runs on only a single CPU, whereas deep learning training jobs tend to run on multiple GPU devices. Therefore, the inventor proves that the optimal average turnaround time can be obtained by adopting the strategy of setting the highest priority for the job with the minimum product of the job running time and the number of GPU resources required by the job, and the product of the job running time and the number of GPU resources required by the job is defined as the job area, so that the invention adopts a small area priority Strategy (SAF).
In addition, the simple small-area priority strategy does not consider that the residual operation time of the job may be greatly changed due to cluster faults, so that the calculation of the job area is inaccurate, and the scheduling effect is influenced. The inventor finds out through a large amount of experiments and mathematical modeling researches that the defect can be solved by a method of estimating the residual execution time of the training operation according to the cluster fault condition. Therefore, the deep learning job priority scheduling method based on time estimation is provided, and the difficulty is how to estimate the real running time under the cluster fault interference condition according to the theoretical execution time.
The invention aims to solve the problem that the prior cluster deep learning operation priority strategy cannot ensure the most validity of the average turnover time from two aspects. On the one hand, strict mathematical proofs can deduce that the existing priority strategy cannot make the average turnaround time the most in the case of large clusters and multiple jobs, while the small-area strategy of the present invention can make the average turnaround time the most. On the other hand, the fault condition of the large-scale cluster can be increased along with the increase of the cluster scale, and the priority strategy without considering the cluster fault can not adapt to the large-scale cluster, so the invention provides a technology for estimating the residual time of deep learning training operation based on the cluster fault information, and the technology is applied to the priority strategy with small area priority, and the average response time of the operation in the cluster can be reduced to the maximum extent.
The scheduling method can estimate the operation execution time under the interference of the cluster faults according to the operation theoretical execution time based on the prediction technology of the cluster fault information on the residual execution time of the deep learning training operation; the method adopts a small area priority (SAF) scheduling strategy, so that the average response time of deep learning training jobs in a cluster can be reduced; and a heuristic algorithm is adopted in the acquisition process of the priority, so that the method has complete interpretability.
Compared with algorithms such as SJF (short job priority algorithm), FCFS (first-in first-out algorithm), LRF (less resource priority algorithm), RAND (random priority algorithm) and the like, the method has the best average turnover time. The theory proves that:
two constraints are first defined:
constraint 1: the number of cards in the cluster is much larger than the number of cards required for a single job. Under such conditions, it may be asserted that the utilization of the cluster may remain at a high level near 1 as long as there are more jobs in the job queue. The following reasoning holds:
introduction 1: the total GPU number in the cluster is recorded as nclusterAnother work queue of sufficient length (jobb)i,i>0}, wherein jobaiThe number of occupied GPUs is njobiRemember nmax:=maxnjobiThere should be any time, if in the work queueIf the operation still exists, the utilization rate U of the GPU in the cluster is:
Figure BDA0003162190980000061
in particular, if ncluster>>maxnjobi1-U < 1.
And (3) proving that: without loss of generality, the priority of the jobs in the work queue is set to decrease with the subscript of the jobs. The jobs in the queue are scheduled to the cluster one by one according to the priority until the jobs with the highest priority left in the work queue cannot be scheduled. Using the inverse syndrome method, inverse setup
Figure BDA0003162190980000071
Then let the index of the job with the highest priority be k, should be
njobk≤nmax≤ncluster-noccupied
The number of free GPUs is therefore not less than the number of GPUs required for job k, which can be scheduled. This is contradictory to schedule stops. Therefore, the original proposition is established.
If n iscluster>>nmax1-U < nmax/ncluster< 1. Mouth piece
In fact, ncluster>>nmaxThis constraint is commonly true in large-scale clustering. Thus, it can be guaranteed that with sufficient jobs, the GPU cards in the cluster are almost entirely in the occupied or faulty state, and rarely in the idle state due to contention.
Constraint 2: the total time for all jobs to execute is much longer than the time required for a single job to execute. This condition is essentially similar to the previous condition. The total time for executing k jobs adjacent to the m-th starting priority in the execution work queue is recorded as tm,kThe time from the first start to the last end of execution of these jobs is recorded as tkTo do sojobiThe length of time of execution is tjobi. Note tmax:=maxtjobi. Number of jobs
tm,k>>tmax
Then it may be asserted that when the batch is executed, there is at least tm,k-2tmaxThe time of (a) can ensure that the proportion of the GPUs in the cluster occupied by the k jobs satisfies the inequality of the U obeys in lem 1. Further, it can be ensured that the cluster is at tm,kThe average occupancy rate by the k jobs in the time length is within a certain range. After scheduling by using a certain priority algorithm, each jobaiAre respectively at biIs scheduled to begin execution at time eiThe execution is completed at that time. Then there is the following lemma 2.
2, leading: the total time for executing k jobs adjacent to the m-th starting priority in the execution work queue is recorded as tm,kThen average utilization rate
Figure BDA0003162190980000072
Due inequality
Figure BDA0003162190980000081
And (3) proving that: since k jobs with adjacent priorities are discussed, it can be guaranteed that b is the samemBefore time, the first m-1 jobs have all been scheduled. Thus in bm+tmaxAfter that, the first m-1 jobs must have all been performed. In a similar manner, inm+k-1-tmaxPreviously, the m + k th and subsequent jobs must not have been scheduled. Thus, in [ bm+tmax,em+k-1-tmax]In between, only these k jobs must be running on the cluster.
From theory 1, it can be seen that the utilization rate of the cluster is higher than 1-n in the time periodmax/nclusterTherefore, during this time, the areas of these operations should be satisfiedFoot
Figure BDA0003162190980000082
Thereby to obtain
Figure BDA0003162190980000083
The lemma is established.
Thus, it has been demonstrated that under two constraints, a cluster can maintain high utilization during execution of any one contiguous segment in a work queue.
The following demonstrates that the scheduling method of the present invention has an optimum in terms of average turnaround time:
this proof can be translated into finding a near optimal algorithm under constraints. Finding an approximately optimal algorithm, however, requires finding several properties that the optimal algorithm satisfies. The conditions that the near-optimal priority algorithm needs to satisfy under the above two constraints will be described below. Here, a simplified assumption may be made that all jobs arrive at an initial time, and information of all jobs is available at the initial time. For the case where a job arrives at different times, the priority of this job may be set using a priority algorithm at each job arrival time.
Now assume that an optimal priority algorithm has been found. Set under the algorithm, each jobaiAre respectively at biIs scheduled to begin execution at time eiThe execution is completed at that time. The time taken by the algorithm from the scheduling of the first job to the completion of the execution of all jobs is recorded as tallNow divide all jobs into several sets according to the time when the job starts to execute, order
Figure BDA0003162190980000084
The magnitude relation between the number of jobs in each set is estimated. By considering two of these sets. Considering the overall interchange of the priorities of the two sets, the following reasoning can be derived:
and 3, introduction: under the above notation, for k1<k2< N, note
Figure BDA0003162190980000091
The number of elements in is
Figure BDA0003162190980000092
Should be provided with
Figure BDA0003162190980000093
And (3) proving that: n mutually disjoint sets are obtained after the division, and the k-th set is selected1And k2A set of k, wherein1The sets containing priority slaves
Figure BDA0003162190980000094
To
Figure BDA0003162190980000095
Operation of (1), k2The sets containing priority slaves
Figure BDA0003162190980000096
To
Figure BDA0003162190980000097
The operation of (2). Assuming that under this optimal priority algorithm, the average turnaround time for all jobs is l0
Now the priorities of the two sets are exchanged in their entirety, i.e. let k-th1All jobs in the individual set are prioritized back to the first
Figure BDA0003162190980000098
To
Figure BDA0003162190980000099
Bit, and order k2All in one setJob is moved to the first
Figure BDA00031621909800000910
To
Figure BDA00031621909800000911
A bit. Now, it is estimated how the average turnaround time of all jobs changes after the priority order is adjusted.
Now consider the case: for the k-th2Each job in the set, the time at which the job starts to execute is advanced by an equal value
Figure BDA00031621909800000912
For the k-th1+1 to kth21 set, each job of which the start of execution is delayed by 2tmax(ii) a To the k-th1A set in which the start of execution of each job is delayed
Figure BDA00031621909800000913
For the k-th2+1 sets and later, the job start time is delayed by 4tmax. Since all the jobs within each section are delayed or advanced by an equal length of time as above, no conflict occurs between jobs within each section. And according to the value of the above time change, kth2The earliest start time of a job in a set is tall·(k1-1)/N+tmaxThus a certain ratio of the kth1The later the end of the job of 1 and the previous set, the two parts of the job will not conflict. The same can be calculated, and all the above operations do not conflict. This illustrates the above scenario as one scheduling scheme that may be implemented. Under this scheduling scheme, the average turnaround time can be found to be
Figure BDA00031621909800000914
And due to
Figure BDA0003162190980000101
So that there are
Figure BDA0003162190980000102
In this case, however, the order of all job start times is the same as the adjusted priority order, and no conflict occurs, so each job has an opportunity to be scheduled at an earlier timing in the adjusted priority order. That is, the average turnaround time of all jobs must not be longer than l in the adjusted priority order1. But from the initial assumption, l0Is the average turnaround time obtained in the case of the optimal priority algorithm, and therefore it is proportional to l1Is shorter, and thus can ensure
Figure BDA0003162190980000103
This is also equivalent to the expression to be proof of lemma. Mouth piece
From the above, it is clear that the number of jobs in the later set must not be much greater than in the earlier set. And the number of jobs in the later set is not greater than the number of jobs in the earlier set when the number of jobs is sufficient to satisfy the second constraint.
At the same time, the time from the beginning of the first job to the end of the last job in each set can be found at tallfrom/N to tall/N+tmaxThe total sum of the areas of the jobs included in each set is estimated to be close to each other according to lemma 2, with almost no change for each set. Then the area occupied by each job on average must be larger since the number of jobs in the later set is smaller than in the earlier set.
But this is not enough to illustrate the problem. This is because if most of the area in one set is occupied by a job having a particularly large occupied area, even if most of the jobs in the set are very small, the occupied area per job may be larger on average than that of the other sets. To find a superior algorithm, it is necessary to study the allocation of various jobs with different sizes of occupied areas in each set.
Therefore, according to the three reasoning, considering the exchange of the operation with smaller occupied area in the later set and the operation with larger occupied area in the former set, the following reasoning can be obtained:
inference 1: if there is job queue (job)iI > 0}, and jobaiWith an execution time of tjobiOccupying GPU cards by njobi,Note Ajobi=njobi·tjobi. The jobs in the cluster are scheduled according to the optimal priority algorithm, and the time from the scheduling of the first job to the execution of all jobs is tallAnd for job in each queueiIn b ofiThe time is scheduled to begin execution. Fetch and divide
Figure BDA0003162190980000104
Then order
Figure BDA0003162190980000111
Then is at
Figure BDA0003162190980000112
Under the condition (2), the following size relationship should be established:
Figure BDA0003162190980000113
and (3) proving that: let I1={ik,k=1,2,...,|I1|},I2={jk,k=1,2,...,|I2| define "exchange" thisThe priority of the two sets indicates, for each one, that is greater than 0 and less than | I1K of | exchange the priorities of the ik th and jk th jobs, and at the same time, assign the ith jobkJob slave I1Move to I2In (1), will bekJob slave I2Move to I1In (1). The implication of the inference is then that two of the above-mentioned sets, the kth set, are selected1And k2Two of them. From
Figure BDA0003162190980000114
And
Figure BDA0003162190980000115
two subsets I with equal element number are respectively selected1And I2An attempt is made to exchange the priorities of the two subsets to check if the average turnaround time is improved.
First, the variation is defined as
Figure BDA0003162190980000116
Then according to the definition in the quote,
Figure BDA0003162190980000117
now consider only I1The area occupied by each job in the system is not more than that of the I2In which case the job is large. If this condition does not hold, then it can be at I1In which several elements are deleted from small to large, and in I2And deleting equal elements from large to small until the condition is satisfied. Note that at this time, the two sets become I respectively1' and I2'. In the process, I1Ratio of sum of areas occupied by the deleted jobs I2Is smaller, so according to the definition of the change amount in the previous paragraph, there should be
Figure BDA0003162190980000118
Therefore if can prove
Figure BDA0003162190980000119
If there is an upper bound as stated in the lemma, then
Figure BDA00031621909800001110
Also possess the same upper bound.
Since the number of jobs in the two subsets is the same, after adjusting the priority, divide by
Figure BDA00031621909800001111
And
Figure BDA00031621909800001112
the priority of the job must not change in all sets other than the two sets. Then, as in lemma 3, all collections are considered in five parts, i.e.
Figure BDA00031621909800001113
Figure BDA00031621909800001114
The amount of change in their turnaround time was analyzed.
To pair
Figure BDA00031621909800001115
In other words, a job having a large occupied area is replaced with a job having a small occupied area, and the execution start time of each job is advanced. The turnaround time for these jobs is shortened by at least a non-negative value.
To pair
Figure BDA0003162190980000121
In other words, the jobs are all higher in priority than the jobs that are being adjusted, so the priority adjustment has no effect on the time they begin executing. The turnaround time change amount of these jobs is 0.
To pair
Figure BDA0003162190980000122
In other words, only the start execution time of the job is affected
Figure BDA0003162190980000123
And
Figure BDA0003162190980000124
wherein
Figure BDA0003162190980000125
The time when all the jobs start and end execution does not change, so it is necessary to consider
Figure BDA0003162190980000126
At the time all jobs are executed. It can be noted that the smaller jobs move in, the larger jobs move out,
Figure BDA0003162190980000127
the sum of the areas occupied by all the jobs is reduced
Figure BDA0003162190980000128
Before the adjustment, the adjustment is carried out,
Figure BDA0003162190980000129
the sum of all the operation occupied areas does not exceed the capacity n which can be borne by the clustercluster·(tmax+tallN), then according to theorem 2, after the adjustment, from
Figure BDA00031621909800001210
The time elapsed from the time when the first job starts to be executed to the time when the last job is completed
Figure BDA00031621909800001211
And, by definition, before priority adjustment,
Figure BDA00031621909800001212
the minimum value of the elapsed time from the start of execution of the first job to the end of execution of the last job is tall/N, and therefore, after the priority adjustment, the time is lengthened at most
Figure BDA00031621909800001213
In addition, due to
Figure BDA00031621909800001214
The execution start/end time of the middle job is not changed, so
Figure BDA00031621909800001215
At most, the time for starting execution of the first job is delayed by tmaxSuch a long execution time, and therefore, the end time thereof is delayed at most
Figure BDA00031621909800001216
Then now if it will be
Figure BDA00031621909800001217
The start time of each job is delayed by delta e, so that the jobs and
Figure BDA00031621909800001218
there is no conflict in the jobs in (1). That is, each job may be scheduled at an earlier time. That is, the time at which each of these jobs begins execution is at most delayed by Δ e, i.e., equivalent to at least advanced by- Δ e, and the turnaround time is also at least shortened by- Δ e.
To pair
Figure BDA0003162190980000131
Considering the end time of the last job, the end time is recorded as
Figure BDA0003162190980000132
Note in addition
Figure BDA0003162190980000133
The time when the first job is started is
Figure BDA0003162190980000134
It can be known that,
Figure BDA0003162190980000135
all jobs in these collections are all in
Figure BDA0003162190980000136
To
Figure BDA0003162190980000137
Is executed within a time period. Then, also by lemma 2, have
Figure BDA0003162190980000138
And is provided with
Figure BDA0003162190980000139
Therefore, the time when the last job ends can be obtained
Figure BDA00031621909800001310
Then due to
Figure BDA00031621909800001311
In that
Figure BDA00031621909800001312
Since all the operations have been completed before, the start time of the left and right operations in the set is not lateIn that
Figure BDA00031621909800001313
And is composed of
Figure BDA00031621909800001314
Before priority adjustment, the start time of the job in the set must not be earlier than (k)2-1)tallN, then it can be deduced that each job in the set begins execution with a priority adjustment and a deferred length of time
Figure BDA00031621909800001315
The turnaround time is also increased by at most this value, i.e. the inverse of this value is reduced.
To pair
Figure BDA00031621909800001316
In other words, the time in which the first job starts to be executed must not be later than
Figure BDA00031621909800001317
Before the priority of the job is adjusted, the first job is started to be executed no earlier than k2stall/N, so that each job is deferred at most after the job priority order is adjusted
Figure BDA00031621909800001318
Then if the average turnaround time of all jobs before prioritization is l0The average turnaround time of all jobs after adjustment is l2According to the change of the average turnover time of the work in the above five parts set, should be
Figure BDA0003162190980000141
Wherein
Figure BDA0003162190980000142
Because the original algorithm is the optimal algorithm, the average turnover time of the new algorithm is longer than that of the original algorithm after the priority is adjusted. Therefore, the left side of the inequality number of the above expression (2) must not be less than 0. Further, according to (3), it is found that
Figure BDA0003162190980000143
The following formula is arranged, namely
Figure BDA0003162190980000144
According to the lemma 3, the pair between k1+1 and k2-each of i between 1 and i,
Figure BDA0003162190980000145
should satisfy the condition
Figure BDA0003162190980000146
Thereby to obtain
Figure BDA0003162190980000151
By substituting this value back into expression (4), it can be known that
Figure BDA0003162190980000152
Then, the expression (5) is collated in a small amount form using the two constraints set forth heretofore. From the definition of δ' in expression (3), it can be seen that
Figure BDA0003162190980000153
In combination with the above expression (5), should be
Figure BDA0003162190980000154
For arbitrary
Figure BDA0003162190980000155
|I1|=|I2All are true, so from expression (1), it can be seen that
Figure BDA0003162190980000156
It is the inference that the results are intended to be proven.
To illustrate that this conclusion is more meaningful than lemma 3, the example mentioned after lemma 3, i.e. for k, can still be chosen to prove2>k1
Figure BDA0003162190980000157
The system consists of a small amount of work with particularly large occupied area and a few positive and small occupied areas. In this case, since most of the area is occupied by the work which is particularly large, it is possible to reduce the number of the work pieces and to reduce the cost
Figure BDA0003162190980000158
May not be enough, and thus cannot be limited by lemma 3. However, in inference 1, I was selected2Is composed of
Figure BDA0003162190980000159
A subset of the smallest area job component of which half of the total number is occupied, and I1 is
Figure BDA00031621909800001510
The subset consisting of equal-volume jobs with the largest area. ThatThe reason why
Figure BDA00031621909800001511
Includes a lot of operations with a very small occupied area, so I2The working areas in (1) are all extremely small, and the sum of the areas is also extremely small; and I1The area distribution of the middle operation is uniform, so the sum of the areas of the operations is larger. This would directly violate the requirements of inference 1. Therefore, inference 1 is a stronger constraint than lemma 3.
Of particular note is ncluster·tallN approximation to each
Figure BDA00031621909800001512
The sum of the areas of all the jobs in (1) is equal. This means that the above reasoning cannot be directly explained
Figure BDA0003162190980000161
Is a small quantity relative to the total area, even at k2-k1In the case of 1, this inference cannot give any limitation. However, if there is a long enough work queue, since N can get very large under the constraint of this chapter, the difference between the set bit numbers to which two specific jobs belong, i.e. k2-k1And can become very large. In this case, the right-hand term of the inequality is relative to a single one
Figure BDA0003162190980000162
The sum of the areas of all the jobs in the process is a small amount, and the inferred constraint becomes very strong.
Then, in the case where the number of types of jobs is limited, the area occupied by a single job must be limited. Therefore, the absolute value of the difference between the areas occupied by the two operations must have a minimum value, and this minimum value is defined as Δ a. The area occupied by a single operation must also have a maximum value, which is assumed to be AmaxThen there are two instants t1And t2Satisfy t2>t1And t is2-t1=O(tall) Then t can be asserted1Almost all the jobs around the time occupy an area ratio t2The vicinity of the moment is smaller. Do not consider1To t2Is divided into [ t1,(2t1+t2)/3),[(2t1+t2)/3,(t1+2t2)/3),[(t1+2t2)/3,t2) Dividing each time period into three large time periods equally, and recording the set of the jobs with the start execution time falling in each small time period as
Figure BDA0003162190980000163
Then note the set sequence number k of two jobs in the first and third large time periods1And k2Then the difference between the two serial numbers is at least N0. According to Corollay 1
Figure BDA0003162190980000164
Suppose in
Figure BDA0003162190980000165
In which there are n job total ratios
Figure BDA0003162190980000166
N of the jobs are large, and
Figure BDA0003162190980000167
m total jobs, then
Figure BDA0003162190980000168
Also according to the introduction 2
Figure BDA0003162190980000169
Is expressed back by
Figure BDA00031621909800001610
And when the two constraints are satisfied, N0It can be made sufficiently large and thus n/m is very small. This means that these are very small in total number of jobs compared to the number of jobs in reverse order of the small area priority algorithm, which explains the optimality of the small area priority algorithm.
The deep learning job priority scheduling method based on time estimation can be used for single queue scheduling and multi-queue scheduling. As shown in fig. 1, when scheduling a single queue, the specific implementation of priority acquisition is as follows:
s1: acquiring the average failure probability p of all GPUs or all available GPUs in the cluster, wherein the average failure probability p is equal to 1-eAnd the average recovery time delta after GPU failure.
S2: acquiring information of all jobs (the job numbers are 0, 1, 2.. jobs) in a cluster waiting queue, wherein the information comprises the theoretical residual running time length T of each job, the number n of needed GPU resources and the storage interval time length tau of a check point.
S3: for each job in the cluster, the residual execution time T is estimated0
S4: according to the calculation result of step S3, the remaining execution times of all jobs in the cluster waiting queue are updated.
S5: calculating the area of all the jobs in the cluster waiting queue, and recording the area of the ith job as SiAnd added to the job information data.
S6: find the j job so that Sj<SkIt holds true constantly, where k ∈ 0, 1, 2. The jth job has the highest priority, i.e., is the next scheduled job.
The step S3 specifically includes:
s31: when each cluster only contains one epoch, as shown in fig. 2, the situation that error restart may occur but check points are not saved is possible, and the execution time of the job on the cluster is estimated
Figure BDA0003162190980000171
S32: tau obtained from S310The execution time in the case where the check points are held every τ time for the job as shown in FIG. 3 is estimated
Figure BDA0003162190980000172
The step S31 specifically includes:
s311: the average probability of errors per second of the GPUs in the cluster is defined as p, and the average probability of no errors is 1-p. If job J is assigned to n GPU cards, the probability that the n cards do not make errors per second is (1-p)nThen the probability that all machines will never make an error after t seconds is (1-p)ntFrom p ═ 1-eTo obtain e1-p, then the job is executed on n GPU cards with a probability e that it will not be in error-nλt
S312: definition p (success) is a probability that an operation will not be executed with an error all the time, and p (success) e-nλt(ii) a The probability that a job will go wrong from time t to time t + dt is defined as f (t) dt.
S313: using the data acquired in step S1, step S2, step S312, the expectation of the total time of job completion is calculated as
Figure BDA0003162190980000173
S314: according to the definition of step S312, p (success) is defined as e-nλTAnd f (t) ═ n λ e-nλtSubstituting the total time of step S313 is expected to be
Figure BDA0003162190980000174
Figure BDA0003162190980000175
Replacing T by tau and finishing to obtain
Figure BDA0003162190980000176
The invention also provides a computer readable storage medium and a data processing device, as shown in fig. 4, including a GPU cluster, a processor and a computer readable storage medium. The computer readable storage medium of the present invention stores computer executable instructions that, when executed by a processor, prioritize deep learning jobs executing on the GPU cluster to schedule deep learning jobs. It will be understood by those skilled in the art that all or part of the steps of the above method may be implemented by instructing relevant hardware (e.g., processor, FPGA, ASIC, etc.) through a program, and the program may be stored in a readable storage medium, such as a read-only memory, a magnetic or optical disk, etc. All or some of the steps of the above embodiments may also be implemented using one or more integrated circuits. Accordingly, the modules in the above embodiments may be implemented in hardware, for example, by an integrated circuit, or in software, for example, by a processor executing programs/instructions stored in a memory. Embodiments of the invention are not limited to any specific form of hardware or software combination.
The invention adopts a heuristic algorithm to estimate the residual execution time of the deep learning operation, then uses the product of the estimated time and the quantity of resources required by the operation as the area, and finally carries out priority evaluation by a small-area operation priority strategy which is proved by mathematics to ensure that the average waiting time is optimal. Compared with the prior art, the priority strategy of the invention has the following advantages: 1) considering that the cluster fault probability is increased after the number of nodes and the number of GPU equipment in the cluster reach a certain degree, and the influence caused by the fault is not negligible; 2) shorter average response times can be obtained; 3) and a pure heuristic algorithm is adopted, so that the method has complete interpretability.
The above embodiments are merely illustrative, and not restrictive, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the invention, and therefore all equivalent technical solutions also fall within the scope of the invention, and the scope of the invention is defined by the appended claims.

Claims (10)

1. A deep learning job priority scheduling method is characterized by comprising the following steps:
in any job scheduling period, acquiring the predicted working parameters of all available GPUs in a GPU cluster and the predicted job parameters of all jobs in a waiting queue of the GPU cluster;
predicting the residual execution time of each job according to the predicted working parameters and the predicted job parameters;
taking the product of the residual execution time of any operation and the estimated resource quantity of the operation as the operation area of the operation;
and selecting the operation with the minimum operation area from all the operations, and setting the operation with the highest priority in the current operation period.
2. The deep learning job priority scheduling method of claim 1, wherein the predicting the operating parameters comprises: a fault parameter λ reflecting the mean fault probability p of all the available GPUs, λ being p ═ 1-eAnd the average recovery time delta after GPU failure; the predicted operation parameters include: the non-fault theory residual operation time length T, the pre-estimated resource number n and the operation scheduling period tau of each operation;
remaining execution time of any job
Figure FDA0003162190970000011
Wherein, tau0For the predicted execution time of the job within a job scheduling period tau,
Figure FDA0003162190970000012
p' is the probability of no failure for the job within one job scheduling period τ.
3. The deep learning job priority scheduling method of claim 2, wherein within one job scheduling period τ, p' ═ e-nλτ
4. The deep-learning job priority scheduling method of claim 1, wherein all deep-learning jobs are executed in at least one job scheduling period.
5. A deep learning operation system, comprising:
the parameter acquisition module is used for acquiring the predicted working parameters of all available GPUs in the GPU cluster and the predicted working parameters of all the jobs in the waiting queue of the GPU cluster in any job scheduling period; and predicting the residual execution time of each job according to the predicted working parameters and the predicted job parameters;
the priority scheduling module is used for setting the priority of the current operation period according to the operation area of the operation; the product of the residual execution time of any job and the estimated resource quantity of the job is used as the job area of the job, the job with the minimum job area in all jobs is selected, and the highest priority in the current job period is set.
6. The deep learning system of claim 5, wherein the predicted operating parameters include: a fault parameter λ reflecting the mean fault probability p of all the available GPUs, λ being p ═ 1-eAnd the average recovery time delta after GPU failure; the predicted operation parameters include: the non-fault theory residual operation time length T, the pre-estimated resource number n and the operation scheduling period tau of each operation;
remaining execution time of any job
Figure FDA0003162190970000021
Wherein, tau0For the predicted execution time of the job within a job scheduling period tau,
Figure FDA0003162190970000022
p' is the probability of no failure for the job within one job scheduling period τ.
7. The deep learning job system of claim 6, wherein within one job scheduling period τ, p' ═ e-nλτ
8. The deep learning job system of claim 5, wherein all deep learning jobs are executed in at least one job scheduling cycle.
9. A computer-readable storage medium storing computer-executable instructions, which when executed by a processor implement the deep-learning job priority scheduling method of any one of claims 1 to 4.
10. A data processing apparatus comprising:
GPU clustering;
a processor;
the computer-readable storage medium of claim 9, wherein the processor when retrieving and executing the computer-executable instructions in the computer-readable storage medium schedules a deep learning job for execution on the cluster of GPUs.
CN202110794626.4A 2021-07-14 2021-07-14 Deep learning job priority scheduling method and deep learning job system Pending CN113568725A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110794626.4A CN113568725A (en) 2021-07-14 2021-07-14 Deep learning job priority scheduling method and deep learning job system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110794626.4A CN113568725A (en) 2021-07-14 2021-07-14 Deep learning job priority scheduling method and deep learning job system

Publications (1)

Publication Number Publication Date
CN113568725A true CN113568725A (en) 2021-10-29

Family

ID=78164761

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110794626.4A Pending CN113568725A (en) 2021-07-14 2021-07-14 Deep learning job priority scheduling method and deep learning job system

Country Status (1)

Country Link
CN (1) CN113568725A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101339521A (en) * 2008-07-28 2009-01-07 华中科技大学 Tasks priority dynamic dispatching algorithm
CN106445673A (en) * 2016-10-14 2017-02-22 苏州光蓝信息技术有限公司 Fault-tolerant task scheduling method oriented to mixed-criticality real-time system
CN106980532A (en) * 2016-01-18 2017-07-25 西安中兴新软件有限责任公司 A kind of job scheduling method and device
CN107193655A (en) * 2017-05-17 2017-09-22 南京大学 A kind of fair resource dispatching method towards big data processing based on utility function
CN110471758A (en) * 2019-07-02 2019-11-19 中国电力科学研究院有限公司 A kind of network analysis applications multi-user concurrent job scheduling system and method
CN111274021A (en) * 2020-02-27 2020-06-12 苏宁云计算有限公司 GPU cluster task scheduling and distributing method
CN111694656A (en) * 2020-04-22 2020-09-22 北京大学 Cluster resource scheduling method and system based on multi-agent deep reinforcement learning
CN112035251A (en) * 2020-07-14 2020-12-04 中科院计算所西部高等技术研究院 Deep learning training system and method based on reinforcement learning operation layout
CN112256434A (en) * 2020-10-30 2021-01-22 中国科学院信息工程研究所 Resource matching method in encrypted data cracking scene

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101339521A (en) * 2008-07-28 2009-01-07 华中科技大学 Tasks priority dynamic dispatching algorithm
CN106980532A (en) * 2016-01-18 2017-07-25 西安中兴新软件有限责任公司 A kind of job scheduling method and device
CN106445673A (en) * 2016-10-14 2017-02-22 苏州光蓝信息技术有限公司 Fault-tolerant task scheduling method oriented to mixed-criticality real-time system
CN107193655A (en) * 2017-05-17 2017-09-22 南京大学 A kind of fair resource dispatching method towards big data processing based on utility function
CN110471758A (en) * 2019-07-02 2019-11-19 中国电力科学研究院有限公司 A kind of network analysis applications multi-user concurrent job scheduling system and method
CN111274021A (en) * 2020-02-27 2020-06-12 苏宁云计算有限公司 GPU cluster task scheduling and distributing method
CN111694656A (en) * 2020-04-22 2020-09-22 北京大学 Cluster resource scheduling method and system based on multi-agent deep reinforcement learning
CN112035251A (en) * 2020-07-14 2020-12-04 中科院计算所西部高等技术研究院 Deep learning training system and method based on reinforcement learning operation layout
CN112256434A (en) * 2020-10-30 2021-01-22 中国科学院信息工程研究所 Resource matching method in encrypted data cracking scene

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘素芹: "《面积最大优先调度的预约回填算法》", 《微计算机应用》, vol. 29, no. 12, 31 December 2008 (2008-12-31), pages 5 - 9 *
王罡: "可重构加速平台下基于面积性能比的多任务调度优化策略研究", 中国硕士优秀论文辑, pages 4 *

Similar Documents

Publication Publication Date Title
US8924976B2 (en) Task scheduling method and apparatus
CN111381950B (en) Multi-copy-based task scheduling method and system for edge computing environment
US8015564B1 (en) Method of dispatching tasks in multi-processor computing environment with dispatching rules and monitoring of system status
EP2742426B1 (en) Network-aware coordination of virtual machine migrations in enterprise data centers and clouds
US7752622B1 (en) Method and apparatus for flexible job pre-emption
Van Tilborg et al. Foundations of real-time computing: Scheduling and resource management
Bril et al. Worst-case response time analysis of real-time tasks under fixed-priority scheduling with deferred preemption
US7743378B1 (en) Method and apparatus for multi-dimensional priority determination for job scheduling
US9442760B2 (en) Job scheduling using expected server performance information
JP3922070B2 (en) Distributed control method and apparatus
US7076781B2 (en) Resource reservation for large-scale job scheduling
Hui et al. Improved strategies for dynamic load balancing
US7844968B1 (en) System for predicting earliest completion time and using static priority having initial priority and static urgency for job scheduling
US20120167101A1 (en) System and method for proactive task scheduling
US7984447B1 (en) Method and apparatus for balancing project shares within job assignment and scheduling
Chen et al. Adaptive multiple-workflow scheduling with task rearrangement
CN103699433B (en) One kind dynamically adjusts number of tasks purpose method and system in Hadoop platform
US8214836B1 (en) Method and apparatus for job assignment and scheduling using advance reservation, backfilling, and preemption
US20070195356A1 (en) Job preempt set generation for resource management
CN108509280B (en) Distributed computing cluster locality scheduling method based on push model
CN111026519A (en) Distributed task priority scheduling method and system and storage medium
CN111367644A (en) Task scheduling method and device for heterogeneous fusion system
CN112579271A (en) Real-time task scheduling method, module, terminal and storage medium for non-real-time operating system
CN113568725A (en) Deep learning job priority scheduling method and deep learning job system
CN111708799B (en) Spark task processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination