CN113568725A

CN113568725A - Deep learning job priority scheduling method and deep learning job system

Info

Publication number: CN113568725A
Application number: CN202110794626.4A
Authority: CN
Inventors: 周悦媛; 章家维; 杨康; 邵恩; 谭光明
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2021-10-29

Abstract

The invention provides a deep learning job priority scheduling method, which comprises the following steps: in any job scheduling period, acquiring the predicted working parameters of all available GPUs in a GPU cluster and the predicted job parameters of all jobs in a waiting queue of the GPU cluster; predicting the residual execution time of each job according to the predicted working parameters and the predicted job parameters; taking the product of the residual execution time of any operation and the estimated resource quantity of the operation as the operation area of the operation; and selecting the operation with the minimum operation area from all the operations, and setting the operation with the highest priority in the current operation period. The invention also provides a deep learning operation system and a data processing device.

Description

Deep learning job priority scheduling method and deep learning job system

Technical Field

The invention relates to the technical field of deep learning, in particular to a deep learning job priority scheduling method based on time estimation and a deep learning job system.

Background

The priority algorithm originally came from a process scheduling problem in the operating system. In early operating systems, only a single-machine priority algorithm was designed, since a single process typically ran on only a single CPU. The main purpose of these priority algorithms is to avoid multiple jobs requesting CPU resources simultaneously, while optimizing as much as possible some common performance metrics, such as job average response time, average turnaround time, response ratio, etc.

According to different scenes and optimization targets, various priority algorithms respectively perform targeted optimization.

For example, if it is desired to minimize the average waiting time of all jobs, a short job priority algorithm may be used, i.e., the priority of a job is made monotonically decreasing with the length of the job execution time, with shorter jobs having higher priorities. Mathematical derivations can be used to strictly prove that such algorithms are optimal, but from a fairness perspective, such algorithms suffer from significant starvation. In addition, there are also some classical priority algorithms as follows:

SJF (short job priority algorithm): the algorithm is a classic algorithm for task scheduling of an operating system, and tasks with short execution time are preferentially executed. The algorithm can generally obtain good response time, but is not suitable for GPU cluster scheduling. The main reason is that a single process typically runs on only a single CPU, whereas deep learning training jobs tend to run on multiple GPU devices.

LRF (low resource priority algorithm): this is a scheduling algorithm widely used for resource sensitive tasks. The algorithm only concerns the resource usage of the jobs, and the jobs with lower resource usage have higher priority and can be scheduled preferentially.

RAND (random priority algorithm): the random priority algorithm is also a priority policy, i.e. the next scheduled task in the queue is completely random or has some randomness.

To address the starvation problem, the highest response ratio priority algorithm may be selected. In this algorithm, the response ratio is defined as the time that the job has been waiting divided by the time that needs to be executed, so that the priority of the job monotonically increases with the response ratio. Thus, if the priority of a single job changes with the waiting time, when the waiting time is too long, the response ratio becomes high and thus is scheduled, thereby avoiding the job starvation phenomenon.

If one wants to perform some jobs with high real-time performance, but places particular emphasis on fairness, there is also a time slice rotation algorithm that allows the process to be interrupted during execution. This algorithm divides time into a number of small time slices, and by assigning each time slice to a process, the illusion that each process is running continuously can be created, reducing response time. This algorithm has strong fairness because the time slice allocation is uniform for each job. In fact, this notion of time slicing and using preemption can also be used in deep learning training jobs. Yujeong Choi et al choose to preempt on the NPU during their work, and choose whether to preempt the executing job by the job in the waiting queue according to the urgency of each job, which reduces the response time while achieving fairness.

If the complexity of the scheduling algorithm is simply reduced and the time overhead caused by the scheduling algorithm is shortened, a first-come first-serve algorithm can also be used. The algorithm simply sorts the jobs according to the arrival time, and the jobs arriving early have high priority, so the algorithm is executed first; the opposite is true for the job arriving at play. The algorithm does not need repeated sorting and inserting operations on the waiting queue, so that the consumed time is short.

These algorithms have a good effect on their own optimization targets in the case of single-machine scheduling, but they are no longer fully applicable in multi-machine scheduling because they cannot utilize the information that the job needs to occupy the number of CPUs. There are more factors to consider for a multi-machine algorithm.

The multi-machine scheduling scenario is divided into two scenarios, that is, scheduled jobs each occupy only one processor and each can occupy multiple processors. Wherein the former is a special case of the latter. But even for the former, the problem of finding a scheduling algorithm to optimize some important metrics is NP-complete. Although the optimal problem is difficult to solve at present, under the condition of abandoning the search of the optimal solution, a considerable number of heuristic scheduling algorithms can be selected under various different demand scenes at present.

There is much work related to the priority. One of the common methods is to use a backhaul method to advance the jobs with less occupied resources to the jobs with higher priority and more occupied resources to execute under a certain condition, so as to occupy idle resources, thereby expecting to improve the cluster utilization. The Eric Gaussier et al research work uses backpilling and is based on various simple heuristic priority algorithms developed using a dynamic "multi-arm slot machine" learning method to gradually converge the priority algorithm to one that best matches the job load characteristics, thereby reducing the average latency of the job. In the prior art, a priority scheduling policy of a secondary queue is proposed. The method specifically includes that all Node values in a cluster are calculated according to a fixed rule to generate a first Node priority queue, a Pod priority queue is obtained through a dynamic priority algorithm, nodes which cannot be scheduled are filtered by two queues to generate a second Node priority queue, the Node with the highest priority is selected from the queue to be bound with a Pod popped up by the Pod priority queue, the next Pod scheduling cycle is entered after the binding is successful, Node binding is optimized from the second Node priority queue through a self-contained priority algorithm after the binding is failed, and no suitable Node can be operated by the Pod and enters the next Pod scheduling cycle after the Node binding is failed again. The invention utilizes the neural network to obtain the weight of each index of the operation in the cluster in a black box mode, and then carries out priority ranking according to the weight.

Other algorithms also comprise a Gang Scheduling method which simulates the design of a time slice rotation algorithm under the single machine condition. Like the time slice rotation algorithm, the gan Scheduling method divides time into smaller time slices, performs the calculation on each processor, and allocates a certain number of processors under a certain time slice according to the needs of each job. Like the time slice rotation algorithm, the algorithm can also improve fairness and reduce response time.

The methods for dealing with errors are mainly divided into two methods, namely, the method for dealing with errors is used for reducing the error overhead and the error probability.

On the software level, the most common method for reducing error overhead is to use Checkpoint technology, i.e. a save point is periodically inserted in the process of executing a job, and when an error occurs in the job, the job can be continuously executed from the save point after being restarted, thereby avoiding a lot of time waste caused by re-execution of the job from the beginning. But the act of saving Checkpoint itself causes a high time overhead and therefore it is necessary to select the appropriate frequency for saving. Two examples of related work are as follows. Han Li and the like consider transmission paths of data among a plurality of machine nodes in the process of executing the operation in detail, and only store Checkpoint when the data needs to be transmitted across the nodes, so that each node is ensured to execute the executed content on the node again after an error occurs, and the content on other nodes which do not have errors is not required to be executed again. The method reduces the frequency of preservation of Checkpoint and avoids high error overhead caused by too few Checkpoints. IsmaiAkturn and the like propose that the value and the time overhead of storing Checkpoint should be measured, and for some checkpoints with large storage time overhead, the Checkpoint is selected not to be stored any more, and a mode of executing a small part of program to acquire again after errors is adopted to avoid a large amount of access overhead.

There are also many related efforts to reduce the error probability, and particularly, Ajeya Naithani and others choose to use a scheduling method to alleviate the error problem, and hopefully, reduce the frequency of the error occurrence. They found experimentally that in heterogeneous processor architectures, different jobs executed on different kinds of processors have different error probabilities. To reduce the average number of errors, jobs executing on different types of processors may be continually swapped for a placement that is least frequently errored.

The existing priority strategy has the following problems: 1) the cluster fault influence is not considered, and the dispatching is only carried out aiming at the cluster fault-free condition; 2) the average response time of the jobs in the cluster cannot be minimized, and the average response time is an important index of the user service quality and reflects the time interval from the submission of the jobs to the arrival of the results by the user; 3) some existing priority policies utilize machine learning algorithms to assist in determining the priority of a job, such priority algorithms being poorly interpretable.

Disclosure of Invention

Aiming at the problems, the invention provides a deep learning job priority scheduling method based on time estimation, which comprises the following steps: in any job scheduling period, acquiring the predicted working parameters of all available GPUs in a GPU cluster and the predicted job parameters of all jobs in a waiting queue of the GPU cluster; predicting the residual execution time of each job according to the predicted working parameters and the predicted job parameters; taking the product of the residual execution time of any operation and the estimated resource quantity of the operation as the operation area of the operation; and selecting the operation with the minimum operation area from all the operations, and setting the operation with the highest priority in the current operation period.

The invention relates to a deep learning job priority scheduling method, wherein the predicted working parameters comprise: a fault parameter λ reflecting the mean fault probability p of all the available GPUs, λ being p ═ 1-e^-λAnd the average recovery time delta after GPU failure; the predicted operation parameters include: the non-fault theory residual operation time length T, the pre-estimated resource number n and the operation scheduling period tau of each operation; remaining execution time of any job

Wherein, tau₀For the predicted execution time of the job within a job scheduling period tau,

p 'is the probability of no failure of the job within one job scheduling period τ, and p' ═ e within one job scheduling period τ^-nλτ。

The deep learning job priority scheduling method of the invention executes all deep learning jobs in at least one job scheduling period.

The invention also provides a deep learning operation system, which comprises: the parameter acquisition module is used for acquiring the predicted working parameters of all available GPUs in the GPU cluster and the predicted working parameters of all the jobs in the waiting queue of the GPU cluster in any job scheduling period; and predicting the residual execution time of each job according to the predicted working parameters and the predicted job parameters; the priority scheduling module is used for setting the priority of the current operation period according to the operation area of the operation; the product of the residual execution time of any job and the estimated resource quantity of the job is used as the job area of the job, the job with the minimum job area in all jobs is selected, and the highest priority in the current job period is set.

The deep learning operation system of the invention, wherein the predicted working parameters comprise: a fault parameter λ reflecting the mean fault probability p of all the available GPUs, λ being p ═ 1-e^-λAnd the average recovery time delta after GPU failure; the predicted operation parameters include: the non-fault theory residual operation time length T, the pre-estimated resource number n and the operation scheduling period tau of each operation; remaining execution time of any job

The deep learning operation system of the invention executes all deep learning operations in at least one operation scheduling period.

The present invention also provides a computer-readable storage medium storing computer-executable instructions, wherein the computer-executable instructions, when executed by a processor, implement the deep learning job priority scheduling method as described above.

The present invention further provides a data processing apparatus, comprising: GPU clustering; a processor; a computer-readable storage medium that schedules deep learning jobs for execution on the cluster of GPUs when the processor retrieves and executes the computer-executable instructions in the computer-readable storage medium.

Drawings

Fig. 1 is a flowchart of acquiring a job priority policy in the present invention.

FIG. 2 is a schematic diagram of job run time prediction without checkpointing.

FIG. 3 is a schematic diagram of job run time prediction with checkpointing.

FIG. 4 is a schematic diagram of a data processing apparatus of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

When the inventor conducts large-scale distributed deep learning job priority strategy research, the inventor finds that the short job priority strategy used in the traditional CPU cluster is unreasonable to be directly used in the GPU cluster. The main reason is that a single process typically runs on only a single CPU, whereas deep learning training jobs tend to run on multiple GPU devices. Therefore, the inventor proves that the optimal average turnaround time can be obtained by adopting the strategy of setting the highest priority for the job with the minimum product of the job running time and the number of GPU resources required by the job, and the product of the job running time and the number of GPU resources required by the job is defined as the job area, so that the invention adopts a small area priority Strategy (SAF).

In addition, the simple small-area priority strategy does not consider that the residual operation time of the job may be greatly changed due to cluster faults, so that the calculation of the job area is inaccurate, and the scheduling effect is influenced. The inventor finds out through a large amount of experiments and mathematical modeling researches that the defect can be solved by a method of estimating the residual execution time of the training operation according to the cluster fault condition. Therefore, the deep learning job priority scheduling method based on time estimation is provided, and the difficulty is how to estimate the real running time under the cluster fault interference condition according to the theoretical execution time.

The invention aims to solve the problem that the prior cluster deep learning operation priority strategy cannot ensure the most validity of the average turnover time from two aspects. On the one hand, strict mathematical proofs can deduce that the existing priority strategy cannot make the average turnaround time the most in the case of large clusters and multiple jobs, while the small-area strategy of the present invention can make the average turnaround time the most. On the other hand, the fault condition of the large-scale cluster can be increased along with the increase of the cluster scale, and the priority strategy without considering the cluster fault can not adapt to the large-scale cluster, so the invention provides a technology for estimating the residual time of deep learning training operation based on the cluster fault information, and the technology is applied to the priority strategy with small area priority, and the average response time of the operation in the cluster can be reduced to the maximum extent.

The scheduling method can estimate the operation execution time under the interference of the cluster faults according to the operation theoretical execution time based on the prediction technology of the cluster fault information on the residual execution time of the deep learning training operation; the method adopts a small area priority (SAF) scheduling strategy, so that the average response time of deep learning training jobs in a cluster can be reduced; and a heuristic algorithm is adopted in the acquisition process of the priority, so that the method has complete interpretability.

Compared with algorithms such as SJF (short job priority algorithm), FCFS (first-in first-out algorithm), LRF (less resource priority algorithm), RAND (random priority algorithm) and the like, the method has the best average turnover time. The theory proves that:

two constraints are first defined:

constraint 1: the number of cards in the cluster is much larger than the number of cards required for a single job. Under such conditions, it may be asserted that the utilization of the cluster may remain at a high level near 1 as long as there are more jobs in the job queue. The following reasoning holds:

introduction 1: the total GPU number in the cluster is recorded as n_clusterAnother work queue of sufficient length (jobb)_i,i>0}, wherein joba_iThe number of occupied GPUs is n_jobiRemember n_max：＝maxn_jobiThere should be any time, if in the work queueIf the operation still exists, the utilization rate U of the GPU in the cluster is:

in particular, if n_cluster＞＞maxn_jobi1-U < 1.

And (3) proving that: without loss of generality, the priority of the jobs in the work queue is set to decrease with the subscript of the jobs. The jobs in the queue are scheduled to the cluster one by one according to the priority until the jobs with the highest priority left in the work queue cannot be scheduled. Using the inverse syndrome method, inverse setup

Then let the index of the job with the highest priority be k, should be

n_jobk≤n_max≤n_cluster-n_occupied

The number of free GPUs is therefore not less than the number of GPUs required for job k, which can be scheduled. This is contradictory to schedule stops. Therefore, the original proposition is established.

If n is_cluster＞＞n_max1-U < n_max/n_cluster< 1. Mouth piece

In fact, n_cluster＞＞n_maxThis constraint is commonly true in large-scale clustering. Thus, it can be guaranteed that with sufficient jobs, the GPU cards in the cluster are almost entirely in the occupied or faulty state, and rarely in the idle state due to contention.

Constraint 2: the total time for all jobs to execute is much longer than the time required for a single job to execute. This condition is essentially similar to the previous condition. The total time for executing k jobs adjacent to the m-th starting priority in the execution work queue is recorded as t_m，kThe time from the first start to the last end of execution of these jobs is recorded as t_kTo do sojob_iThe length of time of execution is t_jobi. Note t_max：＝maxt_jobi. Number of jobs

t_m，k＞＞t_max

Then it may be asserted that when the batch is executed, there is at least t_m，k-2t_maxThe time of (a) can ensure that the proportion of the GPUs in the cluster occupied by the k jobs satisfies the inequality of the U obeys in lem 1. Further, it can be ensured that the cluster is at t_m，kThe average occupancy rate by the k jobs in the time length is within a certain range. After scheduling by using a certain priority algorithm, each joba_iAre respectively at b_iIs scheduled to begin execution at time e_iThe execution is completed at that time. Then there is the following lemma 2.

2, leading: the total time for executing k jobs adjacent to the m-th starting priority in the execution work queue is recorded as t_m，kThen average utilization rate

Due inequality

And (3) proving that: since k jobs with adjacent priorities are discussed, it can be guaranteed that b is the same_mBefore time, the first m-1 jobs have all been scheduled. Thus in b_m+t_maxAfter that, the first m-1 jobs must have all been performed. In a similar manner, in_m+k-1-t_maxPreviously, the m + k th and subsequent jobs must not have been scheduled. Thus, in [ b_m+t_max,e_m+k-1-t_max]In between, only these k jobs must be running on the cluster.

From theory 1, it can be seen that the utilization rate of the cluster is higher than 1-n in the time period_max/n_clusterTherefore, during this time, the areas of these operations should be satisfiedFoot

Thereby to obtain

The lemma is established.

Thus, it has been demonstrated that under two constraints, a cluster can maintain high utilization during execution of any one contiguous segment in a work queue.

The following demonstrates that the scheduling method of the present invention has an optimum in terms of average turnaround time:

this proof can be translated into finding a near optimal algorithm under constraints. Finding an approximately optimal algorithm, however, requires finding several properties that the optimal algorithm satisfies. The conditions that the near-optimal priority algorithm needs to satisfy under the above two constraints will be described below. Here, a simplified assumption may be made that all jobs arrive at an initial time, and information of all jobs is available at the initial time. For the case where a job arrives at different times, the priority of this job may be set using a priority algorithm at each job arrival time.

Now assume that an optimal priority algorithm has been found. Set under the algorithm, each joba_iAre respectively at b_iIs scheduled to begin execution at time e_iThe execution is completed at that time. The time taken by the algorithm from the scheduling of the first job to the completion of the execution of all jobs is recorded as t_allNow divide all jobs into several sets according to the time when the job starts to execute, order

The magnitude relation between the number of jobs in each set is estimated. By considering two of these sets. Considering the overall interchange of the priorities of the two sets, the following reasoning can be derived:

and 3, introduction: under the above notation, for k₁＜k₂< N, note

The number of elements in is

Should be provided with

And (3) proving that: n mutually disjoint sets are obtained after the division, and the k-th set is selected₁And k₂A set of k, wherein₁The sets containing priority slaves

To

Operation of (1), k₂The sets containing priority slaves

To

The operation of (2). Assuming that under this optimal priority algorithm, the average turnaround time for all jobs is l₀。

Now the priorities of the two sets are exchanged in their entirety, i.e. let k-th₁All jobs in the individual set are prioritized back to the first

To

Bit, and order k₂All in one setJob is moved to the first

To

A bit. Now, it is estimated how the average turnaround time of all jobs changes after the priority order is adjusted.

Now consider the case: for the k-th₂Each job in the set, the time at which the job starts to execute is advanced by an equal value

For the k-th₁+1 to kth₂1 set, each job of which the start of execution is delayed by 2t_max(ii) a To the k-th₁A set in which the start of execution of each job is delayed

For the k-th₂+1 sets and later, the job start time is delayed by 4t_max. Since all the jobs within each section are delayed or advanced by an equal length of time as above, no conflict occurs between jobs within each section. And according to the value of the above time change, kth₂The earliest start time of a job in a set is t_all·(k₁-1)/N+t_maxThus a certain ratio of the kth₁The later the end of the job of 1 and the previous set, the two parts of the job will not conflict. The same can be calculated, and all the above operations do not conflict. This illustrates the above scenario as one scheduling scheme that may be implemented. Under this scheduling scheme, the average turnaround time can be found to be

And due to

So that there are

In this case, however, the order of all job start times is the same as the adjusted priority order, and no conflict occurs, so each job has an opportunity to be scheduled at an earlier timing in the adjusted priority order. That is, the average turnaround time of all jobs must not be longer than l in the adjusted priority order₁. But from the initial assumption, l₀Is the average turnaround time obtained in the case of the optimal priority algorithm, and therefore it is proportional to l₁Is shorter, and thus can ensure

This is also equivalent to the expression to be proof of lemma. Mouth piece

From the above, it is clear that the number of jobs in the later set must not be much greater than in the earlier set. And the number of jobs in the later set is not greater than the number of jobs in the earlier set when the number of jobs is sufficient to satisfy the second constraint.

At the same time, the time from the beginning of the first job to the end of the last job in each set can be found at t_allfrom/N to t_all/N+t_maxThe total sum of the areas of the jobs included in each set is estimated to be close to each other according to lemma 2, with almost no change for each set. Then the area occupied by each job on average must be larger since the number of jobs in the later set is smaller than in the earlier set.

But this is not enough to illustrate the problem. This is because if most of the area in one set is occupied by a job having a particularly large occupied area, even if most of the jobs in the set are very small, the occupied area per job may be larger on average than that of the other sets. To find a superior algorithm, it is necessary to study the allocation of various jobs with different sizes of occupied areas in each set.

Therefore, according to the three reasoning, considering the exchange of the operation with smaller occupied area in the later set and the operation with larger occupied area in the former set, the following reasoning can be obtained:

inference 1: if there is job queue (job)_iI > 0}, and joba_iWith an execution time of t_jobiOccupying GPU cards by n_jobi，Note A_jobi＝n_jobi·t_jobi. The jobs in the cluster are scheduled according to the optimal priority algorithm, and the time from the scheduling of the first job to the execution of all jobs is t_allAnd for job in each queue_iIn b of_iThe time is scheduled to begin execution. Fetch and divide

Then order

Then is at

Under the condition (2), the following size relationship should be established:

and (3) proving that: let I₁＝{i_k，k＝1，2，...，|I₁|}，I₂＝{j_k，k＝1，2，...，|I₂| define "exchange" thisThe priority of the two sets indicates, for each one, that is greater than 0 and less than | I₁K of | exchange the priorities of the ik th and jk th jobs, and at the same time, assign the ith job_kJob slave I₁Move to I₂In (1), will be_kJob slave I₂Move to I₁In (1). The implication of the inference is then that two of the above-mentioned sets, the kth set, are selected₁And k₂Two of them. From

And

two subsets I with equal element number are respectively selected₁And I₂An attempt is made to exchange the priorities of the two subsets to check if the average turnaround time is improved.

First, the variation is defined as

Then according to the definition in the quote,

now consider only I₁The area occupied by each job in the system is not more than that of the I₂In which case the job is large. If this condition does not hold, then it can be at I₁In which several elements are deleted from small to large, and in I₂And deleting equal elements from large to small until the condition is satisfied. Note that at this time, the two sets become I respectively₁' and I₂'. In the process, I₁Ratio of sum of areas occupied by the deleted jobs I₂Is smaller, so according to the definition of the change amount in the previous paragraph, there should be

Therefore if can prove

If there is an upper bound as stated in the lemma, then

Also possess the same upper bound.

Since the number of jobs in the two subsets is the same, after adjusting the priority, divide by

And

the priority of the job must not change in all sets other than the two sets. Then, as in lemma 3, all collections are considered in five parts, i.e.

The amount of change in their turnaround time was analyzed.

To pair

In other words, a job having a large occupied area is replaced with a job having a small occupied area, and the execution start time of each job is advanced. The turnaround time for these jobs is shortened by at least a non-negative value.

To pair

In other words, the jobs are all higher in priority than the jobs that are being adjusted, so the priority adjustment has no effect on the time they begin executing. The turnaround time change amount of these jobs is 0.

To pair

In other words, only the start execution time of the job is affected

And

wherein

The time when all the jobs start and end execution does not change, so it is necessary to consider

At the time all jobs are executed. It can be noted that the smaller jobs move in, the larger jobs move out,

the sum of the areas occupied by all the jobs is reduced

Before the adjustment, the adjustment is carried out,

the sum of all the operation occupied areas does not exceed the capacity n which can be borne by the cluster_cluster·(t_max+t_allN), then according to theorem 2, after the adjustment, from

The time elapsed from the time when the first job starts to be executed to the time when the last job is completed

And, by definition, before priority adjustment,

the minimum value of the elapsed time from the start of execution of the first job to the end of execution of the last job is tall/N, and therefore, after the priority adjustment, the time is lengthened at most

In addition, due to

The execution start/end time of the middle job is not changed, so

At most, the time for starting execution of the first job is delayed by t_maxSuch a long execution time, and therefore, the end time thereof is delayed at most

Then now if it will be

The start time of each job is delayed by delta e, so that the jobs and

there is no conflict in the jobs in (1). That is, each job may be scheduled at an earlier time. That is, the time at which each of these jobs begins execution is at most delayed by Δ e, i.e., equivalent to at least advanced by- Δ e, and the turnaround time is also at least shortened by- Δ e.

To pair

Considering the end time of the last job, the end time is recorded as

Note in addition

The time when the first job is started is

It can be known that,

all jobs in these collections are all in

To

Is executed within a time period. Then, also by lemma 2, have

And is provided with

Therefore, the time when the last job ends can be obtained

Then due to

In that

Since all the operations have been completed before, the start time of the left and right operations in the set is not lateIn that

And is composed of

Before priority adjustment, the start time of the job in the set must not be earlier than (k)₂-1)t_allN, then it can be deduced that each job in the set begins execution with a priority adjustment and a deferred length of time

The turnaround time is also increased by at most this value, i.e. the inverse of this value is reduced.

To pair

In other words, the time in which the first job starts to be executed must not be later than

Before the priority of the job is adjusted, the first job is started to be executed no earlier than k₂stall/N, so that each job is deferred at most after the job priority order is adjusted

Then if the average turnaround time of all jobs before prioritization is l₀The average turnaround time of all jobs after adjustment is l₂According to the change of the average turnover time of the work in the above five parts set, should be

Wherein

Because the original algorithm is the optimal algorithm, the average turnover time of the new algorithm is longer than that of the original algorithm after the priority is adjusted. Therefore, the left side of the inequality number of the above expression (2) must not be less than 0. Further, according to (3), it is found that

The following formula is arranged, namely

According to the lemma 3, the pair between k₁+1 and k₂-each of i between 1 and i,

should satisfy the condition

Thereby to obtain

By substituting this value back into expression (4), it can be known that

Then, the expression (5) is collated in a small amount form using the two constraints set forth heretofore. From the definition of δ' in expression (3), it can be seen that

In combination with the above expression (5), should be

For arbitrary

|I₁|＝|I₂All are true, so from expression (1), it can be seen that

It is the inference that the results are intended to be proven.

To illustrate that this conclusion is more meaningful than lemma 3, the example mentioned after lemma 3, i.e. for k, can still be chosen to prove₂>k₁，

The system consists of a small amount of work with particularly large occupied area and a few positive and small occupied areas. In this case, since most of the area is occupied by the work which is particularly large, it is possible to reduce the number of the work pieces and to reduce the cost

May not be enough, and thus cannot be limited by lemma 3. However, in inference 1, I was selected₂Is composed of

A subset of the smallest area job component of which half of the total number is occupied, and I1 is

The subset consisting of equal-volume jobs with the largest area. ThatThe reason why

Includes a lot of operations with a very small occupied area, so I₂The working areas in (1) are all extremely small, and the sum of the areas is also extremely small; and I₁The area distribution of the middle operation is uniform, so the sum of the areas of the operations is larger. This would directly violate the requirements of inference 1. Therefore, inference 1 is a stronger constraint than lemma 3.

Of particular note is n_cluster·t_allN approximation to each

The sum of the areas of all the jobs in (1) is equal. This means that the above reasoning cannot be directly explained

Is a small quantity relative to the total area, even at k₂-k₁In the case of 1, this inference cannot give any limitation. However, if there is a long enough work queue, since N can get very large under the constraint of this chapter, the difference between the set bit numbers to which two specific jobs belong, i.e. k₂-k₁And can become very large. In this case, the right-hand term of the inequality is relative to a single one

The sum of the areas of all the jobs in the process is a small amount, and the inferred constraint becomes very strong.

Then, in the case where the number of types of jobs is limited, the area occupied by a single job must be limited. Therefore, the absolute value of the difference between the areas occupied by the two operations must have a minimum value, and this minimum value is defined as Δ a. The area occupied by a single operation must also have a maximum value, which is assumed to be A_maxThen there are two instants t₁And t₂Satisfy t₂＞t₁And t is₂-t₁＝O(t_all) Then t can be asserted₁Almost all the jobs around the time occupy an area ratio t₂The vicinity of the moment is smaller. Do not consider₁To t₂Is divided into [ t₁，(2t₁+t₂)/3)，[(2t₁+t₂)/3，(t₁+2t₂)/3)，[(t₁+2t₂)/3，t₂) Dividing each time period into three large time periods equally, and recording the set of the jobs with the start execution time falling in each small time period as

Then note the set sequence number k of two jobs in the first and third large time periods₁And k₂Then the difference between the two serial numbers is at least N₀. According to Corollay 1

Suppose in

In which there are n job total ratios

N of the jobs are large, and

m total jobs, then

Also according to the introduction 2

Is expressed back by

And when the two constraints are satisfied, N₀It can be made sufficiently large and thus n/m is very small. This means that these are very small in total number of jobs compared to the number of jobs in reverse order of the small area priority algorithm, which explains the optimality of the small area priority algorithm.

The deep learning job priority scheduling method based on time estimation can be used for single queue scheduling and multi-queue scheduling. As shown in fig. 1, when scheduling a single queue, the specific implementation of priority acquisition is as follows:

s1: acquiring the average failure probability p of all GPUs or all available GPUs in the cluster, wherein the average failure probability p is equal to 1-e^-λAnd the average recovery time delta after GPU failure.

S2: acquiring information of all jobs (the job numbers are 0, 1, 2.. jobs) in a cluster waiting queue, wherein the information comprises the theoretical residual running time length T of each job, the number n of needed GPU resources and the storage interval time length tau of a check point.

S3: for each job in the cluster, the residual execution time T is estimated₀。

S4: according to the calculation result of step S3, the remaining execution times of all jobs in the cluster waiting queue are updated.

S5: calculating the area of all the jobs in the cluster waiting queue, and recording the area of the ith job as S_iAnd added to the job information data.

S6: find the j job so that S_j＜S_kIt holds true constantly, where k ∈ 0, 1, 2. The jth job has the highest priority, i.e., is the next scheduled job.

The step S3 specifically includes:

s31: when each cluster only contains one epoch, as shown in fig. 2, the situation that error restart may occur but check points are not saved is possible, and the execution time of the job on the cluster is estimated

S32: tau obtained from S31₀The execution time in the case where the check points are held every τ time for the job as shown in FIG. 3 is estimated

The step S31 specifically includes:

s311: the average probability of errors per second of the GPUs in the cluster is defined as p, and the average probability of no errors is 1-p. If job J is assigned to n GPU cards, the probability that the n cards do not make errors per second is (1-p)ⁿThen the probability that all machines will never make an error after t seconds is (1-p)^ntFrom p ═ 1-e^-λTo obtain e^-λ1-p, then the job is executed on n GPU cards with a probability e that it will not be in error^-nλt。

S312: definition p (success) is a probability that an operation will not be executed with an error all the time, and p (success) e^-nλt(ii) a The probability that a job will go wrong from time t to time t + dt is defined as f (t) dt.

S313: using the data acquired in step S1, step S2, step S312, the expectation of the total time of job completion is calculated as

S314: according to the definition of step S312, p (success) is defined as e^-nλTAnd f (t) ═ n λ e^-nλtSubstituting the total time of step S313 is expected to be

Replacing T by tau and finishing to obtain

The invention also provides a computer readable storage medium and a data processing device, as shown in fig. 4, including a GPU cluster, a processor and a computer readable storage medium. The computer readable storage medium of the present invention stores computer executable instructions that, when executed by a processor, prioritize deep learning jobs executing on the GPU cluster to schedule deep learning jobs. It will be understood by those skilled in the art that all or part of the steps of the above method may be implemented by instructing relevant hardware (e.g., processor, FPGA, ASIC, etc.) through a program, and the program may be stored in a readable storage medium, such as a read-only memory, a magnetic or optical disk, etc. All or some of the steps of the above embodiments may also be implemented using one or more integrated circuits. Accordingly, the modules in the above embodiments may be implemented in hardware, for example, by an integrated circuit, or in software, for example, by a processor executing programs/instructions stored in a memory. Embodiments of the invention are not limited to any specific form of hardware or software combination.

The invention adopts a heuristic algorithm to estimate the residual execution time of the deep learning operation, then uses the product of the estimated time and the quantity of resources required by the operation as the area, and finally carries out priority evaluation by a small-area operation priority strategy which is proved by mathematics to ensure that the average waiting time is optimal. Compared with the prior art, the priority strategy of the invention has the following advantages: 1) considering that the cluster fault probability is increased after the number of nodes and the number of GPU equipment in the cluster reach a certain degree, and the influence caused by the fault is not negligible; 2) shorter average response times can be obtained; 3) and a pure heuristic algorithm is adopted, so that the method has complete interpretability.

The above embodiments are merely illustrative, and not restrictive, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the invention, and therefore all equivalent technical solutions also fall within the scope of the invention, and the scope of the invention is defined by the appended claims.

Claims

1. A deep learning job priority scheduling method is characterized by comprising the following steps:

in any job scheduling period, acquiring the predicted working parameters of all available GPUs in a GPU cluster and the predicted job parameters of all jobs in a waiting queue of the GPU cluster;

predicting the residual execution time of each job according to the predicted working parameters and the predicted job parameters;

taking the product of the residual execution time of any operation and the estimated resource quantity of the operation as the operation area of the operation;

and selecting the operation with the minimum operation area from all the operations, and setting the operation with the highest priority in the current operation period.

2. The deep learning job priority scheduling method of claim 1, wherein the predicting the operating parameters comprises: a fault parameter λ reflecting the mean fault probability p of all the available GPUs, λ being p ═ 1-e^-λAnd the average recovery time delta after GPU failure; the predicted operation parameters include: the non-fault theory residual operation time length T, the pre-estimated resource number n and the operation scheduling period tau of each operation;

remaining execution time of any job

p' is the probability of no failure for the job within one job scheduling period τ.

3. The deep learning job priority scheduling method of claim 2, wherein within one job scheduling period τ, p' ═ e^-nλτ。

4. The deep-learning job priority scheduling method of claim 1, wherein all deep-learning jobs are executed in at least one job scheduling period.

5. A deep learning operation system, comprising:

the parameter acquisition module is used for acquiring the predicted working parameters of all available GPUs in the GPU cluster and the predicted working parameters of all the jobs in the waiting queue of the GPU cluster in any job scheduling period; and predicting the residual execution time of each job according to the predicted working parameters and the predicted job parameters;

the priority scheduling module is used for setting the priority of the current operation period according to the operation area of the operation; the product of the residual execution time of any job and the estimated resource quantity of the job is used as the job area of the job, the job with the minimum job area in all jobs is selected, and the highest priority in the current job period is set.

6. The deep learning system of claim 5, wherein the predicted operating parameters include: a fault parameter λ reflecting the mean fault probability p of all the available GPUs, λ being p ═ 1-e^-λAnd the average recovery time delta after GPU failure; the predicted operation parameters include: the non-fault theory residual operation time length T, the pre-estimated resource number n and the operation scheduling period tau of each operation;

remaining execution time of any job

7. The deep learning job system of claim 6, wherein within one job scheduling period τ, p' ═ e^-nλτ。

8. The deep learning job system of claim 5, wherein all deep learning jobs are executed in at least one job scheduling cycle.

9. A computer-readable storage medium storing computer-executable instructions, which when executed by a processor implement the deep-learning job priority scheduling method of any one of claims 1 to 4.

10. A data processing apparatus comprising:

GPU clustering;

a processor;

the computer-readable storage medium of claim 9, wherein the processor when retrieving and executing the computer-executable instructions in the computer-readable storage medium schedules a deep learning job for execution on the cluster of GPUs.