CN116501486A

CN116501486A - Cluster resource scheduling method, system, terminal and storage medium

Info

Publication number: CN116501486A
Application number: CN202310295370.1A
Authority: CN
Inventors: 曹绍猛; 刘昌松; 徐莉芳; 陈红宇; 靳新
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2023-03-23
Filing date: 2023-03-23
Publication date: 2023-07-28

Abstract

The invention discloses a cluster resource scheduling method, a system, a terminal and a storage medium, wherein a resource manager, a task life cycle manager and a scheduler are innovatively designed on the basis of a main stream AI computing platform, when a job to be processed is an elastic training job, the allocation computing resource quantity of the elastic training job is determined by predicting the idle resource quantity and the demand computing resource quantity of the elastic training job in a first preset time period of a cluster; according to the real-time idle resource quantity and the distributed computing resource quantity of the cluster, distributing operation nodes for the cluster, and calling the operation nodes to execute elastic training operation in a second preset time period; based on the predicted idle resource amount of the cluster in the next first preset time period, adjusting the allocation calculation resource amount of the elastic training operation; and reallocating the operation node for the cluster according to the real-time idle resource quantity and the adjusted allocated calculation resource quantity, and calling the operation node to continuously execute the elastic training operation in the next second preset time period, so that the cluster utilization rate is effectively improved.

Description

Cluster resource scheduling method, system, terminal and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a cluster resource scheduling method, system, terminal, and storage medium.

Background

With the rapid development of artificial intelligence, a single computing unit (such as an AI computing card) cannot support the training and optimization of the model, so that the deep learning model is deployed in the cluster, and the training and optimization efficiency of the deep learning model is improved through computing resources provided by the cluster. The cluster is composed of a plurality of nodes, and a plurality of computing units can be arranged on each node.

Training, optimizing, of the deep learning model may be divided into a plurality of job tasks, each of which may require different computational resources. In the prior art, an average allocation or pre-configuration mode of computing resources of a cluster is often adopted for each job task, so that the problems of low cluster utilization rate, serious job task queuing, low cluster efficiency and the like are caused.

Disclosure of Invention

The invention mainly aims to provide a cluster resource scheduling method, a system, a terminal and a computer readable storage medium, and aims to solve the problems of low cluster utilization rate, serious job task queuing, low cluster efficiency and the like in the training and optimizing processes of a deep learning model in the prior art.

In order to achieve the above object, the present invention provides a cluster resource scheduling method, which includes:

acquiring a to-be-processed operation for training a deep learning model;

under the condition that the job to be processed is an elastic training job, determining the allocation calculation resource quantity of the elastic training job according to the predicted idle resource quantity of the cluster in a first preset time period and the required calculation resource quantity of the elastic training job;

according to the real-time idle resource quantity of the cluster and the distributed computing resource quantity of the elastic training job, distributing nodes in the cluster for the elastic training job as job nodes of the elastic training job so as to call the job nodes to execute the elastic training job in a second preset time period;

adjusting the allocated computing resource amount of the elastic training job based on the predicted free resource amount of the cluster in a next first preset time period;

and reallocating the operation node of the elastic training operation for the elastic training operation according to the real-time idle resource quantity of the cluster and the allocated calculation resource quantity regulated by the elastic training operation, so as to call the reallocated operation node to continuously execute the elastic training operation in the next second preset time period, continuously execute the step of regulating the allocated calculation resource quantity of the elastic training operation based on the predicted idle resource quantity of the cluster in the next first preset time period until the elastic training operation is completed.

In some embodiments of the present invention, the determining the allocated computing resource amount of the elastic training job according to the predicted free resource amount of the cluster in the first preset time period and the required computing resource amount of the elastic training job specifically includes:

when the predicted free resource amount in the first preset time period of the cluster is larger than the maximum value of the required computing resource amount of the elastic training job, taking the minimum value of the required computing resource amount of the elastic training job as a job fixed resource amount of the elastic training job; and taking the first difference value as the operation elastic resource amount of the elastic training operation;

wherein the first difference is a difference between the maximum value and the minimum value of the required calculation resource amount of the elastic training operation;

taking the minimum value of the elastic training job as a job fixed resource amount of the elastic training job in the case that the real-time free resource amount of the cluster is smaller than the maximum value of the required computing resource amount of the elastic training job and is greater than or equal to the minimum value of the required computing resource amount; and taking the second difference value as the operation elastic resource amount of the elastic training operation;

Wherein the second difference is a difference between the real-time free resource amount of the cluster and the minimum value of the required computing resource amount of the elastic training job;

taking the minimum value of the required computing resource amount of the elastic training job as a job fixed resource amount of the elastic training job and taking 0 as a job elastic resource amount of the elastic training job when the real-time free resource amount of the cluster is smaller than the minimum value of the required computing resource amount of the elastic training job;

and determining the allocation computing resource quantity of the elastic training job according to the job fixed resource quantity and the job elastic resource quantity of the elastic training job.

In some embodiments of the present invention, before determining the classified computing resource amount of the elastic training job based on the predicted free resource amount of the cluster within a first preset time period and the required computing resource amount of the elastic training job, the method further comprises:

predicting the predicted available resource amount of the cluster in the first preset time period through a preset cluster resource prediction model; and

acquiring the released available resource amount released by the currently running job to be processed of the cluster in the first preset time period;

And determining the predicted idle resource amount of the cluster in the first preset time period according to the predicted available resource amount and the released available resource amount.

In some embodiments of the present invention, the adjusting the allocated computing resource amount of the elastic training job based on the predicted free resource amount of the cluster in the next first preset time period specifically includes:

expanding the job elastic resource amount of the elastic training job under the condition that the predicted idle resource amount in the next first preset time period of the cluster is larger than 0 and the job elastic resource amount of the elastic training job is smaller than a first comparison value;

the first comparison value is a difference value between a maximum value and a minimum value of the required calculation resource amount of the elastic training operation;

reducing the job elastic resource amount of the elastic training job when the waiting job resource amount is greater than 0 and the job elastic resource amount of the elastic training job is greater than 0 in a next first preset time period of the cluster;

and adjusting the allocated computing resource amount of the elastic training job according to the job elastic resource amount after the elastic training job is expanded or reduced.

In some embodiments of the present invention, in a case where the predicted free resource amount in the next first preset time period of the cluster is greater than 0 and the job elastic resource amount of the elastic training job is less than a first comparison value, expanding the job elastic resource amount of the elastic training job specifically includes:

when the predicted free resource amount of the cluster in the next first preset time period is larger than a second comparison value, the second comparison value is used as a capacity expansion resource amount;

the second comparison value is a difference value obtained by subtracting the minimum value of the required calculation resource quantity from the maximum value of the required calculation resource quantity of the elastic training operation and subtracting the operation elastic resource quantity from the minimum value of the required calculation resource quantity;

taking the predicted idle resource amount of the cluster in the next first preset time period as a capacity expansion resource amount under the condition that the predicted idle resource amount of the cluster in the next first preset time period is smaller than or equal to a second comparison value;

and expanding the capacity of the operation elastic resource amount of the elastic training operation according to the capacity expansion resource amount.

In some embodiments of the present invention, the allocating a node of the cluster to the elastic training job according to the real-time free resource amount of the cluster and the allocated computing resource amount of the elastic training job, as a job node of the elastic training job, specifically includes:

Determining whether an amount of real-time free resources of the cluster meets the amount of allocated computing resources of the elastic training job;

under the condition that the real-time idle resource quantity of the cluster meets the allocation calculation resource quantity of the elastic training job, allocating the node of the cluster for the elastic training job as the job node of the elastic training job;

when the real-time idle resource quantity of the cluster does not meet the allocation calculation resource quantity of the elastic training job, adjusting the elastic training job from a scheduling queue of a preset queue to a waiting scheduling queue of the preset queue;

and after the waiting scheduling time is preset, the elastic training job is readjusted to the scheduling queue from the waiting scheduling queue, and under the condition that the real-time idle resource quantity of the cluster meets the allocated computing resource quantity of the elastic training job, the node of the cluster is allocated to the elastic training job and is used as the job node of the elastic training job until the elastic training job is executed.

In some embodiments of the present invention, after invoking the job node to perform the elastic training job within a second preset time period, the method further comprises:

And adjusting the elastic training job from a scheduling queue of a preset queue to a secondary scheduling queue of the preset queue, so that after adjusting the allocation calculation resource amount of the elastic training job based on the predicted idle resource amount of the cluster in the next first preset time period, the elastic training job is scheduled from the secondary scheduling queue, and the elastic training job is continuously executed in the next second preset time period.

In some embodiments of the invention, after acquiring the job to be processed for deep learning model training, the method further comprises:

under the condition that the job to be processed is an elastic fixed job, determining the allocation calculation resource amount of the elastic fixed job according to the predicted idle resource amount of the cluster in the first preset time period and the required calculation resource amount of the elastic fixed job;

and according to the real-time idle resource quantity of the cluster and the distributed computing resource quantity of the elastic fixed job, distributing the node of the cluster for the elastic fixed job as the job node of the elastic fixed job so as to call the job node to execute the elastic fixed job.

and under the condition that the job to be processed is an inelastic job, according to the real-time idle resource quantity of the cluster and the required calculation resource quantity of the inelastic job, distributing nodes in the cluster for the inelastic job as the job nodes of the inelastic job so as to call the job nodes to execute the inelastic job.

In some embodiments of the invention, the method further comprises:

and under the condition that the job node of the preempted job to be processed is preempted by other jobs to be processed, controlling the job node to execute the job to be processed to suspend, and after the execution of the other jobs to be processed is completed, controlling the preempted job node to continue executing the job to be processed.

In order to achieve the above object, the present invention further provides a cluster resource scheduling system, the system comprising: the system comprises a job module, a resource manager, a task life cycle manager and a scheduler;

the operation module is used for acquiring an operation to be processed for training the deep learning model;

The resource manager is used for determining the allocation calculation resource amount of the elastic training job according to the predicted idle resource amount of the cluster in a first preset time period and the required calculation resource amount of the elastic training job under the condition that the job to be processed is the elastic training job;

the scheduler is used for distributing nodes in the cluster for the elastic training job according to the real-time idle resource quantity of the cluster and the distributed computing resource quantity of the elastic training job, and the nodes are used as job nodes of the elastic training job so as to call the job nodes to execute the elastic training job in a second preset time period;

the task life cycle manager is used for adjusting the distributed computing resource amount of the elastic training job based on the predicted idle resource amount of the cluster in the next first preset time period;

the scheduler is further configured to reallocate, for the elastic training job, a job node of the elastic training job according to the real-time idle resource amount of the cluster and the allocated computing resource amount adjusted by the elastic training job, so as to invoke the reallocated job node to continue to execute the elastic training job in a next second preset time period, and continue to execute the step of adjusting the allocated computing resource amount of the elastic training job based on the predicted idle resource amount of the cluster in the next first preset time period until the execution of the elastic training job is completed.

To achieve the above object, the present invention also provides a computer readable storage medium, wherein the computer readable storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the steps in the cluster resource scheduling method as set forth in any one of the above.

In order to achieve the above object, the present invention further provides a terminal, which is characterized by comprising: a processor and a memory; the memory has stored thereon a computer readable program executable by the processor; the steps in the cluster resource scheduling method according to any one of the preceding claims are implemented when the processor executes the computer readable program.

According to the method, the predicted idle resource quantity of the cluster in the first preset time period is obtained through prediction, the computing resource quantity is distributed to the elastic training operation, and then the corresponding operation node is distributed to the elastic training operation according to the real-time idle resource quantity of the cluster and the distributed computing resource quantity so as to call the operation node to execute the elastic training operation. And in the execution process of the elastic training job, the allocation calculation resource amount of the elastic training job is adjusted according to the predicted idle resource amount of the cluster in the next first preset time period, so that the job node is reallocated for the elastic training job, the reallocated job node is called to calculate and execute the elastic training job in the next second preset time period, the adjustment of the calculation resource amount of the elastic training job is realized based on the dynamic change of the resources of the cluster, fragmented idle resources are fully utilized, the utilization rate of the cluster resources is improved, the problem of serious cluster queuing condition is solved, and the training of a deep learning network model is accelerated.

Drawings

FIG. 1 is a schematic diagram of a cluster resource scheduling system according to an embodiment of the present invention;

FIG. 2 is a flowchart of a cluster resource scheduling method according to an embodiment of the present invention;

fig. 3 is a flowchart of step S220 provided in an embodiment of the present invention;

fig. 4 is a flowchart of step S310 provided in an embodiment of the present invention;

FIG. 5 is another flowchart of a cluster resource scheduling method according to an embodiment of the present invention;

fig. 6 is a flowchart of step S250 provided in an embodiment of the present invention;

fig. 7 is a flowchart of step S610 provided in an embodiment of the present invention;

FIG. 8 is another flowchart of a cluster resource scheduling method according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clear and clear, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is a schematic structural diagram of a cluster resource scheduling system provided by an embodiment of the present invention, where, as shown in fig. 1, the cluster resource scheduling system provided by the embodiment of the present invention mainly includes: job module 100, resource manager 200, task lifecycle manager 300, scheduler 400.

The job module 100 is configured to construct a job to be processed for training of the deep learning model based on a user operation to acquire the job to be processed and job basic information of the job to be processed.

Wherein the job to be processed includes: elastic training operation, elastic fixing operation and inelastic operation. It will be appreciated that the job to be processed belongs to an elastic training job, or an elastic fixing job or an inelastic job, and is determined based on a user operation, that is, specified by the user.

The job base information of the job to be processed may include job attributes, required calculation resource amounts. Wherein the job attributes include preemptible and non-preemptible. The required computing resource amount refers to a computing resource amount of a job to be processed specified by a user, which may be represented by an interval [ max, min ], that is, the required computing resource amount of the job to be processed has a maximum value max and a minimum value min.

Preemptible means that the other job to be processed, which may be of higher priority, preempts its computing resources during the execution of the job to be processed, and the job to be processed, which may be preempted, is suspended, and after the execution of the other job to be processed of higher priority is completed, the job to be processed continues to be executed again based on the original computing resources. The non-preemptible means that the job to be processed is preempted by other jobs with higher priority, and the error is reported, and the execution is not continued after the execution of other jobs with higher priority is completed.

It will be appreciated that the job attributes and the required amount of computing resources of the job to be processed are specified by the user when constructing the job to be processed, and the job module 100 can be determined by the user operation.

The resource manager 200 is configured to predict an amount of idle resources in a first preset period of time of the cluster, determine an amount of allocated computing resources for the elastic training job and the elastic fixed job, and monitor an amount of real-time idle resources of the cluster in real time.

The allocated computing resource amount referred to herein is the amount of computing resource allocated to the resource manager 200 on the basis of the required computing resource amount of the elastic training job, the elastic fixing job. And in the embodiment of the present invention, the resource manager 200 is not required to allocate the amount of computing resources for the inelastic job, that is, the inelastic job does not have the amount of computing resources allocated. In addition, the allocation computing resource amount of the elastic training job is the sum of the job elastic resource amount and the job fixed resource amount, and the allocation computing resource amount of the elastic fixed job is the job fixed resource amount.

The task lifecycle manager 300 is configured to adjust an allocation computing resource amount of the elastic training job according to the predicted idle resource amounts of the clusters in different first preset time periods. In an embodiment of the present invention, the task lifecycle manager 300 adjusts the amount of computing resources allocated by adjusting the amount of job elastic resources of the elastic training job.

The scheduler 400 is configured to allocate a corresponding job node to a job to be processed according to the real-time idle resource amount of the cluster acquired by the resource manager 200 in real time, so as to implement training of the deep learning model by calling the corresponding job node to execute the job to be processed. Training as referred to herein may refer to pre-training, model optimization, and the like.

Therefore, in the embodiment of the invention, the user constructed elastic training operation can dynamically adjust the computing resources of the execution operation according to the dynamic change of the resources of the cluster in the execution process, is suitable for distributed machine learning, fully utilizes the available resources of the cluster to quickly perform model training, and is generally used for the pre-training of a deep learning model. The elastic fixed operation is allocated based on the available resources of the cluster to obtain corresponding computing resources for training, and the elastic operation cannot be performed based on the dynamic change of the resources of the cluster. The inelastic job does not need to allocate computing resources for the inelastic job based on available resources of the clusters, only needs fixed computing resources specified by users, and is commonly used for model performance improvement, list making, model parallelism in parallel computing and the like.

In some embodiments of the present invention, a plurality of preset queues are set in the scheduler, where the preset queues are respectively: scheduling queues, waiting scheduling queues, non-scheduling queues, secondary scheduling queues, and preemptive queues.

And adding the primarily constructed pending jobs to the queue to wait for scheduling, wherein the pending jobs to be scheduled are in the scheduling queue.

The waiting scheduling queue is a waiting job which fails to schedule the waiting job for the first time due to the cluster resource size, and the waiting job in the waiting scheduling queue is added to the scheduling queue again for scheduling after corresponding preset time.

The pending jobs which fail to be scheduled due to reasons such as parameter configuration crops, unbinding storage and the like in the non-scheduled queue. The pending job in the non-dispatch queue cannot be dispatched.

And the secondary waiting scheduling queue is used for scheduling successfully elastic training jobs. And the elastic training jobs in the secondary scheduling queue wait for secondary scheduling, and the capacity of the allocated computing resources is expanded or reduced by matching with the life cycle controller.

The queue may be preempted as a mid-super job. In the embodiment of the invention, the super job enters the queue after the scheduling failure caused by the size of the cluster idle resources in the primary scheduling, and after the corresponding preset time, if the cluster idle resources still cannot meet the super job, the super job can preempt the computing resources of any preempted job to be processed.

The embodiment of the cluster resource scheduling system provided by the embodiment of the present invention is basically the same as each embodiment of the following cluster resource scheduling method, and will not be described herein.

The invention provides a cluster resource scheduling method, and fig. 2 is a flowchart of the cluster resource scheduling method provided by the embodiment of the invention. As shown in fig. 2, the cluster resource scheduling method provided by the embodiment of the invention at least includes the following steps:

s210, acquiring a to-be-processed job for training a deep learning model and job basic information of the to-be-processed job.

In the embodiment of the invention, the cluster resource scheduling method is applied to a cluster resource scheduling system.

Specifically, the job module 100 of the cluster resource scheduling system may acquire a job to be processed for deep learning model training and job basic information of the job to be processed based on the operation of the user. As can be seen from the above, the job to be processed includes at least three types: elastic training operation, elastic fixing operation and inelastic operation; the job basic information of the job to be processed may include at least: the job attributes and the required computing resource amounts are not described herein.

S220, under the condition that the job to be processed is an elastic training job, determining the allocation calculation resource quantity of the elastic training job according to the predicted idle resource quantity of the cluster in the first preset time period and the required calculation resource quantity of the elastic training job.

Specifically, as shown in fig. 3, the above step S220 may be implemented at least by the following method:

s310, obtaining the predicted idle resource quantity of the cluster in a first preset time period.

In the embodiment of the present invention, the first preset time period may be divided into a plurality of first preset time periods according to the corresponding preset time interval. The resource manager 200 obtains the predicted amount of free resources within the first preset time period.

Specifically, as shown in fig. 4, step S310 may be implemented at least by:

s410, predicting the predicted available resource quantity of the cluster in a first preset time period through a preset cluster resource prediction model.

In the embodiment of the present invention, the preset cluster resource prediction model may be an ARIMA model.

Specifically, a preset cluster resource prediction model may be set in the resource manager 200. Taking an ARIMA model as an example, by using the ARIMA model for cluster resource prediction set by the resource manager 200, the available resource quantity of the cluster in a certain future time range is predicted according to the historical resource information of the cluster, and the predicted available resource quantity of the cluster in a first preset time period is obtained.

The resource manager 200 may monitor the resource situation of the cluster in real time, that is, may record historical resource information of the clusters of different time nodes at correspondingly preset time intervals (e.g., minutes). Wherein the historical resource information of each time node includes: the total amount of computing resources of the cluster of current time nodes, the amount of available resources (i.e., the amount of free resources) of the cluster.

In addition, the construction of an ARIMA model suitable for cluster resource prediction by using historical data belongs to the prior art, and is not described in detail in the embodiment of the present invention.

According to the scheme, the idle resource quantity of the cluster in the first preset time period can be predicted based on the historical resource information of the cluster, namely, the idle resource quantity is predicted.

S420, acquiring the released available resource amount released by the currently executing job to be processed in the cluster in a first preset time period.

Specifically, the resource manager 200 may obtain, through task running time of the historical job, an amount of available resources released (i.e., an amount of released available resources) by the currently executing job to be processed of the cluster within the first preset period of time.

In an embodiment of the present invention, the resource manager 200 may record and store the task running time of the completed job (i.e., the history job) and the basic information of the history job. The task run times of different jobs with the same data set and the same amount of required computing resources are similar for different versions of the same job. Therefore, by acquiring the task running time of the historical job corresponding to the currently executing job to be processed in the cluster, the available resource amount released by the cluster in the first preset time period can be determined.

S430, determining the predicted idle resource amount of the cluster in the first preset time period according to the predicted available resource amount and the released available resource amount of the cluster in the first preset time period.

In particular, the resource manager 200 may predict the amount of available resources (i.e., predict the amount of free resources) over a certain time horizon in the future through two dimensions. These two dimensions are respectively: 1) The change condition of historical computing resources of the cluster; 2) Task runtime of historical jobs. Wherein, the change condition of the historical computing resources of the cluster can be represented by the historical resource information of the cluster.

Further, the predicted available resource amount=a+b+z of the predicted available resource amount in the first preset period of time.

Wherein a, b and z are preset parameters, and can be adjusted according to actual application conditions. In the embodiment of the present invention, a=0.8, b=0.2, and z=0 may be defaulted.

Through the above steps S410 to S430, the idle resources of the cluster in the first preset time period can be fully predicted, where the first preset time period can be 30 minutes, 1 hour, etc., and can be adjusted based on the actual situation.

S320, when the predicted free resource amount of the cluster in the first preset time period is larger than the maximum value of the required computing resource amount of the elastic training job, taking the minimum value of the required computing resource amount of the elastic training job as the job fixed resource amount of the elastic training job and taking the first difference value as the job elastic resource amount of the elastic training job.

The first difference is the difference between the maximum value max and the minimum value min of the required calculation resource amount of the elastic training operation.

S330, taking the minimum value of the elastic training job as the job fixed resource amount of the elastic training job and taking the second difference value as the job elastic resource amount of the elastic training job when the predicted free resource amount of the cluster in the first preset time period is smaller than the maximum value of the required computing resource amount of the elastic training job and is larger than or equal to the minimum value of the required computing resource amount.

The second difference value is a difference value between a predicted idle resource amount of the cluster in a first preset time period and a minimum value min of a required computing resource amount of the elastic training operation.

S340, taking the minimum value of the elastic training job as the job fixed resource amount of the elastic training job and taking 0 as the job elastic resource amount of the elastic training job under the condition that the predicted free resource amount of the cluster in the first preset time period is smaller than the minimum value of the required calculation resource amount of the elastic training job.

According to the embodiment, the job fixed resource amount of the elastic training job is the minimum value of the calculation resource allocated to the requirement of the elastic training job, so that the basic requirement of the resource of the elastic training job is ensured. And according to the predicted idle resource amount of the cluster in a certain time range in the future, the job elastic resource amount of the elastic training job is determined, so that the fragment resources in the cluster can be more fully utilized, the utilization rate of the cluster is further improved, and the completion speed of the elastic training job is improved.

S350, determining the allocation calculation resource quantity of the elastic training job according to the job fixed resource quantity and the job elastic resource quantity of the elastic training job.

The allocation computing resource amount of the elastic training job is the sum value of the job fixed resource amount and the job elastic resource amount, namely:

the allocated computation resource amount of the elastic training job=job elastic resource amount+job fixed resource amount.

In the embodiment of the invention, for the elastic training job, according to the situation that the cluster has available resource in a certain time range in the future, the resource manager 200 allocates corresponding computing resource for the elastic training job, thereby improving the utilization rate of the cluster, fully utilizing the resources in the cluster, and improving the completion speed of the elastic training job.

S230, according to the real-time idle resource quantity of the cluster and the distributed computing resource quantity of the elastic training job, distributing nodes in the cluster for the elastic training job as job nodes of the elastic training job.

Specifically, the real-time idle resource amount of the cluster can be acquired first; and under the condition that the real-time idle resource quantity of the cluster is larger than or equal to the distributed computing resource quantity of the elastic training job, distributing nodes in the cluster for the elastic training job as job nodes of the elastic training job.

Further, the resource manager 200 obtains the real-time idle resource amount of the cluster by means of real-time monitoring. The scheduler compares the real-time idle resource quantity of the cluster with the distributed computing resource quantity of the elastic training job, and can indicate that the current cluster can execute the elastic training job under the condition that the real-time idle resource quantity of the cluster is larger than or equal to the distributed computing resource quantity of the elastic training job, and at the moment, the scheduler distributes corresponding nodes for the elastic training job as job nodes of the elastic training job according to the resource use condition of each node in the cluster.

It will be appreciated that the amount of computing resources may vary from job node to job node in the cluster, as may the amount of free resources per job node. Therefore, when the operation node of the elastic training operation executes the elastic training operation, the amount of the computing resource used by each operation node for the elastic training operation may be different.

For example, cluster a is made up of multiple nodes, each node including 8 AI computation cards, the number of available computation cards remaining for each node being different, when the amount of allocated computation resources for a resilient training job is 128, 32 nodes may be used, each node being allocated 4 AI computation cards to perform the resilient training job.

As can be seen from the above, there are a plurality of preset queues in the scheduler, and the scheduled queues are to-be-scheduled pending jobs. In some embodiments of the present invention, the pending jobs in the dispatch queue may be ordered as follows:

calculating the job score value of each job to be processed according to the waiting time, the allocated calculation resource amount, the priority, the job attribute and the job type of each job to be processed in the scheduling queue;

and sequencing the jobs to be processed in the scheduling queue based on the job score value of each job to be processed in the scheduling queue.

Further, the score value of the job to be processed=fx (waiting time+reciprocal of the amount of allocated computing resources+priority+whether preemption is allowed+job type).

For example, the number of the cells to be processed, the score value of the job to be processed=a×waiting time+b×the reciprocal of the allocated computing resource amount+c×priority+d×whether preemption+e×the job type is allowed. Wherein a, b, c, d, e are all parameters, the initialization can be 1, and the subsequent adjustment can be carried out according to the use condition.

In the embodiment of the invention, the higher the job score value of the job to be processed is, the closer to the front end of the scheduling queue is, and the earlier the job score value is scheduled.

The above-mentioned operation type means that the operation to be processed is an elastic training operation, an elastic fixing operation or an inelastic operation. Different scores may be preconfigured for different job types.

It will be appreciated that the job attribute is whether to allow preemption, and different scores may be configured in advance for the job to be processed that can be preempted and cannot be preempted, so as to calculate a job score value of the job to be processed.

By the method, when more to-be-processed jobs are scheduled in the queue, the to-be-processed jobs which are needed to be executed more urgently are scheduled through sequencing.

In addition, in the actual application process, there is a situation that the real-time idle resource amount of the cluster cannot meet the requirement of the elastic training operation, as shown in fig. 5, the method for scheduling cluster resources provided by the embodiment of the invention may further include the following steps:

s510, adding the elastic training job from a scheduling queue of a preset queue to a waiting scheduling queue when the real-time free resource amount of the cluster does not meet the allocated computing resource amount of the elastic training job.

The fact that the real-time idle resource amount of the cluster does not meet the allocated computing resource amount of the elastic training job means that the real-time idle resource amount of the cluster is smaller than the allocated computing resource amount of the elastic training job, and therefore the cluster cannot execute the elastic training job currently.

Thus, in this case, the scheduler adjusts the flexible training job from the scheduling queue into the waiting scheduling queue.

S520, after the preset time, the elastic training job is added to the scheduling queue from the waiting scheduling queue, and under the condition that the real-time idle resource quantity of the cluster meets the distributed computing resource quantity of the elastic training job, the node of the cluster is distributed to the elastic training job and used as the job node of the elastic training job until the elastic training job is executed.

The waiting scheduling time can be set for the to-be-processed job in the waiting scheduling queue in the scheduler, the scheduler adds the elastic training job exceeding the waiting scheduling time in the waiting scheduling queue to the scheduling queue again, and under the condition that the real-time idle resource quantity of the cluster meets the allocation computing resource of the elastic training job, the node of the cluster is allocated for the elastic training job and serves as the job node of the elastic training job.

It can be understood that, if the amount of real-time free resources of the cluster is still not equal to the amount of allocated computing resources of the elastic training job, the above step S610 is continued until the elastic training job is executed.

S240, calling a job node of the elastic training job, and executing the elastic training job in a second preset time period.

In this embodiment of the present application, after determining that the operation node of the elastic training job is executed, the operation node of the elastic training job is invoked, and the elastic training job is executed in a second preset time period.

After the corresponding operation node is allocated for the elastic training operation, the operation node for executing the elastic training operation in the second preset time period is unchanged. The second preset time period mentioned here may be the same as the first preset time period or may be different from the first preset time period, and may be adjusted according to actual situations.

S250, based on the predicted idle resource amount of the cluster in the next first preset time period, the classified computing resource amount of the elastic training operation is adjusted.

In the embodiment of the present invention, the task lifecycle manager 300 adjusts the allocated computing resource amount of the elastic training job by predicting the idle resource amount of the cluster in the next first preset time period.

Specifically, as shown in fig. 6, the above step S250 may be implemented at least by:

s610, expanding the operation elastic resource amount of the elastic training operation under the condition that the predicted idle resource amount in the next first preset time period of the cluster is larger than 0 and the operation elastic resource amount of the elastic training operation is smaller than a first comparison value.

The first comparison value is a difference between a maximum value and a minimum value of the required calculation resource amount of the elastic training operation.

In an embodiment of the present invention, the task lifecycle manager 300 monitors each elastic training job for adjusting the amount of allocated computing resources for the elastic training job.

Specifically, the task lifecycle manager 300 obtains, via the resource manager 200, a predicted amount of free resources for a next first preset time period of the cluster. Under the condition that the predicted free resource amount in the next first preset time period of the cluster is greater than 0 and the job elastic resource amount of the elastic training job is smaller than the first comparison value, the task life cycle manager 300 expands the job elastic resource amount of the elastic training job.

Further, as shown in fig. 7, the above step S610 may be implemented at least by:

and S710, taking the second difference value as the capacity expansion resource quantity under the condition that the predicted free resource quantity of the cluster in the next first preset time period is larger than the second comparison value.

The second comparison value is a difference value obtained by subtracting the operation elastic resource amount from the minimum value subtracted from the maximum value of the required calculation resource amount of the elastic training operation.

S720, taking the idle resource quantity of the cluster in the next first preset time period as the capacity expansion resource quantity under the condition that the predicted idle resource quantity of the cluster in the next first preset time period is smaller than or equal to the second comparison value.

And S730, expanding the operation elastic resource amount of the elastic training operation according to the expansion resource amount.

It can be understood that the capacity expansion of the operation elastic resource amount of the elastic training operation is: and adjusting the sum of the capacity expansion resource quantity and the job elastic resource quantity to obtain the job elastic resource quantity.

From the above, the task lifecycle manager 300 determines the capacity expansion resource amount of the elastic training task by predicting the idle resource amount of the cluster in the next first preset period.

S620, when the waiting job resource amount of the cluster is larger than 0 and the job elastic resource amount of the elastic training job is larger than 0, the job elastic resource amount of the elastic training job is reduced.

The resource amount of the waiting job refers to the resource amount required by the job waiting to be executed in the cluster resource scheduling management system.

In the embodiment of the invention, the difference value between the predicted idle resource amount in the previous first preset time period and the predicted idle resource amount in the next first preset time period can be used as the reduced resource amount, but the reduced resource amount is ensured to be larger than the minimum value of the resource interval set by the elastic training task; and then the operation flexible resource amount of the flexible training operation is reduced according to the reduced resource amount.

It can also be understood that the job elastic resource amount of the elastic training job is reduced according to the reduced resource amount, which is that: and taking the difference value between the operation elastic resource quantity and the reduced resource quantity as the adjusted operation elastic resource quantity.

S630, adjusting the allocation calculation resource amount of the elastic training job according to the job elastic resource amount after the expansion or the reduction of the elastic training job.

In the embodiment of the invention, the operation elastic resource amount of the elastic training operation is expanded or reduced through the predicted idle resource amount in the next first preset time period of the cluster, so that the adjustment of the allocation calculation resource amount of the elastic training operation is realized, and the method is suitable for the dynamic change of the resources of the cluster.

S260, reallocating the operation nodes for the elastic training operation according to the real-time idle resource quantity of the cluster and the allocated calculation resource quantity adjusted by the elastic training operation, so as to call the reallocated operation nodes to continuously execute the elastic training operation in the next second preset time period.

In the embodiment of the present invention, the scheduler 400 reallocates the job check point for the elastic training job by the real-time idle resource amount of the cluster acquired from the resource manager 200 and the allocated calculation resource amount adjusted by the elastic training job, and invokes the reallocated job node to continue to execute the elastic training job in the next second preset time period.

It should be noted that, the step S260 may refer to the step S230, and is not described herein.

S270, continuing to execute the step S240 until the elastic training operation is completed.

The task lifecycle manager 300 may monitor the completion progress of the elastic training job in real time, and continue executing the above steps S250-S260 until the elastic training job is completed if the elastic training is not completed.

In some embodiments of the present invention, after the job node is invoked to execute the elastic training job in the second preset time period, the elastic training job may be further added to the secondary scheduling queue from the scheduling queue of the preset queue, so that after the allocated computing resource amount of the elastic training job is adjusted based on the predicted free resource amount of the cluster in the next first preset time period, the elastic training job is scheduled from the secondary scheduling queue, and the elastic training job is continuously executed in the next second preset time period.

In the embodiment of the invention, the secondary scheduling queue is provided for the elastic training operation, so that the accuracy of the elastic training operation for the elastic operation is improved, and the condition of operation omission is effectively avoided.

In addition, the job to be processed may be an elastic fixing job as well. Therefore, as shown in fig. 8, the cluster resource scheduling method provided by the present invention after step S210 may further include the following steps:

S810, under the condition that the job to be processed is an elastic fixed job, determining the allocation calculation resource quantity of the elastic fixed job according to the predicted idle resource quantity of the cluster in the first preset time period and the required calculation resource quantity of the elastic fixed job.

Wherein the amount of allocated computing resources for the elastic fixed job is different from the amount of allocated computing resources for the elastic training job. The allocated computing resource amount for the elastically fixed job includes only the job fixed resource amount.

Specifically, when the predicted free resource amount in the first preset time period of the cluster is greater than the maximum value of the required computing resource amount of the elastic fixed job, the resource manager 200 takes the maximum value of the required computing resource amount of the elastic fixed job as the job fixed resource amount of the elastic fixed job, that is, the allocated computing resource amount of the elastic fixed job;

when the predicted free resource amount in the first preset time period of the cluster is smaller than the maximum value of the required computing resource amount of the elastic fixed job and larger than the minimum value of the required computing resource amount of the elastic fixed job, the resource manager 200 takes the predicted free resource amount in the first preset time period of the cluster as the job fixed resource amount of the elastic fixed job, namely the allocation computing resource amount of the elastic fixed job;

When the predicted free resource amount in the first preset time period of the cluster is smaller than the minimum value of the required computing resource amount of the elastic fixed job, the resource manager 200 takes the predicted resource amount as the job fixed resource amount of the elastic fixed job, namely the allocated computing resource amount of the elastic fixed job.

Wherein estimated amount of resources = number of available resources at future average waiting time-number of waiting task resources higher than his priority a-number of waiting task resources lower than his priority b.

The above a and b are variable parameters, and can be adjusted according to actual conditions.

For example: the number of available resources at the average waiting time in the future is 128 cards, the resource manager 200 queries that the number of task waiting cards higher than the priority of the task in the current waiting resource allocation is 64 and the number of task waiting cards lower than the priority of the task waiting cards is 32 in the waiting queue by accessing an interface of the intelligent scheduler, estimates the number of resources to be 25.6 by the above formula (assuming a=1.2b=0.8), rounds up 26, and otherwise takes a value min if 26 is smaller than the minimum number of resources of the elastic fixed task min, and takes a value max if the same is larger than the maximum number of resources of the elastic fixed task max.

S820, according to the real-time idle resource quantity of the cluster and the distributed computing resource quantity of the elastic fixed job, distributing the nodes of the cluster for the elastic fixed job as the job nodes of the elastic fixed job so as to call the job nodes to execute the elastic fixed job.

It may be appreciated that the node for assigning the cluster to the elastic fixed job may specifically refer to a manner of assigning the node for the elastic training job, which is not described herein.

In some embodiments of the present invention, in a case where a job node of a job to be processed that can be preempted is preempted by other jobs to be processed, the job node is controlled to suspend the job to be processed that is being executed, and after the execution of the other jobs to be processed is completed, the preempted job node is controlled to continue executing the job to be processed. By the scheme, the persistent execution of the preempted job to be processed can be ensured, the execution from the new beginning after being preempted is not needed, and the cluster resources are further saved.

According to the cluster resource scheduling method, the computing resource quantity is allocated to the elastic training job through the predicted idle resource quantity of the cluster in the first preset time period, namely the allocated computing resource quantity of the elastic training job is obtained, and corresponding job nodes are allocated to the elastic training job according to the real-time idle resource quantity of the cluster and the allocated computing resource quantity, so that the job nodes are called to execute the elastic training job. And in the execution process of the elastic training job, the allocation calculation resource amount of the elastic training job is adjusted according to the predicted idle resource amount of the cluster in the next first preset time period, so that the job node is reallocated for the elastic training job, the reallocated job node is called to calculate and execute the elastic training job in the next second preset time period, the adjustment of the calculation resource amount of the elastic training job is realized based on the dynamic change of the resources of the cluster, fragmented idle resources are fully utilized, the utilization rate of the cluster resources is improved, the problem of serious cluster queuing condition is solved, and the training of a deep learning network model is accelerated.

As shown in fig. 1, in the cluster resource scheduling system provided by the present invention, the system includes a job module 100, a resource manager 200, a task lifecycle manager 300, and a scheduler 400.

The job module 100 is used for acquiring a job to be processed for training of the deep learning model.

The resource manager 200 is configured to determine, when the job to be processed is an elastic training job, an allocated computing resource amount of the elastic training job according to a predicted idle resource amount of the cluster in a first preset time period and a required computing resource amount of the elastic training job.

The scheduler 400 is configured to allocate nodes in the cluster for the elastic training job according to the real-time idle resource amount of the cluster and the allocated computing resource amount of the elastic training job, as job nodes of the elastic training job, so as to invoke the job nodes to execute the elastic training job in a second preset time period.

The task lifecycle manager 300 is configured to adjust an allocated computing resource amount of the elastic training job based on a predicted free resource amount of the cluster in a next first preset time period.

The scheduler 400 is configured to reallocate the job node of the elastic training job for the elastic training job according to the real-time idle resource amount of the cluster and the allocated computing resource amount adjusted by the elastic training job, so as to invoke the reallocated job node to continue executing the elastic training job in a next second preset time period, and enable the task lifecycle manager 300 to continue adjusting the allocated computing resource amount of the elastic training job based on the predicted idle resource amount of the cluster in the next first preset time period until the execution of the elastic training job is completed.

Based on the cluster resource scheduling method, the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize the steps of the cluster resource scheduling method described in the embodiment.

Based on the cluster resource scheduling method, the invention also provides a terminal, as shown in fig. 9, which comprises at least one processor (processor) 80; a display screen 81; and a memory 82, which may also include a communication interface (Communications Interface) 83 and a bus 84. Wherein the processor 80, the display 81, the memory 82 and the communication interface 83 may communicate with each other via a bus 84. The display screen 81 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 83 may transmit information. Processor 80 may invoke logic instructions in memory 82 to perform the cluster resource scheduling method of the above-described embodiments.

Further, the logic instructions in the memory 82 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product.

The memory 82, as a computer readable storage medium, may be configured to store a software program, a computer executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 80 executes functional applications and data processing, i.e. implements the methods of the embodiments described above, by running software programs, instructions or modules stored in the memory 82.

The memory 82 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created according to the use of the terminal, etc. In addition, the memory 82 may include high-speed random access memory, and may also include nonvolatile memory. For example, a plurality of media capable of storing program codes such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or a transitory storage medium may be used.

All embodiments in the application are described in a progressive manner, and identical and similar parts of all embodiments are mutually referred, so that each embodiment mainly describes differences from other embodiments. In particular, for system, terminal and storage medium embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the partial description of method embodiments being relevant.

The system, the terminal and the storage medium provided in the embodiments of the present application are in one-to-one correspondence with the methods, so that the system, the terminal and the storage medium also have similar beneficial technical effects as the corresponding methods, and since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the system, the terminal and the storage medium are not repeated here.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Of course, those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by a computer program for instructing relevant hardware (e.g., processor, controller, etc.), the program may be stored on a computer readable storage medium, and the program may include the above described methods when executed. The computer readable storage medium may be a memory, a magnetic disk, an optical disk, etc.

It is to be understood that the invention is not limited in its application to the examples described above, but is capable of modification and variation in light of the above teachings by those skilled in the art, and that all such modifications and variations are intended to be included within the scope of the appended claims.

Claims

1. A method for scheduling cluster resources, the method comprising:

acquiring a to-be-processed operation for training a deep learning model;

2. The method for scheduling cluster resources according to claim 1, wherein the determining the allocated computing resource amount of the elastic training job according to the predicted idle resource amount of the cluster in the first preset time period and the required computing resource amount of the elastic training job specifically includes:

3. The method of cluster resource scheduling according to claim 1, wherein before determining the classified computing resource amount of the elastic training job based on the predicted free resource amount of the cluster within the first preset time period and the required computing resource amount of the elastic training job, the method further comprises:

4. The method for scheduling cluster resources according to claim 1, wherein said adjusting the allocated computing resource amount of the elastic training job based on the predicted free resource amount of the cluster in the next first preset time period specifically comprises:

expanding the operation elastic resource amount of the elastic training operation under the condition that the predicted idle resource amount in the next first preset time period of the cluster is larger than 0 and the operation elastic resource amount of the elastic training operation is smaller than a first comparison value;

5. The method for scheduling cluster resources according to claim 4, wherein expanding the job elastic resource amount of the elastic training job in a case where the predicted free resource amount in the next first preset time period of the cluster is greater than 0 and the job elastic resource amount of the elastic training job is smaller than a first comparison value, specifically includes:

6. The method for scheduling cluster resources according to claim 1, wherein the allocating the node of the cluster for the elastic training job according to the real-time idle resource amount of the cluster and the allocated computing resource amount of the elastic training job, as the job node of the elastic training job, specifically includes:

7. The cluster resource scheduling method of claim 1, wherein after invoking the job node to execute the elastic training job within a second preset time period, the method further comprises:

8. The cluster resource scheduling method of claim 1, wherein after acquiring the job to be processed for deep learning model training, the method further comprises:

9. The cluster resource scheduling method of claim 1, wherein after acquiring the job to be processed for deep learning model training, the method further comprises:

10. The cluster resource scheduling method of claim 1, further comprising:

11. A clustered resource scheduling system, the system comprising: the system comprises a job module, a resource manager, a task life cycle manager and a scheduler;

12. A computer readable storage medium storing one or more programs executable by one or more processors to implement the steps in the cluster resource scheduling method of any one of claims 1-10.

13. A terminal, comprising: a processor and a memory; the memory has stored thereon a computer readable program executable by the processor; the processor, when executing the computer readable program, implements the steps of the cluster resource scheduling method of any one of claims 1-10.