CN114996001A

CN114996001A - Distributed machine learning task GPU resource scheduling and distributing method and system

Info

Publication number: CN114996001A
Application number: CN202210562623.2A
Authority: CN
Inventors: 袁天宜; 蒋从锋; 欧东阳; 闫龙川
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-05-23
Filing date: 2022-05-23
Publication date: 2022-09-02

Abstract

The invention discloses a distributed machine learning training task GPU resource scheduling distribution method and a system, wherein the method comprises the following steps: s1, distributing a parameter server and a working node for each training task; s2, fitting a training speed model by collecting training speeds of the operation under different parameter server and working node number combinations; s3, acquiring the loss value of each training iteration of the training operation, fitting the loss value iteration curve of the operation, and calculating the residual completion time of the training operation; s4, the node distribution module distributes proper working node quantity and parameter server quantity for each operation according to the training speed model; s5, the operation placing module selects a placing strategy with the lowest cost based on distributed communication time delay and queuing time according to the short operation priority strategy; the system comprises a node distribution module and a job placement module. The invention can achieve the aim of reducing the average completion time of the cluster training operation.

Description

Distributed machine learning task GPU resource scheduling distribution method and system

Technical Field

The invention relates to a method for realizing GPU resource scheduling distribution, in particular to a resource scheduling distribution method and a resource scheduling distribution system aiming at a machine learning training task adopting a parameter server framework in a GPU data center adopting a large-scale deployment virtualization technology.

Background

With the maturity of artificial intelligence technology, more and more machine learning applications emerge, and internet manufacturers usually choose to deploy machine learning training tasks on cloud computing servers. The machine learning training task is floating point computationally intensive and therefore relies heavily on computationally powerful but costly GPU resources.

At present, a GPU of a single node generally cannot cope with massive training data, so a distributed parameter server architecture is generally adopted for a machine learning training task. Under the framework, the training data of the model is divided into smaller scales and distributed to a plurality of working nodes for training, and each working node is communicated with the parameter server node to update the gradient/parameter in each training period. Currently, the parameter server architecture mainly has two training modes: 1) asynchronous training, wherein the training progress of different working nodes in the training operation is asynchronous, and the model partitions of the working nodes are updated after the parameter server receives the gradients of the working nodes each time; 2) and (4) synchronous training, wherein the training progress of all working nodes in the training operation is synchronous in each step, and the parameter server updates the parameters after collecting gradients from all the working nodes.

The machine learning framework of the distributed architecture therefore typically requires scheduling tasks on multiple GPUs simultaneously, i.e., group scheduling, which increases the risk of resource fragmentation and low utilization problems in the shared cluster. In addition, multi-GPU training also means synchronization of model parameters across nodes, so that network communication delay between nodes can affect the completion time of training jobs to a great extent, and therefore, it is necessary to ensure that placement scheduling of nodes achieves better locality in order to perform inter-machine interconnection communication faster.

For cloud computing resource providers, traditional cluster schedulers such as Kubernets or YARNs are typically used, which simply treat the machine learning task load as a large data processing job, assign the job to the GPU-equipped machine at startup, and ensure exclusive access to the machine during execution of the job. However, the machine learning training task has a large difference from the previous workload, and the current data center generally lacks an understanding of the characteristics of the machine learning workload, so that the resource supply mode of the current data center for the machine learning workload is still supplied in a rough manner.

The coarse-grained resource allocation can cause resource waste of the data center, which increases energy consumption and cost of the data center, and compared with common computing resources, the cost of the computational resource waste of machine learning is much larger. Meanwhile, the load of the machine learning task usually adopts a distributed structure, and if the scheduler lacks the locality perception of distributed nodes, the waiting time and the communication time of the job are increased, the cluster efficiency is reduced, the completion time of the training job is slowed, and even the job failure is caused. Therefore, a reasonable machine learning task scheduling model needs to be designed based on the characteristics of the machine learning training task.

At present, GPU schedulers of data centers are all based on manual configuration when users submit jobs, and if the configuration is not proper, training performance and resource efficiency are greatly reduced. For example, allocating too many GPUs may result in other jobs queuing for the GPUs for too long and result in inefficient use of each GPU, while allocating too few GPUs may result in too long run times and training inefficiencies. However, in large-scale GPU clusters it is difficult to make optimal scheduling decisions, as this depends on the dynamic cluster load at job run-time.

Disclosure of Invention

Aiming at the problems, the resource quota and the node placement position of each node in the operation are managed under the resource allocation designated by the user, the training completion time of the whole machine learning model is shortened, and the specific technical scheme is as follows:

one aspect of the present invention provides a GPU resource scheduling allocation method for a distributed machine learning training task, comprising the steps of:

s1, the scheduling system allocates a parameter server and a work node for each job to be allocated in the training job queue, and places the jobs in a server cluster to prevent starvation.

And S2, the scheduling system tries to run in the cluster with different number of combinations of parameter servers and working nodes for each training job, and a training speed model S (p, w) of the training job is fitted by collecting the training speeds of the job under different number of combinations of parameter servers and working nodes.

And S3, the dispatching system monitoring system acquires the loss value of each training iteration of the training operation, and is used for fitting the loss value iteration curve of the operation and calculating the residual iteration times of the training model for achieving convergence.

And S4, the node distribution module distributes proper working node quantity and parameter server quantity for each job according to the training speed model S (p, w) of the training job by using a greedy strategy, and adds the distributed job to the queue to be placed.

And S5, the job placement module performs placement scheduling on the training jobs in the queue to be placed according to the sequence of the residual completion time T from small to large, calculates the cost of different placement strategies based on the communication delay and the queuing time of the distributed training jobs, and selects the placement strategy with the lowest cost to deploy the jobs to the GPU cluster.

S6, cycling the step S1 to the step S5 at regular intervals of scheduling.

In step S1, when there is a training job to be scheduled in the training job queue, the scheduling system checks available resources of the nodes according to the resource requirements attached to the training job submitted by the user, and if there is an available node to which the resource required by the user can be allocated, allocates a parameter server and a work node to the training job to avoid starvation of the job due to waiting for too long, and the job to which a parameter server and a work node are temporarily allocated is marked as a job to be adjusted.

In step S2, the training speed model of the machine learning training task is described as follows. For a training job, the number of distributed parameter servers is p, the number of working nodes is w, and the training data set is guaranteed to be distributed evenly on the w working nodes. The time T required for a training step of a working node is related to the node allocation quantity as follows:

wherein the specific meanings of the symbols in the above formula are shown in the following table:

TABLE 1 Job training time symbol List

(symbol)	Means of
		T	Time required for a training step on a working node
t _f	Time of forward propagation phase on working node
		t _b	Time of back propagation phase on working node
M	Model size of working phase, i.e. total number of bytes of parameter
		B	Bandwidth value of parameter server
w	Number of working nodes of job
		p	Number of parameter servers for jobs
t _c	Time required for updating parameters on parameter server
		θ _w	Communication delay factor of working node
θ _p	Communication delay factor of parameter server

Therefore, the number of training steps using p parameter servers and w working nodes per unit time is w/T, and the training speed of the job is modeled as follows:

further, the step S2 includes the following steps:

and S21, intercepting data within 500MB in a training data set submitted by a user by the scheduling system for the job to be adjusted, which is distributed with 1 parameter server and 1 working node in the job queue.

And S22, the scheduling system arranges training jobs in the cluster to train the intercepted training data set according to different parameter server numbers p and different working node numbers w. And the scheduling system monitors the training process, finishes training when the training iteration times reach 10 times, and records the average training time distributed by the current node.

And S23, collecting the average training time of the training operation under different node configurations, and fitting the training speed model S (p, w) of the machine learning operation by adopting a minimum dichotomy.

In step S3, the scheduling system collects the loss value and the iteration step number of each training iteration of the job to be adjusted by integrating with the distributed machine learning training framework, and fits the loss curve based on the loss value and the step number, because most of the training tasks use a random gradient descent method as an algorithm for gradient update, based on the foregoing, the loss value iteration curve model is constructed as follows:

wherein l is a loss value of training, k is the number of iteration steps completed by training, and alpha, beta and gamma are non-negative coefficients of the model.

Further, the step S3 includes the following steps:

s31, the scheduling system monitors each training operation and collects the number of iterations and loss value after one training iteration of the operation is finished. The data is then preprocessed and points that exceed the range of loss values are replaced with the average of the loss of the last few steps.

And S32, using a non-negative least square solver, and according to the iteration times and the loss value points (k, l) collected by the scheduling system, wherein k is the currently finished iteration times of the operation, and l is the loss value obtained by the current training iteration. And calculating the loss curve model coefficients alpha, beta and gamma for fitting the current operation. And calculating the number of steps required by model convergence according to a predefined convergence threshold value when the user submits. Let the user-specified convergence threshold be l _th Then the condition satisfied by the loss function to achieve convergence is

Based on this equation, the total number of iterations required for training convergence is calculated as

S33, combining the operation performance model S (p, w) of the step S2 to obtain the one-time training iteration time of each working node

The number of iterations of the job that has been completed at present is k, and the time T (k) required for the job training to be completed can be calculated _total -k)/S(p,w)。

In step S4, the scheduling system uses a greedy strategy to find the optimal parameter server node number p and work node number w for each training job to be adjusted. Let the training job set to be adjusted in the current job queue be J, where the training completion time calculated in step S3 for job J is T _j Then the algorithm optimization goal is:

min∑ _j∈J T _j

wherein

Meanwhile, the allocation node also needs to consider resource availability, and the constraint conditions are as follows:

p _j ∈Z ⁺ ,

wherein the occupation amount of the parameter server of the job j to the resource r is

The occupation amount of the working node of the operation j on the resource r is

The resource capacity constraint of the job needs to satisfy that the occupation of the resource r does not exceed the total capacity C of the resource r ^r 。

The problem is a nonlinear integer programming problem, a linear programming solver cannot be used for solving the problem, and the blind search of an optimal solution can delay scheduling decisions due to overlarge solution search space, so that the method adopts a greedy strategy to reduce the solution search space so as to sacrifice the scheduling effect for better scheduling efficiency.

If the distribution quantity of the working nodes of the job j is w and the distribution quantity of the parameter servers is p, the residual training completion time of the current job j is T-R _j /S _j (p, w) assigning 1 additional parameter server node or 1 additional GPU compute node to a job shortens the training time Δ T, Δ T of the job _p Shortened training time, Δ T, for assigning 1 additional parameter server node _w Reduced training time for assigning 1 additional work node, namely:

dividing the training time shortened after the nodes are distributed by the occupied amount of the resources to obtain the benefit Ben of the resource distribution ^p Resource allocation revenue, Ben, generated for allocating 1 additional parameter server node ^w The allocation of 1 additional worker node yields the resource allocation revenue, namely:

wherein Ben ^p Allocating revenue, Ben, for parameter servers ^w Distributing income for the working nodes, traversing all the jobs each time by a greedy strategy, calculating a larger value of resource distribution income generated by distributing extra parameter server nodes and the working nodes of each job, and taking the larger value as the resource distribution income Ben of the training job j _j The complete calculation formula is as follows:

further, the step S4 includes the following steps:

s41, the scheduling system calculates the income Ben distributed by the parameter calculation server for each training job to be adjusted ^p And the work node allocates the profit Ben ^w The larger one is selected as the resource allocation profit Ben ═ max (Ben) ^p ,Ben ^w )。

S42, after the calculation is finished, the scheduling system compares the resource allocation resource income Ben of all the jobs and selects the resource allocation incomeMaximum job if the parameter server allocation gain of the job is greater than the job node allocation gain, i.e. Ben ^p >Ben ^w Then an additional parameter server is assigned to the job, otherwise an additional work node is assigned.

S43, repeating the steps S41 to S42 until the cluster available resources are insufficient to continue deploying new nodes.

And S44, the scheduling system records the number of the working nodes and the number of the parameter servers distributed by each job to be scheduled, marks the job to be deployed as the job to be deployed, and deploys the job in the next step.

Further, the step S5 includes the following steps:

and S51, calculating the residual completion time T of each training job according to the principle of short job priority for each job to be deployed in the job queue, sequencing the jobs to be deployed from small to large according to the T, and storing the jobs to be deployed in one job queue to be deployed.

And S52, the scheduling system sequences all the servers in the server cluster from large to small according to the available amount of the residual resources and maintains a server available resource priority queue.

And S53, the scheduling system selects the first job j of the job queue to be deployed for placing and scheduling, traverses the available resource priority queue of the server, selects the first server S of the queue, and joins the deployment server set S of the job j.

S54, inquiring whether the available resources of the deployment server set can meet the resource requirement of the job j, and if so, calculating the communication delay of the deployment server set S

Adding deployment scenario cost key-value pairs

And jumping to the step S56 in the deployment plan list DeploymentPlans, otherwise, entering the step S55.

S55, continuously traversing the job queue to be deployed, and selecting the server S which is not in the deployment server set S of the job j ^′ Join deployment server set S and jump back to step S54.

S56 traversing other already running machine learning jobs j on servers in the deployment server set S _other Sorting according to the residual completion time from small to large, and storing the data in other job sets J _other In (1).

S57 traversing the set J _other Task j of _other If j is _other The occupied resource of the server S is r, other servers in the set S are compared, if S ' is less than or equal to r and S ' belongs to S and S ' is not equal to S, S ' in the set S is deleted and recorded as S ', and communication delay is calculated

And operation j _other Is left over for the completion time T _j And deploy the scheme cost key value pair

Added to the deployment scenario list deploymentpans.

S58, traversing all key value pairs in the deployment plan list DeploymentPlans, selecting the minimum cost as a deployment plan, deploying the job j to a deployment server set corresponding to the minimum cost, deleting the job j from a job queue to be deployed, deploying in the next scheduling interval if no available resource exists in the cluster at the moment, and otherwise, jumping to the step S53.

Further, in step S54, the communication delay of job j deployed in server set S

The calculation is as follows:

the scheduling system tries to transmit the parameter data generated by one-time training of the job j among the nodes of different server combinations in the deployment server set S, each different server combination tries to transmit 10 times, and the time spent on average transmission is recorded

Then operation jCommunication delay deployed in server set S

Where k is the number of convergence iterations remaining for job j.

The invention provides a GPU resource scheduling and distributing system of distributed machine learning training jobs, which comprises a node distributing module and a job placing module, wherein the node distributing module considers the benefit of node distribution on job training speed, and the job placing module has job completion time perception and balances and calculates server cluster communication delay and queuing waiting time, so that the average completion time of machine learning training tasks in a GPU cluster is minimum.

When a new task reaches the scheduling system, the scheduling system firstly allocates 1 working node and 1 parameter server for the new task to prevent starvation, the node allocation module calculates a training speed model, selects the current maximum node allocation income scheme based on the training speed model, and allocates additional nodes for training jobs, so that the node allocation income of the scheduling system is maximized, and the average completion time of all training jobs is reduced.

And after the node allocation is determined, the operation is transferred to the operation placement module, the operation placement module arranges a placement decision for the operation according to the calculated residual completion time and the order of the short operation priority, and compared with a first-in first-out strategy, the strategy can avoid the phenomenon of long operation priority deployment and short operation long waiting time, and can effectively reduce the overall operation waiting time of the cluster. The operation placing module calculates communication delay and waiting time generated when the operation is placed in different server sets and records the communication delay and the waiting time as placing cost, a placing strategy with the lowest placing cost is selected, the communication delay and the queuing waiting time of the distributed machine learning training are balanced, and the average completion time of the distributed machine training task is further reduced.

Further, the node allocation module calculates and allocates an additional parameter server or an additional working node for each job, generates training speed improvement, selects the job with the maximum improvement in the current cluster, allocates an additional parameter server for the job if the speed improvement of allocating an additional parameter server for the job is greater than that of allocating a working node, and otherwise allocates an additional working node.

Furthermore, in the job placement module, the jobs to be deployed are sorted from small to large according to the remaining completion time, the scheduling system selects the first job of the job queue to be deployed each time to perform placement scheduling, traverses the available resource priority queue of the server, selects the first server of the queue, and adds the first server into the deployment server set S of the jobs; inquiring whether the available resources of the deployment server set can meet the resource requirements of the jobs, if not, continuously traversing the job queue to be deployed, and selecting the servers S which are not in the deployment server set S of the job j ^′ And adding the server set S into the deployment server set S until the server set S meets the requirements of deployment resources.

When the server set S meets the requirement of the deployment resource of the job j, calculating the communication Delay of the server set S ^S Adding a deployment scheme cost key value pair { plan: S, cost: Delay ^S To deployment scenario list DeploymentPlans; traversing other already running machine learning jobs j on servers in the deployment server set S from small to large according to remaining completion time _other (ii) a If j is _other The occupied resource of the server S is r, other servers in the set S are compared, if the resource occupied by other servers S ' is less than or equal to r, S ' in the set S is deleted and recorded as S ', and communication Delay is calculated ^S′ And operation j _other Is left over for the completion time T _j And deploy the scheme cost key value pair

Add deployment scenario list deploymentpans. Traversing all key value pairs in the deployment plan list DeploymentPlans, selecting the minimum cost as a deployment plan, deploying the operation into a deployment server set corresponding to the minimum cost, deleting the operation from the operation queue to be deployed, deploying in the next scheduling interval if no available resource exists in the cluster at the moment, and continuing to deploy the next scheduling interval if no available resource exists in the cluster at the moment, otherwise, continuing to deploy the next scheduling intervalAnd (5) performing operation.

The invention has the advantages and beneficial effects that: the node distribution module adopts the node distribution with the maximum speed promotion profit selected by the greedy algorithm based on the node distribution of the operation and the training speed relation perception, and the speed promotion profit of the node distribution is promoted to the maximum extent under the condition of ensuring the scheduling efficiency. The node placement module adopts a placement sequence with short job priority on the premise of sensing based on completion time, so that the average waiting time of the jobs is effectively reduced. And the node placement module has communication delay perception, fully considers the influence of the delay of the distributed machine learning training task on the training efficiency, balances the queuing waiting time of the operation, ensures the minimization of the cost of the placement strategy and further achieves the aim of reducing the average completion time of the cluster.

Drawings

Fig. 1 is an overall architecture diagram of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings.

After a user submits a new machine learning training task, the training task enters a job queue to be distributed, and is initialized and distributed with 1 parameter server and 1 working node. As shown in FIG. 1, a total of n jobs to be allocated in the queue of jobs to be allocated, and an initial state denoted as p:1, w:1, represents that the current job is allocated to 1 parameter server and 1 work node.

The node distribution module calculates a training speed model for the jobs in the job queue to be distributed, generates a node distribution decision according to the training speed model, and adds the job addition to the job queue to be placed according to the sequence of the residual completion time. As shown in fig. 1, after the node allocation module generates the node allocation decision, the number of the parameter servers and the number of the working nodes allocated to the job to be placed in the job queue changes, the number of the parameter servers of the job 1 is 1, the number of the working nodes is 3, and the number of the parameter servers and the number of the working nodes of the training job 2 are both 2. The job placement module generates a job placement decision for the job of the module to be placed, and determines the server cluster location where the job is placed, as shown in fig. 1, training job 1 is placed in GPU server 1, and training job 2 is placed in

GPU servers

1 and 2.

The specific implementation mode of the invention comprises the following steps:

(1) the scheduling system allocates a parameter server and a working node for each job to be allocated in the training job queue, and places the jobs in a server cluster to prevent starvation.

(2) The dispatching system tries to run in the cluster with different number of combined parameter servers and working nodes for each training job, and a machine learning training speed model S (p, w) is fitted according to the training speed of the job under different number of combinations of parameter servers and working nodes.

S (p, w) is

Where p is the number of parameter servers for the job; w is the number of working nodes of the job; t is t _f Is the time of the forward propagation phase on the working node; t is t _b Is the time of the back propagation phase on the working node; t is t _c Is the time required for updating the parameters on the parameter server; m is the size of the model at the working stage, namely the total byte number of the parameters; b is the bandwidth value of the parameter server; theta _w Is the communication delay coefficient of the working node; theta _p Is the communication delay coefficient of the parameter server.

Furthermore, the scheduling system intercepts data within 500MB of the training data set submitted by the user for the jobs to be adjusted, to which 1 parameter server and 1 working node have been allocated, in the job queue. The scheduling system arranges training jobs to combine different parameter server numbers p and working node numbers w in the cluster to train the intercepted training data set. In order to ensure the effectiveness of fitting the training speed model and the fitting efficiency of the model, the range of the parameter service number p is set to [1,3], and the range of the working node w is [1,5 ]. And the scheduling system monitors the training process, finishes training when the training iteration times reach 10 times, and records the average training time distributed by the current node. And collecting the average training time of the training jobs under different numbers of node configurations, and fitting a training speed model S (p, w) of the machine learning jobs by adopting a minimum dichotomy.

(3) The monitoring system of the dispatching system collects the loss value of each training iteration of the training operation, and is used for fitting the loss value iteration curve of the operation and calculating the residual iteration times of the training model reaching convergence.

The convergence curve model for the job is:

wherein l is a loss value of training, k is the number of iteration steps completed by the training operation, and alpha, beta and gamma are non-negative coefficients of the model.

Further, the scheduling system monitors each training job and collects the number of iterations and loss values that have been completed after a training iteration of the job has ended. The data is then pre-processed and points that exceed the loss value range are replaced with the average loss values for the first 2 steps and the last 2 steps. And the scheduling system uses a non-negative least square solver and according to the number of finished iterations and the loss value points (k, l) collected by the scheduling system, wherein k is the number of finished iterations of the operation at present, and l is the loss value obtained by the current training iteration. And calculating the loss curve model coefficients alpha, beta and gamma for fitting the current operation. And calculating the number of steps required by model convergence according to a predefined convergence threshold value when the user submits. Combining the operation performance model S (p, w) of the step S2 to obtain the training time of one step for each working node

The current iteration number of the operation is k, and the time required for completing the operation training can be calculated

(4) And the node distribution module distributes proper working node quantity and parameter server quantity for each job according to a training speed model S (p, w) of the training job by using a greedy strategy, and adds the distributed jobs into the queue to be placed.

More further, the node allocation policy is described as follows:

the scheduling system calculates for each training job to be adjusted, the calculation parameter server allocates a profit Ben ^p And the working node allocates the profit Ben ^w The larger one is selected as the resource allocation profit Ben ═ max (Ben) ^p ,Ben ^w ) (ii) a After the calculation is finished, the scheduling system compares the resource allocation resource earnings Ben of all the jobs, selects the job with the maximum resource allocation earnings, and if the allocation earnings of the parameter server of the job are greater than the allocation earnings of the working nodes, namely Ben ^p >Ben ^w If not, allocating an additional working node; the above operations are repeated until the cluster available resources are insufficient to continue deploying new nodes.

(5) The operation placing module performs placing and scheduling on the training operations in the to-be-placed queue according to the sequence of the residual completion time T from small to large, calculates the cost of different placing strategies based on the distributed training operation communication delay and the queuing waiting time, and selects the placing strategy with the lowest cost to deploy the operations into the GPU cluster.

Further, the job allocation policy is described as follows:

the job placement module sorts the jobs to be deployed from small to large according to the remaining completion time, the scheduling system selects the first job of the job queue to be deployed each time to perform placement scheduling, traverses the available resource priority queue of the servers, selects the first server of the queue, and joins the deployment server set S of the jobs; inquiring whether the available resources of the deployment server set can meet the resource requirements of the jobs, if not, continuously traversing the job queue to be deployed, and selecting the servers S which are not in the deployment server set S of the job j ^′ And adding the server set S into the deployment server set S until the server set S meets the requirements of deployment resources.

When the server set S meets the requirement of the deployment resource of the job j, calculating the communication Delay of the server set S ^S Adding a deployment scheme cost key value pair { plan: S, cost: Delay ^S To deploymentProtocol list depolymentplants; traversing other already running machine learning jobs j on servers in the deployment server set S from small to large according to remaining completion time _other (ii) a If j _other The occupied resource of the server S is r, other servers in the set S are compared, if the resource occupied by other servers S ' is less than or equal to r, S ' in the set S is deleted and recorded as S ', and communication Delay is calculated ^S′ And operation j _other Is left over for the completion time T _j And deploy the scheme cost key value pair

Add deployment scenario list deploymentpans. And traversing all key value pairs in the deployment plan list DeploymentPlans, selecting the minimum cost as a deployment plan, deploying the operation into a deployment server set corresponding to the minimum cost, deleting the operation from the operation queue to be deployed, deploying in the next scheduling interval if no available resource exists in the cluster at the moment, and otherwise, continuing to deploy the next operation.

(6) And (5) cycling the step (1) to the step (5) at fixed scheduling intervals.

In order to implement the GPU resource scheduling system provided by the invention, the scheduling system needs to monitor the distribution of each training job to a working node and a parameter server, the loss value generated by each training iteration of the job and the number of iterations of the job which are currently executed, and after the node distribution of the job, the monitoring system also needs to record a training speed model S (p, w) of each job.

Claims

1. A distributed machine learning training task GPU resource scheduling distribution method is characterized by comprising the following steps:

s1, the dispatching system allocates a parameter server and a working node for each job to be allocated in the training job queue, and places the parameter server and the working node into a server cluster to prevent starvation;

s2, the dispatching system tries to run in the cluster by different numbers of parameter servers and working nodes combined for each training job; fitting a training speed model S (p, w) of the training operation by collecting training speeds of the operation under different parameter server and working node number combinations;

s3, the monitoring system of the dispatching system collects the loss value of each training iteration of the training operation, and is used for fitting the loss value iteration curve of the operation and calculating the residual iteration times of the training model reaching convergence;

s4, the node distribution module distributes proper working node quantity and parameter server quantity for each job according to the training speed model S (p, w) of the training job by using a greedy strategy, and the distributed job is added into a queue to be placed;

s5, the job placement module performs placement scheduling on the training jobs in the queue to be placed according to the sequence of the residual completion time T from small to large, calculates the cost of different placement strategies based on the communication delay and the queuing time of the distributed training jobs, and selects the placement strategy with the lowest cost to deploy the jobs to the GPU cluster;

s6, cycling the step S1 to the step S5 at regular intervals of scheduling.

2. The method according to claim 1, wherein the step S2 includes the following steps:

s21, the scheduling system intercepts the data within 500MB in the training data set submitted by the user to the job to be adjusted which is distributed with a parameter server and a working node in the job queue;

s22, the scheduling system arranges the training operation in the cluster to train the intercepted training data set according to the combination of the number p of different parameter servers and the number w of the working nodes;

the scheduling system monitors the training process, finishes training when the number of training iterations reaches 10, and records the average training time distributed by the current node;

s23, collecting the average training time of the training operation under different node configurations, and adopting a minimum dichotomy to fit the training speed model S (p, w) of the machine learning operation:

where p is the number of parameter servers for the job, w is the number of working nodes for the job, t _f Is the time of the forward propagation phase on the working node, t _b Is the time of the back propagation phase on the working node, t _c Is the time required for updating parameters on the parameter server, M is the size of the model in the working stage, i.e. the total number of bytes of the parameters, B is the bandwidth value of the parameter server, theta _w Is the communication delay coefficient of the working node, theta _p Is the communication delay coefficient of the parameter server.

3. The method according to claim 1, wherein the step S3 includes the following steps:

s31, the dispatching system monitors each training operation, and collects the number of iterations and loss value after one training iteration is finished;

replacing points exceeding the loss value range with the loss average values of the last steps;

s32, calculating loss curve model coefficients alpha, beta and gamma for fitting the current operation by using a non-negative least square solver according to the number of iterations and loss value points (k, l) collected by the scheduling system; wherein k is the current iteration number of the operation, and l is the loss value obtained by the current training iteration;

calculating the number of steps required by model convergence according to a predefined convergence threshold when a user submits:

let the user-given convergence threshold be l _th Then the condition satisfied by the loss function to achieve convergence is

S33, obtaining the one-time training iteration time of each working node by combining the training speed model S (p, w) of the step S2

Calculating the time T (k) needed by the completion of the operation training by taking the current number of finished iterations of the operation as k _total -k)/S(p,w)。

4. The method according to claim 1, wherein the step S4 includes the following steps:

s41 calculation of each training job to be adjusted by the scheduling system, calculation of profit Ben allocated by the parameter server ^p And the working node allocates the profit Ben ^w The larger one is selected as the resource allocation profit Ben ═ max (Ben) ^p ,Ben ^w )；

S42, the scheduling system compares the resource allocation resource income Ben of all the jobs and selects the job with the maximum resource allocation income; if the parameter server allocation profit of the operation is larger than the work node allocation profit, i.e. Ben ^p >Ben ^w If the job is a virtual job, allocating an additional parameter server to the job, otherwise, allocating an additional working node;

s43, repeating the steps S41 to S42 until the cluster available resources are not enough to continue to deploy the new node;

5. The method according to claim 1, wherein the step S5 includes the following steps:

s51, calculating the residual completion time T of each training job according to the principle of short job priority for each job to be deployed in the job queue, sequencing the jobs to be deployed from small to large according to the T, and storing the jobs to be deployed in one job queue to be deployed;

s52, the dispatching system sorts all servers in the server cluster from large to small according to the available amount of the residual resources, and maintains a priority queue of the available resources of the servers;

s53, the scheduling system selects the first job j of the job queue to be deployed for placing and scheduling, traverses the available resource priority queue of the server, selects the first server S of the queue, and joins the deployment server set S of the job j;

s54, inquiring whether the available resources of the deployment server set can meet the resource requirement of the job j, if so, calculating the communication delay of the deployment server set S

Saving deployment scenario cost key-value pairs

Jumping to step S56 in the deployment plan list DeploymentPlans, otherwise, entering step S55;

s55, continuously traversing the job queue to be deployed, and selecting the server S which is not in the deployment server set S of the job j ^′ Adding the data into a deployment server set S, and jumping back to the step S54;

s56 traversing other already running machine learning jobs j on servers in the deployment server set S _other Sorting according to the residual completion time from small to large, and storing the data in other job sets J _other Performing the following steps;

s57 traversing the set J _other Task j of _other If j is _other The occupied resource of the server S is r, other servers in the set S are compared, if S ' is less than or equal to r and S ' belongs to S and S ' is not equal to S, S ' in the set S is deleted and marked as S '; calculating communication delay

Adding a deployment scenario list DeploymentPlans;

6. The distributed machine learning training task GPU resource scheduling distribution method according to claim 5, characterized in that: communication delay of job j deployed in server set S

The calculation is as follows:

the scheduling system performs trial transmission on parameter data generated by one training of the job j among nodes of different server combinations in the deployment server set S, wherein each different server combination performs trial transmission 10 times, and the time spent on average transmission is recorded

S1, S2 belongs to S and S1 ≠ S2, then the communication delay of job j deployed in the server set S

Where k is the number of remaining convergence iterations for job j.

7. The utility model provides a GPU resource scheduling distribution system of distributed machine learning training operation, includes that node allocation module and operation place the module, its characterized in that:

the node distribution module considers the improvement benefit of node distribution on operation training speed, the operation placement module has operation completion time perception, and communication delay and queuing waiting time of the server cluster are balanced and calculated, so that the average completion time of machine learning training tasks in the GPU cluster is minimum;

when a new task reaches the scheduling system, the scheduling system firstly allocates a working node and a parameter server for the new task to prevent starvation, the node allocation module calculates a training speed model, selects the current maximum node allocation income scheme based on the training speed model, and allocates additional nodes for training jobs, so that the node allocation income of the scheduling system is maximized, and the average completion time of all training jobs is reduced;

after the node distribution is determined, the operation is transferred to an operation placing module, and the operation placing module arranges a placing decision for the operation according to the calculated residual completion time and the order of the short operation priority; and the job placement module calculates communication delay and waiting time generated by placing the jobs into different server sets to be recorded as placement cost, and selects a placement strategy with the lowest placement cost.

8. The system according to claim 7, wherein the system further comprises: in the node distribution module, after an additional parameter server or an additional working node is respectively calculated and distributed for each job, the generated training speed is increased; and selecting the job with the maximum promotion in the current cluster, if the speed promotion of allocating an additional parameter server to the job is larger than that of allocating a working node, allocating an additional parameter server to the job, and otherwise, allocating an additional working node.

9. The system according to claim 7, wherein the system further comprises: in the operation placing module, the operations to be deployed are sequenced from small to large according to the residual completion time, the scheduling system selects the first operation of the operation queue to be deployed each time to perform placing and scheduling, traverses the available resource priority queue of the server, selects the first server of the queue, and joins the deployment server set S of the operations;

inquiring whether the available resources of the deployment server set can meet the resource requirements of the jobs, if not, continuing traversing the job queue to be deployed, and selecting a server S' which is not in the deployment server set S of the job j to add into the deployment server set S until the server set S meets the requirements of the deployment resources;

when the server set S meets the requirement of the deployment resource of the job j, calculating the communication Delay of the server set S ^S Adding a deployment scheme cost key value pair { plan: S, cost: Delay ^S To deployment scenario list DeploymentPlans; traversing other already running machine learning jobs j on servers in the deployment server set S from small to large according to remaining completion time _other (ii) a If j is _other The occupied resource of the server S is r, other servers in the set S are compared, if the resource occupied by other servers S ' is less than or equal to r, S ' in the set S is deleted and recorded as S ', and communication Delay is calculated ^S′ And operation j _other Is left over time T _j And deploy the scheme cost key value pair

Adding a deployment plan list DeploymentPlans;

and traversing all key value pairs in the deployment plan list DeploymentPlans, selecting the minimum cost as a deployment plan, deploying the operation into a deployment server set corresponding to the minimum cost, deleting the operation from the operation queue to be deployed, deploying in the next scheduling interval if no available resource exists in the cluster at the moment, and continuing to deploy the next operation if no available resource exists in the cluster at the moment.