WO2024103463A1

WO2024103463A1 - Elastic deep learning job scheduling method and system, and computer device

Info

Publication number: WO2024103463A1
Application number: PCT/CN2022/137723
Authority: WO
Inventors: 叶可江; 段诩; 须成忠
Original assignee: 深圳先进技术研究院
Priority date: 2022-11-18
Filing date: 2022-12-08
Publication date: 2024-05-23
Also published as: CN116069495A

Abstract

An elastic deep learning job scheduling method and system, and a computer device, comprising preemption and return of nodes. In this way, nodes for pipeline parallelism jobs in a cluster can be preempted without affecting training, for the preemption within a certain range, reconfiguration or restoration from a checkpoint is not required, and the preempted nodes can be returned to the jobs at any time, thereby improving the resource utilization of a cluster, and shortening the average job completion time.

Description

Flexible deep learning job scheduling method, system and computer equipment

Technical Field

The present application relates to the field of information technology, and in particular to a method, system and computer device for elastic deep learning job scheduling.

Background technique

Distributed training migrates the model from a single machine and a single card to a cluster of multiple machines and multiple cards, and uses the computing resources of multiple nodes to accelerate training. Data parallelism and pipeline parallelism are the current mainstream solutions for distributed deep learning training. Data parallelism distributes training data to different nodes for simultaneous calculation, pushes gradients to each node through synchronous communication, and updates model weights. Pipeline parallelism partitions the model and assigns it to different nodes, and divides the mini-batch into micro-batches, so that calculations on different nodes can be parallelized through pipelines. Each node only stores a part of the model parameters, thus avoiding the bottleneck of video memory. The network communication overhead of pipeline parallelism is lower than that of data parallelism and model parallelism between operators. Mainstream distributed training solutions usually use hybrid parallel training, which combines data parallelism and pipeline parallelism.

Dharma Shukla et al. proposed preemptive job scheduling in Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads, which achieves transparent preemptibility and elasticity of nodes by exchanging data between GPU memory and CPU memory to switch between different jobs. Sanjith Athlur et al. proposed a pipeline parallel strategy for bidding examples in Varuna: Scalable, Low-cost Training of Massive Deep Learning Models, which reconfigures nodes after they are preempted. Varuna uses pipeline schedules and continuous checkpoints to enable jobs to recover from checkpoints when they are preempted. John Thorpe et al. proposed adding redundant computation to the bubbles in the pipeline in Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs to solve the preemption problem of bidding instances. When there is no preemption of consecutive nodes in the cluster, any node can be replaced by the previous node when preemption occurs. When consecutive nodes in the cluster are preempted, it degenerates to reconfiguration and recovery from checkpoints.

Existing cluster scheduling strategies usually preempt computing resources by switching time slices, or recover from checkpoints by reconfiguring, or perform redundant backups through additional nodes. The disadvantages and shortcomings of these technologies are: both time slice switching and reconfiguring parallelism need to restore the state from the checkpoint, which brings a large memory exchange overhead; redundant computing is only available in spot instances, which will bring a large additional cost in the cluster, and it is impossible to complete all computing within the bubble time; there is currently no general mechanism for preemption without checkpoint recovery.

Summary of the invention

In view of this, it is necessary to provide a flexible deep learning job scheduling method, system and computer device that improves cluster resource utilization and reduces the average job completion time to address the defects in the prior art.

To solve the above problems, this application adopts the following technical solutions:

One of the purposes of this application is to provide a flexible deep learning job scheduling method, comprising the following steps: including preemption and return of nodes.

In some embodiments, the step of preempting a node includes the following steps:

Get partition configuration and pipeline orchestration;

Partition the model of preemptible nodes and assign them to bubbles of neighboring nodes to perform latency calculations;

Receive the preemption instruction, calculate the preemption sequence and uninstall the preempted node;

Determine whether the preemption sequence involves key nodes that cannot be preempted;

If yes, reconfigure the model partitions of all nodes and restore the state from the checkpoint;

If not, the additional partitions [L _i-1 ,L' _i ] and [L' _i ,L _i ] stored in the neighbor nodes of the node to be uninstalled are converted from delayed calculation to immediate calculation, and then the synchronous communication network topology is adjusted, the neighbor nodes are added to the network topology, and they replace the uninstalled node for synchronization during the synchronization phase.

In some of the embodiments, the steps of obtaining partition configuration and pipeline orchestration specifically include the following steps:

Get the partition configuration and pipeline arrangement from the scheduler. Let G be the total number of GPUs, P and D be the GPUs used for pipeline parallelism and data parallelism respectively. Then we get the partition size G = P × D, as well as the hyperparameters of the partitions mini-batch size and micro-batch size.

Using the profiling data of the offline model running on the GPU, the forward computation time of each layer F = { _fi }, the backward propagation time of each layer B = { _bi }, the GPU memory usage of each layer M = { _mi } and the GPU memory upper bound sup(M), we establish a partition L = { _Li } that satisfies

Assign the corresponding model partitions to each node of the job and establish a network topology for pipeline parallel computing and data parallel synchronous communication.

In some embodiments, in the step of partitioning the model of the preemptible node and assigning it to the bubbles of the neighboring nodes, performing delay calculation, the following steps are specifically included:

Repartition the model of preemptible nodes _Ni in an average manner

They are stored in two neighboring nodes on the pipeline respectively, so that the calculation of partitions [L _i-1 ,L ′ _i ] and [L ′ _i ,L _i ] are delayed to the bubble time for additional calculation on neighboring nodes Ni _-1 and Ni ₊₁ , that is, converted to delayed calculation.

In some embodiments, the steps of receiving a preemption instruction, calculating a preemption sequence and uninstalling the preempted node specifically include the following steps:

After receiving the preemption instruction, the optimal preemption sequence is calculated according to the partition configuration and pipeline schedule, that is,

Node sequence, uninstall the preempted node.

In some embodiments, the step of returning the node specifically includes the following steps:

The nodes that have completed the job are added to the standby queue, and the preempted job obtains the available nodes from the standby queue.

Calculate the optimal return sequence based on the partition configuration and pipeline schedule;

The obtained available nodes load the corresponding model partitions and pipeline configurations, and restore the state from the checkpoint;

The returned node asynchronously pulls the intermediate states and gradients from neighboring nodes and performs forced synchronization at the next synchronization barrier;

The additional partitioning of neighbor nodes is transformed into delayed computation and is calculated within the bubble time.

In some of the embodiments, in the step of calculating the optimal return sequence according to the partition configuration and the pipeline schedule,

A node sequence sends a return message to the neighboring nodes Ni _-1 and Ni ₊₁ in the return sequence.

In some of the embodiments, the step in which the returned node asynchronously pulls the intermediate state and gradient from the neighboring node and performs forced synchronization at the next synchronization barrier specifically includes the following steps: the returned node asynchronously pulls the intermediate state and gradient from the neighboring node and performs forced synchronization at the next synchronization barrier, i.e., ensures that the data [L _i-1 ,L _i ] of the node N _i after the synchronization barrier is the same as the data in the additional partitions [L _{i-1 ,L ′ i ] and [L′ i ,L i ] stored on the neighboring nodes N i-} ₁ _and _N _i ₊ ₁ , and the normal synchronous communication network topology is restored after synchronization.

The second purpose of this application is to provide a flexible deep learning job scheduling system, including: a processing unit for processing the preemption and return of nodes.

The third object of the present application is to provide a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the method described is implemented.

This application adopts the above technical solution, and its beneficial effects are as follows:

The elastic deep learning job scheduling method, system and computer device provided in the present application include the preemption and return of nodes, so that the nodes of the pipeline parallel operation in the cluster can be preempted without affecting the training. The preemption within a certain range does not require reconfiguration or checkpoint recovery, and the preempted nodes can be returned to the job at any time, thereby improving the resource utilization of the cluster and reducing the average job completion time.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for use in the embodiments of the present application or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

FIG1 is a flowchart of the steps for obtaining partition configuration and pipeline orchestration provided in Example 1 of the present application.

FIG. 2 is a flowchart showing the steps of returning a node provided in the first embodiment of the present application.

FIG3 is a schematic diagram of the structure of the computer device provided in Embodiment 3 of the present application.

Detailed ways

The embodiments of the present application are described in detail below, and examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are exemplary and are intended to be used to explain the present application, and should not be construed as limiting the present application.

In the description of the present application, it should be understood that the terms "upper", "lower", "horizontal", "inside", "outside", etc., indicating orientations or positional relationships, are based on the orientations or positional relationships shown in the accompanying drawings, and are only for the convenience of describing the present application and simplifying the description, rather than indicating or implying that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be understood as a limitation on the present application.

In addition, the terms "first" and "second" are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of the features. In the description of this application, the meaning of "plurality" is two or more, unless otherwise clearly and specifically defined.

In order to make the objectives, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below in conjunction with the accompanying drawings and embodiments.

Example 1

An embodiment of the present application provides a flowchart of the steps of a flexible deep learning job scheduling method, including the steps of preempting and returning nodes. The implementation method of each step is described in detail below.

Please refer to FIG. 1 , which shows the steps of preempting a node provided in this embodiment, including the following steps S110 to S160 , and the implementation method of each step is described in detail below.

Step S110: Obtain partition configuration and pipeline arrangement.

Step S111: Get the partition configuration and pipeline arrangement from the scheduler, let G be the total number of GPUs, P and D be the GPUs used for pipeline parallelism and data parallelism respectively, then the partition size is G=P×D, and the hyper parameters of the partition are mini-batch size and micro-batch size;

Step S112: Using the analysis data of the offline model running on the GPU, the forward calculation time of each layer F = { _fi }, the backward propagation time of each layer B = { _bi }, the GPU memory usage of each layer M = { _mi } and the GPU memory upper bound sup(M), establish a partition L = { _Li }, satisfying

Step S113: distribute the corresponding model partitions to each node of the job, and establish a network topology for pipeline parallel computing and data parallel synchronous communication.

Through the above steps S111 to S113, it is possible to obtain partition configuration and pipeline arrangement.

Step S120: partition the model of the preemptible node and distribute it to the bubbles of the neighboring nodes, and perform delay calculation.

Repartition the model of preemptible nodes _Ni in an average manner

Step S130: receiving a preemption instruction, calculating a preemption sequence and uninstalling the preempted node.

Node sequence, uninstall the preempted node.

Step S140: determining whether the preemption sequence involves a key node that cannot be preempted;

Step S150: If yes, reconfigure the model partitions of all nodes and restore the state from the checkpoint;

Step S160: If not, the additional partitions [L _i-1 ,L' _i ] and [L' _i ,L _i ] stored in the neighboring nodes of the uninstalled node are converted from delayed calculation to immediate calculation, and then the synchronous communication network topology is adjusted, the neighboring nodes are added to the network topology, and the uninstalled nodes are replaced for synchronization during the synchronization phase.

It can be understood that the above steps can realize the preemption of nodes, and the models of preemptible nodes can be evenly distributed to neighboring nodes for backup, so that the calculation can be completed within the bubble time; when a node is preempted, the neighboring node can seamlessly replace the node for calculation.

Please refer to FIG. 2 , which is a flowchart of the steps of returning a node provided in this embodiment, specifically including the following steps S210 to S250 . The implementation method of each step is described in detail below.

Step S210: adding the node that has completed the job to a standby queue, and the preempted job obtains an available node from the standby queue.

Step S220: Calculate the optimal return sequence according to the partition configuration and pipeline schedule.

Step S230: The obtained available nodes load the corresponding model partitions and pipeline configurations, and restore the status from the checkpoint.

Step S240: The returning node asynchronously pulls the intermediate state and gradient from the neighboring node, and performs forced synchronization at the next synchronization barrier.

In some of the embodiments, the step in which the returned node asynchronously pulls the intermediate state and gradient from the neighboring node and performs forced synchronization at the next synchronization barrier specifically includes the following steps: the returned node asynchronously pulls the intermediate state and gradient from the neighboring node and performs forced synchronization at the next synchronization barrier, i.e., ensuring that the data [L _i-1 ,L _i ] of the node N _i after the synchronization barrier is the same as the data in the additional partitions [L _{i-1 ,L′ i ] and [L′ i ,L i ] stored on the neighboring nodes N i-} ₁ _and _N _i ₊₁ , and the normal synchronous communication network topology is restored after synchronization.

Step S250: The additional partitioning of the neighboring nodes is converted to delayed calculation and is calculated within the bubble time.

It can be understood that through the above steps, the return of the node can be achieved.

The elastic deep learning job scheduling method provided in the present application includes the preemption and return of nodes, so that the nodes of the pipeline parallel operation in the cluster can be preempted without affecting the training. The preemption within a certain range does not require reconfiguration or checkpoint recovery, and the preempted nodes can be returned to the job at any time, thereby improving the resource utilization of the cluster and reducing the average job completion time.

Example 2

This embodiment also provides a flexible deep learning job scheduling system, including: a processing unit, used to process the preemption and return of nodes.

The detailed working method of the design system provided in this embodiment 2 can be found in embodiment 1 and will not be repeated here.

The elastic deep learning job scheduling system provided in the present application includes the preemption and return of nodes, so that the nodes of the pipeline parallel operation in the cluster can be preempted without affecting the training. The preemption within a certain range does not require reconfiguration or checkpoint recovery, and the preempted nodes can be returned to the job at any time, thereby improving the resource utilization of the cluster and reducing the average job completion time.

Example 3

Please refer to FIG3 for a schematic diagram of the computer device structure of an embodiment of the present application. The computer device 50 includes a processor 51 and a memory 52 coupled to the processor 51 .

The memory 52 stores program instructions for implementing the error correction method for the above-mentioned memristor precision reconstruction calculation.

The processor 51 is used to execute program instructions stored in the memory 52 to implement the elastic deep learning job scheduling method.

The processor 51 may also be referred to as a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip having signal processing capabilities. The processor 51 may also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor, etc.

It can be understood that the technical features of the above-described embodiments can be combined arbitrarily. In order to make the description concise, not all possible combinations of the technical features in the above-described embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

The above are only preferred embodiments of the present application, and only specifically describe the technical principles of the present application. These descriptions are only for explaining the principles of the present application and cannot be interpreted as limiting the scope of protection of the present application in any way. Based on the explanation here, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application, and other specific implementation methods of the present application that can be associated with the technicians in this field without creative work, should be included in the scope of protection of the present application.

Claims

A flexible deep learning job scheduling method, characterized by including node preemption and return.
The elastic deep learning job scheduling method according to claim 1 is characterized in that, in the step of preempting the node, the following steps are included:

Get partition configuration and pipeline orchestration;

Partition the model of preemptible nodes and assign them to bubbles of neighboring nodes to perform latency calculations;

Receive the preemption instruction, calculate the preemption sequence and uninstall the preempted node;

Determine whether the preemption sequence involves key nodes that cannot be preempted;

If yes, reconfigure the model partitions of all nodes and restore the state from the checkpoint;

If not, the additional partitions [L i-1 ,L' i ] and [L' i ,L i ] stored in the neighbor nodes of the node to be uninstalled are converted from delayed calculation to immediate calculation, and then the synchronous communication network topology is adjusted, the neighbor nodes are added to the network topology, and they replace the uninstalled node for synchronization during the synchronization phase.
The elastic deep learning job scheduling method according to claim 2 is characterized in that, in the step of obtaining partition configuration and pipeline orchestration, the following steps are specifically included:

Get the partition configuration and pipeline arrangement from the scheduler. Let G be the total number of GPUs, P and D be the GPUs used for pipeline parallelism and data parallelism respectively. Then we get the partition size G = P × D, as well as the hyperparameters of the partitions mini-batch size and micro-batch size.

Using the profiling data of the offline model running on the GPU, the forward computation time of each layer F = { fi }, the backward propagation time of each layer B = { bi }, the GPU memory usage of each layer M = { mi } and the GPU memory upper bound sup(M), we establish a partition L = { Li } that satisfies

Assign the corresponding model partitions to each node of the job and establish a network topology for pipeline parallel computing and data parallel synchronous communication.
The elastic deep learning job scheduling method according to claim 2 is characterized in that, in the step of partitioning the model of the preemptible node and allocating it to the bubble of the neighboring node, performing delay calculation, the following steps are specifically included:

Repartition the model of preemptible nodes Ni in an average manner
They are stored in two neighboring nodes on the pipeline respectively, so that the calculation of partitions [L i-1 ,L ′ i ] and [L ′ i ,L i ] are delayed to the bubble time for additional calculation on neighboring nodes Ni -1 and Ni +1 , that is, converted to delayed calculation.
The elastic deep learning job scheduling method according to claim 2 is characterized in that, in the step of receiving the preemption instruction, calculating the preemption sequence and unloading the preempted node, the following steps are specifically included:

After receiving the preemption instruction, the optimal preemption sequence is calculated according to the partition configuration and pipeline schedule, that is,
Node sequence, uninstall the preempted node.
The elastic deep learning job scheduling method according to claim 1 or 2, characterized in that the step of returning the node specifically includes the following steps:

The nodes that have completed the job are added to the standby queue, and the preempted job obtains the available nodes from the standby queue.

Calculate the optimal return sequence based on the partition configuration and pipeline schedule;

The obtained available nodes load the corresponding model partitions and pipeline configurations, and restore the state from the checkpoint;

The returned node asynchronously pulls the intermediate states and gradients from neighboring nodes and performs forced synchronization at the next synchronization barrier;

The additional partitioning of neighbor nodes is transformed into delayed computation and is calculated within the bubble time.
The elastic deep learning job scheduling method according to claim 6 is characterized in that in the step of calculating the optimal return sequence according to the partition configuration and the pipeline schedule,
A node sequence sends a return message to the neighboring nodes Ni -1 and Ni +1 in the return sequence.
The elastic deep learning job scheduling method as described in claim 1 is characterized in that, in the step of the returned node asynchronously pulling the intermediate state and gradient from the neighboring node and performing forced synchronization at the next synchronization barrier, the following steps are specifically included: the returned node asynchronously pulls the intermediate state and gradient from the neighboring node, and performs forced synchronization at the next synchronization barrier, that is, ensures that the data [L i-1 ,L i ] of the node N i after the synchronization barrier is the same as the data in the additional partitions [L i- 1 ,L′ i ] and [L′ i ,L i ] stored on the neighboring nodes N i-1 and N i+1 , and the normal synchronous communication network topology is restored after synchronization.
A flexible deep learning job scheduling system, characterized by comprising: a processing unit for processing the preemption and return of nodes.
A computer device, characterized in that it includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the method according to any one of claims 1 to 8 is implemented.