WO2024103463A1 - Elastic deep learning job scheduling method and system, and computer device - Google Patents

Elastic deep learning job scheduling method and system, and computer device Download PDF

Info

Publication number
WO2024103463A1
WO2024103463A1 PCT/CN2022/137723 CN2022137723W WO2024103463A1 WO 2024103463 A1 WO2024103463 A1 WO 2024103463A1 CN 2022137723 W CN2022137723 W CN 2022137723W WO 2024103463 A1 WO2024103463 A1 WO 2024103463A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
nodes
pipeline
preemption
deep learning
Prior art date
Application number
PCT/CN2022/137723
Other languages
French (fr)
Chinese (zh)
Inventor
叶可江
段诩
须成忠
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2024103463A1 publication Critical patent/WO2024103463A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of information technology, and in particular to a method, system and computer device for elastic deep learning job scheduling.
  • Data parallelism and pipeline parallelism are the current mainstream solutions for distributed deep learning training.
  • Data parallelism distributes training data to different nodes for simultaneous calculation, pushes gradients to each node through synchronous communication, and updates model weights.
  • Pipeline parallelism partitions the model and assigns it to different nodes, and divides the mini-batch into micro-batches, so that calculations on different nodes can be parallelized through pipelines.
  • Each node only stores a part of the model parameters, thus avoiding the bottleneck of video memory.
  • the network communication overhead of pipeline parallelism is lower than that of data parallelism and model parallelism between operators.
  • Mainstream distributed training solutions usually use hybrid parallel training, which combines data parallelism and pipeline parallelism.
  • Dharma Shukla et al. proposed preemptive job scheduling in Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads, which achieves transparent preemptibility and elasticity of nodes by exchanging data between GPU memory and CPU memory to switch between different jobs.
  • Sanjith Athlur et al. proposed a pipeline parallel strategy for bidding examples in Varuna: Scalable, Low-cost Training of Massive Deep Learning Models, which reconfigures nodes after they are preempted.
  • Varuna uses pipeline schedules and continuous checkpoints to enable jobs to recover from checkpoints when they are preempted. John Thorpe et al.
  • One of the purposes of this application is to provide a flexible deep learning job scheduling method, comprising the following steps: including preemption and return of nodes.
  • the step of preempting a node includes the following steps:
  • the additional partitions [L i-1 ,L' i ] and [L' i ,L i ] stored in the neighbor nodes of the node to be uninstalled are converted from delayed calculation to immediate calculation, and then the synchronous communication network topology is adjusted, the neighbor nodes are added to the network topology, and they replace the uninstalled node for synchronization during the synchronization phase.
  • the steps of obtaining partition configuration and pipeline orchestration specifically include the following steps:
  • the following steps are specifically included:
  • the steps of receiving a preemption instruction, calculating a preemption sequence and uninstalling the preempted node specifically include the following steps:
  • the optimal preemption sequence is calculated according to the partition configuration and pipeline schedule, that is, Node sequence, uninstall the preempted node.
  • the step of returning the node specifically includes the following steps:
  • the nodes that have completed the job are added to the standby queue, and the preempted job obtains the available nodes from the standby queue.
  • the obtained available nodes load the corresponding model partitions and pipeline configurations, and restore the state from the checkpoint;
  • the returned node asynchronously pulls the intermediate states and gradients from neighboring nodes and performs forced synchronization at the next synchronization barrier;
  • the additional partitioning of neighbor nodes is transformed into delayed computation and is calculated within the bubble time.
  • a node sequence in the step of calculating the optimal return sequence according to the partition configuration and the pipeline schedule, sends a return message to the neighboring nodes Ni -1 and Ni +1 in the return sequence.
  • the step in which the returned node asynchronously pulls the intermediate state and gradient from the neighboring node and performs forced synchronization at the next synchronization barrier specifically includes the following steps: the returned node asynchronously pulls the intermediate state and gradient from the neighboring node and performs forced synchronization at the next synchronization barrier, i.e., ensures that the data [L i-1 ,L i ] of the node N i after the synchronization barrier is the same as the data in the additional partitions [L i-1 ,L ′ i ] and [L′ i ,L i ] stored on the neighboring nodes N i- 1 and N i + 1 , and the normal synchronous communication network topology is restored after synchronization.
  • the second purpose of this application is to provide a flexible deep learning job scheduling system, including: a processing unit for processing the preemption and return of nodes.
  • the third object of the present application is to provide a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the method described is implemented.
  • the elastic deep learning job scheduling method, system and computer device provided in the present application include the preemption and return of nodes, so that the nodes of the pipeline parallel operation in the cluster can be preempted without affecting the training.
  • the preemption within a certain range does not require reconfiguration or checkpoint recovery, and the preempted nodes can be returned to the job at any time, thereby improving the resource utilization of the cluster and reducing the average job completion time.
  • FIG1 is a flowchart of the steps for obtaining partition configuration and pipeline orchestration provided in Example 1 of the present application.
  • FIG. 2 is a flowchart showing the steps of returning a node provided in the first embodiment of the present application.
  • FIG3 is a schematic diagram of the structure of the computer device provided in Embodiment 3 of the present application.
  • first and second are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Therefore, a feature defined as “first” or “second” may explicitly or implicitly include one or more of the features. In the description of this application, the meaning of “plurality” is two or more, unless otherwise clearly and specifically defined.
  • An embodiment of the present application provides a flowchart of the steps of a flexible deep learning job scheduling method, including the steps of preempting and returning nodes. The implementation method of each step is described in detail below.
  • FIG. 1 shows the steps of preempting a node provided in this embodiment, including the following steps S110 to S160 , and the implementation method of each step is described in detail below.
  • Step S110 Obtain partition configuration and pipeline arrangement.
  • the steps of obtaining partition configuration and pipeline orchestration specifically include the following steps:
  • Step S113 distribute the corresponding model partitions to each node of the job, and establish a network topology for pipeline parallel computing and data parallel synchronous communication.
  • Step S120 partition the model of the preemptible node and distribute it to the bubbles of the neighboring nodes, and perform delay calculation.
  • the following steps are specifically included:
  • Step S130 receiving a preemption instruction, calculating a preemption sequence and uninstalling the preempted node.
  • the steps of receiving a preemption instruction, calculating a preemption sequence and uninstalling the preempted node specifically include the following steps:
  • the optimal preemption sequence is calculated according to the partition configuration and pipeline schedule, that is, Node sequence, uninstall the preempted node.
  • Step S140 determining whether the preemption sequence involves a key node that cannot be preempted
  • Step S150 If yes, reconfigure the model partitions of all nodes and restore the state from the checkpoint;
  • Step S160 If not, the additional partitions [L i-1 ,L' i ] and [L' i ,L i ] stored in the neighboring nodes of the uninstalled node are converted from delayed calculation to immediate calculation, and then the synchronous communication network topology is adjusted, the neighboring nodes are added to the network topology, and the uninstalled nodes are replaced for synchronization during the synchronization phase.
  • the above steps can realize the preemption of nodes, and the models of preemptible nodes can be evenly distributed to neighboring nodes for backup, so that the calculation can be completed within the bubble time; when a node is preempted, the neighboring node can seamlessly replace the node for calculation.
  • FIG. 2 is a flowchart of the steps of returning a node provided in this embodiment, specifically including the following steps S210 to S250 .
  • the implementation method of each step is described in detail below.
  • Step S210 adding the node that has completed the job to a standby queue, and the preempted job obtains an available node from the standby queue.
  • Step S220 Calculate the optimal return sequence according to the partition configuration and pipeline schedule.
  • a node sequence in the step of calculating the optimal return sequence according to the partition configuration and the pipeline schedule, sends a return message to the neighboring nodes Ni -1 and Ni +1 in the return sequence.
  • Step S230 The obtained available nodes load the corresponding model partitions and pipeline configurations, and restore the status from the checkpoint.
  • Step S240 The returning node asynchronously pulls the intermediate state and gradient from the neighboring node, and performs forced synchronization at the next synchronization barrier.
  • the step in which the returned node asynchronously pulls the intermediate state and gradient from the neighboring node and performs forced synchronization at the next synchronization barrier specifically includes the following steps: the returned node asynchronously pulls the intermediate state and gradient from the neighboring node and performs forced synchronization at the next synchronization barrier, i.e., ensuring that the data [L i-1 ,L i ] of the node N i after the synchronization barrier is the same as the data in the additional partitions [L i-1 ,L′ i ] and [L′ i ,L i ] stored on the neighboring nodes N i- 1 and N i +1 , and the normal synchronous communication network topology is restored after synchronization.
  • Step S250 The additional partitioning of the neighboring nodes is converted to delayed calculation and is calculated within the bubble time.
  • the elastic deep learning job scheduling method provided in the present application includes the preemption and return of nodes, so that the nodes of the pipeline parallel operation in the cluster can be preempted without affecting the training.
  • the preemption within a certain range does not require reconfiguration or checkpoint recovery, and the preempted nodes can be returned to the job at any time, thereby improving the resource utilization of the cluster and reducing the average job completion time.
  • This embodiment also provides a flexible deep learning job scheduling system, including: a processing unit, used to process the preemption and return of nodes.
  • the elastic deep learning job scheduling system includes the preemption and return of nodes, so that the nodes of the pipeline parallel operation in the cluster can be preempted without affecting the training.
  • the preemption within a certain range does not require reconfiguration or checkpoint recovery, and the preempted nodes can be returned to the job at any time, thereby improving the resource utilization of the cluster and reducing the average job completion time.
  • the computer device 50 includes a processor 51 and a memory 52 coupled to the processor 51 .
  • the memory 52 stores program instructions for implementing the error correction method for the above-mentioned memristor precision reconstruction calculation.
  • the processor 51 is used to execute program instructions stored in the memory 52 to implement the elastic deep learning job scheduling method.
  • the processor 51 may also be referred to as a CPU (Central Processing Unit).
  • the processor 51 may be an integrated circuit chip having signal processing capabilities.
  • the processor 51 may also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application-specific integrated circuit
  • FPGA field-programmable gate array
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)

Abstract

An elastic deep learning job scheduling method and system, and a computer device, comprising preemption and return of nodes. In this way, nodes for pipeline parallelism jobs in a cluster can be preempted without affecting training, for the preemption within a certain range, reconfiguration or restoration from a checkpoint is not required, and the preempted nodes can be returned to the jobs at any time, thereby improving the resource utilization of a cluster, and shortening the average job completion time.

Description

弹性深度学习作业调度方法、系统及计算机设备Flexible deep learning job scheduling method, system and computer equipment 技术领域Technical Field
本申请涉及信息技术领域,特别涉及一种弹性深度学习作业调度方法、系统及计算机设备。The present application relates to the field of information technology, and in particular to a method, system and computer device for elastic deep learning job scheduling.
背景技术Background technique
分布式训练将模型从单机单卡迁移到多机多卡的集群中,利用多节点的计算资源加速训练。数据并行和流水线并行是当前分布式深度学习训练的主流方案。数据并行将训练数据分发到不同的节点同时计算,通过同步通信将梯度推送到每个节点,更新模型权重。流水线并行对模型进行分区,分配到不同的节点中,并将mini-batch划分为micro-batch,使得不同节点上的计算可以通过流水线的方式并行计算。每个节点只存储了一部分的模型参数,从而避免了显存瓶颈。流水线并行的网络通信开销相较于数据并行和算子间的模型并行更低。主流的分布式训练方案通常采用混合并行训练,结合了数据并行和流水线并行。Distributed training migrates the model from a single machine and a single card to a cluster of multiple machines and multiple cards, and uses the computing resources of multiple nodes to accelerate training. Data parallelism and pipeline parallelism are the current mainstream solutions for distributed deep learning training. Data parallelism distributes training data to different nodes for simultaneous calculation, pushes gradients to each node through synchronous communication, and updates model weights. Pipeline parallelism partitions the model and assigns it to different nodes, and divides the mini-batch into micro-batches, so that calculations on different nodes can be parallelized through pipelines. Each node only stores a part of the model parameters, thus avoiding the bottleneck of video memory. The network communication overhead of pipeline parallelism is lower than that of data parallelism and model parallelism between operators. Mainstream distributed training solutions usually use hybrid parallel training, which combines data parallelism and pipeline parallelism.
Dharma Shukla等人在《Singularity:Planet-Scale,Preemptive and Elastic Scheduling of AI Workloads》中提出可抢占的作业调度,通过在GPU内存和CPU内存中交换数据实现不同作业间的时间片切换,从而实现节点的透明的抢占性和弹性。Sanjith Athlur等人在《Varuna:Scalable,Low-cost Training of Massive Deep Learning Models》中提出用于竞价示例的流水线并行策略,在节点被抢占后进行重新配置。Varuna通过流水线时间表和连续的检查点,使作业能够在被抢占时从检查点恢复。John Thorpe等人在《Bamboo:Making Preemptible Instances Resilient for Affordable Training of Large DNNs》针对竞价实例的抢占问题,提出了在流水线的气泡中加入冗余计算,当集群中不存在连续节点的抢占时,任意节点能够在发生抢占时被前一个节点所代替。当集群发生连续节点的抢占时,则退化到重新配置以及从检查点恢复。Dharma Shukla et al. proposed preemptive job scheduling in Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads, which achieves transparent preemptibility and elasticity of nodes by exchanging data between GPU memory and CPU memory to switch between different jobs. Sanjith Athlur et al. proposed a pipeline parallel strategy for bidding examples in Varuna: Scalable, Low-cost Training of Massive Deep Learning Models, which reconfigures nodes after they are preempted. Varuna uses pipeline schedules and continuous checkpoints to enable jobs to recover from checkpoints when they are preempted. John Thorpe et al. proposed adding redundant computation to the bubbles in the pipeline in Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs to solve the preemption problem of bidding instances. When there is no preemption of consecutive nodes in the cluster, any node can be replaced by the previous node when preemption occurs. When consecutive nodes in the cluster are preempted, it degenerates to reconfiguration and recovery from checkpoints.
现有的集群调度策略通常通过时间片切换来抢占计算资源,或通过重新配置从检查点恢复,或通过额外的节点进行冗余备份。这些技术的缺点与不足在于:无论是时间片切换还是重新配置并行都需要从检查点恢复状态,带来较大的内存交换开销;冗余计算仅在竞价实例可用,在集群中会带来较大的额外成本,并且无法在气泡时间内完成所有计算;目前还没有通用的机制在无需检查点恢复的情况下进行抢占。Existing cluster scheduling strategies usually preempt computing resources by switching time slices, or recover from checkpoints by reconfiguring, or perform redundant backups through additional nodes. The disadvantages and shortcomings of these technologies are: both time slice switching and reconfiguring parallelism need to restore the state from the checkpoint, which brings a large memory exchange overhead; redundant computing is only available in spot instances, which will bring a large additional cost in the cluster, and it is impossible to complete all computing within the bubble time; there is currently no general mechanism for preemption without checkpoint recovery.
发明内容Summary of the invention
鉴于此,有必要针对现有技术中存在的缺陷提供一种提高了集群的资源利用率,并降低了平均作业完成时间的弹性深度学习作业调度方法、系统及计算机设备。In view of this, it is necessary to provide a flexible deep learning job scheduling method, system and computer device that improves cluster resource utilization and reduces the average job completion time to address the defects in the prior art.
为解决上述问题,本申请采用下述技术方案:To solve the above problems, this application adopts the following technical solutions:
本申请目的之一,提供了一种弹性深度学习作业调度方法,包括下述步骤:包括对节点的抢占与返还。One of the purposes of this application is to provide a flexible deep learning job scheduling method, comprising the following steps: including preemption and return of nodes.
在其中一些实施例中,在对节点的抢占的步骤中,包括下述步骤:In some embodiments, the step of preempting a node includes the following steps:
获取分区配置和流水线编排;Get partition configuration and pipeline orchestration;
将可抢占节点的模型分区并分配到邻居节点的气泡中,执行延迟计算;Partition the model of preemptible nodes and assign them to bubbles of neighboring nodes to perform latency calculations;
接收到抢占指令,计算抢占序列并卸载被抢占的节点;Receive the preemption instruction, calculate the preemption sequence and uninstall the preempted node;
判断抢占序列是否涉及无法被抢占的关键节点;Determine whether the preemption sequence involves key nodes that cannot be preempted;
若是,则重新配置所有节点的模型分区,并从检查点恢复状态;If yes, reconfigure the model partitions of all nodes and restore the state from the checkpoint;
若否,将被卸载节点的邻居节点中存储的额外分区[L i-1,L' i]和[L' i,L i]从延迟计算转变为立即计算,再调整同步通信网络拓扑,将邻居节点加入网络拓扑中,在同步阶段取代被卸载节点进行同步。 If not, the additional partitions [L i-1 ,L' i ] and [L' i ,L i ] stored in the neighbor nodes of the node to be uninstalled are converted from delayed calculation to immediate calculation, and then the synchronous communication network topology is adjusted, the neighbor nodes are added to the network topology, and they replace the uninstalled node for synchronization during the synchronization phase.
在其中一些实施例中,在获取分区配置和流水线编排的步骤中,具体包括下述步骤:In some of the embodiments, the steps of obtaining partition configuration and pipeline orchestration specifically include the following steps:
从调度器获取分区配置和流水线编排,令G为GPU总数,P和D分别为流水线并行和数据并行所用的GPU,则得到分区大小为G=P×D,以及分区的超参数mini-batch size和micro-batch size;Get the partition configuration and pipeline arrangement from the scheduler. Let G be the total number of GPUs, P and D be the GPUs used for pipeline parallelism and data parallelism respectively. Then we get the partition size G = P × D, as well as the hyperparameters of the partitions mini-batch size and micro-batch size.
利用离线计算的模型在GPU上运行的剖析数据,每层前向计算时间F={f i}、每层后向传播时间B={b i}、每层GPU内存占用M={m i}和GPU内存上界sup(M),建立分区L={L i},满足
Figure PCTCN2022137723-appb-000001
Figure PCTCN2022137723-appb-000002
Using the profiling data of the offline model running on the GPU, the forward computation time of each layer F = { fi }, the backward propagation time of each layer B = { bi }, the GPU memory usage of each layer M = { mi } and the GPU memory upper bound sup(M), we establish a partition L = { Li } that satisfies
Figure PCTCN2022137723-appb-000001
Figure PCTCN2022137723-appb-000002
将对应的模型分区分配到作业的每个节点中,建立流水线并行计算和数据并行同步通信的网络拓扑。Assign the corresponding model partitions to each node of the job and establish a network topology for pipeline parallel computing and data parallel synchronous communication.
在其中一些实施例中,在将可抢占节点的模型分区并分配到邻居节点的气泡中,执行延迟计算的步骤中,具体包括下述步骤:In some embodiments, in the step of partitioning the model of the preemptible node and assigning it to the bubbles of the neighboring nodes, performing delay calculation, the following steps are specifically included:
对可抢占节点N i的模型进行平均的再次分区
Figure PCTCN2022137723-appb-000003
分别存储到在流 水线上的两个邻居节点中,使分区[L i-1,L′ i]和[L′ i,L i]的计算分别在邻居节点N i-1和N i+1上延迟到气泡时间内进行额外计算,即转变为延迟计算。
Repartition the model of preemptible nodes Ni in an average manner
Figure PCTCN2022137723-appb-000003
They are stored in two neighboring nodes on the pipeline respectively, so that the calculation of partitions [L i-1 ,L ′ i ] and [L ′ i ,L i ] are delayed to the bubble time for additional calculation on neighboring nodes Ni -1 and Ni +1 , that is, converted to delayed calculation.
在其中一些实施例中,在接收到抢占指令,计算抢占序列并卸载被抢占的节点的步骤中,具体包括下述步骤:In some embodiments, the steps of receiving a preemption instruction, calculating a preemption sequence and uninstalling the preempted node specifically include the following steps:
在接收到抢占指令后,根据分区配置和流水线时间表计算最优的抢占序列,即满足
Figure PCTCN2022137723-appb-000004
的节点序列,卸载被抢占的节点。
After receiving the preemption instruction, the optimal preemption sequence is calculated according to the partition configuration and pipeline schedule, that is,
Figure PCTCN2022137723-appb-000004
Node sequence, uninstall the preempted node.
在其中一些实施例中,对节点的返还的步骤中,具体包括下述步骤:In some embodiments, the step of returning the node specifically includes the following steps:
将完成作业的节点加入备用队列,被抢占的作业从备用队列中获取可用的节点。The nodes that have completed the job are added to the standby queue, and the preempted job obtains the available nodes from the standby queue.
根据分区配置和流水线时间表计算最优的返还序列;Calculate the optimal return sequence based on the partition configuration and pipeline schedule;
获取的可用节点加载对应的模型分区和流水线配置,并从检查点恢复状态;The obtained available nodes load the corresponding model partitions and pipeline configurations, and restore the state from the checkpoint;
返还的节点异步地从邻居节点拉取中间状态和梯度,并在下一个同步屏障处进行强制同步;The returned node asynchronously pulls the intermediate states and gradients from neighboring nodes and performs forced synchronization at the next synchronization barrier;
邻居节点的额外分区转变为延迟计算,并在气泡时间内进行计算。The additional partitioning of neighbor nodes is transformed into delayed computation and is calculated within the bubble time.
在其中一些实施例中,在根据分区配置和流水线时间表计算最优的返还序列的步骤中,满足
Figure PCTCN2022137723-appb-000005
的节点序列,向返还序列中的邻居节点N i-1和N i+1发出返还消息。
In some of the embodiments, in the step of calculating the optimal return sequence according to the partition configuration and the pipeline schedule,
Figure PCTCN2022137723-appb-000005
A node sequence sends a return message to the neighboring nodes Ni -1 and Ni +1 in the return sequence.
在其中一些实施例中,在返还的节点异步地从邻居节点拉取中间状态和梯度,并在下一个同步屏障处进行强制同步的步骤中,具体包括下述步骤:返还的节点异步地从邻居节点拉取中间状态和梯度,并在下一个同步屏障处进行强制同步,即保证同步屏障后该节点N i的数据[L i-1,L i]与邻居节点N i-1和N i+1上存储的额外分区[L i-1, Li]和[L′ i,L i]中的数据相同,同步后恢复正常的同步通信网络拓扑。 In some of the embodiments, the step in which the returned node asynchronously pulls the intermediate state and gradient from the neighboring node and performs forced synchronization at the next synchronization barrier specifically includes the following steps: the returned node asynchronously pulls the intermediate state and gradient from the neighboring node and performs forced synchronization at the next synchronization barrier, i.e., ensures that the data [L i-1 ,L i ] of the node N i after the synchronization barrier is the same as the data in the additional partitions [L i-1 ,L ′ i ] and [L′ i ,L i ] stored on the neighboring nodes N i- 1 and N i + 1 , and the normal synchronous communication network topology is restored after synchronization.
本申请目的之二,提供了一种弹性深度学习作业调度系统,包括:处理单元,用于处理对节点的抢占与返还。The second purpose of this application is to provide a flexible deep learning job scheduling system, including: a processing unit for processing the preemption and return of nodes.
本申请目的之三,提供了一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时,实现所述的方法。The third object of the present application is to provide a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the method described is implemented.
本申请采用上述技术方案,其有益效果如下:This application adopts the above technical solution, and its beneficial effects are as follows:
本申请提供的弹性深度学习作业调度方法、系统及计算机设备,包括对节 点的抢占与返还,使得在集群中的流水线并行作业的节点可以在不影响训练的情况下被抢占,在一定范围内的抢占不需要进行重新配置或检查点恢复,并且抢占后的节点能够随时返还到作业中,从而提高了集群的资源利用率,降低了平均作业完成时间。The elastic deep learning job scheduling method, system and computer device provided in the present application include the preemption and return of nodes, so that the nodes of the pipeline parallel operation in the cluster can be preempted without affecting the training. The preemption within a certain range does not require reconfiguration or checkpoint recovery, and the preempted nodes can be returned to the job at any time, thereby improving the resource utilization of the cluster and reducing the average job completion time.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面所描述的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for use in the embodiments of the present application or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.
图1为本申请实施例1提供的获取分区配置和流水线编排的步骤流程图。FIG1 is a flowchart of the steps for obtaining partition configuration and pipeline orchestration provided in Example 1 of the present application.
图2为申请本实施例1提供的对节点的返还的步骤流程图。FIG. 2 is a flowchart showing the steps of returning a node provided in the first embodiment of the present application.
图3为申请本实施例3提供的计算机设备的结构示意图。FIG3 is a schematic diagram of the structure of the computer device provided in Embodiment 3 of the present application.
具体实施方式Detailed ways
下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本申请,而不能理解为对本申请的限制。The embodiments of the present application are described in detail below, and examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are exemplary and are intended to be used to explain the present application, and should not be construed as limiting the present application.
在本申请的描述中,需要理解的是,术语“上”、“下”、“水平”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本申请和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本申请的限制。In the description of the present application, it should be understood that the terms "upper", "lower", "horizontal", "inside", "outside", etc., indicating orientations or positional relationships, are based on the orientations or positional relationships shown in the accompanying drawings, and are only for the convenience of describing the present application and simplifying the description, rather than indicating or implying that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be understood as a limitation on the present application.
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。In addition, the terms "first" and "second" are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of the features. In the description of this application, the meaning of "plurality" is two or more, unless otherwise clearly and specifically defined.
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。In order to make the objectives, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below in conjunction with the accompanying drawings and embodiments.
实施例1Example 1
本申请一实施例提供的一种弹性深度学习作业调度方法的步骤流程图,包括对节点的抢占与返还的步骤,以下详细说明各个步骤的实现方式。An embodiment of the present application provides a flowchart of the steps of a flexible deep learning job scheduling method, including the steps of preempting and returning nodes. The implementation method of each step is described in detail below.
请参阅图1,为本实施例提供的在对节点的抢占的步骤中,包括下述步骤S110至步骤S160,以下详细说明各个步骤的实现方式。Please refer to FIG. 1 , which shows the steps of preempting a node provided in this embodiment, including the following steps S110 to S160 , and the implementation method of each step is described in detail below.
步骤S110:获取分区配置和流水线编排。Step S110: Obtain partition configuration and pipeline arrangement.
在其中一些实施例中,在获取分区配置和流水线编排的步骤中,具体包括下述步骤:In some of the embodiments, the steps of obtaining partition configuration and pipeline orchestration specifically include the following steps:
步骤S111:从调度器获取分区配置和流水线编排,令G为GPU总数,P和D分别为流水线并行和数据并行所用的GPU,则得到分区大小为G=P×D,以及分区的超参数mini-batch size和micro-batch size;Step S111: Get the partition configuration and pipeline arrangement from the scheduler, let G be the total number of GPUs, P and D be the GPUs used for pipeline parallelism and data parallelism respectively, then the partition size is G=P×D, and the hyper parameters of the partition are mini-batch size and micro-batch size;
步骤S112:利用离线计算的模型在GPU上运行的剖析数据,每层前向计算时间F={f i}、每层后向传播时间B={b i}、每层GPU内存占用M={m i}和GPU内存上界sup(M),建立分区L={L i},满足 Step S112: Using the analysis data of the offline model running on the GPU, the forward calculation time of each layer F = { fi }, the backward propagation time of each layer B = { bi }, the GPU memory usage of each layer M = { mi } and the GPU memory upper bound sup(M), establish a partition L = { Li }, satisfying
Figure PCTCN2022137723-appb-000006
Figure PCTCN2022137723-appb-000006
步骤S113:将对应的模型分区分配到作业的每个节点中,建立流水线并行计算和数据并行同步通信的网络拓扑。Step S113: distribute the corresponding model partitions to each node of the job, and establish a network topology for pipeline parallel computing and data parallel synchronous communication.
通过上述步骤S111至步骤S113,可以实现对获取分区配置和流水线编排。Through the above steps S111 to S113, it is possible to obtain partition configuration and pipeline arrangement.
步骤S120:将可抢占节点的模型分区并分配到邻居节点的气泡中,执行延迟计算。Step S120: partition the model of the preemptible node and distribute it to the bubbles of the neighboring nodes, and perform delay calculation.
在其中一些实施例中,在将可抢占节点的模型分区并分配到邻居节点的气泡中,执行延迟计算的步骤中,具体包括下述步骤:In some embodiments, in the step of partitioning the model of the preemptible node and assigning it to the bubbles of the neighboring nodes, performing delay calculation, the following steps are specifically included:
对可抢占节点N i的模型进行平均的再次分区
Figure PCTCN2022137723-appb-000007
分别存储到在流水线上的两个邻居节点中,使分区[L i-1,L′ i]和[L′ i,L i]的计算分别在邻居节点N i-1和N i+1上延迟到气泡时间内进行额外计算,即转变为延迟计算。
Repartition the model of preemptible nodes Ni in an average manner
Figure PCTCN2022137723-appb-000007
They are stored in two neighboring nodes on the pipeline respectively, so that the calculation of partitions [L i-1 ,L ′ i ] and [L ′ i ,L i ] are delayed to the bubble time for additional calculation on neighboring nodes Ni -1 and Ni +1 , that is, converted to delayed calculation.
步骤S130:接收到抢占指令,计算抢占序列并卸载被抢占的节点。Step S130: receiving a preemption instruction, calculating a preemption sequence and uninstalling the preempted node.
在其中一些实施例中,在接收到抢占指令,计算抢占序列并卸载被抢占的节点的步骤中,具体包括下述步骤:In some embodiments, the steps of receiving a preemption instruction, calculating a preemption sequence and uninstalling the preempted node specifically include the following steps:
在接收到抢占指令后,根据分区配置和流水线时间表计算最优的抢占序列,即满足
Figure PCTCN2022137723-appb-000008
的节点序列,卸载被抢占的节点。
After receiving the preemption instruction, the optimal preemption sequence is calculated according to the partition configuration and pipeline schedule, that is,
Figure PCTCN2022137723-appb-000008
Node sequence, uninstall the preempted node.
步骤S140:判断抢占序列是否涉及无法被抢占的关键节点;Step S140: determining whether the preemption sequence involves a key node that cannot be preempted;
步骤S150:若是,则重新配置所有节点的模型分区,并从检查点恢复状态;Step S150: If yes, reconfigure the model partitions of all nodes and restore the state from the checkpoint;
步骤S160:若否,将被卸载节点的邻居节点中存储的额外分区[L i-1,L' i]和[L' i,L i]从延迟计算转变为立即计算,再调整同步通信网络拓扑,将邻居节点加入网络拓扑中,在同步阶段取代被卸载节点进行同步。 Step S160: If not, the additional partitions [L i-1 ,L' i ] and [L' i ,L i ] stored in the neighboring nodes of the uninstalled node are converted from delayed calculation to immediate calculation, and then the synchronous communication network topology is adjusted, the neighboring nodes are added to the network topology, and the uninstalled nodes are replaced for synchronization during the synchronization phase.
可以理解,通过上述步骤可以实现对节点的抢占,将可抢占节点的模型平均分配到邻居节点备份,使计算可在气泡时间内完成;当节点被抢占时,邻居节点能够无缝地代替该节点进行计算。It can be understood that the above steps can realize the preemption of nodes, and the models of preemptible nodes can be evenly distributed to neighboring nodes for backup, so that the calculation can be completed within the bubble time; when a node is preempted, the neighboring node can seamlessly replace the node for calculation.
请参阅图2,为本实施例提供的对节点的返还的步骤流程图,具体包括下述步骤S210至步骤S250,以下详细说明各个步骤的实现方式。Please refer to FIG. 2 , which is a flowchart of the steps of returning a node provided in this embodiment, specifically including the following steps S210 to S250 . The implementation method of each step is described in detail below.
步骤S210:将完成作业的节点加入备用队列,被抢占的作业从备用队列中获取可用的节点。Step S210: adding the node that has completed the job to a standby queue, and the preempted job obtains an available node from the standby queue.
步骤S220:根据分区配置和流水线时间表计算最优的返还序列。Step S220: Calculate the optimal return sequence according to the partition configuration and pipeline schedule.
在其中一些实施例中,在根据分区配置和流水线时间表计算最优的返还序列的步骤中,满足
Figure PCTCN2022137723-appb-000009
的节点序列,向返还序列中的邻居节点N i-1和N i+1发出返还消息。
In some of the embodiments, in the step of calculating the optimal return sequence according to the partition configuration and the pipeline schedule,
Figure PCTCN2022137723-appb-000009
A node sequence sends a return message to the neighboring nodes Ni -1 and Ni +1 in the return sequence.
步骤S230:获取的可用节点加载对应的模型分区和流水线配置,并从检查点恢复状态。Step S230: The obtained available nodes load the corresponding model partitions and pipeline configurations, and restore the status from the checkpoint.
步骤S240:返还的节点异步地从邻居节点拉取中间状态和梯度,并在下一个同步屏障处进行强制同步。Step S240: The returning node asynchronously pulls the intermediate state and gradient from the neighboring node, and performs forced synchronization at the next synchronization barrier.
在其中一些实施例中,在返还的节点异步地从邻居节点拉取中间状态和梯度,并在下一个同步屏障处进行强制同步的步骤中,具体包括下述步骤:返还的节点异步地从邻居节点拉取中间状态和梯度,并在下一个同步屏障处进行强制同步,即保证同步屏障后该节点N i的数据[L i-1,L i]与邻居节点N i-1和N i+1上存储的额外分区[L i-1,L′ i]和[L′ i,L i]中的数据相同,同步后恢复正常的同步通信网络拓扑。 In some of the embodiments, the step in which the returned node asynchronously pulls the intermediate state and gradient from the neighboring node and performs forced synchronization at the next synchronization barrier specifically includes the following steps: the returned node asynchronously pulls the intermediate state and gradient from the neighboring node and performs forced synchronization at the next synchronization barrier, i.e., ensuring that the data [L i-1 ,L i ] of the node N i after the synchronization barrier is the same as the data in the additional partitions [L i-1 ,L′ i ] and [L′ i ,L i ] stored on the neighboring nodes N i- 1 and N i +1 , and the normal synchronous communication network topology is restored after synchronization.
步骤S250:邻居节点的额外分区转变为延迟计算,并在气泡时间内进行计算。Step S250: The additional partitioning of the neighboring nodes is converted to delayed calculation and is calculated within the bubble time.
可以理解通过上述步骤,可以实现对节点的返还。It can be understood that through the above steps, the return of the node can be achieved.
本申请提供的弹性深度学习作业调度方法,包括对节点的抢占与返还,使得在集群中的流水线并行作业的节点可以在不影响训练的情况下被抢占,在一定范围内的抢占不需要进行重新配置或检查点恢复,并且抢占后的节点能够随时返还到作业中,从而提高了集群的资源利用率,降低了平均作业完成时间。The elastic deep learning job scheduling method provided in the present application includes the preemption and return of nodes, so that the nodes of the pipeline parallel operation in the cluster can be preempted without affecting the training. The preemption within a certain range does not require reconfiguration or checkpoint recovery, and the preempted nodes can be returned to the job at any time, thereby improving the resource utilization of the cluster and reducing the average job completion time.
实施例2Example 2
本实施例还提供的提供了一种弹性深度学习作业调度系统,包括:处理单 元,用于处理对节点的抢占与返还。This embodiment also provides a flexible deep learning job scheduling system, including: a processing unit, used to process the preemption and return of nodes.
本实施例2提供的设计系统,其详细的工作方式,可以参见实施例1,这里不再赘述。The detailed working method of the design system provided in this embodiment 2 can be found in embodiment 1 and will not be repeated here.
本申请提供的弹性深度学习作业调度系统,包括对节点的抢占与返还,使得在集群中的流水线并行作业的节点可以在不影响训练的情况下被抢占,在一定范围内的抢占不需要进行重新配置或检查点恢复,并且抢占后的节点能够随时返还到作业中,从而提高了集群的资源利用率,降低了平均作业完成时间。The elastic deep learning job scheduling system provided in the present application includes the preemption and return of nodes, so that the nodes of the pipeline parallel operation in the cluster can be preempted without affecting the training. The preemption within a certain range does not require reconfiguration or checkpoint recovery, and the preempted nodes can be returned to the job at any time, thereby improving the resource utilization of the cluster and reducing the average job completion time.
实施例3Example 3
请参阅图3为本申请实施例的计算机设备结构示意图。该计算机设备50包括处理器51、与处理器51耦接的存储器52。Please refer to FIG3 for a schematic diagram of the computer device structure of an embodiment of the present application. The computer device 50 includes a processor 51 and a memory 52 coupled to the processor 51 .
存储器52存储有用于实现上述忆阻器精度重构计算的误差校正方法的程序指令。The memory 52 stores program instructions for implementing the error correction method for the above-mentioned memristor precision reconstruction calculation.
处理器51用于执行存储器52存储的程序指令以实现所述的弹性深度学习作业调度方法。The processor 51 is used to execute program instructions stored in the memory 52 to implement the elastic deep learning job scheduling method.
其中,处理器51还可以称为CPU(Central Processing Unit,中央处理单元)。处理器51可能是一种集成电路芯片,具有信号的处理能力。处理器51还可以是通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 51 may also be referred to as a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip having signal processing capabilities. The processor 51 may also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor, etc.
可以理解,以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。It can be understood that the technical features of the above-described embodiments can be combined arbitrarily. In order to make the description concise, not all possible combinations of the technical features in the above-described embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
以上仅为本申请的较佳实施例而已,仅具体描述了本申请的技术原理,这些描述只是为了解释本申请的原理,不能以任何方式解释为对本申请保护范围的限制。基于此处解释,凡在本申请的精神和原则之内所作的任何修改、等同替换和改进,及本领域的技术人员不需要付出创造性的劳动即可联想到本申请的其他具体实施方式,均应包含在本申请的保护范围之内。The above are only preferred embodiments of the present application, and only specifically describe the technical principles of the present application. These descriptions are only for explaining the principles of the present application and cannot be interpreted as limiting the scope of protection of the present application in any way. Based on the explanation here, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application, and other specific implementation methods of the present application that can be associated with the technicians in this field without creative work, should be included in the scope of protection of the present application.

Claims (10)

  1. 一种弹性深度学习作业调度方法,其特征在于,包括对节点的抢占与返还。A flexible deep learning job scheduling method, characterized by including node preemption and return.
  2. 如权利要求1所述的弹性深度学习作业调度方法,其特征在于,在对节点的抢占的步骤中,包括下述步骤:The elastic deep learning job scheduling method according to claim 1 is characterized in that, in the step of preempting the node, the following steps are included:
    获取分区配置和流水线编排;Get partition configuration and pipeline orchestration;
    将可抢占节点的模型分区并分配到邻居节点的气泡中,执行延迟计算;Partition the model of preemptible nodes and assign them to bubbles of neighboring nodes to perform latency calculations;
    接收到抢占指令,计算抢占序列并卸载被抢占的节点;Receive the preemption instruction, calculate the preemption sequence and uninstall the preempted node;
    判断抢占序列是否涉及无法被抢占的关键节点;Determine whether the preemption sequence involves key nodes that cannot be preempted;
    若是,则重新配置所有节点的模型分区,并从检查点恢复状态;If yes, reconfigure the model partitions of all nodes and restore the state from the checkpoint;
    若否,将被卸载节点的邻居节点中存储的额外分区[L i-1,L' i]和[L' i,L i]从延迟计算转变为立即计算,再调整同步通信网络拓扑,将邻居节点加入网络拓扑中,在同步阶段取代被卸载节点进行同步。 If not, the additional partitions [L i-1 ,L' i ] and [L' i ,L i ] stored in the neighbor nodes of the node to be uninstalled are converted from delayed calculation to immediate calculation, and then the synchronous communication network topology is adjusted, the neighbor nodes are added to the network topology, and they replace the uninstalled node for synchronization during the synchronization phase.
  3. 如权利要求2所述的弹性深度学习作业调度方法,其特征在于,在获取分区配置和流水线编排的步骤中,具体包括下述步骤:The elastic deep learning job scheduling method according to claim 2 is characterized in that, in the step of obtaining partition configuration and pipeline orchestration, the following steps are specifically included:
    从调度器获取分区配置和流水线编排,令G为GPU总数,P和D分别为流水线并行和数据并行所用的GPU,则得到分区大小为G=P×D,以及分区的超参数mini-batch size和micro-batch size;Get the partition configuration and pipeline arrangement from the scheduler. Let G be the total number of GPUs, P and D be the GPUs used for pipeline parallelism and data parallelism respectively. Then we get the partition size G = P × D, as well as the hyperparameters of the partitions mini-batch size and micro-batch size.
    利用离线计算的模型在GPU上运行的剖析数据,每层前向计算时间F={f i}、每层后向传播时间B={b i}、每层GPU内存占用M={m i}和GPU内存上界sup(M),建立分区L={L i},满足
    Figure PCTCN2022137723-appb-100001
    Figure PCTCN2022137723-appb-100002
    Using the profiling data of the offline model running on the GPU, the forward computation time of each layer F = { fi }, the backward propagation time of each layer B = { bi }, the GPU memory usage of each layer M = { mi } and the GPU memory upper bound sup(M), we establish a partition L = { Li } that satisfies
    Figure PCTCN2022137723-appb-100001
    Figure PCTCN2022137723-appb-100002
    将对应的模型分区分配到作业的每个节点中,建立流水线并行计算和数据并行同步通信的网络拓扑。Assign the corresponding model partitions to each node of the job and establish a network topology for pipeline parallel computing and data parallel synchronous communication.
  4. 如权利要求2所述的弹性深度学习作业调度方法,其特征在于,在将可抢占节点的模型分区并分配到邻居节点的气泡中,执行延迟计算的步骤中,具体包括下述步骤:The elastic deep learning job scheduling method according to claim 2 is characterized in that, in the step of partitioning the model of the preemptible node and allocating it to the bubble of the neighboring node, performing delay calculation, the following steps are specifically included:
    对可抢占节点N i的模型进行平均的再次分区
    Figure PCTCN2022137723-appb-100003
    分别存储到在流水线上的两个邻居节点中,使分区[L i-1,L′ i]和[L′ i,L i]的计算分别在邻居节点N i-1和N i+1上延迟到气泡时间内进行额外计算,即转变为延迟计算。
    Repartition the model of preemptible nodes Ni in an average manner
    Figure PCTCN2022137723-appb-100003
    They are stored in two neighboring nodes on the pipeline respectively, so that the calculation of partitions [L i-1 ,L ′ i ] and [L ′ i ,L i ] are delayed to the bubble time for additional calculation on neighboring nodes Ni -1 and Ni +1 , that is, converted to delayed calculation.
  5. 如权利要求2所述的弹性深度学习作业调度方法,其特征在于,在接收 到抢占指令,计算抢占序列并卸载被抢占的节点的步骤中,具体包括下述步骤:The elastic deep learning job scheduling method according to claim 2 is characterized in that, in the step of receiving the preemption instruction, calculating the preemption sequence and unloading the preempted node, the following steps are specifically included:
    在接收到抢占指令后,根据分区配置和流水线时间表计算最优的抢占序列,即满足
    Figure PCTCN2022137723-appb-100004
    的节点序列,卸载被抢占的节点。
    After receiving the preemption instruction, the optimal preemption sequence is calculated according to the partition configuration and pipeline schedule, that is,
    Figure PCTCN2022137723-appb-100004
    Node sequence, uninstall the preempted node.
  6. 如权利要求1或2所述的弹性深度学习作业调度方法,其特征在于,对节点的返还的步骤中,具体包括下述步骤:The elastic deep learning job scheduling method according to claim 1 or 2, characterized in that the step of returning the node specifically includes the following steps:
    将完成作业的节点加入备用队列,被抢占的作业从备用队列中获取可用的节点。The nodes that have completed the job are added to the standby queue, and the preempted job obtains the available nodes from the standby queue.
    根据分区配置和流水线时间表计算最优的返还序列;Calculate the optimal return sequence based on the partition configuration and pipeline schedule;
    获取的可用节点加载对应的模型分区和流水线配置,并从检查点恢复状态;The obtained available nodes load the corresponding model partitions and pipeline configurations, and restore the state from the checkpoint;
    返还的节点异步地从邻居节点拉取中间状态和梯度,并在下一个同步屏障处进行强制同步;The returned node asynchronously pulls the intermediate states and gradients from neighboring nodes and performs forced synchronization at the next synchronization barrier;
    邻居节点的额外分区转变为延迟计算,并在气泡时间内进行计算。The additional partitioning of neighbor nodes is transformed into delayed computation and is calculated within the bubble time.
  7. 如权利要求6所述的弹性深度学习作业调度方法,其特征在于,在根据分区配置和流水线时间表计算最优的返还序列的步骤中,满足
    Figure PCTCN2022137723-appb-100005
    的节点序列,向返还序列中的邻居节点N i-1和N i+1发出返还消息。
    The elastic deep learning job scheduling method according to claim 6 is characterized in that in the step of calculating the optimal return sequence according to the partition configuration and the pipeline schedule,
    Figure PCTCN2022137723-appb-100005
    A node sequence sends a return message to the neighboring nodes Ni -1 and Ni +1 in the return sequence.
  8. 如权利要求1所述的弹性深度学习作业调度方法,其特征在于,在返还的节点异步地从邻居节点拉取中间状态和梯度,并在下一个同步屏障处进行强制同步的步骤中,具体包括下述步骤:返还的节点异步地从邻居节点拉取中间状态和梯度,并在下一个同步屏障处进行强制同步,即保证同步屏障后该节点N i的数据[L i-1,L i]与邻居节点N i-1和N i+1上存储的额外分区[L i-1,L′ i]和[L′ i,L i]中的数据相同,同步后恢复正常的同步通信网络拓扑。 The elastic deep learning job scheduling method as described in claim 1 is characterized in that, in the step of the returned node asynchronously pulling the intermediate state and gradient from the neighboring node and performing forced synchronization at the next synchronization barrier, the following steps are specifically included: the returned node asynchronously pulls the intermediate state and gradient from the neighboring node, and performs forced synchronization at the next synchronization barrier, that is, ensures that the data [L i-1 ,L i ] of the node N i after the synchronization barrier is the same as the data in the additional partitions [L i- 1 ,L′ i ] and [L′ i ,L i ] stored on the neighboring nodes N i-1 and N i+1 , and the normal synchronous communication network topology is restored after synchronization.
  9. 一种弹性深度学习作业调度系统,其特征在于,包括:处理单元,用于处理对节点的抢占与返还。A flexible deep learning job scheduling system, characterized by comprising: a processing unit for processing the preemption and return of nodes.
  10. 一种计算机设备,其特征在于,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时,实现如权利要求1-8中任一所述的方法。A computer device, characterized in that it includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the method according to any one of claims 1 to 8 is implemented.
PCT/CN2022/137723 2022-11-18 2022-12-08 Elastic deep learning job scheduling method and system, and computer device WO2024103463A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211443945.1 2022-11-18
CN202211443945.1A CN116069495A (en) 2022-11-18 2022-11-18 Method, system and computer equipment for scheduling elastic deep learning job

Publications (1)

Publication Number Publication Date
WO2024103463A1 true WO2024103463A1 (en) 2024-05-23

Family

ID=86177731

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/137723 WO2024103463A1 (en) 2022-11-18 2022-12-08 Elastic deep learning job scheduling method and system, and computer device

Country Status (2)

Country Link
CN (1) CN116069495A (en)
WO (1) WO2024103463A1 (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105159769A (en) * 2015-09-11 2015-12-16 国电南瑞科技股份有限公司 Distributed job scheduling method suitable for heterogeneous computational capability cluster
CN108319522A (en) * 2018-02-02 2018-07-24 绿欣科技发展(北京)有限公司 A method of reinforcing distributed memory system reliability
CN109213594A (en) * 2017-07-06 2019-01-15 阿里巴巴集团控股有限公司 Method, apparatus, equipment and the computer storage medium that resource is seized
CN113014663A (en) * 2021-03-12 2021-06-22 中南大学 Task and resource matching method supporting cross-node computing task survivability and succession
CN113168569A (en) * 2018-11-30 2021-07-23 国际商业机器公司 Decentralized distributed deep learning
CN114077486A (en) * 2021-11-22 2022-02-22 内蒙古大学 MapReduce task scheduling method and system
KR20220122175A (en) * 2021-02-26 2022-09-02 고려대학교 산학협력단 Massively parallel deep learning method and apparatus
CN115190629A (en) * 2022-07-11 2022-10-14 北京通广龙电子科技有限公司 Distributed dynamic resource allocation method, system, device and storage medium
CN115248728A (en) * 2022-09-21 2022-10-28 之江实验室 Distributed training task scheduling method, system and device for intelligent computing

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105159769A (en) * 2015-09-11 2015-12-16 国电南瑞科技股份有限公司 Distributed job scheduling method suitable for heterogeneous computational capability cluster
CN109213594A (en) * 2017-07-06 2019-01-15 阿里巴巴集团控股有限公司 Method, apparatus, equipment and the computer storage medium that resource is seized
CN108319522A (en) * 2018-02-02 2018-07-24 绿欣科技发展(北京)有限公司 A method of reinforcing distributed memory system reliability
CN113168569A (en) * 2018-11-30 2021-07-23 国际商业机器公司 Decentralized distributed deep learning
KR20220122175A (en) * 2021-02-26 2022-09-02 고려대학교 산학협력단 Massively parallel deep learning method and apparatus
CN113014663A (en) * 2021-03-12 2021-06-22 中南大学 Task and resource matching method supporting cross-node computing task survivability and succession
CN114077486A (en) * 2021-11-22 2022-02-22 内蒙古大学 MapReduce task scheduling method and system
CN115190629A (en) * 2022-07-11 2022-10-14 北京通广龙电子科技有限公司 Distributed dynamic resource allocation method, system, device and storage medium
CN115248728A (en) * 2022-09-21 2022-10-28 之江实验室 Distributed training task scheduling method, system and device for intelligent computing

Also Published As

Publication number Publication date
CN116069495A (en) 2023-05-05

Similar Documents

Publication Publication Date Title
CN110619595B (en) Graph calculation optimization method based on interconnection of multiple FPGA accelerators
US10873623B2 (en) Dynamically modifying a cluster of computing nodes used for distributed execution of a program
US20190312772A1 (en) Topology-aware provisioning of hardware accelerator resources in a distributed environment
US8108876B2 (en) Modifying an operation of one or more processors executing message passing interface tasks
US8234652B2 (en) Performing setup operations for receiving different amounts of data while processors are performing message passing interface tasks
US8127300B2 (en) Hardware based dynamic load balancing of message passing interface tasks
US8312464B2 (en) Hardware based dynamic load balancing of message passing interface tasks by modifying tasks
CN114169427B (en) Distributed training method, device and equipment based on end-to-end self-adaptation
US8788879B2 (en) Non-volatile memory for checkpoint storage
US10365980B1 (en) Storage system with selectable cached and cacheless modes of operation for distributed storage virtualization
US8108718B2 (en) Checkpointing in massively parallel processing
US10990435B2 (en) Virtual redundancy for active-standby cloud applications
KR20140080434A (en) Device and method for optimization of data processing in a mapreduce framework
Wang et al. Elasticutor: Rapid elasticity for realtime stateful stream processing
US20090064166A1 (en) System and Method for Hardware Based Dynamic Load Balancing of Message Passing Interface Tasks
Sudarsan et al. ReSHAPE: A framework for dynamic resizing and scheduling of homogeneous applications in a parallel environment
Mei et al. Fault-tolerant dynamic rescheduling for heterogeneous computing systems
US11520673B2 (en) Maintenance operations based on analysis of collected data
Wang et al. An efficient and non-intrusive GPU scheduling framework for deep learning training systems
US20200125461A1 (en) Effective backup of data used by multiple nodes executing parallel processing
Rodríguez-Pascual et al. Job migration in hpc clusters by means of checkpoint/restart
WO2024103463A1 (en) Elastic deep learning job scheduling method and system, and computer device
US10635336B1 (en) Cache-based partition allocation
US10474545B1 (en) Storage system with distributed input-output sequencing
Li et al. Easyscale: Accuracy-consistent elastic training for deep learning