WO2024060788A1 - Intelligent-computing-oriented adaptive adjustment system and method for pipeline-parallel training - Google Patents

Intelligent-computing-oriented adaptive adjustment system and method for pipeline-parallel training Download PDF

Info

Publication number
WO2024060788A1
WO2024060788A1 PCT/CN2023/105618 CN2023105618W WO2024060788A1 WO 2024060788 A1 WO2024060788 A1 WO 2024060788A1 CN 2023105618 W CN2023105618 W CN 2023105618W WO 2024060788 A1 WO2024060788 A1 WO 2024060788A1
Authority
WO
WIPO (PCT)
Prior art keywords
computing
computing node
node
current
sub
Prior art date
Application number
PCT/CN2023/105618
Other languages
French (fr)
Chinese (zh)
Inventor
朱世强
李勇
程稳
陈�光
曾令仿
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Priority to JP2023573533A priority Critical patent/JP2024535971A/en
Publication of WO2024060788A1 publication Critical patent/WO2024060788A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to the field of intelligent computing, and in particular to an adaptive adjustment system and method for pipeline parallel training oriented to intelligent computing.
  • pipeline parallelism is a common distributed training method.
  • the pipeline parallel method divides the model into multiple stages according to layers. Each stage is deployed on the GPU.
  • the forward calculations are performed sequentially in multiple stages, and in the last stage Calculate the loss function and then perform reverse calculations from the last stage to the first stage.
  • the idle waiting time between forward and reverse calculations at different stages in the entire process is not the same.
  • multiple mini-batches at the same time in each stage or splitting a mini-batch into multiple micro-batches for simultaneous execution
  • multiple stages of pipeline execution are realized, reducing GPU idle waiting time and improving efficiency.
  • different layers of deep learning models have different computing requirements, memory requirements, and communication requirements. How to balance computing, memory, and network resources at each stage is crucial to improving the efficiency of pipeline parallel computing.
  • the purpose of the present invention is to provide an adaptive adjustment system and method for pipeline parallel training oriented to intelligent computing, which solves the problem in the prior art that static methods are used to divide layers and the overall efficiency is reduced after the model is changed.
  • Embodiments of the present invention provide an adaptive adjustment system for pipeline parallel training for intelligent computing.
  • the computing cluster includes multiple computing nodes.
  • the multiple computing nodes can communicate with each other.
  • Each computing node includes at least one CPU and at least one GPU.
  • the training model includes multi-layer sub-models.
  • the training process of the model to be trained includes a forward calculation stage and a reverse calculation stage. In the forward calculation stage, the parameters are composed of the first layer sub-model of the multi-layer sub-model.
  • the model in turn moves to the last layer of sub-models Transfer, in the reverse calculation stage, parameters are transferred from the last layer sub-model to the first layer sub-model in sequence, and each computing node is used to train at least one sub-model; the system includes:
  • a monitoring module is responsible for monitoring and collecting the resource operation status of each computing node in the computing cluster, and determining whether the computing task division of the computing node is balanced according to the resource operation status of each computing node, and when the computing node When the computing task division is unbalanced, determine the imbalance type of the computing node;
  • An adjustment module when the computing task division of the computing node is unbalanced, is used to determine an adjustment strategy according to the imbalance type of the computing node, and adjust the distribution of sub-models in the computing cluster according to the adjustment strategy;
  • the adjustment includes at least one of the following:
  • the present invention also provides a pipeline parallel training adaptive adjustment method for intelligent computing, wherein a computing cluster includes multiple computing nodes, the multiple computing nodes can communicate with each other, each computing node includes at least one CPU and at least one GPU, the model to be trained includes multiple layers of sub-models, and the training process of the model to be trained includes a forward calculation stage and a reverse calculation stage, wherein in the forward calculation stage, parameters are sequentially transferred from the first layer of sub-models of the multiple layers of the sub-models to the last layer of sub-models, and in the reverse calculation stage, parameters are sequentially transferred from the last layer of sub-models to the first layer of sub-models, and each computing node is used to train at least one sub-model; the method includes:
  • the monitoring module is responsible for monitoring and collecting the resource operation status of each computing node in the computing cluster, and determining whether the computing task division of the computing node is balanced according to the resource operation status of each computing node, and when the computing task division of the computing node When unbalanced, determine the unbalanced type of the computing node;
  • the adjustment module determines an adjustment strategy according to the imbalance type of the computing node, and adjusts the distribution of sub-models in the computing cluster according to the adjustment strategy;
  • the adjustment includes at least one of the following:
  • the invention also provides an adaptive adjustment device for pipeline parallel training for intelligent computing, which includes a memory and an or There are multiple processors, and executable code is stored in the memory.
  • the one or more processors execute the executable code, they are used to implement the above-mentioned adaptive adjustment method of pipeline parallel training for intelligent computing.
  • the present invention also provides a computer-readable storage medium on which a program is stored.
  • the program is executed by a processor, the above-mentioned adaptive adjustment method for pipeline parallel training for intelligent computing is implemented.
  • the adjustment module dynamically adjusts the distribution of sub-models in the computing cluster, effectively improving the dynamic adaptability of pipeline parallelism and improving the GPU utilization of the intelligent computing cluster. Rate.
  • Figure 1 is a schematic structural diagram of a computing cluster provided by an embodiment of the present invention.
  • Figure 2 is a schematic structural diagram of a pipeline parallel training adaptive adjustment system for intelligent computing provided by an embodiment of the present invention
  • Figure 3 is a schematic flow chart of a calculation adjustment strategy provided by an embodiment of the present invention.
  • Figure 4 is a schematic flowchart of a memory adjustment strategy provided by an embodiment of the present invention.
  • Figure 5 is a schematic flowchart of a topology adjustment strategy provided by an embodiment of the present invention.
  • Figure 6 is a schematic flowchart of an adaptive adjustment method for pipelined parallel training for intelligent computing provided by an embodiment of the present invention
  • FIG. 7 is a schematic structural diagram of an adaptive adjustment device for pipeline parallel training for intelligent computing provided by an embodiment of the present invention.
  • Computing clusters generally support multiple households, especially in public cloud scenarios.
  • the performance of computing nodes in the computing cluster will also change due to changes in shared tasks. Therefore, an adaptive adjustment method for pipeline parallel layer division is urgently needed to adapt to dynamically changing scenarios.
  • a computing cluster in an embodiment of the present invention may include multiple computing nodes.
  • the multiple computing nodes can communicate with each other.
  • Each computing node includes at least one CPU and at least one GPU.
  • the computing cluster may include computing node 1, computing node 2, ..., computing node N, where N is a positive integer, and N is greater than or equal to 3.
  • the model to be trained in the embodiment of the present invention may be a neural network model or other types of models, such as a mathematical model to be trained.
  • the model to be trained may include multiple layers of sub-models, and the model to be trained is trained in a pipeline parallel manner.
  • the training process of the model to be trained includes a forward calculation stage and a reverse calculation stage.
  • the forward calculation stage the parameters are sequentially transferred from the first layer of the multi-layer sub-model to the last layer of the sub-model
  • the reverse calculation stage the parameters are sequentially transferred from the last layer of the sub-model to the first layer of the sub-model.
  • a round of training iteration i.e., a round of training process, which may also be referred to as a training iteration or a training process
  • the model to be trained is a neural network model.
  • the neural network model includes a first layer network, a second layer network, a third layer network and a fourth layer network.
  • the first layer network, the second layer network, the third layer network The network and the fourth-layer network are connected sequentially.
  • the first-layer network is the first-layer sub-model
  • the fourth-layer network is the last-layer sub-model.
  • the forward calculation phase the parameters are transferred from the first layer network to the second layer network, the third layer network and the fourth layer network in sequence; in the reverse calculation phase, the parameters are transferred from the fourth layer network to the third layer network, the third layer network and the fourth layer network in sequence.
  • Layer 2 network and Layer 1 network delivery it should be noted,
  • the types of the first layer network, the second layer network and the third layer network can be designed according to needs.
  • the first layer network is the input layer
  • the second layer network is the convolution layer
  • the third layer network is the pooling layer
  • the third layer network is the input layer.
  • the four-layer network is the output layer.
  • Each computing node in the computing cluster is used to train at least one sub-model, that is, each computing node is assigned at least one sub-model, thereby improving the GPU utilization of the intelligent computing cluster.
  • an embodiment of the present invention provides a pipeline parallel training adaptive adjustment system for intelligent computing.
  • the system may include a monitoring module 10 and an adjustment module 20 .
  • the monitoring module 10 is responsible for monitoring and collecting the resource operation status of each computing node in the computing cluster, and determining whether the computing task division of the computing node is balanced according to the resource operation status of each computing node, and when When the computing tasks of a computing node are divided unbalancedly, determine the imbalance type of the computing node.
  • the monitoring module 10 determines that there are computing nodes with unbalanced computing task division in the computing cluster, it will notify the adjustment module 20 that there are computing nodes with unbalanced computing task division in the computing cluster and the corresponding imbalance type.
  • the monitoring module 10 is responsible for monitoring and collecting the resource operation status of computing node 1, computing node 2, ..., computing node N, and based on the resource operation of computing node 1, computing node 2, ..., computing node N situation, it is determined that the computing task division of computing node 2 is unbalanced, and the imbalance type of computing node 2 is further determined. Furthermore, the monitoring module 10 will notify the adjustment module 20 of the imbalance in the computing task division of the computing node 2 and the type of imbalance.
  • the adjustment module 20 is used to determine an adjustment strategy according to the imbalance type of the computing node when the computing tasks of the computing nodes are unevenly divided, and adjust the distribution of sub-models in the computing cluster according to the adjustment strategy.
  • the adjustment module 20 when the adjustment module 20 receives the indication information sent by the monitoring module 10 to indicate that there are computing nodes with unbalanced computing task division in the computing cluster, the adjustment module 20 adjusts the imbalance type of the computing node according to the unbalanced computing task division. , determine the adjustment strategy, and adjust the distribution of sub-models in the computing cluster according to the adjustment strategy.
  • the indication information carries the imbalance type of the computing nodes in which the computing tasks are divided unbalancedly.
  • adjusting the distribution of sub-models in the computing cluster may include at least one of the following methods:
  • the adjustment module 20 of the pipelined parallel training adaptive adjustment system for intelligent computing in the embodiment of the present invention dynamically adjusts the distribution of sub-models in the computing cluster when the computing tasks of the computing nodes are unbalanced, effectively improving the pipelined parallelism. Dynamic adaptability improves the GPU utilization of intelligent computing clusters.
  • the resource operation status may include information such as computing delay, GPU utilization, network transmission delay, and memory usage. That is, the monitoring module 10 monitors and collects the computing delay, GPU utilization, and network performance of each computing node in the computing cluster. Information such as transmission delay and memory usage. Specifically, the monitoring module 10 monitors and collects the calculation delay, GPU utilization, network transmission delay, memory usage, etc. of the forward calculation phase and the backward calculation phase during each training iteration. Information, in this way, by monitoring and collecting comprehensive operating information, it will help to select subsequent adjustment strategies, thereby effectively improving the GPU utilization of the computing cluster. It can be understood that in other embodiments, the resource operation status may include part of the information such as computing delay, GPU utilization, network transmission delay, and memory usage.
  • the way in which the monitoring module 10 collects the resource operation status of each computing node in the computing cluster during each training iteration can be set as needed. For example, in some embodiments, after each round of iterative training, each computing node in the computing cluster The node sends the computing delay, GPU utilization, network transmission delay, memory usage and other information of the node in this round of iterative training to the monitoring module 10.
  • the monitoring module 10 actively reads information such as computing delay, GPU utilization, network transmission delay, memory usage, etc. of each computing node from each computing node in the computing cluster.
  • the monitoring module 10 can periodically Read the computing latency, GPU utilization, network transmission delay, memory usage and other information of each computing node from each computing node in the computing cluster. The reading cycle can be set as needed.
  • the monitoring module 10 reads the computing delay, GPU utilization, network transmission delay, memory usage and other information of each computing node from each computing node in the computing cluster every 10 minutes.
  • the monitoring module 10 determines whether the computing task division of the computing node is balanced based on the resource operation status of each computing node, specifically, when based on the resource operation status of the current computing node, it is determined that the current computing node has one of the following: In at least one case, it is determined that the computing tasks of the current computing node are unbalanced:
  • the size of the preset delay threshold can be set as needed. For example, when the computing delay of the current computing node exceeds more than half of the computing delays of other computing nodes, it is determined that the computing task division of the current computing node is unbalanced.
  • Case 2 The memory usage of the current computing node is greater than or equal to the preset memory usage threshold and the GPU utilization of the current computing node is less than the average GPU utilization of all computing nodes in the computing cluster.
  • the size of the preset memory usage threshold can be set as needed. For example, when the memory usage of the current computing node exceeds 90% and the GPU utilization of the current computing node is less than the average GPU utilization of all computing nodes in the computing cluster , determining that the computing tasks of the current computing node are unbalanced.
  • Case 3 The network delay of the current computing node exceeds the preset multiple of the maximum network delay of other computing nodes in the computing cluster, where the preset multiple is greater than or equal to 1.
  • the size of the preset multiple can be set as needed. For example, when the network transmission delay of the current computing node exceeds more than twice the maximum access delay of other computing nodes in the computing cluster, it is determined that the computing task division of the current computing node is unbalanced.
  • the monitoring module 10 determines the imbalance type of the computing node when the computing task division of the computing node is unbalanced. Specifically, when the computing delay of the current computing node is greater than or equal to the preset delay threshold, and/or the current computing When the memory usage of a node is greater than or equal to the preset memory usage threshold and the GPU utilization of the current computing node is less than the average GPU utilization of all computing nodes in the computing cluster, the imbalance type of the current computing node includes: the current computing stage Too many tiers assigned. When the network delay of the current computing node exceeds the preset multiple of the maximum network delay of other computing nodes in the computing cluster, the imbalance type of the current computing node includes: network abnormality.
  • adjusting the adjustment strategy includes a calculation adjustment strategy.
  • the calculation adjustment strategy is responsible for adjusting some calculation nodes with very high calculation delays.
  • the calculation adjustment strategy adjusts the calculation node's The layer is re-adjusted to reduce the computing tasks of the computing node, thereby reducing the computing latency of the computing node.
  • the adjustment module 20 uses a calculation adjustment strategy to re-allocate.
  • the computing adjustment strategy of the embodiment of the present invention may include: when the current computing node adopts CPU-GPU memory exchange or recalculation, canceling the CPU-GPU memory exchange or recalculation adopted by the current computing node; After GPU memory swapping or recalculation, if the current computing node executes the memory required by the sub-model on the current computing node If the demand exceeds the maximum memory of the current computing node, at least some layers of at least some sub-models of the current computing node will be migrated based on the GPU utilization of the computing node before the current computing node and the GPU utilization of the computing node after the current computing node. to other computing nodes for execution.
  • the adjustment module 20 ends the adjustment. Cancel the ongoing CPU-GPU memory exchange or recalculation of the current computing node, save the memory size that the current computing node can use for calculations, reduce the computing tasks of the computing node, thereby reducing the computing delay of the current computing node.
  • the memory requirements required for calculation and the maximum memory of the current computing node are considered. When the memory requirements required for calculation exceed the maximum memory of the current computing node, the sub-model is reallocated. In this way, the computing tasks of the current computing node can also be reduced, thereby reducing the computing delay of the current computing node.
  • migrating at least some layers of at least part of the sub-model of the current computing node to other computing nodes may include the following: At least one step:
  • the computing node preceding the current computing node is the initial target computing node; when the computing node preceding the current computing node When the node is the initial target computing node, compare the GPU utilization of the initial target computing node with the GPU utilization of the previous computing node of the initial target computing node. If the GPU utilization of the initial target computing node is less than the previous computing node of the initial target computing node, If the GPU utilization of the node is greater than the GPU utilization of the previous computing node of the initial target computing node, the initial target computing node will be used as the final target computing node.
  • the previous computing node is used as the new initial target computing node, and the previous migration comparison is continued in sequence until the front target computing node; at least some layers of at least some sub-models of the current computing node are migrated to the final target computing node for execution.
  • the allocation of sub-models in the computing cluster consider the memory requirements required for calculation and the maximum memory of the current computing node. When the memory requirements required for calculation exceed the maximum memory of the current computing node, re-allocate the sub-model and replace the current computing node. Migrating at least some layers of at least some sub-models of the computing node to computing nodes with lower GPU utilization among other computing nodes before the current computing node in the current computing phase is beneficial to minimizing the computing delay of the computing cluster.
  • the computing node after the current computing node is used as the initial target computing node; when the computing node after the current computing node is When the computing node is the initial target computing node, compare the GPU utilization of the initial target computing node with the GPU utilization of the computing node after the initial target computing node.
  • the GPU utilization of the initial target computing node is less than the computing node after the initial target computing node, Compute the GPU utilization of the node, then use the initial target computing node as the final target computing node; if the GPU utilization of the initial target computing node is greater than the GPU utilization of the subsequent computing node of the initial target computing node utilization rate, the computing node after the initial target computing node will be used as the new initial target computing node, and the previous migration comparison will continue, in order, until the front target computing node; at least part of the sub-model of the current computing node will be The subunits are migrated to the final target computing node for execution.
  • step I and step II the current calculation stage is the calculation stage in which the current calculation node with very high calculation delay is in the training iteration.
  • the adjustment sub-model adjusts at least part of the sub-model of the current computing node according to the GPU utilization of the computing node before the current computing node and the GPU utilization of the computing node after the current computing node.
  • the sub-model and first perform step I If there are other computing nodes whose computing time is before the current computing node in the current computing phase, whose GPU utilization is smaller than that of the current computing node. For the computing node of GPU utilization, the adjustment ends when the final target computing node is found.
  • step II is executed. If the computing time in the current computing phase is at this If there are computing nodes in other computing nodes after the current computing node whose GPU utilization is less than the GPU utilization of the current computing node, then find the final target computing node and end the adjustment; if the calculation time in the current computing phase is at the current computing node There is no computing node whose GPU utilization is smaller than the GPU utilization of the current computing node among other subsequent computing nodes, which means that the sub-model cannot be reallocated.
  • the adjustment sub-model adjusts at least some layers of at least part of the sub-model of the current computing node according to the GPU utilization of the computing node before the current computing node and the GPU utilization of the computing node after the current computing node.
  • step I is executed. If the computing time in the current computing phase is at this If there is a computing node whose GPU utilization is smaller than the GPU utilization of the current computing node among other computing nodes before the current computing node, then find the final target computing node and end the adjustment; if the calculation time in the previous computing phase is at the current computing node There is no computing node with a GPU utilization smaller than the GPU utilization of the current computing node among other previous computing nodes, which means that the sub-model cannot be reallocated.
  • the calculation adjustment strategy may also include: after migrating at least part of the layers of at least some sub-models of the current calculation node to other calculation nodes for execution, the current calculation node regenerates model parameters and updates the current calculation Model version information of the node. In this way, the calculation stage when the layer changes will regenerate the model parameters and update the version number.
  • the same batch of data training uses the same version of the model to ensure training consistency. Because the pipeline parallel method itself supports multi-version parameters, old versions of the model are released using pipeline parallel model version management.
  • calculating the adjustment strategy may include the following steps:
  • Node target is the previous computing node, then continue to compare the computing node Node target 's previous computing node Node before . If the GPU utilization of Node before is less than the GPU utilization of Node target , then continue to migrate the Node target 's layer forward. If the Node target is the next computing node, then compare the computing node Node after of the computing node Node target . If the GPU utilization of Node after is less than the GPU utilization of Node target , then continue to migrate the Node target 's layer backward. Proceed in sequence until reaching the last computing node.
  • the stage regenerates the model parameters and updates the version number.
  • the same batch of data training uses the same version of the model to ensure training consistency. Because Pipeline Parallel itself supports multi-version parameters, older versions of the model are released using Pipeline Parallel's model version management.
  • the adjustment strategy includes a memory adjustment strategy, memory
  • the adjustment strategy is responsible for adjusting computing nodes with very high memory usage but low GPU utilization.
  • the memory adjustment strategy is used to re-adjust the layers of this computing node to reduce the computing tasks of the computing node, thereby reducing the computing delay of the computing node.
  • the current computing node when the memory usage of the current computing node exceeds 90%, but the GPU utilization of the current computing node is lower than the average GPU utilization of all computing nodes in the computing cluster, then the current computing node is considered to be in the current computing stage (
  • the current calculation stage can be a forward calculation stage or a reverse calculation stage).
  • the current computing phase is the computing phase in the training iteration of the computing node where the current memory usage is very high but the GPU utilization is very low.
  • the memory adjustment strategy of the embodiment of the present invention may include: when the GPU overhead of the current computing node for recalculation is greater than the GPU overhead of the current computing node for CPU-GPU memory swapping, the current computing node uses CPU-GPU memory swapping to Reduce the memory usage of the current compute node.
  • the current computing node uses recomputing to reduce the memory usage of the current computing node. Aiming at the problem that the GPU memory is small, or the model itself, such as model parameters and intermediate variables, results in high memory usage and low computing effect.
  • this embodiment first reduces the memory usage of the current calculation stage through recalculation or CPU-GPU memory swapping. If the GPU overhead of recalculation exceeds the GPU overhead of memory swapping, then memory swapping is used; conversely, if the GPU overhead of memory swapping exceeds the GPU overhead of recalculation, then recalculation is used.
  • the memory adjustment strategy may also include: determining the computing time of the current computing node based on the original task training time of the current computing node and the time required for the current computing node to perform recalculation or CPU-GPU memory exchange.
  • the computing time of the current computing node is greater than or equal to the average task training time of all computing nodes in the computing cluster, migrate at least some sub-units of at least some sub-models of the current computing node to adjacent computing nodes of the current computing node for execution;
  • the computing time of the current computing node is less than the average task training time of all computing nodes in the computing cluster, the current computing node is used as the target computing node to be migrated to as a sub-unit of the computing node that performs layer migration from other computing nodes. Because memory exchange or recalculation will increase additional computing overhead, the time required for memory exchange or recalculation is added to the original task training time of the current computing node as the calculation time T adjust of this stage.
  • the memory adjustment strategy may include the following steps:
  • Decide whether to use recalculation or memory swapping First, reduce the memory usage of this phase through recalculation or CPU-GPU memory swapping. If the GPU overhead of recalculation exceeds the GPU overhead of memory swapping, then use memory swapping; conversely, if the GPU overhead of memory swapping exceeds the GPU overhead of recalculation, then use recalculation.
  • Update the calculation time of the current calculation stage add the time required for memory exchange or recalculation to the original task training time of the current calculation node as the calculation time T adjust of this stage.
  • Comparison with the average computing time of other computing stages Compare the computing time T adjust with the average computing time T average of each stage in pipeline parallelism.
  • the adjustment strategy includes a topology adjustment strategy.
  • the topology adjustment strategy is responsible for adjusting certain computing nodes with very high network delays. For example, if the network transmission delay of a certain computing node is more than twice the maximum access delay between other computing nodes in the computing cluster, it is considered that the network of the computing node may be abnormal, and a topology adjustment strategy is used to readjust it.
  • the topology adjustment strategy of the embodiment of the present invention may include: selecting three consecutive computing nodes with the smallest network delay from the current computing node, and determining that the maximum network delay of the computing node that is not delayed in the current network is more than double; comparing the current computing node with The intermediate computing nodes of three consecutive computing nodes perform task exchange; respectively determine whether the network delay of the two computing nodes before and after the task exchange exceeds the maximum network delay. If it exceeds the maximum network delay, continue to select the three consecutive ones with the next smallest delay. Computing nodes, repeat the task exchange process until all computing nodes are traversed. If there is still no computing node that does not exceed the maximum network delay, the network topology adjustment of the computing cluster ends; if it does not exceed the maximum network delay, then the two tasks are exchanged. Compute model parameters and intermediate variables between nodes.
  • the topology adjustment strategy may also include: if the memory usage of any of the two computing nodes that swap tasks is greater than or equal to a preset memory usage threshold, then using the memory adjustment strategy to continue adjusting the distribution of sub-models in the computing cluster; after using the memory adjustment strategy to continue adjusting the distribution of sub-models in the computing cluster, if the memory usage of any of the two computing nodes that swap tasks is still greater than or equal to the preset memory usage threshold, then using the computing adjustment strategy to migrate at least part of the sub-models of the computing node whose memory usage is still greater than or equal to the preset memory usage threshold. At least part of the sub-units.
  • the topology adjustment strategy may include the following steps:
  • Whether the access delay exceeds twice the maximum access delay of other computing nodes After the communication delay of the current computing node Node slow and other computing nodes in the previous and subsequent stages is slow, first test the network access delay of the current computing node and other nodes. If the network access delay of the current computing node and adjacent computing nodes is more than twice the maximum access delay between other computing nodes, then it is judged that there may be an abnormality between the networks and continue the topology adjustment; otherwise, stop the topology adjustment of the computing node .
  • Three consecutive computing nodes with the smallest access delay Select the three consecutive computing nodes Node A , Node B , and Node C that have the smallest access delay with the current computing node Node Slow.
  • Determine whether the delay is normal Determine whether the delay between the front and rear computing nodes of Node B and Node Slow is normal, that is, it does not exceed the maximum access delay of the current normal computing node. If the delay is abnormal, perform the following steps to determine whether it is the last batch of computing nodes. If the delay is normal, perform the following steps to determine whether the memory is sufficient.
  • the present invention also provides an adaptive adjustment method for pipelined parallel training oriented to intelligent computing.
  • the adaptive adjustment method for pipelined parallel training oriented to intelligent computing may include:
  • the monitoring module is responsible for monitoring and collecting the resource operation status of each computing node in the computing cluster, and based on the resource operation status of each computing node, determines whether the computing task division of the computing node is balanced, and when the computing task division of the computing node is unbalanced When, determine the imbalance type of the computing node;
  • the adjustment module determines the adjustment strategy according to the imbalance type of the computing nodes, and adjusts the distribution of sub-models in the computing cluster according to the adjustment strategy;
  • adjustments include at least one of the following:
  • the present invention also provides an adaptive adjustment device for pipeline parallel training for intelligent computing, which includes a memory and one or more processors.
  • the memory stores executable code.
  • processors execute the executable code, it is used to Implement the above-mentioned adaptive adjustment method of pipeline parallel training for intelligent computing.
  • the present invention also provides a computer-readable storage medium on which a program is stored.
  • the program is executed by a processor, the above-mentioned adaptive adjustment method for pipeline parallel training for intelligent computing is implemented.
  • the present invention also provides an embodiment of an adaptive adjustment device for pipelined parallel training for intelligent computing.
  • an embodiment of the present invention provides a pipeline parallel training adaptive adjustment device for intelligent computing, including a memory and one or more processors.
  • the memory stores executable code
  • the one or more processors execute the executable code.
  • the code is executed, it is used to implement the pipeline parallel training adaptive adjustment method for intelligent computing in the above embodiment.
  • the embodiments of the pipeline parallel training adaptive adjustment device for intelligent computing provided by the embodiments of the present invention can be applied to any device with data processing capabilities.
  • the any device with data processing capabilities can be a device such as a computer. or device.
  • the device embodiments may be implemented by software, or may be implemented by hardware or a combination of software and hardware. Taking software implementation as an example, as a logical device, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory and running them through the processor of any device with data processing capabilities. From the hardware level, as shown in Figure 7, it is a hardware structure diagram of any device with data processing capabilities where the pipeline parallel training adaptive adjustment device for intelligent computing provided by the embodiment of the present invention is located.
  • any device with data processing capabilities where the device in the embodiment is located may also include other hardware based on the actual functions of any device with data processing capabilities. , no further details will be given.
  • the device embodiment since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated.
  • the components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
  • Embodiments of the present invention also provide a computer-readable storage medium on which a program is stored.
  • the program is executed by a processor, the adaptive adjustment method of pipeline parallel training for intelligent computing in the above embodiments is implemented.
  • the computer-readable storage medium may be an internal storage unit of any device with data processing capabilities as described in any of the foregoing embodiments, such as a hard disk or a memory.
  • the computer-readable storage medium can also be an external storage device of any device with data processing capabilities, such as a plug-in hard disk, smart memory card (Smart Media Card, SMC), SD card, flash memory card equipped on the device (Flash Card) etc.
  • the computer-readable storage medium may also include both an internal storage unit and an external storage device of any device with data processing capabilities.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capabilities, and can also be used to temporarily store data that has been output or is to be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Advance Control (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Provided in the present invention are an intelligent-computing-oriented adaptive adjustment system and method for pipeline-parallel training. The system comprises a monitoring module and an adjustment module, wherein when computing tasks of each of computing nodes are unevenly divided, the adjustment module determines an adjustment policy according to an unevenness type of the computing node, and adjusts the allocation of sub-models in a computing cluster according to the adjustment policy. The adjustment comprises at least one of the following: transferring layers of at least some sub-models of the computing node, the computing tasks of which are unevenly divided, from the computing node to another computing node; controlling the computing node, the computing tasks of which are unevenly divided, to execute CPU-GPU memory exchange or recomputing, or controlling the computing node, the computing tasks of which are unevenly divided, to cancel the currently executed CPU-GPU memory exchange or recomputing; and adjusting a network topology structure of the computing cluster. The present invention can dynamically adjust the allocation of sub-models in a computing cluster.

Description

面向智能计算的流水并行训练自适应调整系统、方法Pipeline parallel training adaptive adjustment system and method for intelligent computing 技术领域Technical field
本发明涉及一种智能计算领域,尤其涉及一种面向智能计算的流水并行训练自适应调整系统、方法。The invention relates to the field of intelligent computing, and in particular to an adaptive adjustment system and method for pipeline parallel training oriented to intelligent computing.
背景技术Background technique
深度学习的出现给自然语言处理、音视频处理、融媒体等领域带来了巨大的更新。但是随着深度学习模型越来越大,有些大模型的参数量甚至超过了几百亿,如此大规模的模型往往通过构建分布式机器学习系统来完成模型训练。分布式训练可以突破单张GPU的算力限制,在模型训练的时候通过在多台机器和多张GPU卡上构建分布式训练方法来加快模型训练,已经成为一种非常普遍的方法。The emergence of deep learning has brought great changes to natural language processing, audio and video processing, and integrated media. However, as deep learning models become larger and larger, some large models even have more than tens of billions of parameters. Such large-scale models are often trained by building distributed machine learning systems. Distributed training can break through the computing power limitations of a single GPU. When training a model, it has become a very common method to speed up model training by building a distributed training method on multiple machines and multiple GPU cards.
其中,流水并行是一种常见的分布式训练方法,流水并行方法将模型按层划分成多个阶段stage,每个阶段部署到GPU上,多个阶段顺序执行前向计算,并在最后一个阶段计算损失函数,再从最后一个阶段到第一个阶段依次进行反向计算。整个过程中不同阶段的前向计算和反向计算之间的空闲等待时间并不相同。然后通过在每个阶段同时执行多个mini-batch(或者将一个mini-batch拆成多个micro-batch同时执行),实现多个阶段流水执行,减少GPU空闲等待时间,提高效率。但是深度学习模型的不同层的计算需求、内存需求、通信需求都存在差异,如何在各阶段均衡分配计算、内存和网络资源,对于提高流水并行的计算效率至关重要。Among them, pipeline parallelism is a common distributed training method. The pipeline parallel method divides the model into multiple stages according to layers. Each stage is deployed on the GPU. The forward calculations are performed sequentially in multiple stages, and in the last stage Calculate the loss function and then perform reverse calculations from the last stage to the first stage. The idle waiting time between forward and reverse calculations at different stages in the entire process is not the same. Then by executing multiple mini-batches at the same time in each stage (or splitting a mini-batch into multiple micro-batches for simultaneous execution), multiple stages of pipeline execution are realized, reducing GPU idle waiting time and improving efficiency. However, different layers of deep learning models have different computing requirements, memory requirements, and communication requirements. How to balance computing, memory, and network resources at each stage is crucial to improving the efficiency of pipeline parallel computing.
当前方法普遍采用静态方法进行层的划分,比如pipedream采用动态规划的方法进行划分。但是,当前AI框架支持动态计算图,如Pytorch,因此不同训练时期模型可能发生变化,静态的划分结果在模型(如神经网络模型)变更后会面临整体效率降低的问题。Current methods generally use static methods to divide layers. For example, pipedream uses dynamic programming methods to divide layers. However, the current AI framework supports dynamic calculation graphs, such as Pytorch, so the model may change during different training periods, and static division results will face the problem of reduced overall efficiency after the model (such as a neural network model) is changed.
发明内容Contents of the invention
本发明的目的在于提供一种面向智能计算的流水并行训练自适应调整系统、方法,解决了现有技术中采用静态方法进行层的划分在模型变更后会面临整体效率降低的的问题。The purpose of the present invention is to provide an adaptive adjustment system and method for pipeline parallel training oriented to intelligent computing, which solves the problem in the prior art that static methods are used to divide layers and the overall efficiency is reduced after the model is changed.
本发明采用的技术方案如下:The technical solutions adopted by the present invention are as follows:
本发明实施例提供一种面向智能计算的流水并行训练自适应调整系统,计算集群包括多个计算节点,多个计算节点之间能够相互通信,各计算节点包括至少一CPU和至少一个GPU,待训练模型包括多层子模型,所述待训练模型的训练过程包括前向计算阶段及反向计算阶段,其中,在所述前向计算阶段,参数由多层所述子模型的第一层子模型依次向最后一层子模型 传递,在所述反向计算阶段,参数由所述最后一层子模型依次向所述第一层子模型传递,各计算节点用于训练至少一个子模型;所述系统包括:Embodiments of the present invention provide an adaptive adjustment system for pipeline parallel training for intelligent computing. The computing cluster includes multiple computing nodes. The multiple computing nodes can communicate with each other. Each computing node includes at least one CPU and at least one GPU. The training model includes multi-layer sub-models. The training process of the model to be trained includes a forward calculation stage and a reverse calculation stage. In the forward calculation stage, the parameters are composed of the first layer sub-model of the multi-layer sub-model. The model in turn moves to the last layer of sub-models Transfer, in the reverse calculation stage, parameters are transferred from the last layer sub-model to the first layer sub-model in sequence, and each computing node is used to train at least one sub-model; the system includes:
监控模块,用于负责监控和收集所述计算集群内各计算节点的资源运行情况,并根据各计算节点的资源运行情况,确定该计算节点的计算任务划分是否均衡,以及当所述计算节点的计算任务划分不均衡时,确定所述计算节点的不均衡类型;A monitoring module is responsible for monitoring and collecting the resource operation status of each computing node in the computing cluster, and determining whether the computing task division of the computing node is balanced according to the resource operation status of each computing node, and when the computing node When the computing task division is unbalanced, determine the imbalance type of the computing node;
调整模块,当所述计算节点的计算任务划分不均衡时,用于根据所述计算节点的不均衡类型,确定调整策略,并根据所述调整策略,调整子模型在计算集群中的分配;An adjustment module, when the computing task division of the computing node is unbalanced, is used to determine an adjustment strategy according to the imbalance type of the computing node, and adjust the distribution of sub-models in the computing cluster according to the adjustment strategy;
其中,所述调整包括以下至少一种:The adjustment includes at least one of the following:
将计算任务划分不均衡的计算节点的至少部分子模型的层由该计算节点迁移至其他计算节点;Migrate at least some sub-model layers of a computing node where computing tasks are divided unbalancedly from the computing node to other computing nodes;
控制计算任务划分不均衡的计算节点执行CPU-GPU内存交换或重计算,或者控制计算任务划分不均衡的计算节点取消当前执行的CPU-GPU内存交换或重计算;Control the computing nodes with unbalanced computing task division to perform CPU-GPU memory swapping or recalculation, or control the computing nodes with unbalanced computing task division to cancel the currently executed CPU-GPU memory swapping or recalculation;
对所述计算集群的网络拓扑结构进行调整。Adjust the network topology of the computing cluster.
本发明还提供一种面向智能计算的流水并行训练自适应调整方法,计算集群包括多个计算节点,多个计算节点之间能够相互通信,各计算节点包括至少一CPU和至少一个GPU,待训练模型包括多层子模型,所述待训练模型的训练过程包括前向计算阶段及反向计算阶段,其中,在所述前向计算阶段,参数由多层所述子模型的第一层子模型依次向最后一层子模型传递,在所述反向计算阶段,参数由所述最后一层子模型依次向所述第一层子模型传递,各计算节点用于训练至少一个子模型;所述方法包括:The present invention also provides a pipeline parallel training adaptive adjustment method for intelligent computing, wherein a computing cluster includes multiple computing nodes, the multiple computing nodes can communicate with each other, each computing node includes at least one CPU and at least one GPU, the model to be trained includes multiple layers of sub-models, and the training process of the model to be trained includes a forward calculation stage and a reverse calculation stage, wherein in the forward calculation stage, parameters are sequentially transferred from the first layer of sub-models of the multiple layers of the sub-models to the last layer of sub-models, and in the reverse calculation stage, parameters are sequentially transferred from the last layer of sub-models to the first layer of sub-models, and each computing node is used to train at least one sub-model; the method includes:
监控模块负责监控和收集所述计算集群内各计算节点的资源运行情况,并根据各计算节点的资源运行情况,确定该计算节点的计算任务划分是否均衡,以及当所述计算节点的计算任务划分不均衡时,确定所述计算节点的不均衡类型;The monitoring module is responsible for monitoring and collecting the resource operation status of each computing node in the computing cluster, and determining whether the computing task division of the computing node is balanced according to the resource operation status of each computing node, and when the computing task division of the computing node When unbalanced, determine the unbalanced type of the computing node;
调整模块在所述计算节点的计算任务划分不均衡时,根据所述计算节点的不均衡类型,确定调整策略,并根据所述调整策略,调整子模型在计算集群中的分配;When the computing task division of the computing node is unbalanced, the adjustment module determines an adjustment strategy according to the imbalance type of the computing node, and adjusts the distribution of sub-models in the computing cluster according to the adjustment strategy;
其中,所述调整包括以下至少一种:Wherein, the adjustment includes at least one of the following:
将计算任务划分不均衡的计算节点的至少部分子模型的至少部分层由该计算节点迁移至其他计算节点;Migrate at least some layers of at least some sub-models of a computing node where computing tasks are divided unbalancedly from the computing node to other computing nodes;
控制计算任务划分不均衡的计算节点执行CPU-GPU内存交换或重计算,或者控制计算任务划分不均衡的计算节点取消当前执行的CPU-GPU内存交换或重计算;Control the computing nodes with unbalanced computing task division to perform CPU-GPU memory swapping or recalculation, or control the computing nodes with unbalanced computing task division to cancel the currently executed CPU-GPU memory swapping or recalculation;
对所述计算集群的网络拓扑结构进行调整。Adjust the network topology of the computing cluster.
本发明还提供一种面向智能计算的流水并行训练自适应调整装置,包括存储器和一个或 多个处理器,所述存储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现上述的面向智能计算的流水并行训练自适应调整方法。The invention also provides an adaptive adjustment device for pipeline parallel training for intelligent computing, which includes a memory and an or There are multiple processors, and executable code is stored in the memory. When the one or more processors execute the executable code, they are used to implement the above-mentioned adaptive adjustment method of pipeline parallel training for intelligent computing.
本发明还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述的面向智能计算的流水并行训练自适应调整方法。The present invention also provides a computer-readable storage medium on which a program is stored. When the program is executed by a processor, the above-mentioned adaptive adjustment method for pipeline parallel training for intelligent computing is implemented.
本发明的有益效果是:调整模块在述计算节点的计算任务划分不均衡时,动态调整子模型在计算集群中的分配,有效提升了流水并行的动态适应能力,提升了智能计算集群的GPU利用率。The beneficial effects of the present invention are: when the computing task division of the computing nodes is unbalanced, the adjustment module dynamically adjusts the distribution of sub-models in the computing cluster, effectively improving the dynamic adaptability of pipeline parallelism and improving the GPU utilization of the intelligent computing cluster. Rate.
附图说明Description of drawings
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting any creative effort.
图1为本发明一实施例提供的一种计算集群的结构示意图;Figure 1 is a schematic structural diagram of a computing cluster provided by an embodiment of the present invention;
图2为本发明一实施例提供的一种面向智能计算的流水并行训练自适应调整系统的结构示意图;Figure 2 is a schematic structural diagram of a pipeline parallel training adaptive adjustment system for intelligent computing provided by an embodiment of the present invention;
图3为本发明一实施例提供的一种计算调整策略的流程示意图;Figure 3 is a schematic flow chart of a calculation adjustment strategy provided by an embodiment of the present invention;
图4为本发明一实施例提供的一种内存调整策略的流程示意图;Figure 4 is a schematic flowchart of a memory adjustment strategy provided by an embodiment of the present invention;
图5为本发明一实施例提供的一种拓扑调整策略的流程示意图;Figure 5 is a schematic flowchart of a topology adjustment strategy provided by an embodiment of the present invention;
图6为本发明一实施例提供的一种面向智能计算的流水并行训练自适应调整方法的流程示意图;Figure 6 is a schematic flowchart of an adaptive adjustment method for pipelined parallel training for intelligent computing provided by an embodiment of the present invention;
图7为本发明一实施例提供的一种面向智能计算的流水并行训练自适应调整装置的结构示意图。FIG. 7 is a schematic structural diagram of an adaptive adjustment device for pipeline parallel training for intelligent computing provided by an embodiment of the present invention.
附图标记:10、监控模块;20、调整模块。Reference signs: 10. Monitoring module; 20. Adjustment module.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.
需要说明的是,在不冲突的情况下,下述的实施例及实施方式中的特征可以相互组合。It should be noted that, as long as there is no conflict, the features in the following embodiments and implementation modes can be combined with each other.
计算集群普遍支持多组户,特别是公有云场景,计算集群中的计算节点的性能也会受共享任务的变化而出现变化。因此,亟需一种针对流水并行层划分的自适应调整方法,以适应动态变化的场景。 Computing clusters generally support multiple households, especially in public cloud scenarios. The performance of computing nodes in the computing cluster will also change due to changes in shared tasks. Therefore, an adaptive adjustment method for pipeline parallel layer division is urgently needed to adapt to dynamically changing scenarios.
参见图1,本发明实施例中的计算集群可包括多个计算节点,多个计算节点之间能够相互通信,各计算节点包括至少一CPU和至少一个GPU。如图1所示,计算集群可包括计算节点1、计算节点2、…、计算节点N,其中,N为正整数,且N大于或等于3。Referring to Figure 1, a computing cluster in an embodiment of the present invention may include multiple computing nodes. The multiple computing nodes can communicate with each other. Each computing node includes at least one CPU and at least one GPU. As shown in Figure 1, the computing cluster may include computing node 1, computing node 2, ..., computing node N, where N is a positive integer, and N is greater than or equal to 3.
本发明实施例的待训练模型可以为神经网络模型,也可以为其他类型的模型,如待训练的数学模型。The model to be trained in the embodiment of the present invention may be a neural network model or other types of models, such as a mathematical model to be trained.
在本发明实施例中,待训练模型可包括多层子模型,采用流水并行方式训练所述待训练模型,具体地,待训练模型的训练过程包括前向计算阶段及反向计算阶段。其中,在前向计算阶段,参数由多层子模型的第一层子模型依次向最后一层子模型传递,在反向计算阶段,参数由最后一层子模型依次向第一层子模型传递。需要说明的是,一轮训练迭代(即一轮训练过程,也可称为一次训练迭代或一次训练过程)包括一次前向计算阶段和一次反向计算阶段。In an embodiment of the present invention, the model to be trained may include multiple layers of sub-models, and the model to be trained is trained in a pipeline parallel manner. Specifically, the training process of the model to be trained includes a forward calculation stage and a reverse calculation stage. In the forward calculation stage, the parameters are sequentially transferred from the first layer of the multi-layer sub-model to the last layer of the sub-model, and in the reverse calculation stage, the parameters are sequentially transferred from the last layer of the sub-model to the first layer of the sub-model. It should be noted that a round of training iteration (i.e., a round of training process, which may also be referred to as a training iteration or a training process) includes a forward calculation stage and a reverse calculation stage.
示例性地,待训练模型为神经网络模型,该神经网络模型包括第一层网络、第二层网络、第三层网络和第四层网络,第一层网络、第二层网络、第三层网络和第四层网络顺序连接,第一层网络为第一层子模型,第四层网络为最后一层子模型。在前向计算阶段,参数由第一层网络依次向第二层网络、第三层网络和第四层网络传递;在反向计算阶段,参数由第四层网络依次向第三层网络、第二层网络和第一层网络传递。需要说明的是,Exemplarily, the model to be trained is a neural network model. The neural network model includes a first layer network, a second layer network, a third layer network and a fourth layer network. The first layer network, the second layer network, the third layer network The network and the fourth-layer network are connected sequentially. The first-layer network is the first-layer sub-model, and the fourth-layer network is the last-layer sub-model. In the forward calculation phase, the parameters are transferred from the first layer network to the second layer network, the third layer network and the fourth layer network in sequence; in the reverse calculation phase, the parameters are transferred from the fourth layer network to the third layer network, the third layer network and the fourth layer network in sequence. Layer 2 network and Layer 1 network delivery. It should be noted,
第一层网络、第二层网络和第三层网络的类型可根据需要设计,例如,第一层网络为输入层、第二层网络为卷积层,第三层网络为池化层,第四层网络为输出层。The types of the first layer network, the second layer network and the third layer network can be designed according to needs. For example, the first layer network is the input layer, the second layer network is the convolution layer, the third layer network is the pooling layer, and the third layer network is the input layer. The four-layer network is the output layer.
计算集群中的各计算节点用于训练至少一个子模型,即每个计算节点分配至少一个子模型,从而提升智能计算集群的GPU利用率。Each computing node in the computing cluster is used to train at least one sub-model, that is, each computing node is assigned at least one sub-model, thereby improving the GPU utilization of the intelligent computing cluster.
参见图2,本发明实施例提供一种面向智能计算的流水并行训练自适应调整系统,该系统可包括监控模块10和调整模块20。Referring to FIG. 2 , an embodiment of the present invention provides a pipeline parallel training adaptive adjustment system for intelligent computing. The system may include a monitoring module 10 and an adjustment module 20 .
在本发明实施例中,监控模块10用于负责监控和收集计算集群内各计算节点的资源运行情况,并根据各计算节点的资源运行情况,确定该计算节点的计算任务划分是否均衡,以及当计算节点的计算任务划分不均衡时,确定计算节点的不均衡类型。监控模块10在确定计算集群中存在计算任务划分不均衡的计算节点时,会向调整模块20通知计算集群中存在计算任务划分不均衡的计算节点以及对应的不均衡类型。沿用上述实施例,监控模块10用于负责监控和收集计算节点1、计算节点2、…、计算节点N的资源运行情况,并根据计算节点1、计算节点2、…、计算节点N的资源运行情况,确定出计算节点2的计算任务划分不均衡,进一步确定计算节点2的不均衡类型。并且,监控模块10会向调整模块20通知计算节点2的计算任务划分不均衡及不均衡的类型。 In the embodiment of the present invention, the monitoring module 10 is responsible for monitoring and collecting the resource operation status of each computing node in the computing cluster, and determining whether the computing task division of the computing node is balanced according to the resource operation status of each computing node, and when When the computing tasks of a computing node are divided unbalancedly, determine the imbalance type of the computing node. When the monitoring module 10 determines that there are computing nodes with unbalanced computing task division in the computing cluster, it will notify the adjustment module 20 that there are computing nodes with unbalanced computing task division in the computing cluster and the corresponding imbalance type. Following the above embodiment, the monitoring module 10 is responsible for monitoring and collecting the resource operation status of computing node 1, computing node 2, ..., computing node N, and based on the resource operation of computing node 1, computing node 2, ..., computing node N situation, it is determined that the computing task division of computing node 2 is unbalanced, and the imbalance type of computing node 2 is further determined. Furthermore, the monitoring module 10 will notify the adjustment module 20 of the imbalance in the computing task division of the computing node 2 and the type of imbalance.
调整模块20用于在计算节点的计算任务划分不均衡时,根据计算节点的不均衡类型,确定调整策略,并根据调整策略,调整子模型在计算集群中的分配。本发明实施例中,调整模块20在接收到监控模块10发送的用于指示计算集群中存在计算任务划分不均衡的计算节点的指示信息时,根据计算任务划分不均衡的计算节点的不均衡类型,确定调整策略,并根据调整策略,调整子模型在计算集群中的分配。其中,指示信息携带有计算任务划分不均衡的计算节点的不均衡类型。The adjustment module 20 is used to determine an adjustment strategy according to the imbalance type of the computing node when the computing tasks of the computing nodes are unevenly divided, and adjust the distribution of sub-models in the computing cluster according to the adjustment strategy. In the embodiment of the present invention, when the adjustment module 20 receives the indication information sent by the monitoring module 10 to indicate that there are computing nodes with unbalanced computing task division in the computing cluster, the adjustment module 20 adjusts the imbalance type of the computing node according to the unbalanced computing task division. , determine the adjustment strategy, and adjust the distribution of sub-models in the computing cluster according to the adjustment strategy. Wherein, the indication information carries the imbalance type of the computing nodes in which the computing tasks are divided unbalancedly.
在本发明实施例中,调整子模型在计算集群中的分配中的调整可包括以下至少一种方式:In this embodiment of the present invention, adjusting the distribution of sub-models in the computing cluster may include at least one of the following methods:
(1)、将计算任务划分不均衡的计算节点的至少部分子模型的层由该计算节点迁移至其他计算节点;(1) Migrate at least some sub-model layers of a computing node whose computing tasks are unevenly divided from this computing node to other computing nodes;
(2)、控制计算任务划分不均衡的计算节点执行CPU-GPU内存交换或重计算,或者控制计算任务划分不均衡的计算节点取消当前执行的CPU-GPU内存交换或重计算;(2) Control the computing nodes with unbalanced computing task division to perform CPU-GPU memory swapping or recalculation, or control the computing nodes with unbalanced computing task division to cancel the currently executed CPU-GPU memory swapping or recalculation;
(3)、对计算集群的网络拓扑结构进行调整。(3) Adjust the network topology of the computing cluster.
本发明实施例中的面向智能计算的流水并行训练自适应调整系统的调整模块20在述计算节点的计算任务划分不均衡时,动态调整子模型在计算集群中的分配,有效提升了流水并行的动态适应能力,提升了智能计算集群的GPU利用率。The adjustment module 20 of the pipelined parallel training adaptive adjustment system for intelligent computing in the embodiment of the present invention dynamically adjusts the distribution of sub-models in the computing cluster when the computing tasks of the computing nodes are unbalanced, effectively improving the pipelined parallelism. Dynamic adaptability improves the GPU utilization of intelligent computing clusters.
在本实施例中,资源运行情况可包括计算延迟、GPU利用率、网络传输延迟及内存使用率等信息,即监控模块10监控和收集计算集群内各计算节点的计算延迟、GPU利用率、网络传输延迟及内存使用率等信息,具体地,监控模块10监控和收集每次训练迭代过程中的前向计算阶段和后向计算阶段的计算延迟、GPU利用率、网络传输延迟、内存使用率等信息,如此,通过监控和收集较全的运行信息,有助于后续调整策略的选择,从而有效提升计算集群的GPU利用率。可以理解的是,在其他实施例中,资源运行情况可包括计算延迟、GPU利用率、网络传输延迟及内存使用率等信息中的一部分。In this embodiment, the resource operation status may include information such as computing delay, GPU utilization, network transmission delay, and memory usage. That is, the monitoring module 10 monitors and collects the computing delay, GPU utilization, and network performance of each computing node in the computing cluster. Information such as transmission delay and memory usage. Specifically, the monitoring module 10 monitors and collects the calculation delay, GPU utilization, network transmission delay, memory usage, etc. of the forward calculation phase and the backward calculation phase during each training iteration. Information, in this way, by monitoring and collecting comprehensive operating information, it will help to select subsequent adjustment strategies, thereby effectively improving the GPU utilization of the computing cluster. It can be understood that in other embodiments, the resource operation status may include part of the information such as computing delay, GPU utilization, network transmission delay, and memory usage.
监控模块10收集计算集群内各计算节点在每次训练迭代过程中的资源运行情况的方式可根据需要设置,例如,在一些实施例中,在每一轮迭代训练结束后,计算集群内各计算节点将本节点在该轮迭代训练的计算延迟、GPU利用率、网络传输延迟、内存使用率等信息发送给监控模块10。The way in which the monitoring module 10 collects the resource operation status of each computing node in the computing cluster during each training iteration can be set as needed. For example, in some embodiments, after each round of iterative training, each computing node in the computing cluster The node sends the computing delay, GPU utilization, network transmission delay, memory usage and other information of the node in this round of iterative training to the monitoring module 10.
在另外一些实施例中,监控模块10主动从计算集群内各计算节点读取各计算节点的计算延迟、GPU利用率、网络传输延迟、内存使用率等信息,例如,监控模块10可周期性的从计算集群内各计算节点读取各计算节点的计算延迟、GPU利用率、网络传输延迟、内存使用率等信息。其中,读取周期可根据需要设定,例如,监控模块10每隔10分钟从计算集群内各计算节点读取各计算节点的计算延迟、GPU利用率、网络传输延迟、内存使用率等信息。 In other embodiments, the monitoring module 10 actively reads information such as computing delay, GPU utilization, network transmission delay, memory usage, etc. of each computing node from each computing node in the computing cluster. For example, the monitoring module 10 can periodically Read the computing latency, GPU utilization, network transmission delay, memory usage and other information of each computing node from each computing node in the computing cluster. The reading cycle can be set as needed. For example, the monitoring module 10 reads the computing delay, GPU utilization, network transmission delay, memory usage and other information of each computing node from each computing node in the computing cluster every 10 minutes.
可选地,监控模块10在根据各计算节点的资源运行情况,确定该计算节点的计算任务划分是否均衡时,具体地,当根据当前计算节点的资源运行情况,确定当前计算节点存在以下中的至少一种情况时,确定该当前计算节点的计算任务划分不均衡:Optionally, when the monitoring module 10 determines whether the computing task division of the computing node is balanced based on the resource operation status of each computing node, specifically, when based on the resource operation status of the current computing node, it is determined that the current computing node has one of the following: In at least one case, it is determined that the computing tasks of the current computing node are unbalanced:
情况1、当前计算节点的计算延迟大于或等于预设延迟阈值。Case 1. The computing delay of the current computing node is greater than or equal to the preset delay threshold.
预设延迟阈值的大小可以根据需要设置,例如,当当前计算节点的计算延迟超过其他计算节点的计算延迟大小的一半以上,确定该当前计算节点的计算任务划分不均衡。The size of the preset delay threshold can be set as needed. For example, when the computing delay of the current computing node exceeds more than half of the computing delays of other computing nodes, it is determined that the computing task division of the current computing node is unbalanced.
情况2、当前计算节点的内存使用率大于或等于预设内存使用率阈值且当前计算节点的GPU利用率小于计算集群中所有计算节点的GPU利用率的平均值。Case 2: The memory usage of the current computing node is greater than or equal to the preset memory usage threshold and the GPU utilization of the current computing node is less than the average GPU utilization of all computing nodes in the computing cluster.
预设内存使用率阈值的大小可以根据需要设置,例如,当当前计算节点的内存使用率超过90%,且当前计算节点的GPU利用率小于计算集群中所有计算节点的GPU利用率的平均值时,确定该当前计算节点的计算任务划分不均衡。The size of the preset memory usage threshold can be set as needed. For example, when the memory usage of the current computing node exceeds 90% and the GPU utilization of the current computing node is less than the average GPU utilization of all computing nodes in the computing cluster , determining that the computing tasks of the current computing node are unbalanced.
情况3、当前计算节点的网络延迟超过计算集群中其他计算节点的最大网络延迟的预设倍数,其中预设倍数大于或等于1。Case 3: The network delay of the current computing node exceeds the preset multiple of the maximum network delay of other computing nodes in the computing cluster, where the preset multiple is greater than or equal to 1.
预设倍数的大小可以根据需要设置,例如,当当前计算节点的网络传输延迟超过计算集群内其他计算节点的最大访问延迟的一倍以上时,确定该当前计算节点的计算任务划分不均衡。The size of the preset multiple can be set as needed. For example, when the network transmission delay of the current computing node exceeds more than twice the maximum access delay of other computing nodes in the computing cluster, it is determined that the computing task division of the current computing node is unbalanced.
可选地,监控模块10在计算节点的计算任务划分不均衡时,确定计算节点的不均衡类型时,具体地,当当前计算节点的计算延迟大于或等于预设延迟阈值、和/或当前计算节点的内存使用率大于或等于预设内存使用率阈值且当前计算节点的GPU利用率小于计算集群中所有计算节点的GPU利用率的平均值时,当前计算节点的不均衡类型包括:当前计算阶段分配的层过多。当当前计算节点的网络延迟超过计算集群中其他计算节点的最大网络延迟的预设倍数时,当前计算节点的不均衡类型包括:网络异常。Optionally, the monitoring module 10 determines the imbalance type of the computing node when the computing task division of the computing node is unbalanced. Specifically, when the computing delay of the current computing node is greater than or equal to the preset delay threshold, and/or the current computing When the memory usage of a node is greater than or equal to the preset memory usage threshold and the GPU utilization of the current computing node is less than the average GPU utilization of all computing nodes in the computing cluster, the imbalance type of the current computing node includes: the current computing stage Too many tiers assigned. When the network delay of the current computing node exceeds the preset multiple of the maximum network delay of other computing nodes in the computing cluster, the imbalance type of the current computing node includes: network abnormality.
当当前计算节点的计算延迟大于或等于预设延迟阈值时,调整所述调整策略包括计算调整策略,计算调整策略负责调整某些计算延迟非常高的计算节点,通过计算调整策略对这个计算节点的层进行重新调整,减少该计算节点的计算任务,从而降低该计算节点的计算延迟。例如,当当前计算节点Nodeadjust的计算延迟超过其他计算节点的计算延迟大小的一半以上,那么认为当前计算节点Nodeadjust在当前计算阶段(当前计算阶段可以为前向计算阶段或反向计算阶段)分配的层(即子模型)过多,调整模块20采用计算调整策略进行重新分配。When the calculation delay of the current computing node is greater than or equal to the preset delay threshold, adjusting the adjustment strategy includes a calculation adjustment strategy. The calculation adjustment strategy is responsible for adjusting some calculation nodes with very high calculation delays. The calculation adjustment strategy adjusts the calculation node's The layer is re-adjusted to reduce the computing tasks of the computing node, thereby reducing the computing latency of the computing node. For example, when the computing delay of the current computing node Node adjust exceeds more than half of the computing delays of other computing nodes, then the current computing node Node adjust is considered to be in the current computing phase (the current computing phase can be the forward computing phase or the reverse computing phase) If there are too many allocated layers (i.e., sub-models), the adjustment module 20 uses a calculation adjustment strategy to re-allocate.
本发明实施例的计算调整策略可包括:当当前计算节点采用CPU-GPU内存交换或重计算时,取消当前计算节点采用的CPU-GPU内存交换或重计算;在取消当前计算节点采用的CPU-GPU内存交换或重计算后,若当前计算节点执行当前计算节点上的子模型所需要的内存 需求超出当前计算节点的最大内存,则根据当前计算节点前一个计算节点的GPU利用率及当前计算节点的后一个计算节点的GPU利用率,将当前计算节点的至少部分子模型的至少部分层迁移至其他计算节点执行。另外,若当前计算节点执行当前计算节点上的子模型所需要的内存需求未超出当前计算节点的最大内存,那么调整模块20结束调整。取消当前计算节点正在进行的CPU-GPU内存交换或重计算,节省当前计算节点能够用于计算的内存大小,减少该计算节点的计算任务,从而降低该当前计算节点的计算延迟。并且,调整子模型在计算集群中的分配时考虑计算所需的内存需求和当前计算节点的最大内存,在计算所需的内存需求超出当前计算节点的最大内存时,进行子模型的重新分配,如此,也能减少该当前计算节点的计算任务,从而降低该当前计算节点的计算延迟。The computing adjustment strategy of the embodiment of the present invention may include: when the current computing node adopts CPU-GPU memory exchange or recalculation, canceling the CPU-GPU memory exchange or recalculation adopted by the current computing node; After GPU memory swapping or recalculation, if the current computing node executes the memory required by the sub-model on the current computing node If the demand exceeds the maximum memory of the current computing node, at least some layers of at least some sub-models of the current computing node will be migrated based on the GPU utilization of the computing node before the current computing node and the GPU utilization of the computing node after the current computing node. to other computing nodes for execution. In addition, if the memory requirement required by the current computing node to execute the sub-model on the current computing node does not exceed the maximum memory of the current computing node, then the adjustment module 20 ends the adjustment. Cancel the ongoing CPU-GPU memory exchange or recalculation of the current computing node, save the memory size that the current computing node can use for calculations, reduce the computing tasks of the computing node, thereby reducing the computing delay of the current computing node. In addition, when adjusting the allocation of sub-models in the computing cluster, the memory requirements required for calculation and the maximum memory of the current computing node are considered. When the memory requirements required for calculation exceed the maximum memory of the current computing node, the sub-model is reallocated. In this way, the computing tasks of the current computing node can also be reduced, thereby reducing the computing delay of the current computing node.
其中,根据当前计算节点前一个计算节点的GPU利用率及当前计算节点的后一个计算节点的GPU利用率,将当前计算节点的至少部分子模型的至少部分层迁移至其他计算节点执行可包括以下至少一种步骤:Wherein, according to the GPU utilization of the computing node before the current computing node and the GPU utilization of the computing node after the current computing node, migrating at least some layers of at least part of the sub-model of the current computing node to other computing nodes may include the following: At least one step:
I、当当前计算节点前一个计算节点的GPU利用率小于当前计算节点的后一个计算节点的GPU利用率时,将当前计算节点前一个计算节点为初始目标计算节点;当当前计算节点前一个计算节点为初始目标计算节点时,比较初始目标计算节点的GPU利用率与初始目标计算节点的前一个计算节点的GPU利用率,若初始目标计算节点的GPU利用率小于初始目标计算节点的前一个计算节点的GPU利用率,则将初始目标计算节点作为最终的目标计算节点;若初始目标计算节点的GPU利用率大于初始目标计算节点的前一个计算节点的GPU利用率,则将初始目标计算节点的前一个计算节点作为新的初始目标计算节点,继续前迁移比较,依次进行,直至最前面的目标计算节点;将当前计算节点的至少部分子模型的至少部分层迁移至最终的目标计算节点执行。调整子模型在计算集群中的分配时考虑计算所需的内存需求和当前计算节点的最大内存,在计算所需的内存需求超出当前计算节点的最大内存时,进行子模型的重新分配,将当前计算节点的至少部分子模型的至少部分层迁移至当前计算阶段中处于该当前计算节点之前的其他计算节点中GPU利用率较低的计算节点上,利于计算集群的计算延迟最小化。I. When the GPU utilization of the computing node preceding the current computing node is less than the GPU utilization of the computing node following the current computing node, the computing node preceding the current computing node is the initial target computing node; when the computing node preceding the current computing node When the node is the initial target computing node, compare the GPU utilization of the initial target computing node with the GPU utilization of the previous computing node of the initial target computing node. If the GPU utilization of the initial target computing node is less than the previous computing node of the initial target computing node, If the GPU utilization of the node is greater than the GPU utilization of the previous computing node of the initial target computing node, the initial target computing node will be used as the final target computing node. The previous computing node is used as the new initial target computing node, and the previous migration comparison is continued in sequence until the front target computing node; at least some layers of at least some sub-models of the current computing node are migrated to the final target computing node for execution. When adjusting the allocation of sub-models in the computing cluster, consider the memory requirements required for calculation and the maximum memory of the current computing node. When the memory requirements required for calculation exceed the maximum memory of the current computing node, re-allocate the sub-model and replace the current computing node. Migrating at least some layers of at least some sub-models of the computing node to computing nodes with lower GPU utilization among other computing nodes before the current computing node in the current computing phase is beneficial to minimizing the computing delay of the computing cluster.
II、当当前计算节点前一个计算节点的GPU利用率大于当前计算节点的后一个计算节点的GPU利用率时,将当前计算节点的后一个计算节点作为初始目标计算节点;当当前计算节点后一个计算节点为初始目标计算节点时,比较初始目标计算节点的GPU利用率与初始目标计算节点的后一个计算节点的GPU利用率,若初始目标计算节点的GPU利用率小于初始目标计算节点的后一个计算节点的GPU利用率,则将初始目标计算节点作为最终的目标计算节点;若初始目标计算节点的GPU利用率大于初始目标计算节点的后一个计算节点的GPU利 用率,则将初始目标计算节点的后一个计算节点作为新的初始目标计算节点,继续前迁移比较,依次进行,直至最前面的目标计算节点;将当前计算节点的至少部分子模型的至少部分子单元迁移至最终的目标计算节点执行。调整子模型在计算集群中的分配时考虑计算所需的内存需求和当前计算节点的最大内存,在计算所需的内存需求超出当前计算节点的最大内存时,进行子模型的重新分配,将当前计算节点的至少部分子模型的至少部分层迁移至当前计算阶段中处于该当前计算节点之后的其他计算节点中GPU利用率较低的计算节点上,利于计算集群的计算延迟最小化。II. When the GPU utilization of the computing node before the current computing node is greater than the GPU utilization of the computing node after the current computing node, the computing node after the current computing node is used as the initial target computing node; when the computing node after the current computing node is When the computing node is the initial target computing node, compare the GPU utilization of the initial target computing node with the GPU utilization of the computing node after the initial target computing node. If the GPU utilization of the initial target computing node is less than the computing node after the initial target computing node, Compute the GPU utilization of the node, then use the initial target computing node as the final target computing node; if the GPU utilization of the initial target computing node is greater than the GPU utilization of the subsequent computing node of the initial target computing node utilization rate, the computing node after the initial target computing node will be used as the new initial target computing node, and the previous migration comparison will continue, in order, until the front target computing node; at least part of the sub-model of the current computing node will be The subunits are migrated to the final target computing node for execution. When adjusting the allocation of sub-models in the computing cluster, consider the memory requirements required for calculation and the maximum memory of the current computing node. When the memory requirements required for calculation exceed the maximum memory of the current computing node, reallocate the sub-model and replace the current computing node with the memory requirement. Migrating at least some layers of at least some sub-models of the computing node to computing nodes with lower GPU utilization among other computing nodes located after the current computing node in the current computing phase is beneficial to minimizing the computing delay of the computing cluster.
需要说明的是,步骤I、步骤II中,当前计算阶段为当前计算延迟非常高的计算节点处于训练迭代中的计算阶段。It should be noted that in step I and step II, the current calculation stage is the calculation stage in which the current calculation node with very high calculation delay is in the training iteration.
可选地,在一些实施例中,调整子模型在根据当前计算节点前一个计算节点的GPU利用率及当前计算节点的后一个计算节点的GPU利用率,将当前计算节点的至少部分子模型的至少部分层迁移至其他计算节点执行时,具体地,调整子模型先执行步骤I,若当前计算阶段中计算时刻处于该当前计算节点之前的其他计算节点中存在GPU利用率小于该当前计算节点的GPU利用率的计算节点,则在找到最终的目标计算节点,结束调整。若前计算阶段中计算时刻处于该当前计算节点之前的其他计算节点中不存在GPU利用率小于该当前计算节点的GPU利用率的计算节点,则执行步骤II,若当前计算阶段中计算时刻处于该当前计算节点之后的其他计算节点中存在GPU利用率小于该当前计算节点的GPU利用率的计算节点,则再找到最终的目标计算节点,结束调整;若当前计算阶段中计算时刻处于该当前计算节点之后的其他计算节点中不存在GPU利用率小于该当前计算节点的GPU利用率的计算节点,则说明无法进行子模型的重分配。Optionally, in some embodiments, the adjustment sub-model adjusts at least part of the sub-model of the current computing node according to the GPU utilization of the computing node before the current computing node and the GPU utilization of the computing node after the current computing node. When at least part of the layers are migrated to other computing nodes for execution, specifically, adjust the sub-model and first perform step I. If there are other computing nodes whose computing time is before the current computing node in the current computing phase, whose GPU utilization is smaller than that of the current computing node. For the computing node of GPU utilization, the adjustment ends when the final target computing node is found. If there is no computing node whose GPU utilization is less than the GPU utilization of the current computing node among other computing nodes whose computing time is before the current computing node in the previous computing phase, then step II is executed. If the computing time in the current computing phase is at this If there are computing nodes in other computing nodes after the current computing node whose GPU utilization is less than the GPU utilization of the current computing node, then find the final target computing node and end the adjustment; if the calculation time in the current computing phase is at the current computing node There is no computing node whose GPU utilization is smaller than the GPU utilization of the current computing node among other subsequent computing nodes, which means that the sub-model cannot be reallocated.
在另外一些实施例中,调整子模型在根据当前计算节点前一个计算节点的GPU利用率及当前计算节点的后一个计算节点的GPU利用率,将当前计算节点的至少部分子模型的至少部分层迁移至其他计算节点执行时,具体地,调整子模型先执行步骤II,若当前计算阶段中计算时刻处于该当前计算节点之后的其他计算节点中存在GPU利用率小于该当前计算节点的GPU利用率的计算节点,则再找到最终的目标计算节点,结束调整。若当前计算阶段中计算时刻处于该当前计算节点之后的其他计算节点中不存在GPU利用率小于该当前计算节点的GPU利用率的计算节点,则执行步骤I,若当前计算阶段中计算时刻处于该当前计算节点之前的其他计算节点中存在GPU利用率小于该当前计算节点的GPU利用率的计算节点,则再找到最终的目标计算节点,结束调整;若前计算阶段中计算时刻处于该当前计算节点之前的其他计算节点中不存在GPU利用率小于该当前计算节点的GPU利用率的计算节点,则说明无法进行子模型的重分配。 In some other embodiments, the adjustment sub-model adjusts at least some layers of at least part of the sub-model of the current computing node according to the GPU utilization of the computing node before the current computing node and the GPU utilization of the computing node after the current computing node. When migrating to other computing nodes for execution, specifically, adjust the sub-model and first perform step II. If there is a GPU utilization in other computing nodes after the current computing node in the current computing phase that is smaller than the GPU utilization of the current computing node. of computing nodes, then find the final target computing node and end the adjustment. If there is no computing node whose GPU utilization is less than the GPU utilization of the current computing node among other computing nodes whose computing time is after the current computing node in the current computing phase, then step I is executed. If the computing time in the current computing phase is at this If there is a computing node whose GPU utilization is smaller than the GPU utilization of the current computing node among other computing nodes before the current computing node, then find the final target computing node and end the adjustment; if the calculation time in the previous computing phase is at the current computing node There is no computing node with a GPU utilization smaller than the GPU utilization of the current computing node among other previous computing nodes, which means that the sub-model cannot be reallocated.
进一步地,在一些实施例中,计算调整策略还可包括:在将当前计算节点的至少部分子模型的至少部分层迁移至其他计算节点执行后,当前计算节点重新生成模型参数,并更新当前计算节点的模型版本信息。如此,层发生变更的计算阶段stage会重新生成模型参数,并且更新版本号。同一批的数据训练使用同一版本的模型训练,保证训练一致性。因为流水并行方式本身支持多版本参数,旧版本的模型使用流水并行的模型版本管理进行释放。Further, in some embodiments, the calculation adjustment strategy may also include: after migrating at least part of the layers of at least some sub-models of the current calculation node to other calculation nodes for execution, the current calculation node regenerates model parameters and updates the current calculation Model version information of the node. In this way, the calculation stage when the layer changes will regenerate the model parameters and update the version number. The same batch of data training uses the same version of the model to ensure training consistency. Because the pipeline parallel method itself supports multi-version parameters, old versions of the model are released using pipeline parallel model version management.
示例性地,在一具体实施例中,参见图3,计算调整策略可包括以下步骤:For example, in a specific embodiment, referring to Figure 3, calculating the adjustment strategy may include the following steps:
当当前计算节点Nodeadjust的计算延迟较大时,先判断当前计算节点Nodeadjust是否采用CPU-GPU内存交换或重计算,如果是,则先取消CPU-GPU内存交换或重计算。如果取消CPU-GPU内存交换或重计算后,内存需求没有超过前计算节点Nodeadjust最大内存,那么就结束调整。否则,继续步骤S12。When the calculation delay of the current computing node Node adjust is large, first determine whether the current computing node Node adjust uses CPU-GPU memory exchange or recalculation. If so, first cancel the CPU-GPU memory exchange or recalculation. If after canceling CPU-GPU memory swap or recalculation, the memory requirement does not exceed the maximum memory of the previous computing node Node adjust , then the adjustment will end. Otherwise, continue to step S12.
再比较当前计算节点Nodeadjust前后两个计算节点上两个阶段stage的GPU利用率,将当前计算节点Nodeadjust的层迁移到GPU利用率较低的计算节点NodetargetThen compare the GPU utilization of the two stages on the two computing nodes before and after the current computing node Node adjust , and migrate the layer of the current computing node Node adjust to the computing node Node target with lower GPU utilization.
如果Nodetarget是前一个计算节点,然后继续比较计算节点Nodetarget的前一个计算节点Nodebefore,如果Nodebefore的GPU利用率小于Nodetarget的GPU利用率,那么继续将Nodetarget的层往前迁移。如果Nodetarget是后一个计算节点,然后比较计算节点Nodetarget的后一个计算节点Nodeafter,如果Nodeafter的GPU利用率小于Nodetarget的GPU利用率,那么继续将Nodetarget的层往后迁移。依次进行,直达最后一个计算节点。If the Node target is the previous computing node, then continue to compare the computing node Node target 's previous computing node Node before . If the GPU utilization of Node before is less than the GPU utilization of Node target , then continue to migrate the Node target 's layer forward. If the Node target is the next computing node, then compare the computing node Node after of the computing node Node target . If the GPU utilization of Node after is less than the GPU utilization of Node target , then continue to migrate the Node target 's layer backward. Proceed in sequence until reaching the last computing node.
最后,层发生变更的阶段stage重新生成模型参数,并且更新版本号。同一批的数据训练使用同一版本的模型训练,保证训练一致性。因为流水并行本身支持多版本参数,旧版本的模型使用流水并行的模型版本管理进行释放。Finally, when the layer changes, the stage regenerates the model parameters and updates the version number. The same batch of data training uses the same version of the model to ensure training consistency. Because Pipeline Parallel itself supports multi-version parameters, older versions of the model are released using Pipeline Parallel's model version management.
当当前计算节点的内存使用率大于或等于预设内存使用率阈值且当前计算节点的GPU利用率小于计算集群中所有计算节点的GPU利用率的平均值时,调整调整策略包括内存调整策略,内存调整策略负责调整内存使用率非常高但是GPU利用率却很低的计算节点,通过内存调整策略对这个计算节点的层进行重新调整,减少该计算节点的计算任务,从而降低该计算节点的计算延迟。例如,当当前计算节点的内存使用率超过90%,但该当前计算节点的GPU利用率却低于计算集群内所有计算节点的GPU利用率的平均值,那么认为当前计算节点在当前计算阶段(当前计算阶段可以为前向计算阶段或反向计算阶段)分配的层过多,采用内存调整策略进行重新分配。需要说明的是,当前计算阶段为当前内存使用率非常高但是GPU利用率却很低的计算节点处于训练迭代中的计算阶段。When the memory usage of the current computing node is greater than or equal to the preset memory usage threshold and the GPU utilization of the current computing node is less than the average GPU utilization of all computing nodes in the computing cluster, the adjustment strategy includes a memory adjustment strategy, memory The adjustment strategy is responsible for adjusting computing nodes with very high memory usage but low GPU utilization. The memory adjustment strategy is used to re-adjust the layers of this computing node to reduce the computing tasks of the computing node, thereby reducing the computing delay of the computing node. . For example, when the memory usage of the current computing node exceeds 90%, but the GPU utilization of the current computing node is lower than the average GPU utilization of all computing nodes in the computing cluster, then the current computing node is considered to be in the current computing stage ( The current calculation stage can be a forward calculation stage or a reverse calculation stage). There are too many layers allocated, and the memory adjustment strategy is used for reallocation. It should be noted that the current computing phase is the computing phase in the training iteration of the computing node where the current memory usage is very high but the GPU utilization is very low.
本发明实施例的内存调整策略可包括:当当前计算节点进行重计算的GPU开销大于当前计算节点进行CPU-GPU内存交换的GPU开销时,当前计算节点采用CPU-GPU内存交换以 降低当前计算节点的内存使用率。当当前计算节点进行重计算的GPU开销小于当前计算节点进行CPU-GPU内存交换的GPU开销时,当前计算节点采用重计算以降低当前计算节点的内存使用率。针对GPU显存较小,或者模型参数和中间变量等模型本身原因导致内存使用率较高而计算效果较低的问题,因为显存(GPU)受限无法通过迁移相邻阶段的层进一步提高GPU计算效率。因此,本实施例首先通过重计算或者CPU-GPU内存交换来降低该当前计算阶段的内存使用率。如果重计算的GPU开销超过内存交换的GPU开销,那么采用内存交换;反之,如果内存交换的GPU开销超过重计算的GPU开销,那么采用重计算。The memory adjustment strategy of the embodiment of the present invention may include: when the GPU overhead of the current computing node for recalculation is greater than the GPU overhead of the current computing node for CPU-GPU memory swapping, the current computing node uses CPU-GPU memory swapping to Reduce the memory usage of the current compute node. When the GPU overhead of recalculation by the current computing node is less than the GPU overhead of CPU-GPU memory exchange by the current computing node, the current computing node uses recomputing to reduce the memory usage of the current computing node. Aiming at the problem that the GPU memory is small, or the model itself, such as model parameters and intermediate variables, results in high memory usage and low computing effect. Because the graphics memory (GPU) is limited, it is impossible to further improve the GPU computing efficiency by migrating layers in adjacent stages. . Therefore, this embodiment first reduces the memory usage of the current calculation stage through recalculation or CPU-GPU memory swapping. If the GPU overhead of recalculation exceeds the GPU overhead of memory swapping, then memory swapping is used; conversely, if the GPU overhead of memory swapping exceeds the GPU overhead of recalculation, then recalculation is used.
进一步地,在一些实施例中,内存调整策略还可包括:根据当前计算节点的原任务训练时长和当前计算节点执行重计算或CPU-GPU内存交换所需的时长,确定当前计算节点的计算时长;当当前计算节点的计算时长大于或等于计算集群中所有计算节点的平均任务训练时长时,将当前计算节点的至少部分子模型的至少部分子单元迁移至当前计算节点的相邻计算节点执行;当当前计算节点的计算时长小于计算集群中所有计算节点的平均任务训练时长时,将当前计算节点作为其他计算节点进行层迁移的计算节点的子单元迁入的目标计算节点。因为内存交换或者重计算都会增加额外的计算开销,在当前计算节点的原任务训练时长上加上内存交换或者重计算所需的时长作为该阶段的计算时间Tadjust。然后比较计算时间Tadjust和流水并行中各阶段的平均计算时间Taverage,如果Tadjust<Taverage,那么将其作为计算效率较低的计算阶段,并采用层迁移策略从相邻的计算阶段迁入层来平衡各计算阶段的计算效率。如果Tadjust>=Taverage,则从该当前计算阶段中拆出一些层,迁移到相邻计算阶段。最终达到计算集群的计算延迟最小化的效果。Further, in some embodiments, the memory adjustment strategy may also include: determining the computing time of the current computing node based on the original task training time of the current computing node and the time required for the current computing node to perform recalculation or CPU-GPU memory exchange. ; When the computing time of the current computing node is greater than or equal to the average task training time of all computing nodes in the computing cluster, migrate at least some sub-units of at least some sub-models of the current computing node to adjacent computing nodes of the current computing node for execution; When the computing time of the current computing node is less than the average task training time of all computing nodes in the computing cluster, the current computing node is used as the target computing node to be migrated to as a sub-unit of the computing node that performs layer migration from other computing nodes. Because memory exchange or recalculation will increase additional computing overhead, the time required for memory exchange or recalculation is added to the original task training time of the current computing node as the calculation time T adjust of this stage. Then compare the calculation time T adjust with the average calculation time T average of each stage in pipeline parallelism. If T adjust <T average , then use it as a calculation stage with lower calculation efficiency, and use the layer migration strategy to migrate from the adjacent calculation stage. into layers to balance the computational efficiency of each calculation stage. If T adjust >= T average , some layers are removed from the current calculation stage and migrated to adjacent calculation stages. Ultimately, the effect of minimizing the computing delay of the computing cluster is achieved.
示例性地,在一具体实施例中,参见图4,内存调整策略可包括以下步骤:For example, in a specific embodiment, referring to Figure 4, the memory adjustment strategy may include the following steps:
决定使用重计算还是内存交换:首先通过重计算或者CPU-GPU内存交换来降低该阶段的内存使用率。如果重计算的GPU开销超过内存交换的GPU开销,那么采用内存交换;反之,如果内存交换的GPU开销超过重计算的GPU开销,那么采用重计算。Decide whether to use recalculation or memory swapping: First, reduce the memory usage of this phase through recalculation or CPU-GPU memory swapping. If the GPU overhead of recalculation exceeds the GPU overhead of memory swapping, then use memory swapping; conversely, if the GPU overhead of memory swapping exceeds the GPU overhead of recalculation, then use recalculation.
更新当前计算阶段的计算时间:在当前计算节点的原任务训练时长上加上内存交换或者重计算所需的时长作为该阶段的计算时间TadjustUpdate the calculation time of the current calculation stage: add the time required for memory exchange or recalculation to the original task training time of the current calculation node as the calculation time T adjust of this stage.
与其他计算阶段的平均计算时间比较:比较计算时间Tadjust和流水并行中各阶段的平均计算时间TaverageComparison with the average computing time of other computing stages: Compare the computing time T adjust with the average computing time T average of each stage in pipeline parallelism.
使用计算调整策略调整:如果Tadjust<Taverage,那么将其作为计算效率较低的阶段,并采用层迁移策略从相邻的阶段迁入层来平衡各阶段的计算效率。如果Tadjust>=Taverage,则从该阶段中拆出一些层,迁移到相邻阶段。 Use the calculation adjustment strategy to adjust: If T adjust <T average , then treat it as a stage with lower computational efficiency, and use the layer migration strategy to move into the layer from adjacent stages to balance the computational efficiency of each stage. If T adjust >= T average , some layers are removed from this stage and migrated to adjacent stages.
当当前计算节点的网络延迟超过计算集群中其他计算节点的最大网络延迟的预设倍数时,调整调整策略包括拓扑调整策略,拓扑调整策略负责调整某些网络延迟非常高的计算节点。例如,某个计算节点的网络传输延迟超过计算集群内其他计算节点之间最大访问延迟一倍以上,那么认为该计算节点的网络可能存在异常,采用拓扑调整策略进行重新调整。When the network delay of the current computing node exceeds a preset multiple of the maximum network delay of other computing nodes in the computing cluster, the adjustment strategy includes a topology adjustment strategy. The topology adjustment strategy is responsible for adjusting certain computing nodes with very high network delays. For example, if the network transmission delay of a certain computing node is more than twice the maximum access delay between other computing nodes in the computing cluster, it is considered that the network of the computing node may be abnormal, and a topology adjustment strategy is used to readjust it.
本发明实施例的拓扑调整策略可包括:选择与该当前计算节点的网络延迟最小的三个连续计算节点,并确定当前网络未延迟的计算节点的最大网络延迟一倍以上;将当前计算节点与三个连续计算节点的中间计算节点进行任务互换;分别判断任务相互换的两个计算节点的前后计算节点的网络延迟是否超过最大网络延迟,若超过,则继续选择延迟次小的三个连续计算节点,重复任务互换过程,直至遍历所有计算节点,若仍不存在不超过最大网络延迟的计算节点,则结束计算集群的网络拓扑调整;若未超过,则互换任务相互换的两个计算节点之间的模型参数和中间变量。The topology adjustment strategy of the embodiment of the present invention may include: selecting three consecutive computing nodes with the smallest network delay from the current computing node, and determining that the maximum network delay of the computing node that is not delayed in the current network is more than double; comparing the current computing node with The intermediate computing nodes of three consecutive computing nodes perform task exchange; respectively determine whether the network delay of the two computing nodes before and after the task exchange exceeds the maximum network delay. If it exceeds the maximum network delay, continue to select the three consecutive ones with the next smallest delay. Computing nodes, repeat the task exchange process until all computing nodes are traversed. If there is still no computing node that does not exceed the maximum network delay, the network topology adjustment of the computing cluster ends; if it does not exceed the maximum network delay, then the two tasks are exchanged. Compute model parameters and intermediate variables between nodes.
进一步地,在一些实施例中,拓扑调整策略还可包括:若任务相互换的两个计算节点中任一计算节点的内存使用率大于或等于预设内存使用率阈值,则采用内存调整策略继续调整子模型在计算集群中的分配;在采用内存调整策略继续调整子模型在计算集群中的分配后,若任务相互换的两个计算节点中任一计算节点的内存使用率仍大于或等于预设内存使用率阈值,则采用计算调整策略来迁移内存使用率仍大于或等于预设内存使用率阈值的计算节点的至少部分子模型的至少部分子单元。Furthermore, in some embodiments, the topology adjustment strategy may also include: if the memory usage of any of the two computing nodes that swap tasks is greater than or equal to a preset memory usage threshold, then using the memory adjustment strategy to continue adjusting the distribution of sub-models in the computing cluster; after using the memory adjustment strategy to continue adjusting the distribution of sub-models in the computing cluster, if the memory usage of any of the two computing nodes that swap tasks is still greater than or equal to the preset memory usage threshold, then using the computing adjustment strategy to migrate at least part of the sub-models of the computing node whose memory usage is still greater than or equal to the preset memory usage threshold. At least part of the sub-units.
示例性地,在一具体实施例中,参见图5,拓扑调整策略可包括以下步骤:For example, in a specific embodiment, referring to Figure 5, the topology adjustment strategy may include the following steps:
访问延迟是否超过其他计算节点最大访问延迟一倍:当前计算节点Nodeslow和前后阶段的其他计算节点通信延迟较慢后,首先测试该当前计算节点和其他节点的网络访问延迟。如果该当前计算节点和相邻计算节点的网络访问延迟超过其他计算节点之间最大访问延迟一倍以上,那么判断网络之间可能出现异常,继续进行拓扑调整;否则,停止该计算节点的拓扑调整。Whether the access delay exceeds twice the maximum access delay of other computing nodes: After the communication delay of the current computing node Node slow and other computing nodes in the previous and subsequent stages is slow, first test the network access delay of the current computing node and other nodes. If the network access delay of the current computing node and adjacent computing nodes is more than twice the maximum access delay between other computing nodes, then it is judged that there may be an abnormality between the networks and continue the topology adjustment; otherwise, stop the topology adjustment of the computing node .
访问延迟最小的3个连续计算节点:选择和该当前计算节点Nodeslow访问延迟最小的3个连续计算节点NodeA,NodeB,NodeCThree consecutive computing nodes with the smallest access delay: Select the three consecutive computing nodes Node A , Node B , and Node C that have the smallest access delay with the current computing node Node Slow.
与中间节点互换:和3个连续计算节点的中间计算节点NodeB和Nodeslow进行互换。Exchange with intermediate nodes: exchange with the intermediate computing nodes Node B and Node slow of 3 consecutive computing nodes.
判断延迟是否正常:判断测试NodeB和Nodeslow的前后计算节点之间的延迟是否正常,即不超过当前正常计算节点的最大访问延迟。如果延迟不正常,则执行下面判断是否最后一批计算节点的步骤。如果延迟正常,则执行下面判断内存是否足够的步骤。Determine whether the delay is normal: Determine whether the delay between the front and rear computing nodes of Node B and Node Slow is normal, that is, it does not exceed the maximum access delay of the current normal computing node. If the delay is abnormal, perform the following steps to determine whether it is the last batch of computing nodes. If the delay is normal, perform the following steps to determine whether the memory is sufficient.
判断是否最后一批计算节点:如果不是最后一批计算节点,则那么Nodeslow继续选择延 迟次小的3个连续计算节点,然后和中间计算节点进行交换,直至遍历完所有计算节点。如果遍历完所有计算节点仍未找到满足条件的交换计算节点,那么结束这次网络拓扑调整。Determine whether it is the last batch of computing nodes: If it is not the last batch of computing nodes, then Node Slow continues to choose to postpone The next smallest 3 consecutive computing nodes are then exchanged with the intermediate computing nodes until all computing nodes are traversed. If after traversing all computing nodes and still not finding a switching computing node that meets the conditions, the network topology adjustment ends.
判断内存是否足够:互换计算节点Nodeslow和计算节点NodeB之间模型参数和中间变量,判断内存是否足够。Determine whether the memory is sufficient: Exchange the model parameters and intermediate variables between the computing node Node slow and the computing node Node B to determine whether the memory is sufficient.
使用内存调整策略:如果内存不够,则采用内存调整策略进行调整,如果仍然不够则通过计算调整策略迁移层来降低内存需求。如果内存足够,则结束这次网络拓扑调整。Use the memory adjustment strategy: If the memory is not enough, use the memory adjustment strategy to adjust. If it is still not enough, calculate the adjustment strategy migration layer to reduce memory requirements. If the memory is sufficient, the network topology adjustment ends.
本发明还提供一种面向智能计算的流水并行训练自适应调整方法,参见图6,所述面向智能计算的流水并行训练自适应调整方法可包括:The present invention also provides an adaptive adjustment method for pipelined parallel training oriented to intelligent computing. Referring to Figure 6, the adaptive adjustment method for pipelined parallel training oriented to intelligent computing may include:
S100、监控模块负责监控和收集计算集群内各计算节点的资源运行情况,并根据各计算节点的资源运行情况,确定该计算节点的计算任务划分是否均衡,以及当计算节点的计算任务划分不均衡时,确定计算节点的不均衡类型;S100, the monitoring module is responsible for monitoring and collecting the resource operation status of each computing node in the computing cluster, and based on the resource operation status of each computing node, determines whether the computing task division of the computing node is balanced, and when the computing task division of the computing node is unbalanced When, determine the imbalance type of the computing node;
S200、调整模块在计算节点的计算任务划分不均衡时,根据计算节点的不均衡类型,确定调整策略,并根据调整策略,调整子模型在计算集群中的分配;S200. When the computing tasks of the computing nodes are unevenly divided, the adjustment module determines the adjustment strategy according to the imbalance type of the computing nodes, and adjusts the distribution of sub-models in the computing cluster according to the adjustment strategy;
其中,调整包括以下至少一种:Among them, adjustments include at least one of the following:
将计算任务划分不均衡的计算节点的至少部分子模型的至少部分层由该计算节点迁移至其他计算节点;Migrate at least some layers of at least some sub-models of a computing node where computing tasks are divided unbalancedly from the computing node to other computing nodes;
控制计算任务划分不均衡的计算节点执行CPU-GPU内存交换或重计算,或者控制计算任务划分不均衡的计算节点取消当前执行的CPU-GPU内存交换或重计算;Control the computing nodes with unbalanced computing task division to perform CPU-GPU memory swapping or recalculation, or control the computing nodes with unbalanced computing task division to cancel the currently executed CPU-GPU memory swapping or recalculation;
对计算集群的网络拓扑结构进行调整。Adjust the network topology of the computing cluster.
本发明还提供一种面向智能计算的流水并行训练自适应调整装置,包括存储器和一个或多个处理器,存储器中存储有可执行代码,一个或多个处理器执行可执行代码时,用于实现上述的面向智能计算的流水并行训练自适应调整方法。The present invention also provides an adaptive adjustment device for pipeline parallel training for intelligent computing, which includes a memory and one or more processors. The memory stores executable code. When one or more processors execute the executable code, it is used to Implement the above-mentioned adaptive adjustment method of pipeline parallel training for intelligent computing.
本发明还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述的面向智能计算的流水并行训练自适应调整方法。The present invention also provides a computer-readable storage medium on which a program is stored. When the program is executed by a processor, the above-mentioned adaptive adjustment method for pipeline parallel training for intelligent computing is implemented.
与前述面向智能计算的流水并行训练自适应调整方法的实施例相对应,本发明还提供了一种面向智能计算的流水并行训练自适应调整装置的实施例。Corresponding to the embodiments of the adaptive adjustment method for pipelined parallel training for intelligent computing, the present invention also provides an embodiment of an adaptive adjustment device for pipelined parallel training for intelligent computing.
参见图7,本发明实施例提供的一种面向智能计算的流水并行训练自适应调整装置,包括存储器和一个或多个处理器,存储器中存储有可执行代码,一个或多个处理器执行可执行代码时,用于实现上述实施例中的面向智能计算的流水并行训练自适应调整方法。Referring to Figure 7, an embodiment of the present invention provides a pipeline parallel training adaptive adjustment device for intelligent computing, including a memory and one or more processors. The memory stores executable code, and the one or more processors execute the executable code. When the code is executed, it is used to implement the pipeline parallel training adaptive adjustment method for intelligent computing in the above embodiment.
本发明实施例提供的面向智能计算的流水并行训练自适应调整装置的实施例可以应用在任意具备数据处理能力的设备上,该任意具备数据处理能力的设备可以为诸如计算机等设备 或装置。装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图7所示,为本发明实施例提供的面向智能计算的流水并行训练自适应调整装置所在任意具备数据处理能力的设备的一种硬件结构图,除了图7所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例中装置所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能,还可以包括其他硬件,对此不再赘述。The embodiments of the pipeline parallel training adaptive adjustment device for intelligent computing provided by the embodiments of the present invention can be applied to any device with data processing capabilities. The any device with data processing capabilities can be a device such as a computer. or device. The device embodiments may be implemented by software, or may be implemented by hardware or a combination of software and hardware. Taking software implementation as an example, as a logical device, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory and running them through the processor of any device with data processing capabilities. From the hardware level, as shown in Figure 7, it is a hardware structure diagram of any device with data processing capabilities where the pipeline parallel training adaptive adjustment device for intelligent computing provided by the embodiment of the present invention is located. In addition to what is shown in Figure 7 In addition to the processor, memory, network interface, and non-volatile memory, any device with data processing capabilities where the device in the embodiment is located may also include other hardware based on the actual functions of any device with data processing capabilities. , no further details will be given.
上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。For details on the implementation process of the functions and effects of each unit in the above device, please refer to the implementation process of the corresponding steps in the above method, and will not be described again here.
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。As for the device embodiment, since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details. The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
本发明实施例还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述实施例中的面向智能计算的流水并行训练自适应调整方法。Embodiments of the present invention also provide a computer-readable storage medium on which a program is stored. When the program is executed by a processor, the adaptive adjustment method of pipeline parallel training for intelligent computing in the above embodiments is implemented.
所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元,例如硬盘或内存。所述计算机可读存储介质也可以是任意具备数据处理能力的设备的外部存储设备,例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card,SMC)、SD卡、闪存卡(Flash Card)等。进一步的,所述计算机可读存储介质还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据,还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be an internal storage unit of any device with data processing capabilities as described in any of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium can also be an external storage device of any device with data processing capabilities, such as a plug-in hard disk, smart memory card (Smart Media Card, SMC), SD card, flash memory card equipped on the device (Flash Card) etc. Furthermore, the computer-readable storage medium may also include both an internal storage unit and an external storage device of any device with data processing capabilities. The computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capabilities, and can also be used to temporarily store data that has been output or is to be output.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。 The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.

Claims (14)

  1. 一种面向智能计算的流水并行训练自适应调整系统,其特征在于,计算集群包括多个计算节点,多个计算节点之间能够相互通信,各计算节点包括至少一CPU和至少一个GPU,待训练模型包括多层子模型,所述待训练模型的训练过程包括前向计算阶段及反向计算阶段,其中,在所述前向计算阶段,参数由多层所述子模型的第一层子模型依次向最后一层子模型传递,在所述反向计算阶段,参数由所述最后一层子模型依次向所述第一层子模型传递,各计算节点用于训练至少一个子模型;所述系统包括:A pipeline parallel training adaptive adjustment system for intelligent computing, characterized in that the computing cluster includes multiple computing nodes, and the multiple computing nodes can communicate with each other. Each computing node includes at least one CPU and at least one GPU, to be trained The model includes multiple layers of sub-models. The training process of the model to be trained includes a forward calculation stage and a reverse calculation stage. In the forward calculation stage, the parameters are composed of the first layer sub-model of the multi-layer sub-model. Passed to the last layer of sub-models in sequence, in the reverse calculation stage, parameters are passed from the last layer of sub-models to the first layer of sub-models in sequence, and each computing node is used to train at least one sub-model; The system includes:
    监控模块,用于负责监控和收集所述计算集群内各计算节点的资源运行情况,并根据各计算节点的资源运行情况,确定该计算节点的计算任务划分是否均衡,以及当所述计算节点的计算任务划分不均衡时,确定所述计算节点的不均衡类型;A monitoring module is responsible for monitoring and collecting the resource operation status of each computing node in the computing cluster, and determining whether the computing task division of the computing node is balanced according to the resource operation status of each computing node, and when the computing node When the computing task division is unbalanced, determine the imbalance type of the computing node;
    调整模块,当所述计算节点的计算任务划分不均衡时,用于根据所述计算节点的不均衡类型,确定调整策略,并根据所述调整策略,调整子模型在计算集群中的分配;An adjustment module, when the computing task division of the computing node is unbalanced, is used to determine an adjustment strategy according to the imbalance type of the computing node, and adjust the distribution of sub-models in the computing cluster according to the adjustment strategy;
    其中,所述调整包括以下至少一种:Wherein, the adjustment includes at least one of the following:
    将计算任务划分不均衡的计算节点的至少部分子模型的层由该计算节点迁移至其他计算节点;Migrate at least some sub-model layers of a computing node where computing tasks are divided unbalancedly from the computing node to other computing nodes;
    控制计算任务划分不均衡的计算节点执行CPU-GPU内存交换或重计算,或者控制计算任务划分不均衡的计算节点取消当前执行的CPU-GPU内存交换或重计算;Control the computing nodes with unbalanced computing task division to perform CPU-GPU memory swapping or recalculation, or control the computing nodes with unbalanced computing task division to cancel the currently executed CPU-GPU memory swapping or recalculation;
    对所述计算集群的网络拓扑结构进行调整。The network topology of the computing cluster is adjusted.
  2. 根据权利要求1所述的面向智能计算的流水并行训练自适应调整系统,其特征在于,所述资源运行情况包括计算延迟、GPU利用率、网络传输延迟及内存使用率;The pipeline parallel training adaptive adjustment system for intelligent computing according to claim 1, characterized in that the resource operation conditions include computing delay, GPU utilization, network transmission delay and memory usage;
    所述监控模块在根据各计算节点的资源运行情况,确定该计算节点的计算任务划分是否均衡时,具体用于:The monitoring module is specifically used to determine whether the computing task division of the computing node is balanced based on the resource operation status of each computing node:
    当根据当前计算节点的资源运行情况,确定当前计算节点存在以下中的至少一种时,确定该计算节点的计算任务划分不均衡:When it is determined that the current computing node has at least one of the following based on the resource operation status of the current computing node, it is determined that the computing task division of the computing node is unbalanced:
    所述当前计算节点的计算延迟大于或等于预设延迟阈值;The computing delay of the current computing node is greater than or equal to the preset delay threshold;
    所述当前计算节点的内存使用率大于或等于预设内存使用率阈值且所述当前计算节点的GPU利用率小于所述计算集群中所有计算节点的GPU利用率的平均值;The memory usage of the current computing node is greater than or equal to the preset memory usage threshold and the GPU utilization of the current computing node is less than the average GPU utilization of all computing nodes in the computing cluster;
    当前计算节点的网络延迟超过所述计算集群中其他计算节点的最大网络延迟的预设倍数,其中所述预设倍数大于或等于1。 The network delay of the current computing node exceeds the maximum network delay of other computing nodes in the computing cluster by a preset multiple, where the preset multiple is greater than or equal to 1.
  3. 根据权利要求2所述的面向智能计算的流水并行训练自适应调整系统,其特征在于,所述监控模块在所述计算节点的计算任务划分不均衡时,确定所述计算节点的不均衡类型时,具体用于:The pipeline parallel training adaptive adjustment system for intelligent computing according to claim 2 is characterized in that when the computing task division of the computing node is uneven, the monitoring module determines the imbalance type of the computing node, specifically for:
    当当前计算节点的计算延迟大于或等于预设延迟阈值、和/或所述当前计算节点的内存使用率大于或等于预设内存使用率阈值且所述当前计算节点的GPU利用率小于所述计算集群中所有计算节点的GPU利用率的平均值时,所述当前计算节点的不均衡类型包括:当前计算阶段分配的层过多;When the computing delay of the current computing node is greater than or equal to the preset delay threshold, and/or the memory usage of the current computing node is greater than or equal to the preset memory usage threshold and the GPU utilization of the current computing node is less than the computing When the average GPU utilization of all computing nodes in the cluster is calculated, the imbalance type of the current computing node includes: too many layers are allocated in the current computing stage;
    当当前计算节点的网络延迟超过所述计算集群中其他计算节点的最大网络延迟的预设倍数时,所述当前计算节点的不均衡类型包括:网络异常。When the network delay of the current computing node exceeds a preset multiple of the maximum network delay of other computing nodes in the computing cluster, the imbalance type of the current computing node includes: network abnormality.
  4. 根据权利要求3所述的面向智能计算的流水并行训练自适应调整系统,其特征在于,当当前计算节点的计算延迟大于或等于预设延迟阈值时,所述调整策略包括计算调整策略;The pipeline parallel training adaptive adjustment system for intelligent computing according to claim 3, characterized in that when the calculation delay of the current calculation node is greater than or equal to the preset delay threshold, the adjustment strategy includes a calculation adjustment strategy;
    所述计算调整策略包括:The calculation adjustment strategy includes:
    当当前计算节点采用CPU-GPU内存交换或重计算时,取消所述当前计算节点采用的CPU-GPU内存交换或重计算;When the current computing node adopts CPU-GPU memory exchange or recalculation, cancel the CPU-GPU memory exchange or recalculation adopted by the current computing node;
    在取消所述当前计算节点采用的CPU-GPU内存交换或重计算后,若所述当前计算节点执行所述当前计算节点上的子模型所需要的内存需求超出所述当前计算节点的最大内存,则根据所述当前计算节点前一个计算节点的GPU利用率及所述当前计算节点的后一个计算节点的GPU利用率,将所述当前计算节点的至少部分子模型的至少部分层迁移至其他计算节点执行。After canceling the CPU-GPU memory exchange or recalculation adopted by the current computing node, if the memory requirement required by the current computing node to execute the sub-model on the current computing node exceeds the maximum memory of the current computing node, Then, according to the GPU utilization of the computing node before the current computing node and the GPU utilization of the computing node after the current computing node, at least some layers of at least part of the sub-model of the current computing node are migrated to other computing nodes. node execution.
  5. 根据权利要求4所述的面向智能计算的流水并行训练自适应调整系统,其特征在于,所述根据所述当前计算节点前一个计算节点的GPU利用率及所述当前计算节点的后一个计算节点的GPU利用率,将所述当前计算节点的至少部分子模型的至少部分层迁移至其他计算节点执行,包括:The pipeline parallel training adaptive adjustment system for intelligent computing according to claim 4, characterized in that the GPU utilization rate of the computing node before the current computing node and the computing node after the current computing node are GPU utilization, migrating at least some layers of at least some sub-models of the current computing node to other computing nodes for execution, including:
    当所述当前计算节点前一个计算节点的GPU利用率小于所述当前计算节点的后一个计算节点的GPU利用率时,将所述当前计算节点前一个计算节点为初始目标计算节点;When the GPU utilization of the computing node preceding the current computing node is less than the GPU utilization of the computing node following the current computing node, the computing node preceding the current computing node is the initial target computing node;
    当所述当前计算节点前一个计算节点为初始目标计算节点时,比较所述初始目标计算节点的GPU利用率与所述初始目标计算节点的前一个计算节点的GPU利用率,若所述初始目标计算节点的GPU利用率小于所述初始目标计算节点的前一个计算节点的GPU利用率,则将所述初始目标计算节点作为最终的目标计算节点;若所述初始目标计算节点的GPU利用率大于所述初始目标计算节点的前一个计算节点的GPU利用率,则将所述初始目标计算节点的前一个计算节点作为新的初始目标计算节点,继续前迁移比较,依次进行,直至最前面的目 标计算节点;When the computing node preceding the current computing node is the initial target computing node, compare the GPU utilization of the initial target computing node with the GPU utilization of the computing node preceding the initial target computing node. If the initial target computing node If the GPU utilization of the computing node is less than the GPU utilization of the previous computing node of the initial target computing node, the initial target computing node will be used as the final target computing node; if the GPU utilization of the initial target computing node is greater than If the GPU utilization of the previous computing node of the initial target computing node is determined, the computing node preceding the initial target computing node will be used as the new initial target computing node, and the previous migration comparison will be continued until the frontmost target computing node is reached. mark computing node;
    将所述当前计算节点的至少部分子模型的至少部分层迁移至最终的目标计算节点执行。Migrate at least part of the layers of at least part of the sub-model of the current computing node to the final target computing node for execution.
  6. 根据权利要求4所述的面向智能计算的流水并行训练自适应调整系统,其特征在于,所述根据所述当前计算节点前一个计算节点的GPU利用率及所述当前计算节点的后一个计算节点的GPU利用率,将所述当前计算节点的至少部分子模型的至少部分层迁移至其他计算节点执行,包括:The pipeline parallel training adaptive adjustment system for intelligent computing according to claim 4, characterized in that the GPU utilization rate of the computing node before the current computing node and the computing node after the current computing node are GPU utilization, migrating at least some layers of at least some sub-models of the current computing node to other computing nodes for execution, including:
    当所述当前计算节点前一个计算节点的GPU利用率大于所述当前计算节点的后一个计算节点的GPU利用率时,将所述当前计算节点的后一个计算节点作为初始目标计算节点;When the GPU utilization of the computing node before the current computing node is greater than the GPU utilization of the computing node after the current computing node, use the computing node after the current computing node as the initial target computing node;
    当所述当前计算节点后一个计算节点为初始目标计算节点时,比较所述初始目标计算节点的GPU利用率与所述初始目标计算节点的后一个计算节点的GPU利用率,若所述初始目标计算节点的GPU利用率小于所述初始目标计算节点的后一个计算节点的GPU利用率,则将所述初始目标计算节点作为最终的目标计算节点;若所述初始目标计算节点的GPU利用率大于所述初始目标计算节点的后一个计算节点的GPU利用率,则将所述初始目标计算节点的后一个计算节点作为新的初始目标计算节点,继续前迁移比较,依次进行,直至最前面的目标计算节点;When the computing node subsequent to the current computing node is the initial target computing node, compare the GPU utilization of the initial target computing node with the GPU utilization of the computing node subsequent to the initial target computing node. If the initial target computing node If the GPU utilization of the computing node is less than the GPU utilization of the computing node after the initial target computing node, the initial target computing node will be used as the final target computing node; if the GPU utilization of the initial target computing node is greater than If the GPU utilization of the computing node after the initial target computing node is determined, the computing node after the initial target computing node will be used as the new initial target computing node, and the previous migration comparison will be continued until the frontmost target. calculate node;
    将所述当前计算节点的至少部分子模型的至少部分子单元迁移至最终的目标计算节点执行。Migrate at least some of the sub-units of at least some of the sub-models of the current computing node to the final target computing node for execution.
  7. 根据权利要求5或6所述的面向智能计算的流水并行训练自适应调整系统,其特征在于,所述计算调整策略还包括:The pipeline parallel training adaptive adjustment system for intelligent computing according to claim 5 or 6, characterized in that the calculation adjustment strategy further includes:
    在将所述当前计算节点的至少部分子模型的至少部分层迁移至其他计算节点执行后,所述当前计算节点重新生成模型参数,并更新所述当前计算节点的模型版本信息。After migrating at least some layers of at least some sub-models of the current computing node to other computing nodes for execution, the current computing node regenerates model parameters and updates model version information of the current computing node.
  8. 根据权利要求3或4所述的面向智能计算的流水并行训练自适应调整系统,其特征在于,当当前计算节点的内存使用率大于或等于预设内存使用率阈值且所述当前计算节点的GPU利用率小于所述计算集群中所有计算节点的GPU利用率的平均值时,所述调整策略包括内存调整策略;The pipeline parallel training adaptive adjustment system for intelligent computing according to claim 3 or 4, characterized in that when the memory usage of the current computing node is greater than or equal to the preset memory usage threshold and the GPU of the current computing node When the utilization is less than the average GPU utilization of all computing nodes in the computing cluster, the adjustment strategy includes a memory adjustment strategy;
    所述内存调整策略包括:The memory adjustment strategies include:
    当所述当前计算节点进行重计算的GPU开销大于所述当前计算节点进行CPU-GPU内存交换的GPU开销时,所述当前计算节点采用CPU-GPU内存交换以降低所述当前计算节点的内存使用率;When the GPU overhead of recalculation by the current computing node is greater than the GPU overhead of CPU-GPU memory swapping by the current computing node, the current computing node uses CPU-GPU memory swapping to reduce the memory usage of the current computing node. Rate;
    当所述当前计算节点进行重计算的GPU开销小于所述当前计算节点进行CPU-GPU内存交换的GPU开销时,所述当前计算节点采用重计算以降低所述当前计算节点的内存使用率。 When the GPU overhead of recalculation by the current computing node is less than the GPU overhead of CPU-GPU memory exchange by the current computing node, the current computing node uses recomputing to reduce the memory usage of the current computing node.
  9. 根据权利要求8所述的面向智能计算的流水并行训练自适应调整系统,其特征在于,所述内存调整策略还包括:The pipeline parallel training adaptive adjustment system for intelligent computing according to claim 8, wherein the memory adjustment strategy further includes:
    根据所述当前计算节点的原任务训练时长和所述当前计算节点执行所述重计算或所述CPU-GPU内存交换所需的时长,确定所述当前计算节点的计算时长;Determine the computing duration of the current computing node according to the original task training duration of the current computing node and the duration required by the current computing node to perform the recalculation or the CPU-GPU memory exchange;
    当所述当前计算节点的计算时长大于或等于所述计算集群中所有计算节点的平均任务训练时长时,将所述当前计算节点的至少部分子模型的至少部分子单元迁移至所述当前计算节点的相邻计算节点执行;When the computing time of the current computing node is greater than or equal to the average task training time of all computing nodes in the computing cluster, migrate at least some sub-units of at least part of the sub-model of the current computing node to the current computing node. execution on adjacent computing nodes;
    当所述当前计算节点的计算时长小于所述计算集群中所有计算节点的平均任务训练时长时,将所述当前计算节点作为其他计算节点进行层迁移的计算节点的子单元迁入的目标计算节点。When the computing time of the current computing node is less than the average task training time of all computing nodes in the computing cluster, the current computing node is used as the target computing node for the sub-unit of the computing node to be migrated to for layer migration of other computing nodes. .
  10. 根据权利要求3或4所述的面向智能计算的流水并行训练自适应调整系统,其特征在于,当当前计算节点的网络延迟超过所述计算集群中其他计算节点的最大网络延迟的预设倍数时,所述调整策略包括拓扑调整策略;The pipeline parallel training adaptive adjustment system for intelligent computing according to claim 3 or 4, characterized in that when the network delay of the current computing node exceeds the preset multiple of the maximum network delay of other computing nodes in the computing cluster , the adjustment strategy includes a topology adjustment strategy;
    所述拓扑调整策略包括:The topology adjustment strategy includes:
    选择与该当前计算节点的网络延迟最小的三个连续计算节点,并确定当前网络未延迟的计算节点的最大网络延迟一倍以上;Select three consecutive computing nodes that have the smallest network delay with the current computing node, and determine that the maximum network delay of the computing node that is not delayed by the current network is more than double;
    将所述当前计算节点与所述三个连续计算节点的中间计算节点进行任务互换;Perform task exchange between the current computing node and the intermediate computing node of the three consecutive computing nodes;
    分别判断任务相互换的两个计算节点的前后计算节点的网络延迟是否超过所述最大网络延迟,若超过,则继续选择延迟次小的三个连续计算节点,重复任务互换过程,直至遍历所有计算节点,若仍不存在不超过所述最大网络延迟的计算节点,则结束所述计算集群的网络拓扑调整;Determine whether the network delays of the two computing nodes before and after the tasks are exchanged exceed the maximum network delay. If it exceeds the maximum network delay, continue to select three consecutive computing nodes with the next smallest delay, and repeat the task exchange process until all the tasks are traversed. Computing node, if there is still no computing node that does not exceed the maximum network delay, end the network topology adjustment of the computing cluster;
    若未超过,则互换所述任务相互换的两个计算节点之间的模型参数和中间变量。If not, exchange the model parameters and intermediate variables between the two computing nodes where the tasks are exchanged.
  11. 根据权利要求10所述的面向智能计算的流水并行训练自适应调整系统,其特征在于,所述拓扑调整策略还包括:The pipeline parallel training adaptive adjustment system for intelligent computing according to claim 10, characterized in that the topology adjustment strategy further includes:
    若所述任务相互换的两个计算节点中任一计算节点的内存使用率大于或等于预设内存使用率阈值,则采用内存调整策略继续调整子模型在计算集群中的分配;If the memory usage of any computing node among the two computing nodes for which the tasks are exchanged is greater than or equal to a preset memory usage threshold, a memory adjustment strategy is adopted to continue adjusting the allocation of the sub-model in the computing cluster;
    在采用所述内存调整策略继续调整子模型在计算集群中的分配后,若任务相互换的两个计算节点中任一计算节点的内存使用率仍大于或等于预设内存使用率阈值,则采用计算调整策略来迁移所述内存使用率仍大于或等于预设内存使用率阈值的计算节点的至少部分子模型的至少部分子单元。After using the memory adjustment strategy to continue adjusting the distribution of sub-models in the computing cluster, if the memory usage of any of the two computing nodes where tasks are exchanged is still greater than or equal to the preset memory usage threshold, then adopt Computing an adjustment strategy to migrate at least some sub-units of at least some sub-models of computing nodes whose memory usage is still greater than or equal to a preset memory usage threshold.
  12. 一种面向智能计算的流水并行训练自适应调整方法,其特征在于,计算集群包括多 个计算节点,多个计算节点之间能够相互通信,各计算节点包括至少一CPU和至少一个GPU,待训练模型包括多层子模型,所述待训练模型的训练过程包括前向计算阶段及反向计算阶段,其中,在所述前向计算阶段,参数由多层所述子模型的第一层子模型依次向最后一层子模型传递,在所述反向计算阶段,参数由所述最后一层子模型依次向所述第一层子模型传递,各计算节点用于训练至少一个子模型;所述方法包括:An adaptive adjustment method for pipeline parallel training oriented to intelligent computing, characterized in that the computing cluster includes multiple A plurality of computing nodes can communicate with each other. Each computing node includes at least one CPU and at least one GPU. The model to be trained includes a multi-layer sub-model. The training process of the model to be trained includes a forward calculation stage and a reverse calculation stage. In the forward calculation stage, in the forward calculation stage, parameters are passed from the first layer sub-model of the multiple layers of sub-models to the last layer sub-model in turn, and in the reverse calculation stage, the parameters are passed from the last layer sub-model to A layer of sub-models is passed to the first-layer sub-model in turn, and each computing node is used to train at least one sub-model; the method includes:
    监控模块负责监控和收集所述计算集群内各计算节点的资源运行情况,并根据各计算节点的资源运行情况,确定该计算节点的计算任务划分是否均衡,以及当所述计算节点的计算任务划分不均衡时,确定所述计算节点的不均衡类型;The monitoring module is responsible for monitoring and collecting the resource operation status of each computing node in the computing cluster, and determining whether the computing task division of the computing node is balanced according to the resource operation status of each computing node, and when the computing task division of the computing node is unbalanced, determining the imbalance type of the computing node;
    调整模块在所述计算节点的计算任务划分不均衡时,根据所述计算节点的不均衡类型,确定调整策略,并根据所述调整策略,调整子模型在计算集群中的分配;When the computing task division of the computing node is unbalanced, the adjustment module determines an adjustment strategy according to the imbalance type of the computing node, and adjusts the distribution of sub-models in the computing cluster according to the adjustment strategy;
    其中,所述调整包括以下至少一种:Wherein, the adjustment includes at least one of the following:
    将计算任务划分不均衡的计算节点的至少部分子模型的至少部分层由该计算节点迁移至其他计算节点;Migrate at least some layers of at least some sub-models of a computing node where computing tasks are divided unbalancedly from the computing node to other computing nodes;
    控制计算任务划分不均衡的计算节点执行CPU-GPU内存交换或重计算,或者控制计算任务划分不均衡的计算节点取消当前执行的CPU-GPU内存交换或重计算;Control the computing nodes with unbalanced computing task division to perform CPU-GPU memory swapping or recalculation, or control the computing nodes with unbalanced computing task division to cancel the currently executed CPU-GPU memory swapping or recalculation;
    对所述计算集群的网络拓扑结构进行调整。Adjust the network topology of the computing cluster.
  13. 一种面向智能计算的流水并行训练自适应调整装置,其特征在于,包括存储器和一个或多个处理器,所述存储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现权利要求12所述的面向智能计算的流水并行训练自适应调整方法。An adaptive adjustment device for pipeline parallel training oriented to intelligent computing, characterized in that it includes a memory and one or more processors, executable code is stored in the memory, and the one or more processors execute the executable code. When the code is executed, it is used to implement the adaptive adjustment method of pipeline parallel training for intelligent computing described in claim 12.
  14. 一种计算机可读存储介质,其特征在于,其上存储有程序,该程序被处理器执行时,实现权利要求12所述的面向智能计算的流水并行训练自适应调整方法。 A computer-readable storage medium, characterized in that a program is stored thereon, and when the program is executed by a processor, the adaptive adjustment method of pipeline parallel training for intelligent computing as claimed in claim 12 is implemented.
PCT/CN2023/105618 2022-09-21 2023-07-04 Intelligent-computing-oriented adaptive adjustment system and method for pipeline-parallel training WO2024060788A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2023573533A JP2024535971A (en) 2022-09-21 2023-07-04 Self-adaptive tuning system and method for pipeline parallel training for intelligent computing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211147981.3A CN115237580B (en) 2022-09-21 2022-09-21 Intelligent calculation-oriented flow parallel training self-adaptive adjustment system and method
CN202211147981.3 2022-09-21

Publications (1)

Publication Number Publication Date
WO2024060788A1 true WO2024060788A1 (en) 2024-03-28

Family

ID=83681984

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/105618 WO2024060788A1 (en) 2022-09-21 2023-07-04 Intelligent-computing-oriented adaptive adjustment system and method for pipeline-parallel training

Country Status (3)

Country Link
JP (1) JP2024535971A (en)
CN (1) CN115237580B (en)
WO (1) WO2024060788A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115237580B (en) * 2022-09-21 2022-12-16 之江实验室 Intelligent calculation-oriented flow parallel training self-adaptive adjustment system and method
CN115437795B (en) * 2022-11-07 2023-03-24 东南大学 Video memory recalculation optimization method and system for heterogeneous GPU cluster load perception
CN116050499B (en) * 2023-04-03 2023-07-18 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Self-adaptive model partitioning method, system and equipment in model parallel training

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110533183A (en) * 2019-08-30 2019-12-03 东南大学 The model partition and task laying method of heterogeneous network perception in a kind of assembly line distribution deep learning
CN112183668A (en) * 2020-11-03 2021-01-05 支付宝(杭州)信息技术有限公司 Method and device for training service models in parallel
CN112784968A (en) * 2021-01-29 2021-05-11 东南大学 Hybrid pipeline parallel method for accelerating distributed deep neural network training
CN114035937A (en) * 2021-10-15 2022-02-11 北京潞晨科技有限公司 Distributed training and reasoning method, system, equipment and readable storage medium based on artificial intelligence
CN114490065A (en) * 2022-01-27 2022-05-13 中国科学院微电子研究所 Load prediction method, device and equipment
CN115237580A (en) * 2022-09-21 2022-10-25 之江实验室 Intelligent calculation-oriented flow parallel training self-adaptive adjustment system and method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10884795B2 (en) * 2018-04-26 2021-01-05 International Business Machines Corporation Dynamic accelerator scheduling and grouping for deep learning jobs in a computing cluster
US12056604B2 (en) * 2018-05-23 2024-08-06 Microsoft Technology Licensing, Llc Highly performant pipeline parallel deep neural network training
CN113326002A (en) * 2021-05-22 2021-08-31 清华大学 Cloud edge cooperative control system based on computing migration and migration decision generation method
CN113312178A (en) * 2021-05-24 2021-08-27 河海大学 Assembly line parallel training task allocation method based on deep reinforcement learning
CN114780247B (en) * 2022-05-17 2022-12-13 中国地质大学(北京) Flow application scheduling method and system with flow rate and resource sensing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110533183A (en) * 2019-08-30 2019-12-03 东南大学 The model partition and task laying method of heterogeneous network perception in a kind of assembly line distribution deep learning
CN112183668A (en) * 2020-11-03 2021-01-05 支付宝(杭州)信息技术有限公司 Method and device for training service models in parallel
CN112784968A (en) * 2021-01-29 2021-05-11 东南大学 Hybrid pipeline parallel method for accelerating distributed deep neural network training
CN114035937A (en) * 2021-10-15 2022-02-11 北京潞晨科技有限公司 Distributed training and reasoning method, system, equipment and readable storage medium based on artificial intelligence
CN114490065A (en) * 2022-01-27 2022-05-13 中国科学院微电子研究所 Load prediction method, device and equipment
CN115237580A (en) * 2022-09-21 2022-10-25 之江实验室 Intelligent calculation-oriented flow parallel training self-adaptive adjustment system and method

Also Published As

Publication number Publication date
CN115237580A (en) 2022-10-25
JP2024535971A (en) 2024-10-04
CN115237580B (en) 2022-12-16

Similar Documents

Publication Publication Date Title
WO2024060789A1 (en) Intelligent computing-oriented method, system and apparatus for scheduling distributed training tasks
WO2024060788A1 (en) Intelligent-computing-oriented adaptive adjustment system and method for pipeline-parallel training
US9703610B2 (en) Extensible centralized dynamic resource distribution in a clustered data grid
CN108572873B (en) Load balancing method and device for solving Spark data inclination problem
Chen et al. Elastic parameter server load distribution in deep learning clusters
TWI786564B (en) Task scheduling method and apparatus, storage media and computer equipment
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
US11520673B2 (en) Maintenance operations based on analysis of collected data
CN113032102A (en) Resource rescheduling method, device, equipment and medium
CN117632444B (en) NPU fault-tolerant scheduling system of computer cluster
CN117909061A (en) Model task processing system and resource scheduling method based on GPU hybrid cluster
WO2019086120A1 (en) A system and method for high-performance general-purpose parallel computing with fault tolerance and tail tolerance
JPWO2011078162A1 (en) Scheduling apparatus, scheduling method and program
CN116755855A (en) Distributed container scheduling method based on Kubernetes cluster
US8090762B2 (en) Efficient super cluster implementation for solving connected problems in a distributed environment
CN110532091B (en) Graph computation edge vector load balancing method and device based on graph processor
CN115562812A (en) Distributed virtual machine scheduling method, device and system for machine learning training
CN110515729B (en) Graph computing node vector load balancing method and device based on graph processor
JP2012038275A (en) Transaction calculation simulation system, method, and program
CN111694635A (en) Service quality control method and device
CN113407305A (en) Task deployment method and device, electronic equipment and storage medium
CN118170522B (en) Task scheduling method and device, electronic equipment and readable storage medium
CN114866612B (en) Electric power micro-service unloading method and device
CN118349364B (en) Method for improving server performance based on non-uniform memory access
CN114598706B (en) Storage system elastic expansion method based on Serverless function

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2023573533

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23867083

Country of ref document: EP

Kind code of ref document: A1