CN115454655B

CN115454655B - Dynamic layer migration method in asynchronous pipeline parallel training process

Info

Publication number: CN115454655B
Application number: CN202211415887.1A
Authority: CN
Inventors: 何华森; 王清河; 凌志; 沙沫; 姜晓枫; 谭小彬; 杨坚
Original assignee: Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Current assignee: Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2023-03-10
Anticipated expiration: 2042-11-11
Also published as: CN115454655A

Abstract

The invention relates to the technical field of network information, and discloses a dynamic layer migration method in an asynchronous pipeline parallel training process, which comprises the following steps: sending a layer migration message; preparing layer migration related resources; performing a layer migration; finishing the migration of the layer and cleaning up related resources; when layer migration is executed, the migration node performs weight migration operation on the layer to be migrated before each forward propagation execution, and the migration node performs weight migration operation on the layer to be migrated after each backward propagation execution, so that the weight transmission process is overlapped with the calculation process; the method can be used as a supplementary module of the existing asynchronous pipeline parallel training frame, can realize model structure adjustment and partial layer migration operation between adjacent nodes without interrupting the current training process, breaks through the limitation that the original asynchronous pipeline parallel frame can only use a fixed model division structure for training, and provides a more flexible and effective mode for deep neural network training optimization.

Description

Dynamic layer migration method in asynchronous pipeline parallel training process

Technical Field

The invention relates to the technical field of network information, in particular to a dynamic layer migration method in an asynchronous pipeline parallel training process.

Background

With the development of artificial intelligence and big data technology, the scale of the neural network used for deep learning becomes larger and larger, the number of layers of partial high-accuracy deep neural networks even reaches 1000 layers, and the parameter scale exceeds 1000 hundred million. The ultra-large-scale deep neural network needs distributed clusters to cooperate together to complete training. One way of training is to split the complete model into several serially dependent blocks, one block containing several serially executed layers in the model, and each compute node (stage) responsible for training one block and coordinating the whole cluster to train the whole model using a pipelined approach, which is commonly referred to as model parallel or pipeline parallel.

PipeDraam is a common asynchronous pipeline parallel framework, supports a pipeline parallel training mode, and has high GPU utilization rate; however, pipeDream can only perform block static division on a model to be trained before the cluster performs model training, and maintain the division structure until the whole training is completed, and cannot dynamically adjust the division structure in the process of model training, that is, cannot dynamically migrate part of the layer of a certain node to another node for training.

The training of the deep neural network is usually based on a distributed computing cluster of a data center, and the training performance is reduced when a resource heterogeneity problem (such as different GPU models and calculation power on different nodes), a network fluctuation problem (such as a network fault), a multi-task preemption problem (such as multiple tasks running on one node) and the like possibly exist in an actual cluster. Therefore, in the training process, the model occupation ratio (i.e., the model division structure described above) on different nodes needs to be dynamically adjusted according to the conditions of resources, tasks, network states, and the like, and partial layers are migrated between different nodes to balance the computation load and computation time on each node, so that the total training time is finally shortened.

In order to implement efficient training of a deep neural network, an effective dynamic layer migration method needs to be provided, which can rapidly and efficiently migrate partial layers between different nodes, and simultaneously requires that: (1) the current training process is not interrupted, (2) the amount of data migrated is as small as possible, and (3) the migration process can overlap with the computation process to reduce migration time overhead.

Disclosure of Invention

In order to solve the above technical problems, the present invention provides a dynamic layer migration method in an asynchronous pipeline parallel training process, where a dynamic layer migration module implemented based on the present invention will be operated as a supplementary module of an existing asynchronous pipeline parallel training framework on each distributed computing node, and is responsible for managing and executing all resources and processes related to the present invention, including sending and receiving migration messages, establishing and deleting communication connections between nodes, initializing, adjusting and deleting a layer of a specified model on a specified node, sending and receiving data such as intermediate variables and weights of the layer of the specified model (i.e., migrating the specified layer between adjacent nodes), and the like, and the following description is not particularly emphasized.

In order to solve the technical problems, the invention adopts the following technical scheme:

a dynamic layer migration method in an asynchronous pipeline parallel training process is applied to a distributed machine learning scene, and specifically comprises the following steps:

step one, sending a layer migration message:

the emigration node sends a layer migration message to the emigration node, and attaches the current batch data serial number k of the emigration node and the description information of the layer to be migrated; the description information of the layer to be migrated comprises the number of the layer to be migrated and the label of the layer to be migrated;

step two, preparing layer migration related resources:

dividing an original model structure in an emigration node into two parts, wherein one part is a layer reserved in the emigration node, and the other part is a layer needing to be migrated to the emigration node, namely a layer to be migrated; initializing the layer to be migrated in the migration node according to the description information of the layer to be migrated, and adding the layer to be migrated into the original model structure of the migration node; establishing additional migration communication connection between the migration node and the migration node for the intermediate variables and the weights of the layer to be migrated; each migration communication connection is an independent thread;

step three, executing layer migration:

the emigration node performs weight emigration operation of the layer to be migrated before each forward propagation execution, and the emigration node performs weight emigration operation of the layer to be migrated after each backward propagation execution, so that the weight transmission process is overlapped with the calculation process;

step four, finishing the layer migration and cleaning the related resources:

after the weight migration operation of all layers to be migrated in the migration node is completed, deleting resources occupied by the migrated layers in the migration node; deleting the original communication connection between the emigration node and the migration communication connection: establishing new communication connection between the emigration node and the emigration node; wherein, the layer to be migrated is called as migrated layer after migrating from the migrated node to the migrated node.

Further, in the first step, if the migratory node is before the migratory node, that is, the serial number of the migratory node is smaller than that of the migratory node, the migratory node will send a layer migration message before the last forward propagation execution; if the migrating node is behind the migrating node, the migrating node will send the layer migration message before the last back propagation execution.

Further, after the emigration node sends the layer migration message, if the emigration node is before the emigration node, the emigration node executes the content related to the emigration node in the second step before the forward propagation of the batch data sequence number k +1 is executed; if the emigration node is behind the emigration node, the content related to the emigration node in the step two is executed before the forward propagation of the batch data serial number k +2 is executed;

after the migration node receives the layer migration message and the attached batch data serial number k, if the migration node is in front of the migration node, the migration node executes the contents related to the migration node in the step two before the back propagation of the batch data serial number k-N + t +1 is executed, wherein N is the total number of the nodes, and t is the serial number of the migration node; if the emigration node is behind the emigration node, the contents related to the emigration node in the step two are executed before the back propagation of the batch data sequence number k-N + t +2 is executed.

Further, in the second step, a layer reserved on the emigration node forms a new model structure of the emigration node, and the layer to be migrated is added to the original model structure of the emigration node to obtain the new model structure of the emigration node;

in the layer migration process of step three, an alternate calculation method is used, which specifically comprises the following steps:

when the emigration node and the immigration node execute forward propagation, new model structures of the emigration node and the immigration node are respectively used for calculation, and the migration communication connection is used for transmitting intermediate variables of a layer to be migrated; and when the emigration node and the immigration node execute back propagation, the original model structures of the emigration node and the immigration node are respectively used for calculation, and the original communication connection between the nodes is used for transmitting the intermediate variable of the layer to be migrated.

Compared with the prior art, the invention has the beneficial technical effects that:

the invention provides a dynamic layer migration method in an asynchronous pipeline training process under a distributed machine learning scene aiming at an asynchronous pipeline parallel training framework of the existing distributed deep neural network training, which can be used as a supplementary module of the existing asynchronous pipeline parallel training framework and can realize a dynamic adjusting function of the deep neural network in the training process.

The dynamic layer migration method can realize model structure adjustment between adjacent nodes and partial layer migration operation without interrupting the current training process, breaks through the limitation that the original asynchronous pipeline parallel framework can only use a fixed model division structure for training, and provides a more flexible and effective mode for deep neural network training optimization.

The dynamic layer migration method uses an alternate calculation method in the execution process of the dynamic layer migration, the alternate calculation method replaces time cost brought by massive data recalculation and data transmission with smaller space cost in the execution process of the layer migration, the layer migration operation can be completed under the condition of hardly influencing the original training process, meanwhile, the extra data volume needing to be transmitted is only limited to the weight of the layer to be migrated, the transmission of massive unnecessary data is avoided, the communication overhead is greatly reduced, and the communication bandwidth of the original training process is effectively avoided being occupied.

On the other hand, the dynamic layer migration method adopts a multithreading mechanism, can effectively execute the trained communication process and the trained calculation process in parallel, hides the communication time in the calculation time, effectively reduces or even avoids the node waiting time caused by the communication process, and nearly transparently finishes the dynamic layer migration operation.

Meanwhile, the dynamic layer migration method supports a plurality of mutually exclusive adjacent nodes to simultaneously execute the dynamic layer migration operation, and does not need to wait for one pair of adjacent nodes to execute the dynamic layer migration operation and then execute the dynamic layer migration of the other pair of adjacent nodes, thereby greatly improving the efficiency of model dynamic adjustment in the deep neural network training process.

In addition, the dynamic layer migration method can execute dynamic layer migration operation according to a strategy and the number of layers specified by a user, meanwhile, the whole function is modularized, the dynamic layer migration method can be easily embedded into the original asynchronous pipeline parallel framework without modifying the original framework in a large quantity, various scheduling strategies or other complex functions can be flexibly matched for common application, rapid model dynamic adjustment is realized, and the training efficiency of the whole deep neural network is effectively optimized.

Drawings

FIG. 1 is a timing diagram illustrating a dynamic layer migration method according to the present invention;

FIG. 2 is a schematic diagram of an alternative calculation method in the dynamic layer migration method of the present invention.

Detailed Description

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

A dynamic layer migration method in an asynchronous pipeline parallel training process in a distributed machine learning scene comprises the following steps:

step S1, sending a layer migration message;

s2, preparing layer migration related resources;

s3, executing layer migration;

and S4, finishing the layer migration and cleaning the related resources.

Further, in step S1, the node (migrating node) that needs to perform the migration operation sends a layer migration message migrate to the target node (migrating node) that desires to migrate, and attaches a current batch data (batch) sequence number k of the migrating node and description information of a part of layers (layers to be migrated) that need to migrate, including the number of layers to be migrated, a label of the layers to be migrated, and the like, so as to notify the migrating node of the layer migration operation to be performed; if the migratory node is before the migratory node, that is, the serial number of the migratory node is smaller than that of the migratory node, the migratory node will send a layer migration message migrate before the last forward propagation execution; on the contrary, if the migrant node is behind the migrant node, the migrant node will send the layer migration message migrate before the last back propagation execution.

After the migratory node sends the layer migration message migrate, if the migratory node is before the migratory node, the contents related to the migratory node in the step S2 are executed before the forward propagation of the batch data sequence number k +1 is executed; on the contrary, if the emigration node is behind the emigration node, the content related to the emigration node in the step S2 is executed before the forward propagation of the batch data serial number k +2 is executed;

after receiving the layer migration message migrate and the appended batch data sequence number k, the migrating node executes the content related to the migrating node in step S2 before performing back propagation of the batch data sequence number k-N + t +1 if the migrating node is before the migrating node, where N is the total number of nodes and t is the sequence number of the migrating node (the node sequence number is defaulted to be numbered from 0), and otherwise, executes the content related to the migrating node in step S2 before performing back propagation of the batch data sequence number k-N + t +2 if the migrating node is after the migrating node.

Further, step S2 includes the steps of:

step S21, adjusting the current model structure of the node:

in the step, the emigration node divides the original model structure into two parts, one part is a layer reserved in the emigration node, and the other part is a layer needing to be emigration to a target node, namely a layer to be migrated; correspondingly, the migration node initializes the layer to be migrated and adds the layer to the original model structure of the migration node;

step S22, establishing communication connection of intermediate variables and weights of the layer to be migrated:

in the step, the emigration node and the emigration node respectively establish additional migration communication connection for intermediate variables and weights of the layer to be migrated, the migration communication connection is called migration communication connection and is different from the original communication connection between the nodes, the migration communication connection is specially used for providing a dynamic layer migration function service, a multithreading mode is adopted, namely, each migration communication connection is a new thread, and communication time and calculation time can be effectively overlapped to reduce interference on the original training process.

Further, in step S3, the migrating node always performs the weight migrating operation of the layer to be migrated before each forward propagation is performed, and the migrating node always performs the weight migrating operation of the layer to be migrated after each backward propagation is performed, so as to ensure that the weight received by the migrating node is the latest, and overlap the weight transmission process with the calculation process, so as to reduce the time overhead of layer migration;

assuming that the serial number of the migrated node is s, the serial number of the migrated node is t, and the total number of the nodes is N; if the emigration node is before the emigration node, i.e. s = t-1, the emigration node will execute N-s-1 times of weight emigration operation, which involves N-s-1 set of forward propagation and backward propagation; if the emigration node is behind the emigration node, i.e. s = t +1, the emigration node will execute N-s +1 times of weight emigration operation, which involves N-s +1 sets of forward propagation and backward propagation; for an immigration node, always executing N-t times of weight immigration operation, relating to forward propagation and backward propagation of N-t groups;

on the other hand, in the process of executing layer migration, because the batch data used in the simultaneously executed back propagation process is calculated by using the original model structure before executing layer migration, the batch data will be invalid after adjusting the model structure, in order to avoid the additional calculation overhead brought by recalculating the batch data, an alternative calculation method is used in the process of executing layer migration, specifically, when the node executes forward propagation, a new model structure is used for calculation, because the forward propagation uses the new batch data which is calculated for the first time, and uses the migration communication connection to transfer the intermediate variable, when the node executes back propagation, the original model structure is used for calculation, because the back propagation uses the old batch data which is calculated by using the original model structure before, and uses the original inter-node communication connection to transfer the intermediate variable, the problem that the old batch data which needs to be subjected to back propagation fails due to the model change is avoided, the additional recalculation overhead is avoided, and only one of the migration communication connection and the original communication connection is used at the same time, and the situation that the new communication connection occupies the network communication bandwidth will not occur.

Further, step S4 includes the steps of:

step S41, deleting the original model structure:

in the step, after the migration node completes the weight migration operation of all the layers to be migrated, the migrated layer is deleted from the GPU, and the occupied GPU video memory space is released; wherein, the layer to be migrated is called as migrated layer after being migrated from the migrated node to the migrated node.

Step S42, deleting the original communication connection between the nodes and the migration communication connection used by the dynamic layer migration:

in this step, the migrant node and the migrant node delete the original inter-node communication connections between the nodes and release and recover the resources of the original inter-node communication connection threads, and at the same time, the migrant node and the migrant node delete the migration communication connections for dynamic layer migration, which are established in step S22 and include the communication connections of the intermediate variables and the weights of the layer to be migrated, and release and recover the resources of the migration communication connection threads;

step S43, establishing new communication connection between nodes:

in this step, since the communication content between the relevant nodes changes before and after the dynamic layer migration operation is performed, a new communication connection between the nodes needs to be established, the migration node and the migration node respectively create a sending thread and a receiving thread corresponding to each other, and use the newly established communication connection for communication in the subsequent training process.

Examples

In this embodiment, the total number of the computation nodes of the cluster participating in the parallel training of the asynchronous pipeline is four, that is, N =4, and the four nodes are respectively recorded as stage 0, stage 1, stage 2, and stage 3. The batch data (batch) in the present embodiment includes nine batches, which are respectively denoted as batch 1, batch 2, batch 3, batch 4, batch 5, batch 6, batch 7, batch 8, and batch 9.

The original model structure on the stage 0 comprises a layer a, a layer b and a layer c, and the original model structure on the stage 1 comprises a layer d and a layer e; the layer migration requirement related in this embodiment is to dynamically migrate the layer c on the stage 0 from the stage 0 to the stage 1, and ensure that the original training process is not interrupted, so that the dynamic layer migration method in the asynchronous pipeline parallel training process in the distributed machine learning scene according to this embodiment is used to implement the requirement.

The corresponding situation in step S1 of this embodiment is that the migration node is located before the migration node. Therefore, in this embodiment, after receiving the dynamic layer migration instruction, stage 0 sends a layer migration message migrate to stage 1 before the last forward propagation execution, that is, before the forward propagation execution of batch 5, and attaches the serial number k =5 of the current batch of data and the description information of a plurality of layers to be migrated, where the description information includes information such as the number of the layers to be migrated being 1 and the label of the layer to be migrated being c, and then stage 0 executes the content related to the migrated node in step S2 before the forward propagation execution of batch 6; stage 1, upon receiving the layer migration message migrate and its accompanying batch data sequence number k =5, will execute the contents related to the migration node in step S2 before performing the back propagation of batch 3.

The step S2 of the present embodiment of preparing the layer migration related resource includes the following steps:

step S21, adjusting the current model structure of the node:

in step S21, as shown in fig. 1, stage 0 divides the original model structure into two parts, one part is a set of a layer a and a layer b, and the other part is a layer c to be migrated, and both parts are continuously retained in stage 0; step 1 initializes the layer c according to the description information of the layer c to be migrated received in step S1, and leaves the layer c in step 1 together with the original layer d and layer e;

s22, establishing communication connection of intermediate variables and weights of the layer to be migrated;

step S22, in which stage 0 and stage 1 negotiate to establish an intermediate variable of a layer c to be migrated, namely, a communication connection of input data of the layer c, and negotiate to establish a communication connection of weight of the layer c to be migrated, wherein the two connections are collectively called a migration communication connection; the migration communication connection is operated in a multithreading mode so as to realize parallelization processing of a communication process and a calculation process.

Step S3 of this embodiment includes several times of dynamic weight migration operations, as shown in fig. 1, according to the description of step S3 of this embodiment, stage 0 needs to perform 3 times of weight migration operation of the layer c to be migrated in this step, and similarly, stage 1 needs to perform 3 times of weight migration operation of the layer c to be migrated in this step; stage 0 transmits the weight of the layer c to be migrated to stage 1 while executing the forward propagation of batch 6, stage 1 receives the weight of the layer c to be migrated from stage 0 while executing the backward propagation of batch 3, and stage 1 finishes receiving the weight of the layer c to be migrated while finishing the backward propagation of batch 3, so that the communication time of the weight of the layer c is hidden in the calculation time of the node, and the introduction of extra communication time is avoided, and the weight migration process of subsequent batch 7 and batch 8 is similar to that of the batch 6;

on the other hand, in this step, an alternative calculation method is used at the same time, as shown in fig. 1 and fig. 2, in the layer migration execution process, the forward propagation processes of stage 0 and stage 1 are executed using a new model structure, for stage 0, the forward propagation process is executed using only layer a and layer b, and the output of layer b, that is, the input of layer c is transmitted to stage 1 using the newly established migration communication connection, for stage 1, the input of layer c is received using the newly established migration communication connection, and the forward propagation process is executed using layer c, layer d and layer e together; the back propagation process of stage 0 and stage 1 is executed by using the original model structure, for stage 1, the back propagation process is executed by using only layer e and layer d, and the back result of layer d, namely the back input of layer c, is transmitted to stage 0 by using the original inter-node communication connection, for stage 0, the back input of layer c is received by using the original inter-node communication connection, and the back propagation process is executed by using layer c, layer b and layer a together; specifically, in the present embodiment, in the process of executing the step, stage 0 and stage 1, the forward propagation processes of batch 6, batch 7 and batch 8 are all executed using the new model structure, and the backward propagation processes of batch 3, batch 4 and batch 5 are all executed using the old model structure.

Step S4 of this embodiment, ending the layer migration and clearing the relevant resources, includes the following steps:

step S41, deleting the original model structure:

in step S41, as shown in fig. 2, after the dynamic layer migration operation is completed, stage 0 deletes the migrated layer c from the model structure of stage 0, and recovers the resources occupied by the original layer c, including the memory, the GPU video memory, and the like;

step S42, deleting the original communication connection between nodes and the migration communication connection used by the migration of the dynamic layer:

in step S42, stage 0 and stage 1 delete the original inter-node communication connection, that is, in this embodiment, the communication connection of output data of layer c and the communication connection of reverse input data of layer c, delete the relevant thread and recycle the relevant resource, and at the same time, stage 0 and stage 1 delete the migration communication connection for dynamic layer migration established in step S22, which includes the communication connection of intermediate variable of layer c and the communication connection of weight of layer c, delete the relevant thread and recycle the relevant resource;

step S43, establishing new communication connection between nodes:

in step S43, stage 0 and stage 1 negotiate to establish a new inter-node communication connection, which includes a communication connection of output data of layer b and a communication connection of reverse input data of layer b in this embodiment, where the communication connections operate in a multi-thread manner, so as to implement parallelization processing of a communication process and a calculation process.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not to be construed as limiting the claims.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A dynamic layer migration method in an asynchronous pipeline parallel training process is characterized in that the dynamic layer migration method is applied to a distributed machine learning scene and specifically comprises the following steps:

step one, sending a layer migration message:

step two, preparing layer migration related resources:

dividing an original model structure in an emigration node into two parts, wherein one part is a layer reserved in the emigration node, and the other part is a layer needing to be migrated to the emigration node, namely a layer to be migrated; initializing a layer to be migrated in the migration node according to the description information of the layer to be migrated, and adding the layer to be migrated into the original model structure of the migration node; establishing additional migration communication connection between the migration node and the migration node for the intermediate variables and the weights of the layer to be migrated; each migration communication connection is an independent thread;

step three, executing layer migration:

step four, finishing the layer migration and cleaning the related resources:

after the weight migration operation of all layers to be migrated in the migration node is completed, deleting resources occupied by the migrated layers in the migration node; deleting the original communication connection between the emigration node and the migration communication connection: establishing new communication connection between the emigration node and the immigration node; after the layer to be migrated is migrated from the migration node to the migration node, the layer is called a migrated layer.

2. The method for dynamic layer migration in the parallel training process of asynchronous pipelines according to claim 1, characterized in that: in the first step, if the emigration node is before the emigration node, that is, the serial number of the emigration node is smaller than that of the emigration node, the emigration node will send a layer migration message before the last forward propagation execution; if the migrating node is behind the migrating node, the migrating node will send the layer migration message before the last back propagation execution.

3. The method for dynamic layer migration in the parallel training process of asynchronous pipelines according to claim 1, characterized in that: after the emigration node sends the layer migration message, if the emigration node is before the emigration node, the content related to the emigration node in the step two is executed before the forward propagation execution of the batch data serial number k + 1; if the emigration node is behind the emigration node, the content related to the emigration node in the step two is executed before the forward propagation of the batch data serial number k +2 is executed;

after the migration node receives the layer migration message and the attached batch data serial number k, if the migration node is in front of the migration node, the migration node executes the contents related to the migration node in the step two before the back propagation of the batch data serial number k-N + t +1 is executed, wherein N is the total number of the nodes, and t is the serial number of the migration node; if the migrating node is behind the migrating node, the contents related to the migrating node in the step two are executed before the back propagation of the batch data sequence number k-N + t +2 is executed.

4. The method for dynamic layer migration in the parallel training process of asynchronous pipelines according to claim 1, characterized in that: step two, a new model structure of the emigration node is reserved, wherein the layer of the emigration node forms the new model structure of the emigration node, and the layer to be migrated is added into the original model structure of the emigration node to obtain the new model structure of the emigration node;