CN115454655B - Dynamic layer migration method in asynchronous pipeline parallel training process - Google Patents

Dynamic layer migration method in asynchronous pipeline parallel training process Download PDF

Info

Publication number
CN115454655B
CN115454655B CN202211415887.1A CN202211415887A CN115454655B CN 115454655 B CN115454655 B CN 115454655B CN 202211415887 A CN202211415887 A CN 202211415887A CN 115454655 B CN115454655 B CN 115454655B
Authority
CN
China
Prior art keywords
node
layer
migration
emigration
migrated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211415887.1A
Other languages
Chinese (zh)
Other versions
CN115454655A (en
Inventor
何华森
王清河
凌志
沙沫
姜晓枫
谭小彬
杨坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Original Assignee
Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Artificial Intelligence of Hefei Comprehensive National Science Center filed Critical Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Priority to CN202211415887.1A priority Critical patent/CN115454655B/en
Publication of CN115454655A publication Critical patent/CN115454655A/en
Application granted granted Critical
Publication of CN115454655B publication Critical patent/CN115454655B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the technical field of network information, and discloses a dynamic layer migration method in an asynchronous pipeline parallel training process, which comprises the following steps: sending a layer migration message; preparing layer migration related resources; performing a layer migration; finishing the migration of the layer and cleaning up related resources; when layer migration is executed, the migration node performs weight migration operation on the layer to be migrated before each forward propagation execution, and the migration node performs weight migration operation on the layer to be migrated after each backward propagation execution, so that the weight transmission process is overlapped with the calculation process; the method can be used as a supplementary module of the existing asynchronous pipeline parallel training frame, can realize model structure adjustment and partial layer migration operation between adjacent nodes without interrupting the current training process, breaks through the limitation that the original asynchronous pipeline parallel frame can only use a fixed model division structure for training, and provides a more flexible and effective mode for deep neural network training optimization.

Description

Dynamic layer migration method in asynchronous pipeline parallel training process
Technical Field
The invention relates to the technical field of network information, in particular to a dynamic layer migration method in an asynchronous pipeline parallel training process.
Background
With the development of artificial intelligence and big data technology, the scale of the neural network used for deep learning becomes larger and larger, the number of layers of partial high-accuracy deep neural networks even reaches 1000 layers, and the parameter scale exceeds 1000 hundred million. The ultra-large-scale deep neural network needs distributed clusters to cooperate together to complete training. One way of training is to split the complete model into several serially dependent blocks, one block containing several serially executed layers in the model, and each compute node (stage) responsible for training one block and coordinating the whole cluster to train the whole model using a pipelined approach, which is commonly referred to as model parallel or pipeline parallel.
PipeDraam is a common asynchronous pipeline parallel framework, supports a pipeline parallel training mode, and has high GPU utilization rate; however, pipeDream can only perform block static division on a model to be trained before the cluster performs model training, and maintain the division structure until the whole training is completed, and cannot dynamically adjust the division structure in the process of model training, that is, cannot dynamically migrate part of the layer of a certain node to another node for training.
The training of the deep neural network is usually based on a distributed computing cluster of a data center, and the training performance is reduced when a resource heterogeneity problem (such as different GPU models and calculation power on different nodes), a network fluctuation problem (such as a network fault), a multi-task preemption problem (such as multiple tasks running on one node) and the like possibly exist in an actual cluster. Therefore, in the training process, the model occupation ratio (i.e., the model division structure described above) on different nodes needs to be dynamically adjusted according to the conditions of resources, tasks, network states, and the like, and partial layers are migrated between different nodes to balance the computation load and computation time on each node, so that the total training time is finally shortened.
In order to implement efficient training of a deep neural network, an effective dynamic layer migration method needs to be provided, which can rapidly and efficiently migrate partial layers between different nodes, and simultaneously requires that: (1) the current training process is not interrupted, (2) the amount of data migrated is as small as possible, and (3) the migration process can overlap with the computation process to reduce migration time overhead.
Disclosure of Invention
In order to solve the above technical problems, the present invention provides a dynamic layer migration method in an asynchronous pipeline parallel training process, where a dynamic layer migration module implemented based on the present invention will be operated as a supplementary module of an existing asynchronous pipeline parallel training framework on each distributed computing node, and is responsible for managing and executing all resources and processes related to the present invention, including sending and receiving migration messages, establishing and deleting communication connections between nodes, initializing, adjusting and deleting a layer of a specified model on a specified node, sending and receiving data such as intermediate variables and weights of the layer of the specified model (i.e., migrating the specified layer between adjacent nodes), and the like, and the following description is not particularly emphasized.
In order to solve the technical problems, the invention adopts the following technical scheme:
a dynamic layer migration method in an asynchronous pipeline parallel training process is applied to a distributed machine learning scene, and specifically comprises the following steps:
step one, sending a layer migration message:
the emigration node sends a layer migration message to the emigration node, and attaches the current batch data serial number k of the emigration node and the description information of the layer to be migrated; the description information of the layer to be migrated comprises the number of the layer to be migrated and the label of the layer to be migrated;
step two, preparing layer migration related resources:
dividing an original model structure in an emigration node into two parts, wherein one part is a layer reserved in the emigration node, and the other part is a layer needing to be migrated to the emigration node, namely a layer to be migrated; initializing the layer to be migrated in the migration node according to the description information of the layer to be migrated, and adding the layer to be migrated into the original model structure of the migration node; establishing additional migration communication connection between the migration node and the migration node for the intermediate variables and the weights of the layer to be migrated; each migration communication connection is an independent thread;
step three, executing layer migration:
the emigration node performs weight emigration operation of the layer to be migrated before each forward propagation execution, and the emigration node performs weight emigration operation of the layer to be migrated after each backward propagation execution, so that the weight transmission process is overlapped with the calculation process;
step four, finishing the layer migration and cleaning the related resources:
after the weight migration operation of all layers to be migrated in the migration node is completed, deleting resources occupied by the migrated layers in the migration node; deleting the original communication connection between the emigration node and the migration communication connection: establishing new communication connection between the emigration node and the emigration node; wherein, the layer to be migrated is called as migrated layer after migrating from the migrated node to the migrated node.
Further, in the first step, if the migratory node is before the migratory node, that is, the serial number of the migratory node is smaller than that of the migratory node, the migratory node will send a layer migration message before the last forward propagation execution; if the migrating node is behind the migrating node, the migrating node will send the layer migration message before the last back propagation execution.
Further, after the emigration node sends the layer migration message, if the emigration node is before the emigration node, the emigration node executes the content related to the emigration node in the second step before the forward propagation of the batch data sequence number k +1 is executed; if the emigration node is behind the emigration node, the content related to the emigration node in the step two is executed before the forward propagation of the batch data serial number k +2 is executed;
after the migration node receives the layer migration message and the attached batch data serial number k, if the migration node is in front of the migration node, the migration node executes the contents related to the migration node in the step two before the back propagation of the batch data serial number k-N + t +1 is executed, wherein N is the total number of the nodes, and t is the serial number of the migration node; if the emigration node is behind the emigration node, the contents related to the emigration node in the step two are executed before the back propagation of the batch data sequence number k-N + t +2 is executed.
Further, in the second step, a layer reserved on the emigration node forms a new model structure of the emigration node, and the layer to be migrated is added to the original model structure of the emigration node to obtain the new model structure of the emigration node;
in the layer migration process of step three, an alternate calculation method is used, which specifically comprises the following steps:
when the emigration node and the immigration node execute forward propagation, new model structures of the emigration node and the immigration node are respectively used for calculation, and the migration communication connection is used for transmitting intermediate variables of a layer to be migrated; and when the emigration node and the immigration node execute back propagation, the original model structures of the emigration node and the immigration node are respectively used for calculation, and the original communication connection between the nodes is used for transmitting the intermediate variable of the layer to be migrated.
Compared with the prior art, the invention has the beneficial technical effects that:
the invention provides a dynamic layer migration method in an asynchronous pipeline training process under a distributed machine learning scene aiming at an asynchronous pipeline parallel training framework of the existing distributed deep neural network training, which can be used as a supplementary module of the existing asynchronous pipeline parallel training framework and can realize a dynamic adjusting function of the deep neural network in the training process.
The dynamic layer migration method can realize model structure adjustment between adjacent nodes and partial layer migration operation without interrupting the current training process, breaks through the limitation that the original asynchronous pipeline parallel framework can only use a fixed model division structure for training, and provides a more flexible and effective mode for deep neural network training optimization.
The dynamic layer migration method uses an alternate calculation method in the execution process of the dynamic layer migration, the alternate calculation method replaces time cost brought by massive data recalculation and data transmission with smaller space cost in the execution process of the layer migration, the layer migration operation can be completed under the condition of hardly influencing the original training process, meanwhile, the extra data volume needing to be transmitted is only limited to the weight of the layer to be migrated, the transmission of massive unnecessary data is avoided, the communication overhead is greatly reduced, and the communication bandwidth of the original training process is effectively avoided being occupied.
On the other hand, the dynamic layer migration method adopts a multithreading mechanism, can effectively execute the trained communication process and the trained calculation process in parallel, hides the communication time in the calculation time, effectively reduces or even avoids the node waiting time caused by the communication process, and nearly transparently finishes the dynamic layer migration operation.
Meanwhile, the dynamic layer migration method supports a plurality of mutually exclusive adjacent nodes to simultaneously execute the dynamic layer migration operation, and does not need to wait for one pair of adjacent nodes to execute the dynamic layer migration operation and then execute the dynamic layer migration of the other pair of adjacent nodes, thereby greatly improving the efficiency of model dynamic adjustment in the deep neural network training process.
In addition, the dynamic layer migration method can execute dynamic layer migration operation according to a strategy and the number of layers specified by a user, meanwhile, the whole function is modularized, the dynamic layer migration method can be easily embedded into the original asynchronous pipeline parallel framework without modifying the original framework in a large quantity, various scheduling strategies or other complex functions can be flexibly matched for common application, rapid model dynamic adjustment is realized, and the training efficiency of the whole deep neural network is effectively optimized.
Drawings
FIG. 1 is a timing diagram illustrating a dynamic layer migration method according to the present invention;
FIG. 2 is a schematic diagram of an alternative calculation method in the dynamic layer migration method of the present invention.
Detailed Description
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
A dynamic layer migration method in an asynchronous pipeline parallel training process in a distributed machine learning scene comprises the following steps:
step S1, sending a layer migration message;
s2, preparing layer migration related resources;
s3, executing layer migration;
and S4, finishing the layer migration and cleaning the related resources.
Further, in step S1, the node (migrating node) that needs to perform the migration operation sends a layer migration message migrate to the target node (migrating node) that desires to migrate, and attaches a current batch data (batch) sequence number k of the migrating node and description information of a part of layers (layers to be migrated) that need to migrate, including the number of layers to be migrated, a label of the layers to be migrated, and the like, so as to notify the migrating node of the layer migration operation to be performed; if the migratory node is before the migratory node, that is, the serial number of the migratory node is smaller than that of the migratory node, the migratory node will send a layer migration message migrate before the last forward propagation execution; on the contrary, if the migrant node is behind the migrant node, the migrant node will send the layer migration message migrate before the last back propagation execution.
After the migratory node sends the layer migration message migrate, if the migratory node is before the migratory node, the contents related to the migratory node in the step S2 are executed before the forward propagation of the batch data sequence number k +1 is executed; on the contrary, if the emigration node is behind the emigration node, the content related to the emigration node in the step S2 is executed before the forward propagation of the batch data serial number k +2 is executed;
after receiving the layer migration message migrate and the appended batch data sequence number k, the migrating node executes the content related to the migrating node in step S2 before performing back propagation of the batch data sequence number k-N + t +1 if the migrating node is before the migrating node, where N is the total number of nodes and t is the sequence number of the migrating node (the node sequence number is defaulted to be numbered from 0), and otherwise, executes the content related to the migrating node in step S2 before performing back propagation of the batch data sequence number k-N + t +2 if the migrating node is after the migrating node.
Further, step S2 includes the steps of:
step S21, adjusting the current model structure of the node:
in the step, the emigration node divides the original model structure into two parts, one part is a layer reserved in the emigration node, and the other part is a layer needing to be emigration to a target node, namely a layer to be migrated; correspondingly, the migration node initializes the layer to be migrated and adds the layer to the original model structure of the migration node;
step S22, establishing communication connection of intermediate variables and weights of the layer to be migrated:
in the step, the emigration node and the emigration node respectively establish additional migration communication connection for intermediate variables and weights of the layer to be migrated, the migration communication connection is called migration communication connection and is different from the original communication connection between the nodes, the migration communication connection is specially used for providing a dynamic layer migration function service, a multithreading mode is adopted, namely, each migration communication connection is a new thread, and communication time and calculation time can be effectively overlapped to reduce interference on the original training process.
Further, in step S3, the migrating node always performs the weight migrating operation of the layer to be migrated before each forward propagation is performed, and the migrating node always performs the weight migrating operation of the layer to be migrated after each backward propagation is performed, so as to ensure that the weight received by the migrating node is the latest, and overlap the weight transmission process with the calculation process, so as to reduce the time overhead of layer migration;
assuming that the serial number of the migrated node is s, the serial number of the migrated node is t, and the total number of the nodes is N; if the emigration node is before the emigration node, i.e. s = t-1, the emigration node will execute N-s-1 times of weight emigration operation, which involves N-s-1 set of forward propagation and backward propagation; if the emigration node is behind the emigration node, i.e. s = t +1, the emigration node will execute N-s +1 times of weight emigration operation, which involves N-s +1 sets of forward propagation and backward propagation; for an immigration node, always executing N-t times of weight immigration operation, relating to forward propagation and backward propagation of N-t groups;
on the other hand, in the process of executing layer migration, because the batch data used in the simultaneously executed back propagation process is calculated by using the original model structure before executing layer migration, the batch data will be invalid after adjusting the model structure, in order to avoid the additional calculation overhead brought by recalculating the batch data, an alternative calculation method is used in the process of executing layer migration, specifically, when the node executes forward propagation, a new model structure is used for calculation, because the forward propagation uses the new batch data which is calculated for the first time, and uses the migration communication connection to transfer the intermediate variable, when the node executes back propagation, the original model structure is used for calculation, because the back propagation uses the old batch data which is calculated by using the original model structure before, and uses the original inter-node communication connection to transfer the intermediate variable, the problem that the old batch data which needs to be subjected to back propagation fails due to the model change is avoided, the additional recalculation overhead is avoided, and only one of the migration communication connection and the original communication connection is used at the same time, and the situation that the new communication connection occupies the network communication bandwidth will not occur.
Further, step S4 includes the steps of:
step S41, deleting the original model structure:
in the step, after the migration node completes the weight migration operation of all the layers to be migrated, the migrated layer is deleted from the GPU, and the occupied GPU video memory space is released; wherein, the layer to be migrated is called as migrated layer after being migrated from the migrated node to the migrated node.
Step S42, deleting the original communication connection between the nodes and the migration communication connection used by the dynamic layer migration:
in this step, the migrant node and the migrant node delete the original inter-node communication connections between the nodes and release and recover the resources of the original inter-node communication connection threads, and at the same time, the migrant node and the migrant node delete the migration communication connections for dynamic layer migration, which are established in step S22 and include the communication connections of the intermediate variables and the weights of the layer to be migrated, and release and recover the resources of the migration communication connection threads;
step S43, establishing new communication connection between nodes:
in this step, since the communication content between the relevant nodes changes before and after the dynamic layer migration operation is performed, a new communication connection between the nodes needs to be established, the migration node and the migration node respectively create a sending thread and a receiving thread corresponding to each other, and use the newly established communication connection for communication in the subsequent training process.
Examples
In this embodiment, the total number of the computation nodes of the cluster participating in the parallel training of the asynchronous pipeline is four, that is, N =4, and the four nodes are respectively recorded as stage 0, stage 1, stage 2, and stage 3. The batch data (batch) in the present embodiment includes nine batches, which are respectively denoted as batch 1, batch 2, batch 3, batch 4, batch 5, batch 6, batch 7, batch 8, and batch 9.
The original model structure on the stage 0 comprises a layer a, a layer b and a layer c, and the original model structure on the stage 1 comprises a layer d and a layer e; the layer migration requirement related in this embodiment is to dynamically migrate the layer c on the stage 0 from the stage 0 to the stage 1, and ensure that the original training process is not interrupted, so that the dynamic layer migration method in the asynchronous pipeline parallel training process in the distributed machine learning scene according to this embodiment is used to implement the requirement.
The corresponding situation in step S1 of this embodiment is that the migration node is located before the migration node. Therefore, in this embodiment, after receiving the dynamic layer migration instruction, stage 0 sends a layer migration message migrate to stage 1 before the last forward propagation execution, that is, before the forward propagation execution of batch 5, and attaches the serial number k =5 of the current batch of data and the description information of a plurality of layers to be migrated, where the description information includes information such as the number of the layers to be migrated being 1 and the label of the layer to be migrated being c, and then stage 0 executes the content related to the migrated node in step S2 before the forward propagation execution of batch 6; stage 1, upon receiving the layer migration message migrate and its accompanying batch data sequence number k =5, will execute the contents related to the migration node in step S2 before performing the back propagation of batch 3.
The step S2 of the present embodiment of preparing the layer migration related resource includes the following steps:
step S21, adjusting the current model structure of the node:
in step S21, as shown in fig. 1, stage 0 divides the original model structure into two parts, one part is a set of a layer a and a layer b, and the other part is a layer c to be migrated, and both parts are continuously retained in stage 0; step 1 initializes the layer c according to the description information of the layer c to be migrated received in step S1, and leaves the layer c in step 1 together with the original layer d and layer e;
s22, establishing communication connection of intermediate variables and weights of the layer to be migrated;
step S22, in which stage 0 and stage 1 negotiate to establish an intermediate variable of a layer c to be migrated, namely, a communication connection of input data of the layer c, and negotiate to establish a communication connection of weight of the layer c to be migrated, wherein the two connections are collectively called a migration communication connection; the migration communication connection is operated in a multithreading mode so as to realize parallelization processing of a communication process and a calculation process.
Step S3 of this embodiment includes several times of dynamic weight migration operations, as shown in fig. 1, according to the description of step S3 of this embodiment, stage 0 needs to perform 3 times of weight migration operation of the layer c to be migrated in this step, and similarly, stage 1 needs to perform 3 times of weight migration operation of the layer c to be migrated in this step; stage 0 transmits the weight of the layer c to be migrated to stage 1 while executing the forward propagation of batch 6, stage 1 receives the weight of the layer c to be migrated from stage 0 while executing the backward propagation of batch 3, and stage 1 finishes receiving the weight of the layer c to be migrated while finishing the backward propagation of batch 3, so that the communication time of the weight of the layer c is hidden in the calculation time of the node, and the introduction of extra communication time is avoided, and the weight migration process of subsequent batch 7 and batch 8 is similar to that of the batch 6;
on the other hand, in this step, an alternative calculation method is used at the same time, as shown in fig. 1 and fig. 2, in the layer migration execution process, the forward propagation processes of stage 0 and stage 1 are executed using a new model structure, for stage 0, the forward propagation process is executed using only layer a and layer b, and the output of layer b, that is, the input of layer c is transmitted to stage 1 using the newly established migration communication connection, for stage 1, the input of layer c is received using the newly established migration communication connection, and the forward propagation process is executed using layer c, layer d and layer e together; the back propagation process of stage 0 and stage 1 is executed by using the original model structure, for stage 1, the back propagation process is executed by using only layer e and layer d, and the back result of layer d, namely the back input of layer c, is transmitted to stage 0 by using the original inter-node communication connection, for stage 0, the back input of layer c is received by using the original inter-node communication connection, and the back propagation process is executed by using layer c, layer b and layer a together; specifically, in the present embodiment, in the process of executing the step, stage 0 and stage 1, the forward propagation processes of batch 6, batch 7 and batch 8 are all executed using the new model structure, and the backward propagation processes of batch 3, batch 4 and batch 5 are all executed using the old model structure.
Step S4 of this embodiment, ending the layer migration and clearing the relevant resources, includes the following steps:
step S41, deleting the original model structure:
in step S41, as shown in fig. 2, after the dynamic layer migration operation is completed, stage 0 deletes the migrated layer c from the model structure of stage 0, and recovers the resources occupied by the original layer c, including the memory, the GPU video memory, and the like;
step S42, deleting the original communication connection between nodes and the migration communication connection used by the migration of the dynamic layer:
in step S42, stage 0 and stage 1 delete the original inter-node communication connection, that is, in this embodiment, the communication connection of output data of layer c and the communication connection of reverse input data of layer c, delete the relevant thread and recycle the relevant resource, and at the same time, stage 0 and stage 1 delete the migration communication connection for dynamic layer migration established in step S22, which includes the communication connection of intermediate variable of layer c and the communication connection of weight of layer c, delete the relevant thread and recycle the relevant resource;
step S43, establishing new communication connection between nodes:
in step S43, stage 0 and stage 1 negotiate to establish a new inter-node communication connection, which includes a communication connection of output data of layer b and a communication connection of reverse input data of layer b in this embodiment, where the communication connections operate in a multi-thread manner, so as to implement parallelization processing of a communication process and a calculation process.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not to be construed as limiting the claims.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (4)

1. A dynamic layer migration method in an asynchronous pipeline parallel training process is characterized in that the dynamic layer migration method is applied to a distributed machine learning scene and specifically comprises the following steps:
step one, sending a layer migration message:
the emigration node sends a layer migration message to the emigration node, and attaches the current batch data serial number k of the emigration node and the description information of the layer to be migrated; the description information of the layer to be migrated comprises the number of the layer to be migrated and the label of the layer to be migrated;
step two, preparing layer migration related resources:
dividing an original model structure in an emigration node into two parts, wherein one part is a layer reserved in the emigration node, and the other part is a layer needing to be migrated to the emigration node, namely a layer to be migrated; initializing a layer to be migrated in the migration node according to the description information of the layer to be migrated, and adding the layer to be migrated into the original model structure of the migration node; establishing additional migration communication connection between the migration node and the migration node for the intermediate variables and the weights of the layer to be migrated; each migration communication connection is an independent thread;
step three, executing layer migration:
the emigration node performs weight emigration operation of the layer to be migrated before each forward propagation execution, and the emigration node performs weight emigration operation of the layer to be migrated after each backward propagation execution, so that the weight transmission process is overlapped with the calculation process;
step four, finishing the layer migration and cleaning the related resources:
after the weight migration operation of all layers to be migrated in the migration node is completed, deleting resources occupied by the migrated layers in the migration node; deleting the original communication connection between the emigration node and the migration communication connection: establishing new communication connection between the emigration node and the immigration node; after the layer to be migrated is migrated from the migration node to the migration node, the layer is called a migrated layer.
2. The method for dynamic layer migration in the parallel training process of asynchronous pipelines according to claim 1, characterized in that: in the first step, if the emigration node is before the emigration node, that is, the serial number of the emigration node is smaller than that of the emigration node, the emigration node will send a layer migration message before the last forward propagation execution; if the migrating node is behind the migrating node, the migrating node will send the layer migration message before the last back propagation execution.
3. The method for dynamic layer migration in the parallel training process of asynchronous pipelines according to claim 1, characterized in that: after the emigration node sends the layer migration message, if the emigration node is before the emigration node, the content related to the emigration node in the step two is executed before the forward propagation execution of the batch data serial number k + 1; if the emigration node is behind the emigration node, the content related to the emigration node in the step two is executed before the forward propagation of the batch data serial number k +2 is executed;
after the migration node receives the layer migration message and the attached batch data serial number k, if the migration node is in front of the migration node, the migration node executes the contents related to the migration node in the step two before the back propagation of the batch data serial number k-N + t +1 is executed, wherein N is the total number of the nodes, and t is the serial number of the migration node; if the migrating node is behind the migrating node, the contents related to the migrating node in the step two are executed before the back propagation of the batch data sequence number k-N + t +2 is executed.
4. The method for dynamic layer migration in the parallel training process of asynchronous pipelines according to claim 1, characterized in that: step two, a new model structure of the emigration node is reserved, wherein the layer of the emigration node forms the new model structure of the emigration node, and the layer to be migrated is added into the original model structure of the emigration node to obtain the new model structure of the emigration node;
in the layer migration process of step three, an alternate calculation method is used, which specifically comprises the following steps:
when the emigration node and the immigration node execute forward propagation, new model structures of the emigration node and the immigration node are respectively used for calculation, and the migration communication connection is used for transmitting intermediate variables of a layer to be migrated; and when the emigration node and the immigration node execute back propagation, the original model structures of the emigration node and the immigration node are respectively used for calculation, and the original communication connection between the nodes is used for transmitting the intermediate variable of the layer to be migrated.
CN202211415887.1A 2022-11-11 2022-11-11 Dynamic layer migration method in asynchronous pipeline parallel training process Active CN115454655B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211415887.1A CN115454655B (en) 2022-11-11 2022-11-11 Dynamic layer migration method in asynchronous pipeline parallel training process

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211415887.1A CN115454655B (en) 2022-11-11 2022-11-11 Dynamic layer migration method in asynchronous pipeline parallel training process

Publications (2)

Publication Number Publication Date
CN115454655A CN115454655A (en) 2022-12-09
CN115454655B true CN115454655B (en) 2023-03-10

Family

ID=84295391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211415887.1A Active CN115454655B (en) 2022-11-11 2022-11-11 Dynamic layer migration method in asynchronous pipeline parallel training process

Country Status (1)

Country Link
CN (1) CN115454655B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116050499B (en) * 2023-04-03 2023-07-18 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Self-adaptive model partitioning method, system and equipment in model parallel training

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112506667A (en) * 2020-12-22 2021-03-16 北京航空航天大学杭州创新研究院 Deep neural network training method based on multi-task optimization
JP2022025392A (en) * 2020-07-29 2022-02-10 公立大学法人秋田県立大学 Machine learning device and method for mechanical learning
US11367002B1 (en) * 2021-01-06 2022-06-21 Guangdong University Of Technology Method for constructing and training decentralized migration diagram neural network model for production process

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11373266B2 (en) * 2017-05-05 2022-06-28 Intel Corporation Data parallelism and halo exchange for distributed machine learning
US10585703B2 (en) * 2017-06-03 2020-03-10 Apple Inc. Dynamic operation allocation for neural networks
US20200210840A1 (en) * 2018-12-31 2020-07-02 Microsoft Technology Licensing, Llc Adjusting precision and topology parameters for neural network training based on a performance metric
US11741370B2 (en) * 2019-08-28 2023-08-29 International Business Machines Corporation Transfer learning based on cross-domain homophily influences

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2022025392A (en) * 2020-07-29 2022-02-10 公立大学法人秋田県立大学 Machine learning device and method for mechanical learning
CN112506667A (en) * 2020-12-22 2021-03-16 北京航空航天大学杭州创新研究院 Deep neural network training method based on multi-task optimization
US11367002B1 (en) * 2021-01-06 2022-06-21 Guangdong University Of Technology Method for constructing and training decentralized migration diagram neural network model for production process

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WBATimeNet:A deep neural network approach for VM Live migration in the cloud;Ashish Mangalampalli;《Future Generation Computer Systems》;20220526;全文 *

Also Published As

Publication number Publication date
CN115454655A (en) 2022-12-09

Similar Documents

Publication Publication Date Title
Li et al. Coding for distributed fog computing
US10764125B2 (en) Method and device for training model in distributed system
CN110262901B (en) Data processing method and data processing system
CN115454655B (en) Dynamic layer migration method in asynchronous pipeline parallel training process
CN111191728B (en) Deep reinforcement learning distributed training method and system based on asynchronization or synchronization
Zhan et al. Pipe-torch: Pipeline-based distributed deep learning in a gpu cluster with heterogeneous networking
CN115237580B (en) Intelligent calculation-oriented flow parallel training self-adaptive adjustment system and method
CN111324630B (en) MPI-based neural network architecture search parallelization method and equipment
CN109889440A (en) A kind of correcting and eleting codes failure node reconstruct routing resource based on maximum spanning tree
CN116644803B (en) Distributed cooperative training control method, system, device, equipment and storage medium
Kim et al. Efficient large-scale deep learning framework for heterogeneous multi-gpu cluster
CN112256653B (en) Data sampling method and device
US20200387800A1 (en) Scheduling method and related apparatus
CN112114951A (en) Bottom-up distributed scheduling system and method
CN114844781B (en) Method and system for optimizing Shuffle performance for encoding MapReduce under Rack architecture
Mithila et al. Latency-based vector scheduling of many-task applications for a hybrid cloud
CN113821323A (en) Offline job task scheduling algorithm oriented to hybrid deployment data center scene
CN116432743B (en) Method for improving throughput of reinforcement learning system
CN110532091B (en) Graph computation edge vector load balancing method and device based on graph processor
CN116452951B (en) Remote sensing information extraction model distributed training method based on central data pool
US11907725B2 (en) Communication in a computer having multiple processors
CN113138831B (en) Network resetting method and acceleration distributed training method and system based on same
CN113592089A (en) Gradient synchronization method for compressed sensing under distributed deep learning training scene
Li et al. Tree-Based Elastic Parameter Server to Schedule Resources to Accelerate Distributed'Training
Zhao et al. High-throughput Sampling, Communicating and Training for Reinforcement Learning Systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant