CN115550173A

CN115550173A - Dynamic calculation communication scheduling method based on WFBP and link characteristics

Info

Publication number: CN115550173A
Application number: CN202211164657.2A
Authority: CN
Inventors: 肖利民; 贾志斌; 王良; 刘禹廷; 郭为
Original assignee: Beijing Tiantian Microchip Semiconductor Technology Co ltd
Current assignee: Beijing Tiantian Microchip Semiconductor Technology Co ltd
Priority date: 2022-09-23
Filing date: 2022-09-23
Publication date: 2022-12-30

Abstract

The invention discloses a dynamic calculation communication scheduling method based on WFBP and link characteristics, which comprises the steps of dynamically configuring a gradient operation mode of a back propagation layer during each transmission based on WFBP and link characteristics through distributed pre-training; starting distributed training, wherein in each round of training, each node conducts forward propagation layer by layer based on the updated zone bits of each propagation layer; calculating loss according to the loss function after the forward propagation is finished, starting backward propagation, and propagating the loss into each layer to calculate gradient; based on ready messages sent by the synchronization manager and the nodes, each node performs back propagation layer by layer and sets updated flag bits. The invention can dynamically adapt to adopt different scheduling strategies, maximally overlap calculation and communication and reduce the total training time.

Description

Dynamic calculation communication scheduling method based on WFBP and link characteristics

Technical Field

The invention belongs to the technical field of distributed communication, and particularly relates to a dynamic calculation communication scheduling method based on WFBP and link characteristics.

Background

The distributed deep learning has definite data dependence, forward propagation and backward propagation are carried out symmetrically, the forward propagation uses parameters to carry out result calculation, the backward propagation calculates the gradient of each parameter according to the result, and gradient optimization parameters are used. Because the forward propagation and the backward propagation are symmetrical, the gradient of the parameter used by the i-th layer in the forward propagation is obtained at the i-th layer which is the reciprocal of the backward propagation, and the gradient can be used for parameter optimization at the i-th layer of the next forward propagation at the latest. In addition, the back propagation process has a relatively definite gradient generation period, i.e. after each layer of the back propagation is calculated, an optimized gradient of the layer parameter is generated.

Due to the definite data dependency and the relatively definite gradient generation period, the overall synchronous parallel (BSP) can obtain a relatively good scheduling mode. When the node carries out the back propagation calculation process, each time the gradient of the parameter of one layer is calculated, the gradient of the layer is synchronized by using an Allreduce mode, meanwhile, the calculation of the next layer is carried out in the back propagation mode, and therefore the parallel of calculation and communication is achieved. Furthermore, the gradients can be fused or split, the finally calculated gradient data is transmitted preferentially, and the waiting time before the next round of forward propagation is shortened as much as possible. In the practice of distributed communication, data can be transmitted after being segmented or aggregated according to requirements, and the gradient can be scheduled by adopting the two methods in the distributed deep learning. The theoretical basis of gradient aggregation is that the existing transmission media such as PCIe and network cannot achieve a high bandwidth utilization rate due to the large head information ratio and the frequent and inefficient encoding processing when transmitting a small data amount packet, so that data is aggregated, packed into a larger data packet, and then transmitted and communicated through the media, thereby achieving a higher bandwidth utilization rate. However, packing data means that some gradients calculated in advance may wait for one or more subsequent gradient calculations to be completed before being transmitted, so that the communication delay of the gradients is relatively low. Another gradient segmentation mode aims to construct an approximately interruptible transmission mode, under the condition of extreme segmentation, the high-priority gradient can be immediately synchronized once being calculated, parameters are directly updated to perform forward propagation first-layer calculation of the next round of training after the synchronization of the last-layer gradient of backward propagation is completed, and subsequent gradient communication can be parallel to the forward propagation calculation.

When the BSP synchronization mode is used for distributed deep learning training, the gradient synchronization process is started after all the counter propagation processes are completed by the respective computing nodes. The method seriously loses the parallelism of calculation and communication, prolongs the waiting time between the iterations of the calculation nodes and wastes GPU calculation resources. In addition, the existing gradient scheduling algorithm does not consider the bottleneck of the bottom transmission link in the transmission of small data packets, and blind aggregation and segmentation can further reduce the bandwidth utilization and influence the communication efficiency.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a dynamic computation communication scheduling method based on WFBP and link characteristics, which dynamically determines the method of using tensor division or aggregation according to the result of pre-training, avoids modifying the content of too many frames by setting a synchronization manager, ensures that different scheduling strategies can be adopted dynamically and adaptively under the environment of different tensor types, maximally overlaps computation and communication, and reduces the total training time.

In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:

a dynamic calculation communication scheduling method based on WFBP and link characteristics comprises the following steps:

step 1: dynamically configuring a gradient operation mode of a back propagation layer during each transmission based on WFBP and link characteristics through distributed pre-training;

step 2: starting distributed training according to the configuration in the step 1, and performing forward propagation on each node layer by layer in each training round based on the updated zone bits of each propagation layer;

and 3, step 3: calculating loss according to the loss function after the forward propagation is finished, starting backward propagation, and propagating the loss into each layer to calculate gradient;

and 4, step 4: based on ready messages sent by the synchronization manager and the nodes, each node performs back propagation layer by layer and sets updated flag bits.

In order to optimize the technical scheme, the specific measures adopted further comprise:

the configuration of the gradient operation mode of the backward propagation layer in each transmission in step 1 is as follows:

if the size of the gradient data is larger than or equal to a set value relative to the performance of the link, a gradient segmentation method is used for dividing the gradient to be transmitted of each layer to adapt to the performance of the link, a higher priority is given to the layer which is positioned at the front in the neural network, the gradient which is positioned at the back of the layer during reverse transmission is synchronously completed among all nodes in an interruptible transmission mode, the parameter of the first layer which is transmitted in the forward direction is optimized and calculated, and the next round of forward transmission training is started;

and if the size of the gradient data is smaller than a set value relative to the link performance, aggregating the gradient data calculated by the time division layers during the back propagation by using a gradient aggregation method and transmitting the aggregated gradient data.

In the step 2, each node performs forward propagation layer by layer, checks whether an updated flag bit of the current layer is set before a certain layer starts, and if not, indicates that gradient propagation and updating of the previous layer are not completed and needs to wait for completion; if the updated flag bit is set, the parameters of the current layer are updated, the forward propagation of the current layer is normally carried out, the updated flag bit is reset after the parameters are finished, and the forward propagation of the next layer is circularly started.

In the step 3, the gradient of each layer is collected for processing by distributed training, and then the new gradient is returned to each node for updating the parameters.

The synchronization manager in the above step 4 is configured to perform global synchronization after the back propagation is completed, and then start the next round of training.

In the step 4, each node performs backward propagation layer by layer, after the L-th layer gradient is calculated, the gradient is immediately transmitted by using the gradient processing strategy determined before, meanwhile, the calculation of the L-1-th layer gradient is started in parallel, then, a ready message is sent to the synchronization manager, the parameters of the L-th layer are immediately updated after the L-layer parameters are combined, an updated flag bit is set after the update is completed, the next round of forward propagation is waited, and the backward propagation of the next layer is started circularly.

In the above step 4, when updating the L-th layer parameter, the discontinuous gradient is saved by using a buffer area.

The invention has the following beneficial effects:

the invention dynamically determines to use a tensor division or tensor aggregation method to carry out communication in distributed deep learning according to the result of pre-training, avoids modifying the content of excessive frames by arranging a desynchronization manager, collects the training progress of each node, coordinates with the frames, bypasses the global synchronization of the frames to realize fine-grained scheduling, ensures that different scheduling strategies can be adopted in a dynamic adaptation manner under the environment of different tensor types, maximally overlaps calculation and communication, and reduces the total training time. The method comprises the following specific steps:

aiming at the problem of long waiting time between computing node iterations, a computing communication scheduling algorithm based on AllReduce is researched. Due to clear data dependence of distributed deep learning, the more definite data generation time and the relatively typical synchronization mode can enable calculation and communication to be sufficiently parallel in the training process. Scheduling is used as a component for integrating resources in a system, performance characteristics of different components need to be optimized, a targeted polymerization degree is configured for different transmission links including PCIe and InfiniBand, and the like, bandwidth utilization rate during transmission is maximized by using link transmission characteristics, and transmission delay with high priority gradient is reduced. Meanwhile, different training platforms are also considered, and in order to reserve the codes of the users to the maximum extent and not change the framework as much as possible, the scheduling method provided by the invention is between the codes of the users and the deep learning framework. The user code is first processed by the scheduling program to complete the division of the task and dynamically determine the scheduling method. And finally, calling the framework to complete the actual operation. The general scheduling flow is as follows:

before starting distributed deep learning training, performing a certain iteration number of pre-training processes, and configuring gradient operation of a back propagation layer during each transmission according to data characteristics transmitted by a model and used link performance characteristics: if the size of the gradient data is larger, a gradient segmentation method is used, so that the high-priority gradient, namely the gradient with a relatively backward level in reverse transmission, is synchronously completed among all nodes as quickly as possible in an approximately interruptible transmission mode, the first-layer parameters of forward transmission are optimized and calculated, and the next round of forward transmission training is started; if the gradient data size is smaller, a gradient aggregation method is used for aggregating the gradient data calculated by layers during backward propagation so as to reduce bandwidth waste caused by the size of the head during link transmission and improve the transmission efficiency of the aggregated high-priority gradient in the link.

In the deep learning training process, the generation time of the data and the size of the data volume have certain periodicity, so that the data characteristics of the distributed deep learning training of the current round can be roughly mastered by using a pre-training method, and the adaptive link characteristics are scheduled.

Drawings

FIG. 1 is a flow chart of forward and backward propagation for a certain training round;

FIG. 2 is a schematic diagram of using gradient partitioning for large tensors;

FIG. 3 is a schematic diagram of aggregation using gradients for small tensors;

FIG. 4 is a global synchronization process in which nodes bypass a framework by sending ready information through a synchronization manager.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

As shown in fig. 1 to 4, a dynamic computation communication scheduling method based on WFBP and link characteristics according to the present invention dynamically determines, according to a pre-training result, a method of tensor division or tensor aggregation to perform communication in distributed deep learning, avoids modifying contents of too many frames by setting a synchronization manager, the synchronization manager collects training schedules of nodes, coordinates with the frames, and implements fine-grained scheduling by bypassing global synchronization of the frames, so as to ensure that different scheduling strategies can be dynamically adapted to be adopted in environments of different tensor types, maximally overlap computation and communication, and reduce overall training time, specifically, the method includes the following steps:

step 1: determining subsequent training parameters through distributed pre-training, wherein a gradient operation mode of a back propagation layer during each transmission is configured according to data characteristics and used link performance of model transmission;

before formally starting distributed training, a pre-training process with a certain number of rounds is carried out.

In the pre-training process, some parameters of subsequent training need to be determined, namely, aiming at different transmission links, including PCIe and InfiniBand and the like, targeted polymerization degrees are configured, the bandwidth utilization rate during transmission is maximized by using the transmission characteristics of the links, and the transmission delay with high priority gradient is reduced. The method specifically comprises the step of configuring the gradient operation of a back propagation layer in each transmission according to the data characteristics and the used link performance of model transmission.

Configuring the gradient operation during reverse propagation in the step 1, including determining to use a segmentation or polymerization method for the gradient;

if the size of the gradient data is larger than or equal to a set value, a gradient segmentation method is used, so that a high-priority gradient, namely a gradient with a relatively backward level in reverse transmission, is synchronously completed among all nodes as quickly as possible in an approximately interruptible transmission mode, the first-layer parameters of forward transmission are optimized and calculated, and the next round of forward transmission training is started;

if the size of the gradient data is smaller than a set value, a gradient aggregation method is used for aggregating the gradient data calculated by time division layers during backward propagation so as to reduce bandwidth waste caused by the size of a head during link transmission and improve the transmission efficiency of the aggregated high-priority gradient in the link.

Step 2: starting distributed training according to the parameters in the step 1, and performing forward propagation on each node layer by layer based on updated zone bits set in each propagation layer;

in step 2, each node is propagated forward layer by layer, whether an updated flag bit of the layer is set is checked before a certain layer starts, the flag bit is set after the parameters of the previous layer are updated, if the flag bit is not set, the gradient propagation and updating of the previous layer are not finished, and the step needs to wait for finishing;

if the flag bit is set, the parameter of the current layer is updated, the forward propagation of the current layer is normally carried out, the updated flag bit is reset after the flag bit is set, and the forward propagation of the next layer is circularly started.

And 3, step 3: calculating loss according to the loss function after the forward propagation is finished, and preparing to start backward propagation;

The synchronization manager, when using distributed training in some deep learning frameworks, introduces many synchronization operations internally, i.e. there is a global synchronization after the back propagation is completed, and then the next round of training can be started, which causes an obstacle to the more elaborate scheduling of the present invention. The present invention uses the synchronization manager to "fool" the framework, sending a ready message to the manager when a node is ready for a gradient at a certain level, and then can transmit, bypassing the global synchronization set in the framework.

In step 4, each node performs backward propagation layer by layer, after the L-th layer gradient is calculated, the gradient is immediately transmitted by using the previously determined gradient processing strategy, the calculation of the L-1-th layer gradient is started in parallel in the process, then a ready message needs to be sent to the synchronous manager, the L-th layer parameter can be immediately updated after the L-layer parameter combination processing is completed, an updated flag bit is set after the updating is completed, the next round of forward propagation is waited, and the backward propagation of the next layer is circularly started.

Example 1

Taking training in the pytorech framework as an example, when distributed training is used in this framework, it introduces many synchronization operations internally, i.e. there is a global synchronization after the back propagation is completed, and then the next round of training can be started. The global synchronization confirms whether all nodes are ready, and the next training round can be started only after the nodes are ready, which causes obstacles for more elaborate scheduling of the invention. At a finer granularity, the use of global synchronization is not necessary, because each layer in the neural network only depends on the result of one round of back propagation on the layer when the layer needs to forward propagate, and if the parameter transmission and model update of the layer are completed, the forward propagation of the layer can be started immediately. In short, it is necessary to make appropriate modifications to the global synchronization and then perform fine-grained scheduling.

In a pytorech, after global synchronization means that each node is ready, it relies on each node completing its own computational task and sending a ready message to the controller. On the premise of not modifying the training frame, the invention can send false ready information to pretend to complete the operation of global synchronization. In fact, the node may have started data transmission instead of waiting for other nodes to be ready, so that the global synchronization in the pytorech is named as real death, and the scheduling of the invention is facilitated. But at the same time an additional control mechanism needs to be introduced to ensure that when a layer starts the forward propagation process, the backward propagation parameter update of the previous round is completed. The simplest solution is to set an updated flag bit after the update of a certain layer of parameters is completed. Before the forward propagation starts, it needs to judge whether the flag bit is set, if not, it waits. After the layer forward propagation is over, the flag bit is reset, waiting for the next backward propagation. When a new round of forward propagation is started in the original pytorech, it is not necessary to determine whether a certain layer of parameters is updated, because all the parameters are in an updated state after the global synchronization. Therefore, the method can realize finer-grained scheduling.

After the problem of dependence on the original framework is solved, attention can be focused on a scheduling mode, namely, the mode for updating the parameters, the time for scheduling and the method for determining the parameters required during scheduling. When the user code submits the model to the Pytorch, the scheduler performs some pre-training operations, and configures the gradient of each transmission according to the characteristics of the model transmission data and the performance of the used link. The configuration mainly refers to determining the magnitude of each transmission gradient, and there are two main ways.

First, when the gradient data size of each layer is large relative to the link performance, the gradient segmentation method is used to divide the gradient to be transmitted of each layer to adapt to the link performance, so that the network can transmit more quickly, and the layer ahead in the neural network is given higher priority and transmitted earlier, and the higher priority gradient means that the gradient will be used earlier in the next forward propagation. This is of great help to start the next training as soon as possible.

Secondly, if the gradient size of each layer is smaller than the performance of the link, the gradient aggregation mode is used to combine and transmit the gradients of the layers, so as to avoid the burden of the communication of too much small data volume to the network.

The general flow in the training process is described below.

After the model submitted by the user starts to run, the scheduler can firstly perform a plurality of rounds of pre-training processes, configure the gradient size transmitted in the process according to the performance of the network link, and determine the use of a segmentation or aggregation mode. These parameters will be maintained during subsequent training until the training is complete.

In each training round, forward propagation (it is necessary to determine whether the updated flag bit is set), calculation of new parameters, and resetting of the flag bit are performed.

The total loss is calculated according to the loss function at the end of the network, and then the process of back propagation is followed, and the total loss is propagated to each layer to calculate the gradient.

The distributed training collects the gradient of each layer for processing, and then returns the new gradient to each node for updating the parameters.

And (4) performing reverse propagation in a reverse order, transmitting the gradient immediately by using the gradient processing strategy determined before after the L-layer gradient is calculated, and starting the calculation of the L-1-th layer gradient in parallel in the process.

A ready message then needs to be issued to the sync manager of the pytorech.

And after the L-layer parameters are combined, updating the parameters of the L-th layer can be started immediately, and after the updating is finished, an updated zone bit is set to wait for the next round of forward propagation.

Note that when updating the L-th layer parameters, the gradient reached may be fragmented and discontinuous due to the possible gradient slicing operation, so a large buffer needs to be opened to store the discontinuous gradients.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A dynamic computation communication scheduling method based on WFBP and link characteristics is characterized by comprising the following steps:

step 1: dynamically configuring a gradient operation mode of a back propagation layer in each transmission based on WFBP and link characteristics through distributed pre-training;

step 2: starting distributed training according to the configuration in the step 1, and performing forward propagation on each node layer by layer in each training cycle based on the updated zone bits of each propagation layer;

and step 3: calculating loss according to the loss function after the forward propagation is finished, starting backward propagation, and propagating the loss to each layer to calculate gradient;

2. The method of claim 1, wherein the configuration of the gradient operation mode of the back propagation layer at each transmission in step 1 is as follows:

and if the size of the gradient data is smaller than a set value relative to the performance of the link, aggregating the gradient data calculated by the time division layers during backward propagation by using a gradient aggregation method and transmitting the aggregated gradient data.

3. The method as claimed in claim 1, wherein in step 2, each node performs forward propagation layer by layer, checks whether an updated flag bit of the current layer is set before a certain layer starts, and if not, indicates that gradient propagation and updating of the previous layer are not completed and needs to wait for completion; if the updated flag bit is set, the parameters of the current layer are updated, the forward propagation of the current layer is normally carried out, the updated flag bit is reset after the completion, and the forward propagation of the next layer is circularly started.

4. The method of claim 1, wherein in step 3, distributed training collects gradients of each layer for processing, and returns new gradients to each node for parameter updating.

5. The method of claim 1, wherein the synchronization manager of step 4 is configured to perform global synchronization after the back propagation is completed, and then start the next round of training.

6. The method according to claim 1, wherein in step 4, each node performs backward propagation layer by layer, after the L-th layer gradient is calculated, the gradient is immediately transmitted using the gradient processing policy determined before, and meanwhile, the L-1-th layer gradient calculation is started in parallel, then a ready message is sent to the synchronization manager, the L-th layer parameter is updated immediately after the L-layer parameter combination processing is completed, an updated flag bit is set after the update is completed, and a next round of forward propagation is waited, and the backward propagation of the next layer is cyclically started.

7. The method according to claim 1, wherein in step 4, when updating the L-th layer parameter, the discontinuous gradient is stored in a buffer.