CN114970830A

CN114970830A - Flexible communication method for accelerating data parallel distributed deep learning training

Info

Publication number: CN114970830A
Application number: CN202210651078.4A
Authority: CN
Inventors: 马胜; 侯翔; 黎铁军; 吴利舟; 张建民; 罗莉; 蒋威; 易啸; 徐睿; 王波
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-06-09
Filing date: 2022-06-09
Publication date: 2022-08-30

Abstract

A flexible communication method for accelerating data parallel distributed deep learning training comprises the following steps: in the back propagation phase and the parameter update phase: when the parameters of the local DNN models of all the computing nodes are updated, suspending the ongoing communication operation for synchronizing the DNN model parameters of all the nodes, and saving the communication state; synchronizing the communication operation of the first-layer neural network parameters of all the nodes, updating the first-layer neural network, and starting the next training cycle; in the forward propagation stage of the next training cycle, when the activation of the first layer of neural network is started to be calculated, the incomplete synchronous communication operation from the second layer to the last layer of neural network parameters in the previous training cycle is started in sequence, and the updating is completed; and when the activation calculation of the nth layer of neural network is completed and the (n + 1) th layer of neural network is updated, starting the activation calculation of the (n + 1) th layer of neural network. The method has the advantages of simple principle, easy realization, obvious speed improvement on data parallel distributed training and the like.

Description

Flexible communication method for accelerating data parallel distributed deep learning training

Technical Field

The invention mainly relates to the technical field of distributed deep learning training, in particular to a flexible communication method for accelerating data parallel distributed deep learning training.

Background

Deep Learning (Deep Learning) technology deeply influences our daily life and plays an immeasurable role in the fields of language recognition, machine translation, image classification and the like. Deep learning includes two steps, training and reasoning, respectively. The size of the training data set and the complexity of a Deep Neural Network (DNN) model are two important factors affecting the accuracy of Deep learning inference results.

In recent years, with the development of deep learning algorithms, DNN models have become more and more complex. At the same time, the development of the internet has made it easy to acquire large-scale data sets for training DNN models. The proliferation of model parameters and training data improves the reasoning accuracy of the DNN model. However, this also puts higher demands on the computational and memory resources of the deep learning training system. In this context, it is expected that if the single accelerator system is continuously used for training, the problem of inefficiency will be faced in the future, which will be far from meeting the requirements of the academic world and the industrial industry.

In order to solve the problem of low training efficiency caused by the rapid increase of model parameters and training data, a method for training a DNN model by using a distributed deep learning training system is proposed by practitioners. In a distributed training system, the training data and DNN models may be stored to multiple single accelerator systems, which then collectively perform the training task. The training mode not only reduces the requirements of the storage and the computing capacity of the single accelerator system, but also can remarkably shorten the time for training the DNN model through the cooperative training of the single accelerator system and the computing capacity.

Data parallelism is one of the most important modes of distributed training and is characterized by the following:

firstly, a randomly selected small batch of training data sets needs to be divided evenly, and each part of data is distributed to each single accelerator system in the distributed training system.

Second, each single accelerator system stores a copy of the complete DNN model and is trained using the local data set.

At present, the traditional process of training the DNN model using the data parallel mode is as follows:

first, in the forward propagation phase, each single accelerator system is activated by the go-to-back computation layer-by-layer local DNN model.

Then, in the back propagation stage, the gradient of each layer of neural network parameters (including weight and bias) is calculated layer by layer from back to front, and the local DNN model is updated according to the gradient.

Finally, after the parameters of any nth layer neural network are updated, a parameter updating stage is started, namely: and synchronizing the parameters in the distributed training system through communication operation to obtain global parameters, and updating the DNN model. After the model is updated, the next training cycle is started.

However, in the training process of the above conventional technology, there still exist some technical problems, which directly affect the training efficiency:

firstly, the method comprises the following steps: in the forward propagation stage of training, only the calculation operation of the neural network exists in the distributed training system, and the communication operation between single accelerator systems does not exist, so that the functional module responsible for communication is idle.

Secondly, the method comprises the following steps: in the back propagation and parameter updating stage of training, after the parameter calculation of the layer 1 neural network is completed, the synchronous communication operation of the parameters cannot be started immediately, but the communication from the last layer neural network to the second layer neural network parameters needs to be completed completely, and the functional module responsible for the neural network calculation in the distributed training system is idle in this period.

In view of the foregoing, there is a need for a method capable of improving the parallelism of the computation operation and the communication operation in the data parallel distributed training, and further improving the speed of the data parallel distributed training.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides a flexible communication method for accelerating data parallel distributed deep learning training, which has the advantages of simple principle, easy realization and obvious improvement on the speed of the data parallel distributed training.

In order to solve the technical problems, the invention adopts the following technical scheme:

a flexible communication method for accelerating data parallel distributed deep learning training comprises the following steps:

in the back propagation phase and the parameter update phase of the distributed training loop: when the parameters of the local DNN models of all the computing nodes are updated, suspending the ongoing communication operation for synchronizing the DNN model parameters of all the computing nodes, and saving the communication state;

after the communication state is stored, synchronizing the communication operation of the first-layer neural network parameters of all the nodes, updating the first-layer neural network, and then starting the next training cycle;

in the forward propagation stage of the next training cycle, when the activation of the first layer of neural network is started to be calculated, the synchronous communication operation of the parameters from the second layer to the last layer of neural network which are not completed in the previous training cycle is started in sequence, and each layer of neural network is updated;

when the activation calculation of the nth layer of neural network is completed and the (n + 1) th layer of neural network is updated, starting the activation calculation of the (n + 1) th layer of neural network; wherein n is more than or equal to 1 and is less than the number of DNN model layers.

As a further improvement of the process of the invention: the distributed training cycle comprises a forward propagation stage, a backward propagation stage and a parameter updating stage; the forward propagation stage is used for completing the calculation operation of local DNN model activation of the single accelerator system; in the back propagation stage, parameters for updating the local DNN model are calculated; and the parameter updating stage is used for synchronizing local parameters of all the single accelerator systems through communication operation and updating the DNN model.

As a further improvement of the process of the invention: the parameters include weights and biases.

As a further improvement of the process of the invention: and the back propagation stage and the parameter updating stage are used for calculating the parameters of the DNN model from the last layer to the 1 st layer in sequence, and after the model parameters of any layer are calculated, starting the synchronous communication operation of the model parameters of the layer after the synchronous communication operation in the system is finished.

As a further improvement of the process of the invention: in the forward propagation stage, before the activation of any layer of neural network is calculated, the synchronous communication operation of the parameters of the layer of neural network in the last training cycle is completed, that is, the layer of model is updated.

As a further improvement of the method of the invention: in the back propagation stage and the parameter updating stage, after the parameter calculation of the layer 1 neural network is completed, the synchronous communication operation of the DNN model parameters in the system is suspended, and the communication state is saved.

As a further improvement of the process of the invention: and when the communication is suspended, starting and finishing the synchronous communication operation of the layer 1 neural network parameters, and updating the layer 1 neural network.

As a further improvement of the process of the invention: after the layer 1 neural network is updated, starting the next training cycle; in the forward propagation stage, starting to calculate the activation of the layer 1 neural network; and meanwhile, starting synchronous communication operation of the layer 2 neural network parameters which are not completed in the last training cycle, and if the synchronous communication operation is completed, updating the layer 2 neural network.

As a further improvement of the process of the invention: after the synchronous communication operation of the layer 2 neural network parameters is completed, starting the unfinished synchronous communication operation of the layer 3 neural network parameters in the previous training cycle, and updating the layer 3 neural network; and by analogy, completing synchronous communication operation of all the model parameters which are not completed in the last training cycle, and updating each layer of neural network.

As a further improvement of the process of the invention: when the activation calculation of the layer 1 neural network is completed and the layer 2 neural network is updated, starting to calculate the activation of the layer 2 neural network; and by analogy, the calculation of the activation of each layer of neural network is completed.

Compared with the prior art, the invention has the advantages that:

the flexible communication method for accelerating the data parallel distributed deep learning training is simple in principle and easy to realize, and obviously improves the speed of the data parallel distributed training; the invention greatly improves the speed of data parallel distributed deep learning training by executing the activated calculation operation and the synchronous communication operation of the parameters in the last training cycle in parallel.

Drawings

Fig. 1 is a schematic diagram of the training process of the first training cycle forward propagation stage in a specific application example of the present invention.

Fig. 2 is a schematic diagram of the training process from the second training cycle to the last training cycle in the embodiment of the present invention.

FIG. 3 is a schematic diagram of the training process from the first training cycle to the penultimate training cycle back propagation and parameter synchronization stage in the specific application example of the present invention.

Fig. 4 is a schematic diagram of the training process of the last training cycle back propagation and parameter synchronization phase in the specific application example of the present invention.

FIG. 5 is a schematic flow chart of the method of the present invention in a specific application.

Detailed Description

The invention will be described in further detail below with reference to the drawings and specific examples.

As shown in fig. 5, a flexible communication method for accelerating data parallel distributed deep learning training of the present invention includes:

in the back propagation phase and the parameter update phase of the distributed training loop: when the updating of the parameters (including the weight and the offset) of the local DNN model of all the computing nodes is completed, suspending the ongoing communication operation for synchronizing the DNN model parameters of all the nodes and saving the communication state;

after the communication state is stored, starting and completing the communication operation of synchronizing the parameters of the first-layer neural network of all the nodes, updating the first-layer neural network, and then starting the next training cycle;

in the forward propagation stage of the next training cycle, starting to calculate the activation of the first layer of neural network, and simultaneously starting the synchronous communication operation of the parameters from the second layer to the last layer of neural network which are not completed in the previous training cycle in sequence, and updating each layer of neural network;

and when the activation calculation of the n (n is more than or equal to 1 and less than the number of DNN model layers) layer neural network is completed and the n +1 layer neural network is updated, starting the activation calculation of the n +1 layer neural network.

In a specific application example, in a distributed training system composed of a plurality of single accelerator systems, a distributed training cycle is formed, and data parallel distributed training is performed in three stages, which sequentially comprises: a forward propagation phase, a backward propagation phase and a parameter update phase, the parameters including weights and biases. Wherein, the forward propagation stage is used for completing the calculation operation of local DNN model activation of the single accelerator system; in the back propagation stage, parameters for updating the local DNN model are calculated; and in the parameter updating stage, the DNN model is updated by synchronizing local parameters of all the single accelerator systems through communication operation.

In a specific application example, in a back propagation stage and a parameter updating stage of a distributed training cycle, the parameters of the DNN model are sequentially calculated from the last layer to the layer 1, and after the model parameters of any layer are calculated, the synchronous communication operation of the model parameters of the layer is started after the synchronous communication operation in progress in the system is completed.

In a specific application example, in a forward propagation stage of a distributed training cycle, before the activation of any layer of neural network is calculated, synchronous communication operation of parameters of the layer of neural network in the previous training cycle is completed, that is, the layer of model is updated.

In a specific application example, in a back propagation stage and a parameter updating stage of the distributed training cycle, after the parameter calculation of the layer 1 neural network is completed, the ongoing synchronous communication operation of the DNN model parameters in the system is suspended, and the communication state is saved.

In a specific application example, after the communication is suspended, synchronous communication operation of the layer 1 neural network parameters is started and completed, and the layer 1 neural network is updated.

In the specific application example, after the layer 1 neural network is updated, the next training cycle is started. In the forward propagation stage, starting to calculate the activation of the layer 1 neural network; at the same time, synchronous communication operation of the layer 2 neural network parameters (if any) which is not completed in the last training cycle is started, and the layer 2 neural network is updated.

In a specific application example, after the synchronous communication operation of the layer 2 neural network parameters is completed, the incomplete synchronous communication operation (if any) of the layer 3 neural network parameters in the last training cycle is started, and the layer 3 neural network is updated. And in the same way, completing the synchronous communication operation of all the model parameters which are not completed in the last training cycle (if any), and updating each layer of neural network.

In a specific application example, when the activation calculation of the layer 1 neural network is completed and the layer 2 neural network is updated, the activation calculation of the layer 2 neural network is started. And by analogy, the calculation of the activation of each layer of neural network is completed.

In a specific application example, the detailed process of the invention is as follows:

as shown in fig. 1, in a specific application example, the training process of the forward propagation stage of the first training cycle is as follows: and calculating and activating layer by layer from the neural network of the layer 1 to the neural network of the last layer, wherein synchronous communication operation is not performed in the training process.

As shown in fig. 2, in a specific application example, the training process from the second training cycle to the last training cycle in the forward propagation stage is:

for the communication operation: firstly, the incomplete synchronous communication operation of the parameters of the layer 2 neural network in the last training cycle is started, and the layer 2 neural network is updated. And then starting synchronous communication operation of the layer 3 neural network parameters, and updating the layer 3 neural network. And by analogy, synchronous communication operation of all unfinished parameters in the last training cycle is finished, and each layer of neural network is updated.

For the calculation operation: first, the activation of the layer 1 neural network is calculated. Then, when the activation calculation is completed and the layer 2 neural network is updated, the calculation of the activation of the layer 2 neural network is started. Then, activation is calculated by analogy, namely: and when the activation calculation of the n (1 < n < the number of the DNN model layers) layer neural network is completed and the n +1 layer neural network is updated, starting to calculate the activation of the n +1 layer neural network, and finally completing the calculation of the activation of each layer of the DNN model.

As shown in fig. 3, in a specific application example, the training process from the first training cycle to the penultimate training cycle back propagation and parameter synchronization stage is as follows:

for the calculation operation: and calculating local DNN model parameters layer by layer from the last layer of neural network to the 1 st layer of neural network.

For the communication operation: when the parameter calculation of the local nth (n is more than 1 and less than or equal to the DNN model layer number) layer neural network is finished and the synchronous communication operation in the current system is finished, starting the synchronous communication operation of the nth layer neural network parameter; after the parameter calculation of the local layer 1 neural network is completed, suspending the ongoing communication operation in the system and saving the communication state; after the communication state is stored, the synchronous communication operation of the layer 1 neural network parameters is started and completed, and the layer 1 neural network is updated.

As shown in fig. 4, in the specific application example, the training process of the last training cycle back propagation and parameter synchronization stage is as follows:

For the communication operation: and when the parameter calculation of the local nth (n is more than or equal to 1 and less than or equal to the number of the DNN model layers) layer neural network is finished and the synchronous communication operation in the current system is finished, starting the synchronous communication operation of the nth layer neural network parameters and finally finishing the updating of the DNN model parameters.

As shown in fig. 5, in the forward propagation stage, communication of the unfinished parameters of the second layer to the last layer of the neural network in the last training cycle is performed in sequence. In the back propagation and parameter synchronization stage, after the parameters of any layer of neural network of the local DNN model are updated, the synchronous communication operation of the parameters of the layer can be started; when the layer 1 neural network parameters are updated, the communication is suspended, and the communication state is saved; and finally, carrying out synchronous communication operation of the layer 1 parameters.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A flexible communication method for accelerating data parallel distributed deep learning training is characterized by comprising the following steps:

2. The flexible communication method for acceleration of data-oriented parallel distributed deep learning training according to claim 1, wherein the distributed training loop comprises a forward propagation phase, a backward propagation phase and a parameter update phase; the forward propagation stage is used for completing the calculation operation of local DNN model activation of the single accelerator system; in the back propagation stage, parameters for updating the local DNN model are calculated; and the parameter updating stage is used for synchronizing local parameters of all the single accelerator systems through communication operation and updating the DNN model.

3. The flexible communication method for acceleration of data-oriented parallel distributed deep learning training according to claim 2, wherein the parameters include weights and biases.

4. The flexible communication method oriented to acceleration of data parallel distributed deep learning training of any one of claims 1 to 3, wherein the back propagation stage and the parameter updating stage sequentially calculate the parameters of the DNN model from the last layer to the layer 1, and after the calculation of the model parameters of any layer is completed, the synchronous communication operation of the model parameters of the layer is started after the completion of the ongoing synchronous communication operation in the system.

5. The flexible communication method for acceleration of data-oriented parallel distributed deep learning training according to any one of claims 1 to 3, wherein in the forward propagation stage, before the activation of any layer of neural network is calculated, the synchronous communication operation of the parameters of the layer of neural network in the previous training cycle is completed, that is, the layer of model is updated.

6. The flexible communication method for acceleration of data-oriented parallel distributed deep learning training according to any one of claims 1-3, characterized in that in the back propagation stage and the parameter updating stage, after the parameter calculation of the layer 1 neural network is completed, the ongoing synchronous communication operation of the DNN model parameters in the system is suspended, and the communication state is saved.

7. The flexible communication method for acceleration of data-oriented parallel distributed deep learning training according to claim 6, characterized in that after communication is suspended, synchronous communication operation of layer 1 neural network parameters is started and completed, and the layer 1 neural network is updated.

8. The flexible communication method for acceleration of data-oriented parallel distributed deep learning training according to claim 7, characterized in that after the layer 1 neural network is updated, the next training cycle is started; in the forward propagation stage, starting to calculate the activation of the layer 1 neural network; and meanwhile, starting synchronous communication operation of the layer 2 neural network parameters which are not completed in the last training cycle, and if the synchronous communication operation is completed, updating the layer 2 neural network.

9. The flexible communication method for acceleration of data-oriented parallel distributed deep learning training according to claim 8, wherein after the synchronous communication operation of the layer 2 neural network parameters is completed, the incomplete synchronous communication operation of the layer 3 neural network parameters in the previous training cycle is started, and the layer 3 neural network is updated; and by analogy, completing synchronous communication operation of all the model parameters which are not completed in the last training cycle, and updating each layer of neural network.

10. The flexible communication method for acceleration of data parallel distributed deep learning training according to claim 9, wherein when the activation calculation of the layer 1 neural network is completed and the layer 2 neural network is updated, the activation calculation of the layer 2 neural network is started; and by analogy, the calculation of the activation of each layer of neural network is completed.