CN112529163A

CN112529163A - Distributed training gradient compression acceleration method based on AllReduce

Info

Publication number: CN112529163A
Application number: CN202011504384.2A
Authority: CN
Inventors: 谢远东; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2021-03-19

Abstract

The invention relates to an AllReduce-based distributed training gradient compression acceleration method, which is characterized in that FP32 is converted into FP16 for an Intra-node, and gradient compression is performed by using an EF-SGD method for an Inter-node, so that loss is reduced compared with a sparse method, and bandwidth bottleneck is eliminated by an AllReduce architecture compared with a Params Server communication structure.

Description

Distributed training gradient compression acceleration method based on AllReduce

Technical Field

The invention relates to the technical field of deep learning, in particular to an AllReduce-based distributed training gradient compression acceleration method.

Background

The existing centralized distributed training method based on a parameter server mode or the sparse method for selecting partial gradient values has some problems, for example, the sparse method has large loss on gradient information; the same gradient compression method is used for the intra-node and the inter-node, and the gradient information loss is further increased; the Params Server communication architecture has a bandwidth bottleneck with respect to the AllReduce itself.

Disclosure of Invention

The invention provides an Allreduce-based distributed training gradient compression acceleration method which can solve the problem of synchronous communication bandwidth of large model parameter training.

The technical scheme for solving the technical problems is as follows:

according to an aspect of the embodiments of the present invention, there is provided an AllReduce-based distributed training gradient compression acceleration method, including: an AllReduce distributed depth gradient compression training architecture is adopted, a parameter server does not exist in the AllReduce distributed depth gradient compression training architecture, an annular closed-loop transmission path is formed between working machines, and compressed gradients are transmitted between GPUs; and converting the gradient in the intra-node compression module from FP32 to FP 16; and the gradient is compressed using an error feedback random gradient descent algorithm.

Preferably, the error feedback random gradient descent algorithm comprises: decompressing for each training to obtain the value p_t。

Preferably, the error feedback random gradient descent algorithm further comprises: using gradient compression algorithm to pair values p_tGradient compression is performed.

Preferably, the value p_tIs p_t＝ηg_t+e_tWherein g is_tIn order to randomly decrease the value of the gradient,

e_tis a deviation value.

Preferably, e_tThe initial value is 0.

Preferably, the gradient compression is implemented as: taking k values p before gradient by adopting topk algorithm_tAnd performing data integration.

Preferably, the error feedback random gradient descent algorithm further comprises: updating parameters: x is the number of_t+1＝x_t-Δ_t,e_t+1＝p_t-Δ_t。

According to another aspect of the embodiments of the present invention, a storage medium is provided, where the storage medium includes a stored program, and when the program runs, a device in which the storage medium is located is controlled to execute the AllReduce-based distributed training gradient compression acceleration method.

Thus, the FP32 was transformed to FP16 for Intra-node and the EF-SGD method was used to compress the gradient for Inter-node with reduced loss relative to the sparse method. And, the AllReduce architecture eliminates bandwidth bottlenecks relative to the Params Server communication architecture.

Drawings

FIG. 1 is a schematic diagram of a distributed depth gradient compression training architecture of a Params Server structure;

FIG. 2 is a distributed depth gradient compression training architecture based on Allreduce according to the present invention;

fig. 3 is a schematic diagram of a ring reduces architecture according to an embodiment of the present invention;

fig. 4 is a schematic node connection diagram according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

The embodiment of the invention provides an Allreduce-based distributed training gradient compression acceleration method. As explained in detail below.

Fig. 1 is a distributed gradient compression training architecture of a Params Server (PS) structure, and fig. 2 is a distributed gradient compression training architecture based on AllReduce according to an embodiment of the present invention. Wherein, the respective GPU of each machine of the PS framework forms a closed loop and transfers the gradient after the Intra-node compression; the working machines are not in communication connection, and the inter-node compressed gradient is transmitted between the working machines and the parameter server. The AllReduce architecture does not have a parameter server, an annular closed-loop transmission path is formed between the working machines, and compressed gradients are transmitted between the GPUs.

AllReduce is actually a kind of algorithm, and the aim is to efficiently integrate data in different machines (reduce) and then distribute the result to each machine. In deep learning applications, the data is often a vector or matrix, and the integration usually used is Sum, Max, Min, etc.

There are many specific implementation methods for AllReduce, and the simplest implementation method is that each worker (which may be a GPU) sends its own data to all other workers, but this method has a lot of waste.

A slightly preferred realization is that a master-slave mode architecture is utilized, one worker is set as a master, and after all the rest workers send data to the master, the master performs integration element calculation and distributes the data to the rest workers after the integration element calculation is completed. However, such implementation of the master tends to become a bottleneck for the entire network.

Ring allreduce is shown in FIG. 3 as one form of allreduce. Data among different GPUs are transmitted in a ring mode, and if P is the number of the GPUs and N is the number of the blocks of transmission parameters, the number of the parameters required to be transmitted by each GPU is 2N/P (P-1).

It can be seen here that the parameter transfer amount is independent of P (i.e., the number of GPUs), i.e., the bandwidth of the communication does not increase as the number of GPUs increases; and by reversely observing the PS structure, the parameter transmission amount is increased along with the increase of the number of the working machines, and the method has a bandwidth bottleneck compared with the AllReduce method.

Also, using the same gradient compression algorithm is not an optimal combination due to the different bandwidths between GPUs and machines.

As shown in fig. 4, since the intra-node (intra-node) is connected between GPUs through NVLink, PCIe switch, or PCIe Host Bridge, the speed and bandwidth far exceed the network card between servers. Therefore, gradient information in the process of training the model is greatly lost by adopting a sparse algorithm, so that the gradient is only converted from FP32 to FP16 by an Intra-node compression (Intra-node compression) module, not only can the communication bandwidth be reduced by half, but also the forward transfer accuracy and the backward transfer accuracy of the neural network optimizer training network are slightly influenced.

Because the Inter-node is influenced by the network bandwidth between the servers, the gradient is compressed as much as possible under the condition of less influence on the precision.

Using an Error FeedBack random gradient descent Error-FeedBack back-SGD (EF-SGD) algorithm:

the method comprises the following steps:

1. decompress for each training to get p_t＝ηg_t+e_t，g_tFor random gradient descent

Where e is_tThe initial value is 0 as the deviation value.

2.Δ_t＝C(p_t) And C is a gradient compression algorithm, wherein a topk algorithm is adopted, namely k values before the gradient are taken for data integration (Reduce).

3. Updating parameter x_t+1＝x_t-Δ_t,e_t+1＝p_t-Δ_t。

The EF-SGD method of the invention has the advantages that the accuracy rate is kept equal under the condition that the Imagenet training ResNet50 is accelerated by 10%.

The AllReduce-based distributed training gradient compression acceleration method provided by the embodiment of the invention can be realized in the form of a software functional module, can be used as an independent product for sale or use, and can be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A distributed training gradient compression acceleration method based on AllReduce is characterized by comprising the following steps:

an AllReduce distributed depth gradient compression training architecture is adopted, a parameter server does not exist in the AllReduce distributed depth gradient compression training architecture, an annular closed-loop transmission path is formed between working machines, and compressed gradients are transmitted between GPUs; and

the gradient in the intra-node compression module is converted from FP32 to FP 16; and are

The gradient is compressed using an error feedback random gradient descent algorithm.

2. The AllReduce-based distributed training gradient compression acceleration method according to claim 1, wherein the error feedback random gradient descent algorithm comprises:

decompressing for each training to obtain the value p_t。

3. The AllReduce-based distributed training gradient compression acceleration method according to claim 2, wherein the error feedback random gradient descent algorithm further comprises:

using gradient compression algorithm to pair values p_tGradient compression is performed.

4. The Allreduce-based distributed training gradient compression acceleration method of claim 3,

value p_tIs p_t＝ηg_t+e_tWherein g is_tIn order to randomly decrease the value of the gradient,

e_tis a deviation value.

5. The Allreduce-based distributed training gradient compression acceleration method according to claim 4,

e_tthe initial value is 0.

6. The Allreduce-based distributed training gradient compression acceleration method according to claim 5,

the gradient compression is implemented as: and adopting a topk algorithm to take k values pt before the gradient for data integration.

7. The Allreduce-based distributed training gradient compression acceleration method of claim 5, wherein the error feedback random gradient descent algorithm further comprises:

updating parameters: x is the number of_t+1＝x_t-Δ_t，e_t+1＝p_t-Δ_t。

8. A storage medium characterized in that,

the storage medium comprises a stored program, wherein when the program runs, the device on which the storage medium is located is controlled to execute the AllReduce-based distributed training gradient compression acceleration method according to any one of claims 1 to 7.