CN112529163A - Distributed training gradient compression acceleration method based on AllReduce - Google Patents

Distributed training gradient compression acceleration method based on AllReduce Download PDF

Info

Publication number
CN112529163A
CN112529163A CN202011504384.2A CN202011504384A CN112529163A CN 112529163 A CN112529163 A CN 112529163A CN 202011504384 A CN202011504384 A CN 202011504384A CN 112529163 A CN112529163 A CN 112529163A
Authority
CN
China
Prior art keywords
gradient
allreduce
acceleration method
training
compression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011504384.2A
Other languages
Chinese (zh)
Inventor
谢远东
梁家恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202011504384.2A priority Critical patent/CN112529163A/en
Publication of CN112529163A publication Critical patent/CN112529163A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an AllReduce-based distributed training gradient compression acceleration method, which is characterized in that FP32 is converted into FP16 for an Intra-node, and gradient compression is performed by using an EF-SGD method for an Inter-node, so that loss is reduced compared with a sparse method, and bandwidth bottleneck is eliminated by an AllReduce architecture compared with a Params Server communication structure.

Description

Distributed training gradient compression acceleration method based on AllReduce
Technical Field
The invention relates to the technical field of deep learning, in particular to an AllReduce-based distributed training gradient compression acceleration method.
Background
The existing centralized distributed training method based on a parameter server mode or the sparse method for selecting partial gradient values has some problems, for example, the sparse method has large loss on gradient information; the same gradient compression method is used for the intra-node and the inter-node, and the gradient information loss is further increased; the Params Server communication architecture has a bandwidth bottleneck with respect to the AllReduce itself.
Disclosure of Invention
The invention provides an Allreduce-based distributed training gradient compression acceleration method which can solve the problem of synchronous communication bandwidth of large model parameter training.
The technical scheme for solving the technical problems is as follows:
according to an aspect of the embodiments of the present invention, there is provided an AllReduce-based distributed training gradient compression acceleration method, including: an AllReduce distributed depth gradient compression training architecture is adopted, a parameter server does not exist in the AllReduce distributed depth gradient compression training architecture, an annular closed-loop transmission path is formed between working machines, and compressed gradients are transmitted between GPUs; and converting the gradient in the intra-node compression module from FP32 to FP 16; and the gradient is compressed using an error feedback random gradient descent algorithm.
Preferably, the error feedback random gradient descent algorithm comprises: decompressing for each training to obtain the value pt
Preferably, the error feedback random gradient descent algorithm further comprises: using gradient compression algorithm to pair values ptGradient compression is performed.
Preferably, the value ptIs pt=ηgt+etWherein g istIn order to randomly decrease the value of the gradient,
Figure BDA0002844450360000021
Figure BDA0002844450360000022
etis a deviation value.
Preferably, etThe initial value is 0.
Preferably, the gradient compression is implemented as: taking k values p before gradient by adopting topk algorithmtAnd performing data integration.
Preferably, the error feedback random gradient descent algorithm further comprises: updating parameters: x is the number oft+1=xtt,et+1=ptt
According to another aspect of the embodiments of the present invention, a storage medium is provided, where the storage medium includes a stored program, and when the program runs, a device in which the storage medium is located is controlled to execute the AllReduce-based distributed training gradient compression acceleration method.
Thus, the FP32 was transformed to FP16 for Intra-node and the EF-SGD method was used to compress the gradient for Inter-node with reduced loss relative to the sparse method. And, the AllReduce architecture eliminates bandwidth bottlenecks relative to the Params Server communication architecture.
Drawings
FIG. 1 is a schematic diagram of a distributed depth gradient compression training architecture of a Params Server structure;
FIG. 2 is a distributed depth gradient compression training architecture based on Allreduce according to the present invention;
fig. 3 is a schematic diagram of a ring reduces architecture according to an embodiment of the present invention;
fig. 4 is a schematic node connection diagram according to an embodiment of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
The embodiment of the invention provides an Allreduce-based distributed training gradient compression acceleration method. As explained in detail below.
Fig. 1 is a distributed gradient compression training architecture of a Params Server (PS) structure, and fig. 2 is a distributed gradient compression training architecture based on AllReduce according to an embodiment of the present invention. Wherein, the respective GPU of each machine of the PS framework forms a closed loop and transfers the gradient after the Intra-node compression; the working machines are not in communication connection, and the inter-node compressed gradient is transmitted between the working machines and the parameter server. The AllReduce architecture does not have a parameter server, an annular closed-loop transmission path is formed between the working machines, and compressed gradients are transmitted between the GPUs.
AllReduce is actually a kind of algorithm, and the aim is to efficiently integrate data in different machines (reduce) and then distribute the result to each machine. In deep learning applications, the data is often a vector or matrix, and the integration usually used is Sum, Max, Min, etc.
There are many specific implementation methods for AllReduce, and the simplest implementation method is that each worker (which may be a GPU) sends its own data to all other workers, but this method has a lot of waste.
A slightly preferred realization is that a master-slave mode architecture is utilized, one worker is set as a master, and after all the rest workers send data to the master, the master performs integration element calculation and distributes the data to the rest workers after the integration element calculation is completed. However, such implementation of the master tends to become a bottleneck for the entire network.
Ring allreduce is shown in FIG. 3 as one form of allreduce. Data among different GPUs are transmitted in a ring mode, and if P is the number of the GPUs and N is the number of the blocks of transmission parameters, the number of the parameters required to be transmitted by each GPU is 2N/P (P-1).
It can be seen here that the parameter transfer amount is independent of P (i.e., the number of GPUs), i.e., the bandwidth of the communication does not increase as the number of GPUs increases; and by reversely observing the PS structure, the parameter transmission amount is increased along with the increase of the number of the working machines, and the method has a bandwidth bottleneck compared with the AllReduce method.
Also, using the same gradient compression algorithm is not an optimal combination due to the different bandwidths between GPUs and machines.
As shown in fig. 4, since the intra-node (intra-node) is connected between GPUs through NVLink, PCIe switch, or PCIe Host Bridge, the speed and bandwidth far exceed the network card between servers. Therefore, gradient information in the process of training the model is greatly lost by adopting a sparse algorithm, so that the gradient is only converted from FP32 to FP16 by an Intra-node compression (Intra-node compression) module, not only can the communication bandwidth be reduced by half, but also the forward transfer accuracy and the backward transfer accuracy of the neural network optimizer training network are slightly influenced.
Because the Inter-node is influenced by the network bandwidth between the servers, the gradient is compressed as much as possible under the condition of less influence on the precision.
Using an Error FeedBack random gradient descent Error-FeedBack back-SGD (EF-SGD) algorithm:
the method comprises the following steps:
1. decompress for each training to get pt=ηgt+et,gtFor random gradient descent
Figure BDA0002844450360000041
Where e istThe initial value is 0 as the deviation value.
2.Δt=C(pt) And C is a gradient compression algorithm, wherein a topk algorithm is adopted, namely k values before the gradient are taken for data integration (Reduce).
3. Updating parameter xt+1=xtt,et+1=ptt
The EF-SGD method of the invention has the advantages that the accuracy rate is kept equal under the condition that the Imagenet training ResNet50 is accelerated by 10%.
Thus, the FP32 was transformed to FP16 for Intra-node and the EF-SGD method was used to compress the gradient for Inter-node with reduced loss relative to the sparse method. And, the AllReduce architecture eliminates bandwidth bottlenecks relative to the Params Server communication architecture.
The AllReduce-based distributed training gradient compression acceleration method provided by the embodiment of the invention can be realized in the form of a software functional module, can be used as an independent product for sale or use, and can be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (8)

1. A distributed training gradient compression acceleration method based on AllReduce is characterized by comprising the following steps:
an AllReduce distributed depth gradient compression training architecture is adopted, a parameter server does not exist in the AllReduce distributed depth gradient compression training architecture, an annular closed-loop transmission path is formed between working machines, and compressed gradients are transmitted between GPUs; and
the gradient in the intra-node compression module is converted from FP32 to FP 16; and are
The gradient is compressed using an error feedback random gradient descent algorithm.
2. The AllReduce-based distributed training gradient compression acceleration method according to claim 1, wherein the error feedback random gradient descent algorithm comprises:
decompressing for each training to obtain the value pt
3. The AllReduce-based distributed training gradient compression acceleration method according to claim 2, wherein the error feedback random gradient descent algorithm further comprises:
using gradient compression algorithm to pair values ptGradient compression is performed.
4. The Allreduce-based distributed training gradient compression acceleration method of claim 3,
value ptIs pt=ηgt+etWherein g istIn order to randomly decrease the value of the gradient,
Figure FDA0002844450350000011
etis a deviation value.
5. The Allreduce-based distributed training gradient compression acceleration method according to claim 4,
etthe initial value is 0.
6. The Allreduce-based distributed training gradient compression acceleration method according to claim 5,
the gradient compression is implemented as: and adopting a topk algorithm to take k values pt before the gradient for data integration.
7. The Allreduce-based distributed training gradient compression acceleration method of claim 5, wherein the error feedback random gradient descent algorithm further comprises:
updating parameters: x is the number oft+1=xtt,et+1=ptt
8. A storage medium characterized in that,
the storage medium comprises a stored program, wherein when the program runs, the device on which the storage medium is located is controlled to execute the AllReduce-based distributed training gradient compression acceleration method according to any one of claims 1 to 7.
CN202011504384.2A 2020-12-17 2020-12-17 Distributed training gradient compression acceleration method based on AllReduce Pending CN112529163A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011504384.2A CN112529163A (en) 2020-12-17 2020-12-17 Distributed training gradient compression acceleration method based on AllReduce

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011504384.2A CN112529163A (en) 2020-12-17 2020-12-17 Distributed training gradient compression acceleration method based on AllReduce

Publications (1)

Publication Number Publication Date
CN112529163A true CN112529163A (en) 2021-03-19

Family

ID=75001529

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011504384.2A Pending CN112529163A (en) 2020-12-17 2020-12-17 Distributed training gradient compression acceleration method based on AllReduce

Country Status (1)

Country Link
CN (1) CN112529163A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472347A (en) * 2018-10-15 2019-03-15 中山大学 A kind of gradient compression method of distribution deep learning
CN110633798A (en) * 2019-09-12 2019-12-31 北京金山数字娱乐科技有限公司 Parameter updating method and device in distributed training
WO2020081399A1 (en) * 2018-10-15 2020-04-23 Nam Sung Kim Network-centric architecture and algorithms to accelerate distributed training of neural networks
US20200311539A1 (en) * 2019-03-28 2020-10-01 International Business Machines Corporation Cloud computing data compression for allreduce in deep learning
CN111917579A (en) * 2020-07-30 2020-11-10 云知声智能科技股份有限公司 Distributed training method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472347A (en) * 2018-10-15 2019-03-15 中山大学 A kind of gradient compression method of distribution deep learning
WO2020081399A1 (en) * 2018-10-15 2020-04-23 Nam Sung Kim Network-centric architecture and algorithms to accelerate distributed training of neural networks
US20200311539A1 (en) * 2019-03-28 2020-10-01 International Business Machines Corporation Cloud computing data compression for allreduce in deep learning
CN110633798A (en) * 2019-09-12 2019-12-31 北京金山数字娱乐科技有限公司 Parameter updating method and device in distributed training
CN111917579A (en) * 2020-07-30 2020-11-10 云知声智能科技股份有限公司 Distributed training method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHUAI ZHENG ET AL.: "Communication-efficient distributed blockwise momentum SGD with error-feedback", 《NIPS\'19: PROCEEDINGS OF THE 33RD INTERNATIONAL CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS》, pages 11450 - 11460 *

Similar Documents

Publication Publication Date Title
CN110929886B (en) Model training and predicting method and system
Sattler et al. Robust and communication-efficient federated learning from non-iid data
US20210073678A1 (en) Method, apparatus and system for secure vertical federated learning
US8904149B2 (en) Parallelization of online learning algorithms
Lin et al. Design of optimal sparse feedback gains via the alternating direction method of multipliers
JP6227813B1 (en) Distributed deep learning device and distributed deep learning system
JP7095675B2 (en) Information processing equipment, information processing methods, and programs
EP3336760A1 (en) Combined adversarial learning of inverse image manipulation operations
US20190213470A1 (en) Zero injection for distributed deep learning
EP4206943A1 (en) Graph data processing method and apparatus, computer device and storage medium
CN113487035B (en) Control pulse determining method and device for quantum gate and electronic equipment
CN110505218B (en) Grid data self-adaptive compression transmission method based on JSON and computer storage medium
US20210149985A1 (en) Method and apparatus for processing large-scale distributed matrix product
US11853391B1 (en) Distributed model training
CN110263917B (en) Neural network compression method and device
CN106911777A (en) A kind of data processing method and server
CN117764132A (en) Data transmission method, device and system, electronic equipment and storage medium
Zhang et al. Decentralized optimal control for the mean field LQG problem of multi-agent systems
CN111695701B (en) System for realizing data set construction processing based on federal learning and construction generation method thereof
CN112529163A (en) Distributed training gradient compression acceleration method based on AllReduce
US11943277B2 (en) Conversion system, method and program
CN111695689B (en) Natural language processing method, device, equipment and readable storage medium
CN110334067B (en) Sparse matrix compression method, device, equipment and storage medium
KR102105951B1 (en) Constructing method of classification restricted boltzmann machine and computer apparatus for classification restricted boltzmann machine
Engelmann Distributed Optimization with Application to Power Systems and Control

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210319