CN112529163A - Distributed training gradient compression acceleration method based on AllReduce - Google Patents
Distributed training gradient compression acceleration method based on AllReduce Download PDFInfo
- Publication number
- CN112529163A CN112529163A CN202011504384.2A CN202011504384A CN112529163A CN 112529163 A CN112529163 A CN 112529163A CN 202011504384 A CN202011504384 A CN 202011504384A CN 112529163 A CN112529163 A CN 112529163A
- Authority
- CN
- China
- Prior art keywords
- gradient
- allreduce
- acceleration method
- training
- compression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000006835 compression Effects 0.000 title claims abstract description 40
- 238000007906 compression Methods 0.000 title claims abstract description 40
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000001133 acceleration Effects 0.000 title claims abstract description 17
- 230000010354 integration Effects 0.000 claims description 6
- 230000005540 biological transmission Effects 0.000 claims description 5
- 101100481876 Danio rerio pbk gene Proteins 0.000 claims description 3
- 101100481878 Mus musculus Pbk gene Proteins 0.000 claims description 3
- HTIQEAQVCYTUBX-UHFFFAOYSA-N amlodipine Chemical compound CCOC(=O)C1=C(COCCN)NC(C)=C(C(=O)OC)C1C1=CC=CC=C1Cl HTIQEAQVCYTUBX-UHFFFAOYSA-N 0.000 claims description 2
- 238000004891 communication Methods 0.000 abstract description 8
- 238000010586 diagram Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to an AllReduce-based distributed training gradient compression acceleration method, which is characterized in that FP32 is converted into FP16 for an Intra-node, and gradient compression is performed by using an EF-SGD method for an Inter-node, so that loss is reduced compared with a sparse method, and bandwidth bottleneck is eliminated by an AllReduce architecture compared with a Params Server communication structure.
Description
Technical Field
The invention relates to the technical field of deep learning, in particular to an AllReduce-based distributed training gradient compression acceleration method.
Background
The existing centralized distributed training method based on a parameter server mode or the sparse method for selecting partial gradient values has some problems, for example, the sparse method has large loss on gradient information; the same gradient compression method is used for the intra-node and the inter-node, and the gradient information loss is further increased; the Params Server communication architecture has a bandwidth bottleneck with respect to the AllReduce itself.
Disclosure of Invention
The invention provides an Allreduce-based distributed training gradient compression acceleration method which can solve the problem of synchronous communication bandwidth of large model parameter training.
The technical scheme for solving the technical problems is as follows:
according to an aspect of the embodiments of the present invention, there is provided an AllReduce-based distributed training gradient compression acceleration method, including: an AllReduce distributed depth gradient compression training architecture is adopted, a parameter server does not exist in the AllReduce distributed depth gradient compression training architecture, an annular closed-loop transmission path is formed between working machines, and compressed gradients are transmitted between GPUs; and converting the gradient in the intra-node compression module from FP32 to FP 16; and the gradient is compressed using an error feedback random gradient descent algorithm.
Preferably, the error feedback random gradient descent algorithm comprises: decompressing for each training to obtain the value pt。
Preferably, the error feedback random gradient descent algorithm further comprises: using gradient compression algorithm to pair values ptGradient compression is performed.
Preferably, the value ptIs pt=ηgt+etWherein g istIn order to randomly decrease the value of the gradient, etis a deviation value.
Preferably, etThe initial value is 0.
Preferably, the gradient compression is implemented as: taking k values p before gradient by adopting topk algorithmtAnd performing data integration.
Preferably, the error feedback random gradient descent algorithm further comprises: updating parameters: x is the number oft+1=xt-Δt,et+1=pt-Δt。
According to another aspect of the embodiments of the present invention, a storage medium is provided, where the storage medium includes a stored program, and when the program runs, a device in which the storage medium is located is controlled to execute the AllReduce-based distributed training gradient compression acceleration method.
Thus, the FP32 was transformed to FP16 for Intra-node and the EF-SGD method was used to compress the gradient for Inter-node with reduced loss relative to the sparse method. And, the AllReduce architecture eliminates bandwidth bottlenecks relative to the Params Server communication architecture.
Drawings
FIG. 1 is a schematic diagram of a distributed depth gradient compression training architecture of a Params Server structure;
FIG. 2 is a distributed depth gradient compression training architecture based on Allreduce according to the present invention;
fig. 3 is a schematic diagram of a ring reduces architecture according to an embodiment of the present invention;
fig. 4 is a schematic node connection diagram according to an embodiment of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
The embodiment of the invention provides an Allreduce-based distributed training gradient compression acceleration method. As explained in detail below.
Fig. 1 is a distributed gradient compression training architecture of a Params Server (PS) structure, and fig. 2 is a distributed gradient compression training architecture based on AllReduce according to an embodiment of the present invention. Wherein, the respective GPU of each machine of the PS framework forms a closed loop and transfers the gradient after the Intra-node compression; the working machines are not in communication connection, and the inter-node compressed gradient is transmitted between the working machines and the parameter server. The AllReduce architecture does not have a parameter server, an annular closed-loop transmission path is formed between the working machines, and compressed gradients are transmitted between the GPUs.
AllReduce is actually a kind of algorithm, and the aim is to efficiently integrate data in different machines (reduce) and then distribute the result to each machine. In deep learning applications, the data is often a vector or matrix, and the integration usually used is Sum, Max, Min, etc.
There are many specific implementation methods for AllReduce, and the simplest implementation method is that each worker (which may be a GPU) sends its own data to all other workers, but this method has a lot of waste.
A slightly preferred realization is that a master-slave mode architecture is utilized, one worker is set as a master, and after all the rest workers send data to the master, the master performs integration element calculation and distributes the data to the rest workers after the integration element calculation is completed. However, such implementation of the master tends to become a bottleneck for the entire network.
Ring allreduce is shown in FIG. 3 as one form of allreduce. Data among different GPUs are transmitted in a ring mode, and if P is the number of the GPUs and N is the number of the blocks of transmission parameters, the number of the parameters required to be transmitted by each GPU is 2N/P (P-1).
It can be seen here that the parameter transfer amount is independent of P (i.e., the number of GPUs), i.e., the bandwidth of the communication does not increase as the number of GPUs increases; and by reversely observing the PS structure, the parameter transmission amount is increased along with the increase of the number of the working machines, and the method has a bandwidth bottleneck compared with the AllReduce method.
Also, using the same gradient compression algorithm is not an optimal combination due to the different bandwidths between GPUs and machines.
As shown in fig. 4, since the intra-node (intra-node) is connected between GPUs through NVLink, PCIe switch, or PCIe Host Bridge, the speed and bandwidth far exceed the network card between servers. Therefore, gradient information in the process of training the model is greatly lost by adopting a sparse algorithm, so that the gradient is only converted from FP32 to FP16 by an Intra-node compression (Intra-node compression) module, not only can the communication bandwidth be reduced by half, but also the forward transfer accuracy and the backward transfer accuracy of the neural network optimizer training network are slightly influenced.
Because the Inter-node is influenced by the network bandwidth between the servers, the gradient is compressed as much as possible under the condition of less influence on the precision.
Using an Error FeedBack random gradient descent Error-FeedBack back-SGD (EF-SGD) algorithm:
the method comprises the following steps:
1. decompress for each training to get pt=ηgt+et,gtFor random gradient descentWhere e istThe initial value is 0 as the deviation value.
2.Δt=C(pt) And C is a gradient compression algorithm, wherein a topk algorithm is adopted, namely k values before the gradient are taken for data integration (Reduce).
3. Updating parameter xt+1=xt-Δt,et+1=pt-Δt。
The EF-SGD method of the invention has the advantages that the accuracy rate is kept equal under the condition that the Imagenet training ResNet50 is accelerated by 10%.
Thus, the FP32 was transformed to FP16 for Intra-node and the EF-SGD method was used to compress the gradient for Inter-node with reduced loss relative to the sparse method. And, the AllReduce architecture eliminates bandwidth bottlenecks relative to the Params Server communication architecture.
The AllReduce-based distributed training gradient compression acceleration method provided by the embodiment of the invention can be realized in the form of a software functional module, can be used as an independent product for sale or use, and can be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (8)
1. A distributed training gradient compression acceleration method based on AllReduce is characterized by comprising the following steps:
an AllReduce distributed depth gradient compression training architecture is adopted, a parameter server does not exist in the AllReduce distributed depth gradient compression training architecture, an annular closed-loop transmission path is formed between working machines, and compressed gradients are transmitted between GPUs; and
the gradient in the intra-node compression module is converted from FP32 to FP 16; and are
The gradient is compressed using an error feedback random gradient descent algorithm.
2. The AllReduce-based distributed training gradient compression acceleration method according to claim 1, wherein the error feedback random gradient descent algorithm comprises:
decompressing for each training to obtain the value pt。
3. The AllReduce-based distributed training gradient compression acceleration method according to claim 2, wherein the error feedback random gradient descent algorithm further comprises:
using gradient compression algorithm to pair values ptGradient compression is performed.
5. The Allreduce-based distributed training gradient compression acceleration method according to claim 4,
etthe initial value is 0.
6. The Allreduce-based distributed training gradient compression acceleration method according to claim 5,
the gradient compression is implemented as: and adopting a topk algorithm to take k values pt before the gradient for data integration.
7. The Allreduce-based distributed training gradient compression acceleration method of claim 5, wherein the error feedback random gradient descent algorithm further comprises:
updating parameters: x is the number oft+1=xt-Δt,et+1=pt-Δt。
8. A storage medium characterized in that,
the storage medium comprises a stored program, wherein when the program runs, the device on which the storage medium is located is controlled to execute the AllReduce-based distributed training gradient compression acceleration method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011504384.2A CN112529163A (en) | 2020-12-17 | 2020-12-17 | Distributed training gradient compression acceleration method based on AllReduce |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011504384.2A CN112529163A (en) | 2020-12-17 | 2020-12-17 | Distributed training gradient compression acceleration method based on AllReduce |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112529163A true CN112529163A (en) | 2021-03-19 |
Family
ID=75001529
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011504384.2A Pending CN112529163A (en) | 2020-12-17 | 2020-12-17 | Distributed training gradient compression acceleration method based on AllReduce |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112529163A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109472347A (en) * | 2018-10-15 | 2019-03-15 | 中山大学 | A kind of gradient compression method of distribution deep learning |
CN110633798A (en) * | 2019-09-12 | 2019-12-31 | 北京金山数字娱乐科技有限公司 | Parameter updating method and device in distributed training |
WO2020081399A1 (en) * | 2018-10-15 | 2020-04-23 | Nam Sung Kim | Network-centric architecture and algorithms to accelerate distributed training of neural networks |
US20200311539A1 (en) * | 2019-03-28 | 2020-10-01 | International Business Machines Corporation | Cloud computing data compression for allreduce in deep learning |
CN111917579A (en) * | 2020-07-30 | 2020-11-10 | 云知声智能科技股份有限公司 | Distributed training method, device, equipment and storage medium |
-
2020
- 2020-12-17 CN CN202011504384.2A patent/CN112529163A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109472347A (en) * | 2018-10-15 | 2019-03-15 | 中山大学 | A kind of gradient compression method of distribution deep learning |
WO2020081399A1 (en) * | 2018-10-15 | 2020-04-23 | Nam Sung Kim | Network-centric architecture and algorithms to accelerate distributed training of neural networks |
US20200311539A1 (en) * | 2019-03-28 | 2020-10-01 | International Business Machines Corporation | Cloud computing data compression for allreduce in deep learning |
CN110633798A (en) * | 2019-09-12 | 2019-12-31 | 北京金山数字娱乐科技有限公司 | Parameter updating method and device in distributed training |
CN111917579A (en) * | 2020-07-30 | 2020-11-10 | 云知声智能科技股份有限公司 | Distributed training method, device, equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
SHUAI ZHENG ET AL.: "Communication-efficient distributed blockwise momentum SGD with error-feedback", 《NIPS\'19: PROCEEDINGS OF THE 33RD INTERNATIONAL CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS》, pages 11450 - 11460 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110929886B (en) | Model training and predicting method and system | |
Sattler et al. | Robust and communication-efficient federated learning from non-iid data | |
US20210073678A1 (en) | Method, apparatus and system for secure vertical federated learning | |
US8904149B2 (en) | Parallelization of online learning algorithms | |
Lin et al. | Design of optimal sparse feedback gains via the alternating direction method of multipliers | |
JP6227813B1 (en) | Distributed deep learning device and distributed deep learning system | |
JP7095675B2 (en) | Information processing equipment, information processing methods, and programs | |
EP3336760A1 (en) | Combined adversarial learning of inverse image manipulation operations | |
US20190213470A1 (en) | Zero injection for distributed deep learning | |
EP4206943A1 (en) | Graph data processing method and apparatus, computer device and storage medium | |
CN113487035B (en) | Control pulse determining method and device for quantum gate and electronic equipment | |
CN110505218B (en) | Grid data self-adaptive compression transmission method based on JSON and computer storage medium | |
US20210149985A1 (en) | Method and apparatus for processing large-scale distributed matrix product | |
US11853391B1 (en) | Distributed model training | |
CN110263917B (en) | Neural network compression method and device | |
CN106911777A (en) | A kind of data processing method and server | |
CN117764132A (en) | Data transmission method, device and system, electronic equipment and storage medium | |
Zhang et al. | Decentralized optimal control for the mean field LQG problem of multi-agent systems | |
CN111695701B (en) | System for realizing data set construction processing based on federal learning and construction generation method thereof | |
CN112529163A (en) | Distributed training gradient compression acceleration method based on AllReduce | |
US11943277B2 (en) | Conversion system, method and program | |
CN111695689B (en) | Natural language processing method, device, equipment and readable storage medium | |
CN110334067B (en) | Sparse matrix compression method, device, equipment and storage medium | |
KR102105951B1 (en) | Constructing method of classification restricted boltzmann machine and computer apparatus for classification restricted boltzmann machine | |
Engelmann | Distributed Optimization with Application to Power Systems and Control |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210319 |