CN111722937A

CN111722937A - Deep learning weight updating method and device

Info

Publication number: CN111722937A
Application number: CN201910217885.3A
Authority: CN
Inventors: 林立翔; 龚志刚
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-03-21
Filing date: 2019-03-21
Publication date: 2020-09-29

Abstract

The invention provides a method and a device for updating deep learning weight, wherein a weight gradient is divided into a plurality of segments, the set communication and the weight updating of each segment are performed in a direct pipelining mode, once the set communication of the current segment is completed, the weight updating corresponding to the segment is performed immediately, and the next segment performs the set communication in parallel on a communication link.

Description

Deep learning weight updating method and device

Technical Field

The invention relates to the technical field of deep learning of a neural network, in particular to a deep learning weight updating method and device.

Background

At present, deep learning is a new field in machine learning research, and the motivation lies in establishing and simulating a neural network for analyzing and learning the human brain, which simulates the mechanism of the human brain to interpret data, and deep learning applications include speech recognition, image recognition, natural language processing, etc., and the computation amount in these applications is huge, which requires large-scale deep learning computation.

In order to accelerate the Processing time of the deep learning application of data and improve the computing efficiency, a high-density computing mode of cooperation of a Central Processing Unit (CPU) and a plurality of Graphics Processing Units (GPU) is usually adopted, the GPU with powerful parallel computing capability executes time-consuming forward and backward computing, and the rest of parameter updating computing, data reading and distributing and neural network model updating computing according to the algorithm characteristics of the deep learning application are completed by the CPU.

However, based on deep learning of GPUs, in one iteration of multi-machine multi-card (multiple GPUs) training, each GPU simultaneously forward and backward processes different data to generate respective weight gradients. At this time, before performing weight update, the weight gradient in each GPU needs to be exchanged with the corresponding weight gradient in other GPUs to obtain the actual weight gradient generated by all data in this iteration. Aggregate communication (Allreduce) is a commonly used weight gradient reduction communication operation that cyclically transmits the weight gradients in each GPU until the weight gradient value on each GPU is the sum of the gradient values on all GPUs.

Therefore, communication between GPUs and between machines brought by the protocol communication operation brings communication time overhead to the process of updating the weight gradient to the weight, and the speed of updating the weight is reduced.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method and an apparatus for updating a deep learning weight, which can effectively hide the weight updating calculation time within the network communication time, and the final completion time is about the whole communication time, thereby saving the weight updating time.

The embodiment of the invention provides a deep learning weight updating method, which comprises the following steps:

decomposing the weight gradients generated by the forward and backward calculations into a plurality of segments;

each segment is in aggregate communication with the corresponding segment;

updating the weight gradient corresponding to the segment when the set communication of the segment is finished;

until the weight gradients of the plurality of segments are updated.

Optionally, before decomposing the weight gradient generated by the forward and backward calculation into a plurality of segments, further comprising:

converting the single-precision main weight parameter on the central processing unit into a semi-precision weight parameter on the graphic processor;

and performing forward and backward operations of deep learning on the semi-precision weight parameters on the graphics processor to generate a semi-precision weight gradient on the graphics processor.

Optionally, each segment is the same as its corresponding segment and is distributed on different machines, and each machine has multiple set communication threads for performing set communication in parallel between each segment and the corresponding segment on another machine.

Optionally, each segment corresponds to a half-precision weight gradient corresponding to the segment on the graphics processor;

after each segment is in aggregate communication with its corresponding segment, the method further comprises:

and converting the corresponding half-precision weight gradient of the segment on the graphic processor into the corresponding single-precision weight gradient on the central processing unit.

Optionally, when the set communication of the segments is finished, performing weight gradient update corresponding to the segments includes:

and performing corresponding weight updating operation of the segment on the central processing unit according to the corresponding single-precision weight gradient of the segment on the central processing unit and the corresponding single-precision main weight parameter of the segment on the central processing unit to generate a new single-precision main weight parameter of the segment on the central processing unit.

The present application further provides a deep learning weight updating apparatus, including:

a decomposition module for decomposing the weight gradient generated by the forward and backward calculations into a plurality of segments;

the communication module is used for carrying out set communication on each fragment and the corresponding fragment;

the updating module is used for updating the weight gradient corresponding to the segment when the set communication of the segment is finished; until the weight gradients of the plurality of segments are updated.

Optionally, the apparatus further comprises:

the conversion module is used for converting the single-precision main weight parameters on the central processing unit into the semi-precision weight parameters on the graphic processor;

and the generating module is used for performing forward and backward operations of deep learning on the semi-precision weight parameters on the graphics processor to generate a semi-precision weight gradient on the graphics processor.

the conversion module is further configured to convert the half-precision weight gradient of the segment on the graphics processor into a single-precision weight gradient on the central processing unit.

Optionally, the updating module is specifically configured to perform a weight updating operation on the segment on the central processing unit according to the single-precision weight gradient corresponding to the segment on the central processing unit and the single-precision primary weight parameter corresponding to the segment on the central processing unit, so as to generate a new single-precision primary weight parameter of the segment on the central processing unit.

The present application provides a server comprising: a memory, a processor, and a communication component;

the memory for storing a computer program;

the processor, coupled to the memory and the communication component, is configured to execute a computer program for performing the steps or operations of the above-mentioned deep learning weight updating method.

The present application provides a computer-readable storage medium storing a computer program, which when executed by a computer, can implement the steps or operations of the weight updating method for deep learning described above.

The weight gradient is divided into a plurality of segments, the set communication and the weight updating of each segment are directly pipelined, once the set communication of the current segment is completed, the weight updating corresponding to the segment is immediately performed, and simultaneously the next segment performs the set communication in parallel on a communication link.

Meanwhile, in order to save the video memory, the invention stores the single-precision main weight parameters (masterweights) in the GPU video memory on the CPU, and the update (update) operation of the weight parameters is also carried out on the CPU, thus reducing the parameter video memory consumption of 2/3 and greatly improving the video memory utilization rate.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a prior art deep learning weight gradient update method;

fig. 2 is a flowchart illustrating a deep learning weight updating method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a weight update method with fragment granularity according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a principle of updating a weight using a CPU version master weight parameter and a CPU version according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a deep learning weight updating apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and "a" and "an" generally include at least two, but do not exclude at least one, unless the context clearly dictates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a commodity or system that includes the element.

Fig. 1 is a flowchart illustrating a method for updating a deep learning weight gradient in the prior art, as shown in fig. 1,

performing cyclic protocol communication on weight gradients (weight gradients) generated by the GPU after completing forward and backward calculation and corresponding weight gradients generated by other GPUs after completing forward and backward calculation to generate average weight gradients (average weight gradients);

2. the average weight gradient performs an update (update) calculation on the weights, resulting in updated weights (updated weights).

Therefore, each weight updating calculation operation needs to be performed after the communication of each weight gradient is completed, so that the time for finally completing the weight updating is the sum of the communication time of each weight gradient set and the updating calculation time, the speed of weight updating is greatly reduced, in the whole set communication (Allreduce) process, the calculation resources are in an idle state, the communication link is not fully occupied, and the calculation resources are greatly wasted.

In the conventional deep learning training for blend accuracy using a GPU, a single set of master weight parameters (master weights) of single accuracy (F32) is maintained on the GPU, and first, the master weight parameters need to be converted into half-accuracy (F16) weight parameters (weights) on the GPU.

Then, Forward (Forward) and Backward (Backward) operations of deep learning are performed with the half-precision (F16) weight parameters (weights) on the GPU, resulting in a weight gradient (weightgradient) for half-precision (F16) on the GPU.

The half-precision weight gradients on the GPU are then converted to single-precision (F32) weight gradients on the GPU.

Thereafter, a weight update (update) calculation on the GPU is performed using the single precision (F32) weight gradient on the GPU and the single precision (F32) master weight parameter on the GPU as inputs, and an updated single precision (F32) master weight parameter (updated master weights) on the GPU is generated.

Then, the next iteration calculation is carried out by using the single-precision (F32) main weight parameter on the updated GPU (the steps are repeated) until all the iterations are finished.

Therefore, the master weight parameters (master weights) and the updated master weight parameters (updatedmaster weights) share the same memory space; in the whole process, main weight parameters (master weights) used in an updating stage and weight gradients (weight gradients) participating in Forward and Backward (Forward + Backward) calculation are stored in a video memory, wherein the main weight parameters (master weights) are single-precision, and the weight gradients (weight gradients) are half-precision; the memory occupation amount of the single-precision main weight parameter (master weights) is twice of that of the semi-precision weight gradient (weight gradient), so that the consumption of the parameter memory is increased by twice in the conventional deep learning training of mixing precision by using the GPU.

In order to optimize inefficient weight updating communication and calculation processes in the prior art, the weight gradient is divided into a plurality of segments, the set communication and weight updating of each segment are directly pipelined, once the set communication of the current segment is completed, the weight updating corresponding to the segment is immediately performed, and simultaneously, other segments perform the set communication in parallel on a communication link.

Meanwhile, in order to save the video memory, the invention stores the main weight parameters (master weights) on the CPU, and the update (update) operation of the weight parameters is also carried out on the CPU, thus saving two thirds of the parameters of the video memory utilization and greatly improving the video memory utilization rate.

The application scenario of the invention includes a plurality of machines (multi-chip image processors) for accelerating the training process of the deep learning neural network model, each machine has a plurality of set communication threads for performing set communication with other machines in parallel, and the implementation process of the invention is described in detail through specific embodiments below.

Fig. 2 is a flowchart of a deep learning weight updating method according to an embodiment of the present invention, as shown in fig. 2, including:

201. decomposing the weight gradients generated by the forward and backward calculations into a plurality of segments;

in an alternative embodiment, before decomposing the weight gradient generated by the forward and backward calculation into a plurality of segments, the method further includes:

Then, the generated half-precision weight gradient on the graphics processor is decomposed into a plurality of segments, wherein each segment corresponds to the half-precision weight gradient corresponding to the segment on the graphics processor.

202. Each segment is in aggregate communication with the corresponding segment;

each segment and the corresponding segment are the same segment and are distributed on different machines.

In an alternative embodiment, after a weight gradient is decomposed into a plurality of segments, each segment carries its own identifier, and if the weight gradient is divided into 6 segments, segment 0, segment 1, segment 2, segment 3, segment 4, and segment 5, respectively, the sequence number of the segment can be used as the identifier of the segment.

In the deep learning training of the multiple machines (multiple image processors), the decomposed multiple segments are input into the deep learning training of the multiple machines, for example, in the deep learning training of 3 machines (machine 1, machine 2, machine 3) for deep learning training, the 6 segments are input into each of the machine 1, machine 2 and machine 3, each machine has multiple set communication threads for performing parallel set communication with other machines, specifically, each segment is sequentially subjected to set communication with the segment corresponding to the sequence number on other machines according to the sequence number; for example, segment 0 on machine 1 is in aggregate communication with segment 0 on machine 2, and segment 0 on machine 2 is in aggregate communication with segment 0 on machine 3.

In an optional implementation manner, after each segment performs aggregate communication with its corresponding other segment, the method further includes:

203. Updating the weight gradient corresponding to the segment when the set communication of the segment is finished;

in an optional implementation manner, this step of updating the weight gradient corresponding to the segment at the end of the aggregate communication of the segment includes:

At the same time, other segments are in parallel to carry out aggregate communication among the segments.

204. Until the weight gradients of the plurality of segments are updated.

And (4) performing the next iterative computation on the updated main weight parameter, namely repeating the steps 201 to 204 until all the weight updating iterations are finished.

FIG. 3 is a schematic diagram of the principle of a fragment-based weight update method according to an embodiment of the present invention,

fig. 4 is a schematic diagram of a principle of updating the weight using the CPU version master weight parameter and the CPU version weight according to the embodiment of the present invention, and the weight updating of a certain layer of the multi-machine multi-card is taken as an example for description.

First, as shown in fig. 4, a single precision (F32) version of the master weight parameters (masterweights) is maintained on the CPU, the master weight parameters are converted into half precision (F16) weight parameters (weights) on the GPU, and the half precision weight parameters (weights) on the GPU are subjected to deep learning Forward (Forward) and Backward (Backward) operations to generate a weight gradient (weights gradient) of half precision (F16) on the GPU.

Thereafter, as shown in fig. 3, the semi-precision weight gradient on the GPU is decomposed into 6 segments, segment 0, segment 1, segment 2, segment 3, segment 4, and segment 5, respectively.

Each segment is a half-precision weight gradient corresponding to the segment on the GPU.

Each machine has the 6 segments, and each machine has a plurality of set communication threads for performing set communication of each segment with the same segment of other machines in parallel, so that each segment is concurrently performed with set communication by independent processes; specifically, each segment may sequentially perform aggregate communication with the segment corresponding to the sequence number on other machines according to the sequence number; for example, segment 0 on machine 1 is in aggregate communication with segment 0 on machine 2, segment 0 on machine 2 is in aggregate communication with segment 0 on machine 3; meanwhile, segment 1 on machine 1 is in aggregate communication with segment 1 on machine 2, and segment 1 on machine 2 is in aggregate communication with segment 1 on machine 3; segment 2 on machine 1 is in aggregate communication with segment 2 on machine 2, segment 2 on machine 2 is in aggregate communication with segment 2 on machine 3; segment 3 on machine 1 is in aggregate communication with segment 3 on machine 2, segment 3 on machine 2 is in aggregate communication with segment 3 on machine 3; segment 4 on machine 1 is in aggregate communication with segment 4 on machine 2, segment 4 on machine 2 is in aggregate communication with segment 4 on machine 3; segment 5 on machine 1 is in aggregate communication with segment 5 on machine 2, and segment 5 on machine 2 is in aggregate communication with segment 5 on machine 3.

It should be noted that the present invention decomposes the weight gradient into a plurality of segments, not limited to the 6 segments shown in fig. 3, and may concurrently initiate n segments for aggregate communication.

And after the set communication of each segment is completed, converting the corresponding half-precision weight gradient of the segment on the GPU into a single-precision weight gradient on the CPU.

As shown in fig. 3, after segment 0 completes set communication, the half-precision weight gradient corresponding to segment 0 on the GPU is converted into a single-precision weight gradient on the CPU; the single-precision weight gradient of the segment 0 on the CPU and the single-precision master weight parameter maintained on the CPU are taken as input to carry out the weight updating operation (update) of the segment 0 on the CPU, and generate the single-precision master weight parameter (updated master weights) of the segment 0 on the new CPU.

At the same time, other segments (such as segment 1, segment 2, segment 3, segment 4 and segment 5) are in parallel aggregate communication with corresponding segments on other machines, so that the weight updating calculation time can be hidden in the network communication time, the final completion time is about the whole communication time, and the weight updating time is saved.

And performing next weight updating iterative calculation on the updated main weight parameters of the segment 3 until all weight updating iterative calculations are completed.

The weight gradient is divided into a plurality of segments, the set communication and the weight updating of each segment are directly pipelined, once the set communication of the current segment is completed, the weight updating corresponding to the segment is immediately performed, and simultaneously, other segments perform the set communication in parallel on a communication link.

Fig. 5 is a schematic structural diagram of a deep learning weight updating apparatus according to an embodiment of the present invention, as shown in fig. 5, including:

the communication module is used for carrying out set communication on each segment and the corresponding segment;

Optionally, the apparatus further comprises:

Wherein each segment corresponds to a half-precision weight gradient corresponding to the segment on the graphics processor;

optionally, the conversion module is further configured to convert a half-precision weight gradient corresponding to the segment on the graphics processor into a single-precision weight gradient corresponding to the central processing unit.

Optionally, the updating module is specifically configured to perform a weight updating operation on the segment on the central processing unit according to the single-precision weight gradient of the segment on the central processing unit and the single-precision primary weight parameter on the central processing unit, so as to generate a new single-precision primary weight parameter on the central processing unit.

The apparatus shown in this embodiment may perform the method embodiment shown in fig. 2, and the implementation principle and the technical effect are not described again.

Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention, as shown in fig. 6, including:

the device comprises a memory, a central processing unit, a graphic processor and a communication assembly;

the memory for storing a computer program;

the central processor and the graphics processor are coupled to the memory and the communication assembly, respectively, for executing computer programs.

The central processing unit is used for converting the single-precision main weight parameter into a half-precision weight parameter and sending the half-precision weight parameter to the graphic processor through the communication assembly;

the image processor is used for carrying out forward and backward operations of deep learning according to the semi-precision weight parameters to generate a semi-precision weight gradient;

a graphics processor further configured to decompose the semi-precision weight gradient into a plurality of segments;

the graphics processor is also used for carrying out set communication on each fragment and the fragments corresponding to the fragments on other machines through the communication component;

And the graphics processor is also used for converting the half-precision weight gradient corresponding to each segment into a single-precision weight gradient and sending the single-precision weight gradient to the central processor through the communication assembly.

Correspondingly, the central processing unit is further configured to perform a weight updating operation on the segment according to the single-precision weight gradient corresponding to each segment and the single-precision primary weight parameter corresponding to the segment, so as to generate a new single-precision primary weight parameter of the segment.

It should be noted that other segments perform aggregate communication between segments in parallel. And carrying out next iterative calculation on the updated main weight parameters until all weight updating iterations are finished.

Further, as shown in fig. 6, the server further includes: display, power components, audio components, and the like. Only some of the components are schematically shown in fig. 6, and it is not meant that the server includes only the components shown in fig. 6.

The server shown in this embodiment may execute the method embodiment shown in fig. 2, and the implementation principle and technical effect thereof are not described again.

Accordingly, an embodiment of the present application further provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a computer, the steps or operations related to the server in the embodiment of the method shown in fig. 2 can be implemented, which is not described herein again.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A deep learning weight updating method is characterized by comprising the following steps:

each segment is in aggregate communication with the corresponding segment;

until the weight gradients of the plurality of segments are updated.

2. The method of claim 1, wherein before decomposing the weight gradient resulting from the forward and backward computations into a plurality of segments, further comprising:

3. The method of claim 1, wherein each segment is the same segment as its corresponding segment and is distributed across different machines, and wherein each machine has multiple set communication threads for each segment to collectively communicate with corresponding segments on other machines in parallel.

4. The method of claim 3, wherein each segment corresponds to a half-precision weight gradient corresponding to the segment on the graphics processor;

5. The method of claim 4, wherein updating the weight gradient corresponding to the segment at the end of the aggregate communication of the segment comprises:

6. A deep learning weight update apparatus, comprising:

7. The apparatus of claim 6, further comprising:

8. The apparatus of claim 6, wherein each segment is the same segment as its corresponding segment and is distributed across different machines, and each machine has multiple set communication threads for each segment to perform set communication with corresponding segments on other machines in parallel.

9. The apparatus of claim 8, wherein each segment corresponds to a half-precision weight gradient on the graphics processor corresponding to the segment;

10. The apparatus according to claim 9, wherein the updating module is specifically configured to perform a weight updating operation on the central processing unit corresponding to the segment according to the single-precision weight gradient corresponding to the segment on the central processing unit and the single-precision primary weight parameter corresponding to the segment on the central processing unit, so as to generate a new single-precision primary weight parameter of the segment on the central processing unit.

11. A server, comprising: a memory, a processor, and a communication component;

the memory for storing a computer program;

the processor, coupled with the memory and the communication component, to execute a computer program for performing the steps or operations of the method of any of claims 1-5.

12. A computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a computer, is capable of performing the steps or operations of any one of the methods of claims 1-5.