CN111722937B

CN111722937B - Deep learning weight updating method and device

Info

Publication number: CN111722937B
Application number: CN201910217885.3A
Authority: CN
Inventors: 林立翔; 龚志刚
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-03-21
Filing date: 2019-03-21
Publication date: 2024-05-10
Anticipated expiration: 2039-03-21
Also published as: CN111722937A

Abstract

The application provides a deep learning weight updating method and device, which divide weight gradient into a plurality of fragments, directly pipeline the aggregate communication and weight updating of each fragment, once the current fragment aggregate communication is completed, immediately update the weight corresponding to the fragment, and simultaneously, the next fragment is in parallel aggregate communication on a communication link.

Description

Deep learning weight updating method and device

Technical Field

The invention relates to the technical field of neural network deep learning, in particular to a deep learning weight updating method and device.

Background

Today, deep learning is a new field in machine learning research, and the motivation is to build and simulate a neural network for analysis learning of human brain, which mimics the mechanism of human brain to interpret data, and deep learning applications including speech recognition, image recognition, natural language processing, etc., where the computational effort is huge, and it requires extensive deep learning computation.

In order to accelerate the processing time of the data deep learning application and improve the computing efficiency, a high-density computing mode of cooperation of a central processing unit (English: central Processing Unit, abbreviated: CPU) and a plurality of graphic processors (English: graphics Processing Unit, abbreviated: GPU) is generally adopted, the GPU with powerful parallel computing capability executes time-consuming forward and backward computation, and the rest of parameter updating computation, data reading and distributing and neural network model updating computation according to the algorithm characteristics of the deep learning application are completed by the CPU.

However, GPU-based deep learning, in one iteration of multi-machine multi-card (multiple GPUs) training, each GPU concurrently performs forward and backward processing on different data to generate respective weight gradients. At this time, before the weight update, the weight gradient in each GPU needs to be exchanged with the corresponding weight gradient in the other GPUs to obtain the actual weight gradient generated by all the data in this iteration. Aggregate communication (Allreduce) is a common weight gradient protocol communication operation that cyclically transmits a protocol communication for weight gradients in each GPU until the weight gradient value on each GPU is the sum of the gradient values on all GPUs.

Therefore, communication between GPUs and between machines caused by protocol communication operation can bring time overhead in communication to the weight gradient-to-weight updating process, and the speed of weight updating is reduced.

Disclosure of Invention

In order to solve the problems, the invention provides a deep learning weight updating method and device, which can effectively hide weight updating calculation time within network communication time, and finally finish time is about the whole communication time, thereby saving weight updating time.

The embodiment of the invention provides a deep learning weight updating method, which comprises the following steps:

decomposing the weight gradient generated by the forward and backward calculation into a plurality of fragments;

Each segment is in collective communication with its corresponding segment;

updating the weight gradient corresponding to the fragment when the collection communication of the fragment is finished;

until the weight gradients of the plurality of fragments are updated.

Optionally, before decomposing the weight gradient generated by the forward and backward computation into a plurality of segments, further comprising:

converting the single-precision main weight parameter on the central processing unit into a half-precision weight parameter on the graphic processor;

and performing forward and backward operations of deep learning on the semi-precision weight parameters on the graphics processor to generate a semi-precision weight gradient on the graphics processor.

Optionally, each segment and its corresponding segment are the same segment and distributed on different machines, and each machine has multiple aggregate communication threads for each segment to perform aggregate communication in parallel with the corresponding segments on other machines.

Optionally, each segment corresponds to a half-precision weight gradient corresponding to the segment on the graphics processor;

after each segment performs collective communication with its corresponding segment, the method further includes:

And converting the corresponding half-precision weight gradient of the segment on the graphic processor into a corresponding single-precision weight gradient on the central processing unit.

Optionally, updating the weight gradient corresponding to the segment at the end of the aggregate communication of the segment includes:

and carrying out the weight updating operation corresponding to the fragment on the central processing unit according to the single-precision weight gradient corresponding to the fragment on the central processing unit and the single-precision main weight parameter corresponding to the fragment on the central processing unit, and generating a new single-precision main weight parameter of the fragment on the central processing unit.

The application also provides a deep learning weight updating device, which comprises:

the decomposition module is used for decomposing the weight gradient generated by the forward and backward calculation into a plurality of fragments;

the communication module is used for carrying out collective communication on each segment and the corresponding segment;

The updating module is used for updating the weight gradient corresponding to the fragment when the collection communication of the fragment is finished; until the weight gradients of the plurality of fragments are updated.

Optionally, the apparatus further comprises:

The conversion module is used for converting the single-precision main weight parameter on the central processing unit into a half-precision weight parameter on the graphic processor;

and the generation module is used for carrying out forward and backward operations of deep learning on the semi-precision weight parameters on the graphic processor and generating a semi-precision weight gradient on the graphic processor.

the conversion module is also used for converting the half-precision weight gradient corresponding to the fragment on the graphic processor into the single-precision weight gradient corresponding to the central processor.

Optionally, the updating module is specifically configured to perform a weight updating operation corresponding to the segment on the central processor according to a single-precision weight gradient corresponding to the segment on the central processor and a single-precision main weight parameter corresponding to the segment on the central processor, so as to generate a new single-precision main weight parameter of the segment on the central processor.

The present application provides a server, comprising: a memory, a processor, and a communication component;

The memory is used for storing a computer program;

The processor is coupled to the memory and the communication component for executing a computer program for performing the steps or operations described by the weight updating method for deep learning described above.

The present application provides a computer-readable storage medium storing a computer program which, when executed by a computer, is capable of implementing the steps or operations described in the weight updating method for deep learning.

According to the embodiment of the invention, the weight gradient is divided into a plurality of fragments, the aggregate communication and the weight update of each fragment are directly pipelined, once the aggregate communication of the current fragment is completed, the weight update corresponding to the fragment is immediately carried out, and meanwhile, the next fragment is in parallel aggregate communication on a communication link.

Meanwhile, in order to save the use of the video memory, the invention stores the single-precision main weight parameter (MASTER WEIGTHS) in the GPU video memory on the CPU, and the update operation of the weight parameter is also carried out on the CPU, so that the consumption of the 2/3 parameter video memory can be reduced, and the utilization rate of the video memory is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the prior art descriptions, and it is obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a prior art deep learning weight gradient update method;

FIG. 2 is a flowchart illustrating a method for updating deep learning weights according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a weight updating method with fragments as granularity according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a weight update using CPU version master weight parameters and CPU versions according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a deep learning weight updating device according to an embodiment of the invention;

fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, the "plurality" generally includes at least two, but does not exclude the case of at least one.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a product or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such product or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a commodity or system comprising such elements.

Fig. 1 is a flow chart of a deep learning weight gradient updating method in the prior art, as shown in fig. 1,

The method comprises the steps that 1, a weight gradient (WEIGHT GRADIENTS) generated by the front-back direction calculation completed by the GPU is communicated with a corresponding weight gradient generated by the front-back direction calculation completed by other GPUs in a circulating protocol mode, and an average weight gradient (AVERAGE WEIGHT GRADIENTS) is generated;

2. The average weight gradient updates (updates) the weights, resulting in updated weights (updated weights).

Therefore, the time for finally completing the weight updating is the sum of the communication time and the updating calculation time of each weight gradient set, the speed of the weight updating is greatly reduced, and in the whole set communication (Allreduce) process, the calculation resources are in an idle state, the communication links are not fully occupied, and the calculation resources are greatly wasted.

In addition, in the conventional deep learning training using the GPU, a single-precision (F32) main weight parameter is maintained on the GPU (MASTER WEIGHTS), and first, the main weight parameter needs to be converted into a half-precision (F16) weight parameter on the GPU (weights).

Forward (Forward) and Backward (Backward) operations of the deep learning are then performed with the half-precision (F16) weight parameters (weights) on the GPU, resulting in a half-precision (F16) weight gradient (WEIGHTS GRADIENT) on the GPU.

The semi-accurate weight gradient on the GPU is then converted to a single-accurate (F32) weight gradient on the GPU.

Then, the weight update (update) calculation on the GPU is performed using the single precision (F32) weight gradient on the GPU and the single precision (F32) main weight parameter on the GPU as inputs, and an updated single precision (F32) main weight parameter (updated MASTER WEIGHTS) on the GPU is generated.

Then, the next iteration calculation is performed (repeating the above steps) with the updated single precision (F32) main weight parameters on the GPU until all iterations are completed.

It can be seen that the main weight parameter (MASTER WEIGTHS) and the updated main weight parameter (updated MASTER WEIGHTS) share the same memory space; in the whole flow, a main weight parameter (MASTER WEIGTHS) used as an updating stage and a weight gradient (WEIGHTS GRADIENT) participating in Forward and Backward (forward+backward) calculation are stored in a video memory, wherein the main weight parameter (MASTER WEIGTHS) is single-precision, and the weight gradient (WEIGHTS GRADIENT) is half-precision; the single-precision main weight parameter (MASTER WEIGTHS) is twice as large as the half-precision weight gradient (WEIGHTS GRADIENT), so that the double-precision parameter memory consumption is increased in the existing deep learning training using the GPU for mixed precision.

In order to optimize the low-efficiency weight updating communication and calculation process in the prior art, the invention divides the weight gradient into a plurality of fragments, directly pipelines the aggregate communication and the weight updating of each fragment, once the current fragment aggregate communication is completed, immediately carries out the weight updating corresponding to the fragment, simultaneously carries out the aggregate communication on other fragments in parallel on a communication link.

Meanwhile, in order to save the use of the video memory, the main weight parameter (MASTER WEIGTHS) is stored on the CPU, and the update operation of the weight parameter is also carried out on the CPU, so that the two-thirds of the use of the parameter video memory can be saved, and the utilization rate of the video memory is greatly improved.

The application scenario of the present invention includes a plurality of machines (multi-image processors) for accelerating the training process of the deep learning neural network model, each machine has a plurality of aggregate communication threads for performing aggregate communication with other machines in parallel, and the implementation process of the present invention is described in detail below through specific embodiments.

Fig. 2 is a flowchart illustrating a deep learning weight updating method according to an embodiment of the present invention, as shown in fig. 2, including:

201. Decomposing the weight gradient generated by the forward and backward calculation into a plurality of fragments;

In an alternative embodiment, before the step of decomposing the weight gradient generated by the forward and backward computation into a plurality of segments, the method further comprises:

Then, the generated semi-precision weight gradient on the graphics processor is decomposed into a plurality of segments, and it is to be noted that each segment corresponds to the semi-precision weight gradient corresponding to the segment on the graphics processor.

202. Each segment is in collective communication with its corresponding segment;

Each segment and the corresponding segment are the same segment and distributed on different machines.

In an alternative embodiment, after decomposing a weight gradient into a plurality of segments, each segment carries its own identifier, and it is assumed that the segments are divided into 6 segments, namely, segment 0, segment 1, segment 2, segment 3, segment 4, and segment 5, and the sequence numbers of the segments can be used as the identifiers of the segments.

In the deep learning training of a plurality of machines (a plurality of image processors), the decomposed plurality of segments are input into the 6 segments in all of the machine 1, the machine 2 and the machine 3 for the deep learning training of 3 machines (the machine 1, the machine 2 and the machine 3) in the deep learning training of a plurality of machines, for example, each machine has a plurality of collective communication threads for performing collective communication with other machines in parallel, and specifically, each segment sequentially performs collective communication with the segment corresponding to the serial number on the other machines according to the serial number; for example, segment 0 on machine 1 is in collective communication with segment 0 on machine 2, and segment 0 on machine 2 is in collective communication with segment 0 on machine 3.

In an alternative embodiment, after each segment in this step performs collective communication with its corresponding other segment, the method further includes:

203. Updating the weight gradient corresponding to the fragment when the collection communication of the fragment is finished;

in an alternative embodiment, the step of updating the weight gradient corresponding to the segment at the end of the aggregate communication of the segment includes:

At the same time, the other fragments conduct the aggregate communication among the fragments in parallel.

204. Until the weight gradients of the plurality of fragments are updated.

The updated main weight parameters are subjected to the next iterative calculation, namely the steps 201-204 are repeated until all weight updating iterations are completed.

Figure 3 is a schematic diagram of a weight updating method with fragments as granularity provided by the embodiment of the invention,

Fig. 4 is a schematic diagram of a principle of using a main weight parameter of a CPU version and a weight update of the CPU version according to an embodiment of the present invention, and illustrates a weight update of a layer of a multi-machine multi-card.

First, as shown in fig. 4, a single precision (F32) version of the main weight parameters is maintained on the CPU (MASTER WEIGHTS), and the main weight parameters are converted to half precision (F16) weight parameters on the GPU (weights), and the half precision weight parameters on the GPU (weights) are subjected to Forward (Forward) and Backward (Backward) operations of deep learning, resulting in a half precision (F16) weight gradient on the GPU (WEIGHTS GRADIENT).

Then, as shown in fig. 3, the semi-accurate weight gradient on the GPU is decomposed into 6 segments, segment 0, segment 1, segment 2, segment 3, segment 4, and segment 5, respectively.

Each segment is a half-precision weight gradient corresponding to the segment on the GPU.

Each machine has the 6 fragments, and each machine has a plurality of collective communication threads for collective communication of each fragment in parallel with the same fragments of other machines, so that each fragment is concurrently collective communicated by an independent process; specifically, each segment can be in aggregate communication with the segments corresponding to the sequence numbers on other machines in sequence according to the sequence numbers; for example, segment 0 on machine 1 is in collective communication with segment 0 on machine 2, segment 0 on machine 2 is in collective communication with segment 0 on machine 3; meanwhile, segment 1 on machine 1 is in collective communication with segment 1 on machine 2, and segment 1 on machine 2 is in collective communication with segment 1 on machine 3; segment 2 on machine 1 is in collective communication with segment 2 on machine 2, segment 2 on machine 2 is in collective communication with segment 2 on machine 3; segment 3 on machine 1 is in collective communication with segment 3 on machine 2, segment 3 on machine 2 is in collective communication with segment 3 on machine 3; segment 4 on machine 1 is in collective communication with segment 4 on machine 2, segment 4 on machine 2 is in collective communication with segment 4 on machine 3; segments 5 on machine 1 are in collective communication with segments 5 on machine 2, and segments 5 on machine 2 are in collective communication with segments 5 on machine 3.

It should be noted that, the present invention decomposes the weight gradient into a plurality of segments, not limited to the 6 segments shown in fig. 3, and may initiate n segments to perform collective communication concurrently.

After the collection communication of one segment is completed, the corresponding half-precision weight gradient of the segment on the GPU is converted into a single-precision weight gradient on the CPU.

As shown in fig. 3, after the segment 0 completes the set communication, converting the half-precision weight gradient corresponding to the segment 0 on the GPU into a single-precision weight gradient on the CPU; the single precision weight gradient of the segment 0 on the CPU and the single precision main weight parameter maintained on the CPU are taken as inputs to perform the weight update operation (update) of the segment 0 on the CPU, and the single precision main weight parameter (updated MASTER WEIGHTS) of the segment 0 on the new CPU is generated.

At the same time, other fragments (such as fragment 1, fragment 2, fragment 3, fragment 4 and fragment 5) are in parallel and are in collective communication with corresponding fragments on other machines, so that the weight update calculation time can be hidden within the network communication time, the final completion time is about the whole communication time, and the weight update time is saved.

And carrying out next weight updating iterative computation on the updated main weight parameters of the segment 3 until all weight updating iterative computation is completed.

According to the embodiment of the invention, the weight gradient is divided into a plurality of fragments, the aggregate communication and the weight update of each fragment are directly pipelined, once the aggregate communication of the current fragment is completed, the weight update corresponding to the fragment is immediately carried out, and meanwhile, the aggregate communication is carried out on other fragments in parallel on a communication link.

Fig. 5 is a schematic structural diagram of a deep learning weight updating device according to an embodiment of the present invention, as shown in fig. 5, including:

Optionally, the apparatus further comprises:

Each segment corresponds to a half-precision weight gradient corresponding to the segment on the graphics processor;

optionally, the conversion module is further configured to convert a half-precision weight gradient corresponding to the segment on the graphics processor into a single-precision weight gradient corresponding to the central processor.

Optionally, the updating module is specifically configured to perform a weight updating operation corresponding to the segment on the central processor according to the single-precision weight gradient corresponding to the segment on the central processor and the single-precision main weight parameter on the central processor, so as to generate a new single-precision main weight parameter on the central processor.

The apparatus shown in this embodiment may perform the method embodiment shown in fig. 2, and its implementation principle and technical effects are not repeated.

Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention, as shown in fig. 6, including:

Memory, central processing unit, graphic processor and communication assembly;

The memory is used for storing a computer program;

the central processor and the graphics processor are coupled to the memory and the communication component, respectively, for executing computer programs.

The CPU is used for converting the single-precision main weight parameter into a half-precision weight parameter and transmitting the half-precision weight parameter to the graphic processor through the communication component;

The graphic processor is used for carrying out forward and backward operations of deep learning according to the semi-precision weight parameters to generate a semi-precision weight gradient;

a graphics processor, further configured to decompose the semi-precision weight gradient into a plurality of segments;

a graphics processor for aggregating each segment with its corresponding segment on the other machine through a communication component;

The image processor is also used for converting the half-precision weight gradient corresponding to each segment into a single-precision weight gradient and transmitting the single-precision weight gradient to the central processing unit through the communication component.

Correspondingly, the central processing unit is further configured to perform a corresponding weight updating operation on the segment according to the single-precision weight gradient corresponding to each segment and the single-precision main weight parameter corresponding to the segment, so as to generate a new single-precision main weight parameter of the segment.

The other fragments perform inter-fragment collective communication in parallel. And carrying out next iterative computation on the updated main weight parameters until all weight updating iterations are completed.

Further, as shown in fig. 6, the server further includes: a display, a power supply component, an audio component, and the like. Only some of the components are schematically shown in fig. 6, which does not mean that the server only comprises the components shown in fig. 6.

The server in this embodiment may execute the method embodiment shown in fig. 2, and its implementation principle and technical effects are not repeated.

Accordingly, an embodiment of the present application further provides a computer readable storage medium storing a computer program, where the computer program when executed by a computer can implement steps or operations related to a server in the embodiment of the method shown in fig. 2, which are not described herein.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A deep learning weight updating method, comprising:

Performing forward and backward operations of deep learning on weight parameters on a graphics processor to generate weight gradients on the graphics processor;

Decomposing the weight gradient into a plurality of segments; each segment and the corresponding segment are the same segment and are distributed on different machines, and each machine is provided with a plurality of collective communication threads for carrying out collective communication on each segment and the corresponding segment on other machines in parallel;

Each segment is in collective communication with its corresponding segment;

until the weight gradients of the plurality of fragments are updated.

2. The method according to claim 1, wherein the method further comprises:

the forward and backward operations of deep learning weight parameters on a graphics processor generate a weight gradient on the graphics processor, comprising:

3. The method of claim 2, wherein each segment corresponds to a half-precision weight gradient on the graphics processor for the segment;

4. A method according to claim 3, wherein performing a weight gradient update for the segment at the end of the aggregate communication of the segment comprises:

5. A deep learning weight updating apparatus, characterized by comprising:

the generation module is used for performing forward and backward operations of deep learning on weight parameters on the graphic processor and generating weight gradients on the graphic processor;

A decomposition module for decomposing the weight gradient into a plurality of segments; each segment and the corresponding segment are the same segment and are distributed on different machines, and each machine is provided with a plurality of collective communication threads for carrying out collective communication on each segment and the corresponding segment on other machines in parallel;

6. The apparatus as recited in claim 5, further comprising:

The generation module is specifically configured to perform forward and backward operations of deep learning on the semi-precision weight parameter on the graphics processor, so as to generate a semi-precision weight gradient on the graphics processor.

7. The apparatus of claim 6, wherein each segment corresponds to a half-precision weight gradient on the graphics processor for the segment;

8. The apparatus of claim 7, wherein the updating module is specifically configured to perform a weight updating operation corresponding to the segment on the central processor according to a single-precision weight gradient corresponding to the segment on the central processor and a single-precision main weight parameter corresponding to the segment on the central processor, and generate a new single-precision main weight parameter of the segment on the central processor.

9. A server, comprising: a memory, a processor, and a communication component;

The memory is used for storing a computer program;

The processor, coupled with the memory and the communication component, is configured to execute a computer program for performing the steps or operations of the method of any of claims 1-4.

10. A computer readable storage medium storing a computer program, wherein the computer program is capable of implementing the steps or operations of the method according to any one of claims 1-4 when executed by a computer.