CN111722937A - Deep learning weight updating method and device - Google Patents

Deep learning weight updating method and device Download PDF

Info

Publication number
CN111722937A
CN111722937A CN201910217885.3A CN201910217885A CN111722937A CN 111722937 A CN111722937 A CN 111722937A CN 201910217885 A CN201910217885 A CN 201910217885A CN 111722937 A CN111722937 A CN 111722937A
Authority
CN
China
Prior art keywords
segment
weight
precision
processing unit
central processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910217885.3A
Other languages
Chinese (zh)
Inventor
林立翔
龚志刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910217885.3A priority Critical patent/CN111722937A/en
Publication of CN111722937A publication Critical patent/CN111722937A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/545Interprogram communication where tasks reside in different layers, e.g. user- and kernel-space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Abstract

The invention provides a method and a device for updating deep learning weight, wherein a weight gradient is divided into a plurality of segments, the set communication and the weight updating of each segment are performed in a direct pipelining mode, once the set communication of the current segment is completed, the weight updating corresponding to the segment is performed immediately, and the next segment performs the set communication in parallel on a communication link.

Description

Deep learning weight updating method and device
Technical Field
The invention relates to the technical field of deep learning of a neural network, in particular to a deep learning weight updating method and device.
Background
At present, deep learning is a new field in machine learning research, and the motivation lies in establishing and simulating a neural network for analyzing and learning the human brain, which simulates the mechanism of the human brain to interpret data, and deep learning applications include speech recognition, image recognition, natural language processing, etc., and the computation amount in these applications is huge, which requires large-scale deep learning computation.
In order to accelerate the Processing time of the deep learning application of data and improve the computing efficiency, a high-density computing mode of cooperation of a Central Processing Unit (CPU) and a plurality of Graphics Processing Units (GPU) is usually adopted, the GPU with powerful parallel computing capability executes time-consuming forward and backward computing, and the rest of parameter updating computing, data reading and distributing and neural network model updating computing according to the algorithm characteristics of the deep learning application are completed by the CPU.
However, based on deep learning of GPUs, in one iteration of multi-machine multi-card (multiple GPUs) training, each GPU simultaneously forward and backward processes different data to generate respective weight gradients. At this time, before performing weight update, the weight gradient in each GPU needs to be exchanged with the corresponding weight gradient in other GPUs to obtain the actual weight gradient generated by all data in this iteration. Aggregate communication (Allreduce) is a commonly used weight gradient reduction communication operation that cyclically transmits the weight gradients in each GPU until the weight gradient value on each GPU is the sum of the gradient values on all GPUs.
Therefore, communication between GPUs and between machines brought by the protocol communication operation brings communication time overhead to the process of updating the weight gradient to the weight, and the speed of updating the weight is reduced.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method and an apparatus for updating a deep learning weight, which can effectively hide the weight updating calculation time within the network communication time, and the final completion time is about the whole communication time, thereby saving the weight updating time.
The embodiment of the invention provides a deep learning weight updating method, which comprises the following steps:
decomposing the weight gradients generated by the forward and backward calculations into a plurality of segments;
each segment is in aggregate communication with the corresponding segment;
updating the weight gradient corresponding to the segment when the set communication of the segment is finished;
until the weight gradients of the plurality of segments are updated.
Optionally, before decomposing the weight gradient generated by the forward and backward calculation into a plurality of segments, further comprising:
converting the single-precision main weight parameter on the central processing unit into a semi-precision weight parameter on the graphic processor;
and performing forward and backward operations of deep learning on the semi-precision weight parameters on the graphics processor to generate a semi-precision weight gradient on the graphics processor.
Optionally, each segment is the same as its corresponding segment and is distributed on different machines, and each machine has multiple set communication threads for performing set communication in parallel between each segment and the corresponding segment on another machine.
Optionally, each segment corresponds to a half-precision weight gradient corresponding to the segment on the graphics processor;
after each segment is in aggregate communication with its corresponding segment, the method further comprises:
and converting the corresponding half-precision weight gradient of the segment on the graphic processor into the corresponding single-precision weight gradient on the central processing unit.
Optionally, when the set communication of the segments is finished, performing weight gradient update corresponding to the segments includes:
and performing corresponding weight updating operation of the segment on the central processing unit according to the corresponding single-precision weight gradient of the segment on the central processing unit and the corresponding single-precision main weight parameter of the segment on the central processing unit to generate a new single-precision main weight parameter of the segment on the central processing unit.
The present application further provides a deep learning weight updating apparatus, including:
a decomposition module for decomposing the weight gradient generated by the forward and backward calculations into a plurality of segments;
the communication module is used for carrying out set communication on each fragment and the corresponding fragment;
the updating module is used for updating the weight gradient corresponding to the segment when the set communication of the segment is finished; until the weight gradients of the plurality of segments are updated.
Optionally, the apparatus further comprises:
the conversion module is used for converting the single-precision main weight parameters on the central processing unit into the semi-precision weight parameters on the graphic processor;
and the generating module is used for performing forward and backward operations of deep learning on the semi-precision weight parameters on the graphics processor to generate a semi-precision weight gradient on the graphics processor.
Optionally, each segment is the same as its corresponding segment and is distributed on different machines, and each machine has multiple set communication threads for performing set communication in parallel between each segment and the corresponding segment on another machine.
Optionally, each segment corresponds to a half-precision weight gradient corresponding to the segment on the graphics processor;
the conversion module is further configured to convert the half-precision weight gradient of the segment on the graphics processor into a single-precision weight gradient on the central processing unit.
Optionally, the updating module is specifically configured to perform a weight updating operation on the segment on the central processing unit according to the single-precision weight gradient corresponding to the segment on the central processing unit and the single-precision primary weight parameter corresponding to the segment on the central processing unit, so as to generate a new single-precision primary weight parameter of the segment on the central processing unit.
The present application provides a server comprising: a memory, a processor, and a communication component;
the memory for storing a computer program;
the processor, coupled to the memory and the communication component, is configured to execute a computer program for performing the steps or operations of the above-mentioned deep learning weight updating method.
The present application provides a computer-readable storage medium storing a computer program, which when executed by a computer, can implement the steps or operations of the weight updating method for deep learning described above.
The weight gradient is divided into a plurality of segments, the set communication and the weight updating of each segment are directly pipelined, once the set communication of the current segment is completed, the weight updating corresponding to the segment is immediately performed, and simultaneously the next segment performs the set communication in parallel on a communication link.
Meanwhile, in order to save the video memory, the invention stores the single-precision main weight parameters (masterweights) in the GPU video memory on the CPU, and the update (update) operation of the weight parameters is also carried out on the CPU, thus reducing the parameter video memory consumption of 2/3 and greatly improving the video memory utilization rate.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a prior art deep learning weight gradient update method;
fig. 2 is a flowchart illustrating a deep learning weight updating method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating a weight update method with fragment granularity according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a principle of updating a weight using a CPU version master weight parameter and a CPU version according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a deep learning weight updating apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and "a" and "an" generally include at least two, but do not exclude at least one, unless the context clearly dictates otherwise.
It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a commodity or system that includes the element.
Fig. 1 is a flowchart illustrating a method for updating a deep learning weight gradient in the prior art, as shown in fig. 1,
performing cyclic protocol communication on weight gradients (weight gradients) generated by the GPU after completing forward and backward calculation and corresponding weight gradients generated by other GPUs after completing forward and backward calculation to generate average weight gradients (average weight gradients);
2. the average weight gradient performs an update (update) calculation on the weights, resulting in updated weights (updated weights).
Therefore, each weight updating calculation operation needs to be performed after the communication of each weight gradient is completed, so that the time for finally completing the weight updating is the sum of the communication time of each weight gradient set and the updating calculation time, the speed of weight updating is greatly reduced, in the whole set communication (Allreduce) process, the calculation resources are in an idle state, the communication link is not fully occupied, and the calculation resources are greatly wasted.
In the conventional deep learning training for blend accuracy using a GPU, a single set of master weight parameters (master weights) of single accuracy (F32) is maintained on the GPU, and first, the master weight parameters need to be converted into half-accuracy (F16) weight parameters (weights) on the GPU.
Then, Forward (Forward) and Backward (Backward) operations of deep learning are performed with the half-precision (F16) weight parameters (weights) on the GPU, resulting in a weight gradient (weightgradient) for half-precision (F16) on the GPU.
The half-precision weight gradients on the GPU are then converted to single-precision (F32) weight gradients on the GPU.
Thereafter, a weight update (update) calculation on the GPU is performed using the single precision (F32) weight gradient on the GPU and the single precision (F32) master weight parameter on the GPU as inputs, and an updated single precision (F32) master weight parameter (updated master weights) on the GPU is generated.
Then, the next iteration calculation is carried out by using the single-precision (F32) main weight parameter on the updated GPU (the steps are repeated) until all the iterations are finished.
Therefore, the master weight parameters (master weights) and the updated master weight parameters (updatedmaster weights) share the same memory space; in the whole process, main weight parameters (master weights) used in an updating stage and weight gradients (weight gradients) participating in Forward and Backward (Forward + Backward) calculation are stored in a video memory, wherein the main weight parameters (master weights) are single-precision, and the weight gradients (weight gradients) are half-precision; the memory occupation amount of the single-precision main weight parameter (master weights) is twice of that of the semi-precision weight gradient (weight gradient), so that the consumption of the parameter memory is increased by twice in the conventional deep learning training of mixing precision by using the GPU.
In order to optimize inefficient weight updating communication and calculation processes in the prior art, the weight gradient is divided into a plurality of segments, the set communication and weight updating of each segment are directly pipelined, once the set communication of the current segment is completed, the weight updating corresponding to the segment is immediately performed, and simultaneously, other segments perform the set communication in parallel on a communication link.
Meanwhile, in order to save the video memory, the invention stores the main weight parameters (master weights) on the CPU, and the update (update) operation of the weight parameters is also carried out on the CPU, thus saving two thirds of the parameters of the video memory utilization and greatly improving the video memory utilization rate.
The application scenario of the invention includes a plurality of machines (multi-chip image processors) for accelerating the training process of the deep learning neural network model, each machine has a plurality of set communication threads for performing set communication with other machines in parallel, and the implementation process of the invention is described in detail through specific embodiments below.
Fig. 2 is a flowchart of a deep learning weight updating method according to an embodiment of the present invention, as shown in fig. 2, including:
201. decomposing the weight gradients generated by the forward and backward calculations into a plurality of segments;
in an alternative embodiment, before decomposing the weight gradient generated by the forward and backward calculation into a plurality of segments, the method further includes:
converting the single-precision main weight parameter on the central processing unit into a semi-precision weight parameter on the graphic processor;
and performing forward and backward operations of deep learning on the semi-precision weight parameters on the graphics processor to generate a semi-precision weight gradient on the graphics processor.
Then, the generated half-precision weight gradient on the graphics processor is decomposed into a plurality of segments, wherein each segment corresponds to the half-precision weight gradient corresponding to the segment on the graphics processor.
202. Each segment is in aggregate communication with the corresponding segment;
each segment and the corresponding segment are the same segment and are distributed on different machines.
In an alternative embodiment, after a weight gradient is decomposed into a plurality of segments, each segment carries its own identifier, and if the weight gradient is divided into 6 segments, segment 0, segment 1, segment 2, segment 3, segment 4, and segment 5, respectively, the sequence number of the segment can be used as the identifier of the segment.
In the deep learning training of the multiple machines (multiple image processors), the decomposed multiple segments are input into the deep learning training of the multiple machines, for example, in the deep learning training of 3 machines (machine 1, machine 2, machine 3) for deep learning training, the 6 segments are input into each of the machine 1, machine 2 and machine 3, each machine has multiple set communication threads for performing parallel set communication with other machines, specifically, each segment is sequentially subjected to set communication with the segment corresponding to the sequence number on other machines according to the sequence number; for example, segment 0 on machine 1 is in aggregate communication with segment 0 on machine 2, and segment 0 on machine 2 is in aggregate communication with segment 0 on machine 3.
In an optional implementation manner, after each segment performs aggregate communication with its corresponding other segment, the method further includes:
and converting the corresponding half-precision weight gradient of the segment on the graphic processor into the corresponding single-precision weight gradient on the central processing unit.
203. Updating the weight gradient corresponding to the segment when the set communication of the segment is finished;
in an optional implementation manner, this step of updating the weight gradient corresponding to the segment at the end of the aggregate communication of the segment includes:
and performing corresponding weight updating operation of the segment on the central processing unit according to the corresponding single-precision weight gradient of the segment on the central processing unit and the corresponding single-precision main weight parameter of the segment on the central processing unit to generate a new single-precision main weight parameter of the segment on the central processing unit.
At the same time, other segments are in parallel to carry out aggregate communication among the segments.
204. Until the weight gradients of the plurality of segments are updated.
And (4) performing the next iterative computation on the updated main weight parameter, namely repeating the steps 201 to 204 until all the weight updating iterations are finished.
FIG. 3 is a schematic diagram of the principle of a fragment-based weight update method according to an embodiment of the present invention,
fig. 4 is a schematic diagram of a principle of updating the weight using the CPU version master weight parameter and the CPU version weight according to the embodiment of the present invention, and the weight updating of a certain layer of the multi-machine multi-card is taken as an example for description.
First, as shown in fig. 4, a single precision (F32) version of the master weight parameters (masterweights) is maintained on the CPU, the master weight parameters are converted into half precision (F16) weight parameters (weights) on the GPU, and the half precision weight parameters (weights) on the GPU are subjected to deep learning Forward (Forward) and Backward (Backward) operations to generate a weight gradient (weights gradient) of half precision (F16) on the GPU.
Thereafter, as shown in fig. 3, the semi-precision weight gradient on the GPU is decomposed into 6 segments, segment 0, segment 1, segment 2, segment 3, segment 4, and segment 5, respectively.
Each segment is a half-precision weight gradient corresponding to the segment on the GPU.
Each machine has the 6 segments, and each machine has a plurality of set communication threads for performing set communication of each segment with the same segment of other machines in parallel, so that each segment is concurrently performed with set communication by independent processes; specifically, each segment may sequentially perform aggregate communication with the segment corresponding to the sequence number on other machines according to the sequence number; for example, segment 0 on machine 1 is in aggregate communication with segment 0 on machine 2, segment 0 on machine 2 is in aggregate communication with segment 0 on machine 3; meanwhile, segment 1 on machine 1 is in aggregate communication with segment 1 on machine 2, and segment 1 on machine 2 is in aggregate communication with segment 1 on machine 3; segment 2 on machine 1 is in aggregate communication with segment 2 on machine 2, segment 2 on machine 2 is in aggregate communication with segment 2 on machine 3; segment 3 on machine 1 is in aggregate communication with segment 3 on machine 2, segment 3 on machine 2 is in aggregate communication with segment 3 on machine 3; segment 4 on machine 1 is in aggregate communication with segment 4 on machine 2, segment 4 on machine 2 is in aggregate communication with segment 4 on machine 3; segment 5 on machine 1 is in aggregate communication with segment 5 on machine 2, and segment 5 on machine 2 is in aggregate communication with segment 5 on machine 3.
It should be noted that the present invention decomposes the weight gradient into a plurality of segments, not limited to the 6 segments shown in fig. 3, and may concurrently initiate n segments for aggregate communication.
And after the set communication of each segment is completed, converting the corresponding half-precision weight gradient of the segment on the GPU into a single-precision weight gradient on the CPU.
As shown in fig. 3, after segment 0 completes set communication, the half-precision weight gradient corresponding to segment 0 on the GPU is converted into a single-precision weight gradient on the CPU; the single-precision weight gradient of the segment 0 on the CPU and the single-precision master weight parameter maintained on the CPU are taken as input to carry out the weight updating operation (update) of the segment 0 on the CPU, and generate the single-precision master weight parameter (updated master weights) of the segment 0 on the new CPU.
At the same time, other segments (such as segment 1, segment 2, segment 3, segment 4 and segment 5) are in parallel aggregate communication with corresponding segments on other machines, so that the weight updating calculation time can be hidden in the network communication time, the final completion time is about the whole communication time, and the weight updating time is saved.
And performing next weight updating iterative calculation on the updated main weight parameters of the segment 3 until all weight updating iterative calculations are completed.
The weight gradient is divided into a plurality of segments, the set communication and the weight updating of each segment are directly pipelined, once the set communication of the current segment is completed, the weight updating corresponding to the segment is immediately performed, and simultaneously, other segments perform the set communication in parallel on a communication link.
Meanwhile, in order to save the video memory, the invention stores the single-precision main weight parameters (masterweights) in the GPU video memory on the CPU, and the update (update) operation of the weight parameters is also carried out on the CPU, thus reducing the parameter video memory consumption of 2/3 and greatly improving the video memory utilization rate.
Fig. 5 is a schematic structural diagram of a deep learning weight updating apparatus according to an embodiment of the present invention, as shown in fig. 5, including:
a decomposition module for decomposing the weight gradient generated by the forward and backward calculations into a plurality of segments;
the communication module is used for carrying out set communication on each segment and the corresponding segment;
the updating module is used for updating the weight gradient corresponding to the segment when the set communication of the segment is finished; until the weight gradients of the plurality of segments are updated.
Optionally, the apparatus further comprises:
the conversion module is used for converting the single-precision main weight parameters on the central processing unit into the semi-precision weight parameters on the graphic processor;
and the generating module is used for performing forward and backward operations of deep learning on the semi-precision weight parameters on the graphics processor to generate a semi-precision weight gradient on the graphics processor.
Wherein each segment corresponds to a half-precision weight gradient corresponding to the segment on the graphics processor;
optionally, the conversion module is further configured to convert a half-precision weight gradient corresponding to the segment on the graphics processor into a single-precision weight gradient corresponding to the central processing unit.
Optionally, the updating module is specifically configured to perform a weight updating operation on the segment on the central processing unit according to the single-precision weight gradient of the segment on the central processing unit and the single-precision primary weight parameter on the central processing unit, so as to generate a new single-precision primary weight parameter on the central processing unit.
The apparatus shown in this embodiment may perform the method embodiment shown in fig. 2, and the implementation principle and the technical effect are not described again.
Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention, as shown in fig. 6, including:
the device comprises a memory, a central processing unit, a graphic processor and a communication assembly;
the memory for storing a computer program;
the central processor and the graphics processor are coupled to the memory and the communication assembly, respectively, for executing computer programs.
The central processing unit is used for converting the single-precision main weight parameter into a half-precision weight parameter and sending the half-precision weight parameter to the graphic processor through the communication assembly;
the image processor is used for carrying out forward and backward operations of deep learning according to the semi-precision weight parameters to generate a semi-precision weight gradient;
a graphics processor further configured to decompose the semi-precision weight gradient into a plurality of segments;
the graphics processor is also used for carrying out set communication on each fragment and the fragments corresponding to the fragments on other machines through the communication component;
each segment and the corresponding segment are the same segment and are distributed on different machines.
In an alternative embodiment, after a weight gradient is decomposed into a plurality of segments, each segment carries its own identifier, and if the weight gradient is divided into 6 segments, segment 0, segment 1, segment 2, segment 3, segment 4, and segment 5, respectively, the sequence number of the segment can be used as the identifier of the segment.
In the deep learning training of the multiple machines (multiple image processors), the decomposed multiple segments are input into the deep learning training of the multiple machines, for example, in the deep learning training of 3 machines (machine 1, machine 2, machine 3) for deep learning training, the 6 segments are input into each of the machine 1, machine 2 and machine 3, each machine has multiple set communication threads for performing parallel set communication with other machines, specifically, each segment is sequentially subjected to set communication with the segment corresponding to the sequence number on other machines according to the sequence number; for example, segment 0 on machine 1 is in aggregate communication with segment 0 on machine 2, and segment 0 on machine 2 is in aggregate communication with segment 0 on machine 3.
And the graphics processor is also used for converting the half-precision weight gradient corresponding to each segment into a single-precision weight gradient and sending the single-precision weight gradient to the central processor through the communication assembly.
Correspondingly, the central processing unit is further configured to perform a weight updating operation on the segment according to the single-precision weight gradient corresponding to each segment and the single-precision primary weight parameter corresponding to the segment, so as to generate a new single-precision primary weight parameter of the segment.
It should be noted that other segments perform aggregate communication between segments in parallel. And carrying out next iterative calculation on the updated main weight parameters until all weight updating iterations are finished.
Further, as shown in fig. 6, the server further includes: display, power components, audio components, and the like. Only some of the components are schematically shown in fig. 6, and it is not meant that the server includes only the components shown in fig. 6.
The server shown in this embodiment may execute the method embodiment shown in fig. 2, and the implementation principle and technical effect thereof are not described again.
Accordingly, an embodiment of the present application further provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a computer, the steps or operations related to the server in the embodiment of the method shown in fig. 2 can be implemented, which is not described herein again.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (12)

1. A deep learning weight updating method is characterized by comprising the following steps:
decomposing the weight gradients generated by the forward and backward calculations into a plurality of segments;
each segment is in aggregate communication with the corresponding segment;
updating the weight gradient corresponding to the segment when the set communication of the segment is finished;
until the weight gradients of the plurality of segments are updated.
2. The method of claim 1, wherein before decomposing the weight gradient resulting from the forward and backward computations into a plurality of segments, further comprising:
converting the single-precision main weight parameter on the central processing unit into a semi-precision weight parameter on the graphic processor;
and performing forward and backward operations of deep learning on the semi-precision weight parameters on the graphics processor to generate a semi-precision weight gradient on the graphics processor.
3. The method of claim 1, wherein each segment is the same segment as its corresponding segment and is distributed across different machines, and wherein each machine has multiple set communication threads for each segment to collectively communicate with corresponding segments on other machines in parallel.
4. The method of claim 3, wherein each segment corresponds to a half-precision weight gradient corresponding to the segment on the graphics processor;
after each segment is in aggregate communication with its corresponding segment, the method further comprises:
and converting the corresponding half-precision weight gradient of the segment on the graphic processor into the corresponding single-precision weight gradient on the central processing unit.
5. The method of claim 4, wherein updating the weight gradient corresponding to the segment at the end of the aggregate communication of the segment comprises:
and performing corresponding weight updating operation of the segment on the central processing unit according to the corresponding single-precision weight gradient of the segment on the central processing unit and the corresponding single-precision main weight parameter of the segment on the central processing unit to generate a new single-precision main weight parameter of the segment on the central processing unit.
6. A deep learning weight update apparatus, comprising:
a decomposition module for decomposing the weight gradient generated by the forward and backward calculations into a plurality of segments;
the communication module is used for carrying out set communication on each fragment and the corresponding fragment;
the updating module is used for updating the weight gradient corresponding to the segment when the set communication of the segment is finished; until the weight gradients of the plurality of segments are updated.
7. The apparatus of claim 6, further comprising:
the conversion module is used for converting the single-precision main weight parameters on the central processing unit into the semi-precision weight parameters on the graphic processor;
and the generating module is used for performing forward and backward operations of deep learning on the semi-precision weight parameters on the graphics processor to generate a semi-precision weight gradient on the graphics processor.
8. The apparatus of claim 6, wherein each segment is the same segment as its corresponding segment and is distributed across different machines, and each machine has multiple set communication threads for each segment to perform set communication with corresponding segments on other machines in parallel.
9. The apparatus of claim 8, wherein each segment corresponds to a half-precision weight gradient on the graphics processor corresponding to the segment;
the conversion module is further configured to convert the half-precision weight gradient of the segment on the graphics processor into a single-precision weight gradient on the central processing unit.
10. The apparatus according to claim 9, wherein the updating module is specifically configured to perform a weight updating operation on the central processing unit corresponding to the segment according to the single-precision weight gradient corresponding to the segment on the central processing unit and the single-precision primary weight parameter corresponding to the segment on the central processing unit, so as to generate a new single-precision primary weight parameter of the segment on the central processing unit.
11. A server, comprising: a memory, a processor, and a communication component;
the memory for storing a computer program;
the processor, coupled with the memory and the communication component, to execute a computer program for performing the steps or operations of the method of any of claims 1-5.
12. A computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a computer, is capable of performing the steps or operations of any one of the methods of claims 1-5.
CN201910217885.3A 2019-03-21 2019-03-21 Deep learning weight updating method and device Pending CN111722937A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910217885.3A CN111722937A (en) 2019-03-21 2019-03-21 Deep learning weight updating method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910217885.3A CN111722937A (en) 2019-03-21 2019-03-21 Deep learning weight updating method and device

Publications (1)

Publication Number Publication Date
CN111722937A true CN111722937A (en) 2020-09-29

Family

ID=72562202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910217885.3A Pending CN111722937A (en) 2019-03-21 2019-03-21 Deep learning weight updating method and device

Country Status (1)

Country Link
CN (1) CN111722937A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224502A (en) * 2015-09-28 2016-01-06 浪潮(北京)电子信息产业有限公司 A kind of degree of depth learning method based on GPU and system
CN106297774A (en) * 2015-05-29 2017-01-04 中国科学院声学研究所 The distributed parallel training method of a kind of neutral net acoustic model and system
CN107341541A (en) * 2016-04-29 2017-11-10 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing full articulamentum neural metwork training
US20180121806A1 (en) * 2016-10-27 2018-05-03 International Business Machines Corporation Efficient parallel training of a network model on multiple graphics processing units
CN108805798A (en) * 2017-05-05 2018-11-13 英特尔公司 Fine granularity for deep learning frame calculates communication and executes
US20180349313A1 (en) * 2017-06-01 2018-12-06 Electronics And Telecommunications Research Institute Parameter server and method for sharing distributed deep learning parameter using the same

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106297774A (en) * 2015-05-29 2017-01-04 中国科学院声学研究所 The distributed parallel training method of a kind of neutral net acoustic model and system
CN105224502A (en) * 2015-09-28 2016-01-06 浪潮(北京)电子信息产业有限公司 A kind of degree of depth learning method based on GPU and system
CN107341541A (en) * 2016-04-29 2017-11-10 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing full articulamentum neural metwork training
US20180121806A1 (en) * 2016-10-27 2018-05-03 International Business Machines Corporation Efficient parallel training of a network model on multiple graphics processing units
CN108805798A (en) * 2017-05-05 2018-11-13 英特尔公司 Fine granularity for deep learning frame calculates communication and executes
US20180349313A1 (en) * 2017-06-01 2018-12-06 Electronics And Telecommunications Research Institute Parameter server and method for sharing distributed deep learning parameter using the same

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王裕民;顾乃杰;张孝慈;: "多GPU环境下的卷积神经网络并行算法", 小型微型计算机系统, no. 03, 15 March 2017 (2017-03-15) *

Similar Documents

Publication Publication Date Title
CN109993299B (en) Data training method and device, storage medium and electronic device
CN110533183B (en) Task placement method for heterogeneous network perception in pipeline distributed deep learning
CN108460457A (en) A kind of more asynchronous training methods of card hybrid parallel of multimachine towards convolutional neural networks
CN109597965B (en) Data processing method, system, terminal and medium based on deep neural network
CN111488211A (en) Task processing method, device, equipment and medium based on deep learning framework
EP4242844A2 (en) Distributing tensor computations across computing devices
CN110796242A (en) Neural network model reasoning method and device, electronic equipment and readable medium
CN107729138B (en) Method and device for analyzing high-performance distributed vector space data
CN109885406B (en) Operator calculation optimization method, device, equipment and storage medium
CN114356578B (en) Parallel computing method, device, equipment and medium for natural language processing model
CN108304925B (en) Pooling computing device and method
CN114237869B (en) Ray double-layer scheduling method and device based on reinforcement learning and electronic equipment
CN115880132A (en) Graphics processor, matrix multiplication task processing method, device and storage medium
WO2020015087A1 (en) Method and system for large-scale processing of images, computer device, and computer storage medium
CN115934275A (en) Task processing method and dialogue task processing method
CN115437760A (en) Computing resource allocation method, electronic device, storage medium, and program product
CN109753682B (en) Finite element stiffness matrix simulation method based on GPU (graphics processing Unit) end
CN113885871A (en) Special back-end code generation method and device for supporting machine learning training
CN111722937A (en) Deep learning weight updating method and device
WO2022267854A1 (en) Method, system and apparatus for processing quantum computing task, and operating system
Wu et al. Skeletongcn: a simple yet effective accelerator for gcn training
CN115759260B (en) Reasoning method and device of deep learning model, electronic equipment and storage medium
CN107025099B (en) Asynchronous graph calculation implementation method and system based on double-queue model
Chen et al. Edge FPGA-based Onsite Neural Network Training
CN111208980B (en) Data analysis processing method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination