CN111722937B - Deep learning weight updating method and device - Google Patents
Deep learning weight updating method and device Download PDFInfo
- Publication number
- CN111722937B CN111722937B CN201910217885.3A CN201910217885A CN111722937B CN 111722937 B CN111722937 B CN 111722937B CN 201910217885 A CN201910217885 A CN 201910217885A CN 111722937 B CN111722937 B CN 111722937B
- Authority
- CN
- China
- Prior art keywords
- segment
- weight
- precision
- communication
- gradient
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013135 deep learning Methods 0.000 title claims abstract description 45
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000004891 communication Methods 0.000 claims abstract description 101
- 239000012634 fragment Substances 0.000 claims abstract description 73
- 238000012545 processing Methods 0.000 claims description 29
- 238000004590 computer program Methods 0.000 claims description 11
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000000354 decomposition reaction Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 description 16
- 238000012549 training Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 2
- 210000004556 brain Anatomy 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/545—Interprogram communication where tasks reside in different layers, e.g. user- and kernel-space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Image Analysis (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application provides a deep learning weight updating method and device, which divide weight gradient into a plurality of fragments, directly pipeline the aggregate communication and weight updating of each fragment, once the current fragment aggregate communication is completed, immediately update the weight corresponding to the fragment, and simultaneously, the next fragment is in parallel aggregate communication on a communication link.
Description
Technical Field
The invention relates to the technical field of neural network deep learning, in particular to a deep learning weight updating method and device.
Background
Today, deep learning is a new field in machine learning research, and the motivation is to build and simulate a neural network for analysis learning of human brain, which mimics the mechanism of human brain to interpret data, and deep learning applications including speech recognition, image recognition, natural language processing, etc., where the computational effort is huge, and it requires extensive deep learning computation.
In order to accelerate the processing time of the data deep learning application and improve the computing efficiency, a high-density computing mode of cooperation of a central processing unit (English: central Processing Unit, abbreviated: CPU) and a plurality of graphic processors (English: graphics Processing Unit, abbreviated: GPU) is generally adopted, the GPU with powerful parallel computing capability executes time-consuming forward and backward computation, and the rest of parameter updating computation, data reading and distributing and neural network model updating computation according to the algorithm characteristics of the deep learning application are completed by the CPU.
However, GPU-based deep learning, in one iteration of multi-machine multi-card (multiple GPUs) training, each GPU concurrently performs forward and backward processing on different data to generate respective weight gradients. At this time, before the weight update, the weight gradient in each GPU needs to be exchanged with the corresponding weight gradient in the other GPUs to obtain the actual weight gradient generated by all the data in this iteration. Aggregate communication (Allreduce) is a common weight gradient protocol communication operation that cyclically transmits a protocol communication for weight gradients in each GPU until the weight gradient value on each GPU is the sum of the gradient values on all GPUs.
Therefore, communication between GPUs and between machines caused by protocol communication operation can bring time overhead in communication to the weight gradient-to-weight updating process, and the speed of weight updating is reduced.
Disclosure of Invention
In order to solve the problems, the invention provides a deep learning weight updating method and device, which can effectively hide weight updating calculation time within network communication time, and finally finish time is about the whole communication time, thereby saving weight updating time.
The embodiment of the invention provides a deep learning weight updating method, which comprises the following steps:
decomposing the weight gradient generated by the forward and backward calculation into a plurality of fragments;
Each segment is in collective communication with its corresponding segment;
updating the weight gradient corresponding to the fragment when the collection communication of the fragment is finished;
until the weight gradients of the plurality of fragments are updated.
Optionally, before decomposing the weight gradient generated by the forward and backward computation into a plurality of segments, further comprising:
converting the single-precision main weight parameter on the central processing unit into a half-precision weight parameter on the graphic processor;
and performing forward and backward operations of deep learning on the semi-precision weight parameters on the graphics processor to generate a semi-precision weight gradient on the graphics processor.
Optionally, each segment and its corresponding segment are the same segment and distributed on different machines, and each machine has multiple aggregate communication threads for each segment to perform aggregate communication in parallel with the corresponding segments on other machines.
Optionally, each segment corresponds to a half-precision weight gradient corresponding to the segment on the graphics processor;
after each segment performs collective communication with its corresponding segment, the method further includes:
And converting the corresponding half-precision weight gradient of the segment on the graphic processor into a corresponding single-precision weight gradient on the central processing unit.
Optionally, updating the weight gradient corresponding to the segment at the end of the aggregate communication of the segment includes:
and carrying out the weight updating operation corresponding to the fragment on the central processing unit according to the single-precision weight gradient corresponding to the fragment on the central processing unit and the single-precision main weight parameter corresponding to the fragment on the central processing unit, and generating a new single-precision main weight parameter of the fragment on the central processing unit.
The application also provides a deep learning weight updating device, which comprises:
the decomposition module is used for decomposing the weight gradient generated by the forward and backward calculation into a plurality of fragments;
the communication module is used for carrying out collective communication on each segment and the corresponding segment;
The updating module is used for updating the weight gradient corresponding to the fragment when the collection communication of the fragment is finished; until the weight gradients of the plurality of fragments are updated.
Optionally, the apparatus further comprises:
The conversion module is used for converting the single-precision main weight parameter on the central processing unit into a half-precision weight parameter on the graphic processor;
and the generation module is used for carrying out forward and backward operations of deep learning on the semi-precision weight parameters on the graphic processor and generating a semi-precision weight gradient on the graphic processor.
Optionally, each segment and its corresponding segment are the same segment and distributed on different machines, and each machine has multiple aggregate communication threads for each segment to perform aggregate communication in parallel with the corresponding segments on other machines.
Optionally, each segment corresponds to a half-precision weight gradient corresponding to the segment on the graphics processor;
the conversion module is also used for converting the half-precision weight gradient corresponding to the fragment on the graphic processor into the single-precision weight gradient corresponding to the central processor.
Optionally, the updating module is specifically configured to perform a weight updating operation corresponding to the segment on the central processor according to a single-precision weight gradient corresponding to the segment on the central processor and a single-precision main weight parameter corresponding to the segment on the central processor, so as to generate a new single-precision main weight parameter of the segment on the central processor.
The present application provides a server, comprising: a memory, a processor, and a communication component;
The memory is used for storing a computer program;
The processor is coupled to the memory and the communication component for executing a computer program for performing the steps or operations described by the weight updating method for deep learning described above.
The present application provides a computer-readable storage medium storing a computer program which, when executed by a computer, is capable of implementing the steps or operations described in the weight updating method for deep learning.
According to the embodiment of the invention, the weight gradient is divided into a plurality of fragments, the aggregate communication and the weight update of each fragment are directly pipelined, once the aggregate communication of the current fragment is completed, the weight update corresponding to the fragment is immediately carried out, and meanwhile, the next fragment is in parallel aggregate communication on a communication link.
Meanwhile, in order to save the use of the video memory, the invention stores the single-precision main weight parameter (MASTER WEIGTHS) in the GPU video memory on the CPU, and the update operation of the weight parameter is also carried out on the CPU, so that the consumption of the 2/3 parameter video memory can be reduced, and the utilization rate of the video memory is greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the prior art descriptions, and it is obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a prior art deep learning weight gradient update method;
FIG. 2 is a flowchart illustrating a method for updating deep learning weights according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a weight updating method with fragments as granularity according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a weight update using CPU version master weight parameters and CPU versions according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a deep learning weight updating device according to an embodiment of the invention;
fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, the "plurality" generally includes at least two, but does not exclude the case of at least one.
It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a product or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such product or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a commodity or system comprising such elements.
Fig. 1 is a flow chart of a deep learning weight gradient updating method in the prior art, as shown in fig. 1,
The method comprises the steps that 1, a weight gradient (WEIGHT GRADIENTS) generated by the front-back direction calculation completed by the GPU is communicated with a corresponding weight gradient generated by the front-back direction calculation completed by other GPUs in a circulating protocol mode, and an average weight gradient (AVERAGE WEIGHT GRADIENTS) is generated;
2. The average weight gradient updates (updates) the weights, resulting in updated weights (updated weights).
Therefore, the time for finally completing the weight updating is the sum of the communication time and the updating calculation time of each weight gradient set, the speed of the weight updating is greatly reduced, and in the whole set communication (Allreduce) process, the calculation resources are in an idle state, the communication links are not fully occupied, and the calculation resources are greatly wasted.
In addition, in the conventional deep learning training using the GPU, a single-precision (F32) main weight parameter is maintained on the GPU (MASTER WEIGHTS), and first, the main weight parameter needs to be converted into a half-precision (F16) weight parameter on the GPU (weights).
Forward (Forward) and Backward (Backward) operations of the deep learning are then performed with the half-precision (F16) weight parameters (weights) on the GPU, resulting in a half-precision (F16) weight gradient (WEIGHTS GRADIENT) on the GPU.
The semi-accurate weight gradient on the GPU is then converted to a single-accurate (F32) weight gradient on the GPU.
Then, the weight update (update) calculation on the GPU is performed using the single precision (F32) weight gradient on the GPU and the single precision (F32) main weight parameter on the GPU as inputs, and an updated single precision (F32) main weight parameter (updated MASTER WEIGHTS) on the GPU is generated.
Then, the next iteration calculation is performed (repeating the above steps) with the updated single precision (F32) main weight parameters on the GPU until all iterations are completed.
It can be seen that the main weight parameter (MASTER WEIGTHS) and the updated main weight parameter (updated MASTER WEIGHTS) share the same memory space; in the whole flow, a main weight parameter (MASTER WEIGTHS) used as an updating stage and a weight gradient (WEIGHTS GRADIENT) participating in Forward and Backward (forward+backward) calculation are stored in a video memory, wherein the main weight parameter (MASTER WEIGTHS) is single-precision, and the weight gradient (WEIGHTS GRADIENT) is half-precision; the single-precision main weight parameter (MASTER WEIGTHS) is twice as large as the half-precision weight gradient (WEIGHTS GRADIENT), so that the double-precision parameter memory consumption is increased in the existing deep learning training using the GPU for mixed precision.
In order to optimize the low-efficiency weight updating communication and calculation process in the prior art, the invention divides the weight gradient into a plurality of fragments, directly pipelines the aggregate communication and the weight updating of each fragment, once the current fragment aggregate communication is completed, immediately carries out the weight updating corresponding to the fragment, simultaneously carries out the aggregate communication on other fragments in parallel on a communication link.
Meanwhile, in order to save the use of the video memory, the main weight parameter (MASTER WEIGTHS) is stored on the CPU, and the update operation of the weight parameter is also carried out on the CPU, so that the two-thirds of the use of the parameter video memory can be saved, and the utilization rate of the video memory is greatly improved.
The application scenario of the present invention includes a plurality of machines (multi-image processors) for accelerating the training process of the deep learning neural network model, each machine has a plurality of aggregate communication threads for performing aggregate communication with other machines in parallel, and the implementation process of the present invention is described in detail below through specific embodiments.
Fig. 2 is a flowchart illustrating a deep learning weight updating method according to an embodiment of the present invention, as shown in fig. 2, including:
201. Decomposing the weight gradient generated by the forward and backward calculation into a plurality of fragments;
In an alternative embodiment, before the step of decomposing the weight gradient generated by the forward and backward computation into a plurality of segments, the method further comprises:
converting the single-precision main weight parameter on the central processing unit into a half-precision weight parameter on the graphic processor;
and performing forward and backward operations of deep learning on the semi-precision weight parameters on the graphics processor to generate a semi-precision weight gradient on the graphics processor.
Then, the generated semi-precision weight gradient on the graphics processor is decomposed into a plurality of segments, and it is to be noted that each segment corresponds to the semi-precision weight gradient corresponding to the segment on the graphics processor.
202. Each segment is in collective communication with its corresponding segment;
Each segment and the corresponding segment are the same segment and distributed on different machines.
In an alternative embodiment, after decomposing a weight gradient into a plurality of segments, each segment carries its own identifier, and it is assumed that the segments are divided into 6 segments, namely, segment 0, segment 1, segment 2, segment 3, segment 4, and segment 5, and the sequence numbers of the segments can be used as the identifiers of the segments.
In the deep learning training of a plurality of machines (a plurality of image processors), the decomposed plurality of segments are input into the 6 segments in all of the machine 1, the machine 2 and the machine 3 for the deep learning training of 3 machines (the machine 1, the machine 2 and the machine 3) in the deep learning training of a plurality of machines, for example, each machine has a plurality of collective communication threads for performing collective communication with other machines in parallel, and specifically, each segment sequentially performs collective communication with the segment corresponding to the serial number on the other machines according to the serial number; for example, segment 0 on machine 1 is in collective communication with segment 0 on machine 2, and segment 0 on machine 2 is in collective communication with segment 0 on machine 3.
In an alternative embodiment, after each segment in this step performs collective communication with its corresponding other segment, the method further includes:
And converting the corresponding half-precision weight gradient of the segment on the graphic processor into a corresponding single-precision weight gradient on the central processing unit.
203. Updating the weight gradient corresponding to the fragment when the collection communication of the fragment is finished;
in an alternative embodiment, the step of updating the weight gradient corresponding to the segment at the end of the aggregate communication of the segment includes:
and carrying out the weight updating operation corresponding to the fragment on the central processing unit according to the single-precision weight gradient corresponding to the fragment on the central processing unit and the single-precision main weight parameter corresponding to the fragment on the central processing unit, and generating a new single-precision main weight parameter of the fragment on the central processing unit.
At the same time, the other fragments conduct the aggregate communication among the fragments in parallel.
204. Until the weight gradients of the plurality of fragments are updated.
The updated main weight parameters are subjected to the next iterative calculation, namely the steps 201-204 are repeated until all weight updating iterations are completed.
Figure 3 is a schematic diagram of a weight updating method with fragments as granularity provided by the embodiment of the invention,
Fig. 4 is a schematic diagram of a principle of using a main weight parameter of a CPU version and a weight update of the CPU version according to an embodiment of the present invention, and illustrates a weight update of a layer of a multi-machine multi-card.
First, as shown in fig. 4, a single precision (F32) version of the main weight parameters is maintained on the CPU (MASTER WEIGHTS), and the main weight parameters are converted to half precision (F16) weight parameters on the GPU (weights), and the half precision weight parameters on the GPU (weights) are subjected to Forward (Forward) and Backward (Backward) operations of deep learning, resulting in a half precision (F16) weight gradient on the GPU (WEIGHTS GRADIENT).
Then, as shown in fig. 3, the semi-accurate weight gradient on the GPU is decomposed into 6 segments, segment 0, segment 1, segment 2, segment 3, segment 4, and segment 5, respectively.
Each segment is a half-precision weight gradient corresponding to the segment on the GPU.
Each machine has the 6 fragments, and each machine has a plurality of collective communication threads for collective communication of each fragment in parallel with the same fragments of other machines, so that each fragment is concurrently collective communicated by an independent process; specifically, each segment can be in aggregate communication with the segments corresponding to the sequence numbers on other machines in sequence according to the sequence numbers; for example, segment 0 on machine 1 is in collective communication with segment 0 on machine 2, segment 0 on machine 2 is in collective communication with segment 0 on machine 3; meanwhile, segment 1 on machine 1 is in collective communication with segment 1 on machine 2, and segment 1 on machine 2 is in collective communication with segment 1 on machine 3; segment 2 on machine 1 is in collective communication with segment 2 on machine 2, segment 2 on machine 2 is in collective communication with segment 2 on machine 3; segment 3 on machine 1 is in collective communication with segment 3 on machine 2, segment 3 on machine 2 is in collective communication with segment 3 on machine 3; segment 4 on machine 1 is in collective communication with segment 4 on machine 2, segment 4 on machine 2 is in collective communication with segment 4 on machine 3; segments 5 on machine 1 are in collective communication with segments 5 on machine 2, and segments 5 on machine 2 are in collective communication with segments 5 on machine 3.
It should be noted that, the present invention decomposes the weight gradient into a plurality of segments, not limited to the 6 segments shown in fig. 3, and may initiate n segments to perform collective communication concurrently.
After the collection communication of one segment is completed, the corresponding half-precision weight gradient of the segment on the GPU is converted into a single-precision weight gradient on the CPU.
As shown in fig. 3, after the segment 0 completes the set communication, converting the half-precision weight gradient corresponding to the segment 0 on the GPU into a single-precision weight gradient on the CPU; the single precision weight gradient of the segment 0 on the CPU and the single precision main weight parameter maintained on the CPU are taken as inputs to perform the weight update operation (update) of the segment 0 on the CPU, and the single precision main weight parameter (updated MASTER WEIGHTS) of the segment 0 on the new CPU is generated.
At the same time, other fragments (such as fragment 1, fragment 2, fragment 3, fragment 4 and fragment 5) are in parallel and are in collective communication with corresponding fragments on other machines, so that the weight update calculation time can be hidden within the network communication time, the final completion time is about the whole communication time, and the weight update time is saved.
And carrying out next weight updating iterative computation on the updated main weight parameters of the segment 3 until all weight updating iterative computation is completed.
According to the embodiment of the invention, the weight gradient is divided into a plurality of fragments, the aggregate communication and the weight update of each fragment are directly pipelined, once the aggregate communication of the current fragment is completed, the weight update corresponding to the fragment is immediately carried out, and meanwhile, the aggregate communication is carried out on other fragments in parallel on a communication link.
Meanwhile, in order to save the use of the video memory, the invention stores the single-precision main weight parameter (MASTER WEIGTHS) in the GPU video memory on the CPU, and the update operation of the weight parameter is also carried out on the CPU, so that the consumption of the 2/3 parameter video memory can be reduced, and the utilization rate of the video memory is greatly improved.
Fig. 5 is a schematic structural diagram of a deep learning weight updating device according to an embodiment of the present invention, as shown in fig. 5, including:
the decomposition module is used for decomposing the weight gradient generated by the forward and backward calculation into a plurality of fragments;
the communication module is used for carrying out collective communication on each segment and the corresponding segment;
The updating module is used for updating the weight gradient corresponding to the fragment when the collection communication of the fragment is finished; until the weight gradients of the plurality of fragments are updated.
Optionally, the apparatus further comprises:
The conversion module is used for converting the single-precision main weight parameter on the central processing unit into a half-precision weight parameter on the graphic processor;
and the generation module is used for carrying out forward and backward operations of deep learning on the semi-precision weight parameters on the graphic processor and generating a semi-precision weight gradient on the graphic processor.
Each segment corresponds to a half-precision weight gradient corresponding to the segment on the graphics processor;
optionally, the conversion module is further configured to convert a half-precision weight gradient corresponding to the segment on the graphics processor into a single-precision weight gradient corresponding to the central processor.
Optionally, the updating module is specifically configured to perform a weight updating operation corresponding to the segment on the central processor according to the single-precision weight gradient corresponding to the segment on the central processor and the single-precision main weight parameter on the central processor, so as to generate a new single-precision main weight parameter on the central processor.
The apparatus shown in this embodiment may perform the method embodiment shown in fig. 2, and its implementation principle and technical effects are not repeated.
Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention, as shown in fig. 6, including:
Memory, central processing unit, graphic processor and communication assembly;
The memory is used for storing a computer program;
the central processor and the graphics processor are coupled to the memory and the communication component, respectively, for executing computer programs.
The CPU is used for converting the single-precision main weight parameter into a half-precision weight parameter and transmitting the half-precision weight parameter to the graphic processor through the communication component;
The graphic processor is used for carrying out forward and backward operations of deep learning according to the semi-precision weight parameters to generate a semi-precision weight gradient;
a graphics processor, further configured to decompose the semi-precision weight gradient into a plurality of segments;
a graphics processor for aggregating each segment with its corresponding segment on the other machine through a communication component;
Each segment and the corresponding segment are the same segment and distributed on different machines.
In an alternative embodiment, after decomposing a weight gradient into a plurality of segments, each segment carries its own identifier, and it is assumed that the segments are divided into 6 segments, namely, segment 0, segment 1, segment 2, segment 3, segment 4, and segment 5, and the sequence numbers of the segments can be used as the identifiers of the segments.
In the deep learning training of a plurality of machines (a plurality of image processors), the decomposed plurality of segments are input into the 6 segments in all of the machine 1, the machine 2 and the machine 3 for the deep learning training of 3 machines (the machine 1, the machine 2 and the machine 3) in the deep learning training of a plurality of machines, for example, each machine has a plurality of collective communication threads for performing collective communication with other machines in parallel, and specifically, each segment sequentially performs collective communication with the segment corresponding to the serial number on the other machines according to the serial number; for example, segment 0 on machine 1 is in collective communication with segment 0 on machine 2, and segment 0 on machine 2 is in collective communication with segment 0 on machine 3.
The image processor is also used for converting the half-precision weight gradient corresponding to each segment into a single-precision weight gradient and transmitting the single-precision weight gradient to the central processing unit through the communication component.
Correspondingly, the central processing unit is further configured to perform a corresponding weight updating operation on the segment according to the single-precision weight gradient corresponding to each segment and the single-precision main weight parameter corresponding to the segment, so as to generate a new single-precision main weight parameter of the segment.
The other fragments perform inter-fragment collective communication in parallel. And carrying out next iterative computation on the updated main weight parameters until all weight updating iterations are completed.
Further, as shown in fig. 6, the server further includes: a display, a power supply component, an audio component, and the like. Only some of the components are schematically shown in fig. 6, which does not mean that the server only comprises the components shown in fig. 6.
The server in this embodiment may execute the method embodiment shown in fig. 2, and its implementation principle and technical effects are not repeated.
Accordingly, an embodiment of the present application further provides a computer readable storage medium storing a computer program, where the computer program when executed by a computer can implement steps or operations related to a server in the embodiment of the method shown in fig. 2, which are not described herein.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. A deep learning weight updating method, comprising:
Performing forward and backward operations of deep learning on weight parameters on a graphics processor to generate weight gradients on the graphics processor;
Decomposing the weight gradient into a plurality of segments; each segment and the corresponding segment are the same segment and are distributed on different machines, and each machine is provided with a plurality of collective communication threads for carrying out collective communication on each segment and the corresponding segment on other machines in parallel;
Each segment is in collective communication with its corresponding segment;
updating the weight gradient corresponding to the fragment when the collection communication of the fragment is finished;
until the weight gradients of the plurality of fragments are updated.
2. The method according to claim 1, wherein the method further comprises:
Converting the single-precision main weight parameter on the central processing unit into a half-precision weight parameter on the graphic processor;
the forward and backward operations of deep learning weight parameters on a graphics processor generate a weight gradient on the graphics processor, comprising:
and performing forward and backward operations of deep learning on the semi-precision weight parameters on the graphics processor to generate a semi-precision weight gradient on the graphics processor.
3. The method of claim 2, wherein each segment corresponds to a half-precision weight gradient on the graphics processor for the segment;
after each segment performs collective communication with its corresponding segment, the method further includes:
And converting the corresponding half-precision weight gradient of the segment on the graphic processor into a corresponding single-precision weight gradient on the central processing unit.
4. A method according to claim 3, wherein performing a weight gradient update for the segment at the end of the aggregate communication of the segment comprises:
and carrying out the weight updating operation corresponding to the fragment on the central processing unit according to the single-precision weight gradient corresponding to the fragment on the central processing unit and the single-precision main weight parameter corresponding to the fragment on the central processing unit, and generating a new single-precision main weight parameter of the fragment on the central processing unit.
5. A deep learning weight updating apparatus, characterized by comprising:
the generation module is used for performing forward and backward operations of deep learning on weight parameters on the graphic processor and generating weight gradients on the graphic processor;
A decomposition module for decomposing the weight gradient into a plurality of segments; each segment and the corresponding segment are the same segment and are distributed on different machines, and each machine is provided with a plurality of collective communication threads for carrying out collective communication on each segment and the corresponding segment on other machines in parallel;
the communication module is used for carrying out collective communication on each segment and the corresponding segment;
The updating module is used for updating the weight gradient corresponding to the fragment when the collection communication of the fragment is finished; until the weight gradients of the plurality of fragments are updated.
6. The apparatus as recited in claim 5, further comprising:
The conversion module is used for converting the single-precision main weight parameter on the central processing unit into a half-precision weight parameter on the graphic processor;
The generation module is specifically configured to perform forward and backward operations of deep learning on the semi-precision weight parameter on the graphics processor, so as to generate a semi-precision weight gradient on the graphics processor.
7. The apparatus of claim 6, wherein each segment corresponds to a half-precision weight gradient on the graphics processor for the segment;
the conversion module is also used for converting the half-precision weight gradient corresponding to the fragment on the graphic processor into the single-precision weight gradient corresponding to the central processor.
8. The apparatus of claim 7, wherein the updating module is specifically configured to perform a weight updating operation corresponding to the segment on the central processor according to a single-precision weight gradient corresponding to the segment on the central processor and a single-precision main weight parameter corresponding to the segment on the central processor, and generate a new single-precision main weight parameter of the segment on the central processor.
9. A server, comprising: a memory, a processor, and a communication component;
The memory is used for storing a computer program;
The processor, coupled with the memory and the communication component, is configured to execute a computer program for performing the steps or operations of the method of any of claims 1-4.
10. A computer readable storage medium storing a computer program, wherein the computer program is capable of implementing the steps or operations of the method according to any one of claims 1-4 when executed by a computer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910217885.3A CN111722937B (en) | 2019-03-21 | 2019-03-21 | Deep learning weight updating method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910217885.3A CN111722937B (en) | 2019-03-21 | 2019-03-21 | Deep learning weight updating method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111722937A CN111722937A (en) | 2020-09-29 |
CN111722937B true CN111722937B (en) | 2024-05-10 |
Family
ID=72562202
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910217885.3A Active CN111722937B (en) | 2019-03-21 | 2019-03-21 | Deep learning weight updating method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111722937B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105224502A (en) * | 2015-09-28 | 2016-01-06 | 浪潮(北京)电子信息产业有限公司 | A kind of degree of depth learning method based on GPU and system |
CN106297774A (en) * | 2015-05-29 | 2017-01-04 | 中国科学院声学研究所 | The distributed parallel training method of a kind of neutral net acoustic model and system |
CN107341541A (en) * | 2016-04-29 | 2017-11-10 | 北京中科寒武纪科技有限公司 | A kind of apparatus and method for performing full articulamentum neural metwork training |
CN108805798A (en) * | 2017-05-05 | 2018-11-13 | 英特尔公司 | Fine granularity for deep learning frame calculates communication and executes |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10949746B2 (en) * | 2016-10-27 | 2021-03-16 | International Business Machines Corporation | Efficient parallel training of a network model on multiple graphics processing units |
KR102197247B1 (en) * | 2017-06-01 | 2020-12-31 | 한국전자통신연구원 | Parameter server and method for sharing distributed deep learning parameter using the same |
-
2019
- 2019-03-21 CN CN201910217885.3A patent/CN111722937B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106297774A (en) * | 2015-05-29 | 2017-01-04 | 中国科学院声学研究所 | The distributed parallel training method of a kind of neutral net acoustic model and system |
CN105224502A (en) * | 2015-09-28 | 2016-01-06 | 浪潮(北京)电子信息产业有限公司 | A kind of degree of depth learning method based on GPU and system |
CN107341541A (en) * | 2016-04-29 | 2017-11-10 | 北京中科寒武纪科技有限公司 | A kind of apparatus and method for performing full articulamentum neural metwork training |
CN108805798A (en) * | 2017-05-05 | 2018-11-13 | 英特尔公司 | Fine granularity for deep learning frame calculates communication and executes |
Non-Patent Citations (1)
Title |
---|
多GPU环境下的卷积神经网络并行算法;王裕民;顾乃杰;张孝慈;;小型微型计算机系统;20170315(03);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111722937A (en) | 2020-09-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110533183B (en) | Task placement method for heterogeneous network perception in pipeline distributed deep learning | |
CN109993299A (en) | Data training method and device, storage medium, electronic device | |
CN108460457A (en) | A kind of more asynchronous training methods of card hybrid parallel of multimachine towards convolutional neural networks | |
EP4242844A2 (en) | Distributing tensor computations across computing devices | |
CN114356578B (en) | Parallel computing method, device, equipment and medium for natural language processing model | |
CN114418129B (en) | Deep learning model training method and related device | |
CN105373517A (en) | Spark-based distributed matrix inversion parallel operation method | |
WO2022267854A1 (en) | Method, system and apparatus for processing quantum computing task, and operating system | |
Verma et al. | A new accelerated proximal gradient technique for regularized multitask learning framework | |
CN109657794B (en) | Instruction queue-based distributed deep neural network performance modeling method | |
US20220172044A1 (en) | Method, electronic device, and computer program product for deploying machine learning model | |
CN113885871A (en) | Special back-end code generation method and device for supporting machine learning training | |
CN111722937B (en) | Deep learning weight updating method and device | |
Mei et al. | ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation | |
Zeng et al. | Training acceleration for deep neural networks: A hybrid parallelization strategy | |
CN111160535A (en) | DGCNN model acceleration method based on Hadoop | |
CN115292044A (en) | Data processing method and device, electronic equipment and storage medium | |
CN110021339A (en) | Cluster parallel computing accelerated method based on protein folding measuring and calculating protein structure | |
Sreedhar et al. | Efficient training of convolutional neural nets on large distributed systems | |
Luo et al. | RTP: Rethinking Tensor Parallelism with Memory Deduplication | |
CN111291893A (en) | Scheduling method, scheduling system, storage medium, and electronic apparatus | |
Chen et al. | Edge FPGA-based onsite neural network training | |
CN115759260B (en) | Reasoning method and device of deep learning model, electronic equipment and storage medium | |
Huang et al. | Adaptive partitioning and efficient scheduling for distributed DNN training in heterogeneous IoT environment | |
Kumar et al. | Efficient training of convolutional neural nets on large distributed systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |