WO2021244354A1 - Procédé de formation pour un modèle de réseau de neurones artificiels et produit connexe - Google Patents

Procédé de formation pour un modèle de réseau de neurones artificiels et produit connexe Download PDF

Info

Publication number
WO2021244354A1
WO2021244354A1 PCT/CN2021/095844 CN2021095844W WO2021244354A1 WO 2021244354 A1 WO2021244354 A1 WO 2021244354A1 CN 2021095844 W CN2021095844 W CN 2021095844W WO 2021244354 A1 WO2021244354 A1 WO 2021244354A1
Authority
WO
WIPO (PCT)
Prior art keywords
gradient information
network layer
working node
neural network
network model
Prior art date
Application number
PCT/CN2021/095844
Other languages
English (en)
Chinese (zh)
Inventor
王迎瑞
李周洋
王元波
张行程
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Priority to KR1020227010791A priority Critical patent/KR20220054861A/ko
Publication of WO2021244354A1 publication Critical patent/WO2021244354A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application relates to the field of model training, in particular to a training method and related products of a neural network model.
  • Deep learning is bringing tremendous development and progress to many social fields, and model training is its key link.
  • model training process a large amount of sample data will be read and a large number of mathematical operations will be performed, which is very time-consuming.
  • breakthroughs in benchmark tests on the ImageNet data set Although the industry has continuously made breakthroughs in benchmark tests on the ImageNet data set.
  • an efficient distributed model training program is still a tricky practical problem. Therefore, it is necessary to study more efficient distributed model training schemes.
  • the embodiment of the application discloses a neural network model training method and related products.
  • an embodiment of the present application provides a method for training a neural network model.
  • the method includes: a first working node obtains the local information of at least one network layer of the neural network model based on the current iteration of the neural network model. Gradient information; in the process of transmitting the local gradient information of the first network layer in the neural network model with at least one second working node, the first working node updates the second in the neural network model in parallel Parameters of the network layer.
  • the neural network model can include several layers (Layer), and its distributed parallel training process can be divided into forward calculation (Forward Pass), backward calculation (Backward Pass), gradient data synchronization (for example, Allreduce Gradients) and Parameter update.
  • the forward calculation is a forward-order layer-by-layer operation
  • the reverse calculation is a reverse-order layer-by-layer operation
  • gradient data synchronization mainly occupies network bandwidth resources, and other operations occupies computing resources of the processor.
  • the first working node performs parameter update and gradient data synchronization in parallel, so as to hide communication overhead, and can fully explore overlapping parts in the model training process, reduce the delay caused by communication, and improve model training efficiency.
  • the first working node updates the local gradient information of the first network layer in the neural network model in parallel with at least one second working node.
  • Parameters; overlapping the process of updating the parameters of the neural network model and the process of transmitting local gradient information can improve the efficiency of model training.
  • the method further includes: the first working node determines the dependency relationship between the multiple operations of the current iteration based on the connection relationship of the multiple network layers of the neural network model,
  • the multiple operations include at least a transmission operation of local gradient information and a parameter update operation of at least one network layer in the neural network model; wherein, the first working node performs all operations based on the dependency between the multiple operations Describe multiple operations.
  • the dependency between multiple operations of the current iteration can be accurately determined, and the multiple operations can be executed sequentially based on the dependency between the multiple operations.
  • Each of the operations can be executed sequentially based on the dependency between the multiple operations.
  • the first working node updates the parameters of multiple network layers in the neural network model layer by layer in a reverse order; and/or, the network depth of the second network layer is greater than that of the The network depth of the first network layer.
  • the first working node and the at least one second working node transmit the local gradient information of the multiple network layers in the neural network model layer by layer in a reverse order; the first working node transmits in a reverse order
  • the local gradient information of the multiple network layers in the neural network model is calculated layer by layer (corresponding to reverse calculation as a reverse order layer by layer operation).
  • the first working node updates the The parameters of the second network layer in the neural network model include:
  • the first working node determines that the parameters of the second network layer are updated
  • update the parameters of the second network layer in parallel includes transmitting the second network with the at least one second working node Local gradient information of the layer.
  • the method further includes: the first working node is in the process of transmitting local gradient information of the first network layer in the neural network model with at least one second working node Calculate the local gradient information of the third network layer in the neural network model.
  • the first working node calculates the local gradient of the third network layer in the neural network model during the process of transmitting the local gradient information of the first network layer in the neural network model with at least one second working node Information; the process of calculating the local gradient information of the network layer in the neural network model and the process of transmitting the local gradient information overlap (that is, communication and calculation overlap), which can improve the efficiency of model training.
  • the method before the first working node performs the current iteration of the neural network model, the method further includes: the first working node performs at least one inner iteration of the neural network model to obtain The intermediate fusion gradient information corresponding to the at least one inner layer iteration; the first working node obtains the local gradient information of at least one network layer of the neural network model based on the current iteration of the neural network model, including: The first working node obtains target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration; the first working node and the at least one The local gradient information of the first network layer transmitted by the second working node includes the target fusion gradient information of the first network layer.
  • the first working node performs at least one inner iteration of the neural network model to obtain a set of local gradient information.
  • a set of local gradient information can be understood as all the local gradient information obtained by the first working node to complete the forward calculation and reverse calculation of each network layer in the neural network model.
  • the target fusion gradient information of a network layer of the neural network model can be understood as the gradient information obtained by fusion of multiple sets of local gradient information of the network layer obtained by multiple inner layer iterations.
  • the first working node at least one second working node transmits the target fusion gradient information of the network layer; the number of transmissions of the gradient information and the total communication volume can be reduced.
  • the first working node obtaining target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration includes: The first working node accumulates the intermediate fusion gradient information and the local gradient information obtained in the current iteration to obtain target fusion gradient information of at least one network layer of the neural network model.
  • the method further includes: the first working node obtains the third network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration.
  • the target fusion gradient information transmission of the fourth network layer of the neural network model is performed with the at least one second working node.
  • the network depth of the fourth network layer is greater than the network depth of the third network layer.
  • the process of calculating the target fusion gradient information of the network layer in the neural network model and the process of transmitting the target fusion gradient information of the network layer are overlapped (that is, calculation and communication overlap), which can improve the efficiency of model training.
  • the method before performing local gradient information transmission of the first network layer in the neural network model with at least one second working node, the method further includes: the first working node transfers the Each value in the local gradient information of the first network layer is enlarged by M times, and each enlarged value is converted into half-precision; the M is a real number greater than 1.
  • the method before the first working node updates the parameters of the second network layer in the neural network model in parallel, the method further includes: the first working node will obtain the first Each value included in the local gradient information of the second network layer is converted into single precision, and each value obtained by the conversion is reduced by M times to obtain processing gradient information, where M is a real number greater than 1; the first working node Updating the parameters of the second network layer in the neural network model in parallel includes: the first working node uses the processing gradient information to update the parameters of the second network layer in the neural network model.
  • the method before performing local gradient information transmission of the first network layer in the neural network model with at least one second working node, the method further includes: the first working node is based on the For the offset corresponding to the first network layer, the calculated local gradient information of the first network layer is stored in a pre-allocated target storage space, where the target storage space is used to store the neural network model Local gradient information of a network layer; wherein the local gradient information of the first network layer sent by the first working node is obtained from the target storage space based on the offset corresponding to the first network layer , And/or, the first working node updates the first network stored in the target storage space based on the received local gradient information of the first network layer from the at least one second working node Local gradient information of the layer.
  • the local gradient information of the first network layer obtained from the target storage space based on the offset corresponding to the first network layer and/or the update of the first network layer stored in the target storage space can be quickly and accurately Local gradient information.
  • the method before performing local gradient information transmission of the first network layer in the neural network model with at least one second working node, the method further includes: the first working node will calculate The local gradient information of the multiple network layers of the neural network model is stored in a pre-allocated target storage space, and the offset corresponding to each of the multiple network layers is determined by the memory manager; the target The storage space is a continuous storage space; the first working node obtains at least one of the plurality of network layers from the target storage space based on the offset corresponding to each of the plurality of network layers Local gradient information of two network layers; the at least two network layers include the first network layer; the transmission of the local gradient information of the first network layer in the neural network model with at least one second working node
  • the method includes: performing local gradient information transmission of the at least two network layers in the neural network model with the at least one second working node.
  • the main principle of the implementation is: merge the local gradient information of several network layers into a larger array, and then initiate a global communication; this can improve the global communication efficiency and reduce the number of global communication.
  • an embodiment of the present application provides an image prediction method.
  • the method includes: acquiring an image to be processed; The image is subjected to prediction processing, and the prediction result is obtained.
  • an embodiment of the present application provides a data processing device, including: a processing module configured to obtain local gradient information of at least one network layer of the neural network model based on the current iteration of the neural network model; Module, used to transmit the local gradient information of the first network layer in the neural network model with at least one second working node; the processing module is also used to communicate between the transceiver module and at least one second working node In the process of transmitting the local gradient information of the first network layer in the neural network model, the parameters of the second network layer in the neural network model are updated in parallel.
  • an embodiment of the present application provides a data processing device, including: an acquisition module, used to acquire images to be processed; The network model performs prediction processing on the to-be-processed image to obtain a prediction result.
  • an embodiment of the present application provides an electronic device that includes a processor and a memory, where the memory is used to store instructions, and the processor is used to execute the instructions stored in the memory so that all The processor executes the method as described in the first aspect and any possible implementation manners.
  • an embodiment of the present application provides an electronic device that includes a processor and a memory, where the memory is used to store instructions, and the processor is used to execute the instructions stored in the memory so that all The processor executes the method of the second aspect and any possible implementation manners described above.
  • an embodiment of the present application provides a chip including a data interface and a processor, where the processor is configured to execute the first aspect or the method in any possible implementation of the first aspect.
  • an embodiment of the present application provides a chip, which includes a data interface and a processor, where the processor is configured to execute the second aspect or a method in any possible implementation manner of the second aspect.
  • an embodiment of the present application provides a computer-readable storage medium, the computer storage medium stores a computer program, the computer program includes program instructions, and when the program instructions are executed by a processor, the processor executes the above-mentioned first On the one hand and any possible implementation methods.
  • an embodiment of the present application provides a computer-readable storage medium that stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to execute the above-mentioned first Two aspects and any possible implementation methods.
  • an embodiment of the present application provides a computer program product.
  • the computer program product includes program instructions that, when executed by a processor, cause the processor to execute the first aspect and any of the possibilities. The method of realization.
  • an embodiment of the present application provides a computer program product, the computer program product includes program instructions, and when the program instructions are executed by a processor, the processor executes the second aspect and any one of the possibilities.
  • the method of realization includes
  • FIG. 1 is an example of a distributed training flowchart provided by an embodiment of the application
  • Figure 2 is a flowchart of a neural network model training method provided by an embodiment of the application
  • FIG. 3 is a schematic diagram of an example of calculation and communication overlap provided by an embodiment of the application.
  • FIG. 5 is a flowchart of an inner layer iteration method provided by an embodiment of the application.
  • FIG. 6 is a schematic diagram of an example of a communication convergence strategy provided by an embodiment of this application.
  • FIG. 7 is a flowchart of an image prediction method provided by an embodiment of the application.
  • FIG. 8 is a schematic structural diagram of a data processing device provided by an embodiment of this application.
  • FIG. 9 is a schematic structural diagram of another data processing device provided by an embodiment of the application.
  • FIG. 10 is a schematic structural diagram of a server provided by an embodiment of this application.
  • FIG. 11 is a schematic structural diagram of a terminal device provided by an embodiment of this application.
  • This application provides a training method of a neural network model suitable for a distributed model training scenario, which can improve the efficiency of model training.
  • the distributed training system includes multiple working nodes, and each working node has basically the same function.
  • Each working node obtains a trained neural network model through multiple iterative training of the neural network model.
  • each working node uses its own training sample to train the neural network model to obtain its own local gradient information; then, data synchronization is performed between multiple working nodes, so that each of the multiple working nodes
  • the working node obtains the local gradient information of all the working nodes, and then merges the obtained local gradient information of all the working nodes to obtain the global gradient information, or each of the multiple working nodes passes the local gradient information of all other working nodes.
  • the gradient information is fused to obtain the fused gradient information, and then the own local gradient information and the fused gradient information are fused to obtain the global gradient information.
  • each working node sends the local gradient information calculated by itself and/or the local gradient information received from at least one other working node to other working nodes, or sends and receives the local gradient information calculated by itself.
  • the obtained fusion gradient information is obtained by fusing the local gradient information from at least one other working node, for example, sent to the working node on the left or right of oneself, until each working node obtains the local calculated by all working nodes Gradient information, fusion gradient information or global gradient information; then, each working node uses the global gradient information obtained by fusion of local gradient information calculated by all working nodes to update the parameters of the neural network model.
  • Such iterations are performed multiple times, and each work node repeats the previous operations in each iteration until the training cut-off condition is reached, for example, the neural network model converges or the number of training times reaches a preset number, etc.
  • each working node uses the same neural network model, and each working node updates the parameters of the neural network model synchronously. Different working nodes use different training samples for training the neural network model. .
  • the neural network model adopted by each working node is always the same.
  • multiple working nodes may be multiple processors on the same terminal device or server.
  • 8 GPUs on a server serve as 8 working nodes, that is, one GPU corresponds to one working node.
  • one working node or at least two working nodes corresponds to one hardware entity, such as a terminal device or a server.
  • 8 laptops serve as 8 working nodes, that is, one laptop serves as one working node.
  • 256 GPUs on 32 servers serve as 256 working nodes.
  • the multiple working nodes included in the distributed training system are multiple virtual machines running in one or more devices (for example, servers).
  • the process of updating the parameters of the neural network model of the working node and the gradient data synchronization process of the working node are executed in parallel, which can improve training efficiency.
  • Fig. 1 is an example of a distributed training flowchart provided by an embodiment of the application.
  • GPU 0, GPU 1, GPU 2, and GPU 3 are respectively a working node in the distributed training system.
  • the neural network model includes several layers, GPU 0, GPU 1, GPU 2, and GPU
  • the parallel training process of 3 may include: forward calculation (Forward Pass) of each layer, backward propagation (Backward Pass), gradient data synchronization (such as gradient protocol communication), and parameter update.
  • forward calculation Forward Pass
  • Backward Pass backward propagation
  • gradient data synchronization such as gradient protocol communication
  • parameter update such as gradient protocol communication
  • the gradient of the last layer of the neural network model can be obtained; in backpropagation, the gradient of the last layer can be backpropagated in reverse order, and the gradients of each layer of the neural network model can be calculated in turn .
  • gradient data synchronization gradient data can be synchronized between multiple working nodes.
  • the purpose of gradient data synchronization is to enable each working node to obtain global gradient information obtained by fusion of local gradient information calculated by all working nodes, and this application does not limit the way to achieve this goal.
  • each working node uses the global gradient information obtained by synchronization of the gradient data to update the network parameters (such as weights, etc.) of the neural network model.
  • different working nodes input different training samples into the neural network model to perform forward calculation and reverse calculation (ie, back propagation) to obtain their respective local gradient information.
  • each working node After each working node completes a global gradient data synchronization, it can obtain global gradient information obtained by fusion of local gradient information calculated by all working nodes or local gradient information calculated by all working nodes; each working node uses all The global gradient information obtained by the fusion of the local gradient information calculated by the working nodes of, updates the parameters of the respective neural network models.
  • each working node can use the same method to update the parameters of the neural network model.
  • gradient data synchronization mainly occupies network bandwidth resources, and other operations occupy GPU computing resources.
  • the embodiment of the present application provides a method for training a neural network model that makes the gradient data synchronization and the parameter update overlap (ie, parallel). The following describes the training method of the neural network model provided by the embodiments of the present application with reference to the accompanying drawings.
  • Fig. 2 is a flowchart of a method for training a neural network model provided by an embodiment of the application. As shown in Figure 2, the method includes:
  • the first working node obtains local gradient information of at least one network layer of the neural network model based on the current iteration of the neural network model.
  • the above-mentioned first working node can be a terminal device such as a notebook computer, a desktop computer, a tablet computer, a mobile phone, etc.; it can also be a server; it can also be a virtual machine running on a server or a terminal device; it can also be a terminal device or a processor on the server , Such as graphics processing unit (Graphics Processing Unit, GPU), central processing unit (Central Processing Unit, CPU), network processing unit (Neural-network Processing Unit, NPU), etc.
  • each GPU can obtain the local gradient information of each network layer through reverse calculation.
  • the reverse calculation is a reverse order layer-by-layer operation, and the first working node can calculate the local gradient information of each network layer in the neural network model layer by layer in the reverse order, see FIG. 1.
  • the first working node may also perform the following operations before performing the local gradient information transmission of the first network layer in the neural network model with at least one second working node (step 202):
  • the node amplifies each value in the local gradient information of the first network layer by M times, and converts each amplified value into half-precision; the above M is a real number greater than 1.
  • the first working node first converts the local gradient information of the first network layer into half-precision before transmitting the local gradient information of the first network layer in the neural network model with at least one second working node.
  • Floating point (half-precision float) data so that the storage space occupied by it will be reduced by half compared to single-precision float data; then perform gradient protocol communication; after the protocol communication is over, the half-precision obtained by the protocol communication The gradient is first converted back to single precision, and then the parameters are updated. In this way, communication overhead can be reduced by half.
  • the first working node first amplifies the local gradient information before communication, and then shrinks it after the communication ends, so as to reduce the accuracy loss in the transmission of local gradient information.
  • the first working node updates the parameters of the second network layer in the neural network model in parallel during the process of transmitting the local gradient information of the first network layer in the neural network model with at least one second working node.
  • the first network layer and the second network layer are different.
  • each of the at least one second working node described above performs similar operations to the first working node.
  • the first working node updates the parameters of multiple network layers in the neural network model layer by layer in a reverse order; and/or, the network depth of the second network layer is greater than that of the first network layer depth.
  • the first working node implements the gradient data synchronization by reverse order layer by layer operation, and the parameter update method is reverse order layer by layer operation.
  • the neural network model includes N layers, and the first working node and at least one second working node successively transmit local gradient information from the Nth network layer to the first network layer (corresponding to reverse-order layer-by-layer operation to achieve gradient data synchronization).
  • “transmit” means “send” and "receive”.
  • the first working node sends to at least one second working node the local gradient information of the Nth network layer calculated by the first working node, it also receives Local gradient information from the Nth network layer of at least one second working node. Then, the first working node successively updates the parameters of the Nth network layer to the first network layer (corresponding to the reverse order layer by layer operation to achieve parameter update).
  • FIG. 3 is a schematic diagram of an example of calculation and communication overlap provided by an embodiment of the application.
  • 301 represents a data stream (stream) 1 that implements gradient data synchronization by reverse-order layer-by-layer operation
  • 302 represents a data stream (stream) 2 that implements parameter update through reverse-order layer-by-layer operation, and data stream 1 and data stream 2 are parallel
  • Each rectangular box in 301 represents the operation of transmitting (or communicating or synchronizing) the local gradient information of a network layer between the first working node and other working nodes.
  • the nth network layer indicates that the first working node and other working nodes transmit the nth network.
  • each rectangular box in 302 represents the operation of the first working node to update the parameters of a network layer
  • the nth network layer represents the operation of the first working node to update the parameters of the nth network layer
  • arrows indicate Timeline.
  • n is an integer greater than 1.
  • the first working node and other working nodes transmit the local gradient information of the nth network layer, the local gradient information of the (n-1)th network layer, ..., the local gradient information of the first network layer in sequence;
  • a working node updates the parameters of the nth network layer, the parameters of the (n-1)th network layer,..., the parameters of the first network layer in sequence;
  • the first working node and other working nodes transmit the parameters of the (ni)th network layer
  • the parameters of the (n-i+1)th network layer are updated in parallel.
  • i is an integer less than n.
  • the first working node Since the first working node realizes gradient data synchronization by reverse order layer by layer operation, and the parameter update method is reverse order layer by layer operation, the first working node can use the obtained network in parallel during the gradient data synchronization process.
  • the local gradient information of the layer is used to implement part of the parameter update operation.
  • the first working node because the first working node has received the local gradient information of the nth network layer before performing the operation of receiving the local gradient information of the (n-1)th network layer, the first working node is performing the reception of the local gradient information of the nth network layer. (n-1) During the operation of the local gradient information of the network layer, the operation of updating the parameters of the nth network layer can be performed in parallel.
  • the first working node determines the dependency relationship between the multiple operations of the current iteration based on the connection relationship of the multiple network layers of the neural network model, and the multiple operations include at least those in the neural network model. At least one network layer's local gradient information transmission operation and parameter update operation; the above-mentioned first working node executes the above-mentioned multiple operations based on the dependency between the above-mentioned multiple operations. That is to say, the first working node can establish the dependency relationship between the multiple operations of the current iteration based on the sequential relationship of the multiple operations of the current iteration to which the network layer belongs, that is, the specific execution timing of each operation is driven by the dependency relationship.
  • the first working node realizes the gradient data synchronization by reverse order layer by layer operation
  • the parameter update method is reverse order layer by layer operation.
  • the operation depends on the local gradient information transmission operation of any network layer in the neural network model.
  • the transmission operation of the local gradient information of each network layer after the any network layer is completed, and the operation of the parameter update operation of any network layer in the neural network model is the transmission of the local gradient information of the any network layer.
  • the operations are all completed. For example, after the first working node completes the transmission operation of the local gradient information of the nth network layer in the neural network model, it can perform the transmission operation of the local gradient information of the (n-1)th network layer and the parameters of the nth network layer Update operation.
  • step 202 is as follows: in the process of transmitting the local gradient information of the first network layer in the neural network model with the at least one second working node, the first working node determines When the operation on which the parameter update operation of the second network layer relies is completed, the parameters of the second network layer are updated in parallel with the transmission of the local gradient information of the first network layer, wherein the parameter update operation relies on The operation includes transmitting the local gradient information of the second network layer with the at least one second working node.
  • each operation to be executed by the first working node is bound to an event, and the event that each operation needs to wait for is established according to the dependency between the operations; each data stream passes through a lightweight The blocking interface (such as cudaStreamWaitEvent) waits for the completion of the associated event of the current operation before starting the current operation.
  • a lightweight The blocking interface such as cudaStreamWaitEvent
  • the first working node may perform the following operations: the first working node will obtain the local gradient information of the above-mentioned second network layer. Each value is converted to single precision, and each value obtained by the above conversion is reduced by M times to obtain the processing gradient information.
  • the above M is a real number greater than 1.
  • the first working node updates the second network layer in the above neural network model in parallel.
  • the parameter may be: the first working node uses the processing gradient information to update the parameter of the second network layer in the neural network model.
  • the first working node updates the local gradient information of the first network layer in the neural network model in parallel with at least one second working node.
  • Parameters The process of updating the parameters of the neural network model and the process of transmitting local gradient information overlap (that is, parameter update and calculation overlap), which can improve the efficiency of model training.
  • the first working node may further overlap the gradient data synchronization and reverse calculation.
  • the gradient data synchronization and reverse calculation overlap is introduced in conjunction with the accompanying drawings.
  • the first working node may perform the following operations on the basis of executing the method flow in FIG. 1
  • the local gradient information of the third network layer in the above-mentioned neural network model is calculated.
  • the network depth of the third network layer is smaller than the network depth of the first network layer.
  • the reverse calculation is a reverse-order layer-by-layer operation
  • the first working node realizes gradient data synchronization by reverse-order layer-by-layer operation
  • the process of the first working node implementing reverse calculation can overlap with the process of achieving gradient data synchronization. , That is, realize the reverse calculation and the synchronization of the gradient data in parallel.
  • FIG. 4 is a schematic diagram of another example of calculation and communication overlap provided by an embodiment of the application.
  • 401 represents a data stream 3 that implements reverse calculation by layer-by-layer operation in reverse order
  • 301 represents a data stream 1 that implements gradient data synchronization through reverse-order layer-by-layer operation
  • 302 represents a data stream 2 that implements parameter update through reverse-order layer-by-layer operation.
  • Data stream 1, data stream 2, and data stream 3 are parallel; each rectangular box in 401 represents the operation of the first working node to calculate the local gradient information of a network layer (corresponding to the reverse operation), for example, the nth network layer represents the first The operation of the working node to calculate the local gradient information of the nth network layer; each rectangular box in 301 represents the operation of transmitting the local gradient information of a network layer between the first working node and other working nodes, for example, the nth network layer represents the first working node The operation of transmitting the local gradient information of the nth network layer with other working nodes; each rectangular box in 302 represents the operation of the first working node to update the parameters of a network layer, for example, the nth network layer means that the first working node updates the nth network The operation of the parameters of the layer.
  • n is an integer greater than 1.
  • the first working node calculates the local gradient information of the nth network layer, the local gradient information of the (n-1)th network layer,..., the local gradient information of the first network layer in order of sequence; the first working node and Other working nodes transmit the local gradient information of the nth network layer, the local gradient information of the (n-1)th network layer,..., the local gradient information of the first network layer in sequence; the first working node updates the nth network layer in sequence
  • -i+1) The parameters of the network layer and the calculation of the local gradient information of the (ni-1)th network layer. Wherein, i is an integer smaller than (n-1).
  • the first working node calculates the local gradient of the third network layer in the neural network model during the process of transmitting the local gradient information of the first network layer in the neural network model with at least one second working node Information; overlapping the process of calculating the local gradient information of the network layer in the neural network model and the process of transmitting the local gradient information can improve the efficiency of model training.
  • the foregoing embodiment describes a scheme in which calculation and communication overlap.
  • the essence of the above calculation and communication overlap scheme is to hide the communication time with parameter update time and/or reverse calculation time, but when the calculation time of the neural network model is less than the communication time, we cannot fully hide the communication overhead. Therefore, it is necessary to study communication reduction schemes to further compress communication overhead.
  • the embodiment of the present application introduces an inner-layer iteration strategy.
  • Each inner layer iteration performs a complete forward calculation (Forward) and reverse calculation (Backward), and accumulates local gradient information, but does not perform gradient data synchronization and parameter update, that is, the gradient data of each working node is not synchronized And do not update the parameters of the neural network model.
  • Multiple inner layer iterations correspond to one global communication, in which the local gradient information is communicated and the parameter values are updated in the last inner layer iteration.
  • the global communication operation may overlap with the reverse calculation of the last inner iteration.
  • the inner iteration strategy is essentially to increase the batch size of each iteration, which is equivalent to reducing the total communication volume in the overall training process.
  • FIG. 5 is a flowchart of an inner layer iteration method provided by an embodiment of the application. As shown in Figure 5, the inner iteration method includes:
  • the first working node inputs training samples to a neural network model for forward calculation, and obtains a processing result.
  • the first working node uses the foregoing processing result and the foregoing neural network model to perform reverse calculations to obtain local gradient information of at least one network layer of the neural network model.
  • Steps 502 and 501 can be understood as an implementation manner in which the above-mentioned first working node performs an inner-layer iteration on the above-mentioned neural network model to obtain the local gradient information of at least one network layer of the above-mentioned neural network model.
  • step 502 can be replaced by: the first working node uses the above processing result and the above neural network model to perform reverse calculations to obtain local gradient information of each network layer of the neural network model.
  • the first working node implements reverse calculation by layer-by-layer operation in reverse order, and obtains the local gradient information of each network layer of the neural network model.
  • the first working node obtains target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration (that is, this inner layer iteration).
  • the aforementioned intermediate fusion gradient information may be intermediate fusion gradient information corresponding to the aforementioned at least one inner iteration obtained by the first working node performing at least one inner layer iteration on the aforementioned neural network model.
  • the aforementioned intermediate fusion gradient information may be the local gradient information of each network layer of the neural network model obtained by the first working node performing one inner layer iteration; it may also be obtained by the first working node performing at least two inner layer iterations At least two sets of local gradient information are obtained by fusion.
  • step 503 when the first working node executes step 503 for the first time, the aforementioned intermediate fusion gradient information does not exist, and the implementation of step 503 may be to use the local gradient information of at least one network layer of the neural network model obtained in step 502 as the intermediate fusion
  • the gradient information is stored; when the first working node executes step 503 for the second time, the implementation of step 503 can be based on the current intermediate fusion gradient information and the local gradient information corresponding to this inner iteration (that is, the second execution step 502) to obtain new intermediate fusion gradient information (corresponding to updating the intermediate fusion gradient); and so on, after the first working node executes step 503 for the Kth time, the target of at least one network layer of the neural network model is obtained Fusion of gradient information.
  • K is an integer greater than 1. It can be understood that the first working node performs step 503 for the first time to obtain the initial intermediate fusion gradient (corresponding to the gradient information obtained from the first execution of step 502), and each subsequent step 503 is to use the current intermediate fusion gradient information and The local gradient information corresponding to the current iteration (that is, this inner iteration) obtains the new intermediate fusion gradient information.
  • the first working node performs an inner layer iteration to obtain a set of local gradient parameters.
  • Each set of local gradient parameters includes the local gradient information of each network layer of the neural network model; the first working node performs at least two operations on it.
  • the fusion of at least two sets of local gradient information obtained in the second inner layer iteration may be: fusion of the local gradient information of each network layer included in the at least two sets of local gradient information, respectively, to obtain the intermediate fusion gradient of each network layer.
  • the first working node merges the local gradient information of the first network layer included in the at least two sets of local gradient information to obtain the intermediate fusion gradient of the first network layer.
  • the first working node fusing the local gradient information of the first network layer included in the at least two sets of local gradient information may be successively fusing the corresponding parameters in the first network layer included in the two sets of local gradient information.
  • the value of a certain parameter of the first network layer included in the first group of local gradient information is a
  • the value of the parameter included in the second group of local gradient information is b
  • the value of the parameter included in the third group of local gradient information is b.
  • the value of the parameter is c; taking this parameter as an example, the first working node to fuse the local gradient information of the first network layer included in the three sets of local gradient information can be: first calculate (a+b), then calculate ((a+b)+c).
  • the corresponding value of this parameter in the intermediate fusion gradient information of the first network layer is ((a+b)+c).
  • the implementation of step 503 may be: the first working node accumulates the intermediate fusion gradient information and the local gradient information obtained in the current iteration to obtain the target of at least one network layer of the neural network model. Fusion of gradient information.
  • the gradient in the intermediate fusion gradient information and the gradient in the local gradient information obtained in the current iteration correspond one-to-one; the first working node accumulates the intermediate fusion gradient information and the local gradient information obtained in the current iteration to obtain the above
  • the target fusion gradient information of at least one network layer of the neural network model may be: accumulating the one-to-one corresponding parameters in the intermediate fusion gradient information and the local gradient information obtained in the current iteration.
  • the value of a certain parameter in the intermediate fusion gradient information is d
  • the corresponding value of this parameter in the local gradient information obtained in the current iteration is e
  • d and e are accumulated to obtain (d+e).
  • the target fusion gradient information of any network layer of the aforementioned neural network model may be obtained by fusion of multiple sets of local gradient information of any network layer obtained by multiple inner layer iterations of the first working node.
  • the first working node judges whether the inner iteration threshold is reached.
  • the foregoing inner iteration threshold may be 3, 5, 10, 20, etc., which is not limited in this application. In practical applications, the first working node can set the inner iteration threshold according to actual needs. The larger the inner iteration threshold, the fewer the number of times the first working node performs global communication.
  • the first working node performs a global communication operation to obtain global gradient information.
  • the above-mentioned global gradient information may be gradient information obtained by fusion of local gradient information calculated by all working nodes.
  • the above-mentioned global gradient information may be gradient information obtained by accumulating corresponding gradients in the local gradient information calculated by all working nodes.
  • the local gradient information calculated by each working node corresponds to a vector
  • the vector corresponding to the global gradient information obtained by fusion of the local gradient information calculated by all working nodes may be the local gradient information calculated by each working node The elements at the same position in the corresponding vector are accumulated by accumulation.
  • each working node in the distributed training system obtains the global gradient information after the first working node obtains the global gradient information.
  • the first working node uses the global gradient information to update the neural network model.
  • each working node in the distributed training system uses global gradient information to update the neural network model, so that each working node will get the same updated neural network model.
  • Steps 501 to 506 describe the process of the first working node to implement a parameter update operation.
  • the first working node may execute the method flow in FIG. 5 multiple times to obtain a convergent neural network model.
  • the first working node may also perform the following operations: the first working node obtains the target of the third network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration. In the process of fusing the gradient information, the transmission of the target fusion gradient information of the fourth network layer of the neural network model is performed in parallel with the at least one second working node. Optionally, the network depth of the fourth network layer is greater than the network depth of the third network layer.
  • the first working node can perform the last iteration of the inner layer in reverse order, layer by layer.
  • the first working node can obtain the target of the highest network layer (with the largest network depth) of the above neural network model to fuse gradient information to the lowest network layer (with The minimum network depth) target fusion gradient information. It should be understood that, in the process of calculating the target fusion gradient information of a certain network layer, the first working node may transmit the calculated target fusion gradient information of some network layers to other working nodes. In other words, the global communication operation can overlap with the reverse calculation of the last inner iteration
  • the process of calculating the target fusion gradient information of the network layer in the neural network model and the process of transmitting the target fusion gradient information of the network layer are overlapped (that is, calculation and communication overlap), which can improve the efficiency of model training.
  • the first working node and at least one second working node transmit the target fusion gradient information of the network layer; the number of transmissions of the gradient information and the total communication volume can be reduced.
  • the embodiment of the present application also provides a communication fusion strategy, that is, the gradients of several network layers are merged into a larger array, and then global communication is initiated again.
  • the communication fusion strategy can be applied to the foregoing embodiments to improve communication efficiency.
  • the number of gradient parameters is quite small, usually a small constant multiple of the number of feature maps, and the communication volume is on the order of KBytes or even Byte.
  • the small amount of transmitted data cannot make full use of network bandwidth.
  • the integration scale also called gradient fusion
  • the communication efficiency will not be high; if the integration scale is too large, it will delay the start of communication operations. Therefore, when we implement the communication fusion strategy, the fusion size can be configured, for example, through dry-run to debug the most suitable fusion scale for each neural network model and platform (such as a distributed training system).
  • the fusion size can be configured, for example, through dry-run to debug the most suitable fusion scale for each neural network model and platform (such as a distributed training system).
  • multiple discretely stored small arrays must be merged into a large continuously stored array before communication. After communication, they must be disassembled and returned. This introduces two memory copies. Incur additional overhead.
  • the first working node may perform the following operations: the first working node calculates the local gradient of the first network layer based on the offset corresponding to the first network layer The information is stored in a pre-allocated target storage space, where the target storage space is used to store local gradient information of multiple network layers of the neural network model;
  • the local gradient information of the first network layer sent by the first working node is obtained from the target storage space based on the offset corresponding to the first network layer, and/or, the first working node is based on receiving The obtained local gradient information of the first network layer from the at least one second working node updates the local gradient information of the first network layer stored in the target storage space.
  • the first working node opens up a unified continuous memory space (corresponding to the target storage space) for all parameter gradients (corresponding to the gradient information) of the neural network model in advance, and then uses the memory manager to convert the parameters of each network layer
  • the parameter gradient points to the corresponding offset (offset), thereby avoiding additional memory copies during communication.
  • the first working node may perform the following operations: the first working node stores the calculated local gradient information of the multiple network layers of the neural network model in a pre-allocated target storage The memory manager determines the offset corresponding to each of the multiple network layers.
  • the target storage space is a continuous storage space; the first working node is based on each of the multiple network layers.
  • the offset corresponding to the layer is obtained from the above-mentioned target storage space for the local gradient information of at least two of the above-mentioned multiple network layers; the above-mentioned at least two network layers include the above-mentioned first network layer; step 201 can be replaced with: Perform local gradient information transmission of the at least two network layers in the neural network model with the at least one second working node.
  • FIG. 6 is a schematic diagram of an example of a communication convergence strategy provided by an embodiment of the application.
  • 601 represents the network layers of the neural network model, where L1 represents the first network layer, Ln represents the nth network layer;
  • 602 represents the local gradient information of each network layer, where gradient m, gradient (m -1),...Gradient 1 represents a gradient or the gradient of a network layer;
  • 603 represents the local gradient information of each network layer after merging, where gradient group k, gradient group (k-1)...Gradient group 1 all include At least two gradients or gradients of at least two network layers.
  • the network layers and gradients in the neural network model are not in a one-to-one correspondence.
  • each rectangular box (for example, gradient m) of 602 represents the gradient of a network layer, and each time the first working node transmits the gradient of a network layer to other working nodes, it needs to be transmitted m times.
  • each rectangular frame (for example, gradient m) of 602 represents the gradient of a parameter vector, and each time the first working node transmits a gradient group (for example, gradient group k) to other working nodes, it needs to be transmitted k times. It should be understood that the first working node can merge the local gradient information of several network layers into a larger array, and then initiate a global communication again; this can reduce the global communication information.
  • the foregoing embodiment describes the method flow of training the neural network model.
  • the following introduces an example of applying the neural network model obtained by training to realize the prediction task.
  • FIG. 7 is a flowchart of an image prediction method provided by an embodiment of the application. As shown in Figure 7, the method includes:
  • the image processing device acquires an image to be processed.
  • the aforementioned image processing device may be the aforementioned first working node, or another working node, or a device that does not participate in neural network model training, such as a terminal device or a server.
  • the image processing apparatus is a server
  • the image processing apparatus acquiring the image to be processed may be that the server receives the image to be processed from the terminal device or acquires the image to be processed from other devices according to an instruction input by the user.
  • the image processing apparatus is a server
  • the image processing apparatus acquiring the image to be processed may be the server acquiring the image to be processed uploaded by the user or acquiring the image to be processed from other devices according to instructions input by the user.
  • the foregoing neural network model may be obtained by training using the method in the foregoing embodiment. It should be understood that Fig. 7 is an example of applying a neural network model.
  • the neural network model trained by the training method in the foregoing embodiment can handle different prediction tasks, such as text recognition, image recognition, and image classification.
  • the image processing device is a server, and after performing step 702, the image processing device may also send the prediction result to a terminal device, such as a mobile phone or a personal computer.
  • a terminal device such as a mobile phone or a personal computer.
  • the image processing apparatus is a terminal device, and after performing step 702, the image processing apparatus may also output a prediction result, for example, display the prediction result on a display screen.
  • the neural network model obtained by training is used to perform prediction processing on the image to be processed to obtain the prediction result; different image prediction tasks can be efficiently realized.
  • FIG. 8 is a schematic structural diagram of a data processing device provided by an embodiment of the application.
  • the data processing device in FIG. 8 may be the first working node in the foregoing embodiment.
  • the data processing device may include:
  • the processing module 801 is configured to obtain local gradient information of at least one network layer of the neural network model based on the current iteration of the neural network model;
  • the transceiver module 802 is configured to transmit the local gradient information of the first network layer in the neural network model with at least one second working node;
  • the processing module 801 is further configured to update the second network in the neural network model in parallel during the process of transmitting the local gradient information of the first network layer in the neural network model between the transceiver module 802 and at least one second working node The parameters of the layer.
  • the processing module 801 may be a processor such as a CPU, GPU, or NPU, and the transceiver module 802 may be a transceiver with specific data transceiver functions.
  • the processing module 801 is further configured to determine the dependency relationship between the multiple operations of the current iteration based on the connection relationship of the multiple network layers of the above neural network model, and the multiple operations include at least the above The transmission operation and parameter update operation of the local gradient information of at least one network layer in the neural network model; the above-mentioned multiple operations are performed based on the dependency between the above-mentioned multiple operations.
  • the first working node updates the parameters of multiple network layers in the neural network model layer by layer in a reverse order; and/or, the network depth of the second network layer is greater than that of the first network layer Network depth.
  • the processing module 801 is specifically configured to perform local gradient information transmission of the first network layer in the neural network model between the transceiver module and the at least one second working node, in determining When the operation dependent on the parameter update operation of the second network layer is completed, the parameters of the second network layer are updated in parallel, wherein the operation dependent on the parameter update operation includes transmission to the at least one second working node.
  • the above-mentioned local gradient information of the second network layer is specifically configured to perform local gradient information transmission of the first network layer in the neural network model between the transceiver module and the at least one second working node, in determining When the operation dependent on the parameter update operation of the second network layer is completed, the parameters of the second network layer are updated in parallel, wherein the operation dependent on the parameter update operation includes transmission to the at least one second working node.
  • the processing module 801 is further configured to calculate the local gradient information of the first network layer in the neural network model during the transmission of the local gradient information of the first network layer in the neural network model between the transceiver module and at least one second working node. Local gradient information of the third network layer in the neural network model.
  • the processing module 801 is further configured to perform at least one inner layer iteration on the above-mentioned neural network model to obtain intermediate fusion gradient information corresponding to the above-mentioned at least one inner layer iteration;
  • the processing module 801 is specifically configured to obtain target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration; the first working node and the at least one second
  • the local gradient information of the first network layer transmitted by the working node includes the target fusion gradient information of the first network layer.
  • the processing module 801 is specifically configured to accumulate the aforementioned intermediate fusion gradient information and the aforementioned local gradient information obtained in the current iteration to obtain target fusion gradient information of at least one network layer of the aforementioned neural network model.
  • the transceiver module 802 is further configured to obtain the target fusion gradient information of the third network layer of the neural network model in the processing module 801 based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration. In the process, the target fusion gradient information transmission of the fourth network layer of the neural network model is performed with the at least one second working node.
  • the processing module 801 is also used to amplify each value in the local gradient information of the first network layer by M times, and convert the amplified value to half precision; the above M is greater than The real number of 1.
  • the processing module 801 is also used to convert each value included in the obtained local gradient information of the second network layer into single precision, and reduce each value obtained by the conversion by M times to obtain Processing gradient information, the above M is a real number greater than 1;
  • the processing module 801 is specifically configured to use the processing gradient information to update the parameters of the second network layer in the neural network model.
  • the processing module 801 is further configured to store the calculated local gradient information of the first network layer in a pre-allocated target storage space based on the offset corresponding to the first network layer, where ,
  • the above-mentioned target storage space is used to store local gradient information of multiple network layers of the above-mentioned neural network model;
  • the local gradient information of the first network layer sent by the transceiver module 802 is obtained from the target storage space based on the offset corresponding to the first network layer, and/or the processing module 801 is also used for receiving The obtained local gradient information of the first network layer from the at least one second working node updates the local gradient information of the first network layer stored in the target storage space.
  • the processing module 801 is also used to store the calculated local gradient information of the multiple network layers of the aforementioned neural network model in a pre-allocated target storage space, and determine the multiple The offset corresponding to each network layer in the network layer; the above-mentioned target storage space is a continuous storage space; the above-mentioned first working node is based on the offset corresponding to each of the above-mentioned multiple network layers, from the above-mentioned target storage
  • the local gradient information of at least two of the above-mentioned multiple network layers is acquired in space; the above-mentioned at least two network layers include the above-mentioned first network layer; the above-mentioned transceiver module is specifically configured to perform the above-mentioned with the above-mentioned at least one second working node Local gradient information transmission of the at least two network layers in the neural network model.
  • FIG. 9 is a schematic structural diagram of another data processing device provided by an embodiment of this application. As shown in Figure 9, the data processing device includes:
  • the obtaining module 901 is used to obtain an image to be processed
  • the processing module 902 is configured to use the neural network model obtained by training to perform prediction processing on the image to be processed to obtain a prediction result.
  • the division of the various units of the above data processing device is only a division of logical functions, and may be fully or partially integrated into one physical entity during actual implementation, or may be physically separated.
  • the above units can be separately established processing elements, or they can be integrated into the same chip for implementation.
  • they can also be stored in the storage element of the controller in the form of program code, which is called and combined by a certain processing element of the processor.
  • each unit can be integrated together or implemented independently.
  • the processing element here can be an integrated circuit chip with signal processing capabilities.
  • each step of the above method or each of the above units can be completed by an integrated logic circuit of hardware in the processor element or instructions in the form of software.
  • the processing element may be a general-purpose processor, such as a central processing unit (English: central processing unit, CPU for short), or one or more integrated circuits configured to implement the above methods, for example: one or more specific integrated circuits Circuit (English: application-specific integrated circuit, abbreviation: ASIC), or, one or more microprocessors (English: digital signal processor, abbreviation: DSP), or, one or more field programmable gate arrays (English: field-programmable gate array (referred to as FPGA), etc.
  • a general-purpose processor such as a central processing unit (English: central processing unit, CPU for short), or one or more integrated circuits configured to implement the above methods, for example: one or more specific integrated circuits Circuit (English: application-specific integrated circuit, abbreviation: ASIC), or, one or more microprocessors (English: digital signal processor, abbreviation: DSP), or, one or more field programmable gate arrays (English: field-programmable gate array (referred to as FPGA), etc.
  • FIG. 10 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 1000 may have relatively large differences due to different configurations or performance, and may include one or more central processing units (CPU) 1022 ( For example, one or more processors) and memory 1032, one or more storage media 1030 for storing application programs 1042 or data 1044 (such as one or one storage device with a large amount), one or more acceleration devices (such as GPU or NPU) ) 1024.
  • the memory 1032 and the storage medium 1030 may be short-term storage or permanent storage.
  • the program stored in the storage medium 1030 may include one or more modules (not shown in the figure), and each module may include a series of command operations on the server.
  • the central processing unit 1022 may be configured to communicate with the storage medium 1030, and execute a series of instruction operations in the storage medium 1030 on the server 1000.
  • the acceleration device 1024 can perform tasks assigned by the central processing unit 1022, such as image processing tasks.
  • the server 1000 may be a data processing device provided in an embodiment of the application.
  • the server 1000 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input and output interfaces 1058, and/or one or more operating systems 1041, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • operating systems 1041 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the steps performed by the data processing device in the above embodiment may be based on the server structure shown in FIG. 10.
  • the acceleration device 1024 may implement the function of the processing module 801 in FIG. 8, and the wired or wireless network interface 1050 may implement the function of the transceiver module 802 in FIG. 8.
  • the acceleration device 1024 can implement the function of the processing module 902 in FIG. 9, and the wired or wireless network interface 1050 or the input/output interface 1058 can implement the function of the acquisition module in FIG. 9.
  • FIG. 11 is a schematic structural diagram of a terminal device provided by an embodiment of this application.
  • the terminal device 110 includes a processor 1101, a memory 1102, and a communication interface 1103; the processor 1101, the memory 1102, and the communication interface 1103 are connected to each other through a bus 1104.
  • the terminal device in FIG. 11 may be the data processing apparatus in the foregoing embodiment.
  • the memory 1102 includes but is not limited to random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM), or portable Read-only memory (compact disc read-only memory, CDROM), the memory 1102 is used for related instructions and data.
  • the communication interface 1103 is used to receive and send data.
  • the processor 1101 may include one or more CPUs and one or more GPUs. When the processor 1101 includes one CPU, the CPU may be a single-core CPU or a multi-core CPU. The steps performed by the data processing apparatus in the foregoing embodiment may be based on the structure of the terminal device shown in FIG. 11. Specifically, the processor 1101 may implement the function of the processing module 801 in FIG. 8, and the communication interface 1103 may implement the function of the transceiver module in FIG. 8. Specifically, the processor 1101 may implement the function of the processing module 902 in FIG. 9, and the communication interface 1103 may implement the function of the acquisition module in FIG. 9.
  • a computer-readable storage medium is provided, and the above-mentioned computer-readable storage medium stores a computer program.
  • the neural network model training method provided in the foregoing embodiment is implemented.
  • a computer-readable storage medium stores a computer program, and when the above-mentioned computer program is executed by a processor, the image prediction method provided in the foregoing embodiment is implemented.
  • the embodiments of the present application provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the neural network model training method provided in the foregoing embodiments.
  • the embodiments of the present application provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the image prediction method provided in the foregoing embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Procédé de formation pour un modèle de réseau de neurones artificiels, et produit connexe. Le procédé comprend les étapes suivantes : sur la base de l'itération actuelle effectuée sur un modèle de réseau de neurones artificiels, un premier nœud de travail obtient des informations de gradient localisées d'au moins une couche réseau du modèle de réseau de neurones artificiels ; et lors du processus de transmission d'informations de gradient localisées d'une première couche réseau dans le modèle de réseau de neurones artificiels à au moins un second nœud de travail, le premier nœud de travail met à jour simultanément des paramètres d'une seconde couche réseau dans le modèle de réseau de neurones artificiels. Dans le présent procédé, lors du processus de transmission des informations de gradient localisées de la première couche réseau dans le modèle de réseau de neurones artificiels audit second nœud de travail, le premier nœud de travail met à jour simultanément les paramètres de la seconde couche réseau dans le modèle de réseau de neurones artificiels.
PCT/CN2021/095844 2020-06-03 2021-05-25 Procédé de formation pour un modèle de réseau de neurones artificiels et produit connexe WO2021244354A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020227010791A KR20220054861A (ko) 2020-06-03 2021-05-25 신경망 모델의 트레이닝 방법 및 관련 제품

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010496921.7A CN111723933B (zh) 2020-06-03 2020-06-03 神经网络模型的训练方法和相关产品
CN202010496921.7 2020-06-03

Publications (1)

Publication Number Publication Date
WO2021244354A1 true WO2021244354A1 (fr) 2021-12-09

Family

ID=72565896

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/095844 WO2021244354A1 (fr) 2020-06-03 2021-05-25 Procédé de formation pour un modèle de réseau de neurones artificiels et produit connexe

Country Status (4)

Country Link
KR (1) KR20220054861A (fr)
CN (1) CN111723933B (fr)
TW (1) TW202147188A (fr)
WO (1) WO2021244354A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114792125A (zh) * 2022-04-15 2022-07-26 北京百度网讯科技有限公司 基于分布式训练的数据处理方法、装置、电子设备和介质
CN116955365A (zh) * 2023-09-21 2023-10-27 浪潮电子信息产业股份有限公司 梯度数据同步方法、模型训练方法、系统、设备及介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723933B (zh) * 2020-06-03 2024-04-16 上海商汤智能科技有限公司 神经网络模型的训练方法和相关产品
CN112288083A (zh) * 2020-10-21 2021-01-29 周宇浩 一种神经网络分布式训练方法、装置、设备及存储介质
CN115222038A (zh) * 2021-04-16 2022-10-21 华为技术有限公司 一种梯度传输方法及相关装置
CN112866041B (zh) * 2021-04-23 2022-04-19 南京蓝洋智能科技有限公司 一种自适应性的网络系统的训练方法
CN113626652B (zh) * 2021-10-11 2021-12-17 北京一流科技有限公司 数据处理网络系统、数据处理网络部署系统及其方法
CN115328579B (zh) * 2022-10-11 2023-02-24 山东海量信息技术研究院 神经网络训练的调度方法、系统及计算机可读存储介质
CN115688867A (zh) * 2022-11-15 2023-02-03 抖音视界有限公司 用于训练神经网络的方法、装置、设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021395A (zh) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 一种面向神经网络的数据并行处理方法及系统
CN109919313A (zh) * 2019-01-31 2019-06-21 华为技术有限公司 一种梯度传输的方法及分布式训练系统
CN110379416A (zh) * 2019-08-15 2019-10-25 腾讯科技(深圳)有限公司 一种神经网络语言模型训练方法、装置、设备及存储介质
CN110378472A (zh) * 2019-07-24 2019-10-25 苏州浪潮智能科技有限公司 一种深度神经网络模型的数据并行训练方法、装置及设备
CN111723933A (zh) * 2020-06-03 2020-09-29 上海商汤智能科技有限公司 神经网络模型的训练方法和相关产品

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2881862B1 (fr) * 2012-07-30 2018-09-26 Nec Corporation Dispositif de traitement réparti et système de traitement réparti, ainsi que procédé de traitement réparti
US10152676B1 (en) * 2013-11-22 2018-12-11 Amazon Technologies, Inc. Distributed training of models using stochastic gradient descent
US10949746B2 (en) * 2016-10-27 2021-03-16 International Business Machines Corporation Efficient parallel training of a network model on multiple graphics processing units
US11288575B2 (en) * 2017-05-18 2022-03-29 Microsoft Technology Licensing, Llc Asynchronous neural network training
US11093827B2 (en) * 2017-09-20 2021-08-17 International Business Machines Corporation Variable ISA vector-based compaction in distributed training of neural networks
CN107578094A (zh) * 2017-10-25 2018-01-12 济南浪潮高新科技投资发展有限公司 基于参数服务器和fpga实现神经网络分布式训练的方法
US11630994B2 (en) * 2018-02-17 2023-04-18 Advanced Micro Devices, Inc. Optimized asynchronous training of neural networks using a distributed parameter server with eager updates
CN109600255A (zh) * 2018-12-04 2019-04-09 中山大学 一种去中心化的参数服务器优化算法
CN109871942B (zh) * 2019-02-19 2021-06-11 上海商汤智能科技有限公司 神经网络的训练方法和装置、系统、存储介质
CN110600020B (zh) * 2019-09-12 2022-05-17 上海依图信息技术有限公司 一种梯度传输方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021395A (zh) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 一种面向神经网络的数据并行处理方法及系统
CN109919313A (zh) * 2019-01-31 2019-06-21 华为技术有限公司 一种梯度传输的方法及分布式训练系统
CN110378472A (zh) * 2019-07-24 2019-10-25 苏州浪潮智能科技有限公司 一种深度神经网络模型的数据并行训练方法、装置及设备
CN110379416A (zh) * 2019-08-15 2019-10-25 腾讯科技(深圳)有限公司 一种神经网络语言模型训练方法、装置、设备及存储介质
CN111723933A (zh) * 2020-06-03 2020-09-29 上海商汤智能科技有限公司 神经网络模型的训练方法和相关产品

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114792125A (zh) * 2022-04-15 2022-07-26 北京百度网讯科技有限公司 基于分布式训练的数据处理方法、装置、电子设备和介质
CN116955365A (zh) * 2023-09-21 2023-10-27 浪潮电子信息产业股份有限公司 梯度数据同步方法、模型训练方法、系统、设备及介质
CN116955365B (zh) * 2023-09-21 2024-02-09 浪潮电子信息产业股份有限公司 梯度数据同步方法、模型训练方法、系统、设备及介质

Also Published As

Publication number Publication date
KR20220054861A (ko) 2022-05-03
CN111723933B (zh) 2024-04-16
TW202147188A (zh) 2021-12-16
CN111723933A (zh) 2020-09-29

Similar Documents

Publication Publication Date Title
WO2021244354A1 (fr) Procédé de formation pour un modèle de réseau de neurones artificiels et produit connexe
JP2022058329A (ja) 分散式モデルトレーニング方法、装置、電子機器、記憶媒体及びコンピュータプログラム
WO2022027937A1 (fr) Procédé, appareil et dispositif de compression de réseau neuronal et support de stockage
US20210295168A1 (en) Gradient compression for distributed training
US11948352B2 (en) Speculative training using partial gradients update
Zhou et al. Accelerating deep learning inference via model parallelism and partial computation offloading
JP7387017B2 (ja) アドレス生成方法及びユニット、深層学習処理器、チップ、電子機器並びにコンピュータプログラム
CN111723932A (zh) 神经网络模型的训练方法和相关产品
WO2023231954A1 (fr) Procédé de débruitage de données et dispositif associé
CN114428907B (zh) 信息搜索方法、装置、电子设备及存储介质
CN111343602B (zh) 基于进化算法的联合布局与任务调度优化方法
JP2017157215A (ja) ニューラル・ネットワーク解析
US20200311511A1 (en) Accelerating neuron computations in artificial neural networks by skipping bits
CN116762080A (zh) 神经网络生成装置、神经网络运算装置、边缘设备、神经网络控制方法以及软件生成程序
WO2023185896A1 (fr) Procédé et appareil de génération de textes, dispositif informatique et support de stockage
CN116820577A (zh) 模型的并行处理方法、装置、第一计算设备和电子设备
CN116668351A (zh) 服务质量预测方法、装置、计算机设备及存储介质
CN115907041A (zh) 一种模型训练方法及装置
US20210312325A1 (en) Mixed-precision neural processing unit (npu) using spatial fusion with load balancing
US11531578B1 (en) Profiling and debugging for remote neural network execution
US11263517B1 (en) Flexible weight expansion
US20160171419A1 (en) Assistance service facilitation
WO2020194032A1 (fr) Accélération de calculs neuronaux dans des réseaux neuronaux artificiels par saut de bits
US20240211758A1 (en) Method for Training Artificial Intelligence Model and Related Device
US20240020510A1 (en) System and method for execution of inference models across multiple data processing systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21818540

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20227010791

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2022530257

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 522431767

Country of ref document: SA

122 Ep: pct application non-entry in european phase

Ref document number: 21818540

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21818540

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 19/05/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21818540

Country of ref document: EP

Kind code of ref document: A1