WO2021244354A1 - Training method for neural network model, and related product - Google Patents

Training method for neural network model, and related product Download PDF

Info

Publication number
WO2021244354A1
WO2021244354A1 PCT/CN2021/095844 CN2021095844W WO2021244354A1 WO 2021244354 A1 WO2021244354 A1 WO 2021244354A1 CN 2021095844 W CN2021095844 W CN 2021095844W WO 2021244354 A1 WO2021244354 A1 WO 2021244354A1
Authority
WO
WIPO (PCT)
Prior art keywords
gradient information
network layer
working node
neural network
network model
Prior art date
Application number
PCT/CN2021/095844
Other languages
French (fr)
Chinese (zh)
Inventor
王迎瑞
李周洋
王元波
张行程
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Priority to KR1020227010791A priority Critical patent/KR20220054861A/en
Publication of WO2021244354A1 publication Critical patent/WO2021244354A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application relates to the field of model training, in particular to a training method and related products of a neural network model.
  • Deep learning is bringing tremendous development and progress to many social fields, and model training is its key link.
  • model training process a large amount of sample data will be read and a large number of mathematical operations will be performed, which is very time-consuming.
  • breakthroughs in benchmark tests on the ImageNet data set Although the industry has continuously made breakthroughs in benchmark tests on the ImageNet data set.
  • an efficient distributed model training program is still a tricky practical problem. Therefore, it is necessary to study more efficient distributed model training schemes.
  • the embodiment of the application discloses a neural network model training method and related products.
  • an embodiment of the present application provides a method for training a neural network model.
  • the method includes: a first working node obtains the local information of at least one network layer of the neural network model based on the current iteration of the neural network model. Gradient information; in the process of transmitting the local gradient information of the first network layer in the neural network model with at least one second working node, the first working node updates the second in the neural network model in parallel Parameters of the network layer.
  • the neural network model can include several layers (Layer), and its distributed parallel training process can be divided into forward calculation (Forward Pass), backward calculation (Backward Pass), gradient data synchronization (for example, Allreduce Gradients) and Parameter update.
  • the forward calculation is a forward-order layer-by-layer operation
  • the reverse calculation is a reverse-order layer-by-layer operation
  • gradient data synchronization mainly occupies network bandwidth resources, and other operations occupies computing resources of the processor.
  • the first working node performs parameter update and gradient data synchronization in parallel, so as to hide communication overhead, and can fully explore overlapping parts in the model training process, reduce the delay caused by communication, and improve model training efficiency.
  • the first working node updates the local gradient information of the first network layer in the neural network model in parallel with at least one second working node.
  • Parameters; overlapping the process of updating the parameters of the neural network model and the process of transmitting local gradient information can improve the efficiency of model training.
  • the method further includes: the first working node determines the dependency relationship between the multiple operations of the current iteration based on the connection relationship of the multiple network layers of the neural network model,
  • the multiple operations include at least a transmission operation of local gradient information and a parameter update operation of at least one network layer in the neural network model; wherein, the first working node performs all operations based on the dependency between the multiple operations Describe multiple operations.
  • the dependency between multiple operations of the current iteration can be accurately determined, and the multiple operations can be executed sequentially based on the dependency between the multiple operations.
  • Each of the operations can be executed sequentially based on the dependency between the multiple operations.
  • the first working node updates the parameters of multiple network layers in the neural network model layer by layer in a reverse order; and/or, the network depth of the second network layer is greater than that of the The network depth of the first network layer.
  • the first working node and the at least one second working node transmit the local gradient information of the multiple network layers in the neural network model layer by layer in a reverse order; the first working node transmits in a reverse order
  • the local gradient information of the multiple network layers in the neural network model is calculated layer by layer (corresponding to reverse calculation as a reverse order layer by layer operation).
  • the first working node updates the The parameters of the second network layer in the neural network model include:
  • the first working node determines that the parameters of the second network layer are updated
  • update the parameters of the second network layer in parallel includes transmitting the second network with the at least one second working node Local gradient information of the layer.
  • the method further includes: the first working node is in the process of transmitting local gradient information of the first network layer in the neural network model with at least one second working node Calculate the local gradient information of the third network layer in the neural network model.
  • the first working node calculates the local gradient of the third network layer in the neural network model during the process of transmitting the local gradient information of the first network layer in the neural network model with at least one second working node Information; the process of calculating the local gradient information of the network layer in the neural network model and the process of transmitting the local gradient information overlap (that is, communication and calculation overlap), which can improve the efficiency of model training.
  • the method before the first working node performs the current iteration of the neural network model, the method further includes: the first working node performs at least one inner iteration of the neural network model to obtain The intermediate fusion gradient information corresponding to the at least one inner layer iteration; the first working node obtains the local gradient information of at least one network layer of the neural network model based on the current iteration of the neural network model, including: The first working node obtains target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration; the first working node and the at least one The local gradient information of the first network layer transmitted by the second working node includes the target fusion gradient information of the first network layer.
  • the first working node performs at least one inner iteration of the neural network model to obtain a set of local gradient information.
  • a set of local gradient information can be understood as all the local gradient information obtained by the first working node to complete the forward calculation and reverse calculation of each network layer in the neural network model.
  • the target fusion gradient information of a network layer of the neural network model can be understood as the gradient information obtained by fusion of multiple sets of local gradient information of the network layer obtained by multiple inner layer iterations.
  • the first working node at least one second working node transmits the target fusion gradient information of the network layer; the number of transmissions of the gradient information and the total communication volume can be reduced.
  • the first working node obtaining target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration includes: The first working node accumulates the intermediate fusion gradient information and the local gradient information obtained in the current iteration to obtain target fusion gradient information of at least one network layer of the neural network model.
  • the method further includes: the first working node obtains the third network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration.
  • the target fusion gradient information transmission of the fourth network layer of the neural network model is performed with the at least one second working node.
  • the network depth of the fourth network layer is greater than the network depth of the third network layer.
  • the process of calculating the target fusion gradient information of the network layer in the neural network model and the process of transmitting the target fusion gradient information of the network layer are overlapped (that is, calculation and communication overlap), which can improve the efficiency of model training.
  • the method before performing local gradient information transmission of the first network layer in the neural network model with at least one second working node, the method further includes: the first working node transfers the Each value in the local gradient information of the first network layer is enlarged by M times, and each enlarged value is converted into half-precision; the M is a real number greater than 1.
  • the method before the first working node updates the parameters of the second network layer in the neural network model in parallel, the method further includes: the first working node will obtain the first Each value included in the local gradient information of the second network layer is converted into single precision, and each value obtained by the conversion is reduced by M times to obtain processing gradient information, where M is a real number greater than 1; the first working node Updating the parameters of the second network layer in the neural network model in parallel includes: the first working node uses the processing gradient information to update the parameters of the second network layer in the neural network model.
  • the method before performing local gradient information transmission of the first network layer in the neural network model with at least one second working node, the method further includes: the first working node is based on the For the offset corresponding to the first network layer, the calculated local gradient information of the first network layer is stored in a pre-allocated target storage space, where the target storage space is used to store the neural network model Local gradient information of a network layer; wherein the local gradient information of the first network layer sent by the first working node is obtained from the target storage space based on the offset corresponding to the first network layer , And/or, the first working node updates the first network stored in the target storage space based on the received local gradient information of the first network layer from the at least one second working node Local gradient information of the layer.
  • the local gradient information of the first network layer obtained from the target storage space based on the offset corresponding to the first network layer and/or the update of the first network layer stored in the target storage space can be quickly and accurately Local gradient information.
  • the method before performing local gradient information transmission of the first network layer in the neural network model with at least one second working node, the method further includes: the first working node will calculate The local gradient information of the multiple network layers of the neural network model is stored in a pre-allocated target storage space, and the offset corresponding to each of the multiple network layers is determined by the memory manager; the target The storage space is a continuous storage space; the first working node obtains at least one of the plurality of network layers from the target storage space based on the offset corresponding to each of the plurality of network layers Local gradient information of two network layers; the at least two network layers include the first network layer; the transmission of the local gradient information of the first network layer in the neural network model with at least one second working node
  • the method includes: performing local gradient information transmission of the at least two network layers in the neural network model with the at least one second working node.
  • the main principle of the implementation is: merge the local gradient information of several network layers into a larger array, and then initiate a global communication; this can improve the global communication efficiency and reduce the number of global communication.
  • an embodiment of the present application provides an image prediction method.
  • the method includes: acquiring an image to be processed; The image is subjected to prediction processing, and the prediction result is obtained.
  • an embodiment of the present application provides a data processing device, including: a processing module configured to obtain local gradient information of at least one network layer of the neural network model based on the current iteration of the neural network model; Module, used to transmit the local gradient information of the first network layer in the neural network model with at least one second working node; the processing module is also used to communicate between the transceiver module and at least one second working node In the process of transmitting the local gradient information of the first network layer in the neural network model, the parameters of the second network layer in the neural network model are updated in parallel.
  • an embodiment of the present application provides a data processing device, including: an acquisition module, used to acquire images to be processed; The network model performs prediction processing on the to-be-processed image to obtain a prediction result.
  • an embodiment of the present application provides an electronic device that includes a processor and a memory, where the memory is used to store instructions, and the processor is used to execute the instructions stored in the memory so that all The processor executes the method as described in the first aspect and any possible implementation manners.
  • an embodiment of the present application provides an electronic device that includes a processor and a memory, where the memory is used to store instructions, and the processor is used to execute the instructions stored in the memory so that all The processor executes the method of the second aspect and any possible implementation manners described above.
  • an embodiment of the present application provides a chip including a data interface and a processor, where the processor is configured to execute the first aspect or the method in any possible implementation of the first aspect.
  • an embodiment of the present application provides a chip, which includes a data interface and a processor, where the processor is configured to execute the second aspect or a method in any possible implementation manner of the second aspect.
  • an embodiment of the present application provides a computer-readable storage medium, the computer storage medium stores a computer program, the computer program includes program instructions, and when the program instructions are executed by a processor, the processor executes the above-mentioned first On the one hand and any possible implementation methods.
  • an embodiment of the present application provides a computer-readable storage medium that stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to execute the above-mentioned first Two aspects and any possible implementation methods.
  • an embodiment of the present application provides a computer program product.
  • the computer program product includes program instructions that, when executed by a processor, cause the processor to execute the first aspect and any of the possibilities. The method of realization.
  • an embodiment of the present application provides a computer program product, the computer program product includes program instructions, and when the program instructions are executed by a processor, the processor executes the second aspect and any one of the possibilities.
  • the method of realization includes
  • FIG. 1 is an example of a distributed training flowchart provided by an embodiment of the application
  • Figure 2 is a flowchart of a neural network model training method provided by an embodiment of the application
  • FIG. 3 is a schematic diagram of an example of calculation and communication overlap provided by an embodiment of the application.
  • FIG. 5 is a flowchart of an inner layer iteration method provided by an embodiment of the application.
  • FIG. 6 is a schematic diagram of an example of a communication convergence strategy provided by an embodiment of this application.
  • FIG. 7 is a flowchart of an image prediction method provided by an embodiment of the application.
  • FIG. 8 is a schematic structural diagram of a data processing device provided by an embodiment of this application.
  • FIG. 9 is a schematic structural diagram of another data processing device provided by an embodiment of the application.
  • FIG. 10 is a schematic structural diagram of a server provided by an embodiment of this application.
  • FIG. 11 is a schematic structural diagram of a terminal device provided by an embodiment of this application.
  • This application provides a training method of a neural network model suitable for a distributed model training scenario, which can improve the efficiency of model training.
  • the distributed training system includes multiple working nodes, and each working node has basically the same function.
  • Each working node obtains a trained neural network model through multiple iterative training of the neural network model.
  • each working node uses its own training sample to train the neural network model to obtain its own local gradient information; then, data synchronization is performed between multiple working nodes, so that each of the multiple working nodes
  • the working node obtains the local gradient information of all the working nodes, and then merges the obtained local gradient information of all the working nodes to obtain the global gradient information, or each of the multiple working nodes passes the local gradient information of all other working nodes.
  • the gradient information is fused to obtain the fused gradient information, and then the own local gradient information and the fused gradient information are fused to obtain the global gradient information.
  • each working node sends the local gradient information calculated by itself and/or the local gradient information received from at least one other working node to other working nodes, or sends and receives the local gradient information calculated by itself.
  • the obtained fusion gradient information is obtained by fusing the local gradient information from at least one other working node, for example, sent to the working node on the left or right of oneself, until each working node obtains the local calculated by all working nodes Gradient information, fusion gradient information or global gradient information; then, each working node uses the global gradient information obtained by fusion of local gradient information calculated by all working nodes to update the parameters of the neural network model.
  • Such iterations are performed multiple times, and each work node repeats the previous operations in each iteration until the training cut-off condition is reached, for example, the neural network model converges or the number of training times reaches a preset number, etc.
  • each working node uses the same neural network model, and each working node updates the parameters of the neural network model synchronously. Different working nodes use different training samples for training the neural network model. .
  • the neural network model adopted by each working node is always the same.
  • multiple working nodes may be multiple processors on the same terminal device or server.
  • 8 GPUs on a server serve as 8 working nodes, that is, one GPU corresponds to one working node.
  • one working node or at least two working nodes corresponds to one hardware entity, such as a terminal device or a server.
  • 8 laptops serve as 8 working nodes, that is, one laptop serves as one working node.
  • 256 GPUs on 32 servers serve as 256 working nodes.
  • the multiple working nodes included in the distributed training system are multiple virtual machines running in one or more devices (for example, servers).
  • the process of updating the parameters of the neural network model of the working node and the gradient data synchronization process of the working node are executed in parallel, which can improve training efficiency.
  • Fig. 1 is an example of a distributed training flowchart provided by an embodiment of the application.
  • GPU 0, GPU 1, GPU 2, and GPU 3 are respectively a working node in the distributed training system.
  • the neural network model includes several layers, GPU 0, GPU 1, GPU 2, and GPU
  • the parallel training process of 3 may include: forward calculation (Forward Pass) of each layer, backward propagation (Backward Pass), gradient data synchronization (such as gradient protocol communication), and parameter update.
  • forward calculation Forward Pass
  • Backward Pass backward propagation
  • gradient data synchronization such as gradient protocol communication
  • parameter update such as gradient protocol communication
  • the gradient of the last layer of the neural network model can be obtained; in backpropagation, the gradient of the last layer can be backpropagated in reverse order, and the gradients of each layer of the neural network model can be calculated in turn .
  • gradient data synchronization gradient data can be synchronized between multiple working nodes.
  • the purpose of gradient data synchronization is to enable each working node to obtain global gradient information obtained by fusion of local gradient information calculated by all working nodes, and this application does not limit the way to achieve this goal.
  • each working node uses the global gradient information obtained by synchronization of the gradient data to update the network parameters (such as weights, etc.) of the neural network model.
  • different working nodes input different training samples into the neural network model to perform forward calculation and reverse calculation (ie, back propagation) to obtain their respective local gradient information.
  • each working node After each working node completes a global gradient data synchronization, it can obtain global gradient information obtained by fusion of local gradient information calculated by all working nodes or local gradient information calculated by all working nodes; each working node uses all The global gradient information obtained by the fusion of the local gradient information calculated by the working nodes of, updates the parameters of the respective neural network models.
  • each working node can use the same method to update the parameters of the neural network model.
  • gradient data synchronization mainly occupies network bandwidth resources, and other operations occupy GPU computing resources.
  • the embodiment of the present application provides a method for training a neural network model that makes the gradient data synchronization and the parameter update overlap (ie, parallel). The following describes the training method of the neural network model provided by the embodiments of the present application with reference to the accompanying drawings.
  • Fig. 2 is a flowchart of a method for training a neural network model provided by an embodiment of the application. As shown in Figure 2, the method includes:
  • the first working node obtains local gradient information of at least one network layer of the neural network model based on the current iteration of the neural network model.
  • the above-mentioned first working node can be a terminal device such as a notebook computer, a desktop computer, a tablet computer, a mobile phone, etc.; it can also be a server; it can also be a virtual machine running on a server or a terminal device; it can also be a terminal device or a processor on the server , Such as graphics processing unit (Graphics Processing Unit, GPU), central processing unit (Central Processing Unit, CPU), network processing unit (Neural-network Processing Unit, NPU), etc.
  • each GPU can obtain the local gradient information of each network layer through reverse calculation.
  • the reverse calculation is a reverse order layer-by-layer operation, and the first working node can calculate the local gradient information of each network layer in the neural network model layer by layer in the reverse order, see FIG. 1.
  • the first working node may also perform the following operations before performing the local gradient information transmission of the first network layer in the neural network model with at least one second working node (step 202):
  • the node amplifies each value in the local gradient information of the first network layer by M times, and converts each amplified value into half-precision; the above M is a real number greater than 1.
  • the first working node first converts the local gradient information of the first network layer into half-precision before transmitting the local gradient information of the first network layer in the neural network model with at least one second working node.
  • Floating point (half-precision float) data so that the storage space occupied by it will be reduced by half compared to single-precision float data; then perform gradient protocol communication; after the protocol communication is over, the half-precision obtained by the protocol communication The gradient is first converted back to single precision, and then the parameters are updated. In this way, communication overhead can be reduced by half.
  • the first working node first amplifies the local gradient information before communication, and then shrinks it after the communication ends, so as to reduce the accuracy loss in the transmission of local gradient information.
  • the first working node updates the parameters of the second network layer in the neural network model in parallel during the process of transmitting the local gradient information of the first network layer in the neural network model with at least one second working node.
  • the first network layer and the second network layer are different.
  • each of the at least one second working node described above performs similar operations to the first working node.
  • the first working node updates the parameters of multiple network layers in the neural network model layer by layer in a reverse order; and/or, the network depth of the second network layer is greater than that of the first network layer depth.
  • the first working node implements the gradient data synchronization by reverse order layer by layer operation, and the parameter update method is reverse order layer by layer operation.
  • the neural network model includes N layers, and the first working node and at least one second working node successively transmit local gradient information from the Nth network layer to the first network layer (corresponding to reverse-order layer-by-layer operation to achieve gradient data synchronization).
  • “transmit” means “send” and "receive”.
  • the first working node sends to at least one second working node the local gradient information of the Nth network layer calculated by the first working node, it also receives Local gradient information from the Nth network layer of at least one second working node. Then, the first working node successively updates the parameters of the Nth network layer to the first network layer (corresponding to the reverse order layer by layer operation to achieve parameter update).
  • FIG. 3 is a schematic diagram of an example of calculation and communication overlap provided by an embodiment of the application.
  • 301 represents a data stream (stream) 1 that implements gradient data synchronization by reverse-order layer-by-layer operation
  • 302 represents a data stream (stream) 2 that implements parameter update through reverse-order layer-by-layer operation, and data stream 1 and data stream 2 are parallel
  • Each rectangular box in 301 represents the operation of transmitting (or communicating or synchronizing) the local gradient information of a network layer between the first working node and other working nodes.
  • the nth network layer indicates that the first working node and other working nodes transmit the nth network.
  • each rectangular box in 302 represents the operation of the first working node to update the parameters of a network layer
  • the nth network layer represents the operation of the first working node to update the parameters of the nth network layer
  • arrows indicate Timeline.
  • n is an integer greater than 1.
  • the first working node and other working nodes transmit the local gradient information of the nth network layer, the local gradient information of the (n-1)th network layer, ..., the local gradient information of the first network layer in sequence;
  • a working node updates the parameters of the nth network layer, the parameters of the (n-1)th network layer,..., the parameters of the first network layer in sequence;
  • the first working node and other working nodes transmit the parameters of the (ni)th network layer
  • the parameters of the (n-i+1)th network layer are updated in parallel.
  • i is an integer less than n.
  • the first working node Since the first working node realizes gradient data synchronization by reverse order layer by layer operation, and the parameter update method is reverse order layer by layer operation, the first working node can use the obtained network in parallel during the gradient data synchronization process.
  • the local gradient information of the layer is used to implement part of the parameter update operation.
  • the first working node because the first working node has received the local gradient information of the nth network layer before performing the operation of receiving the local gradient information of the (n-1)th network layer, the first working node is performing the reception of the local gradient information of the nth network layer. (n-1) During the operation of the local gradient information of the network layer, the operation of updating the parameters of the nth network layer can be performed in parallel.
  • the first working node determines the dependency relationship between the multiple operations of the current iteration based on the connection relationship of the multiple network layers of the neural network model, and the multiple operations include at least those in the neural network model. At least one network layer's local gradient information transmission operation and parameter update operation; the above-mentioned first working node executes the above-mentioned multiple operations based on the dependency between the above-mentioned multiple operations. That is to say, the first working node can establish the dependency relationship between the multiple operations of the current iteration based on the sequential relationship of the multiple operations of the current iteration to which the network layer belongs, that is, the specific execution timing of each operation is driven by the dependency relationship.
  • the first working node realizes the gradient data synchronization by reverse order layer by layer operation
  • the parameter update method is reverse order layer by layer operation.
  • the operation depends on the local gradient information transmission operation of any network layer in the neural network model.
  • the transmission operation of the local gradient information of each network layer after the any network layer is completed, and the operation of the parameter update operation of any network layer in the neural network model is the transmission of the local gradient information of the any network layer.
  • the operations are all completed. For example, after the first working node completes the transmission operation of the local gradient information of the nth network layer in the neural network model, it can perform the transmission operation of the local gradient information of the (n-1)th network layer and the parameters of the nth network layer Update operation.
  • step 202 is as follows: in the process of transmitting the local gradient information of the first network layer in the neural network model with the at least one second working node, the first working node determines When the operation on which the parameter update operation of the second network layer relies is completed, the parameters of the second network layer are updated in parallel with the transmission of the local gradient information of the first network layer, wherein the parameter update operation relies on The operation includes transmitting the local gradient information of the second network layer with the at least one second working node.
  • each operation to be executed by the first working node is bound to an event, and the event that each operation needs to wait for is established according to the dependency between the operations; each data stream passes through a lightweight The blocking interface (such as cudaStreamWaitEvent) waits for the completion of the associated event of the current operation before starting the current operation.
  • a lightweight The blocking interface such as cudaStreamWaitEvent
  • the first working node may perform the following operations: the first working node will obtain the local gradient information of the above-mentioned second network layer. Each value is converted to single precision, and each value obtained by the above conversion is reduced by M times to obtain the processing gradient information.
  • the above M is a real number greater than 1.
  • the first working node updates the second network layer in the above neural network model in parallel.
  • the parameter may be: the first working node uses the processing gradient information to update the parameter of the second network layer in the neural network model.
  • the first working node updates the local gradient information of the first network layer in the neural network model in parallel with at least one second working node.
  • Parameters The process of updating the parameters of the neural network model and the process of transmitting local gradient information overlap (that is, parameter update and calculation overlap), which can improve the efficiency of model training.
  • the first working node may further overlap the gradient data synchronization and reverse calculation.
  • the gradient data synchronization and reverse calculation overlap is introduced in conjunction with the accompanying drawings.
  • the first working node may perform the following operations on the basis of executing the method flow in FIG. 1
  • the local gradient information of the third network layer in the above-mentioned neural network model is calculated.
  • the network depth of the third network layer is smaller than the network depth of the first network layer.
  • the reverse calculation is a reverse-order layer-by-layer operation
  • the first working node realizes gradient data synchronization by reverse-order layer-by-layer operation
  • the process of the first working node implementing reverse calculation can overlap with the process of achieving gradient data synchronization. , That is, realize the reverse calculation and the synchronization of the gradient data in parallel.
  • FIG. 4 is a schematic diagram of another example of calculation and communication overlap provided by an embodiment of the application.
  • 401 represents a data stream 3 that implements reverse calculation by layer-by-layer operation in reverse order
  • 301 represents a data stream 1 that implements gradient data synchronization through reverse-order layer-by-layer operation
  • 302 represents a data stream 2 that implements parameter update through reverse-order layer-by-layer operation.
  • Data stream 1, data stream 2, and data stream 3 are parallel; each rectangular box in 401 represents the operation of the first working node to calculate the local gradient information of a network layer (corresponding to the reverse operation), for example, the nth network layer represents the first The operation of the working node to calculate the local gradient information of the nth network layer; each rectangular box in 301 represents the operation of transmitting the local gradient information of a network layer between the first working node and other working nodes, for example, the nth network layer represents the first working node The operation of transmitting the local gradient information of the nth network layer with other working nodes; each rectangular box in 302 represents the operation of the first working node to update the parameters of a network layer, for example, the nth network layer means that the first working node updates the nth network The operation of the parameters of the layer.
  • n is an integer greater than 1.
  • the first working node calculates the local gradient information of the nth network layer, the local gradient information of the (n-1)th network layer,..., the local gradient information of the first network layer in order of sequence; the first working node and Other working nodes transmit the local gradient information of the nth network layer, the local gradient information of the (n-1)th network layer,..., the local gradient information of the first network layer in sequence; the first working node updates the nth network layer in sequence
  • -i+1) The parameters of the network layer and the calculation of the local gradient information of the (ni-1)th network layer. Wherein, i is an integer smaller than (n-1).
  • the first working node calculates the local gradient of the third network layer in the neural network model during the process of transmitting the local gradient information of the first network layer in the neural network model with at least one second working node Information; overlapping the process of calculating the local gradient information of the network layer in the neural network model and the process of transmitting the local gradient information can improve the efficiency of model training.
  • the foregoing embodiment describes a scheme in which calculation and communication overlap.
  • the essence of the above calculation and communication overlap scheme is to hide the communication time with parameter update time and/or reverse calculation time, but when the calculation time of the neural network model is less than the communication time, we cannot fully hide the communication overhead. Therefore, it is necessary to study communication reduction schemes to further compress communication overhead.
  • the embodiment of the present application introduces an inner-layer iteration strategy.
  • Each inner layer iteration performs a complete forward calculation (Forward) and reverse calculation (Backward), and accumulates local gradient information, but does not perform gradient data synchronization and parameter update, that is, the gradient data of each working node is not synchronized And do not update the parameters of the neural network model.
  • Multiple inner layer iterations correspond to one global communication, in which the local gradient information is communicated and the parameter values are updated in the last inner layer iteration.
  • the global communication operation may overlap with the reverse calculation of the last inner iteration.
  • the inner iteration strategy is essentially to increase the batch size of each iteration, which is equivalent to reducing the total communication volume in the overall training process.
  • FIG. 5 is a flowchart of an inner layer iteration method provided by an embodiment of the application. As shown in Figure 5, the inner iteration method includes:
  • the first working node inputs training samples to a neural network model for forward calculation, and obtains a processing result.
  • the first working node uses the foregoing processing result and the foregoing neural network model to perform reverse calculations to obtain local gradient information of at least one network layer of the neural network model.
  • Steps 502 and 501 can be understood as an implementation manner in which the above-mentioned first working node performs an inner-layer iteration on the above-mentioned neural network model to obtain the local gradient information of at least one network layer of the above-mentioned neural network model.
  • step 502 can be replaced by: the first working node uses the above processing result and the above neural network model to perform reverse calculations to obtain local gradient information of each network layer of the neural network model.
  • the first working node implements reverse calculation by layer-by-layer operation in reverse order, and obtains the local gradient information of each network layer of the neural network model.
  • the first working node obtains target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration (that is, this inner layer iteration).
  • the aforementioned intermediate fusion gradient information may be intermediate fusion gradient information corresponding to the aforementioned at least one inner iteration obtained by the first working node performing at least one inner layer iteration on the aforementioned neural network model.
  • the aforementioned intermediate fusion gradient information may be the local gradient information of each network layer of the neural network model obtained by the first working node performing one inner layer iteration; it may also be obtained by the first working node performing at least two inner layer iterations At least two sets of local gradient information are obtained by fusion.
  • step 503 when the first working node executes step 503 for the first time, the aforementioned intermediate fusion gradient information does not exist, and the implementation of step 503 may be to use the local gradient information of at least one network layer of the neural network model obtained in step 502 as the intermediate fusion
  • the gradient information is stored; when the first working node executes step 503 for the second time, the implementation of step 503 can be based on the current intermediate fusion gradient information and the local gradient information corresponding to this inner iteration (that is, the second execution step 502) to obtain new intermediate fusion gradient information (corresponding to updating the intermediate fusion gradient); and so on, after the first working node executes step 503 for the Kth time, the target of at least one network layer of the neural network model is obtained Fusion of gradient information.
  • K is an integer greater than 1. It can be understood that the first working node performs step 503 for the first time to obtain the initial intermediate fusion gradient (corresponding to the gradient information obtained from the first execution of step 502), and each subsequent step 503 is to use the current intermediate fusion gradient information and The local gradient information corresponding to the current iteration (that is, this inner iteration) obtains the new intermediate fusion gradient information.
  • the first working node performs an inner layer iteration to obtain a set of local gradient parameters.
  • Each set of local gradient parameters includes the local gradient information of each network layer of the neural network model; the first working node performs at least two operations on it.
  • the fusion of at least two sets of local gradient information obtained in the second inner layer iteration may be: fusion of the local gradient information of each network layer included in the at least two sets of local gradient information, respectively, to obtain the intermediate fusion gradient of each network layer.
  • the first working node merges the local gradient information of the first network layer included in the at least two sets of local gradient information to obtain the intermediate fusion gradient of the first network layer.
  • the first working node fusing the local gradient information of the first network layer included in the at least two sets of local gradient information may be successively fusing the corresponding parameters in the first network layer included in the two sets of local gradient information.
  • the value of a certain parameter of the first network layer included in the first group of local gradient information is a
  • the value of the parameter included in the second group of local gradient information is b
  • the value of the parameter included in the third group of local gradient information is b.
  • the value of the parameter is c; taking this parameter as an example, the first working node to fuse the local gradient information of the first network layer included in the three sets of local gradient information can be: first calculate (a+b), then calculate ((a+b)+c).
  • the corresponding value of this parameter in the intermediate fusion gradient information of the first network layer is ((a+b)+c).
  • the implementation of step 503 may be: the first working node accumulates the intermediate fusion gradient information and the local gradient information obtained in the current iteration to obtain the target of at least one network layer of the neural network model. Fusion of gradient information.
  • the gradient in the intermediate fusion gradient information and the gradient in the local gradient information obtained in the current iteration correspond one-to-one; the first working node accumulates the intermediate fusion gradient information and the local gradient information obtained in the current iteration to obtain the above
  • the target fusion gradient information of at least one network layer of the neural network model may be: accumulating the one-to-one corresponding parameters in the intermediate fusion gradient information and the local gradient information obtained in the current iteration.
  • the value of a certain parameter in the intermediate fusion gradient information is d
  • the corresponding value of this parameter in the local gradient information obtained in the current iteration is e
  • d and e are accumulated to obtain (d+e).
  • the target fusion gradient information of any network layer of the aforementioned neural network model may be obtained by fusion of multiple sets of local gradient information of any network layer obtained by multiple inner layer iterations of the first working node.
  • the first working node judges whether the inner iteration threshold is reached.
  • the foregoing inner iteration threshold may be 3, 5, 10, 20, etc., which is not limited in this application. In practical applications, the first working node can set the inner iteration threshold according to actual needs. The larger the inner iteration threshold, the fewer the number of times the first working node performs global communication.
  • the first working node performs a global communication operation to obtain global gradient information.
  • the above-mentioned global gradient information may be gradient information obtained by fusion of local gradient information calculated by all working nodes.
  • the above-mentioned global gradient information may be gradient information obtained by accumulating corresponding gradients in the local gradient information calculated by all working nodes.
  • the local gradient information calculated by each working node corresponds to a vector
  • the vector corresponding to the global gradient information obtained by fusion of the local gradient information calculated by all working nodes may be the local gradient information calculated by each working node The elements at the same position in the corresponding vector are accumulated by accumulation.
  • each working node in the distributed training system obtains the global gradient information after the first working node obtains the global gradient information.
  • the first working node uses the global gradient information to update the neural network model.
  • each working node in the distributed training system uses global gradient information to update the neural network model, so that each working node will get the same updated neural network model.
  • Steps 501 to 506 describe the process of the first working node to implement a parameter update operation.
  • the first working node may execute the method flow in FIG. 5 multiple times to obtain a convergent neural network model.
  • the first working node may also perform the following operations: the first working node obtains the target of the third network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration. In the process of fusing the gradient information, the transmission of the target fusion gradient information of the fourth network layer of the neural network model is performed in parallel with the at least one second working node. Optionally, the network depth of the fourth network layer is greater than the network depth of the third network layer.
  • the first working node can perform the last iteration of the inner layer in reverse order, layer by layer.
  • the first working node can obtain the target of the highest network layer (with the largest network depth) of the above neural network model to fuse gradient information to the lowest network layer (with The minimum network depth) target fusion gradient information. It should be understood that, in the process of calculating the target fusion gradient information of a certain network layer, the first working node may transmit the calculated target fusion gradient information of some network layers to other working nodes. In other words, the global communication operation can overlap with the reverse calculation of the last inner iteration
  • the process of calculating the target fusion gradient information of the network layer in the neural network model and the process of transmitting the target fusion gradient information of the network layer are overlapped (that is, calculation and communication overlap), which can improve the efficiency of model training.
  • the first working node and at least one second working node transmit the target fusion gradient information of the network layer; the number of transmissions of the gradient information and the total communication volume can be reduced.
  • the embodiment of the present application also provides a communication fusion strategy, that is, the gradients of several network layers are merged into a larger array, and then global communication is initiated again.
  • the communication fusion strategy can be applied to the foregoing embodiments to improve communication efficiency.
  • the number of gradient parameters is quite small, usually a small constant multiple of the number of feature maps, and the communication volume is on the order of KBytes or even Byte.
  • the small amount of transmitted data cannot make full use of network bandwidth.
  • the integration scale also called gradient fusion
  • the communication efficiency will not be high; if the integration scale is too large, it will delay the start of communication operations. Therefore, when we implement the communication fusion strategy, the fusion size can be configured, for example, through dry-run to debug the most suitable fusion scale for each neural network model and platform (such as a distributed training system).
  • the fusion size can be configured, for example, through dry-run to debug the most suitable fusion scale for each neural network model and platform (such as a distributed training system).
  • multiple discretely stored small arrays must be merged into a large continuously stored array before communication. After communication, they must be disassembled and returned. This introduces two memory copies. Incur additional overhead.
  • the first working node may perform the following operations: the first working node calculates the local gradient of the first network layer based on the offset corresponding to the first network layer The information is stored in a pre-allocated target storage space, where the target storage space is used to store local gradient information of multiple network layers of the neural network model;
  • the local gradient information of the first network layer sent by the first working node is obtained from the target storage space based on the offset corresponding to the first network layer, and/or, the first working node is based on receiving The obtained local gradient information of the first network layer from the at least one second working node updates the local gradient information of the first network layer stored in the target storage space.
  • the first working node opens up a unified continuous memory space (corresponding to the target storage space) for all parameter gradients (corresponding to the gradient information) of the neural network model in advance, and then uses the memory manager to convert the parameters of each network layer
  • the parameter gradient points to the corresponding offset (offset), thereby avoiding additional memory copies during communication.
  • the first working node may perform the following operations: the first working node stores the calculated local gradient information of the multiple network layers of the neural network model in a pre-allocated target storage The memory manager determines the offset corresponding to each of the multiple network layers.
  • the target storage space is a continuous storage space; the first working node is based on each of the multiple network layers.
  • the offset corresponding to the layer is obtained from the above-mentioned target storage space for the local gradient information of at least two of the above-mentioned multiple network layers; the above-mentioned at least two network layers include the above-mentioned first network layer; step 201 can be replaced with: Perform local gradient information transmission of the at least two network layers in the neural network model with the at least one second working node.
  • FIG. 6 is a schematic diagram of an example of a communication convergence strategy provided by an embodiment of the application.
  • 601 represents the network layers of the neural network model, where L1 represents the first network layer, Ln represents the nth network layer;
  • 602 represents the local gradient information of each network layer, where gradient m, gradient (m -1),...Gradient 1 represents a gradient or the gradient of a network layer;
  • 603 represents the local gradient information of each network layer after merging, where gradient group k, gradient group (k-1)...Gradient group 1 all include At least two gradients or gradients of at least two network layers.
  • the network layers and gradients in the neural network model are not in a one-to-one correspondence.
  • each rectangular box (for example, gradient m) of 602 represents the gradient of a network layer, and each time the first working node transmits the gradient of a network layer to other working nodes, it needs to be transmitted m times.
  • each rectangular frame (for example, gradient m) of 602 represents the gradient of a parameter vector, and each time the first working node transmits a gradient group (for example, gradient group k) to other working nodes, it needs to be transmitted k times. It should be understood that the first working node can merge the local gradient information of several network layers into a larger array, and then initiate a global communication again; this can reduce the global communication information.
  • the foregoing embodiment describes the method flow of training the neural network model.
  • the following introduces an example of applying the neural network model obtained by training to realize the prediction task.
  • FIG. 7 is a flowchart of an image prediction method provided by an embodiment of the application. As shown in Figure 7, the method includes:
  • the image processing device acquires an image to be processed.
  • the aforementioned image processing device may be the aforementioned first working node, or another working node, or a device that does not participate in neural network model training, such as a terminal device or a server.
  • the image processing apparatus is a server
  • the image processing apparatus acquiring the image to be processed may be that the server receives the image to be processed from the terminal device or acquires the image to be processed from other devices according to an instruction input by the user.
  • the image processing apparatus is a server
  • the image processing apparatus acquiring the image to be processed may be the server acquiring the image to be processed uploaded by the user or acquiring the image to be processed from other devices according to instructions input by the user.
  • the foregoing neural network model may be obtained by training using the method in the foregoing embodiment. It should be understood that Fig. 7 is an example of applying a neural network model.
  • the neural network model trained by the training method in the foregoing embodiment can handle different prediction tasks, such as text recognition, image recognition, and image classification.
  • the image processing device is a server, and after performing step 702, the image processing device may also send the prediction result to a terminal device, such as a mobile phone or a personal computer.
  • a terminal device such as a mobile phone or a personal computer.
  • the image processing apparatus is a terminal device, and after performing step 702, the image processing apparatus may also output a prediction result, for example, display the prediction result on a display screen.
  • the neural network model obtained by training is used to perform prediction processing on the image to be processed to obtain the prediction result; different image prediction tasks can be efficiently realized.
  • FIG. 8 is a schematic structural diagram of a data processing device provided by an embodiment of the application.
  • the data processing device in FIG. 8 may be the first working node in the foregoing embodiment.
  • the data processing device may include:
  • the processing module 801 is configured to obtain local gradient information of at least one network layer of the neural network model based on the current iteration of the neural network model;
  • the transceiver module 802 is configured to transmit the local gradient information of the first network layer in the neural network model with at least one second working node;
  • the processing module 801 is further configured to update the second network in the neural network model in parallel during the process of transmitting the local gradient information of the first network layer in the neural network model between the transceiver module 802 and at least one second working node The parameters of the layer.
  • the processing module 801 may be a processor such as a CPU, GPU, or NPU, and the transceiver module 802 may be a transceiver with specific data transceiver functions.
  • the processing module 801 is further configured to determine the dependency relationship between the multiple operations of the current iteration based on the connection relationship of the multiple network layers of the above neural network model, and the multiple operations include at least the above The transmission operation and parameter update operation of the local gradient information of at least one network layer in the neural network model; the above-mentioned multiple operations are performed based on the dependency between the above-mentioned multiple operations.
  • the first working node updates the parameters of multiple network layers in the neural network model layer by layer in a reverse order; and/or, the network depth of the second network layer is greater than that of the first network layer Network depth.
  • the processing module 801 is specifically configured to perform local gradient information transmission of the first network layer in the neural network model between the transceiver module and the at least one second working node, in determining When the operation dependent on the parameter update operation of the second network layer is completed, the parameters of the second network layer are updated in parallel, wherein the operation dependent on the parameter update operation includes transmission to the at least one second working node.
  • the above-mentioned local gradient information of the second network layer is specifically configured to perform local gradient information transmission of the first network layer in the neural network model between the transceiver module and the at least one second working node, in determining When the operation dependent on the parameter update operation of the second network layer is completed, the parameters of the second network layer are updated in parallel, wherein the operation dependent on the parameter update operation includes transmission to the at least one second working node.
  • the processing module 801 is further configured to calculate the local gradient information of the first network layer in the neural network model during the transmission of the local gradient information of the first network layer in the neural network model between the transceiver module and at least one second working node. Local gradient information of the third network layer in the neural network model.
  • the processing module 801 is further configured to perform at least one inner layer iteration on the above-mentioned neural network model to obtain intermediate fusion gradient information corresponding to the above-mentioned at least one inner layer iteration;
  • the processing module 801 is specifically configured to obtain target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration; the first working node and the at least one second
  • the local gradient information of the first network layer transmitted by the working node includes the target fusion gradient information of the first network layer.
  • the processing module 801 is specifically configured to accumulate the aforementioned intermediate fusion gradient information and the aforementioned local gradient information obtained in the current iteration to obtain target fusion gradient information of at least one network layer of the aforementioned neural network model.
  • the transceiver module 802 is further configured to obtain the target fusion gradient information of the third network layer of the neural network model in the processing module 801 based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration. In the process, the target fusion gradient information transmission of the fourth network layer of the neural network model is performed with the at least one second working node.
  • the processing module 801 is also used to amplify each value in the local gradient information of the first network layer by M times, and convert the amplified value to half precision; the above M is greater than The real number of 1.
  • the processing module 801 is also used to convert each value included in the obtained local gradient information of the second network layer into single precision, and reduce each value obtained by the conversion by M times to obtain Processing gradient information, the above M is a real number greater than 1;
  • the processing module 801 is specifically configured to use the processing gradient information to update the parameters of the second network layer in the neural network model.
  • the processing module 801 is further configured to store the calculated local gradient information of the first network layer in a pre-allocated target storage space based on the offset corresponding to the first network layer, where ,
  • the above-mentioned target storage space is used to store local gradient information of multiple network layers of the above-mentioned neural network model;
  • the local gradient information of the first network layer sent by the transceiver module 802 is obtained from the target storage space based on the offset corresponding to the first network layer, and/or the processing module 801 is also used for receiving The obtained local gradient information of the first network layer from the at least one second working node updates the local gradient information of the first network layer stored in the target storage space.
  • the processing module 801 is also used to store the calculated local gradient information of the multiple network layers of the aforementioned neural network model in a pre-allocated target storage space, and determine the multiple The offset corresponding to each network layer in the network layer; the above-mentioned target storage space is a continuous storage space; the above-mentioned first working node is based on the offset corresponding to each of the above-mentioned multiple network layers, from the above-mentioned target storage
  • the local gradient information of at least two of the above-mentioned multiple network layers is acquired in space; the above-mentioned at least two network layers include the above-mentioned first network layer; the above-mentioned transceiver module is specifically configured to perform the above-mentioned with the above-mentioned at least one second working node Local gradient information transmission of the at least two network layers in the neural network model.
  • FIG. 9 is a schematic structural diagram of another data processing device provided by an embodiment of this application. As shown in Figure 9, the data processing device includes:
  • the obtaining module 901 is used to obtain an image to be processed
  • the processing module 902 is configured to use the neural network model obtained by training to perform prediction processing on the image to be processed to obtain a prediction result.
  • the division of the various units of the above data processing device is only a division of logical functions, and may be fully or partially integrated into one physical entity during actual implementation, or may be physically separated.
  • the above units can be separately established processing elements, or they can be integrated into the same chip for implementation.
  • they can also be stored in the storage element of the controller in the form of program code, which is called and combined by a certain processing element of the processor.
  • each unit can be integrated together or implemented independently.
  • the processing element here can be an integrated circuit chip with signal processing capabilities.
  • each step of the above method or each of the above units can be completed by an integrated logic circuit of hardware in the processor element or instructions in the form of software.
  • the processing element may be a general-purpose processor, such as a central processing unit (English: central processing unit, CPU for short), or one or more integrated circuits configured to implement the above methods, for example: one or more specific integrated circuits Circuit (English: application-specific integrated circuit, abbreviation: ASIC), or, one or more microprocessors (English: digital signal processor, abbreviation: DSP), or, one or more field programmable gate arrays (English: field-programmable gate array (referred to as FPGA), etc.
  • a general-purpose processor such as a central processing unit (English: central processing unit, CPU for short), or one or more integrated circuits configured to implement the above methods, for example: one or more specific integrated circuits Circuit (English: application-specific integrated circuit, abbreviation: ASIC), or, one or more microprocessors (English: digital signal processor, abbreviation: DSP), or, one or more field programmable gate arrays (English: field-programmable gate array (referred to as FPGA), etc.
  • FIG. 10 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 1000 may have relatively large differences due to different configurations or performance, and may include one or more central processing units (CPU) 1022 ( For example, one or more processors) and memory 1032, one or more storage media 1030 for storing application programs 1042 or data 1044 (such as one or one storage device with a large amount), one or more acceleration devices (such as GPU or NPU) ) 1024.
  • the memory 1032 and the storage medium 1030 may be short-term storage or permanent storage.
  • the program stored in the storage medium 1030 may include one or more modules (not shown in the figure), and each module may include a series of command operations on the server.
  • the central processing unit 1022 may be configured to communicate with the storage medium 1030, and execute a series of instruction operations in the storage medium 1030 on the server 1000.
  • the acceleration device 1024 can perform tasks assigned by the central processing unit 1022, such as image processing tasks.
  • the server 1000 may be a data processing device provided in an embodiment of the application.
  • the server 1000 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input and output interfaces 1058, and/or one or more operating systems 1041, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • operating systems 1041 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the steps performed by the data processing device in the above embodiment may be based on the server structure shown in FIG. 10.
  • the acceleration device 1024 may implement the function of the processing module 801 in FIG. 8, and the wired or wireless network interface 1050 may implement the function of the transceiver module 802 in FIG. 8.
  • the acceleration device 1024 can implement the function of the processing module 902 in FIG. 9, and the wired or wireless network interface 1050 or the input/output interface 1058 can implement the function of the acquisition module in FIG. 9.
  • FIG. 11 is a schematic structural diagram of a terminal device provided by an embodiment of this application.
  • the terminal device 110 includes a processor 1101, a memory 1102, and a communication interface 1103; the processor 1101, the memory 1102, and the communication interface 1103 are connected to each other through a bus 1104.
  • the terminal device in FIG. 11 may be the data processing apparatus in the foregoing embodiment.
  • the memory 1102 includes but is not limited to random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM), or portable Read-only memory (compact disc read-only memory, CDROM), the memory 1102 is used for related instructions and data.
  • the communication interface 1103 is used to receive and send data.
  • the processor 1101 may include one or more CPUs and one or more GPUs. When the processor 1101 includes one CPU, the CPU may be a single-core CPU or a multi-core CPU. The steps performed by the data processing apparatus in the foregoing embodiment may be based on the structure of the terminal device shown in FIG. 11. Specifically, the processor 1101 may implement the function of the processing module 801 in FIG. 8, and the communication interface 1103 may implement the function of the transceiver module in FIG. 8. Specifically, the processor 1101 may implement the function of the processing module 902 in FIG. 9, and the communication interface 1103 may implement the function of the acquisition module in FIG. 9.
  • a computer-readable storage medium is provided, and the above-mentioned computer-readable storage medium stores a computer program.
  • the neural network model training method provided in the foregoing embodiment is implemented.
  • a computer-readable storage medium stores a computer program, and when the above-mentioned computer program is executed by a processor, the image prediction method provided in the foregoing embodiment is implemented.
  • the embodiments of the present application provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the neural network model training method provided in the foregoing embodiments.
  • the embodiments of the present application provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the image prediction method provided in the foregoing embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A training method for a neural network model, and a related product. The method comprises: on the basis of the current iteration performed on a neural network model, a first working node obtaining localized gradient information of at least one network layer of the neural network model; and during the process of transmitting localized gradient information of a first network layer in the neural network model to at least one second working node, the first working node concurrently updating parameters of a second network layer in the neural network model. In the present method, during the process of transmitting the localized gradient information of the first network layer in the neural network model to the at least one second working node, the first working node concurrently updates the parameters of the second network layer in the neural network model.

Description

神经网络模型的训练方法和相关产品Neural network model training methods and related products
相关申请的交叉引用Cross-references to related applications
本专利申请要求于2020年6月3日提交的、申请号为202010496921.7、发明名称为“神经网络模型的训练方法和相关产品”的中国专利申请的优先权,该申请的全文以引用的方式并入本文中。This patent application claims the priority of the Chinese patent application filed on June 3, 2020, the application number is 202010496921.7, and the invention title is "Neural Network Model Training Method and Related Products". The full text of this application is incorporated by reference. Into this article.
技术领域Technical field
本申请涉及模型训练领域,尤其涉及一种神经网络模型的训练方法和相关产品。This application relates to the field of model training, in particular to a training method and related products of a neural network model.
背景技术Background technique
深度学习正在为众多社会领域带来巨大的发展进步,模型训练是其关键环节。模型训练过程中会读取大量的样本数据,进行大量的数学运算,非常耗时。尽管业界在ImageNet数据集上的基准(benchmark)测试中不断取得突破。然而,回归到通用的训练平台中,高效的分布式模型训练方案依旧是一个棘手的实际问题。因此,需要研究更高效的分布式模型训练方案。Deep learning is bringing tremendous development and progress to many social fields, and model training is its key link. During the model training process, a large amount of sample data will be read and a large number of mathematical operations will be performed, which is very time-consuming. Although the industry has continuously made breakthroughs in benchmark tests on the ImageNet data set. However, returning to a general training platform, an efficient distributed model training program is still a tricky practical problem. Therefore, it is necessary to study more efficient distributed model training schemes.
发明内容Summary of the invention
本申请实施例公开了一种神经网络模型的训练方法和相关产品。The embodiment of the application discloses a neural network model training method and related products.
第一方面,本申请实施例提供了一种神经网络模型的训练方法,该方法包括:第一工作节点基于对神经网络模型进行的当前迭代,得到所述神经网络模型的至少一个网络层的本地梯度信息;在与至少一个第二工作节点进行所述神经网络模型中的第一网络层的本地梯度信息传输的过程中,所述第一工作节点并行地更新所述神经网络模型中的第二网络层的参数。In the first aspect, an embodiment of the present application provides a method for training a neural network model. The method includes: a first working node obtains the local information of at least one network layer of the neural network model based on the current iteration of the neural network model. Gradient information; in the process of transmitting the local gradient information of the first network layer in the neural network model with at least one second working node, the first working node updates the second in the neural network model in parallel Parameters of the network layer.
神经网络模型可以包含若干个层(Layer),其分布式并行训练过程可以分为各个层的前向计算(Forward Pass),反向计算(Backward Pass),梯度数据同步(例如,Allreduce Gradients)以及参数更新。在一些实施例中,前向计算为正序逐层操作,反向计算为逆序逐层操作;梯度数据同步主要占用网络带宽资源,其他操作占用处理器的计算资源。本申请实施例中,第一工作节点并行执行参数更新与梯度数据同步,以便隐藏通信开销,能够充分发掘模型训练过程中的可重叠部分,降低通信造成的延迟,提高模型训练效率。The neural network model can include several layers (Layer), and its distributed parallel training process can be divided into forward calculation (Forward Pass), backward calculation (Backward Pass), gradient data synchronization (for example, Allreduce Gradients) and Parameter update. In some embodiments, the forward calculation is a forward-order layer-by-layer operation, and the reverse calculation is a reverse-order layer-by-layer operation; gradient data synchronization mainly occupies network bandwidth resources, and other operations occupies computing resources of the processor. In the embodiment of the present application, the first working node performs parameter update and gradient data synchronization in parallel, so as to hide communication overhead, and can fully explore overlapping parts in the model training process, reduce the delay caused by communication, and improve model training efficiency.
本申请实施例中,第一工作节点在与至少一个第二工作节点进行神经网络模型中的第一网络层的本地梯度信息传输的过程中,并行地更新神经网络模型中的第二网络层的参数;将更新神经网络模型的参数的过程和传输本地梯度信息的过程重叠,可以提高模型训练效率。In the embodiment of the present application, the first working node updates the local gradient information of the first network layer in the neural network model in parallel with at least one second working node. Parameters; overlapping the process of updating the parameters of the neural network model and the process of transmitting local gradient information can improve the efficiency of model training.
在一个可能的实现方式中,所述方法还包括:所述第一工作节点基于所述神经网络模型的多个网络层的连接关系,确定所述当前迭代的多个操作之间的依赖关系,所述多个操作至少包括所述神经网络模型中至少一个网络层的本地梯度信息的传输操作和参 数更新操作;其中,所述第一工作节点基于所述多个操作之间的依赖关系执行所述多个操作。In a possible implementation, the method further includes: the first working node determines the dependency relationship between the multiple operations of the current iteration based on the connection relationship of the multiple network layers of the neural network model, The multiple operations include at least a transmission operation of local gradient information and a parameter update operation of at least one network layer in the neural network model; wherein, the first working node performs all operations based on the dependency between the multiple operations Describe multiple operations.
在该实现方式中,可以基于神经网络模型的多个网络层的连接关系,准确地确定当前迭代的多个操作之间的依赖关系,并基于该多个操作之间的依赖关系先后执行该多个操作中的各操作。In this implementation manner, based on the connection relationship of multiple network layers of the neural network model, the dependency between multiple operations of the current iteration can be accurately determined, and the multiple operations can be executed sequentially based on the dependency between the multiple operations. Each of the operations.
在一个可能的实现方式中,所述第一工作节点以逆序的方式逐层更新所述神经网络模型中多个网络层的参数;和/或,所述第二网络层的网络深度大于所述第一网络层的网络深度。可选的,所述第一工作节点与至少一个第二工作节点以逆序的方式逐层传输所述神经网络模型中的多个网络层的本地梯度信息;所述第一工作节点以逆序的方式逐层计算所述神经网络模型中多个网络层的本地梯度信息(对应于反向计算为逆序逐层操作)。In a possible implementation manner, the first working node updates the parameters of multiple network layers in the neural network model layer by layer in a reverse order; and/or, the network depth of the second network layer is greater than that of the The network depth of the first network layer. Optionally, the first working node and the at least one second working node transmit the local gradient information of the multiple network layers in the neural network model layer by layer in a reverse order; the first working node transmits in a reverse order The local gradient information of the multiple network layers in the neural network model is calculated layer by layer (corresponding to reverse calculation as a reverse order layer by layer operation).
在一个可能的实现方式中,所述在与至少一个第二工作节点进行所述神经网络模型中的第一网络层的本地梯度信息传输的过程中,所述第一工作节点并行地更新所述神经网络模型中的第二网络层的参数包括:In a possible implementation manner, in the process of transmitting the local gradient information of the first network layer in the neural network model with at least one second working node, the first working node updates the The parameters of the second network layer in the neural network model include:
所述第一工作节点在与所述至少一个第二工作节点进行所述神经网络模型中的所述第一网络层的本地梯度信息传输的过程中,在确定所述第二网络层的参数更新操作所依赖的操作已完成的情况下,并行地更新所述第二网络层的参数,其中,所述参数更新操作所依赖的操作包括与所述至少一个第二工作节点传输所述第二网络层的本地梯度信息。In the process of transmitting the local gradient information of the first network layer in the neural network model with the at least one second working node, the first working node determines that the parameters of the second network layer are updated When the operation on which the operation depends has been completed, update the parameters of the second network layer in parallel, where the operation on which the parameter update operation depends includes transmitting the second network with the at least one second working node Local gradient information of the layer.
在该实现方式中,可以保证能成功实现更新第二网络层的参数的操作。In this implementation manner, it can be guaranteed that the operation of updating the parameters of the second network layer can be successfully implemented.
在一个可能的实现方式中,所述方法还包括:所述第一工作节点在与至少一个第二工作节点进行所述神经网络模型中的所述第一网络层的本地梯度信息传输的过程中,计算所述神经网络模型中的第三网络层的本地梯度信息。In a possible implementation, the method further includes: the first working node is in the process of transmitting local gradient information of the first network layer in the neural network model with at least one second working node Calculate the local gradient information of the third network layer in the neural network model.
在该实现方式中,第一工作节点在与至少一个第二工作节点进行神经网络模型中的第一网络层的本地梯度信息传输的过程中,计算神经网络模型中的第三网络层的本地梯度信息;将计算神经网络模型中的网络层的本地梯度信息的过程和传输本地梯度信息的过程重叠(即通信和计算重叠),可以提高模型训练效率。In this implementation, the first working node calculates the local gradient of the third network layer in the neural network model during the process of transmitting the local gradient information of the first network layer in the neural network model with at least one second working node Information; the process of calculating the local gradient information of the network layer in the neural network model and the process of transmitting the local gradient information overlap (that is, communication and calculation overlap), which can improve the efficiency of model training.
在一个可能的实现方式中,在所述第一工作节点对神经网络模型进行当前迭代之前,所述方法还包括:所述第一工作节点对所述神经网络模型进行至少一次内层迭代,得到所述至少一次内层迭代对应的中间融合梯度信息;所述第一工作节点基于对神经网络模型进行的当前迭代,得到所述神经网络模型的至少一个网络层的本地梯度信息,包括:所述第一工作节点基于所述中间融合梯度信息和所述当前迭代对应的本地梯度信息,得到所述神经网络模型的至少一个网络层的目标融合梯度信息;所述第一工作节点与所述至少一个第二工作节点传输的所述第一网络层的本地梯度信息包括所述第一网络层的目标融合梯度信息。In a possible implementation manner, before the first working node performs the current iteration of the neural network model, the method further includes: the first working node performs at least one inner iteration of the neural network model to obtain The intermediate fusion gradient information corresponding to the at least one inner layer iteration; the first working node obtains the local gradient information of at least one network layer of the neural network model based on the current iteration of the neural network model, including: The first working node obtains target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration; the first working node and the at least one The local gradient information of the first network layer transmitted by the second working node includes the target fusion gradient information of the first network layer.
所述第一工作节点对所述神经网络模型进行至少一次内层迭代,可得到一组本地梯度信息。一组本地梯度信息可以理解为第一工作节点完成神经网络模型中各网络层的前 向计算和反向计算得到的全部本地梯度信息。神经网络模型的一个网络层的目标融合梯度信息可以理解为由多次内层迭代得到的该网络层的多组本地梯度信息融合得到的梯度信息。The first working node performs at least one inner iteration of the neural network model to obtain a set of local gradient information. A set of local gradient information can be understood as all the local gradient information obtained by the first working node to complete the forward calculation and reverse calculation of each network layer in the neural network model. The target fusion gradient information of a network layer of the neural network model can be understood as the gradient information obtained by fusion of multiple sets of local gradient information of the network layer obtained by multiple inner layer iterations.
在该实现方式中,第一工作节点至少一个第二工作节点传输网络层的目标融合梯度信息;可以减少梯度信息的传输次数和总通信量。In this implementation manner, the first working node at least one second working node transmits the target fusion gradient information of the network layer; the number of transmissions of the gradient information and the total communication volume can be reduced.
在一个可能的实现方式中,所述第一工作节点基于所述中间融合梯度信息和所述当前迭代对应的本地梯度信息,得到所述神经网络模型的至少一个网络层的目标融合梯度信息包括:所述第一工作节点对所述中间融合梯度信息和所述当前迭代得到的本地梯度信息进行累加处理,得到所述神经网络模型的至少一个网络层的目标融合梯度信息。In a possible implementation, the first working node obtaining target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration includes: The first working node accumulates the intermediate fusion gradient information and the local gradient information obtained in the current iteration to obtain target fusion gradient information of at least one network layer of the neural network model.
在一个可能的实现方式中,所述方法还包括:所述第一工作节点在基于所述中间融合梯度信息和所述当前迭代对应的本地梯度信息,得到所述神经网络模型的第三网络层的目标融合梯度信息的过程中,与所述至少一个第二工作节点进行所述神经网络模型的第四网络层的目标融合梯度信息的传输。可选的,所述第四网络层的网络深度大于所述第三网络层的网络深度。In a possible implementation, the method further includes: the first working node obtains the third network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration. In the process of target fusion gradient information, the target fusion gradient information transmission of the fourth network layer of the neural network model is performed with the at least one second working node. Optionally, the network depth of the fourth network layer is greater than the network depth of the third network layer.
在该实现方式中,将计算神经网络模型中的网络层的目标融合梯度信息的过程和传输网络层的目标融合梯度信息的过程重叠(即计算和通信重叠),可以提高模型训练效率。In this implementation manner, the process of calculating the target fusion gradient information of the network layer in the neural network model and the process of transmitting the target fusion gradient information of the network layer are overlapped (that is, calculation and communication overlap), which can improve the efficiency of model training.
在一个可能的实现方式中,在与至少一个第二工作节点进行所述神经网络模型中的第一网络层的本地梯度信息传输之前,所述方法还包括:所述第一工作节点将所述第一网络层的本地梯度信息中的各个数值均放大M倍,并将放大后的各个数值转换为半精度;所述M为大于1的实数。In a possible implementation manner, before performing local gradient information transmission of the first network layer in the neural network model with at least one second working node, the method further includes: the first working node transfers the Each value in the local gradient information of the first network layer is enlarged by M times, and each enlarged value is converted into half-precision; the M is a real number greater than 1.
在该实现方式中,通过对本地梯度信息中的各个数值采用低精度存储,可以减少本地梯度信息的数据量。In this implementation manner, by using low-precision storage for each value in the local gradient information, the data amount of the local gradient information can be reduced.
在一个可能的实现方式中,所述第一工作节点并行地更新所述神经网络模型中的第二网络层的参数之前,所述方法还包括:所述第一工作节点将获得的所述第二网络层的本地梯度信息中包括的各个数值转换为单精度,并将所述转换得到的各个数值缩小M倍以得到处理梯度信息,所述M为大于1的实数;所述第一工作节点并行地更新所述神经网络模型中的第二网络层的参数包括:所述第一工作节点利用所述处理梯度信息更新所述神经网络模型中的所述第二网络层的参数。In a possible implementation, before the first working node updates the parameters of the second network layer in the neural network model in parallel, the method further includes: the first working node will obtain the first Each value included in the local gradient information of the second network layer is converted into single precision, and each value obtained by the conversion is reduced by M times to obtain processing gradient information, where M is a real number greater than 1; the first working node Updating the parameters of the second network layer in the neural network model in parallel includes: the first working node uses the processing gradient information to update the parameters of the second network layer in the neural network model.
在一个可能的实现方式中,在与至少一个第二工作节点进行所述神经网络模型中的第一网络层的本地梯度信息传输之前,所述方法还包括:所述第一工作节点基于所述第一网络层对应的偏移量,将计算得到的所述第一网络层的本地梯度信息存储至预先分配的目标存储空间,其中,所述目标存储空间用于存储所述神经网络模型的多个网络层的本地梯度信息;其中,所述第一工作节点发送的所述第一网络层的本地梯度信息是基于所述第一网络层对应的偏移量从所述目标存储空间中获取的,和/或,所述第一工作节点基于接收到的来自于所述至少一个第二工作节点的所述第一网络层的本地梯度信息,更新所述目标存储空间存储的所述第一网络层的本地梯度信息。In a possible implementation manner, before performing local gradient information transmission of the first network layer in the neural network model with at least one second working node, the method further includes: the first working node is based on the For the offset corresponding to the first network layer, the calculated local gradient information of the first network layer is stored in a pre-allocated target storage space, where the target storage space is used to store the neural network model Local gradient information of a network layer; wherein the local gradient information of the first network layer sent by the first working node is obtained from the target storage space based on the offset corresponding to the first network layer , And/or, the first working node updates the first network stored in the target storage space based on the received local gradient information of the first network layer from the at least one second working node Local gradient information of the layer.
在该实现方式中,可以快速、准确地基于第一网络层对应的偏移量从目标存储空间中获取的第一网络层的本地梯度信息和/或更新目标存储空间存储的第一网络层的本地梯度信息。In this implementation manner, the local gradient information of the first network layer obtained from the target storage space based on the offset corresponding to the first network layer and/or the update of the first network layer stored in the target storage space can be quickly and accurately Local gradient information.
在一个可能的实现方式中,在与至少一个第二工作节点进行所述神经网络模型中的第一网络层的本地梯度信息传输之前,所述方法还包括:所述第一工作节点将计算得到的所述神经网络模型的多个网络层的本地梯度信息存储至预先分配的目标存储空间,并通过内存管理器确定所述多个网络层中每个网络层对应的偏移量;所述目标存储空间为一个连续的存储空间;所述第一工作节点基于所述多个网络层中每个网络层对应的偏移量,从所述目标存储空间中获取所述多个网络层中的至少两个网络层的本地梯度信息;所述至少两个网络层包括所述第一网络层;所述与至少一个第二工作节点进行所述神经网络模型中的第一网络层的本地梯度信息传输包括:与所述至少一个第二工作节点进行所述神经网络模型中的所述至少两个网络层的本地梯度信息传输。In a possible implementation manner, before performing local gradient information transmission of the first network layer in the neural network model with at least one second working node, the method further includes: the first working node will calculate The local gradient information of the multiple network layers of the neural network model is stored in a pre-allocated target storage space, and the offset corresponding to each of the multiple network layers is determined by the memory manager; the target The storage space is a continuous storage space; the first working node obtains at least one of the plurality of network layers from the target storage space based on the offset corresponding to each of the plurality of network layers Local gradient information of two network layers; the at least two network layers include the first network layer; the transmission of the local gradient information of the first network layer in the neural network model with at least one second working node The method includes: performing local gradient information transmission of the at least two network layers in the neural network model with the at least one second working node.
应理解,实现方式的主要原理是:将若干个网络层的本地梯度信息合并到一块较大的数组,再发起一次全局通信;这样可以提升全局通信效率,减少全局通信次数。It should be understood that the main principle of the implementation is: merge the local gradient information of several network layers into a larger array, and then initiate a global communication; this can improve the global communication efficiency and reduce the number of global communication.
第二方面,本申请实施例提供了一种图像预测方法,该方法包括:获取待处理图像;利用上述第一方面以及任一项可能的实现方式中训练得到的神经网络模型对所述待处理图像进行预测处理,得到预测结果。In a second aspect, an embodiment of the present application provides an image prediction method. The method includes: acquiring an image to be processed; The image is subjected to prediction processing, and the prediction result is obtained.
第三方面,本申请实施例提供了一种数据处理装置,包括:处理模块,用于基于对神经网络模型进行的当前迭代,得到所述神经网络模型的至少一个网络层的本地梯度信息;收发模块,用于与至少一个第二工作节点进行所述神经网络模型中的第一网络层的本地梯度信息的传输;所述处理模块,还用于在所述收发模块与至少一个第二工作节点进行所述神经网络模型中的第一网络层的本地梯度信息传输的过程中,并行地更新所述神经网络模型中的第二网络层的参数。In a third aspect, an embodiment of the present application provides a data processing device, including: a processing module configured to obtain local gradient information of at least one network layer of the neural network model based on the current iteration of the neural network model; Module, used to transmit the local gradient information of the first network layer in the neural network model with at least one second working node; the processing module is also used to communicate between the transceiver module and at least one second working node In the process of transmitting the local gradient information of the first network layer in the neural network model, the parameters of the second network layer in the neural network model are updated in parallel.
关于第三方面或各种可能的实施方式所带来的技术效果,可参考对于第一方面或相应的实现方式的技术效果的介绍。Regarding the technical effects brought about by the third aspect or various possible implementation manners, reference may be made to the introduction of the technical effects of the first aspect or corresponding implementation manners.
第四方面,本申请实施例提供了一种数据处理装置,包括:获取模块,用于获取待处理图像;处理模块,用于利用上述第一方面以及任一项可能的实现方式训练得到的神经网络模型对所述待处理图像进行预测处理,得到预测结果。In a fourth aspect, an embodiment of the present application provides a data processing device, including: an acquisition module, used to acquire images to be processed; The network model performs prediction processing on the to-be-processed image to obtain a prediction result.
第五方面,本申请实施例提供了一种电子设备,该电子设备包括:处理器和存储器,其中,所述存储器用于存储指令,所述处理器用于执行所述存储器存储的指令,使得所述处理器执行如上述第一方面以及任一种可能的实现方式的方法。In a fifth aspect, an embodiment of the present application provides an electronic device that includes a processor and a memory, where the memory is used to store instructions, and the processor is used to execute the instructions stored in the memory so that all The processor executes the method as described in the first aspect and any possible implementation manners.
第六方面,本申请实施例提供了一种电子设备,该电子设备包括:处理器和存储器,其中,所述存储器用于存储指令,所述处理器用于执行所述存储器存储的指令,使得所述处理器执行如上述第二方面以及任一种可能的实现方式的方法。In a sixth aspect, an embodiment of the present application provides an electronic device that includes a processor and a memory, where the memory is used to store instructions, and the processor is used to execute the instructions stored in the memory so that all The processor executes the method of the second aspect and any possible implementation manners described above.
第七方面,本申请实施例提供了一种芯片,该芯片包括数据接口和处理器,其中,所述处理器用于执行第一方面或第一方面的任意可能实现方式中的方法。In a seventh aspect, an embodiment of the present application provides a chip including a data interface and a processor, where the processor is configured to execute the first aspect or the method in any possible implementation of the first aspect.
第八方面,本申请实施例提供了一种芯片,该芯片包括数据接口和处理器,其中, 所述处理器用于执行第二方面或第二方面的任意可能实现方式中的方法。In an eighth aspect, an embodiment of the present application provides a chip, which includes a data interface and a processor, where the processor is configured to execute the second aspect or a method in any possible implementation manner of the second aspect.
第九方面,本申请实施例提供了一种计算机可读存储介质,该计算机存储介质存储有计算机程序,该计算机程序包括程序指令,该程序指令当被处理器执行时使该处理器执行上述第一方面以及任一种可能的实现方式的方法。In a ninth aspect, an embodiment of the present application provides a computer-readable storage medium, the computer storage medium stores a computer program, the computer program includes program instructions, and when the program instructions are executed by a processor, the processor executes the above-mentioned first On the one hand and any possible implementation methods.
第十方面,本申请实施例提供了一种计算机可读存储介质,该计算机存储介质存储有计算机程序,该计算机程序包括程序指令,该程序指令当被处理器执行时使该处理器执行上述第二方面以及任一种可能的实现方式的方法。In a tenth aspect, an embodiment of the present application provides a computer-readable storage medium that stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to execute the above-mentioned first Two aspects and any possible implementation methods.
第十一方面,本申请实施例提供了一种计算机程序产品,该计算机程序产品包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第一方面以及任一种可能的实现方式的方法。In an eleventh aspect, an embodiment of the present application provides a computer program product. The computer program product includes program instructions that, when executed by a processor, cause the processor to execute the first aspect and any of the possibilities. The method of realization.
第十二方面,本申请实施例提供了一种计算机程序产品,该计算机程序产品包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第二方面以及任一种可能的实现方式的方法。In a twelfth aspect, an embodiment of the present application provides a computer program product, the computer program product includes program instructions, and when the program instructions are executed by a processor, the processor executes the second aspect and any one of the possibilities. The method of realization.
附图说明Description of the drawings
为了更清楚地说明本申请实施例或背景技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图进行说明。In order to more clearly describe the technical solutions in the embodiments of the present application or the background art, the following will describe the drawings that need to be used in the embodiments of the present application or the background art.
图1为本申请实施例提供的一种分布式训练流程图的示例;FIG. 1 is an example of a distributed training flowchart provided by an embodiment of the application;
图2为本申请实施例提供的一种神经网络模型的训练方法流程图;Figure 2 is a flowchart of a neural network model training method provided by an embodiment of the application;
图3为本申请实施例提供的一种计算和通信重叠的示例的示意图;FIG. 3 is a schematic diagram of an example of calculation and communication overlap provided by an embodiment of the application;
图4为本申请实施例提供的另一种计算和通信重叠的示例的示意图;4 is a schematic diagram of another example of calculation and communication overlap provided by an embodiment of this application;
图5为本申请实施例提供的一种内层迭代方法流程图;FIG. 5 is a flowchart of an inner layer iteration method provided by an embodiment of the application;
图6为本申请实施例提供的一种通信融合策略的一个示例的示意图;FIG. 6 is a schematic diagram of an example of a communication convergence strategy provided by an embodiment of this application;
图7为本申请实施例提供的一种图像预测方法流程图;FIG. 7 is a flowchart of an image prediction method provided by an embodiment of the application;
图8为本申请实施例提供的一种数据处理装置的结构示意图;FIG. 8 is a schematic structural diagram of a data processing device provided by an embodiment of this application;
图9为本申请实施例提供的另一种数据处理装置的结构示意图;FIG. 9 is a schematic structural diagram of another data processing device provided by an embodiment of the application;
图10为本申请实施例提供的一种服务器的结构示意图;FIG. 10 is a schematic structural diagram of a server provided by an embodiment of this application;
图11为本申请实施例提供的一种终端设备的结构示意图。FIG. 11 is a schematic structural diagram of a terminal device provided by an embodiment of this application.
具体实施方式detailed description
本申请的说明书实施例和权利要求书及上述附图中的术语“第一”、“第二”、和“第三”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元。方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括 没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", and "third" in the specification embodiments and claims of this application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or Priority. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusion, for example, including a series of steps or units. The method, system, product, or device need not be limited to those clearly listed steps or units, but may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or devices.
高效的分布式模型训练方案是棘手的实际问题。本申请提供了适用于分布式模型训练场景的神经网络模型的训练方法,能够提高模型训练效率。下面分别对本申请实施例提供的神经网络模型的训练方法适用的场景进行简单的介绍。Efficient distributed model training program is a tricky practical problem. This application provides a training method of a neural network model suitable for a distributed model training scenario, which can improve the efficiency of model training. The following briefly introduces the applicable scenarios of the neural network model training method provided in the embodiments of the present application.
分布式模型训练场景:分布式训练系统包括多个工作节点,每个工作节点的功能基本相同,各个工作节点通过对神经网络模型进行多次迭代训练,得到训练好的神经网络模型。在一次迭代中,每个工作节点使用各自的训练样本对神经网络模型进行训练,得到各自的本地梯度信息;然后,多个工作节点之间进行数据同步,以使得多个工作节点中的每个工作节点获得所有工作节点的本地梯度信息,然后将得到的所有工作节点的本地梯度信息进行融合,得到全局梯度信息,或者,多个工作节点中的每个工作节点通过对所有其他工作节点的本地梯度信息进行融合得到融合梯度信息,然后将自身的本地梯度信息与融合梯度信息进行融合,得到全局梯度信息。作为一个例子,每个工作节点将自身计算得到的本地梯度信息和/或接收到的来自于至少一个其他工作节点的本地梯度信息发送给其他工作节点,或者发送自身计算得到的本地梯度信息与接收到的来自于至少一个其他工作节点的本地梯度信息进行融合得到的融合梯度信息,例如,发送给自己的左侧或右侧的工作节点,直到每个工作节点均得到所有工作节点计算得到的本地梯度信息、融合梯度信息或者全局梯度信息;然后,每个工作节点利用由全部的工作节点计算得到的本地梯度信息融合得到的全局梯度信息,更新神经网络模型的参数。这样的迭代进行多次,每个工作节点在每次迭代中重复执行之前的操作,直到达到训练截止条件,例如,神经网络模型收敛或者训练次数达到预设次数等。在该分布式模型训练场景中,在一些实施例中,每个工作节点采用的神经网络模型相同,并且各工作节点同步更新神经网络模型的参数,不同工作节点训练神经网络模型使用的训练样本不同。也就是说,各工作节点采用的神经网络模型一直是相同的。在一些实施例中,多个工作节点可以是同一个终端设备或者服务器上的多个处理器。举例来说,某个服务器上的8个GPU作为8个工作节点,即一个GPU对应一个工作节点。在一些实施例中,一个工作节点或至少两个工作节点对应一个硬件实体,例如终端设备或者服务器。举例来说,8个笔记本电脑作为8个工作节点,即一个笔记本电脑作为一个工作节点。又举例来说,32台服务器上的256块GPU作为256个工作节点。又举例来说,分布式训练系统包括的多个工作节点分别为一个或多个设备(例如服务器)中运行的多个虚拟机。Distributed model training scenario: The distributed training system includes multiple working nodes, and each working node has basically the same function. Each working node obtains a trained neural network model through multiple iterative training of the neural network model. In one iteration, each working node uses its own training sample to train the neural network model to obtain its own local gradient information; then, data synchronization is performed between multiple working nodes, so that each of the multiple working nodes The working node obtains the local gradient information of all the working nodes, and then merges the obtained local gradient information of all the working nodes to obtain the global gradient information, or each of the multiple working nodes passes the local gradient information of all other working nodes. The gradient information is fused to obtain the fused gradient information, and then the own local gradient information and the fused gradient information are fused to obtain the global gradient information. As an example, each working node sends the local gradient information calculated by itself and/or the local gradient information received from at least one other working node to other working nodes, or sends and receives the local gradient information calculated by itself. The obtained fusion gradient information is obtained by fusing the local gradient information from at least one other working node, for example, sent to the working node on the left or right of oneself, until each working node obtains the local calculated by all working nodes Gradient information, fusion gradient information or global gradient information; then, each working node uses the global gradient information obtained by fusion of local gradient information calculated by all working nodes to update the parameters of the neural network model. Such iterations are performed multiple times, and each work node repeats the previous operations in each iteration until the training cut-off condition is reached, for example, the neural network model converges or the number of training times reaches a preset number, etc. In this distributed model training scenario, in some embodiments, each working node uses the same neural network model, and each working node updates the parameters of the neural network model synchronously. Different working nodes use different training samples for training the neural network model. . In other words, the neural network model adopted by each working node is always the same. In some embodiments, multiple working nodes may be multiple processors on the same terminal device or server. For example, 8 GPUs on a server serve as 8 working nodes, that is, one GPU corresponds to one working node. In some embodiments, one working node or at least two working nodes corresponds to one hardware entity, such as a terminal device or a server. For example, 8 laptops serve as 8 working nodes, that is, one laptop serves as one working node. For another example, 256 GPUs on 32 servers serve as 256 working nodes. For another example, the multiple working nodes included in the distributed training system are multiple virtual machines running in one or more devices (for example, servers).
在上述场景中,通过本申请实施例提供的神经网络模型的训练方法,将工作节点更新神经网络模型的参数的过程和工作节点的梯度数据同步过程并行执行,可以提高训练效率。In the above scenario, through the neural network model training method provided in the embodiments of the present application, the process of updating the parameters of the neural network model of the working node and the gradient data synchronization process of the working node are executed in parallel, which can improve training efficiency.
下面结合一个分布式训练流程图的示例来描述本申请实施例提供的神经网络模型的训练方法。The following describes the training method of the neural network model provided by the embodiment of the present application in conjunction with an example of a distributed training flowchart.
图1为本申请实施例提供的一种分布式训练流程图的示例。如图1所示,GPU 0、GPU 1、GPU 2以及GPU 3分别为分布式训练系统中的一个工作节点,神经网络模型包括若干个层(Layer),GPU 0、GPU 1、GPU 2以及GPU 3的并行训练过程可以包括:各个层的前向计算(Forward Pass),反向传播(Backward Pass),梯度数据同步(如 梯度规约通信)以及参数更新。其中,在前向计算中,神经网络模型的各个层依次对输入到神经网络模型的图像进行处理,得到对该图像的处理结果。然后,可以基于处理结果和特定计算规则,得到神经网络模型的最后一层的梯度;在反向传播中,可以将最后一层的梯度逆序反向传播,依次计算神经网络模型的各个层的梯度。在梯度数据同步中,多个工作节点之间可以进行梯度数据的同步。本申请实施例中,梯度数据同步的目的是使得每个工作节点均获得由全部的工作节点计算得到的本地梯度信息融合得到的全局梯度信息,本申请对实现这一目的的方式不作限定。在参数更新中,各工作节点利用梯度数据同步得到的全局梯度信息,进行神经网络模型的网络参数(例如权重等)的更新。Fig. 1 is an example of a distributed training flowchart provided by an embodiment of the application. As shown in Figure 1, GPU 0, GPU 1, GPU 2, and GPU 3 are respectively a working node in the distributed training system. The neural network model includes several layers, GPU 0, GPU 1, GPU 2, and GPU The parallel training process of 3 may include: forward calculation (Forward Pass) of each layer, backward propagation (Backward Pass), gradient data synchronization (such as gradient protocol communication), and parameter update. Among them, in the forward calculation, each layer of the neural network model sequentially processes the image input to the neural network model to obtain the processing result of the image. Then, based on the processing results and specific calculation rules, the gradient of the last layer of the neural network model can be obtained; in backpropagation, the gradient of the last layer can be backpropagated in reverse order, and the gradients of each layer of the neural network model can be calculated in turn . In gradient data synchronization, gradient data can be synchronized between multiple working nodes. In the embodiments of this application, the purpose of gradient data synchronization is to enable each working node to obtain global gradient information obtained by fusion of local gradient information calculated by all working nodes, and this application does not limit the way to achieve this goal. In the parameter update, each working node uses the global gradient information obtained by synchronization of the gradient data to update the network parameters (such as weights, etc.) of the neural network model.
在图1所示的例子中,不同工作节点将不同的训练样本输入至神经网络模型进行前向计算和反向计算(即,反向传播),得到各自的本地梯度信息。各工作节点完成一次全局的梯度数据同步之后,均能获得由全部的工作节点计算得到的本地梯度信息融合得到的全局梯度信息或者全部的工作节点计算得到的本地梯度信息;各工作节点利用由全部的工作节点计算得到的本地梯度信息融合得到的全局梯度信息对各自的神经网络模型进行参数更新。其中,各工作节点可以采用相同的方式对神经网络模型进行参数更新。In the example shown in FIG. 1, different working nodes input different training samples into the neural network model to perform forward calculation and reverse calculation (ie, back propagation) to obtain their respective local gradient information. After each working node completes a global gradient data synchronization, it can obtain global gradient information obtained by fusion of local gradient information calculated by all working nodes or local gradient information calculated by all working nodes; each working node uses all The global gradient information obtained by the fusion of the local gradient information calculated by the working nodes of, updates the parameters of the respective neural network models. Among them, each working node can use the same method to update the parameters of the neural network model.
在一些实施例中,梯度数据同步主要占用网络带宽资源,其他操作占用GPU计算资源。为了隐藏通信开销,本申请实施例提供了使得梯度数据同步和参数更新重叠(即并行)的神经网络模型的训练方法。下面结合附图来介绍本申请实施例提供的神经网络模型的训练方法。In some embodiments, gradient data synchronization mainly occupies network bandwidth resources, and other operations occupy GPU computing resources. In order to hide the communication overhead, the embodiment of the present application provides a method for training a neural network model that makes the gradient data synchronization and the parameter update overlap (ie, parallel). The following describes the training method of the neural network model provided by the embodiments of the present application with reference to the accompanying drawings.
图2为本申请实施例提供的一种神经网络模型的训练方法流程图。如图2所示,该方法包括:Fig. 2 is a flowchart of a method for training a neural network model provided by an embodiment of the application. As shown in Figure 2, the method includes:
201、第一工作节点基于对神经网络模型进行的当前迭代,得到上述神经网络模型的至少一个网络层的本地梯度信息。201. The first working node obtains local gradient information of at least one network layer of the neural network model based on the current iteration of the neural network model.
上述第一工作节点可以是笔记本电脑、台式电脑、平板电脑、手机等终端设备;也可以是服务器;还可以是服务器或者终端设备上运行的虚拟机;还可以是终端设备或者服务器上的处理器,例如图形处理器(Graphics Processing Unit,GPU)、中央处理器(Central Processing Unit,CPU)、网络处理器(Neural-network Processing Unit,NPU)等。如图1所示,每个GPU通过反向计算可得到各网络层的本地梯度信息。在一些实施例中,反向计算为逆序逐层操作,第一工作节点可逆序逐层计算神经网络模型中各网络层的本地梯度信息,参阅图1。The above-mentioned first working node can be a terminal device such as a notebook computer, a desktop computer, a tablet computer, a mobile phone, etc.; it can also be a server; it can also be a virtual machine running on a server or a terminal device; it can also be a terminal device or a processor on the server , Such as graphics processing unit (Graphics Processing Unit, GPU), central processing unit (Central Processing Unit, CPU), network processing unit (Neural-network Processing Unit, NPU), etc. As shown in Figure 1, each GPU can obtain the local gradient information of each network layer through reverse calculation. In some embodiments, the reverse calculation is a reverse order layer-by-layer operation, and the first working node can calculate the local gradient information of each network layer in the neural network model layer by layer in the reverse order, see FIG. 1.
在一些实施例中,第一工作节点在与至少一个第二工作节点进行神经网络模型中的第一网络层的本地梯度信息传输(执行步骤202)之前,还可以执行如下操作:上述第一工作节点将上述第一网络层的本地梯度信息中的各个数值均放大M倍,并将放大后的各个数值转换为半精度;上述M为大于1的实数。在该实施例中,第一工作节点在与至少一个第二工作节点进行上述神经网络模型中的第一网络层的本地梯度信息传输之前,先将第一网络层的本地梯度信息转换为半精度浮点(half-precision float)数据,这样其占用的存储空间会比单精度浮点(single-precision float)数据减少一半;然后进行梯度规约通信;规约通信结束后,将规约通信得到的半精度梯度先转换回单精度,再进行参数更新。通过这种方式通信开销可减少一半。In some embodiments, the first working node may also perform the following operations before performing the local gradient information transmission of the first network layer in the neural network model with at least one second working node (step 202): The node amplifies each value in the local gradient information of the first network layer by M times, and converts each amplified value into half-precision; the above M is a real number greater than 1. In this embodiment, the first working node first converts the local gradient information of the first network layer into half-precision before transmitting the local gradient information of the first network layer in the neural network model with at least one second working node. Floating point (half-precision float) data, so that the storage space occupied by it will be reduced by half compared to single-precision float data; then perform gradient protocol communication; after the protocol communication is over, the half-precision obtained by the protocol communication The gradient is first converted back to single precision, and then the parameters are updated. In this way, communication overhead can be reduced by half.
但是需要注意的是,半精度浮点数据格式所能表示的正数范围为6.1*e-5到65504,远小于单精度浮点数据的表示范围,而神经网络模型的梯度往往是很小的值,因此在通信前第一工作节点先对本地梯度信息进行放大,通信结束后再缩小,以减少本地梯度信息传递过程中的精度损失。However, it should be noted that the range of positive numbers that can be represented by the half-precision floating-point data format is 6.1*e-5 to 65504, which is much smaller than the range of single-precision floating-point data, and the gradient of the neural network model is often very small Therefore, the first working node first amplifies the local gradient information before communication, and then shrinks it after the communication ends, so as to reduce the accuracy loss in the transmission of local gradient information.
202、第一工作节点在与至少一个第二工作节点进行上述神经网络模型中的第一网络层的本地梯度信息传输的过程中,并行地更新上述神经网络模型中的第二网络层的参数。202. The first working node updates the parameters of the second network layer in the neural network model in parallel during the process of transmitting the local gradient information of the first network layer in the neural network model with at least one second working node.
上述第一网络层和上述第二网络层不同。在一些实施例中,上述至少一个第二工作节点中每个第二工作节点与第一工作节点执行的操作类似。在一些实施例中,上述第一工作节点以逆序的方式逐层更新上述神经网络模型中多个网络层的参数;和/或,上述第二网络层的网络深度大于上述第一网络层的网络深度。在一些实施例中,第一工作节点实现梯度数据同步的方式为逆序逐层操作,实现参数更新的方式为逆序逐层操作。举例来说,神经网络模型包含N层,第一工作节点与至少一个第二工作节点先后传输第N网络层至第1网络层的本地梯度信息(对应于逆序逐层操作实现梯度数据同步)。这里的“传输”表示“发送”和“接收”,例如,第一工作节点在向至少一个第二工作节点发送通过第一工作节点计算得到的第N网络层的本地梯度信息的同时,也接收来自至少一个第二工作节点的第N网络层的本地梯度信息。然后,该第一工作节点先后更新第N网络层至第1网络层的参数(对应于逆序逐层操作实现参数更新)。图3为本申请实施例提供的一种计算和通信重叠的示例的示意图。如图3所示,301表示逆序逐层操作实现梯度数据同步的数据流(stream)1,302表示逆序逐层操作实现参数更新的数据流(stream)2,数据流1和数据流2并行;301中每个矩形框表示第一工作节点与其他工作节点传输(或者通信、同步)一个网络层的本地梯度信息的操作,例如第n网络层表示第一工作节点与其他工作节点传输第n网络层的本地梯度信息的操作;302中每个矩形框表示第一工作节点更新一个网络层的参数的操作,例如第n网络层表示第一工作节点更新第n网络层的参数的操作;箭头表示时间轴。n为大于1的整数。图3中,第一工作节点与其他工作节点按照先后顺序传输第n网络层的本地梯度信息、第(n-1)网络层的本地梯度信息、…、第1网络层的本地梯度信息;第一工作节点按照先后顺序更新第n网络层的参数、第(n-1)网络层的参数、…、第1网络层的参数;第一工作节点与其他工作节点传输第(n-i)网络层的本地梯度信息的过程中,并行的更新第(n-i+1)网络层的参数。其中,i为小于n的整数。由于第一工作节点实现梯度数据同步的方式为逆序逐层操作,且实现参数更新的方式为逆序逐层操作,因此第一工作节点可以在梯度数据同步的过程中,并行的利用已获得的网络层的本地梯度信息来实现一部分参数更新的操作。参阅图3,由于第一工作节点在执行接收第(n-1)网络层的本地梯度信息的操作之前,已接收到了第n网络层的本地梯度信息,因此该第一工作节点在执行接收第(n-1)网络层的本地梯度信息的操作的过程中,可并行地执行更新第n网络层的参数的操作。The first network layer and the second network layer are different. In some embodiments, each of the at least one second working node described above performs similar operations to the first working node. In some embodiments, the first working node updates the parameters of multiple network layers in the neural network model layer by layer in a reverse order; and/or, the network depth of the second network layer is greater than that of the first network layer depth. In some embodiments, the first working node implements the gradient data synchronization by reverse order layer by layer operation, and the parameter update method is reverse order layer by layer operation. For example, the neural network model includes N layers, and the first working node and at least one second working node successively transmit local gradient information from the Nth network layer to the first network layer (corresponding to reverse-order layer-by-layer operation to achieve gradient data synchronization). Here, "transmit" means "send" and "receive". For example, when the first working node sends to at least one second working node the local gradient information of the Nth network layer calculated by the first working node, it also receives Local gradient information from the Nth network layer of at least one second working node. Then, the first working node successively updates the parameters of the Nth network layer to the first network layer (corresponding to the reverse order layer by layer operation to achieve parameter update). FIG. 3 is a schematic diagram of an example of calculation and communication overlap provided by an embodiment of the application. As shown in Figure 3, 301 represents a data stream (stream) 1 that implements gradient data synchronization by reverse-order layer-by-layer operation, and 302 represents a data stream (stream) 2 that implements parameter update through reverse-order layer-by-layer operation, and data stream 1 and data stream 2 are parallel; Each rectangular box in 301 represents the operation of transmitting (or communicating or synchronizing) the local gradient information of a network layer between the first working node and other working nodes. For example, the nth network layer indicates that the first working node and other working nodes transmit the nth network. The operation of the local gradient information of the layer; each rectangular box in 302 represents the operation of the first working node to update the parameters of a network layer, for example, the nth network layer represents the operation of the first working node to update the parameters of the nth network layer; arrows indicate Timeline. n is an integer greater than 1. In Figure 3, the first working node and other working nodes transmit the local gradient information of the nth network layer, the local gradient information of the (n-1)th network layer, ..., the local gradient information of the first network layer in sequence; A working node updates the parameters of the nth network layer, the parameters of the (n-1)th network layer,..., the parameters of the first network layer in sequence; the first working node and other working nodes transmit the parameters of the (ni)th network layer In the process of local gradient information, the parameters of the (n-i+1)th network layer are updated in parallel. Among them, i is an integer less than n. Since the first working node realizes gradient data synchronization by reverse order layer by layer operation, and the parameter update method is reverse order layer by layer operation, the first working node can use the obtained network in parallel during the gradient data synchronization process. The local gradient information of the layer is used to implement part of the parameter update operation. Referring to Figure 3, because the first working node has received the local gradient information of the nth network layer before performing the operation of receiving the local gradient information of the (n-1)th network layer, the first working node is performing the reception of the local gradient information of the nth network layer. (n-1) During the operation of the local gradient information of the network layer, the operation of updating the parameters of the nth network layer can be performed in parallel.
在一些实施例中,上述第一工作节点基于上述神经网络模型的多个网络层的连接关系,确定上述当前迭代的多个操作之间的依赖关系,上述多个操作至少包括上述神经网络模型中至少一个网络层的本地梯度信息的传输操作和参数更新操作;上述第一工作节点基于上述多个操作之间的依赖关系执行上述多个操作。也就是说,第一工作节点可根 据当前迭代的多个操作所属网络层的先后关系确立当前迭代的多个操作之间的依赖关系,即各操作的具体执行时机由依赖关系驱动。示例性的,第一工作节点实现梯度数据同步的方式为逆序逐层操作,实现参数更新的方式为逆序逐层操作,神经网络模型中任一网络层的本地梯度信息的传输操作所依赖的操作为该任一网络层之后的各网络层的本地梯度信息的传输操作均被完成,神经网络模型中任一网络层的参数更新操作所依赖的操作为该任一网络层的本地梯度信息的传输操作均完成。举例来说,第一工作节点完成神经网络模型中第n网络层的本地梯度信息的传输操作之后,可执行第(n-1)网络层的本地梯度信息的传输操作以及第n网络层的参数更新操作。In some embodiments, the first working node determines the dependency relationship between the multiple operations of the current iteration based on the connection relationship of the multiple network layers of the neural network model, and the multiple operations include at least those in the neural network model. At least one network layer's local gradient information transmission operation and parameter update operation; the above-mentioned first working node executes the above-mentioned multiple operations based on the dependency between the above-mentioned multiple operations. That is to say, the first working node can establish the dependency relationship between the multiple operations of the current iteration based on the sequential relationship of the multiple operations of the current iteration to which the network layer belongs, that is, the specific execution timing of each operation is driven by the dependency relationship. Exemplarily, the first working node realizes the gradient data synchronization by reverse order layer by layer operation, and the parameter update method is reverse order layer by layer operation. The operation depends on the local gradient information transmission operation of any network layer in the neural network model. The transmission operation of the local gradient information of each network layer after the any network layer is completed, and the operation of the parameter update operation of any network layer in the neural network model is the transmission of the local gradient information of the any network layer. The operations are all completed. For example, after the first working node completes the transmission operation of the local gradient information of the nth network layer in the neural network model, it can perform the transmission operation of the local gradient information of the (n-1)th network layer and the parameters of the nth network layer Update operation.
在一些实施例中,步骤202的实现方式如下:上述第一工作节点在与上述至少一个第二工作节点进行上述神经网络模型中的上述第一网络层的本地梯度信息传输的过程中,在确定上述第二网络层的参数更新操作所依赖的操作已完成的情况下,与上述第一网络层的本地梯度信息的传输并行地更新上述第二网络层的参数,其中,上述参数更新操作所依赖的操作包括与上述至少一个第二工作节点传输上述第二网络层的本地梯度信息。在一些实施例中,第一工作节点待执行的每个操作绑定一个事件(event),并根据各操作之间的依赖关系确立每个操作需要等待的event;每个数据流通过轻量级阻塞接口(例如cudaStreamWaitEvent)等待当前操作的关联event完成后,再启动当前操作。In some embodiments, the implementation of step 202 is as follows: in the process of transmitting the local gradient information of the first network layer in the neural network model with the at least one second working node, the first working node determines When the operation on which the parameter update operation of the second network layer relies is completed, the parameters of the second network layer are updated in parallel with the transmission of the local gradient information of the first network layer, wherein the parameter update operation relies on The operation includes transmitting the local gradient information of the second network layer with the at least one second working node. In some embodiments, each operation to be executed by the first working node is bound to an event, and the event that each operation needs to wait for is established according to the dependency between the operations; each data stream passes through a lightweight The blocking interface (such as cudaStreamWaitEvent) waits for the completion of the associated event of the current operation before starting the current operation.
在一个实施例中,第一工作节点在更新上述神经网络模型中的第二网络层的参数之前,可执行如下操作:第一工作节点将获得的上述第二网络层的本地梯度信息中包括的各个数值转换为单精度,并将上述转换得到的各个数值缩小M倍以得到处理梯度信息,上述M为大于1的实数;第一工作节点并行地更新上述神经网络模型中的第二网络层的参数可以是:第一工作节点利用上述处理梯度信息更新上述神经网络模型中的上述第二网络层的参数。In one embodiment, before updating the parameters of the second network layer in the above-mentioned neural network model, the first working node may perform the following operations: the first working node will obtain the local gradient information of the above-mentioned second network layer. Each value is converted to single precision, and each value obtained by the above conversion is reduced by M times to obtain the processing gradient information. The above M is a real number greater than 1. The first working node updates the second network layer in the above neural network model in parallel. The parameter may be: the first working node uses the processing gradient information to update the parameter of the second network layer in the neural network model.
本申请实施例中,第一工作节点在与至少一个第二工作节点进行神经网络模型中的第一网络层的本地梯度信息传输的过程中,并行地更新神经网络模型中的第二网络层的参数;将更新神经网络模型的参数的过程和传输本地梯度信息的过程重叠(即参数更新和计算重叠),可以提高模型训练效率。In the embodiment of the present application, the first working node updates the local gradient information of the first network layer in the neural network model in parallel with at least one second working node. Parameters: The process of updating the parameters of the neural network model and the process of transmitting local gradient information overlap (that is, parameter update and calculation overlap), which can improve the efficiency of model training.
为进一步隐藏通信开销,第一工作节点还可以进一步将梯度数据同步和反向计算重叠。下面结合附图来介绍一种梯度数据同步和反向计算重叠的可能的实现方式。In order to further hide the communication overhead, the first working node may further overlap the gradient data synchronization and reverse calculation. In the following, a possible implementation manner of gradient data synchronization and reverse calculation overlap is introduced in conjunction with the accompanying drawings.
在一个实施例中,第一工作节点在执行图1的方法流程的基础上,还可以执行如下操作:第一工作节点在与至少一个第二工作节点进行上述神经网络模型中的上述第一网络层的本地梯度信息传输的过程中,计算上述神经网络模型中的第三网络层的本地梯度信息。上述第三网络层的网络深度小于上述第一网络层的网络深度。在一些实施例中,反向计算为逆序逐层操作,第一工作节点实现梯度数据同步的方式为逆序逐层操作,第一工作节点实现反向计算的过程可与实现梯度数据同步的过程重叠,即并行地实现反向计算和实现梯度数据同步。In an embodiment, the first working node may perform the following operations on the basis of executing the method flow in FIG. 1 In the process of transmitting the local gradient information of the layer, the local gradient information of the third network layer in the above-mentioned neural network model is calculated. The network depth of the third network layer is smaller than the network depth of the first network layer. In some embodiments, the reverse calculation is a reverse-order layer-by-layer operation, the first working node realizes gradient data synchronization by reverse-order layer-by-layer operation, and the process of the first working node implementing reverse calculation can overlap with the process of achieving gradient data synchronization. , That is, realize the reverse calculation and the synchronization of the gradient data in parallel.
图4为本申请实施例提供的另一种计算和通信重叠的示例的示意图。如图4所示,401表示逆序逐层操作实现反向计算的数据流3,301表示逆序逐层操作实现梯度数据同步的数据流1,302表示逆序逐层操作实现参数更新的数据流2,数据流1、数据流2以 及数据流3并行;401中每个矩形框表示第一工作节点计算一个网络层的本地梯度信息的操作(对应于反向操作),例如第n网络层表示第一工作节点计算第n网络层的本地梯度信息的操作;301中每个矩形框表示第一工作节点与其他工作节点传输一个网络层的本地梯度信息的操作,例如第n网络层表示第一工作节点与其他工作节点传输第n网络层的本地梯度信息的操作;302中每个矩形框表示第一工作节点更新一个网络层的参数的操作,例如第n网络层表示第一工作节点更新第n网络层的参数的操作。n为大于1的整数。图4中,第一工作节点按照先后顺序计算第n网络层的本地梯度信息、第(n-1)网络层的本地梯度信息、…、第1网络层的本地梯度信息;第一工作节点与其他工作节点按照先后顺序传输第n网络层的本地梯度信息、第(n-1)网络层的本地梯度信息、…、第1网络层的本地梯度信息;第一工作节点按照先后顺序更新第n网络层的参数、第(n-1)网络层的参数、…、第1网络层的参数;第一工作节点接收第(n-i)网络层的本地梯度信息的过程中,并行的更新第(n-i+1)网络层的参数以及计算第(n-i-1)网络层的本地梯度信息。其中,i为小于(n-1)的整数。FIG. 4 is a schematic diagram of another example of calculation and communication overlap provided by an embodiment of the application. As shown in Figure 4, 401 represents a data stream 3 that implements reverse calculation by layer-by-layer operation in reverse order, 301 represents a data stream 1 that implements gradient data synchronization through reverse-order layer-by-layer operation, and 302 represents a data stream 2 that implements parameter update through reverse-order layer-by-layer operation. Data stream 1, data stream 2, and data stream 3 are parallel; each rectangular box in 401 represents the operation of the first working node to calculate the local gradient information of a network layer (corresponding to the reverse operation), for example, the nth network layer represents the first The operation of the working node to calculate the local gradient information of the nth network layer; each rectangular box in 301 represents the operation of transmitting the local gradient information of a network layer between the first working node and other working nodes, for example, the nth network layer represents the first working node The operation of transmitting the local gradient information of the nth network layer with other working nodes; each rectangular box in 302 represents the operation of the first working node to update the parameters of a network layer, for example, the nth network layer means that the first working node updates the nth network The operation of the parameters of the layer. n is an integer greater than 1. In Figure 4, the first working node calculates the local gradient information of the nth network layer, the local gradient information of the (n-1)th network layer,..., the local gradient information of the first network layer in order of sequence; the first working node and Other working nodes transmit the local gradient information of the nth network layer, the local gradient information of the (n-1)th network layer,..., the local gradient information of the first network layer in sequence; the first working node updates the nth network layer in sequence The parameters of the network layer, the parameters of the (n-1)th network layer,..., the parameters of the first network layer; in the process of receiving the local gradient information of the (ni)th network layer, the first working node updates the (n)th network layer in parallel. -i+1) The parameters of the network layer and the calculation of the local gradient information of the (ni-1)th network layer. Wherein, i is an integer smaller than (n-1).
在该实施例中,第一工作节点在与至少一个第二工作节点进行神经网络模型中的第一网络层的本地梯度信息传输的过程中,计算神经网络模型中的第三网络层的本地梯度信息;将计算神经网络模型中的网络层的本地梯度信息的过程和传输本地梯度信息的过程重叠,可以提高模型训练效率。In this embodiment, the first working node calculates the local gradient of the third network layer in the neural network model during the process of transmitting the local gradient information of the first network layer in the neural network model with at least one second working node Information; overlapping the process of calculating the local gradient information of the network layer in the neural network model and the process of transmitting the local gradient information can improve the efficiency of model training.
前述实施例描述了计算和通信重叠的方案。上述计算和通信重叠方案的本质是用参数更新时间和/或反向计算时间隐藏通信时间,但是当神经网络模型的计算时间小于通信时间时,我们就无法充分隐藏通信开销。因此有必要研究通信削减方案,以进一步压缩通信开销。The foregoing embodiment describes a scheme in which calculation and communication overlap. The essence of the above calculation and communication overlap scheme is to hide the communication time with parameter update time and/or reverse calculation time, but when the calculation time of the neural network model is less than the communication time, we cannot fully hide the communication overhead. Therefore, it is necessary to study communication reduction schemes to further compress communication overhead.
本申请实施例引入了内层迭代的策略。每次内层迭代进行一次完整的前向计算(Forward)和反向计算(Backward),并对本地梯度信息进行累计,但不做梯度数据同步和参数更新,即不同步各个工作节点的梯度数据以及不更新神经网络模型的参数。多次内层迭代对应一次全局通信,其中,在最后一次内层迭代中对本地梯度信息进行规约通信并更新参数值。在一些实施例中,全局通信操作可以与最后一次内层迭代的反向计算互相重叠。内层迭代策略本质上是增大每次迭代的批量(Batch size),这等效于减少整体训练过程中的总通信量。下面结合附图介绍本申请实施例提供的内层迭代方法。The embodiment of the present application introduces an inner-layer iteration strategy. Each inner layer iteration performs a complete forward calculation (Forward) and reverse calculation (Backward), and accumulates local gradient information, but does not perform gradient data synchronization and parameter update, that is, the gradient data of each working node is not synchronized And do not update the parameters of the neural network model. Multiple inner layer iterations correspond to one global communication, in which the local gradient information is communicated and the parameter values are updated in the last inner layer iteration. In some embodiments, the global communication operation may overlap with the reverse calculation of the last inner iteration. The inner iteration strategy is essentially to increase the batch size of each iteration, which is equivalent to reducing the total communication volume in the overall training process. In the following, the inner layer iteration method provided by the embodiments of the present application will be introduced with reference to the accompanying drawings.
图5为本申请实施例提供的一种内层迭代方法流程图。如图5所示,该内层迭代方法包括:FIG. 5 is a flowchart of an inner layer iteration method provided by an embodiment of the application. As shown in Figure 5, the inner iteration method includes:
501、第一工作节点将训练样本输入至神经网络模型进行前向计算,得到处理结果。501. The first working node inputs training samples to a neural network model for forward calculation, and obtains a processing result.
502、第一工作节点利用上述处理结果和上述神经网络模型进行反向计算,得到神经网络模型的至少一个网络层的本地梯度信息。502. The first working node uses the foregoing processing result and the foregoing neural network model to perform reverse calculations to obtain local gradient information of at least one network layer of the neural network model.
步骤502和步骤501可以理解为上述第一工作节点对上述神经网络模型进行一次内层迭代,得到上述神经网络模型的至少一个网络层的本地梯度信息的实现方式。在一些实施例中,步骤502可替换为:第一工作节点利用上述处理结果和上述神经网络模型进行反向计算,得到神经网络模型的各网络层的本地梯度信息。举例来说,第一工作节点采用逆序逐层操作实现反向计算,得到神经网络模型的各网络层的本地梯度信息。 Steps 502 and 501 can be understood as an implementation manner in which the above-mentioned first working node performs an inner-layer iteration on the above-mentioned neural network model to obtain the local gradient information of at least one network layer of the above-mentioned neural network model. In some embodiments, step 502 can be replaced by: the first working node uses the above processing result and the above neural network model to perform reverse calculations to obtain local gradient information of each network layer of the neural network model. For example, the first working node implements reverse calculation by layer-by-layer operation in reverse order, and obtains the local gradient information of each network layer of the neural network model.
503、第一工作节点基于中间融合梯度信息和当前迭代(即本次内层迭代)对应的本地梯度信息,得到上述神经网络模型的至少一个网络层的目标融合梯度信息。503. The first working node obtains target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration (that is, this inner layer iteration).
在一些实施例中,上述中间融合梯度信息可以是第一工作节点对上述神经网络模型进行至少一次内层迭代,得到的上述至少一次内层迭代对应的中间融合梯度信息。示例性的,上述中间融合梯度信息可以是第一工作节点进行一次内层迭代得到的神经网络模型的各网络层的本地梯度信息;还可以是由第一工作节点进行至少两次内层迭代得到的至少两组本地梯度信息进行融合得到。应理解,第一工作节点第一次执行步骤503时,上述中间融合梯度信息不存在,步骤503的实现方式可以是将步骤502得到的神经网络模型的至少一个网络层的本地梯度信息作为中间融合梯度信息,并存储;第一工作节点第二次执行步骤503时,步骤503的实现方式可以是基于当前的中间融合梯度信息和本次内层迭代对应的本地梯度信息(即第二次执行步骤502得到的梯度信息),得到新的中间融合梯度信息(对应于更新中间融合梯度);以此类推,第一工作节点第K次执行步骤503之后,得到神经网络模型的至少一个网络层的目标融合梯度信息。其中,K为大于1的整数。可以理解,第一工作节点第一次执行步骤503可得到初始的中间融合梯度(对应于第一次执行步骤502得到的梯度信息),后面每执行一次步骤503就是利用当前的中间融合梯度信息和当前迭代(即本次内层迭代)对应的本地梯度信息,得到新的中间融合梯度信息。In some embodiments, the aforementioned intermediate fusion gradient information may be intermediate fusion gradient information corresponding to the aforementioned at least one inner iteration obtained by the first working node performing at least one inner layer iteration on the aforementioned neural network model. Exemplarily, the aforementioned intermediate fusion gradient information may be the local gradient information of each network layer of the neural network model obtained by the first working node performing one inner layer iteration; it may also be obtained by the first working node performing at least two inner layer iterations At least two sets of local gradient information are obtained by fusion. It should be understood that when the first working node executes step 503 for the first time, the aforementioned intermediate fusion gradient information does not exist, and the implementation of step 503 may be to use the local gradient information of at least one network layer of the neural network model obtained in step 502 as the intermediate fusion The gradient information is stored; when the first working node executes step 503 for the second time, the implementation of step 503 can be based on the current intermediate fusion gradient information and the local gradient information corresponding to this inner iteration (that is, the second execution step 502) to obtain new intermediate fusion gradient information (corresponding to updating the intermediate fusion gradient); and so on, after the first working node executes step 503 for the Kth time, the target of at least one network layer of the neural network model is obtained Fusion of gradient information. Among them, K is an integer greater than 1. It can be understood that the first working node performs step 503 for the first time to obtain the initial intermediate fusion gradient (corresponding to the gradient information obtained from the first execution of step 502), and each subsequent step 503 is to use the current intermediate fusion gradient information and The local gradient information corresponding to the current iteration (that is, this inner iteration) obtains the new intermediate fusion gradient information.
在一些实施例中,第一工作节点进行一次内层迭代,得到一组本地梯度参数,每组本地梯度参数包括神经网络模型的各网络层的本地梯度信息;第一工作节点对其进行至少两次内层迭代得到的至少两组本地梯度信息进行融合可以是:对上述至少两组本地梯度信息中分别包括的各网络层的本地梯度信息融合,得到各网络层的中间融合梯度。举例来说,第一工作节点对至少两组本地梯度信息中分别包括的第一网络层的本地梯度信息进行融合,得到第一网络层的中间融合梯度。示例性的,第一工作节点对至少两组本地梯度信息中分别包括的第一网络层的本地梯度信息进行融合可以是逐次融合两组本地梯度信息中分别包括的第一网络层中的相应参数。例如,第一组本地梯度信息中包括的第一网络层的某个参数的值为a,第二组本地梯度信息中包括的该参数的值为b,第三组本地梯度信息中包括的该参数的值为c;以该参数为例,第一工作节点对这3组本地梯度信息中分别包括的第一网络层的本地梯度信息进行融合可以是:先计算(a+b),再计算((a+b)+c)。在该例子中,该参数在第一网络层的中间融合梯度信息中对应的值为((a+b)+c)。In some embodiments, the first working node performs an inner layer iteration to obtain a set of local gradient parameters. Each set of local gradient parameters includes the local gradient information of each network layer of the neural network model; the first working node performs at least two operations on it. The fusion of at least two sets of local gradient information obtained in the second inner layer iteration may be: fusion of the local gradient information of each network layer included in the at least two sets of local gradient information, respectively, to obtain the intermediate fusion gradient of each network layer. For example, the first working node merges the local gradient information of the first network layer included in the at least two sets of local gradient information to obtain the intermediate fusion gradient of the first network layer. Exemplarily, the first working node fusing the local gradient information of the first network layer included in the at least two sets of local gradient information may be successively fusing the corresponding parameters in the first network layer included in the two sets of local gradient information. . For example, the value of a certain parameter of the first network layer included in the first group of local gradient information is a, the value of the parameter included in the second group of local gradient information is b, and the value of the parameter included in the third group of local gradient information is b. The value of the parameter is c; taking this parameter as an example, the first working node to fuse the local gradient information of the first network layer included in the three sets of local gradient information can be: first calculate (a+b), then calculate ((a+b)+c). In this example, the corresponding value of this parameter in the intermediate fusion gradient information of the first network layer is ((a+b)+c).
在一些实施例中,步骤503的实现方式可以是:上述第一工作节点对上述中间融合梯度信息和上述当前迭代得到的本地梯度信息进行累加处理,得到上述神经网络模型的至少一个网络层的目标融合梯度信息。上述中间融合梯度信息中的梯度和上述当前迭代得到的本地梯度信息中的梯度一一对应;上述第一工作节点对上述中间融合梯度信息和上述当前迭代得到的本地梯度信息进行累加处理,得到上述神经网络模型的至少一个网络层的目标融合梯度信息可以是:对上述中间融合梯度信息和上述当前迭代得到的本地梯度信息中一一对应的参数进行累加处理。举例来说,中间融合梯度信息中某个参数的值为d,该参数在当前迭代得到的本地梯度信息中对应的值为e,对d和e进行累加处理得到(d+e)。应理解,上述神经网络模型的任一网络层的目标融合梯度信息可以由 第一工作节点多次内层迭代得到的多组该任一网络层的本地梯度信息融合得到。In some embodiments, the implementation of step 503 may be: the first working node accumulates the intermediate fusion gradient information and the local gradient information obtained in the current iteration to obtain the target of at least one network layer of the neural network model. Fusion of gradient information. The gradient in the intermediate fusion gradient information and the gradient in the local gradient information obtained in the current iteration correspond one-to-one; the first working node accumulates the intermediate fusion gradient information and the local gradient information obtained in the current iteration to obtain the above The target fusion gradient information of at least one network layer of the neural network model may be: accumulating the one-to-one corresponding parameters in the intermediate fusion gradient information and the local gradient information obtained in the current iteration. For example, the value of a certain parameter in the intermediate fusion gradient information is d, and the corresponding value of this parameter in the local gradient information obtained in the current iteration is e, and d and e are accumulated to obtain (d+e). It should be understood that the target fusion gradient information of any network layer of the aforementioned neural network model may be obtained by fusion of multiple sets of local gradient information of any network layer obtained by multiple inner layer iterations of the first working node.
504、第一工作节点判断是否达到内层迭代阈值。504. The first working node judges whether the inner iteration threshold is reached.
若是,执行步骤505;若否,执行步骤501。上述内层迭代阈值可以是3、5、10、20等,本申请不作限定。在实际应用中,第一工作节点可根据实际需求来相应的设置内层迭代阈值。内层迭代阈值越大,第一工作节点执行全局通信的次数越少。If yes, go to step 505; if not, go to step 501. The foregoing inner iteration threshold may be 3, 5, 10, 20, etc., which is not limited in this application. In practical applications, the first working node can set the inner iteration threshold according to actual needs. The larger the inner iteration threshold, the fewer the number of times the first working node performs global communication.
505、第一工作节点执行全局通信操作,得到全局梯度信息。505. The first working node performs a global communication operation to obtain global gradient information.
在一些实施例中,上述全局梯度信息可以是由全部的工作节点计算得到的本地梯度信息融合得到的梯度信息。示例性的,上述全局梯度信息可以是由全部的工作节点计算得到的本地梯度信息中相应的梯度累加得到的梯度信息。举例来说,每个工作节点计算得到的本地梯度信息对应一个向量,由全部的工作节点计算得到的本地梯度信息融合得到的全局梯度信息对应的向量可以是由各工作节点计算得到的本地梯度信息对应的向量中相同位置的元素累加得到。在一些实施例中,第一工作节点得到全局梯度信息之后,分布式训练系统中各工作节点均得到全局梯度信息。In some embodiments, the above-mentioned global gradient information may be gradient information obtained by fusion of local gradient information calculated by all working nodes. Exemplarily, the above-mentioned global gradient information may be gradient information obtained by accumulating corresponding gradients in the local gradient information calculated by all working nodes. For example, the local gradient information calculated by each working node corresponds to a vector, and the vector corresponding to the global gradient information obtained by fusion of the local gradient information calculated by all working nodes may be the local gradient information calculated by each working node The elements at the same position in the corresponding vector are accumulated by accumulation. In some embodiments, after the first working node obtains the global gradient information, each working node in the distributed training system obtains the global gradient information.
506、第一工作节点利用全局梯度信息更新神经网络模型。506. The first working node uses the global gradient information to update the neural network model.
应理解,分布式训练系统中各工作节点均利用全局梯度信息更新神经网络模型,这样每个工作节点均会得到一个相同的更新后的神经网络模型。步骤501至步骤506描述第一工作节点实现一次参数更新操作的过程,在实际应用中,第一工作节点可多次执行图5中的方法流程以得到收敛的神经网络模型。It should be understood that each working node in the distributed training system uses global gradient information to update the neural network model, so that each working node will get the same updated neural network model. Steps 501 to 506 describe the process of the first working node to implement a parameter update operation. In practical applications, the first working node may execute the method flow in FIG. 5 multiple times to obtain a convergent neural network model.
在一些实施例中,第一工作节点还可以执行如下操作:上述第一工作节点在基于上述中间融合梯度信息和上述当前迭代对应的本地梯度信息,得到上述神经网络模型的第三网络层的目标融合梯度信息的过程中,并行地与上述至少一个第二工作节点进行上述神经网络模型的第四网络层的目标融合梯度信息的传输。可选的,上述第四网络层的网络深度大于上述第三网络层的网络深度。第一工作节点可以按照逆序逐层操作进行最后一次内层迭代,因此第一工作节点可先后得到上述神经网络模型的最高网络层(具有最大网络深度)的目标融合梯度信息至最低网络层(具有最小网络深度)的目标融合梯度信息。应理解,第一工作节点在计算某一网络层的目标融合梯度信息的过程中,可将已计算得到的一些网络层的目标融合梯度信息传输给其他工作节点。也就是说,全局通信操作可以与最后一次内层迭代的反向计算互相重叠In some embodiments, the first working node may also perform the following operations: the first working node obtains the target of the third network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration. In the process of fusing the gradient information, the transmission of the target fusion gradient information of the fourth network layer of the neural network model is performed in parallel with the at least one second working node. Optionally, the network depth of the fourth network layer is greater than the network depth of the third network layer. The first working node can perform the last iteration of the inner layer in reverse order, layer by layer. Therefore, the first working node can obtain the target of the highest network layer (with the largest network depth) of the above neural network model to fuse gradient information to the lowest network layer (with The minimum network depth) target fusion gradient information. It should be understood that, in the process of calculating the target fusion gradient information of a certain network layer, the first working node may transmit the calculated target fusion gradient information of some network layers to other working nodes. In other words, the global communication operation can overlap with the reverse calculation of the last inner iteration
在该实现方式中,将计算神经网络模型中的网络层的目标融合梯度信息的过程和传输网络层的目标融合梯度信息的过程重叠(即计算和通信重叠),可以提高模型训练效率。In this implementation manner, the process of calculating the target fusion gradient information of the network layer in the neural network model and the process of transmitting the target fusion gradient information of the network layer are overlapped (that is, calculation and communication overlap), which can improve the efficiency of model training.
本申请实施例中,第一工作节点与至少一个第二工作节点传输网络层的目标融合梯度信息;可以减少梯度信息的传输次数和总通信量。In the embodiment of the present application, the first working node and at least one second working node transmit the target fusion gradient information of the network layer; the number of transmissions of the gradient information and the total communication volume can be reduced.
为进一步提升通信效率,本申请实施例还提供了通信融合策略,即将若干个网络层的梯度合并到一块较大的数组,再发起一次全局通信。通信融合策略可应用于前述实施例中,来提升通信效率。In order to further improve communication efficiency, the embodiment of the present application also provides a communication fusion strategy, that is, the gradients of several network layers are merged into a larger array, and then global communication is initiated again. The communication fusion strategy can be applied to the foregoing embodiments to improve communication efficiency.
对于常见神经网络模型中的大部分算子,其梯度参数的数量是相当小的,通常是特征图数量的小常数倍,通信量为KBytes甚至Byte的量级。根据底层通信的相关研究,传输数据量偏小时无法充分利用网络带宽。为了获得较大的通信量,以提升通信效率,我们引入了对通信融合的策略。For most operators in common neural network models, the number of gradient parameters is quite small, usually a small constant multiple of the number of feature maps, and the communication volume is on the order of KBytes or even Byte. According to related research on underlying communications, the small amount of transmitted data cannot make full use of network bandwidth. In order to obtain a larger amount of communication and improve communication efficiency, we have introduced a strategy of communication convergence.
在该策略中,有几点需要注意的地方。一方面,我们需要合理配置通信融合(也称梯度融合)的规模。融合规模太小,则通信效率不高;融合规模太大,又会耽搁通信操作的启动时机。因此,我们在实现通信融合策略时,让融合大小可以配置,例如通过空运行(dry-run)为每个神经网络模型和平台(例如分布式训练系统)调试出最合适的融合规模。另一方面,在通信融合的原始方案下,通信前要将多个离散存放的小数组合并为一块连续存放的大数组,通信后又要拆解回去,这就引入了两拨内存拷贝,会产生额外的开销。In this strategy, there are a few points to note. On the one hand, we need to rationally configure the scale of communication fusion (also called gradient fusion). If the integration scale is too small, the communication efficiency will not be high; if the integration scale is too large, it will delay the start of communication operations. Therefore, when we implement the communication fusion strategy, the fusion size can be configured, for example, through dry-run to debug the most suitable fusion scale for each neural network model and platform (such as a distributed training system). On the other hand, under the original plan of communication fusion, multiple discretely stored small arrays must be merged into a large continuously stored array before communication. After communication, they must be disassembled and returned. This introduces two memory copies. Incur additional overhead.
在一些实施例中,第一工作节点在执行步骤201之前,可执行如下操作:上述第一工作节点基于上述第一网络层对应的偏移量,将计算得到的上述第一网络层的本地梯度信息存储至预先分配的目标存储空间,其中,上述目标存储空间用于存储上述神经网络模型的多个网络层的本地梯度信息;In some embodiments, before performing step 201, the first working node may perform the following operations: the first working node calculates the local gradient of the first network layer based on the offset corresponding to the first network layer The information is stored in a pre-allocated target storage space, where the target storage space is used to store local gradient information of multiple network layers of the neural network model;
其中,上述第一工作节点发送的上述第一网络层的本地梯度信息是基于上述第一网络层对应的偏移量从上述目标存储空间中获取的,和/或,上述第一工作节点基于接收到的来自于上述至少一个第二工作节点的上述第一网络层的本地梯度信息,更新上述目标存储空间存储的上述第一网络层的本地梯度信息。Wherein, the local gradient information of the first network layer sent by the first working node is obtained from the target storage space based on the offset corresponding to the first network layer, and/or, the first working node is based on receiving The obtained local gradient information of the first network layer from the at least one second working node updates the local gradient information of the first network layer stored in the target storage space.
在该实施例中,第一工作节点预先给神经网络模型的所有参数梯度(对应于梯度信息)开辟统一的连续内存空间(对应于目标存储空间),然后通过内存管理器将每个网络层的参数梯度指向对应的偏移量(offset),从而避免了通信时额外的内存拷贝。In this embodiment, the first working node opens up a unified continuous memory space (corresponding to the target storage space) for all parameter gradients (corresponding to the gradient information) of the neural network model in advance, and then uses the memory manager to convert the parameters of each network layer The parameter gradient points to the corresponding offset (offset), thereby avoiding additional memory copies during communication.
在一些实施例中,第一工作节点在执行步骤201之前,可执行如下操作:上述第一工作节点将计算得到的上述神经网络模型的多个网络层的本地梯度信息存储至预先分配的目标存储空间,并通过内存管理器确定上述多个网络层中每个网络层对应的偏移量,上述目标存储空间为一个连续的存储空间;上述第一工作节点基于上述多个网络层中每个网络层对应的偏移量,从上述目标存储空间中获取上述多个网络层中的至少两个网络层的本地梯度信息;上述至少两个网络层包括上述第一网络层;步骤201可替换为:与上述至少一个第二工作节点进行上述神经网络模型中的上述至少两个网络层的本地梯度信息传输。In some embodiments, before performing step 201, the first working node may perform the following operations: the first working node stores the calculated local gradient information of the multiple network layers of the neural network model in a pre-allocated target storage The memory manager determines the offset corresponding to each of the multiple network layers. The target storage space is a continuous storage space; the first working node is based on each of the multiple network layers. The offset corresponding to the layer is obtained from the above-mentioned target storage space for the local gradient information of at least two of the above-mentioned multiple network layers; the above-mentioned at least two network layers include the above-mentioned first network layer; step 201 can be replaced with: Perform local gradient information transmission of the at least two network layers in the neural network model with the at least one second working node.
图6为本申请实施例提供的一种通信融合策略的一个示例的示意图。如图6所示,601表示神经网络模型的各网络层,其中,L1表示第一网络层,Ln表示第n网络层;602表示各网络层的本地梯度信息,其中,梯度m、梯度(m-1)、…梯度1均表示一个梯度或一个网络层的梯度;603表示合并后的各网络层的本地梯度信息,其中,梯度组k、梯度组(k-1)…梯度组1均包括至少两个梯度或至少两个网络层的梯度。本申请实施例中,神经网络模型中的网络层和梯度不是一一对应,有些网络层可以有多个梯度,有些网络层可以无梯度。在一些实施例中,602的每个矩形框(例如梯度m)表示一个网络层的梯度,则第一工作节点每次向其他工作节点传输一个网络层的梯度需要 传输m次,第一工作节点每次向其他工作节点传输一个梯度组(例如梯度组k)需要传输k次,k小于m。在一些实施例中,602的每个矩形框(例如梯度m)表示一个参数向量的梯度,则第一工作节点每次向其他工作节点传输一个梯度组(例如梯度组k)需要传输k次。应理解,第一工作节点可将若干个网络层的本地梯度信息合并到一块较大的数组,再发起一次全局通信;这样可以减少全局通信信息。FIG. 6 is a schematic diagram of an example of a communication convergence strategy provided by an embodiment of the application. As shown in Figure 6, 601 represents the network layers of the neural network model, where L1 represents the first network layer, Ln represents the nth network layer; 602 represents the local gradient information of each network layer, where gradient m, gradient (m -1),...Gradient 1 represents a gradient or the gradient of a network layer; 603 represents the local gradient information of each network layer after merging, where gradient group k, gradient group (k-1)...Gradient group 1 all include At least two gradients or gradients of at least two network layers. In the embodiments of the present application, the network layers and gradients in the neural network model are not in a one-to-one correspondence. Some network layers may have multiple gradients, and some network layers may have no gradients. In some embodiments, each rectangular box (for example, gradient m) of 602 represents the gradient of a network layer, and each time the first working node transmits the gradient of a network layer to other working nodes, it needs to be transmitted m times. Each time a gradient group (for example, gradient group k) is transmitted to other working nodes, it needs to be transmitted k times, and k is less than m. In some embodiments, each rectangular frame (for example, gradient m) of 602 represents the gradient of a parameter vector, and each time the first working node transmits a gradient group (for example, gradient group k) to other working nodes, it needs to be transmitted k times. It should be understood that the first working node can merge the local gradient information of several network layers into a larger array, and then initiate a global communication again; this can reduce the global communication information.
前述实施例描述了训练神经网络模型的方法流程。下面介绍应用训练得到的神经网络模型实现预测任务的举例。The foregoing embodiment describes the method flow of training the neural network model. The following introduces an example of applying the neural network model obtained by training to realize the prediction task.
图7为本申请实施例提供的一种图像预测方法流程图。如图7所示,该方法包括:FIG. 7 is a flowchart of an image prediction method provided by an embodiment of the application. As shown in Figure 7, the method includes:
701、图像处理装置获取待处理图像。701. The image processing device acquires an image to be processed.
上述图像处理装置可以是上述第一工作节点,也可以是其他工作节点,还可以是未参与神经网络模型训练的装置,例如终端设备或服务器。The aforementioned image processing device may be the aforementioned first working node, or another working node, or a device that does not participate in neural network model training, such as a terminal device or a server.
在一些实施例中,图像处理装置为服务器,图像处理装置获取待处理图像可以是服务器接收到来自终端设备的待处理图像或者按照用户输入的指令从其他设备获取待处理图像。In some embodiments, the image processing apparatus is a server, and the image processing apparatus acquiring the image to be processed may be that the server receives the image to be processed from the terminal device or acquires the image to be processed from other devices according to an instruction input by the user.
在一些实施例中,图像处理装置为服务器,图像处理装置获取待处理图像可以是服务器获取用户上传的待处理图像或者按照用户输入的指令从其他设备获取待处理图像。In some embodiments, the image processing apparatus is a server, and the image processing apparatus acquiring the image to be processed may be the server acquiring the image to be processed uploaded by the user or acquiring the image to be processed from other devices according to instructions input by the user.
702、利用训练得到的神经网络模型对上述待处理图像进行预测处理,得到预测结果。702. Use the neural network model obtained by training to perform prediction processing on the image to be processed to obtain a prediction result.
上述神经网络模型可以是采用前述实施例中的方法训练得到的。应理解,图7为应用神经网络模型的一个示例。采用前述实施例中的训练方法训练得到的神经网络模型可处理不同的预测任务,例如文本识别、图像识别、图像分类等。The foregoing neural network model may be obtained by training using the method in the foregoing embodiment. It should be understood that Fig. 7 is an example of applying a neural network model. The neural network model trained by the training method in the foregoing embodiment can handle different prediction tasks, such as text recognition, image recognition, and image classification.
在一些实施例中,图像处理装置为服务器,图像处理装置在执行步骤702之后,还可以将预测结果发送给终端设备,例如手机、个人电脑等。In some embodiments, the image processing device is a server, and after performing step 702, the image processing device may also send the prediction result to a terminal device, such as a mobile phone or a personal computer.
在一些实施例中,图像处理装置为终端设备,图像处理装置在执行步骤702之后,还可以输出预测结果,例如通过显示屏显示预测结果。In some embodiments, the image processing apparatus is a terminal device, and after performing step 702, the image processing apparatus may also output a prediction result, for example, display the prediction result on a display screen.
本申请实施例中,利用训练得到的神经网络模型对待处理图像进行预测处理,得到预测结果;可高效的实现不同的图像预测任务。In the embodiments of the present application, the neural network model obtained by training is used to perform prediction processing on the image to be processed to obtain the prediction result; different image prediction tasks can be efficiently realized.
前述实施例描述了第一工作节点实现的神经网络模型的训练方法。下面结合附图介绍第一工作节点的各模块的功能。The foregoing embodiment describes the training method of the neural network model implemented by the first working node. The functions of each module of the first working node are described below in conjunction with the accompanying drawings.
图8为本申请实施例提供的一种数据处理装置的结构示意图。图8中的数据处理装置可以为前述实施例中的第一工作节点。如图8所示,数据处理装置可包括:FIG. 8 is a schematic structural diagram of a data processing device provided by an embodiment of the application. The data processing device in FIG. 8 may be the first working node in the foregoing embodiment. As shown in Figure 8, the data processing device may include:
处理模块801,用于基于对神经网络模型进行的当前迭代,得到上述神经网络模型的至少一个网络层的本地梯度信息;The processing module 801 is configured to obtain local gradient information of at least one network layer of the neural network model based on the current iteration of the neural network model;
收发模块802,用于与至少一个第二工作节点进行所述神经网络模型中的第一网络层的本地梯度信息的传输;The transceiver module 802 is configured to transmit the local gradient information of the first network layer in the neural network model with at least one second working node;
处理模块801,还用于在收发模块802与至少一个第二工作节点进行上述神经网络模型中的第一网络层的本地梯度信息传输的过程中,并行地更新上述神经网络模型中的第二网络层的参数。The processing module 801 is further configured to update the second network in the neural network model in parallel during the process of transmitting the local gradient information of the first network layer in the neural network model between the transceiver module 802 and at least one second working node The parameters of the layer.
在一些实施例中,处理模块801可以是CPU、GPU、NPU等处理器,收发模块802可以具体数据收发功能的收发器。In some embodiments, the processing module 801 may be a processor such as a CPU, GPU, or NPU, and the transceiver module 802 may be a transceiver with specific data transceiver functions.
在一个可能的实现方式中,处理模块801,还用于基于上述神经网络模型的多个网络层的连接关系,确定上述当前迭代的多个操作之间的依赖关系,上述多个操作至少包括上述神经网络模型中至少一个网络层的本地梯度信息的传输操作和参数更新操作;基于上述多个操作之间的依赖关系执行上述多个操作。In a possible implementation, the processing module 801 is further configured to determine the dependency relationship between the multiple operations of the current iteration based on the connection relationship of the multiple network layers of the above neural network model, and the multiple operations include at least the above The transmission operation and parameter update operation of the local gradient information of at least one network layer in the neural network model; the above-mentioned multiple operations are performed based on the dependency between the above-mentioned multiple operations.
在一个可能的实现方式中,上述第一工作节点以逆序的方式逐层更新上述神经网络模型中多个网络层的参数;和/或,上述第二网络层的网络深度大于上述第一网络层的网络深度。In a possible implementation manner, the first working node updates the parameters of multiple network layers in the neural network model layer by layer in a reverse order; and/or, the network depth of the second network layer is greater than that of the first network layer Network depth.
在一个可能的实现方式中,处理模块801,具体用于在上述收发模块与上述至少一个第二工作节点进行上述神经网络模型中的上述第一网络层的本地梯度信息传输的过程中,在确定上述第二网络层的参数更新操作所依赖的操作已完成的情况下,并行地更新上述第二网络层的参数,其中,上述参数更新操作所依赖的操作包括与上述至少一个第二工作节点传输上述第二网络层的本地梯度信息。In a possible implementation manner, the processing module 801 is specifically configured to perform local gradient information transmission of the first network layer in the neural network model between the transceiver module and the at least one second working node, in determining When the operation dependent on the parameter update operation of the second network layer is completed, the parameters of the second network layer are updated in parallel, wherein the operation dependent on the parameter update operation includes transmission to the at least one second working node. The above-mentioned local gradient information of the second network layer.
在一个可能的实现方式中,处理模块801,还用于在上述收发模块在与至少一个第二工作节点进行上述神经网络模型中的上述第一网络层的本地梯度信息传输的过程中,计算上述神经网络模型中的第三网络层的本地梯度信息。In a possible implementation, the processing module 801 is further configured to calculate the local gradient information of the first network layer in the neural network model during the transmission of the local gradient information of the first network layer in the neural network model between the transceiver module and at least one second working node. Local gradient information of the third network layer in the neural network model.
在一个可能的实现方式中,处理模块801,还用于对上述神经网络模型进行至少一次内层迭代,得到上述至少一次内层迭代对应的中间融合梯度信息;In a possible implementation, the processing module 801 is further configured to perform at least one inner layer iteration on the above-mentioned neural network model to obtain intermediate fusion gradient information corresponding to the above-mentioned at least one inner layer iteration;
处理模块801,具体用于基于上述中间融合梯度信息和上述当前迭代对应的本地梯度信息,得到上述神经网络模型的至少一个网络层的目标融合梯度信息;上述第一工作节点与上述至少一个第二工作节点传输的上述第一网络层的本地梯度信息包括上述第一网络层的目标融合梯度信息。The processing module 801 is specifically configured to obtain target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration; the first working node and the at least one second The local gradient information of the first network layer transmitted by the working node includes the target fusion gradient information of the first network layer.
在一个可能的实现方式中,处理模块801,具体用于对上述中间融合梯度信息和上述当前迭代得到的本地梯度信息进行累加处理,得到上述神经网络模型的至少一个网络层的目标融合梯度信息。In a possible implementation manner, the processing module 801 is specifically configured to accumulate the aforementioned intermediate fusion gradient information and the aforementioned local gradient information obtained in the current iteration to obtain target fusion gradient information of at least one network layer of the aforementioned neural network model.
在一个可能的实现方式中,收发模块802,还用于在处理模块801基于上述中间融合梯度信息和上述当前迭代对应的本地梯度信息,得到上述神经网络模型的第三网络层的目标融合梯度信息的过程中,与上述至少一个第二工作节点进行上述神经网络模型的第四网络层的目标融合梯度信息的传输。In a possible implementation, the transceiver module 802 is further configured to obtain the target fusion gradient information of the third network layer of the neural network model in the processing module 801 based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration. In the process, the target fusion gradient information transmission of the fourth network layer of the neural network model is performed with the at least one second working node.
在一个可能的实现方式中,处理模块801,还用于将上述第一网络层的本地梯度 信息中的各个数值均放大M倍,并将放大后的各个数值转换为半精度;上述M为大于1的实数。In a possible implementation, the processing module 801 is also used to amplify each value in the local gradient information of the first network layer by M times, and convert the amplified value to half precision; the above M is greater than The real number of 1.
在一个可能的实现方式中,处理模块801,还用于将获得的上述第二网络层的本地梯度信息中包括的各个数值转换为单精度,并将上述转换得到的各个数值缩小M倍以得到处理梯度信息,上述M为大于1的实数;In a possible implementation, the processing module 801 is also used to convert each value included in the obtained local gradient information of the second network layer into single precision, and reduce each value obtained by the conversion by M times to obtain Processing gradient information, the above M is a real number greater than 1;
处理模块801,具体用于利用上述处理梯度信息更新上述神经网络模型中的上述第二网络层的参数。The processing module 801 is specifically configured to use the processing gradient information to update the parameters of the second network layer in the neural network model.
在一个可能的实现方式中,处理模块801,还用于基于上述第一网络层对应的偏移量,将计算得到的上述第一网络层的本地梯度信息存储至预先分配的目标存储空间,其中,上述目标存储空间用于存储上述神经网络模型的多个网络层的本地梯度信息;In a possible implementation manner, the processing module 801 is further configured to store the calculated local gradient information of the first network layer in a pre-allocated target storage space based on the offset corresponding to the first network layer, where , The above-mentioned target storage space is used to store local gradient information of multiple network layers of the above-mentioned neural network model;
其中,收发模块802发送的上述第一网络层的本地梯度信息是基于上述第一网络层对应的偏移量从上述目标存储空间中获取的,和/或,处理模块801,还用于基于接收到的来自于上述至少一个第二工作节点的上述第一网络层的本地梯度信息,更新上述目标存储空间存储的上述第一网络层的本地梯度信息。Wherein, the local gradient information of the first network layer sent by the transceiver module 802 is obtained from the target storage space based on the offset corresponding to the first network layer, and/or the processing module 801 is also used for receiving The obtained local gradient information of the first network layer from the at least one second working node updates the local gradient information of the first network layer stored in the target storage space.
在一个可能的实现方式中,处理模块801,还用于将计算得到的上述神经网络模型的多个网络层的本地梯度信息存储至预先分配的目标存储空间,并通过内存管理器确定上述多个网络层中每个网络层对应的偏移量;上述目标存储空间为一个连续的存储空间;上述第一工作节点基于上述多个网络层中每个网络层对应的偏移量,从上述目标存储空间中获取上述多个网络层中的至少两个网络层的本地梯度信息;上述至少两个网络层包括上述第一网络层;上述收发模块,具体用于与上述至少一个第二工作节点进行上述神经网络模型中的上述至少两个网络层的本地梯度信息传输。In a possible implementation, the processing module 801 is also used to store the calculated local gradient information of the multiple network layers of the aforementioned neural network model in a pre-allocated target storage space, and determine the multiple The offset corresponding to each network layer in the network layer; the above-mentioned target storage space is a continuous storage space; the above-mentioned first working node is based on the offset corresponding to each of the above-mentioned multiple network layers, from the above-mentioned target storage The local gradient information of at least two of the above-mentioned multiple network layers is acquired in space; the above-mentioned at least two network layers include the above-mentioned first network layer; the above-mentioned transceiver module is specifically configured to perform the above-mentioned with the above-mentioned at least one second working node Local gradient information transmission of the at least two network layers in the neural network model.
图9为本申请实施例提供的另一种数据处理装置的结构示意图。如图9所示,该数据处理装置,包括:FIG. 9 is a schematic structural diagram of another data processing device provided by an embodiment of this application. As shown in Figure 9, the data processing device includes:
获取模块901,用于获取待处理图像;The obtaining module 901 is used to obtain an image to be processed;
处理模块902,用于利用训练得到的神经网络模型对上述待处理图像进行预测处理,得到预测结果。The processing module 902 is configured to use the neural network model obtained by training to perform prediction processing on the image to be processed to obtain a prediction result.
应理解以上数据处理装置的各个单元的划分仅仅是一种逻辑功能的划分,实际实现时可以全部或部分集成到一个物理实体上,也可以物理上分开。例如,以上各个单元可以为单独设立的处理元件,也可以集成同一个芯片中实现,此外,也可以以程序代码的形式存储于控制器的存储元件中,由处理器的某一个处理元件调用并执行以上各个单元的功能。此外各个单元可以集成在一起,也可以独立实现。这里的处理元件可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤或以上各个单元可以通过处理器元件中的硬件的集成逻辑电路或者软件形式的指令完成。该处理元件可以是通用处理器,例如中央处理器(英文:central processing unit,简称:CPU),还可以是被配置成实施以上方法的一个或多个集成电路,例如:一个或多个特定集成电路(英文:application-specific integrated circuit,简称:ASIC),或,一个或多个微处理器(英文:digital signal processor,简称:DSP),或,一个或者多个现场可编程门阵列 (英文:field-programmable gate array,简称:FPGA)等。It should be understood that the division of the various units of the above data processing device is only a division of logical functions, and may be fully or partially integrated into one physical entity during actual implementation, or may be physically separated. For example, the above units can be separately established processing elements, or they can be integrated into the same chip for implementation. In addition, they can also be stored in the storage element of the controller in the form of program code, which is called and combined by a certain processing element of the processor. Perform the functions of each unit above. In addition, each unit can be integrated together or implemented independently. The processing element here can be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method or each of the above units can be completed by an integrated logic circuit of hardware in the processor element or instructions in the form of software. The processing element may be a general-purpose processor, such as a central processing unit (English: central processing unit, CPU for short), or one or more integrated circuits configured to implement the above methods, for example: one or more specific integrated circuits Circuit (English: application-specific integrated circuit, abbreviation: ASIC), or, one or more microprocessors (English: digital signal processor, abbreviation: DSP), or, one or more field programmable gate arrays (English: field-programmable gate array (referred to as FPGA), etc.
图10是本申请实施例提供的一种服务器的结构示意图,该服务器1000可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1022(例如,一个或一个以上处理器)和存储器1032,一个或一个以上存储应用程序1042或数据1044的存储介质1030(例如一个或一个以上海量存储设备),一个或一个以上加速设备(例如GPU或NPU)1024。其中,存储器1032和存储介质1030可以是短暂存储或持久存储。存储在存储介质1030的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器1022可以设置为与存储介质1030通信,在服务器1000上执行存储介质1030中的一系列指令操作。加速设备1024可执行中央处理器1022分配的任务,例如图像处理任务。服务器1000可以为本申请实施例提供的数据处理装置。FIG. 10 is a schematic structural diagram of a server provided by an embodiment of the present application. The server 1000 may have relatively large differences due to different configurations or performance, and may include one or more central processing units (CPU) 1022 ( For example, one or more processors) and memory 1032, one or more storage media 1030 for storing application programs 1042 or data 1044 (such as one or one storage device with a large amount), one or more acceleration devices (such as GPU or NPU) ) 1024. Among them, the memory 1032 and the storage medium 1030 may be short-term storage or permanent storage. The program stored in the storage medium 1030 may include one or more modules (not shown in the figure), and each module may include a series of command operations on the server. Furthermore, the central processing unit 1022 may be configured to communicate with the storage medium 1030, and execute a series of instruction operations in the storage medium 1030 on the server 1000. The acceleration device 1024 can perform tasks assigned by the central processing unit 1022, such as image processing tasks. The server 1000 may be a data processing device provided in an embodiment of the application.
服务器1000还可以包括一个或一个以上电源1026,一个或一个以上有线或无线网络接口1050,一个或一个以上输入输出接口1058,和/或,一个或一个以上操作系统1041,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。The server 1000 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input and output interfaces 1058, and/or one or more operating systems 1041, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
上述实施例中由数据处理装置所执行的步骤可以基于该图10所示的服务器结构。具体的,加速设备1024可实现图8中处理模块801的功能,有线或无线网络接口1050可实现图8中收发模块802的功能。具体的,加速设备1024可实现图9中处理模块902的功能,有线或无线网络接口1050或者输入输出接口1058可实现图9中获取模块的功能。The steps performed by the data processing device in the above embodiment may be based on the server structure shown in FIG. 10. Specifically, the acceleration device 1024 may implement the function of the processing module 801 in FIG. 8, and the wired or wireless network interface 1050 may implement the function of the transceiver module 802 in FIG. 8. Specifically, the acceleration device 1024 can implement the function of the processing module 902 in FIG. 9, and the wired or wireless network interface 1050 or the input/output interface 1058 can implement the function of the acquisition module in FIG. 9.
图11为本申请实施例提供的一种终端设备的结构示意图。如图11所示,该终端设备110包括处理器1101、存储器1102和通信接口1103;该处理器1101、存储器1102和通信接口1103通过总线1104相互连接。图11中的终端设备可以为前述实施例中的数据处理装置。FIG. 11 is a schematic structural diagram of a terminal device provided by an embodiment of this application. As shown in FIG. 11, the terminal device 110 includes a processor 1101, a memory 1102, and a communication interface 1103; the processor 1101, the memory 1102, and the communication interface 1103 are connected to each other through a bus 1104. The terminal device in FIG. 11 may be the data processing apparatus in the foregoing embodiment.
存储器1102包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmableread only memory,EPROM)、或便携式只读存储器(compact disc read-only memory,CDROM),该存储器1102用于相关指令及数据。通信接口1103用于接收和发送数据。The memory 1102 includes but is not limited to random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM), or portable Read-only memory (compact disc read-only memory, CDROM), the memory 1102 is used for related instructions and data. The communication interface 1103 is used to receive and send data.
处理器1101可以包括一个或多个CPU以及一个或多个GPU,在处理器1101包括一个CPU的情况下,该CPU可以是单核CPU,也可以是多核CPU。上述实施例中由数据处理装置所执行的步骤可以基于该图11所示的终端设备的结构。具体的,处理器1101可实现图8中处理模块801的功能,通信接口1103可实现图8中收发模块的功能。具体的,处理器1101可实现图9中处理模块902的功能,通信接口1103可实现图9中获取模块的功能。The processor 1101 may include one or more CPUs and one or more GPUs. When the processor 1101 includes one CPU, the CPU may be a single-core CPU or a multi-core CPU. The steps performed by the data processing apparatus in the foregoing embodiment may be based on the structure of the terminal device shown in FIG. 11. Specifically, the processor 1101 may implement the function of the processing module 801 in FIG. 8, and the communication interface 1103 may implement the function of the transceiver module in FIG. 8. Specifically, the processor 1101 may implement the function of the processing module 902 in FIG. 9, and the communication interface 1103 may implement the function of the acquisition module in FIG. 9.
在本申请的实施例中提供一种计算机可读存储介质,上述计算机可读存储介质存储有计算机程序,上述计算机程序被处理器执行时实现前述实施例所提供的神经网络模型的训练方法。In an embodiment of the present application, a computer-readable storage medium is provided, and the above-mentioned computer-readable storage medium stores a computer program. When the above-mentioned computer program is executed by a processor, the neural network model training method provided in the foregoing embodiment is implemented.
在本申请的实施例中提供一种计算机可读存储介质,上述计算机可读存储介质存储有计算机程序,上述计算机程序被处理器执行时实现前述实施例所提供的图像预测方法。In the embodiments of the present application, a computer-readable storage medium is provided, and the above-mentioned computer-readable storage medium stores a computer program, and when the above-mentioned computer program is executed by a processor, the image prediction method provided in the foregoing embodiment is implemented.
本申请实施例提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行前述实施例所提供的神经网络模型的训练方法。The embodiments of the present application provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the neural network model training method provided in the foregoing embodiments.
本申请实施例提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行前述实施例所提供的图像预测方法。The embodiments of the present application provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the image prediction method provided in the foregoing embodiments.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific implementations of this application, but the scope of protection of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the scope of protection of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims (27)

  1. 一种神经网络模型的训练方法,其特征在于,包括:A method for training a neural network model, which is characterized in that it includes:
    第一工作节点基于对神经网络模型进行的当前迭代,得到所述神经网络模型的至少一个网络层的本地梯度信息;The first working node obtains local gradient information of at least one network layer of the neural network model based on the current iteration of the neural network model;
    在与至少一个第二工作节点进行所述神经网络模型中的第一网络层的本地梯度信息传输的过程中,所述第一工作节点并行地更新所述神经网络模型中的第二网络层的参数。In the process of transmitting the local gradient information of the first network layer in the neural network model with at least one second working node, the first working node updates the information of the second network layer in the neural network model in parallel. parameter.
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    所述第一工作节点基于所述神经网络模型的多个网络层的连接关系,确定所述当前迭代的多个操作之间的依赖关系,所述多个操作至少包括所述神经网络模型中至少一个网络层的本地梯度信息的传输操作和参数更新操作;The first working node determines the dependency relationship between the multiple operations of the current iteration based on the connection relationship of the multiple network layers of the neural network model, and the multiple operations at least include at least The local gradient information transmission operation and parameter update operation of a network layer;
    其中,所述第一工作节点基于所述多个操作之间的依赖关系执行所述多个操作。Wherein, the first working node executes the multiple operations based on the dependency relationship between the multiple operations.
  3. 根据权利要求1或2所述的方法,其特征在于,所述第一工作节点以逆序的方式逐层更新所述神经网络模型中多个网络层的参数;和/或The method according to claim 1 or 2, wherein the first working node updates the parameters of multiple network layers in the neural network model layer by layer in a reverse order; and/or
    所述第二网络层的网络深度大于所述第一网络层的网络深度。The network depth of the second network layer is greater than the network depth of the first network layer.
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述在与至少一个第二工作节点进行所述神经网络模型中的第一网络层的本地梯度信息传输的过程中,所述第一工作节点并行地更新所述神经网络模型中的第二网络层的参数包括:The method according to any one of claims 1 to 3, characterized in that, in the process of transmitting local gradient information of the first network layer in the neural network model with at least one second working node, the The parallel updating of the parameters of the second network layer in the neural network model by the first working node includes:
    所述第一工作节点在与所述至少一个第二工作节点进行所述神经网络模型中的所述第一网络层的本地梯度信息传输的过程中,在确定所述第二网络层的参数更新操作所依赖的操作已完成的情况下,并行地更新所述第二网络层的参数,其中,所述参数更新操作所依赖的操作包括与所述至少一个第二工作节点传输所述第二网络层的本地梯度信息。In the process of transmitting the local gradient information of the first network layer in the neural network model with the at least one second working node, the first working node determines that the parameters of the second network layer are updated When the operation on which the operation depends has been completed, update the parameters of the second network layer in parallel, where the operation on which the parameter update operation depends includes transmitting the second network with the at least one second working node Local gradient information of the layer.
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 4, wherein the method further comprises:
    所述第一工作节点在与至少一个第二工作节点进行所述神经网络模型中的所述第一网络层的本地梯度信息传输的过程中,计算所述神经网络模型中的第三网络层的本地梯度信息。In the process of transmitting the local gradient information of the first network layer in the neural network model with at least one second working node, the first working node calculates the value of the third network layer in the neural network model. Local gradient information.
  6. 根据权利要求1至5任一项所述的方法,其特征在于,在所述第一工作节点对神经网络模型进行当前迭代之前,所述方法还包括:The method according to any one of claims 1 to 5, characterized in that, before the first working node performs the current iteration of the neural network model, the method further comprises:
    所述第一工作节点对所述神经网络模型进行至少一次内层迭代,得到所述至少一次内层迭代对应的中间融合梯度信息;The first working node performs at least one inner layer iteration on the neural network model to obtain intermediate fusion gradient information corresponding to the at least one inner layer iteration;
    所述第一工作节点基于对神经网络模型进行的当前迭代,得到所述神经网络模型的至少一个网络层的本地梯度信息,包括:所述第一工作节点基于所述中间融合梯度信息和所述当前迭代对应的本地梯度信息,得到所述神经网络模型的至少一个网络层的目标融合梯度信息;所述第一工作节点与所述至少一个第二工作节点传输的所述第一网络层的本地梯度信息包括所述第一网络层的目标融合梯度信息。The first working node obtains local gradient information of at least one network layer of the neural network model based on the current iteration of the neural network model, including: the first working node is based on the intermediate fusion gradient information and the The local gradient information corresponding to the current iteration is used to obtain the target fusion gradient information of at least one network layer of the neural network model; the local of the first network layer transmitted by the first working node and the at least one second working node The gradient information includes target fusion gradient information of the first network layer.
  7. 根据权利要求6所述的方法,其特征在于,所述第一工作节点基于所述中间融合梯度信息和所述当前迭代对应的本地梯度信息,得到所述神经网络模型的至少一个网络层的目标融合梯度信息包括:The method according to claim 6, wherein the first working node obtains the target of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration The fusion gradient information includes:
    所述第一工作节点对所述中间融合梯度信息和所述当前迭代得到的本地梯度信息 进行累加处理,得到所述神经网络模型的至少一个网络层的目标融合梯度信息。The first working node accumulates the intermediate fusion gradient information and the local gradient information obtained in the current iteration to obtain target fusion gradient information of at least one network layer of the neural network model.
  8. 根据权利要求6或7所述的方法,其特征在于,所述方法还包括:The method according to claim 6 or 7, wherein the method further comprises:
    所述第一工作节点在基于所述中间融合梯度信息和所述当前迭代对应的本地梯度信息,得到所述神经网络模型的第三网络层的目标融合梯度信息的过程中,并行地与所述至少一个第二工作节点进行所述神经网络模型的第四网络层的目标融合梯度信息的传输。The first working node is in parallel with the process of obtaining the target fusion gradient information of the third network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration. At least one second working node transmits the target fusion gradient information of the fourth network layer of the neural network model.
  9. 根据权利要求1至8任一项所述的方法,其特征在于,在与至少一个第二工作节点进行所述神经网络模型中的第一网络层的本地梯度信息传输之前,所述方法还包括:The method according to any one of claims 1 to 8, characterized in that, before performing the local gradient information transmission of the first network layer in the neural network model with at least one second working node, the method further comprises :
    所述第一工作节点将所述第一网络层的本地梯度信息中的各个数值均放大M倍,并将放大后的各个数值转换为半精度;所述M为大于1的实数。The first working node amplifies each value in the local gradient information of the first network layer by M times, and converts each amplified value into a half-precision; the M is a real number greater than 1.
  10. 根据权利要求1至9任一项所述的方法,其特征在于,所述第一工作节点并行地更新所述神经网络模型中的第二网络层的参数之前,所述方法还包括:The method according to any one of claims 1 to 9, wherein before the first working node updates the parameters of the second network layer in the neural network model in parallel, the method further comprises:
    所述第一工作节点将获得的所述第二网络层的本地梯度信息中包括的各个数值转换为单精度,并将所述转换得到的各个数值缩小M倍以得到处理梯度信息,所述M为大于1的实数;The first working node converts each value included in the obtained local gradient information of the second network layer into single precision, and reduces each value obtained by the conversion by M times to obtain processing gradient information. Is a real number greater than 1;
    所述第一工作节点并行地更新所述神经网络模型中的第二网络层的参数包括:The parallel updating of the parameters of the second network layer in the neural network model by the first working node includes:
    所述第一工作节点利用所述处理梯度信息更新所述神经网络模型中的所述第二网络层的参数。The first working node uses the processing gradient information to update the parameters of the second network layer in the neural network model.
  11. 根据权利要求1至10任一项所述的方法,其特征在于,在与至少一个第二工作节点进行所述神经网络模型中的第一网络层的本地梯度信息传输之前,所述方法还包括:The method according to any one of claims 1 to 10, characterized in that, before performing the local gradient information transmission of the first network layer in the neural network model with at least one second working node, the method further comprises :
    所述第一工作节点基于所述第一网络层对应的偏移量,将计算得到的所述第一网络层的本地梯度信息存储至预先分配的目标存储空间,其中,所述目标存储空间用于存储所述神经网络模型的多个网络层的本地梯度信息;The first working node stores the calculated local gradient information of the first network layer in a pre-allocated target storage space based on the offset corresponding to the first network layer, where the target storage space is used To store local gradient information of multiple network layers of the neural network model;
    其中,所述第一工作节点发送的所述第一网络层的本地梯度信息是基于所述第一网络层对应的偏移量从所述目标存储空间中获取的,和/或,所述第一工作节点基于接收到的来自于所述至少一个第二工作节点的所述第一网络层的本地梯度信息,更新所述目标存储空间存储的所述第一网络层的本地梯度信息。Wherein, the local gradient information of the first network layer sent by the first working node is obtained from the target storage space based on the offset corresponding to the first network layer, and/or, the first network layer A working node updates the local gradient information of the first network layer stored in the target storage space based on the received local gradient information of the first network layer from the at least one second working node.
  12. 一种图像预测方法,其特征在于,包括:An image prediction method, characterized in that it includes:
    获取待处理图像;Obtain the image to be processed;
    利用权利要求1至11任一项训练得到的神经网络模型对所述待处理图像进行预测处理,得到预测结果。The neural network model trained in any one of claims 1 to 11 is used to perform prediction processing on the to-be-processed image to obtain a prediction result.
  13. 一种数据处理装置,其特征在于,包括:A data processing device, characterized in that it comprises:
    处理模块,用于基于对神经网络模型进行的当前迭代,得到所述神经网络模型的至少一个网络层的本地梯度信息;A processing module for obtaining local gradient information of at least one network layer of the neural network model based on the current iteration of the neural network model;
    收发模块,用于与至少一个第二工作节点进行所述神经网络模型中的第一网络层的本地梯度信息的传输;A transceiver module, configured to transmit the local gradient information of the first network layer in the neural network model with at least one second working node;
    所述处理模块,还用于在所述收发模块与至少一个第二工作节点进行所述神经网络模型中的第一网络层的本地梯度信息传输的过程中,并行地更新所述神经网络模型中的第二网络层的参数。The processing module is also configured to update the neural network model in parallel when the transceiver module and the at least one second working node perform local gradient information transmission of the first network layer in the neural network model The parameters of the second network layer.
  14. 根据权利要求13所述的数据处理装置,其特征在于,The data processing device according to claim 13, wherein:
    所述处理模块,还用于基于所述神经网络模型的多个网络层的连接关系,确定所述当前迭代的多个操作之间的依赖关系,所述多个操作至少包括所述神经网络模型中至少一个网络层的本地梯度信息的传输操作和参数更新操作;基于所述多个操作之间的依赖关系执行所述多个操作。The processing module is further configured to determine the dependency relationship between the multiple operations of the current iteration based on the connection relationship of the multiple network layers of the neural network model, and the multiple operations at least include the neural network model The local gradient information transmission operation and parameter update operation of at least one network layer in at least one of the network layers; the multiple operations are performed based on the dependency between the multiple operations.
  15. 根据权利要求13或14所述的数据处理装置,其特征在于,所述处理模块以逆序的方式逐层更新所述神经网络模型中多个网络层的参数;和/或The data processing device according to claim 13 or 14, wherein the processing module updates the parameters of multiple network layers in the neural network model layer by layer in a reverse order; and/or
    所述第二网络层的网络深度大于所述第一网络层的网络深度。The network depth of the second network layer is greater than the network depth of the first network layer.
  16. 根据权利要求13至15任一项所述的数据处理装置,其特征在于,The data processing device according to any one of claims 13 to 15, wherein:
    所述处理模块,具体用于在所述收发模块与所述至少一个第二工作节点进行所述神经网络模型中的所述第一网络层的本地梯度信息传输的过程中,在确定所述第二网络层的参数更新操作所依赖的操作已完成的情况下,并行地更新所述第二网络层的参数,其中,所述参数更新操作所依赖的操作包括与所述至少一个第二工作节点传输所述第二网络层的本地梯度信息。The processing module is specifically configured to determine the first network layer in the neural network model during the transmission of the local gradient information of the first network layer in the neural network model by the transceiver module and the at least one second working node. When the operation dependent on the parameter update operation of the second network layer is completed, the parameters of the second network layer are updated in parallel, wherein the operation dependent on the parameter update operation includes the operation with the at least one second working node Transmit the local gradient information of the second network layer.
  17. 根据权利要求13至16任一项所述的数据处理装置,其特征在于,The data processing device according to any one of claims 13 to 16, wherein:
    所述处理模块,还用于在所述收发模块在与至少一个第二工作节点进行所述神经网络模型中的所述第一网络层的本地梯度信息传输的过程中,计算所述神经网络模型中的第三网络层的本地梯度信息。The processing module is further configured to calculate the neural network model during the process of transmitting the local gradient information of the first network layer in the neural network model between the transceiver module and at least one second working node Local gradient information of the third network layer in.
  18. 根据权利要求13至17任一项所述的数据处理装置,其特征在于,The data processing device according to any one of claims 13 to 17, wherein:
    所述处理模块,还用于对所述神经网络模型进行至少一次内层迭代,得到所述至少一次内层迭代对应的中间融合梯度信息;The processing module is further configured to perform at least one inner layer iteration on the neural network model to obtain intermediate fusion gradient information corresponding to the at least one inner layer iteration;
    所述处理模块,具体用于基于所述中间融合梯度信息和所述当前迭代对应的本地梯度信息,得到所述神经网络模型的至少一个网络层的目标融合梯度信息;所述第一工作节点与所述至少一个第二工作节点传输的所述第一网络层的本地梯度信息包括所述第一网络层的目标融合梯度信息。The processing module is specifically configured to obtain target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration; the first working node and The local gradient information of the first network layer transmitted by the at least one second working node includes the target fusion gradient information of the first network layer.
  19. 根据权利要求18所述的数据处理装置,其特征在于,The data processing device according to claim 18, wherein:
    所述处理模块,具体用于对所述中间融合梯度信息和所述当前迭代得到的本地梯度信息进行累加处理,得到所述神经网络模型的至少一个网络层的目标融合梯度信息。The processing module is specifically configured to accumulate the intermediate fusion gradient information and the local gradient information obtained in the current iteration to obtain target fusion gradient information of at least one network layer of the neural network model.
  20. 根据权利要求18或19所述的数据处理装置,其特征在于,The data processing device according to claim 18 or 19, wherein:
    所述收发模块,还用于在所述处理模块基于所述中间融合梯度信息和所述当前迭代对应的本地梯度信息,得到所述神经网络模型的第三网络层的目标融合梯度信息的过程中,并行地与所述至少一个第二工作节点进行所述神经网络模型的第四网络层的目标融合梯度信息的传输。The transceiver module is also used to obtain the target fusion gradient information of the third network layer of the neural network model by the processing module based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration , Transmitting the target fusion gradient information of the fourth network layer of the neural network model in parallel with the at least one second working node.
  21. 根据权利要求13至20任一项所述的数据处理装置,其特征在于,The data processing device according to any one of claims 13 to 20, wherein:
    所述处理模块,还用于将所述第一网络层的本地梯度信息中的各个数值均放大M倍,并将放大后的各个数值转换为半精度;所述M为大于1的实数。The processing module is further configured to amplify each value in the local gradient information of the first network layer by M times, and convert each amplified value into half precision; the M is a real number greater than 1.
  22. 根据权利要求13至21任一项所述的数据处理装置,其特征在于,The data processing device according to any one of claims 13 to 21, wherein:
    所述处理模块,还用于将获得的所述第二网络层的本地梯度信息中包括的各个数值转换为单精度,并将所述转换得到的各个数值缩小M倍以得到处理梯度信息,所述M为大于1的实数;The processing module is further configured to convert each value included in the obtained local gradient information of the second network layer into single precision, and reduce each value obtained by the conversion by M times to obtain processed gradient information, so Said M is a real number greater than 1;
    所述处理模块,具体用于利用所述处理梯度信息更新所述神经网络模型中的所述第二网络层的参数。The processing module is specifically configured to use the processing gradient information to update the parameters of the second network layer in the neural network model.
  23. 根据权利要求13至22任一项所述的数据处理装置,其特征在于,The data processing device according to any one of claims 13 to 22, wherein:
    所述处理模块,还用于基于所述第一网络层对应的偏移量,将计算得到的所述第一网络层的本地梯度信息存储至预先分配的目标存储空间,其中,所述目标存储空间用于存储所述神经网络模型的多个网络层的本地梯度信息;The processing module is further configured to store the calculated local gradient information of the first network layer in a pre-allocated target storage space based on the offset corresponding to the first network layer, wherein the target storage The space is used to store local gradient information of multiple network layers of the neural network model;
    其中,所述收发模块发送的所述第一网络层的本地梯度信息是基于所述第一网络层对应的偏移量从所述目标存储空间中获取的,和/或,所述处理模块,还用于基于接收到的来自于所述至少一个第二工作节点的所述第一网络层的本地梯度信息,更新所述目标存储空间存储的所述第一网络层的本地梯度信息。Wherein, the local gradient information of the first network layer sent by the transceiver module is obtained from the target storage space based on the offset corresponding to the first network layer, and/or, the processing module, It is also used to update the local gradient information of the first network layer stored in the target storage space based on the received local gradient information of the first network layer from the at least one second working node.
  24. 一种数据处理装置,其特征在于,包括:A data processing device, characterized in that it comprises:
    获取模块,用于获取待处理图像;The acquisition module is used to acquire the image to be processed;
    处理模块,用于利用权利要求1至11任一项训练得到的神经网络模型对所述待处理图像进行预测处理,得到预测结果。The processing module is configured to use the neural network model trained in any one of claims 1 to 11 to perform prediction processing on the to-be-processed image to obtain a prediction result.
  25. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被移动设备的处理器执行时,使所述处理器执行权利要求1至12任意一项所述的方法。A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and the computer program includes program instructions that, when executed by a processor of a mobile device, cause the The processor executes the method of any one of claims 1-12.
  26. 一种电子设备,其特征在于,包括存储器和处理器,其中,所述存储器用于存储指令,所述处理器用于执行所述存储器存储的指令,使得所述处理器执行如权利要求1至12任一项所述的方法。An electronic device, characterized by comprising a memory and a processor, wherein the memory is used to store instructions, and the processor is used to execute the instructions stored in the memory, so that the processor executes as claimed in claims 1 to 12. Any one of the methods.
  27. 一种计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如权利要求1至12任一项所述的方法。A computer program that, when executed by a processor, causes the processor to execute the method according to any one of claims 1 to 12.
PCT/CN2021/095844 2020-06-03 2021-05-25 Training method for neural network model, and related product WO2021244354A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020227010791A KR20220054861A (en) 2020-06-03 2021-05-25 Training methods for neural network models and related products

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010496921.7A CN111723933B (en) 2020-06-03 2020-06-03 Training method of neural network model and related products
CN202010496921.7 2020-06-03

Publications (1)

Publication Number Publication Date
WO2021244354A1 true WO2021244354A1 (en) 2021-12-09

Family

ID=72565896

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/095844 WO2021244354A1 (en) 2020-06-03 2021-05-25 Training method for neural network model, and related product

Country Status (4)

Country Link
KR (1) KR20220054861A (en)
CN (1) CN111723933B (en)
TW (1) TW202147188A (en)
WO (1) WO2021244354A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114792125A (en) * 2022-04-15 2022-07-26 北京百度网讯科技有限公司 Data processing method and device based on distributed training, electronic equipment and medium
CN116955365A (en) * 2023-09-21 2023-10-27 浪潮电子信息产业股份有限公司 Gradient data synchronization method, model training method, system, equipment and medium

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723933B (en) * 2020-06-03 2024-04-16 上海商汤智能科技有限公司 Training method of neural network model and related products
CN112288083A (en) * 2020-10-21 2021-01-29 周宇浩 Neural network distributed training method, device, equipment and storage medium
CN115222038A (en) * 2021-04-16 2022-10-21 华为技术有限公司 Gradient transmission method and related device
CN112866041B (en) * 2021-04-23 2022-04-19 南京蓝洋智能科技有限公司 Adaptive network system training method
CN113626652B (en) * 2021-10-11 2021-12-17 北京一流科技有限公司 Data processing network system, data processing network deployment system and method thereof
CN115328579B (en) * 2022-10-11 2023-02-24 山东海量信息技术研究院 Scheduling method and system for neural network training and computer readable storage medium
CN115688867A (en) * 2022-11-15 2023-02-03 抖音视界有限公司 Method, apparatus, device and storage medium for training neural network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021395A (en) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 Data parallel processing method and system for neural network
CN109919313A (en) * 2019-01-31 2019-06-21 华为技术有限公司 A kind of method and distribution training system of gradient transmission
CN110379416A (en) * 2019-08-15 2019-10-25 腾讯科技(深圳)有限公司 A kind of neural network language model training method, device, equipment and storage medium
CN110378472A (en) * 2019-07-24 2019-10-25 苏州浪潮智能科技有限公司 A kind of data parallel training method, device and the equipment of deep neural network model
CN111723933A (en) * 2020-06-03 2020-09-29 上海商汤智能科技有限公司 Training method of neural network model and related product

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6197791B2 (en) * 2012-07-30 2017-09-20 日本電気株式会社 Distributed processing apparatus, distributed processing system, and distributed processing method
US10152676B1 (en) * 2013-11-22 2018-12-11 Amazon Technologies, Inc. Distributed training of models using stochastic gradient descent
US10949746B2 (en) * 2016-10-27 2021-03-16 International Business Machines Corporation Efficient parallel training of a network model on multiple graphics processing units
US11288575B2 (en) * 2017-05-18 2022-03-29 Microsoft Technology Licensing, Llc Asynchronous neural network training
US11093827B2 (en) * 2017-09-20 2021-08-17 International Business Machines Corporation Variable ISA vector-based compaction in distributed training of neural networks
CN107578094A (en) * 2017-10-25 2018-01-12 济南浪潮高新科技投资发展有限公司 The method that the distributed training of neutral net is realized based on parameter server and FPGA
US11630994B2 (en) * 2018-02-17 2023-04-18 Advanced Micro Devices, Inc. Optimized asynchronous training of neural networks using a distributed parameter server with eager updates
CN109600255A (en) * 2018-12-04 2019-04-09 中山大学 A kind of parameter server optimization algorithm of decentralization
CN109871942B (en) * 2019-02-19 2021-06-11 上海商汤智能科技有限公司 Neural network training method, device, system and storage medium
CN110600020B (en) * 2019-09-12 2022-05-17 上海依图信息技术有限公司 Gradient transmission method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021395A (en) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 Data parallel processing method and system for neural network
CN109919313A (en) * 2019-01-31 2019-06-21 华为技术有限公司 A kind of method and distribution training system of gradient transmission
CN110378472A (en) * 2019-07-24 2019-10-25 苏州浪潮智能科技有限公司 A kind of data parallel training method, device and the equipment of deep neural network model
CN110379416A (en) * 2019-08-15 2019-10-25 腾讯科技(深圳)有限公司 A kind of neural network language model training method, device, equipment and storage medium
CN111723933A (en) * 2020-06-03 2020-09-29 上海商汤智能科技有限公司 Training method of neural network model and related product

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114792125A (en) * 2022-04-15 2022-07-26 北京百度网讯科技有限公司 Data processing method and device based on distributed training, electronic equipment and medium
CN116955365A (en) * 2023-09-21 2023-10-27 浪潮电子信息产业股份有限公司 Gradient data synchronization method, model training method, system, equipment and medium
CN116955365B (en) * 2023-09-21 2024-02-09 浪潮电子信息产业股份有限公司 Gradient data synchronization method, model training method, system, equipment and medium

Also Published As

Publication number Publication date
CN111723933A (en) 2020-09-29
KR20220054861A (en) 2022-05-03
CN111723933B (en) 2024-04-16
TW202147188A (en) 2021-12-16

Similar Documents

Publication Publication Date Title
WO2021244354A1 (en) Training method for neural network model, and related product
WO2022027937A1 (en) Neural network compression method, apparatus and device, and storage medium
US20210295168A1 (en) Gradient compression for distributed training
US11948352B2 (en) Speculative training using partial gradients update
Zhou et al. Accelerating deep learning inference via model parallelism and partial computation offloading
WO2022001550A1 (en) Address generation method, related device and storage medium
CN111723932A (en) Training method of neural network model and related product
CN114428907B (en) Information searching method, device, electronic equipment and storage medium
CN111343602B (en) Joint layout and task scheduling optimization method based on evolutionary algorithm
JP2017157215A (en) Neural network analysis
US20200311511A1 (en) Accelerating neuron computations in artificial neural networks by skipping bits
CN116762080A (en) Neural network generation device, neural network operation device, edge device, neural network control method, and software generation program
WO2023185896A1 (en) Text generation method and apparatus, and computer device and storage medium
CN116820577A (en) Parallel processing method and device for model, first computing equipment and electronic equipment
CN116668351A (en) Quality of service prediction method, device, computer equipment and storage medium
CN115907041A (en) Model training method and device
CN115688917A (en) Neural network model training method and device, electronic equipment and storage medium
US20210312325A1 (en) Mixed-precision neural processing unit (npu) using spatial fusion with load balancing
US11531578B1 (en) Profiling and debugging for remote neural network execution
US11263517B1 (en) Flexible weight expansion
US20160171419A1 (en) Assistance service facilitation
EP3948685A1 (en) Accelerating neuron computations in artificial neural networks by skipping bits
US20240211758A1 (en) Method for Training Artificial Intelligence Model and Related Device
US20240020510A1 (en) System and method for execution of inference models across multiple data processing systems
WO2024007938A1 (en) Multi-task prediction method and apparatus, electronic device, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21818540

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20227010791

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2022530257

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 522431767

Country of ref document: SA

122 Ep: pct application non-entry in european phase

Ref document number: 21818540

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21818540

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 19/05/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21818540

Country of ref document: EP

Kind code of ref document: A1