CN109919313B

CN109919313B - Gradient transmission method and distributed training system

Info

Publication number: CN109919313B
Application number: CN201910101338.9A
Authority: CN
Inventors: 徐华; 徐宇啸; 吕跃强
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2021-06-08
Anticipated expiration: 2039-01-31
Also published as: CN109919313A

Abstract

The application discloses a gradient transmission method and a distributed training system, which are used for improving the transmission efficiency of gradients generated in a training process and improving the training efficiency of distributed training. The method comprises the following steps: acquiring the gradient of the weight corresponding to the neuron at the i-th layer of the neural network model according to the input training data; sending the gradient of the weight corresponding to the i-layer neuron to a gradient cache region; judging whether the number of the gradients stored in the gradient cache region exceeds a transmission threshold value; according to the judgment result, sending the gradient stored in the gradient cache area to a gradient collection module; acquiring a gradient mean value of weights corresponding to the i-th layer neurons of the neural network model according to gradients sent by the plurality of neural network models stored in the gradient collection module; and updating the corresponding weight according to the gradient mean value of the weight corresponding to the i-th layer of neurons so as to execute the next iteration of the neural network model.

Description

Gradient transmission method and distributed training system

Technical Field

The embodiment of the application relates to the technical field of information, in particular to a gradient transmission method and a distributed training system.

Background

At present, Artificial Intelligence (AI) is receiving wide attention, and the core technology thereof makes major breakthrough in various fields, wherein one of the core technologies of AI is deep learning, and deep learning is a machine learning technology based on a neural network model. The neural network model includes a plurality of layers of neurons, each layer of neurons corresponding to at least one weight. The neural network model can be normally used after multiple iterations, and the iteration of the neural network model can be as follows: and determining the optimal weight according to the massive training data to minimize the difference between the prediction result of the neural network model and the prior knowledge.

In the iteration of the neural network model, in order to improve the training efficiency, a plurality of training devices can be adopted for distributed training. In one iteration of the neural network model by multiple training devices, the gradient of the weight calculated by the multiple training devices may be different for any weight, and the multiple training devices need to transmit the calculated gradient of each weight in order to determine the gradient mean. The multiple training devices adopt gradient average values to update the weights, and the updated weights of the multiple training devices are the same for any weight. And after updating the weights corresponding to the neurons of each layer, each training device respectively uses the updated weights corresponding to the neurons of each layer to perform the next iteration on the neural network model.

In the process of one iteration, how to synchronize the gradients obtained by the training devices has a large influence on the training efficiency.

Disclosure of Invention

The embodiment of the application provides a gradient transmission method, which is used for improving the efficiency of synchronization of gradients generated in a training process among neural network models.

In a first aspect, the present invention provides a method for gradient transmission, in a distributed training system for a neural network model, including a plurality of neural network models, each neural network model including n layers of neurons, each layer of neurons corresponding to at least one weight, where n is a positive integer, where the plurality of neural network models perform an iteration at the same time, and a process of each neural network model in the plurality of neural network models is similar when performing an iteration. Any neural network model in the distributed training system is taken as an example for explanation.

In one iteration of the neural network model, firstly, training data can be input, and the gradient of the weight corresponding to the neuron on the ith layer of the neural network model is obtained according to the input training data, wherein i is a positive integer not greater than n; and sending the gradient of the weight corresponding to the layer i neuron of the neural network model to a gradient cache region of the neural network model. After the gradient is sent to the gradient cache region, whether the number of the gradients stored in the gradient cache region of the neural network model exceeds the determined transmission threshold value or not can be judged; then according to the judgment result, the gradient stored in the gradient cache region of the neural network model is sent to a gradient collection module, and after the sending is finished, the gradient cache region has no gradient; generally, when the number of gradients stored in the gradient cache exceeds the determined transmission threshold, the gradients stored in the gradient cache are sent to the gradient collection module. The gradient collection module stores the gradients sent by the neural network models, and further obtains a gradient mean value of weights corresponding to the layer i neurons of the neural network models according to the gradients sent by the neural network models stored in the gradient collection module; finally, the weights corresponding to the i-th layer neurons of the neural network model are updated according to the gradient mean value of the weights corresponding to the i-th layer neurons of the neural network model, so that the next iteration of the neural network model is executed.

By setting a transmission threshold, the number of the stored gradients is compared with the transmission threshold to determine whether the gradients are transmitted or not, a gradient mean value is determined according to the gradients of the neural network models, and the weighted values are updated by adopting the gradient mean value, so that the transmission of the gradients is realized, and the training efficiency of distributed training is improved. In addition, the transmission threshold value is determined according to the neural network model, different transmission threshold values are adopted by different neural network models, the loss of transmission efficiency caused by mismatching of the transmission threshold value and the neural network model is avoided, and the training efficiency of distributed training is further improved.

With reference to the first aspect, in a first possible implementation manner of the first aspect, before determining whether the number of gradients stored in a gradient cache of the neural network model exceeds a determined transmission threshold, a transmission threshold set may be determined, where there are m candidate transmission thresholds in the transmission threshold set, and a transmission duration corresponding to each candidate transmission threshold in the m candidate transmission thresholds in the transmission threshold set is obtained through m iterations; and then determining the transmission threshold according to the m transmission durations.

Further, a set of thresholds comprising at least two alternative transmission thresholds may be predetermined, and the transmission threshold is determined among the at least two alternative transmission thresholds during an iteration of the distributed training system to achieve a maximization of the transmission efficiency.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, when the transmission threshold is determined according to the m transmission durations, the alternative transmission threshold corresponding to the shortest transmission duration of the m transmission durations may be selected as the transmission threshold.

Further, the alternative transmission threshold corresponding to the shortest transmission duration may be used as the final transmission threshold, so as to implement fast transmission and improve the efficiency of distributed training.

With reference to the first or second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, each candidate transmission threshold is not less than the number of weights corresponding to any layer of neurons of the neural network model.

In a second aspect, the present invention provides a distributed training system having functional modules for implementing the method of the first aspect and any one of its possible implementations. The functional module can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

The distributed training system comprises a plurality of neural network models and a plurality of transmission modules, wherein the neural network models correspond to the transmission modules one to one, each neural network model comprises n layers of neurons, each layer of neurons corresponds to at least one weight, and n is a positive integer.

The neural network model is used for acquiring the gradient of the weight corresponding to the neuron at the ith layer of the neural network model according to input training data in one iteration, wherein i is a positive integer not greater than n; sending the gradient of the weight corresponding to the i-th layer neuron of the neural network model to a gradient cache region of the neural network model;

the transmission module is used for judging whether the number of the gradients stored in the gradient cache area of the neural network model exceeds the determined transmission threshold value; sending the gradient stored in the gradient cache region of the neural network model to a gradient collection module according to the judgment result;

the neural network model is also used for obtaining a gradient mean value of weights corresponding to the neurons in the layer i of the neural network model, which is obtained according to the gradients sent by the plurality of neural network models and stored in the gradient collection module; and updating the weight corresponding to the i-th layer neuron of the neural network model according to the gradient mean value of the weight corresponding to the i-th layer neuron of the neural network model so as to execute the next iteration of the neural network model.

With reference to the second aspect, in a first possible implementation manner of the second aspect, the transmission module is further configured to obtain, through m iterations, a transmission duration corresponding to each alternative transmission threshold in m alternative transmission thresholds in the transmission threshold set; and determining the transmission threshold according to the m transmission durations, and triggering the transmission module to execute a step of judging whether the number of gradients stored in a gradient cache region of the neural network model exceeds the determined transmission threshold.

With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the transmission module is configured to, when determining the transmission threshold according to the m transmission durations, specifically: and selecting the alternative transmission threshold corresponding to the shortest transmission time length in the m transmission time lengths as the transmission threshold.

Further, the alternative transmission threshold corresponding to the shortest transmission duration may be used as the transmission threshold, so as to implement fast transmission and improve the efficiency of distributed training.

With reference to the first or second possible implementation manner of the second aspect, in a third possible implementation manner of the second aspect, each candidate transmission threshold is not less than the number of weights corresponding to any layer of neurons of the neural network model.

In a third aspect, the present invention provides a distributed training system comprising at least one training device, each training device comprising: a processor and a memory. The memory for storing computer instructions; the processor is configured to execute the computer instructions in the memory to implement the method of any one of the possible implementations of the first aspect and the first aspect.

In a fourth aspect, the present invention provides a non-transitory computer readable storage medium having stored therein computer instructions for execution by a processor to implement the method of the first aspect described above and any of the possible implementations of the first aspect.

In a fifth aspect, the present invention provides a computer program product which, when read and executed by a computer, causes the computer to perform the method of any of the possible implementations of the first aspect and the first aspect described above.

In a sixth aspect, the present invention provides a chip, coupled to a memory, for reading and executing a software program stored in the memory to implement the method in any one of the possible implementations of the first aspect and the first aspect.

Drawings

FIG. 1A is a schematic diagram of a neural network model provided in an embodiment of the present application;

fig. 1B is a schematic diagram of a distributed training system provided in an embodiment of the present application;

FIG. 1C is a schematic diagram of a decentralized distributed training system provided in an embodiment of the present application;

FIG. 2A is a schematic diagram of an iterative process provided in an embodiment of the present application;

FIG. 2B is a schematic diagram of a gradient transmission process provided in an embodiment of the present application;

fig. 3 is a schematic diagram of a process for determining a transmission threshold according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a process for determining a set of thresholds provided in an embodiment of the present application;

fig. 5 is a schematic diagram of a distributed training system provided in an embodiment of the present application.

Detailed Description

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

To facilitate understanding of the present solution, a neural network model is first introduced. It should be understood that the neural network model is a network model simulating the behavior characteristics of the animal neural network, and the network model achieves the aim of processing information by adjusting the interconnection relationship among a large number of internal nodes depending on the complexity degree of the network model.

The process of training the neural network is the process of learning the weights corresponding to the neurons, and the final purpose is to obtain the weights corresponding to each layer of neurons of the trained neural network model.

A detailed description of a training process applied to a possible neural network model according to an embodiment of the present application is provided below with reference to fig. 1A.

Fig. 1A is a schematic block diagram of a neural network model 100 provided in an embodiment of the present application. The neural network model 100 includes n layers of neurons, each of which includes one or more neurons, all neurons of each layer being connected to all neurons of the next layer. Taking the neural network model 100 in fig. 1A as an example, referring to fig. 1A, the 1 st layer includes two neurons, each of the 2 nd to n-1 th layers includes three neurons, and the nth layer includes one neuron, where n is a positive integer not less than 2, and i in fig. 1A is a positive integer not greater than n and not less than 1. Each neuron has a corresponding weight.

One iteration in the training process of the neural network model 100 is described in detail below.

Training data is obtained from a training data set, the training data is used as input of a layer 1 of the neural network model 100, and a prediction result is output from the layer n after the input of the layer 1 passes through a plurality of neurons in the layer n. Specifically, each layer of neurons has a corresponding weight. Training data is input to a first layer of neurons, which output values for the first layer of neurons based on corresponding weights. The output values of the first layer neurons are used as inputs to second layer neurons, which output the output values of the second layer neurons based on the corresponding weights. And similarly, by analogy, finally outputting a prediction result from the nth layer.

In the process of training the neural network model 100, it is desirable that the predicted result output by the nth layer of the neural network model 100 is as close as possible to the prior knowledge (prior knowledge) of the training data, which is also called true value (ground true), and generally includes the true result corresponding to the training data provided by the human. Therefore, the weights corresponding to each layer of neurons in the neural network model 100 can be updated according to the difference between the current prediction result and the prior knowledge by comparing the current prediction result and the prior knowledge (of course, before the first update, an initialization process is usually performed, that is, the initialization process is performed to initialize the weights corresponding to each layer of neurons in the neural network model 100). Therefore, after the prediction result output by the nth layer is obtained, the error algorithm is adopted to correct the weight corresponding to the neuron according to the prediction result and the prior knowledge, which is specifically as follows.

And calculating a loss function according to the prediction result and the prior knowledge, and correcting the weight corresponding to each layer of neurons in the neural network model 100 along the direction from the nth layer to the 1 st layer according to the loss function. The weights may be modified by calculating a gradient for each weight, the gradient being derived from a loss function, the gradient being obtained by taking the derivative of the weight with the loss function.

And correcting the weight according to the prediction result and the priori knowledge comprises calculating a loss function according to the prediction result and the priori knowledge, and calculating the gradient of the weight corresponding to each layer of neurons along the direction from the nth layer to the 1 st layer according to the loss function. In other words, the computation of the gradient of the weight corresponding to each layer of neurons is performed layer by layer in the order from the nth layer to the 1 st layer, and after the computation of the gradient of the weight corresponding to the i-th layer of neurons is completed, the computation of the gradient of the weight corresponding to the i-1 st layer of neurons is started. And after the gradient of the weight corresponding to each layer of neuron is obtained, correcting the weight corresponding to each layer of neuron according to each gradient, and finishing one iteration.

In the process of multiple iterations, the weights corresponding to the neurons in each layer are continuously corrected, so that the prediction result output by the neural network model 100 is as close to the prior knowledge of the training data as possible.

The gradient transmission method is applied to a distributed training system of a neural network model. A distributed training system includes a plurality of training devices. The same neural network model is deployed on each training device, and each training device respectively acquires different training data in the training data set for training. After a plurality of iterations, the plurality of training devices obtain a plurality of same trained neural network models, and the trained neural network model on any one training device is the neural network model for completing the distributed training.

In addition to this, each training device comprises a transmission module. The process of distributed training is illustrated by example of distributed training system 200 in FIG. 1B. Illustratively, only the training devices 210 and 220 in the distributed training system 200 are drawn in the embodiment of the present invention, and the number of actual training devices may be larger.

In one iteration of distributed training, each training device acquires training data from a training data set, and the training data corresponding to each training device is different. Each training device inputs training data into the neural network model on the corresponding training device, for example, training device 210 inputs acquired training data into neural network model 211 on training device 210, and training device 220 inputs acquired training data into neural network model 221 on training device 220. And each neural network model respectively obtains a prediction result aiming at the input training data, and calculates the gradient of the weight corresponding to each layer of neuron based on the prediction result and the prior knowledge. Because the training data input into each neural network model is different, the gradient calculated by each neural network model in one iteration is different. That is, in the distributed training system 200 shown in fig. 1B, the training data input to the neural network model 211 is different from the training data input to the neural network model 221 in one iteration, and then the gradient obtained by the neural network model 211 is different from the gradient obtained by the neural network model 221 based on the different training data. If the weights corresponding to the neurons in each layer in the neural network model 211 are adjusted according to the gradient obtained by the neural network model 211, the weights corresponding to the neurons in each layer in the neural network model 221 are adjusted according to the gradient obtained by the neural network model 221, after several iterations, the weights corresponding to the neurons in each layer in the trained neural network model 211 are different from the weights corresponding to the neurons in each layer in the neural network model 221, and the trained neural network model 211 is different from the neural network model 221.

Therefore, in each iteration process, after each neural network model obtains the gradient of the weight, the gradient of the weight is sent to the corresponding gradient cache region for storage, and when the transmission module in each training device determines that the number of the gradients stored in the gradient cache region is greater than the transmission threshold value, the corresponding transmission module needs to transmit the gradient of the weight. Specifically, transmission module 212 in training device 210 sends the gradient of weights to gradient collection module 230, and transmission module 222 in training device 220 sends the gradient of weights to gradient collection module 230. The gradient collection module stores the gradient in each gradient buffer. And calculating the gradient mean value of the weight according to the gradient in the gradient buffer area of each neural network model stored in the gradient collection module. And each neural network model updates the corresponding weight according to the gradient mean calculation. Optionally, the gradient of the weight obtained by each neural network model is averaged to obtain a gradient mean of the weight, the gradient of the weight obtained by each neural network model may be weighted and averaged to obtain a gradient mean of the weight, and the gradient of the weight obtained by each neural network model may be subjected to other processing to generate a gradient mean of the weight.

Each training device in distributed training system 200 may be centrally deployed or decentralized deployed. If each training device is centrally deployed, the gradient gathering module is independent of the plurality of training devices, or the gradient gathering module is deployed on one of the plurality of training devices. Each training device transmits the resulting gradient to the gradient collection module. If each training device is decentralized, each training device is provided with a gradient collection module. Distributed training system 200 is depicted in FIG. 1B as a centralized deployment with gradient gathering module 230 independent of training devices 210 and 220. Decentralized deployment distributed training system 200 as shown in fig. 1C, a gradient gathering module 2301 is deployed on training device 210, and a gradient gathering module 2302 is deployed on training device 220.

When the distributed training system is adopted to train the neural network model, a plurality of training devices in the distributed training system execute the step of training the neural network model in parallel. Referring to fig. 2A, a training process of the distributed training system on the neural network model is described as an example of a training device in the distributed training system.

Step 21: acquiring training data from a training data set, and inputting the acquired training data into a neural network model; and obtaining a prediction result aiming at the input training data according to the weight corresponding to each layer of neuron in the neural network model.

In the process of the first iteration, the weight corresponding to each layer of neurons is determined by an initialization process, and the initialization process is a process of setting the weight corresponding to each layer of neurons for each neural network model.

Step 22: and acquiring the prior knowledge of the training data, and calculating a loss function according to the prediction result and the prior knowledge of the training data.

Step 23: and acquiring a transmission threshold, calculating the gradient of the weight corresponding to each layer of neuron according to the loss function, and transmitting the gradient of the weight corresponding to each layer of neuron according to the transmission threshold.

When calculating the gradient of the weight corresponding to each layer of neurons according to the loss function, it is necessary to sequentially calculate the gradient of the weight corresponding to each layer of neurons in the order from the last layer of neurons (nth layer of neurons) to the first layer of neurons.

Since the mean gradient of the weights needs to be calculated, after the gradient of the weights corresponding to each layer of neurons is calculated, the gradient of the weights corresponding to each layer of neurons needs to be transmitted. And transmitting the gradient of the weight corresponding to each layer of neuron, namely sending the gradient to a gradient collection module, so that the gradient collection module calculates the gradient mean value of the weight. When the gradient is transmitted, the gradients of the weights corresponding to all layer neurons may be transmitted after the gradients of the weights corresponding to all layer neurons are calculated. In order to improve efficiency, the calculated gradient may be transmitted after calculating the gradients of the weights corresponding to the neurons in several layers, and the calculated gradient may start to calculate the gradient of the weight corresponding to the neuron in the previous layer (for example, the layer 1 is the first layer) during the transmission process, so as to reduce the time delay caused by waiting for the transmission of the gradient before the next iteration starts. Specifically, after the gradient of the weight corresponding to one layer of neurons is calculated, the gradient is cached in a gradient buffer area on the training device; and setting a transmission threshold, comparing the number of the gradients cached in the gradient buffer with the transmission threshold, and starting one transmission after the number of the gradients cached in the gradient buffer exceeds the transmission threshold.

The setting of the transmission threshold is key to improving the training efficiency. Different neural network models have different transmission thresholds. Because the starting of one transmission needs time, and the transmission is frequently started due to an excessively small transmission threshold, the time delay caused by the transmission cannot be effectively shortened due to the excessively small transmission threshold; when the total number of weights corresponding to each layer of neurons in the neural network model is small, an excessively large transmission threshold value cannot effectively improve the transmission efficiency. Therefore, the transmission threshold value adopted by the neural network model needs to be determined according to the neural network model, and the transmission threshold values determined by different neural network models are different. Optionally, the transmission threshold is determined according to the total number of weights corresponding to the neurons in each layer of the neural network model. The specific process of determining the transmission threshold value according to the neural network model is described later.

The specific process of calculating the gradient of the weight corresponding to each layer of neurons according to the loss function in step 23 and transmitting the gradient of the weight corresponding to each layer of neurons according to the transmission threshold can be seen from the description of steps 231 to 236.

Step 24: for each weight in the weights corresponding to each layer of neurons, the gradient collection module calculates a gradient mean of the weight.

Step 25: and obtaining a gradient mean value corresponding to each weight, and updating the corresponding weight by adopting the gradient mean value.

After the above steps 21 to 25, the training device completes one iteration, obtains new training data in the training data set, and performs the next iteration based on the updated weights. Each training device completes an iteration using different training data according to steps 21-25. After the weights are updated, the updated weights of each training device are the same, and the plurality of training devices can perform the next iteration on the neural network model until the loss function calculated in step 22 meets the set condition, and the neural network model is considered to be trained.

The following describes in detail a specific process of calculating the gradient of the weight corresponding to each layer of neurons according to the loss function in step 23 in one iteration and transmitting the gradient of the weight corresponding to each layer of neurons according to the transmission threshold, and refer to fig. 2B specifically:

step 231: and calculating the gradient of each weight corresponding to the neuron in the current layer, and after calculating the gradient of each weight, sending the calculated gradient to a gradient cache region for caching.

The neural network model comprises n layers of neurons, the i-th layer of neurons is used as the current layer of neurons, and i is a positive integer not larger than n. And sequentially calculating gradients corresponding to the weights corresponding to the neurons of each layer according to the sequence from the neurons of the nth layer to the neurons of the 1 st layer in one iteration, namely the initial value of i is n.

After each weighted gradient is computed, the gradient may be buffered.

Step 232: it is determined whether computation of the gradient of weights corresponding to all layer neurons of the neural network model is complete. Particular embodiments of determining whether to complete the computation of the gradient of weights for all layer neurons include determining whether a current layer neuron is a first layer neuron of the neural network model. If the current layer neuron is a layer i neuron, it is determined whether i is 1. If the computation of the gradients of weights corresponding to all layer neurons has not been completed, go to step 233; if the computation of the gradients of weights corresponding to all layer neurons is completed, step 236 is performed directly, and the transmission of the gradients of weights corresponding to all neurons is completed in step 236.

Step 233: determining whether the number of the gradients stored in the current gradient cache area is not less than a transmission threshold value, if so, executing step 234 to transmit the cached gradients; if not, go to step 235, and continue to calculate the gradient of the weights corresponding to the neurons in the previous layer. In step 233, the size (storage capacity) of the gradient stored in the current gradient buffer may also be compared with the transmission threshold, and if the size of the gradient stored in the current gradient buffer is greater than the transmission threshold, step 235 is executed, otherwise, step 234 is executed.

Step 234: and sending the current cached gradient to a gradient collecting module.

Step 234 further comprises sending the currently cached gradient to the gradient collection module, and then deleting the gradient that has been transmitted to the gradient collection module, and after step 234 is completed, there is no transmitted gradient in the gradient buffer.

Step 235: with layer i-1 as the current layer of neurons, step 231 is performed to calculate the gradient of the weights corresponding to the neurons of the previous layer. At this time, the computation of the gradients of the weights corresponding to all layer neurons is not completed, that is, the layer neuron which is currently computed is not the first layer neuron, and therefore, the computation of the gradient of the weight corresponding to the previous layer neuron of the layer neuron which is currently computed is also required.

Step 236: and sending the current cached gradient to a gradient collecting module.

At this time, the gradient of the weight corresponding to the layer 1 neuron is calculated, that is, the gradient calculation of the weight corresponding to all the layer neurons is completed, and the transmission of the gradient of the weight corresponding to all the neurons needs to be completed. At this time, the gradient cache region still stores the currently cached gradient (the currently cached gradient includes the gradient of the weight corresponding to the layer 1 neuron calculated in step 231), and the currently cached gradient is not transmitted yet, so that the currently cached gradient needs to be transmitted, that is, the currently cached gradient is sent to the gradient collection module. At this time, before sending the currently cached gradient to the gradient collection module, it is not necessary to determine whether the number of the cached gradients in the cache region is not less than the transmission threshold. After step 236, the gradient already transmitted in the buffer may also be deleted, and the computation and transmission of the gradient of the weight corresponding to all layer neurons in one iteration is completed.

In the present application, the transmission threshold is determined according to a neural network model, specifically, according to the number of weights corresponding to each layer of neurons in the neural network model. For example, traversing each layer of neurons in the neural network model, obtaining the total number of weights corresponding to each layer of neurons, where the total number of weights corresponding to the i-th layer of neurons is q_iAnd if the neural network model comprises n layers of neurons, i is a positive integer not greater than n. The maximum value, the minimum value, the median value, the average value, or the like of the total number of weights corresponding to each layer of neurons can be used as the transmission threshold.

Further, a set of thresholds comprising at least two alternative transmission thresholds may be predetermined, and the transmission threshold is determined among the at least two alternative transmission thresholds during an iteration of the distributed training system to achieve a maximization of the transmission efficiency. In particular, the specific process of determining each alternative transmission threshold value in the set of threshold values according to the neural network model can be seen in steps 41-44 in fig. 4.

The process of selecting an alternative transmission threshold among the set of thresholds is described below with reference to fig. 3. The threshold set comprises m alternative transmission thresholds, and m is a positive integer not less than 2. FIG. 3 provides an embodiment of determining the transmission threshold through m iterations in the neural network model training process.

Step 31: in the a-th iteration, inputting training data into a neural network model; and obtaining a prediction result aiming at the input training data according to the weight corresponding to each layer of neuron in the neural network model, wherein the initial value of a is 1, and the termination value of a is m.

Step 31 mentioned here is step 21 mentioned above, and the detailed description is omitted.

Step 32: and calculating a loss function according to the prediction result and the prior knowledge.

Step 32 mentioned here is step 22 mentioned above, and the detailed description is omitted.

Step 33: and selecting an alternative transmission threshold value without acquiring the corresponding transmission duration from the threshold value set.

In the application, the gradient of the weight corresponding to each layer of neurons is transmitted according to the transmission threshold, in order to determine the final transmission threshold, the transmission market corresponding to each alternative transmission threshold in the threshold set is determined, and the transmission duration corresponding to the alternative transmission threshold is the duration from the beginning of calculating the gradient of the weight corresponding to the nth layer of neurons to the completion of the transmission of the gradient of the weight corresponding to the n layer of neurons when the alternative transmission threshold is adopted for one iteration. In the a-th iteration, an alternative transmission threshold value for which the corresponding transmission duration is not obtained may be selected from the threshold value set.

Step 34: and calculating the gradient of the weight corresponding to each layer of neurons according to the loss function, and transmitting the gradient of the weight corresponding to each layer of neurons according to the alternative transmission threshold value.

Step 34 is similar to step 23, and the detailed process is not described herein.

Step 35: and acquiring the transmission duration corresponding to the alternative transmission threshold.

The transmission duration represents the time required for the training device to transmit the gradient of the weight corresponding to each layer of neurons according to the alternative transmission device. Specifically, the transmission duration of the alternative transmission threshold is the transmission duration used by the training device from the start of calculating the gradient of the weight corresponding to the nth layer neuron to the completion of the transmission of the gradients of all layer neurons.

Step 36: and determining the transmission time length corresponding to each alternative transmission threshold value in the acquired threshold value set.

Step 37: according to the transmission duration corresponding to each alternative transmission threshold, one transmission threshold is selected, the selected transmission threshold is used as the transmission threshold to perform subsequent iteration, and the process of performing iteration according to the transmission threshold can be referred to the above description.

Specifically, the candidate transmission threshold with the shortest transmission duration may be selected as the transmission threshold to be used in the subsequent training.

It should be noted that, when determining a plurality of alternative transmission threshold values in the threshold value set, each training device may be an alternative transmission threshold value determined according to the neural network model; or one or more training devices determine alternative transmission threshold values according to the neural network model and distribute the determined alternative transmission threshold values to other training devices; the management device may also determine an alternative transmission threshold value according to the neural network model and distribute the determined alternative transmission threshold value to each training device.

The sequence of step 33 and step 31 is not limited in this application, and the sequence of step 33 and step 32 is not limited in this application.

The following takes fig. 4 as an example to describe in detail a specific process of determining each alternative transmission threshold in the set of thresholds:

step 41, traversing each layer of neurons in the neural network model, wherein the total number of weights corresponding to each layer of neurons and the total number of weights corresponding to the ith layer of neurons are q_iAnd calculating the total number s of the weights in the neural network model, wherein the total number of the weights in the neural network model is s, the neural network model comprises n layers of neurons, and i is a positive integer not greater than n.

Step 42, Total number of weights q corresponding to each layer of neurons_iTo q_nAnd performing deduplication processing, and determining each total number obtained after deduplication as each alternative transmission threshold in the threshold set.

Step 43, add k candidate transmission thresholds to the set of thresholds.

In order to find out the transmission threshold that can reflect the performance of distributed training, some alternative transmission thresholds may be added to the threshold set, and the number k of the added alternative transmission thresholds may be preset, for example, k is set to 5, 10, 15, and the like.

The sequence of the steps 42 and 43 is not limited.

In an implementable manner, the number k of the added alternative transmission thresholds may be determined according to the following formula:

where s is the total number of weights in the neural network model, q_maxThe value of k is determined by the above formula for the maximum value among the total number of weights corresponding to each layer of neurons in the neural network model, where x is a constant and may be 8, 10, or 15, and is intended to mean that at least x candidate transmission thresholds are added to the threshold set determined in step 42.

The k candidate transmission thresholds added in the threshold set may be randomly generated within a set range of values. Generally, each alternative transmission threshold in the set of thresholds conforms to a certain value range in order to adapt to the needs of the scheme. In order to avoid the problem of the number of times of establishing communication caused by transmitting the gradient of the weight corresponding to the layer of neurons every time the gradient of the weight corresponding to the layer of neurons is calculated, the value of the alternative transmission threshold may be not less than q_min，q_minThe minimum value of the total number of the weights corresponding to all layers of neurons in the neural network model; in order to avoid the problem of long training time caused by transmitting the calculated gradients of the weights corresponding to all layer neurons after calculating the gradients of the weights corresponding to all layer neurons, the value range of the threshold value may not be greater than the total number s of the weights in the neural network model. Thus the k thresholds added may be at q_min，s]And randomly selecting k alternative transmission threshold values in the value range and adding the k alternative transmission threshold values into the threshold value set.

In order to make the alternative transmission thresholds in the set of thresholds more uniform and to make up for q_maxSearch margin interval with total number s, at oneIn an implementable manner, the k additional alternative transmission thresholds may be determined according to the following formula:

p_j＝q_max+(i+1)*(S-q_max)/k，i∈[0,k)，P_iany alternative transmission threshold value added to the set of threshold values.

The process of determining the k candidate transmission thresholds according to the above formula may be: the total number s of the weights of the neural network model and the total number q of the weights corresponding to each layer of neurons of the neural network model respectively_iMedian maximum value q_maxThe quotient of the difference value of k is determined as a target step value; maximum value q of the total number_maxAnd the target step size value as a base; and determining k alternative transmission thresholds added in the threshold set by taking the base number as the first added alternative transmission threshold and adding a target step value as the alternative transmission threshold newly determined at this time on the basis of the alternative transmission threshold newly determined at the last time.

Step 44, determining whether the alternative transmission threshold is smaller than a preset value for each alternative transmission threshold in the threshold set, and if so, filtering the alternative transmission threshold from the threshold set.

In order to reasonably utilize transmission resources and narrow the search space, a smaller alternative transmission threshold may be eliminated, that is, an alternative transmission threshold smaller than a preset value is filtered from the threshold set, where the preset value may be S/1000.

Step 44 may be performed before or after step 43 and/or step 42, or may be performed simultaneously with step 42 and/or step 43. In general, the step of performing filtering judgment on the alternative transmission threshold may be performed once after one alternative transmission threshold is determined, or may be performed after a part of or all of the alternative transmission thresholds are determined, and the sequence of step 44 and other steps for determining the alternative transmission thresholds is not specifically limited in this application.

Each alternative transmission threshold value of the set of threshold values is determined by steps 41-44 described above.

Based on the same inventive concept as the gradient transmission method, the embodiment of the application also provides a distributed training system, wherein the distributed training system comprises a plurality of neural network models and a plurality of transmission modules; the neural network model is embodied as the neural network model 211 or 221 in fig. 1B, and the transmission module is embodied as the

transmission module

212 or 222 in fig. 1B.

The neural network model is used for acquiring the gradient of the weight corresponding to the neuron on the ith layer of the neural network model according to input training data in one iteration, wherein i is a positive integer not greater than n;

the transmission module is used for sending the gradient of the weight corresponding to the i-th layer neuron of the neural network model to a gradient cache region of the neural network model; judging whether the number of the gradients stored in a gradient cache region of the neural network model exceeds a determined transmission threshold value or not; sending the gradient stored in the gradient cache region of the neural network model to a gradient collection module according to a judgment result;

the neural network model is further configured to obtain a gradient mean value of weights corresponding to layer i neurons of the neural network model, which is obtained according to gradients sent by the plurality of neural network models and stored in the gradient collection module; and updating the weight corresponding to the i-th layer neuron of the neural network model according to the gradient mean value of the weight corresponding to the i-th layer neuron of the neural network model so as to execute the next iteration of the neural network model.

In an example, the neural network model is further configured to obtain, through m iterations, a transmission duration corresponding to each of m candidate transmission thresholds in a transmission threshold set; and determining the transmission threshold according to the m transmission durations, and triggering the transmission module to execute a step of judging whether the number of gradients stored in a gradient cache region of the neural network model exceeds the determined transmission threshold.

For example, the neural network model is specifically configured to, when determining the transmission threshold according to the m transmission durations: and selecting the alternative transmission threshold corresponding to the shortest transmission time length in the m transmission time lengths as the transmission threshold.

For example, none of the m candidate transmission thresholds is smaller than a minimum value of n total numbers, and one of the n total numbers is a total number of weights corresponding to one layer of neurons in n layers of neurons of the neural network model.

Based on the same inventive concept as the method of gradient transmission described above, as shown in fig. 5, an embodiment of the present application further provides a distributed training system 500, where the distributed training system includes a plurality of at least one training device, each training device includes n layers of neurons, each layer of neurons corresponds to at least one weight, where n is a positive integer, and the training device includes a processor and a memory. Illustratively, only three training devices in distributed training system 500 are depicted in FIG. 5. As shown in fig. 5, the distributed training system includes a training device 50, a training device 51, and a training device 52, each of which includes: a processor and a memory. Training device 50 includes a processor 501 and a memory 502, and training device 51 includes a processor 511 and a memory 512. Training device 52 includes a processor 521 and a memory 522. The memory on the training device in distributed training system 500 is used to store computer instructions, and the processor on the training device executes the computer instructions in the memory to implement the devices and modules in centralized distributed training system 200 or the devices and modules in decentralized distributed training system 200. Specifically, the processor 501 and the processor 511 are used to implement the neural network model 211 and the transmission module 212 in the distributed training system 200 in fig. 1B, and the gradient buffer in the neural network model 211 is implemented by the memory 502 and the memory 512; the gradient gathering module 230 in the distributed training system 200 of fig. 1B is implemented by the processor 521, at which point the distributed training system 500 is centralized. In decentralized distributed training system 500, processor 501, processor 511, and processor 521 are all used to implement neural network model 211, transmission module 212, and gradient collection module 2301 as in distributed training system 200 in fig. 1C, and the gradient buffers in neural network model 211 are implemented by memory 502, memory 512, and memory 522.

The computing devices in the distributed training system 500 may also include a communication interface. For example, computing device 50 includes communication interface 503 and computing device 51 includes communication interface 513. The computing device enables communication through a communication interface thereon.

The processor may be a Central Processing Unit (CPU), a Network Processor (NP), or a Graphic Processing Unit (GPU), or any combination of the three.

The processor may further include a hardware chip or other general purpose processor. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The aforementioned PLDs may be Complex Programmable Logic Devices (CPLDs), field-programmable gate arrays (FPGAs), General Array Logic (GAL) and other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., or any combination thereof. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will also be appreciated that the memory referred to in the embodiments of the application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous link SDRAM (SLDRAM), and Direct Rambus RAM (DR RAM). It should be noted that the memory described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

"and/or" in the present application, describing an association relationship of associated objects, means that there may be three relationships, for example, a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The plural in the present application means two or more.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include such modifications and variations.

Claims

1. A method of gradient transmission, the method being applied to a distributed training system of neural network models, the distributed training system comprising a plurality of neural network models, each neural network model comprising n layers of neurons, each layer of neurons corresponding to at least one weight, wherein n is a positive integer, the method comprising:

in one iteration of each neural network model, acquiring the gradient of the weight corresponding to the neuron on the ith layer of the neural network model according to input training data, wherein i is a positive integer not greater than n;

sending the gradient of the weight corresponding to the i-th layer neuron of the neural network model to a gradient cache region of the neural network model;

judging whether the number of the gradients stored in a gradient cache region of the neural network model exceeds a determined transmission threshold value or not;

sending the gradient stored in the gradient cache region of the neural network model to a gradient collection module according to a judgment result;

acquiring a gradient mean value of weights corresponding to the neurons in the layer i of the neural network model, which is obtained according to the gradients sent by the plurality of neural network models and stored in the gradient collection module;

and updating the weight corresponding to the i-th layer neuron of the neural network model according to the gradient mean value of the weight corresponding to the i-th layer neuron of the neural network model so as to execute the next iteration of the neural network model.

2. The method of claim 1, wherein before determining whether the number of gradients stored in the gradient buffer of the neural network model exceeds the determined transmission threshold, the method further comprises:

acquiring a transmission time length corresponding to each alternative transmission threshold value in m alternative transmission threshold values in a transmission threshold value set through m iterations;

and determining the transmission threshold value according to the m transmission durations.

3. The method of claim 2, wherein said determining the transmission threshold based on the m transmission durations comprises:

and selecting the alternative transmission threshold corresponding to the shortest transmission time length in the m transmission time lengths as the transmission threshold.

4. The method of any one of claims 2-3, wherein each of the candidate transmission thresholds is not less than the number of weights corresponding to any layer of neurons of the neural network model.

5. A distributed training system is characterized by comprising a plurality of neural network models and a transmission module, wherein each neural network model comprises n layers of neurons, each layer of neurons corresponds to at least one weight, and n is a positive integer;

the neural network model is used for acquiring the gradient of the weight corresponding to the neuron on the ith layer of the neural network model according to input training data in one iteration, wherein i is a positive integer not greater than n; sending the gradient of the weight corresponding to the i-th layer neuron of the neural network model to a gradient cache region of the neural network model;

the transmission module is used for judging whether the number of the gradients stored in the gradient cache area of the neural network model exceeds the determined transmission threshold value or not; sending the gradient stored in the gradient cache region of the neural network model to a gradient collection module according to a judgment result;

6. The distributed training system of claim 5, wherein the transmission module is further configured to obtain, through m iterations, a transmission duration corresponding to each of m candidate transmission thresholds in the set of transmission thresholds; and determining the transmission threshold value according to the m transmission durations.

7. The distributed training system of claim 6, wherein the transmission module, when determining the transmission threshold according to the m transmission durations, is specifically configured to:

8. The distributed training system of any of claims 6-7, wherein each alternative transmission threshold is not less than the number of weights corresponding to any layer of neurons of the neural network model.

9. A distributed training system, comprising at least one training device, the training device comprising: a processor and a memory;

the memory to store computer instructions;

the processor configured to execute the computer instructions in the memory to implement the method of any of the preceding claims 1-4.

10. A non-transitory computer readable storage medium having stored therein computer instructions for execution by a processor to implement the method of any one of claims 1-4.