WO2022105348A1 - 神经网络的训练方法和装置 - Google Patents
神经网络的训练方法和装置 Download PDFInfo
- Publication number
- WO2022105348A1 WO2022105348A1 PCT/CN2021/115204 CN2021115204W WO2022105348A1 WO 2022105348 A1 WO2022105348 A1 WO 2022105348A1 CN 2021115204 W CN2021115204 W CN 2021115204W WO 2022105348 A1 WO2022105348 A1 WO 2022105348A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- parameters
- training
- group
- iteration step
- neural network
- Prior art date
Links
- 238000012549 training Methods 0.000 title claims abstract description 298
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 182
- 238000000034 method Methods 0.000 title claims abstract description 142
- 238000005070 sampling Methods 0.000 claims abstract description 101
- 230000000737 periodic effect Effects 0.000 claims abstract description 29
- 230000008014 freezing Effects 0.000 claims abstract description 24
- 238000007710 freezing Methods 0.000 claims abstract description 24
- 230000015654 memory Effects 0.000 claims description 53
- 238000012545 processing Methods 0.000 claims description 48
- 230000008569 process Effects 0.000 claims description 41
- 238000004364 calculation method Methods 0.000 claims description 29
- 238000004590 computer program Methods 0.000 claims description 14
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 28
- 238000004891 communication Methods 0.000 description 21
- 230000006870 function Effects 0.000 description 21
- 238000011176 pooling Methods 0.000 description 18
- 238000013527 convolutional neural network Methods 0.000 description 15
- 238000013135 deep learning Methods 0.000 description 9
- 230000002441 reversible effect Effects 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 8
- 230000001133 acceleration Effects 0.000 description 7
- 238000003062 neural network model Methods 0.000 description 7
- 238000003672 processing method Methods 0.000 description 7
- 230000003068 static effect Effects 0.000 description 6
- 230000001360 synchronised effect Effects 0.000 description 5
- 239000011159 matrix material Substances 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 230000003247 decreasing effect Effects 0.000 description 2
- 238000005315 distribution function Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000003709 image segmentation Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000037361 pathway Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
Definitions
- the present application relates to the field of artificial intelligence, and in particular, to a method and apparatus for training a neural network.
- Deep learning technology has made great progress in computer vision.
- the deep neural network model has been leading the traditional ImageNet large scale visual recognition challenge (ILSVRC) with huge advantages since 2012.
- Computer Vision Methods The ImageNet (ILSVRC 2012) dataset has about 1.28 million images, and it takes about 8 hours to train 90 rounds on 8 V100 computing cards using the ResNet50 neural network.
- the GPT-3 model released by OpenAI has about 175 billion parameters and is trained using 45TB of data, which costs $13 million to train once.
- the network model parameters become more and more, and while the model with higher accuracy is obtained, more and more time and money are spent on training the model. Therefore, how to speed up the training of neural network has become an urgent problem to be solved.
- the present application provides a method and device for training a neural network, which can implement fine-grained control over the parameter group of the neural network in the iterative step dimension, and improve the training accuracy while accelerating the training.
- a first aspect provides a method for training a neural network, the method comprising: obtaining a neural network to be trained; grouping parameters of the neural network to be trained to obtain M groups of parameters, where M is a positive integer greater than or equal to 1 ; Obtain the sampling probability distribution and the training iteration step arrangement.
- the sampling probability distribution is used to represent the probability that each group of parameters in the M groups of parameters is sampled in each training iteration step.
- the training iteration step arrangement includes interval arrangement and Periodic arrangement; according to the sampling probability distribution and the arrangement of training iteration steps, the sampled parameter group is frozen or stopped; the neural network to be trained is trained according to the frozen parameter group or the stopped parameter group.
- the neural network training method of the embodiment of the present application processes the parameter group of the neural network in the dimension of the iterative step, realizes fine-grained control of the acceleration process, and improves the training accuracy while accelerating the training.
- the parameter groups are sampled and processed through the training iteration step arrangement and the sampling probability distribution, so that the training cost and the training accuracy can be selected more flexibly.
- the corresponding sampling probability can be determined according to the specific cost ratio of each set of parameters. .
- freezing or stopping the sampled parameter group according to the sampling probability distribution and the arrangement of the training iteration steps including: determining the first iteration step according to the arrangement of the training iteration steps , the first iterative step is the iterative step to be sampled; determine the mth group of parameters sampled in the first iterative step according to the sampling probability distribution, where m is less than or equal to M-1; freeze the mth group of parameters in the first iterative step to For the first group of parameters, freezing the mth group of parameters to the first group of parameters in the first iteration step means that no gradient calculation is performed for the mth group of parameters to the first group of parameters, and no parameter update is performed.
- the sampling probability distribution it is determined that some parameter groups are frozen without gradient calculation and parameter update, so that the acceleration of neural network training can be achieved.
- the parameter group of the subsequent iteration step does not need to use the parameters of the previously frozen parameter group, thus avoiding the problem of momentum offset.
- freezing or stopping the sampled parameter group according to the sampling probability distribution and the arrangement of the training iteration steps including: determining the first iteration step according to the arrangement of the training iteration steps , the first iteration step is the iterative step to be sampled; determine the mth group of parameters sampled in the first iteration step according to the sampling probability distribution, where m is less than or equal to M-1; stop changing the mth group of parameters in the first iteration step To the first group of parameters, stopping the updating of the mth group of parameters to the first group of parameters in the first iteration step means that the gradient calculation is performed on the mth group of parameters to the first group of parameters, and no parameter update is performed.
- some parameter groups are determined to be frozen, and gradient calculation is performed, but parameter update is not performed, so that the accuracy of neural network training can be improved.
- the gradient calculation is still performed, so that the parameters of the corresponding parameter group in the subsequent iteration steps can be kept updated, thereby avoiding the problem of momentum offset.
- determining the first iteration step according to the training iteration step arrangement includes: determining the first interval; In the iteration steps, one or more first iteration steps are determined at every first interval.
- determining the first iteration step according to the training iteration step arrangement includes: determining the number of the first iteration steps to be M -1; the first cycle is determined according to the number of the first iteration steps and the first ratio, the first cycle includes the first iteration step and the iteration steps of the whole network training, and the first ratio is the proportion of the first iteration step in the first cycle
- the first iteration step is the last M-1 iteration steps of the first cycle.
- the iteration steps to be sampled can be determined in the above two ways, wherein the periodic arrangement can effectively improve the speed of neural network training, and the interval arrangement can effectively improve the accuracy of neural network training .
- a data processing method includes: acquiring data to be processed; processing the data to be processed according to a target neural network, the target neural network is obtained through training, and the training of the target neural network includes: acquiring the neural network to be trained network; group the parameters of the neural network to be trained to obtain M groups of parameters, where M is a positive integer greater than or equal to 1; obtain the sampling probability distribution and the arrangement of training iteration steps, and the sampling probability distribution is used to represent the The probability that each group of parameters in the M groups of parameters in the iteration step is sampled.
- the training iteration step arrangement includes interval arrangement and periodic arrangement; according to the sampling probability distribution and the training iteration step arrangement, the sampled parameter group is frozen. Or stop the update; train the neural network to be trained according to the frozen parameter group or the stopped parameter group.
- the data processing method provided by the present application uses the neural network trained by the neural network training method of any one of the first aspect and the first aspect to process data, which can effectively improve the data processing capability of the neural network.
- a third aspect provides a training device for a neural network, the device comprising: an acquisition module for acquiring a neural network to be trained; a processing module for grouping parameters of the neural network to be trained to obtain M groups of parameters, M is a positive integer greater than or equal to 1; the acquisition module is also used to acquire the sampling probability distribution and the arrangement of training iteration steps, and the sampling probability distribution is used to represent that each group of parameters in the M groups of parameters is sampled in each training iteration step
- the training iteration step arrangement includes interval arrangement and periodic arrangement; the processing module is also used to freeze or stop the sampled parameter group according to the sampling probability distribution and the training iteration step arrangement;
- the neural network to be trained is trained on the group or the group of parameters that are stopped.
- the embodiment of the present application further provides a training apparatus for a neural network, and the apparatus can be used to implement the method in any one of the implementation manners of the first aspect.
- the processing module freezes or stops updating the sampled parameter group according to the sampling probability distribution and the training iteration step arrangement, including: determining the first parameter according to the training iteration step arrangement. Iterative step, the first iteration step is the iterative step to be sampled; determine the mth group of parameters sampled in the first iteration step according to the sampling probability distribution, m is less than or equal to M-1; freeze the mth group in the first iteration step Parameters to the first group of parameters, freezing the mth group of parameters to the first group of parameters in the first iteration step means that the gradient calculation is not performed for the mth group of parameters to the first group of parameters, and no parameter update is performed.
- the processing module freezes or stops updating the sampled parameter group according to the sampling probability distribution and the training iteration step arrangement, including: determining the first parameter according to the training iteration step arrangement. Iterative step, the first iteration step is the iteration step to be sampled; determine the mth group of parameters sampled in the first iteration step according to the sampling probability distribution, m is less than or equal to M-1; stop changing the mth group of parameters in the first iteration step Set the parameters to the first set of parameters, and stop updating the mth set of parameters to the first set of parameters in the first iteration step means that the gradient calculation is performed on the mth set of parameters to the first set of parameters, and no parameter update is performed.
- the processing module determines the sampled first iteration step according to the training iteration step arrangement, including: determining the first interval ; In multiple training iteration steps, one or more first iteration steps are determined every other first interval.
- the processing module determines the sampled first iteration step according to the training iteration step arrangement, including: determining the first iteration The number of steps is M-1; the first cycle is determined according to the number of first iteration steps and the first ratio.
- the first cycle includes the first iteration step and the iterative step of the whole network training.
- the first ratio is that the first iteration step is in the first iteration.
- the proportion of one cycle, the first iteration step is the last M-1 iteration steps of the first cycle.
- a data processing device comprising: an acquisition module for acquiring data to be processed; a processing module for processing the data to be processed according to a target neural network, the target neural network is obtained through training, and the target neural network is obtained through training.
- the training of the network includes: obtaining the neural network to be trained; grouping the parameters of the neural network to be trained to obtain M groups of parameters, where M is a positive integer greater than or equal to 1; obtaining the sampling probability distribution and the arrangement of training iteration steps, The sampling probability distribution is used to represent the probability that each group of parameters in the M groups of parameters is sampled in each training iteration step.
- the training iteration step arrangement includes interval arrangement and periodic arrangement; according to the sampling probability distribution and the training iteration step arrangement In the distribution mode, the sampled parameter group is frozen or stopped; the neural network to be trained is trained according to the frozen parameter group or the stopped parameter group.
- an electronic device including a memory and a processor, where the memory is used to store program instructions; when the program instructions are executed in the processor, the processor is used to execute any one of the implementations of the first aspect manner and the method described in the second aspect.
- the processor in the fifth aspect above may be either a central processing unit (CPU), or a combination of a CPU and a neural network computing processor.
- a computer-readable medium stores program code for device execution, the program code comprising for executing any one of the implementations in the first aspect and the method in the second aspect .
- a computer program product comprising instructions, which, when the computer program product is run on a computer, causes the computer to execute any one of the implementations of the first aspect and the method of the second aspect.
- a chip in an eighth aspect, includes a processor and a data interface, the processor reads an instruction stored in a memory through the data interface, and executes any one of the implementation manners of the first aspect and the first aspect. method in the second aspect.
- the chip may further include a memory, in which instructions are stored, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute any one of the implementations of the first aspect and the method of the second aspect.
- the above chip may specifically be a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC).
- FPGA field-programmable gate array
- ASIC application-specific integrated circuit
- FIG. 1 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present application.
- FIG. 2 is a schematic block diagram of a system architecture to which the neural network training method according to the embodiment of the present application is applied;
- FIG. 3 is a schematic diagram of the arrangement of training iteration steps in an embodiment of the present application.
- FIG. 4 is a schematic explanatory diagram of a momentum offset according to an embodiment of the present application.
- FIG. 5 is a schematic diagram of a training iteration step cycle arrangement according to an embodiment of the present application.
- FIG. 6 is a schematic flowchart of a training method of a neural network according to an embodiment of the present application.
- FIG. 7 is a schematic block diagram of a neural network parameter grouping according to an embodiment of the present application.
- FIG. 8 is a schematic block diagram of a training method of a neural network according to an embodiment of the present application.
- FIG. 9 is a schematic diagram of training iteration step sampling of a periodic arrangement according to an embodiment of the present application.
- FIG. 10 is a schematic block diagram of a computation graph of a static graph deep learning framework according to an embodiment of the present application.
- FIG. 11 is a schematic diagram of training iteration step sampling of the interval arrangement according to the embodiment of the present application.
- FIG. 12 is a schematic flowchart of a data processing method according to an embodiment of the present application.
- FIG. 13 is a schematic block diagram of a training apparatus for a neural network according to an embodiment of the present application.
- FIG. 14 is a schematic block diagram of a data processing apparatus according to an embodiment of the present application.
- 15 is a schematic diagram of a hardware structure of a training device for a neural network according to an embodiment of the present application.
- FIG. 16 is a schematic diagram of a hardware structure of a data processing apparatus according to an embodiment of the present application.
- references in this specification to "one embodiment” or “some embodiments” and the like mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
- appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically emphasized otherwise.
- the terms “including”, “including”, “having” and their variants mean “including but not limited to” unless specifically emphasized otherwise.
- Deep learning A machine learning technology based on deep neural network algorithms, the main feature of which is the use of multiple nonlinear transformations to process and analyze data. It is mainly used in perception, decision-making and other scenarios in the field of artificial intelligence, such as image recognition, speech recognition, natural speech translation, computer games, etc.
- the training in this embodiment of the present application refers specifically to the training of the neural network, which generally includes forward computing the model output, computing the loss according to the model output and labels, back-propagating gradients, and parameter updating.
- the model is optimized to reduce the loss value as much as possible.
- Stop updating In the reverse step of the neural network training process, continue to calculate the gradient of the parameters, but stop updating these parameters, which is to stop the updating of these parameters.
- Cost (cost) The cost of the embodiment of the present application refers to the resources consumed in the neural network training process, and can generally be calculated according to the calculation amount of the neural network.
- the training acceleration of neural network mainly focuses on hardware upgrade and algorithm optimization.
- GPU performance is higher, but at the same time, the price is also more expensive;
- the use of multi-card parallelism, multi-machine parallelism and large-scale clusters is also a commonly used training acceleration method.
- mixed-precision training can reduce the computational complexity of neural networks and effectively accelerate the training of neural networks in some scenarios.
- the training method of the neural network in the embodiment of the present application mainly involves the improvement of the algorithm, and the training is continued to be accelerated under the condition that the hardware conditions remain unchanged, so as to reduce the actual cost.
- a convolutional neural network (CNN) 100 may include an input layer 110 , a convolutional/pooling layer 120 (where the pooling layer is optional), and a neural network layer 130 .
- the input layer 110 can obtain the data to be processed, and pass the obtained data to be processed by the convolutional layer/pooling layer 120 and the subsequent neural network layer 130 for processing, and the processing result of the data can be obtained.
- the internal layer structure in the CNN 100 in Figure 1 is described in detail below.
- the convolutional/pooling layer 120 may include layers 121-126 as examples, for example: in one implementation, layer 121 is a convolutional layer, layer 122 is a pooling layer, and layer 123 is a convolutional layer Layer 124 is a pooling layer, 125 is a convolutional layer, and 126 is a pooling layer; in another implementation, 121 and 122 are convolutional layers, 123 is a pooling layer, and 124 and 125 are convolutional layers. layer, 126 is the pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
- the following will take the convolutional layer 121 as an example to introduce the inner working principle of a convolutional layer.
- the convolution layer 121 may include many convolution operators.
- the convolution operator is also called a kernel, and its role in data processing is equivalent to a filter that extracts specific information from the input data matrix.
- the convolution operator is essentially Can be a weight matrix, which is usually predefined.
- weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can be used to extract information from the input data, so that the convolutional neural network 100 can make correct predictions .
- the initial convolutional layer for example, 121
- the features extracted by the later convolutional layers become more and more complex, such as features such as high-level semantics.
- the convolutional layer can be a convolutional layer followed by a layer.
- the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers. In the process of data processing, the only purpose of the pooling layer is to reduce the space size of the data.
- the convolutional neural network 100 After being processed by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input data. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 100 needs to utilize the neural network layer 130 to generate one or a set of outputs of the desired number of classes. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in FIG. 1) and the output layer 140, and the parameters contained in the multiple hidden layers may be based on specific task types The relevant training data of , is obtained by pre-training, for example, the task type may include identification, classification, and so on.
- the neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller.
- BP error back propagation
- the input signal is passed forward until the output will generate error loss, and the parameters in the initial neural network model are updated by back-propagating the error loss information, so that the error loss converges.
- the back-propagation algorithm is a back-propagation movement dominated by error loss, aiming to obtain the parameters of the optimal neural network model, such as the weight matrix.
- the output layer 140 After the multi-layer hidden layers in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error,
- the forward propagation of the entire convolutional neural network 100 (as shown in Fig. 1, the propagation from the direction 110 to 140 is forward propagation) is completed, the back propagation (as shown in Fig. 1, the propagation from 140 to 110 is the back propagation) will Start to update the weight values and biases of the aforementioned layers to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.
- each layer of the network is frozen from front to back, and when a layer is frozen, the layer is not trained until the end of the network training.
- the parameters of a neural network are divided into 6 parts, and a total of 90 epochs (epochs) are trained.
- the first part is epoch1-46, the second part is epoch47-53, and the third part is 54-61.
- the fourth part is epoch62-70, the fifth part is epoch71-79, and the sixth part is epoch80-90.
- the training method of this neural network adopts certain calculation rules, so that each set of parameters in the first part is performed.
- each parameter in the network calculates the gradient and updates the parameters during the backpropagation process; in the second part, freeze the first set of parameters of the neural network, that is, only participate in the forward calculation, in the backpropagation process In the process of , the gradient is not calculated and the parameters are not updated; in the third part, the first two sets of parameters of the neural network are frozen; The first four sets of parameters for the network; in Part 6, the first five sets of parameters for the neural network are frozen.
- the learning rate of each set of parameters is scaled according to the ratio of the training duration of the set of parameters to the total training duration. In the training of the above neural network, the learning rate increases from the last set of parameters to the first set of parameters. This method is only suitable for the training of some neural networks.
- Another existing neural network training method calculates a certain standard according to the state of the network gradient during the neural network training process, and then judges whether freezing is required according to the standard. This method introduces extra computation, and in some scenarios, the training speed does not increase but decreases.
- the existing neural network training methods also add random factors in the network training process, and randomly skip forward and reverse calculations of some residual branches to achieve the purpose of accelerating training.
- this method can only be used on some specific structures of the network, and the limitations are large and the scope of use is narrow.
- the neural network training method of the embodiment of the present application automatically groups the parameters of the user network, and introduces a sampling probability distribution to perform fine-grained freezing control on the training process.
- fine interval or periodic arrangement is performed to correct the momentum offset, so that each set of parameters of the network has a certain probability to be updated during the whole training process, and there will be no certain
- the group parameters are trained in the early stage, and completely frozen and no longer updated in the later stage, so as to ensure the training accuracy.
- the neural network training method of the embodiment of the present application is suitable for deep learning frameworks such as mindspore, tensorflow, and pytorch, and needs to accelerate neural network training in various computer vision tasks in conjunction with hardware platforms such as Ascend chips and GPUs.
- computer vision tasks can be object recognition, object detection, semantic segmentation, etc.
- FIG. 2 shows a schematic block diagram of a system architecture to which the neural network training method according to the embodiment of the present application is applied, and the system architecture can realize the acceleration of the training of the neural network shown in FIG. 1 .
- the system architecture includes a probability distribution module and a training iteration step arrangement module, which will be introduced separately below.
- the probability distribution module is used to introduce sampling probability distribution for fine-grained freezing control of the training process, where the probability distribution includes sampling probability and freezing probability.
- the sampling probability controls the probability that each group of the neural network is sampled. For example, if a certain group of parameters is sampled in a certain training iteration step, then the parameters from this group to the first group of parameters in the network will be frozen in reverse. Once the sampling probability distribution is determined, the frozen probability distribution of the parameters is also determined.
- sampling probability distribution formula is:
- the frozen probability distribution formula is:
- p 0 represents the probability of not freezing any parameters during the neural network training process, that is, the probability of training the entire network
- n represents the number of groups of network parameters
- i represents the index of the parameter group, ranging from 0 to n-1
- formula f (x) can be either a continuous function or a discrete function.
- the corresponding sampling probability is determined according to the actual test overhead ratio of each group of parameters.
- the probability that the i-th group of parameters is frozen is from the i-th group to the n-1th group.
- the sum of the sampling probabilities of the group parameters, and the sum of the sampling probabilities of all the group parameters is 1. The larger the area covered by the frozen probability distribution curve, the more the training overhead is reduced.
- the training iteration step arranging module is used for fine-grained arranging the training iteration steps according to the probability distribution formula selected by the probability distribution module, so as to determine which groups of parameters are frozen in each iteration in the training process.
- Figure 3 shows a schematic diagram of the training iteration step arrangement with the whole network training probability p 0 as 0.5. Since p 0 is 0.5, the whole network training can be performed every other step, that is, at iteration step 0, iteration step 2, iteration In step 4, the entire network is trained, and in iteration step 1, iteration step 3, and iteration step 5, the parameters are sampled to the third, fifth, and first groups of parameters according to the sampling probability distribution function distribution, and the first three, five, and first parameters are frozen respectively. 1 set of parameters, this arrangement is called interval arrangement.
- the active layer in Figure 3 indicates that both forward calculation and reverse calculation are performed, and the frozen layer indicates that only forward calculation is performed, and the frozen reverse calculation is performed.
- An active layer or a frozen layer represents a set of parameters.
- the arrangement of the whole network training at equal intervals in Fig. 3 has a situation where the frozen layer is thawed at each iteration step, which will cause a momentum shift.
- the momentum is not calculated and the parameters are not updated, so the momentum used in the iteration step 2 is still based on the momentum of the iteration step 0.
- a periodic and gradual freezing arrangement scheme (periodic mode) as shown in FIG. 5 is designed, wherein the index is 0 to represent the whole network training, and the index is 1, 2, 3, 4, and 5, respectively.
- the first 1, 2, 3, 4, and 5 groups of parameters are frozen in the reverse calculation process, and the last group of parameters of the network does not participate in the freezing.
- the iterative step arrangement and exponential curve in Figure 5 are both periodic, and the sampling probability is linearly decreasing, which means that the higher the probability of sampling the parameter group is, the higher the probability of being sampled, the training iteration step arrangement on the left side of Figure 5 can be obtained.
- any sampling probability distribution function according to the probability of being sampled at each step and the probability of training the entire network, the above-mentioned interval arrangement and periodic arrangement of training iteration steps can be implemented.
- interval arrangement in 10 consecutive training iterations, one or more iteration steps are selected to freeze sampling every n iteration steps, and n can be a preset value or determined according to the training probability of the whole network. For example, in Figure 3, when p 0 is 0.5, n is 1, so freeze sampling is performed every 1 iteration.
- the periodic arrangement it is only necessary to adjust the iteration steps of the whole network training according to the size of p 0 , the ratio of p 0 and 1-p 0 .
- FIG. 6 shows a schematic flowchart of the neural network training method according to the embodiment of the present application. As shown in FIG. 6 , the method includes steps 601 to 604 , which will be introduced separately below.
- the training method of the neural network in the embodiment of the present application can be applied to tasks such as target detection, image segmentation, natural language processing, speech recognition, etc.
- the neural network to be trained can be a convolutional neural network as shown in FIG. , MobileNet and other series of neural networks, and may also be other neural networks, which are not specifically limited in the embodiments of the present application.
- S602 Group the parameters of the neural network to be trained to obtain M groups of parameters, where M is a positive integer greater than or equal to 1.
- the neural network parameters in the embodiments of the present application include operators of each layer of the neural network, and the training of the neural network is the determination of the weights of the operators of each layer in the neural network.
- the neural network training method of the embodiment of the present application can realize automatic grouping of the parameters of the neural network to be trained, and the grouping standard can be set in advance.
- the parameter grouping follows the principle of order from input to output. As shown in Figure 7, the group of parameters closest to the input is determined as the first group of parameters, that is, group 0 in Figure 7, and so on, which will be the farthest from the input. A group of parameters is determined as the last group of parameters, namely group 5 in Figure 7.
- FIG. 8 shows a schematic block diagram of a neural network training method according to an embodiment of the present application. As shown in FIG. 8 , in the parameter automatic grouping step, the input data can be automatically divided into groups 0 to 5.
- S603 Obtain the sampling probability distribution and the training iteration step arrangement, where the sampling probability distribution is used to represent the probability of each group of parameters in the M groups of parameters being sampled in each training iteration step, and the training iteration step arrangement includes the interval Arrangements and Periodic Arrangements.
- the arrangement of training iteration steps is first determined.
- the whole network training is performed on some iterative steps, and the whole network training means that the gradient calculation and Parameter update, in which the training data needs to be used to minimize the loss function in the training of the neural network, so as to determine the value of the neural network parameters, and to minimize the loss function, the extreme value of the loss function needs to be obtained, and the gradient of the vector field points to
- the direction is the direction in which the function value rises the fastest, that is to say, the opposite direction is the direction in which the function value drops the fastest, so by calculating the gradient of the loss function (that is, calculating the partial derivatives of all parameters) and updating the parameters in the opposite direction, After iteration, the loss function can quickly reach a minimum value.
- the training iteration step arrangement includes interval arrangement and periodic arrangement, where interval arrangement means that multiple training iteration steps are determined at regular intervals.
- One or more sampled training iteration steps can be determined according to the following method: determine the training probability p 0 of the entire network, multiply the number of multiple iteration steps to be trained by p 0 , and obtain the iterative steps trained by the entire network
- the number of iterative steps to be trained by the whole network is evenly distributed among the multiple iterative steps to be trained, where p 0 is an artificially preset value in the range of (0, 1).
- Table 1 shows some An example of how the net training probability determines how training iterations are arranged:
- Table 1 takes 10 iterative steps as an example, namely step0 to step9, ⁇ means that the iteration step will be sampled, and ⁇ means that the iteration step will not be sampled, and the iterative step is trained on the whole network.
- ⁇ means that the iteration step will be sampled
- ⁇ means that the iteration step will not be sampled
- the iterative step is trained on the whole network.
- the iterative steps trained by the whole network are evenly divided into 10 iterative steps, then step2, step5, step8 are the iterative steps trained by the whole network, and step0, step1, step3, step4, step6, step7, step9 will be sampled iteration step.
- the determination of the training iteration step arrangement according to the whole network training probability in Table 1 is only an example of the interval arrangement in the embodiment of the present application, and does not constitute a limitation to the embodiment of the present application.
- the cycle arrangement means that multiple training iteration steps are regarded as a cycle, and the number of iteration steps to be sampled is first determined as M-1; the fusion cycle is determined according to the number of iteration steps to be sampled and a certain proportion, and one cycle includes The iteration step to be sampled and the iteration step of the whole network training, a certain proportion is the proportion of the iteration step to be sampled in the cycle, and the iteration step to be sampled is the last M-1 iteration steps of the cycle .
- the whole network training probability p 0 may be determined, the whole network training probability p 0 may be a preset value, and the above-mentioned certain ratio is 1-p 0 .
- iterative steps 0, 2, and 4 are the whole network training, and iteration steps 1, 3, and 5 are the sampled iteration steps.
- the parameter group to be sampled is determined.
- the sampling probability distribution is used to determine the sampling probability of each group of parameters in the M groups of parameters in each training iteration step, that is, the sampling probability distribution is used to determine the sampling probability distribution in a certain iteration step.
- m sets of parameters m is less than or equal to M-1, after the mth set of parameters to be sampled is determined in a certain iteration step, the mth set of parameters to the first set of parameters will be processed in the same way.
- the process includes freezing and stopping, wherein freezing means that the gradient calculation is not performed from the mth group of parameters to the first group of parameters, and the parameters are not updated, and the stop means that the mth group of parameters to the first group of parameters
- the gradient is performed on the parameters Calculated without parameter update.
- FIG. 2 For the corresponding sampling probability distribution formula and freezing/stopping probability distribution formula, reference may be made to the above description of FIG. 2 .
- FIG. 2 For example, in Figure 8, when p 0 is 0.5, multiple sampling probability distribution curves can be obtained according to the sampling probability distribution formula.
- the abscissa represents the parameter group, and the ordinate represents the probability of being sampled.
- the sampling distribution diagram of the training iteration step as shown in Figure 8 can be obtained, in which the parameter group sampled in iteration step 1 is group 2, then groups 0 to 1 are frozen; the parameters sampled in iteration step 3 If the group is group 0, group 0 is frozen; the parameter group sampled in iteration step 5 is group 4, then groups 0 to 4 are frozen.
- the neural network to be trained is trained according to the frozen parameter group or the stopped parameter group.
- the sampling distribution of the training iteration steps can be obtained, and the gradient calculation and parameter update are performed for the parameter groups that are not frozen or stopped; the gradient calculation is not performed for the frozen parameter groups. , no parameter update is performed; only the gradient calculation is performed for the stopped parameter group, but the parameter update is not performed.
- the neural network to be trained is iteratively trained.
- the neural network training method in the embodiment of the present application can be used for training a corresponding neural network, and the neural network can be the neural network shown in FIG. 1 .
- the neural network training method in the embodiment of the present application can be applied to visual tasks such as target detection and image segmentation, and can also be applied to non-visual tasks such as natural language processing and speech recognition.
- an epoch represents a complete training of the neural network model using all the data in the training set, and a training iteration step represents updating the parameters of the neural network model
- an epoch may include 10,000 training iteration steps , so controlling the training of the neural network in the dimension of the training iteration step has higher accuracy than in the dimension of the epoch.
- the neural network training method of the embodiment of the present application processes the parameter group of the neural network in the dimension of the iterative step, realizes fine-grained control of the acceleration process, and improves the training accuracy while accelerating the training.
- the parameter groups are sampled and processed through the training iteration step arrangement and sampling probability distribution, so that the training cost and training accuracy can be selected more flexibly. For example, the corresponding sampling probability can be determined according to the specific cost ratio of each set of parameters. ; The problem of momentum offset is corrected.
- the parameter group of the subsequent iteration step does not need to use the parameters of the previously frozen parameter group.
- the gradient calculation is still performed, so that The parameters of the parameter group of subsequent iteration steps can be kept updated.
- the training method of the neural network in the embodiment of the present application can be applied to perform accuracy verification of the target recognition task on multiple networks respectively according to the ImageNet data set.
- the computation graph can construct multiple reverse paths, so as to cooperate with the neural network training method of the embodiment of the present application for training acceleration, such as tensorflow, mindspore and other static graph deep learning frameworks; when using
- reverse truncation can be performed in each reverse process, such as dynamic graph deep learning frameworks such as pytorch.
- Step 1 Input neural networks such as ResNet50, ResNet18, MobileNetV2;
- Step 2 The parameters are automatically grouped, and each convolution operator and batch normalization (BN) operator are grouped into one group;
- Step 3 Select the uniform sampling probability distribution with the most cost reduction, indicating that the probability of each group of parameters being sampled is the same; and select a periodic arrangement.
- FIG. 9 shows a schematic diagram of the sampling of training iteration steps using a periodic arrangement according to an embodiment of the present application.
- the whole network training is performed from iteration steps 0 to 4, and the first set of parameters is frozen in iteration step 5.
- Step 6 freezes the first two sets of parameters
- iterative step 7 freezes the first three sets of parameters
- iteration step 8 freezes the first four sets of parameters
- iteration step 9 freezes the first five sets of parameters.
- Step 4 Use the tensorflow deep learning framework to construct multiple paths in the computational graph and start iterative training.
- the calculation diagram is shown in Figure 10.
- the training method of the neural network in the embodiment of the present application can achieve a certain regularization effect in the recognition task, and at the same time reduce the overhead to a small extent, it can improve the accuracy of the model to a certain extent.
- the following introduces another process of performing network training using the neural network training method according to the embodiment of the present application.
- Step 1 Input neural networks such as ResNet50, ResNet18, MobileNetV2;
- Step 2 The parameters are automatically grouped, and each convolution operator and batch normalization (BN) operator are grouped into one group;
- Step 3 Select a linearly decreasing sampling probability distribution.
- the sampled parameters still calculate the gradient and momentum, which can avoid the corresponding parameter group momentum shift in the next iteration step, but no parameter update is performed. Since the randomness of stochastic gradient descent (SGD) will introduce a certain amount of noise, in the normal training process of the network, the useful signal brought by the gradient transmitted by the loss to the previous layer is already very weak, resulting in the loss of the front layer of the network.
- SGD stochastic gradient descent
- FIG. 11 shows a schematic diagram of training iteration step sampling using an interval arrangement according to an embodiment of the present application.
- the training method of the neural network according to the embodiment of the present application can reduce the overhead to a small extent on the basis of not changing the user network structure, and can improve the accuracy through a certain sampling probability distribution and parameter stopping. promote.
- FIG. 12 shows a schematic flowchart representing a data processing method provided by an embodiment, including steps 1201 to 1202 .
- the arrangement method includes interval arrangement and periodic arrangement; according to the sampling probability distribution and the training iteration step arrangement, the sampled parameter group is frozen or stopped; according to the frozen parameter group or the stopped parameter group to be trained The neural network is trained.
- the neural network used in the data processing in FIG. 12 is obtained by training according to the training method of the neural network in FIG. 6 .
- For the training of the neural network reference may be made to the above description of FIG. 6 .
- the embodiments of the present application will not be repeated here. .
- FIG. 13 shows a schematic block diagram of a neural network training apparatus according to an embodiment of the present application, including a storage module 1310 , an acquisition module 1320 , and a processing module 1330 , which will be introduced separately below.
- the storage module 1310 is used to store programs.
- the obtaining module 1320 is used to obtain the neural network to be trained.
- the processing module 1330 is configured to group the parameters of the neural network to be trained to obtain M groups of parameters, where M is a positive integer greater than or equal to 1.
- the obtaining module 1320 is further configured to obtain the sampling probability distribution and the training iteration step arrangement, the sampling probability distribution is used to represent the probability that each group of parameters in the M groups of parameters is sampled in each training iteration step, and the training iteration step arrangement Including interval arrangement and periodic arrangement.
- the processing module 1330 is further configured to freeze or stop updating the sampled parameter group according to the sampling probability distribution and the training iteration step arrangement; train the neural network to be trained according to the frozen parameter group or the stopped parameter group.
- the processing module 1330 freezes or stops updating the sampled parameter group according to the sampling probability distribution and the training iteration step arrangement, and is specifically used for: determining the first iteration step according to the training iteration step arrangement; The distribution determines the m-th group of parameters sampled in the first iteration step, where m is less than or equal to M-1; freeze the m-th group of parameters in the first iteration step to the first group of parameters, and freeze the m-th group in the first iteration step
- the parameter to the first group of parameters means that the gradient calculation is not performed for the mth group of parameters to the first group of parameters, and the parameter update is not performed.
- the processing module 1330 freezes or stops updating the sampled parameter group according to the sampling probability distribution and the training iteration step arrangement, and is specifically used for: determining the sampled first iteration step according to the training iteration step arrangement; Determine the mth group of parameters sampled in the first iteration step according to the sampling probability distribution, where m is less than or equal to M-1; stop changing the mth group of parameters in the first iteration step to the first group of parameters, and stop changing the first iteration step
- the mth group of parameters to the first group of parameters in indicates that the gradient calculation is performed on the mth group of parameters to the first group of parameters, and parameter update is not performed.
- the processing module 1130 determines the sampled first iteration step according to the training iteration step arrangement, and is specifically used for: determining the first interval; step, one or more first iteration steps are determined at every first interval.
- the processing module 1130 determines the sampled first iteration step according to the training iteration step arrangement, including: determining that the number of the first iteration steps is M-1;
- the first cycle is determined according to the number of the first iteration steps and the first ratio.
- the first cycle includes the first iteration step and the iterative steps of the whole network training.
- the first ratio is the proportion of the first iteration step in the first cycle.
- the first iteration step is the last M-1 iteration steps of the first cycle.
- the neural network training apparatus 1300 in this embodiment of the present application may be used to implement each step in the method in FIG. 6 .
- this embodiment of the present application does not Repeat.
- FIG. 14 shows a schematic block diagram of a data processing apparatus according to an embodiment of the present application, including a storage module 1410 , an acquisition module 1420 , and a processing module 1430 , which will be introduced separately below.
- the storage module 1410 is used to store programs.
- the obtaining module 1420 is used to obtain the data to be processed.
- the processing module 1430 is used to process the data to be processed according to the target neural network.
- the target neural network is obtained through training.
- the training of the target neural network includes: obtaining the neural network to be trained; grouping the parameters of the neural network to be trained to obtain M Group parameter, M is a positive integer greater than or equal to 1; obtain the sampling probability distribution and the arrangement of training iteration steps, the sampling probability distribution is used to represent the probability of each group of parameters in the M groups of parameters being sampled in each training iteration step , the training iteration step arrangement includes interval arrangement and periodic arrangement; according to the sampling probability distribution and the training iteration step arrangement, the sampled parameter group is frozen or stopped; according to the frozen parameter group or the stopped parameter group to train the neural network to be trained.
- the neural network training apparatus 1400 in this embodiment of the present application may be used to implement each step in the method in FIG. 12 .
- this embodiment of the present application does not Repeat.
- FIG. 15 is a schematic diagram of the hardware structure of a neural network training apparatus 1500 according to an embodiment of the present application. As shown in FIG. 15 , it includes a memory 1501 , a processor 1502 , a communication interface 1503 and a bus 1504 . The memory 1501 , the processor 1502 , and the communication interface 1503 are connected to each other through the bus 1504 for communication.
- the memory 1501 may be ROM, static storage devices and RAM.
- the memory 1501 may store programs. When the programs stored in the memory 1501 are executed by the processor 1502, the processor 1502 and the communication interface 1503 are used to execute various steps of the neural network training method of the embodiment of the present application.
- the processor 1502 may adopt a general-purpose CPU, a microprocessor, an ASIC, a GPU, or one or more integrated circuits, and is used to execute relevant programs, so as to realize the required execution of the units in the neural network training apparatus of the embodiments of the present application. function, or perform the training method of the neural network in the method embodiment of the present application.
- the processor 1502 may also be an integrated circuit chip with signal processing capability, for example, the chip shown in FIG. 4 .
- each step of the neural network training method in the embodiment of the present application may be completed by an integrated logic circuit of hardware in the processor 1502 or instructions in the form of software.
- the above-mentioned processor 1502 may also be a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
- the methods, steps, and logic block diagrams disclosed in the embodiments of this application can be implemented or executed.
- a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
- the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
- the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
- the storage medium is located in the memory 1501, and the processor 1502 reads the information in the memory 1501, and combines its hardware to complete the functions required to be performed by the units included in the neural network training apparatus of the embodiments of the present application, or to perform the functions of the method embodiments of the present application. Training methods for neural networks.
- the communication interface 1503 implements communication between the apparatus 1500 and other devices or a communication network using a transceiving device such as, but not limited to, a transceiver.
- a transceiving device such as, but not limited to, a transceiver.
- the neural network to be trained can be acquired through the communication interface 1503 .
- Bus 1504 may include a pathway for communicating information between various components of device 1500 (eg, memory 1501, processor 1502, communication interface 1503).
- FIG. 16 shows a schematic diagram of the hardware structure of a data processing apparatus 1600 according to an embodiment of the present application, including a memory 1601 , a processor 1602 , a communication interface 1603 , and a bus 1604 .
- the memory 1601 , the processor 1602 , and the communication interface 1603 are connected to each other through the bus 1604 for communication.
- the memory 1601 may be ROM, static storage devices and RAM.
- the memory 1601 may store a program. When the program stored in the memory 1601 is executed by the processor 1602, the processor 1602 and the communication interface 1603 are used to execute each step of the data processing method of the embodiment of the present application.
- the processor 1602 may adopt a general-purpose CPU, a microprocessor, an ASIC, a GPU, or one or more integrated circuits, and is used to execute relevant programs to implement the functions required to be performed by the units in the data processing apparatus of the embodiments of the present application , or execute the data processing method of the method embodiment of the present application.
- the processor 1602 may also be an integrated circuit chip with signal processing capability.
- each step of the data processing method of the embodiment of the present application may be completed by an integrated logic circuit of hardware in the processor 1602 or an instruction in the form of software.
- the above-mentioned processor 1602 may also be a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
- the methods, steps, and logic block diagrams disclosed in the embodiments of this application can be implemented or executed.
- a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
- the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
- the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
- the storage medium is located in the memory 1601, and the processor 1602 reads the information in the memory 1601, and combines its hardware to complete the functions required to be performed by the units included in the data processing apparatus of the embodiments of the present application, or to perform the data processing of the method embodiments of the present application. method.
- the communication interface 1603 implements communication between the apparatus 1600 and other devices or a communication network using a transceiving device such as, but not limited to, a transceiver.
- a transceiving device such as, but not limited to, a transceiver.
- the data to be processed can be acquired through the communication interface 1603 .
- Bus 1604 may include a pathway for communicating information between various components of device 1600 (eg, memory 1601, processor 1602, communication interface 1603).
- the apparatuses 1500 and 1600 may also include other devices necessary for normal operation . Meanwhile, according to specific needs, those skilled in the art should understand that the apparatuses 1500 and 1600 may further include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the apparatuses 1500 and 1600 may only include the necessary devices for implementing the embodiments of the present application, and do not necessarily include all the devices shown in FIG. 15 and FIG. 16 .
- the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processors, DSP), application-specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
- a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
- the memory in the embodiments of the present application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
- the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
- Volatile memory may be random access memory (RAM), which acts as an external cache.
- RAM random access memory
- SRAM static random access memory
- DRAM dynamic random access memory
- DRAM synchronous dynamic random access memory
- SDRAM synchronous dynamic random access memory
- DDR SDRAM double data rate synchronous dynamic random access memory
- enhanced SDRAM enhanced synchronous dynamic random access memory
- SLDRAM synchronous connection dynamic random access memory Fetch memory
- direct memory bus random access memory direct rambus RAM, DR RAM
- the embodiments of the present application also provide a computer program product, which implements the method of any of the method embodiments in the present application when the computer program product is executed by the processors 1502 and 1602 .
- the computer program product can be stored in the memories 1501 and 1601 , and the program is finally converted into an executable object file that can be executed by the processors 1502 and 1602 through processing processes such as preprocessing, compilation, assembly and linking.
- the embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a computer, implements the method of any of the method embodiments in the present application.
- the computer program can be a high-level language program or an executable object program.
- the computer-readable storage medium is, for example, the memories 1501 and 1601 .
- An embodiment of the present application further provides a chip, the chip includes a processor and a data interface, and the processor reads the instructions stored in the memory through the data interface, and executes the method of any method embodiment of the present application.
- the above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
- the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
- the computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
- the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
- the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server or data center by wire (eg, infrared, wireless, microwave, etc.).
- the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media.
- the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media.
- the semiconductor medium may be a solid state drive.
- At least one means one or more, and “plurality” means two or more.
- At least one item(s) below” or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s).
- at least one item (a) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c may be single or multiple .
- the size of the sequence numbers of the above-mentioned processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not be dealt with in the embodiments of the present application. implementation constitutes any limitation.
- the disclosed system, apparatus and method may be implemented in other manners.
- the apparatus embodiments described above are only illustrative.
- the division of the units is only a logical function division. In actual implementation, there may be other division methods.
- multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
- the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
- the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
- the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
- the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
- the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
本申请涉及人工智能领域,提供了一种神经网络的训练方法和装置,可以实现在迭代步维度对神经网络的参数组进行细粒度控制,在训练加速的同时提升了训练精度。该方法包括:获取待训练的神经网络;对待训练的神经网络的参数进行分组,以得到M组参数,M为大于或等于1的正整数;获取采样概率分布和训练迭代步排布方式,采样概率分布用于表征在每个训练迭代步中M组参数中的每组参数被采样的概率,训练迭代步排布方式包括间隔排布和周期排布;根据采样概率分布和训练迭代步排布方式,对被采样的参数组冻结或停更;根据被冻结的参数组或被停更的参数组对待训练的神经网络进行训练。
Description
本申请要求于2020年11月23日提交中国国家知识产权局、申请号为202011322834.6、申请名称为“神经网络的训练方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及人工智能领域,具体的,涉及一种神经网络的训练方法和装置。
深度学习技术在计算机视觉中取得了巨大进展,以图形识别为例,深度神经网络模型自2012年就以巨大的优势在ImageNet大规模图形识别竞赛(ImageNet large scale visual recognition challenge,ILSVRC)中领先传统计算机视觉方法。ImageNet(ILSVRC 2012)数据集大约有128万多张图片,使用ResNet50神经网络在8块V100计算卡上训练90轮大约需要8小时。OpenAI发布的GPT-3模型大约有1750亿个参数,使用45TB的数据进行训练,训练一次需要花费1300万美元。随着数据集规模越来越大,网络模型参数越来越多,在获得了更高精度的模型的同时,训练模型所花费的时间和金钱也越来越多。因此如何加速神经网络的训练,成为亟待解决的问题。
发明内容
本申请提供一种神经网络的训练方法和装置,可以实现在迭代步维度对神经网络的参数组进行细粒度控制,在训练加速的同时提升了训练精度。
第一方面,提供了一种神经网络的训练方法,该方法包括:获取待训练的神经网络;对待训练的神经网络的参数进行分组,以得到M组参数,M为大于或等于1的正整数;获取采样概率分布和训练迭代步排布方式,采样概率分布用于表征在每个训练迭代步中M组参数中的每组参数被采样的概率,训练迭代步排布方式包括间隔排布和周期排布;根据采样概率分布和训练迭代步排布方式,对被采样的参数组冻结或停更;根据被冻结参数组或被停更的参数组对待训练的神经网络进行训练。
本申请实施例的神经网络的训练方法在迭代步的维度对神经网络的参数组进行处理,实现对加速过程的细粒度控制,在训练加速的同时,提升训练精度。通过训练迭代步排布方式和采样概率分布对参数组进行采样和处理,在训练开销和训练精度之间可以更加灵活地选择,例如可以根据每一组参数的具体开销占比确定相应的采样概率。
结合第一方面,在一些可能的实现方式中,根据采样概率分布和训练迭代步排布方式,对被采样的参数组冻结或停更,包括:根据训练迭代步排布方式确定第一迭代步,第一迭代步为待采样的迭代步;根据采样概率分布确定第一迭代步中被采样的第m组参数,m小于或等于M-1;冻结第一迭代步中的第m组参数至第一组参数,冻结第一迭代步中的 第m组参数至第一组参数表示对第m组参数至第一组参数不进行梯度计算,不进行参数更新。
根据采样概率分布确定部分参数组进行冻结,不进行梯度计算,不进行参数更新,由此可以实现对神经网络训练的加速。对于冻结的参数组,在一个周期内,后续迭代步的参数组无需用到之前被冻结参数组的参数,由此可以避免动量偏移的问题。
结合第一方面,在一些可能的实现方式中,根据采样概率分布和训练迭代步排布方式,对被采样的参数组冻结或停更,包括:根据训练迭代步排布方式确定第一迭代步,第一迭代步为待采样的迭代步;根据采样概率分布确定第一迭代步中被采样的第m组参数,m小于或等于M-1;停更第一迭代步中的第m组参数至第一组参数,停更第一迭代步中的第m组参数至第一组参数表示对第m组参数至第一组参数进行梯度计算,不进行参数更新。
根据采样概率分布确定部分参数组进行冻结,进行梯度计算,但不进行参数更新,由此可以实现对神经网络训练的精度提升。对于停更的参数组,依然进行梯度计算,使后续的迭代步的相应参数组的参数可以保持更新,由此可以避免动量偏移的问题。
结合第一方面,在一些可能的实现方式中,当训练迭代步排布方式为间隔排布时,根据训练迭代步排布方式确定第一迭代步,包括:确定第一间隔;在多个训练迭代步中,每隔一个第一间隔确定一个或多个第一迭代步。
结合第一方面,在一些可能的实现方式中,当训练迭代步排布方式为周期排布时,根据训练迭代步排布方式确定第一迭代步,包括:确定第一迭代步的数量为M-1;根据第一迭代步的数量和第一比例确定第一周期,第一周期包括第一迭代步和整网训练的迭代步,第一比例为第一迭代步在第一周期中所占的比例,第一迭代步为第一周期的后M-1个迭代步。
本申请实施例的神经网络的训练方法可以通过上述两种方式确定要被采样的迭代步,其中周期排布方式可以有效提高神经网络训练的速度,间隔排布方式可以有效提高神经网络训练的精度。
第二方面,提供一种数据处理的方法,该方法包括:获取待处理数据;根据目标神经网络对待处理数据进行处理,目标神经网络通过训练得到,目标神经网络的训练包括:获取待训练的神经网络;对待训练的神经网络的参数进行分组,以得到M组参数,M为大于或等于1的正整数;获取采样概率分布和训练迭代步排布方式,采样概率分布用于表征在每个训练迭代步中M组参数中的每组参数被采样的概率,训练迭代步排布方式包括间隔排布和周期排布;根据采样概率分布和训练迭代步排布方式,对被采样的参数组冻结或停更;根据被冻结参数组或被停更的参数组对待训练的神经网络进行训练。
本申请提供的数据处理方法,使用第一方面和第一方面中的任一种实施方式的神经网络训练方法训练得到的神经网络来处理数据,可以有效提高神经网络数据处理的能力。
第三方面,提供一种神经网络的训练装置,该装置包括:获取模块,用于获取待训练的神经网络;处理模块,用于对待训练的神经网络的参数进行分组,以得到M组参数,M为大于或等于1的正整数;获取模块还用于获取采样概率分布和训练迭代步排布方式,采样概率分布用于表征在每个训练迭代步中M组参数中的每组参数被采样的概率,训练迭代步排布方式包括间隔排布和周期排布;处理模块还用于根据采样概率分布和训练迭代 步排布方式,对被采样的参数组冻结或停更;根据被冻结参数组或被停更的参数组对待训练的神经网络进行训练。
本申请实施例还提供神经网络的训练装置,该装置可以用于实现第一方面的中的任意一种实现方式中的方法。
结合第三方面,在一些可能的实现方式中,处理模块根据采样概率分布和训练迭代步排布方式,对被采样的参数组冻结或停更,包括:根据训练迭代步排布方式确定第一迭代步,第一迭代步为待采样的迭代步;根据采样概率分布确定第一迭代步中被采样的第m组参数,m小于或等于M-1;冻结第一迭代步中的第m组参数至第一组参数,冻结第一迭代步中的第m组参数至第一组参数表示对第m组参数至第一组参数不进行梯度计算,不进行参数更新。
结合第三方面,在一些可能的实现方式中,处理模块根据采样概率分布和训练迭代步排布方式,对被采样的参数组冻结或停更,包括:根据训练迭代步排布方式确定第一迭代步,第一迭代步为待采样的迭代步;根据采样概率分布确定第一迭代步中被采样的第m组参数,m小于或等于M-1;停更第一迭代步中的第m组参数至第一组参数,停更第一迭代步中的第m组参数至第一组参数表示对第m组参数至第一组参数进行梯度计算,不进行参数更新。
结合第三方面,在一些可能的实现方式中,当训练迭代步排布方式为间隔排布时,处理模块根据训练迭代步排布方式确定被采样的第一迭代步,包括:确定第一间隔;在多个训练迭代步中,每隔一个第一间隔确定一个或多个第一迭代步。
结合第三方面,在一些可能的实现方式中,当训练迭代步排布方式为周期排布时,处理模块根据训练迭代步排布方式确定被采样的第一迭代步,包括:确定第一迭代步的数量为M-1;根据第一迭代步的数量和第一比例确定第一周期,第一周期包括第一迭代步和整网训练的迭代步,第一比例为第一迭代步在第一周期中所占的比例,第一迭代步为第一周期的后M-1个迭代步。
第四方面,提供一种数据处理的装置,该装置包括:获取模块,用于获取待处理数据;处理模块,用于根据目标神经网络对待处理数据进行处理,目标神经网络通过训练得到,目标神经网络的训练包括:获取待训练的神经网络;对待训练的神经网络的参数进行分组,以得到M组参数,M为大于或等于1的正整数;获取采样概率分布和训练迭代步排布方式,采样概率分布用于表征在每个训练迭代步中M组参数中的每组参数被采样的概率,训练迭代步排布方式包括间隔排布和周期排布;根据采样概率分布和训练迭代步排布方式,对被采样的参数组冻结或停更;根据被冻结参数组或被停更的参数组对待训练的神经网络进行训练。
第五方面,提供一种电子设备,包括存储器和处理器,存储器用于存储程序指令;当程序指令在所述处理器中执行时,所述处理器用于执行第一方面中的任意一种实现方式和第二方面中所述的方法。
上述第五方面中的处理器既可以是中央处理器(central processing unit,CPU),也可以是CPU与神经网络运算处理器的组合。
第六方面,提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行第一方面中的任意一种实现方式和第二方面中的方法。
第七方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面中的任意一种实现方式和第二方面中的方法。
第八方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行上述第一方面中的任意一种实现方式和第二方面中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行第一方面中的任意一种实现方式和第二方面中的方法。
上述芯片具体可以是现场可编程门阵列(field-programmable gate array,FPGA)或者专用集成电路(application-specific integrated circuit,ASIC)。
图1是本申请实施例的一种卷积神经网络的示意性结构图;
图2是本申请实施例的神经网络的训练方法应用的系统架构的示意性框图;
图3是本申请实施例的训练迭代步间隔排布示意图;
图4是本申请实施例的动量偏移的示意性说明图;
图5是本申请实施例的训练迭代步周期排布示意图;
图6是本申请实施例的神经网络的训练方法的示意性流程图;
图7是本申请实施例的神经网络参数分组示意性框图;
图8是本申请实施例的神经网络的训练方法的示意性框图;
图9是本申请实施例的周期排布方式的训练迭代步采样示意图;
图10是本申请实施例的静态图深度学习框架计算图的示意性框图;
图11是本申请实施例的间隔排布方式的训练迭代步采样示意图;
图12是本申请实施例的数据处理方法的示意性流程图;
图13是本申请实施例的神经网络的训练装置的示意性框图;
图14是本申请实施例的数据处理装置的示意性框图;
图15是本申请实施例的神经网络的训练装置的硬件结构示意图;
图16是本申请实施例的数据处理装置的硬件结构示意图。
以下实施例中所使用的术语只是为了描述特定实施例的目的,而并非旨在作为对本申请的限制。如在本申请的说明书和所附权利要求书中所使用的那样,单数表达形式“一个”、“一种”、“所述”、“上述”、“该”和“这一”旨在也包括例如“一个或多个”这种表达形式,除非其上下文中明确地有相反指示。还应当理解,在本申请以下各实施例中,“至少一个”、“一个或多个”是指一个、两个或两个以上。术语“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系;例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A、B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。
在本说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个 或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
为了便于理解本申请的技术方案,首先对本申请涉及的概念做简要介绍。
深度学习(deep learning):一种基于深层次神经网络算法的机器学习技术,其主要特征是使用多重非线性变换对数据进行处理和分析。主要应用于人工智能领域的感知、决策等场景,例如图像识别、语音识别、自然语音翻译、计算机博弈等。
训练(train):本申请实施例的训练特指神经网络的训练,一般包括前向计算模型输出、根据模型输出和标签计算损失、反向传播求梯度和参数更新。利用已有的数据集及其相应的标签,使用反向传播算法以及某种参数更新的方法,对模型进行优化,使得损失值尽可能减小。
冻结(freeze):在神经网络训练过程中的反向传播步骤中,不计算某些参数的梯度且不更新这些参数,即为对这些参数进行反向冻结。
停更:在神经网络训练过程中的反向步骤中,继续计算参数的梯度,但停止更新这些参数,即为对这些参数进行停更。
开销(cost):本申请实施例的开销是指在神经网络训练过程中所消耗的资源,一般可以根据神经网络的计算量推算得出。
下面将结合附图,对本申请中的技术方案进行描述。
目前对于神经网络的训练加速,主要集中在硬件的升级和算法的优化方面。其中在硬件方面,GPU性能越高,但同时价格也越昂贵;采用多卡并行、多机并行和大规模集群,也是常用的训练加速方法。在算法方面,混合精度训练可以减少神经网络的计算量,在某些场景下对神经网络的训练起到有效的加速作用。本申请实施例的神经网络的训练方法主要涉及算法方面的改进,在硬件条件不变的情况下继续加速训练,降低实际开销。
本申请实施例的神经网络的训练方法应用的对象可以是图1所示的卷积神经网络结构。在图1中,卷积神经网络(CNN)100可以包括输入层110,卷积层/池化层120(其中池化层为可选的),以及神经网络层130。其中,输入层110可以获取待处理数据,并将获取到的待处理数据交由卷积层/池化层120以及后面的神经网络层130进行处理,可以得到数据的处理结果。下面对图1中的CNN 100中内部的层结构进行详细的介绍。
卷积层/池化层120:
卷积层:
如图1所示卷积层/池化层120可以包括如示例121-126层,举例来说:在一种实现中,121层为卷积层,122层为池化层,123层为卷积层,124层为池化层,125为卷积层,126为池化层;在另一种实现方式中,121、122为卷积层,123为池化层,124、125为卷积层,126为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层121为例,介绍一层卷积层的内部工作原理。
卷积层121可以包括很多个卷积算子,卷积算子也称为核,其在数据处理中的作用相 当于一个从输入数据矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入数据中提取信息,从而使得卷积神经网络100进行正确的预测。
当卷积神经网络100有多个卷积层的时候,初始的卷积层(例如121)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络100深度的加深,越往后的卷积层(例如126)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图1中120所示例的121-126各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在数据处理过程中,池化层的唯一目的就是减少数据的空间大小。
神经网络层130:
在经过卷积层/池化层120的处理后,卷积神经网络100还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层120只会提取特征,并减少输入数据带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络100需要利用神经网络层130来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层130中可以包括多层隐含层(如图1所示的131、132至13n)以及输出层140,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括识别、分类等等。
神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。
在神经网络层130中的多层隐含层之后,也就是整个卷积神经网络100的最后层为输出层140,该输出层140具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络100的前向传播(如图1由110至140方向的传播为前向传播)完成,反向传播(如图1由140至110方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络100的损失,及卷积神经网络100通过输出层输出的结果和理想结果之间的误差。
现有的一种神经网络的训练方法,在训练过程中,由前向后冻结网络各层,当某一层被冻结后,直到网络训练结束都不再训练该层。例如某神经网络的参数分为6个部分,总共训练90个时期(epoch),其中,第1个部分为epoch1-46,第2个部分为epoch47-53,第3个部分为54-61,第4个部分为epoch62-70,第5个部分为epoch71-79,第6个部分为epoch80-90,该神经网络的训练方法通过一定的计算规则,使第1部分中的每组参数都进行整网训练,即网络中每个参数在反向传播的过程中都计算梯度并进行参数更新;在第 2部分中冻结神经网络的第一组参数,即只参与前向计算,在反向传播的过程中不计算梯度,不进行参数更新;在第3部分中,冻结神经网络的前两组参数;在第4部分中,冻结神经网络的前三组参数;在第5部分中,冻结神经网络的前四组参数;在第6部分中,冻结神经网络的前五组参数。每组参数的学习率根据该组参数训练时长占训练总时长的比例进行缩放,在上述神经网络的训练中,从最后一组参数到第一组参数的学习率递增。该方法只适用部分神经网络的训练,隔多个epoch冻结一组参数,粗粒度地控制训练过程中的参数冻结,精度损失较大,且需要根据每组参数冻结的时长占比进行学习率的缩放,增加了额外的工作量。
现有的另一种神经网络的训练方法在神经网络训练过程中根据网络梯度的状态计算出一定的标准,然后根据该标准判断是否需要冻结。该方法引入额外的计算量,且在某些场景下训练速度不升反降。现有的神经网络的训练方法还通过在网络的训练过程中加入随机因子,随机跳过部分残差分支(residual branch)的前向和反向计算,来达到加速训练的目的。但是该方法只能使用在某些特定结构的网络上,局限性较大,使用范围较窄。
本申请实施例的神经网络的训练方法,对用户网络的参数进行自动分组,引入采样概率分布对训练过程进行细粒度的冻结控制。在训练迭代步(trainingstep)维度进行精细的间隔性或周期性排布,对动量偏移进行校正,使得网络的每一组参数在整个训练过程中都有一定的概率进行更新,不会存在某组参数前期训练,后期完全冻结不再更新的情况,从而保证训练精度。
本申请实施例的神经网络的训练方法适用于mindspore、tensorflow、pytorch等深度学习框架,以及配合昇腾芯片、GPU等硬件平台需要对各种计算机视觉(computer vision)任务中的神经网络训练进行加速的场景,其中计算机视觉任务可以是目标识别、目标检测、语义分割等。
图2示出了本申请实施例的神经网络的训练方法应用的系统架构的示意性框图,该系统架构可以实现对图1所示的神经网络的训练的加速。如图2所示,该系统架构包括概率分布模块和训练迭代步排布模块,以下分别进行介绍。
概率分布模块,用于引入采样概率分布对训练过程进行细粒度的冻结控制,其中概率分布包括采样概率和冻结概率。采样概率控制神经网络每一组被采样到的概率,例如在某个训练迭代步采样到某一组参数,那么从该组参数到网络的第一组参数都将被反向冻结。一旦采样概率分布确定后,参数的冻结概率分布也随之确定。
采样概率分布公式为:
p
s(i)=f(i)
冻结概率分布公式为:
其中,p
0表示在神经网络训练过程中不冻结任何参数的概率,即整网训练的概率;n表示网络参数的分组数量,i表示参数组的指数,范围从0到n-1;公式f(x)既可以是连续函数,也可以是离散函数,例如根据每一组参数的实际测试开销占比确定相应的采样概率,第i组参数被冻结的概率为第i组到第n-1组参数的采样概率之和,所有组参数的采 样概率之和为1。冻结概率分布曲线所包括的面积越大,训练开销降低得越多。
训练迭代步排布模块,用于根据概率分布模块选定的概率分布公式对训练迭代步进行细粒度的排布,以确定在训练过程中每一次迭代时对哪些组参数进行冻结。图3示出了以整网训练概率p
0为0.5的训练迭代步排布示意图,由于p
0为0.5,则可以每隔一步进行一次整网训练,即在迭代步0、迭代步2、迭代步4都进行整网训练,而在迭代步1、迭代步3、迭代步5根据采样概率分布函数分布采样到第3、第5、第1组参数,则分别冻结前3、前5、前1组参数,该排布方式称为间隔排布。
图3中的活动层表示前向计算和反向计算都进行,冻结层表示只进行前向计算,冻结反向计算,一个活动层或一个冻结层均表示一组参数。图3中的等间隔整网训练的排布方式在每一个迭代步均存在冻结层解冻的情况,这样会造成动量偏移。例如,在图4中,迭代步2的前3组参数由于在迭代步1中梯度缺失,不计算动量不进行参数更新,导致迭代步2使用的动量依然是基于迭代步0的动量。
因此本申请实施例设计了如图5所示的周期性逐步冻结的排布方案(periodic mode),其中用指数为0表示整网训练,用指数为1、2、3、4、5分别表示在反向计算过程中冻结前1、前2、前3、前4、前5组参数,网络的最后一组参数不参与冻结。图5中的迭代步排布和指数曲线均为周期性的,采样概率为线性递减,表示越靠前的参数组被采样到的概率越大,由此可以得到图5左边的训练迭代步排布示意图。
对于任何采样概率分布函数,根据每一步被采样的概率以及整网训练的概率,均可实现上述间隔排布和周期性排布的训练迭代步排列方式。对于间隔排布,在连续10个训练迭代中,每隔n个迭代步选择一个或多个迭代步冻结采样,n可以为人为预设的值,或者根据整网训练概率确定。例如图3中,当p
0为0.5时,n则为1,因此每隔1个迭代步进行一次冻结采样。对于周期性排布,只需根据p
0的大小,p
0和1-p
0的比例对整网训练的迭代步进行调整即可。例如图5中,当p
0为0.5时,表示进行整网训练的迭代步数量和进行冻结采样的迭代步数量比例为1比1,则迭代步0至14为整网训练,迭代步15至29为冻结采样;然后根据采样概率曲线图可知,越靠前的参数组被采样的概率越大。
图6示出了本申请实施例的神经网络的训练方法的示意性流程图,如图6所示,包括步骤601至步骤604,以下分别进行介绍。
S601,获取待训练的神经网络。
本申请实施例的神经网络的训练方法可以应用于目标检测、图像分割、自然语言处理、语音识别等任务,待训练的神经网络可以是如图1所示的卷积神经网络,具体可以是ResNet、MobileNet等系列的神经网络,也可以是其他神经网络,本申请实施例在此不做具体限定。
S602,对待训练的神经网络的参数进行分组,以得到M组参数,M为大于或等于1的正整数。
本申请实施例的神经网络参数包括神经网络各层的算子,对神经网络的训练即对神经网络中各层算子权值的确定。本申请实施例的神经网络的训练方法可以实现对待训练的神经网络的参数进行自动分组,分组的标准可以预先进行设定。其中参数分组遵循从输入到输出的顺序原则,如图7所示,将离输入最近的一组参数确定为第一组参数,即图7中的组0,以此类推,将离输入最远的一组参数确定为最后一组参数,即图7中的组5。参数分组时,可以将单个算子分为一组参数,也可以将连续多个连续算子分为一组参数,一般 卷积算子与其后的批标准化(batch normalization,BN)算子分为一组参数,例如对于S601中的ResNet、MobileNet等系列的神经网络,可以预先设定将每个卷积算子和BN算子分为一组,则在获取了待训练的ResNet、MobileNet等系列的神经网络后,神经网络的训练方法可以自动根据预先的设定对这些神经网络的参数进行分组。图8示出了本申请实施例的神经网络的训练方法的示意性框图,如图8所示,在参数自动分组步骤,将输入的数据可以自动分为组0至组5。
S603,获取采样概率分布和训练迭代步排布方式,采样概率分布用于表征在每个训练迭代步中所述M组参数中的每组参数被采样的概率,训练迭代步排布方式包括间隔排布和周期排布。
具体的,首先确定训练迭代步排布方式,本申请实施例的神经网络的训练方法中,对部分迭代步进行整网训练,整网训练表示对迭代步中的每组参数都进行梯度计算和参数更新,其中神经网络训练中,需要使用训练数据来最小化损失函数,从而确定神经网络参数的值,而最小化损失函数,即需要求得损失函数的极值,而向量场的梯度指向的方向是函数值上升最快的方向,也就是说其相反方向是函数值下降最快的方向,因此通过计算出损失函数的梯度(即计算所有参数的偏导数)并在其反方向更新参数,经过迭代后损失函数即可很快达到一个极小值。对部分迭代步进行采样,因此需要确定对哪些迭代步进行采样,训练迭代步排布方式包括间隔排布和周期排布,其中间隔排布表示对多个训练迭代步,每隔一定的间隔确定一个或多个被采样的训练迭代步,例如可以根据如下方式确定:确定整网训练概率p
0,将待训练的多个迭代步的数量乘以p
0,可以得到被整网训练的迭代步数量,然后将被整网训练的迭代步均匀地分布在待训练的多个迭代步中,其中p
0为人为预设的值,范围在(0,1),表1示出了一些根据整网训练概率确定训练迭代步排布方式的示例:
表1
P 0=0 | P 0=0.1 | P 0=0.2 | P 0=0.3 | P 0=0.4 | P 0=0.5 | P 0=0.6 | P 0=0.7 | P 0=0.8 | P 0=0.9 | P 0=1 | |
Step0 | ◆ | ◆ | ◆ | ◆ | ◆ | ◆ | ◇ | ◇ | ◇ | ◇ | ◇ |
Step1 | ◆ | ◆ | ◆ | ◆ | ◇ | ◇ | ◆ | ◇ | ◇ | ◇ | ◇ |
Step2 | ◆ | ◆ | ◆ | ◇ | ◆ | ◆ | ◇ | ◆ | ◇ | ◇ | ◇ |
Step3 | ◆ | ◆ | ◆ | ◆ | ◇ | ◇ | ◆ | ◇ | ◇ | ◇ | ◇ |
Step4 | ◆ | ◆ | ◇ | ◆ | ◆ | ◆ | ◇ | ◇ | ◆ | ◇ | ◇ |
Step5 | ◆ | ◆ | ◆ | ◇ | ◇ | ◇ | ◆ | ◆ | ◇ | ◇ | ◇ |
Step6 | ◆ | ◆ | ◆ | ◆ | ◆ | ◆ | ◇ | ◇ | ◇ | ◇ | ◇ |
Step7 | ◆ | ◆ | ◆ | ◆ | ◇ | ◇ | ◆ | ◇ | ◇ | ◇ | ◇ |
Step8 | ◆ | ◆ | ◆ | ◇ | ◆ | ◆ | ◇ | ◆ | ◇ | ◇ | ◇ |
Step9 | ◆ | ◇ | ◇ | ◆ | ◆ | ◇ | ◇ | ◇ | ◆ | ◆ | ◇ |
表1中以10个迭代步为例,即step0至step9,◆表示将对该迭代步进行采样,◇表示不对该迭代步进行采样,该迭代步整网训练。例如p
0=0时,则被整网训练的迭代步数量为0,则step0至step9都将被采样;例如p
0=0.3时,则被整网训练的迭代步数量为3,将3个被整网训练的迭代步均匀地分别在10个迭代步中,则step2、step5、step8为被整网训练的迭代步,而step0、step1、step3、step4、step6、step7、step9为将被采样的迭代步。应理解,表1中的根据整网训练概率确定训练迭代步排布方式只是对本申请实施例中间隔排布方式的举例,并不构成对本申请实施例的限定。
周期排布表示将多个训练迭代步作为一个周期,首先确定将被采样的迭代步的数量为 M-1;融合根据将被采样的迭代步的数量和一定的比例确定周期,其中一个周期包括将被采样的迭代步和整网训练的迭代步,一定的比例为将被采样的迭代步在该周期中所占的比例,将被采样的迭代步为该周期的后M-1个迭代步。例如可以确定整网训练概率p
0,整网训练概率p
0可以是人为预设的值,而上述的一定比例则为1-p
0。对应于图8中使用的是间隔排布方式,迭代步0、2、4为整网训练,迭代步1、3、5为被采样的迭代步。
根据训练迭代步排布方式确定了要被采样的迭代步后,对于要被采样的每个迭代步,再确定其中要被采样的参数组。本申请实施例的神经网络的训练方法用采样概率分布确定在每个训练迭代步中M组参数中的每组参数被采样的概率,即根据采样概率分布确定某一迭代步中被采样的第m组参数,m小于或等于M-1,在某一迭代步中确定了被采样的第m组参数后,该第m组参数至第一组参数都会进行相同的处理。
S604,根据采样概率分布和训练迭代步排布方式,对被采样的参数组冻结或停更。
在S603中,处理包括冻结和停更,其中冻结表示表示对第m组参数至第一组参数不进行梯度计算,不进行参数更新,停更表示对第m组参数至第一组参数进行梯度计算,不进行参数更新。相应的采样概率分布公式和冻结/停更概率分布公式可以参照上述对于图2的描述。例如图8中,在p
0为0.5时,根据采样概率分布公式可以得到多条采样概率分布曲线图,横坐标表示参数组,纵坐标表示被采样的概率,选取其中一条曲线对迭代步1、3、5进行采样,可以得到如图8所示的训练迭代步采样分布图,其中迭代步1被采样的参数组为组2,则将组0至组1冻结;迭代步3被采样的参数组为组0,则将组0冻结;迭代步5被采样的参数组为组4,则将组0至组4冻结。
S605,根据被冻结参数组或被停更的参数组对待训练的神经网络进行训练。
根据上述采样概率分布和所述训练迭代步排布方式可以得到训练迭代步采样分布,对其中未被冻结或停更的参数组进行梯度计算和参数更新;对被冻结的参数组不进行梯度计算,不进行参数更新;对被停更的参数组只进行梯度计算,但不进行参数更新。由此对待训练的神经网络进行迭代训练。
应理解,本申请实施例的神经网络的训练方法可以用于相应的神经网络的训练,该神经网络可以是如图1所示的神经网络。本申请实施例的神经网络的训练方法可以应用于目标检测、图像分割等视觉任务,也可以应用于自然语言处理、语音识别等非视觉任务。
由于一个epoch表示使用训练集的全部数据对神经网络模型进行一次完整的训练,而一个训练迭代步表示更新一次神经网络模型的参数,在某些情况下,一个epoch可能包括一万个训练迭代步,因此在训练迭代步的维度对神经网络的训练进行控制比在epoch的维度具有更高的精度。
本申请实施例的神经网络的训练方法在迭代步的维度对神经网络的参数组进行处理,实现对加速过程的细粒度控制,在训练加速的同时,提升训练精度。通过训练迭代步排布方式和采样概率分布对参数组进行采样和处理,在训练开销和训练精度之间可以更加灵活地选择,例如可以根据每一组参数的具体开销占比确定相应的采样概率;校正了动量偏移的问题,对于冻结的参数组,在一个周期内,后续迭代步的参数组无需用到之前被冻结参数组的参数,对于停更的参数组,依然进行梯度计算,使后续的迭代步的参数组的参数可以保持更新。
以下结合具体示例对本申请实施例的神经网络的训练方法做详细介绍。
本申请实施例的神经网络的训练方法可以应用于根据ImageNet数据集,分别在多个网络上进行目标识别任务的精度验证。在使用静态图深度学习框架的场景下,计算图可以构建多条反向路径,以配合本申请实施例的神经网络的训练方法进行训练加速,例如tensorflow、mindspore等静态图深度学习框架;在使用动态图深度学习框架的场景下,可以在每一次反向过程中进行反向截断,例如pytorch等动态图深度学习框架。
步骤一:输入ResNet50、ResNet18、MobileNetV2等神经网络;
步骤二:参数自动分组,将每个卷积算子和批标准化(batch normalization,BN)算子分为一组;
步骤三:选择开销降低最多的均匀采样概率分布,表示每组参数被采样到的概率相同;并且选择周期性的排布方式。图9示出了本申请实施例的使用周期性排布方式的训练迭代步采样示意图,在一个周期内,迭代步0至迭代步4进行整网训练,迭代步5冻结第一组参数,迭代步6冻结前两组参数,迭代步7冻结前三组参数,迭代步8冻结前四组参数,迭代步9冻结前五组参数。
步骤四:使用tensorflow深度学习框架,计算图构建多条路径,开始迭代训练。计算图如图10所示。
使用上述神经网络的训练方法在ImageNet数据集上对不同的网络进行训练和评测,得到的精度测试结果如下表所示:
表2
基线(%) | 精度(%) | |
ResNet50 | 76.81 | 76.92(+0.11) |
ResNet34 | 74.43 | 74.38(-0.05) |
ResNet18 | 70.7 | 70.98(+0.28) |
ResNet101 | 78.84 | 78.85(+0.01) |
MobileNetV2 | 71.96 | 72.04(+0.08) |
Vgg16_bn | 73.82 | 73.55(-0.27) |
ResNeXt50 | 77.68 | 77.64(-0.04) |
DenseNet121 | 75.84 | 75.82(-0.02) |
AlexNet | 57.02 | 56.98(-0.04) |
InceptionV3 | 76.20 | 76.15(-0.05) |
由表2可知,使用上述神经网络的训练方法在ImageNet数据集上对不同的网络进行训练得到的精度与基线基本持平,但神经网络训练的速度提高20%。
本申请实施例的神经网络的训练方法可以在识别类任务中起到一定的正则化效果,在小幅度降低开销的同时,一定程度上起到提升模型精度的效果。以下介绍另一种使用本申请实施例的神经网络的训练方法进行网络训练的过程。
步骤一:输入ResNet50、ResNet18、MobileNetV2等神经网络;
步骤二:参数自动分组,将每个卷积算子和批标准化(batch normalization,BN)算子分为一组;
步骤三:选择线性递减的采样概率分布,越靠前的参数组被采样的概率越大;训练迭代步排布方式选择间隔排布。被采样到的参数依然计算梯度和动量,如此可以避免下一个 迭代步中对应的参数组动量偏移,但不进行参数更新。由于随机梯度下降(stochastic gradient descent,SGD)的随机性会引入一定的噪声,在网络正常训练过程中,由损失传递到前面层的梯度所带来的有用信号已经非常微弱,导致网络前面层的信噪比较高,间隔排布方式随机停止前面层的梯度更新有利于减小高信噪比带来的负作用,起到一定的优化作用,由此可以提升训练的神经网络的精度。图11示出了本申请实施例的使用间隔排布方式的训练迭代步采样示意图。
使用上述神经网络的训练方法在ImageNet数据集上对不同的网络进行训练和评测,得到的精度测试结果如下表所示:
表3
基线(%) | 精度(%) | |
ResNet50 | 76.81 | 77.05(+0.24) |
ResNet34 | 74.43 | 76.60(+0.07) |
ResNet18 | 70.7 | 71.02(+0.32) |
ResNet101 | 78.84 | 79.10(+0.26) |
MobileNetV2 | 71.96 | 72.14(+0.18) |
Vgg16_bn | 73.82 | 74.05(+0.23) |
ResNeXt50 | 77.68 | 78.01(+0.33) |
DenseNet121 | 75.84 | 75.78(-0.06) |
AlexNet | 57.02 | 56.69(-0.33) |
InceptionV3 | 76.20 | 76.41(-0.21) |
由表3可知,使用上述神经网络的训练方法在ImageNet数据集上对不同的网络进行训练得到的精度与基线相比基本都有一定提升。
与现有的正则化方法不同,本申请实施例的神经网络的训练方法可以在不改变用户网络结构的基础上,小幅度降低开销的同时,通过一定的采样概率分布和参数停更进行精度的提升。
图12示出了表示实施例提供的一种数据处理方法的示意性流程图,包括步骤1201至步骤1202。
S1201,获取待处理数据。
S1202,根据目标神经网络对待处理数据进行处理,目标神经网络通过训练得到,目标神经网络的训练包括:获取待训练的神经网络;对待训练的神经网络的参数进行分组,以得到M组参数,M为大于或等于1的正整数;获取采样概率分布和训练迭代步排布方式,采样概率分布用于表征在每个训练迭代步中M组参数中的每组参数被采样的概率,训练迭代步排布方式包括间隔排布和周期排布;根据采样概率分布和训练迭代步排布方式,对被采样的参数组冻结或停更;根据被冻结参数组或被停更的参数组对待训练的神经网络进行训练。
图12中的数据处理所使用的神经网络为根据图6的神经网络的训练方法训练得到,其中神经网络的训练可以参照上述对于图6的描述,为了简洁,本申请实施例在此不再赘述。
上文详细介绍了本申请实施例的神经网络的训练方法,下面结合图13至图16对本申 请实施例的相关装置进行介绍。
图13示出了本申请实施例的神经网络的训练装置的示意性框图,包括存储模块1310、获取模块1320、处理模块1330,以下分别进行介绍。
存储模块1310用于存储程序。
获取模块1320,用于获取待训练的神经网络。
处理模块1330,用于对待训练的神经网络的参数进行分组,以得到M组参数,M为大于或等于1的正整数。
获取模块1320还用于获取采样概率分布和训练迭代步排布方式,采样概率分布用于表征在每个训练迭代步中M组参数中的每组参数被采样的概率,训练迭代步排布方式包括间隔排布和周期排布。
处理模块1330还用于根据采样概率分布和训练迭代步排布方式,对被采样的参数组冻结或停更;根据被冻结参数组或被停更的参数组对待训练的神经网络进行训练。
可选的,处理模块1330根据采样概率分布和训练迭代步排布方式,对被采样的参数组冻结或停更,具体用于:根据训练迭代步排布方式确定第一迭代步;根据采样概率分布确定第一迭代步中被采样的第m组参数,m小于或等于M-1;冻结第一迭代步中的第m组参数至第一组参数,冻结第一迭代步中的第m组参数至第一组参数表示对第m组参数至第一组参数不进行梯度计算,不进行参数更新。
可选的,处理模块1330根据采样概率分布和训练迭代步排布方式,对被采样的参数组冻结或停更,具体用于:根据训练迭代步排布方式确定被采样的第一迭代步;根据采样概率分布确定第一迭代步中被采样的第m组参数,m小于或等于M-1;停更第一迭代步中的第m组参数至第一组参数,停更第一迭代步中的第m组参数至第一组参数表示对第m组参数至第一组参数进行梯度计算,不进行参数更新。
可选的,当训练迭代步排布方式为间隔排布时,处理模块1130根据训练迭代步排布方式确定被采样的第一迭代步,具体用于:确定第一间隔;在多个训练迭代步中,每隔一个第一间隔确定一个或多个第一迭代步。可选的,当训练迭代步排布方式为周期排布时,处理模块1130根据训练迭代步排布方式确定被采样的第一迭代步,包括:确定第一迭代步的数量为M-1;根据第一迭代步的数量和第一比例确定第一周期,第一周期包括第一迭代步和整网训练的迭代步,第一比例为第一迭代步在第一周期中所占的比例,第一迭代步为第一周期的后M-1个迭代步。
应理解,本申请实施例的神经网络的训练装置1300可以用于实现图6的方法中的各个步骤,具体实现可以参照上述对于图6的方法的描述,为了简洁,本申请实施例在此不再赘述。
图14示出了本申请实施例的数据处理装置的示意性框图,包括存储模块1410、获取模块1420、处理模块1430,以下分别进行介绍。
存储模块1410用于存储程序。
获取模块1420,用于获取待处理数据。
处理模块1430,用于根据目标神经网络对待处理数据进行处理,目标神经网络通过训练得到,目标神经网络的训练包括:获取待训练的神经网络;对待训练的神经网络的参数进行分组,以得到M组参数,M为大于或等于1的正整数;获取采样概率分布和训练 迭代步排布方式,采样概率分布用于表征在每个训练迭代步中M组参数中的每组参数被采样的概率,训练迭代步排布方式包括间隔排布和周期排布;根据采样概率分布和训练迭代步排布方式,对被采样的参数组冻结或停更;根据被冻结参数组或被停更的参数组对待训练的神经网络进行训练。
应理解,本申请实施例的神经网络的训练装置1400可以用于实现图12的方法中的各个步骤,具体实现可以参照上述对于图12的方法的描述,为了简洁,本申请实施例在此不再赘述。
图15是本申请实施例的神经网络的训练装置1500的硬件结构示意图,如图15所示,包括存储器1501、处理器1502、通信接口1503以及总线1504。其中,存储器1501、处理器1502、通信接口1503通过总线1504实现彼此之间的通信连接。
存储器1501可以是ROM,静态存储设备和RAM。存储器1501可以存储程序,当存储器1501中存储的程序被处理器1502执行时,处理器1502和通信接口1503用于执行本申请实施例的神经网络的训练方法的各个步骤。
处理器1502可以采用通用的,CPU,微处理器,ASIC,GPU或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的神经网络的训练装置中的单元所需执行的功能,或者执行本申请方法实施例的神经网络的训练方法。
处理器1502还可以是一种集成电路芯片,具有信号的处理能力,例如,可以是图4所示的芯片。在实现过程中,本申请实施例的神经网络的训练方法的各个步骤可以通过处理器1502中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器1502还可以是通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1501,处理器1502读取存储器1501中的信息,结合其硬件完成本申请实施例的神经网络的训练装置中包括的单元所需执行的功能,或者执行本申请方法实施例的神经网络的训练方法。
通信接口1503使用例如但不限于收发器一类的收发装置,来实现装置1500与其他设备或通信网络之间的通信。例如,可以通过通信接口1503获取待训练的神经网络。
总线1504可包括在装置1500各个部件(例如,存储器1501、处理器1502、通信接口1503)之间传送信息的通路。
图16示出了本申请实施例的一种数据处理装置的1600的硬件结构示意图,包括存储器1601、处理器1602、通信接口1603以及总线1604。其中,存储器1601、处理器1602、通信接口1603通过总线1604实现彼此之间的通信连接。
存储器1601可以是ROM,静态存储设备和RAM。存储器1601可以存储程序,当存储器1601中存储的程序被处理器1602执行时,处理器1602和通信接口1603用于执行本申请实施例的数据处理方法的各个步骤。
处理器1602可以采用通用的,CPU,微处理器,ASIC,GPU或者一个或多个集成电 路,用于执行相关程序,以实现本申请实施例的数据处理处理装置中的单元所需执行的功能,或者执行本申请方法实施例的数据处理方法。
处理器1602还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请实施例的数据处理方法的各个步骤可以通过处理器1602中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器1602还可以是通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1601,处理器1602读取存储器1601中的信息,结合其硬件完成本申请实施例的数据处理装置中包括的单元所需执行的功能,或者执行本申请方法实施例的数据处理方法。
通信接口1603使用例如但不限于收发器一类的收发装置,来实现装置1600与其他设备或通信网络之间的通信。例如,可以通过通信接口1603获取待处理数据。
总线1604可包括在装置1600各个部件(例如,存储器1601、处理器1602、通信接口1603)之间传送信息的通路。
应注意,尽管上述1500和1600仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置1500和1600还可以包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置1500和1600还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置1500和1600也可仅仅包括实现本申请实施例所必须的器件,而不必包括图15和图16中所示的全部器件。
应理解,本申请实施例中的处理器可以为中央处理单元(central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
还应理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的随机存取存储器(random access memory,RAM)可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存 储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
本申请实施例还提供了一种计算机程序产品,该计算机程序产品被处理器1502和1602执行时实现本申请中任一方法实施例的方法。该计算机程序产品可以存储在存储器1501和1601中,程序经过预处理、编译、汇编和链接等处理过程最终被转换为能够被处理器1502和1602执行的可执行目标文件。
本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被计算机执行时实现本申请中任一方法实施例的方法。该计算机程序可以是高级语言程序,也可以是可执行目标程序。该计算机可读存储介质例如是存储器1501和1601。
本申请实施例还提供一种芯片,该芯片包括处理器与数据接口,处理器通过数据接口读取存储器上存储的指令,执行本申请中任一方法实施例的方法。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘。
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系,但也可能表示的是一种“和/或”的关系,具体可参考前后文进行理解。
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。
Claims (15)
- 一种神经网络的训练方法,其特征在于,包括:获取待训练的神经网络;对所述待训练的神经网络的参数进行分组,以得到M组参数,所述M为大于或等于1的正整数;获取采样概率分布和训练迭代步排布方式,所述采样概率分布用于表征在每个训练迭代步中所述M组参数中的每组参数被采样的概率,所述训练迭代步排布方式包括间隔排布和周期排布;根据所述采样概率分布和所述训练迭代步排布方式,对被采样的参数组冻结或停更;根据被冻结的参数组或被停更的参数组对所述待训练的神经网络进行训练。
- 根据权利要求1所述的方法,其特征在于,所述根据所述采样概率分布和所述训练迭代步排布方式,对被采样的参数组冻结或停更,包括:根据所述训练迭代步排布方式确定第一迭代步,所述第一迭代步为待采样的迭代步;根据所述采样概率分布确定所述第一迭代步中被采样的第m组参数,所述m小于或等于M-1;冻结所述第一迭代步中的第m组参数至第一组参数,所述冻结所述第一迭代步中的第m组参数至第一组参数表示对所述第m组参数至所述第一组参数不进行梯度计算,不进行参数更新。
- 根据权利要求1所述的方法,其特征在于,所述根据所述采样概率分布和所述训练迭代步排布方式,对被采样的参数组冻结或停更,包括:根据所述训练迭代步排布方式确定第一迭代步,所述第一迭代步为待采样的迭代步;根据所述采样概率分布确定所述第一迭代步中被采样的第m组参数,所述m小于或等于M-1;停更所述第一迭代步中的第m组参数至第一组参数,所述停更所述第一迭代步中的第m组参数至第一组参数表示对所述第m组参数至所述第一组参数进行梯度计算,不进行参数更新。
- 根据权利要求2或3所述的方法,其特征在于,当所述训练迭代步排布方式为间隔排布时,所述根据所述训练迭代步排布方式确定第一迭代步,包括:确定第一间隔;在多个训练迭代步中,每隔一个所述第一间隔确定一个或多个所述第一迭代步。
- 根据权利要求2或3所述的方法,其特征在于,当所述训练迭代步排布方式为周期排布时,所述根据所述训练迭代步排布方式确定第一迭代步,包括:确定第一迭代步的数量为M-1;根据所述第一迭代步的数量和第一比例确定第一周期,所述第一周期包括所述第一迭代步和整网训练的迭代步,所述第一比例为所述第一迭代步在所述第一周期中所占的比例,所述第一迭代步为所述第一周期的后M-1个迭代步。
- 一种数据处理的方法,其特征在于,包括:获取待处理数据;根据目标神经网络对所述待处理数据进行处理,所述目标神经网络通过训练得到,所述目标神经网络的训练包括:获取待训练的神经网络;对所述待训练的神经网络的参数进行分组,以得到M组参数,所述M为大于或等于1的正整数;获取采样概率分布和训练迭代步排布方式,所述采样概率分布用于表征在每个训练迭代步中所述M组参数中的每组参数被采样的概率,所述训练迭代步排布方式包括间隔排布和周期排布;根据所述采样概率分布和所述训练迭代步排布方式,对被采样的参数组冻结或停更;根据被冻结参数组或被停更的参数组对所述待训练的神经网络进行训练。
- 一种神经网络的训练装置,其特征在于,包括:获取模块,用于获取待训练的神经网络;处理模块,用于对所述待训练的神经网络的参数进行分组,以得到M组参数,所述M为大于或等于1的正整数;所述获取模块还用于获取采样概率分布和训练迭代步排布方式,所述采样概率分布用于表征在每个训练迭代步中所述M组参数中的每组参数被采样的概率,所述训练迭代步排布方式包括间隔排布和周期排布;所述处理模块还用于根据所述采样概率分布和所述训练迭代步排布方式,对被采样的参数组冻结或停更;根据被冻结参数组或被停更的参数组对所述待训练的神经网络进行训练。
- 根据权利要求7所述的装置,其特征在于,所述处理模块根据所述采样概率分布和所述训练迭代步排布方式,对被采样的参数组冻结或停更,包括:根据所述训练迭代步排布方式确定第一迭代步,所述第一迭代步为待采样的迭代步;根据所述采样概率分布确定所述第一迭代步中被采样的第m组参数,所述m小于或等于M-1;冻结所述第一迭代步中的第m组参数至第一组参数,所述冻结所述第一迭代步中的第m组参数至第一组参数表示对所述第m组参数至所述第一组参数不进行梯度计算,不进行参数更新。
- 根据权利要求7所述的装置,其特征在于,所述处理模块根据所述采样概率分布和所述训练迭代步排布方式,对被采样的参数组冻结或停更,包括:根据所述训练迭代步排布方式确定第一迭代步,所述第一迭代步为待采样的迭代步;根据所述采样概率分布确定所述第一迭代步中被采样的第m组参数,所述m小于或等于M-1;停更所述第一迭代步中的第m组参数至第一组参数,所述停更所述第一迭代步中的第m组参数至第一组参数表示对所述第m组参数至所述第一组参数进行梯度计算,不进行参数更新。
- 根据权利要求8或9所述的装置,其特征在于,当所述训练迭代步排布方式为间隔排布时,所述处理模块根据所述训练迭代步排布方式确定第一迭代步,包括:确定第一间隔;在多个训练迭代步中,每隔一个所述第一间隔确定一个或多个所述第一迭代步。
- 根据权利要求8或9所述的装置,其特征在于,当所述训练迭代步排布方式为周期排布时,所述处理模块根据所述训练迭代步排布方式确定第一迭代步,包括:确定第一迭代步的数量为M-1;根据所述第一迭代步的数量和第一比例确定第一周期,所述第一周期包括所述第一迭代步和整网训练的迭代步,所述第一比例为所述第一迭代步在所述第一周期中所占的比例,所述第一迭代步为所述第一周期的后M-1个迭代步。
- 一种数据处理的装置,其特征在于,包括:获取模块,用于获取待处理数据;处理模块,用于根据目标神经网络对所述待处理数据进行处理,所述目标神经网络通过训练得到,所述目标神经网络的训练包括:获取待训练的神经网络;对所述待训练的神经网络的参数进行分组,以得到M组参数,所述M为大于或等于1的正整数;获取采样概率分布和训练迭代步排布方式,所述采样概率分布用于表征在每个训练迭代步中所述M组参数中的每组参数被采样的概率,所述训练迭代步排布方式包括间隔排布和周期排布;根据所述采样概率分布和所述训练迭代步排布方式,对被采样的参数组冻结或停更;根据被冻结参数组或被停更的参数组对所述待训练的神经网络进行训练。
- 一种芯片,其特征在于,所述芯片包括处理器与存储器,所述处理器和所述存储器耦合,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行如权利要求1至5或权利要求6中任一项所述的方法。
- 一种计算机可读存储介质,其特征在于,所述计算机可读介质存储用于设备执行的程序代码,该程序代码被所述设备执行时,所述设备执行如权利要求1至5或权利要求6中任一项所述的方法。
- 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得计算机执行如权利要求1至5或权利要求6中任一项所述的方法。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/322,373 US20230289603A1 (en) | 2020-11-23 | 2023-05-23 | Neural network training method and apparatus |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011322834.6A CN114528968A (zh) | 2020-11-23 | 2020-11-23 | 神经网络的训练方法和装置 |
CN202011322834.6 | 2020-11-23 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/322,373 Continuation US20230289603A1 (en) | 2020-11-23 | 2023-05-23 | Neural network training method and apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022105348A1 true WO2022105348A1 (zh) | 2022-05-27 |
Family
ID=81619650
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/115204 WO2022105348A1 (zh) | 2020-11-23 | 2021-08-30 | 神经网络的训练方法和装置 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20230289603A1 (zh) |
CN (1) | CN114528968A (zh) |
WO (1) | WO2022105348A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116381164A (zh) * | 2023-05-31 | 2023-07-04 | 广州香安化工有限公司 | 一种基于神经网络的燃气臭味剂浓度测量方法及装置 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106779064A (zh) * | 2016-11-25 | 2017-05-31 | 电子科技大学 | 基于数据特征的深度神经网络自训练方法 |
US20190138151A1 (en) * | 2017-11-03 | 2019-05-09 | Silicon Integrated Systems Corp. | Method and system for classifying tap events on touch panel, and touch panel product |
CN111695671A (zh) * | 2019-03-12 | 2020-09-22 | 北京地平线机器人技术研发有限公司 | 训练神经网络的方法及装置、电子设备 |
-
2020
- 2020-11-23 CN CN202011322834.6A patent/CN114528968A/zh active Pending
-
2021
- 2021-08-30 WO PCT/CN2021/115204 patent/WO2022105348A1/zh active Application Filing
-
2023
- 2023-05-23 US US18/322,373 patent/US20230289603A1/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106779064A (zh) * | 2016-11-25 | 2017-05-31 | 电子科技大学 | 基于数据特征的深度神经网络自训练方法 |
US20190138151A1 (en) * | 2017-11-03 | 2019-05-09 | Silicon Integrated Systems Corp. | Method and system for classifying tap events on touch panel, and touch panel product |
CN111695671A (zh) * | 2019-03-12 | 2020-09-22 | 北京地平线机器人技术研发有限公司 | 训练神经网络的方法及装置、电子设备 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116381164A (zh) * | 2023-05-31 | 2023-07-04 | 广州香安化工有限公司 | 一种基于神经网络的燃气臭味剂浓度测量方法及装置 |
CN116381164B (zh) * | 2023-05-31 | 2023-08-29 | 广州香安化工有限公司 | 一种基于神经网络的燃气臭味剂浓度测量方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN114528968A (zh) | 2022-05-24 |
US20230289603A1 (en) | 2023-09-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11568258B2 (en) | Operation method | |
US11620568B2 (en) | Using hyperparameter predictors to improve accuracy of automatic machine learning model selection | |
US20180260709A1 (en) | Calculating device and method for a sparsely connected artificial neural network | |
JP6788019B2 (ja) | 非一貫性確率的勾配降下を使用した深層ニューラルネットワークのトレーニングの高速化 | |
CN108090565A (zh) | 一种卷积神经网络并行化训练加速方法 | |
US11934949B2 (en) | Composite binary decomposition network | |
CN110909926A (zh) | 基于tcn-lstm的太阳能光伏发电预测方法 | |
CN117892774A (zh) | 用于卷积神经网络的神经架构搜索 | |
US20170061279A1 (en) | Updating an artificial neural network using flexible fixed point representation | |
US11775832B2 (en) | Device and method for artificial neural network operation | |
CN114127740A (zh) | 人工智能模型的分布式训练中的数据并行性 | |
US20210019555A1 (en) | Generating video frames using neural networks | |
US11763150B2 (en) | Method and system for balanced-weight sparse convolution processing | |
CN114127702A (zh) | 在存储器受限设备上执行大型人工智能模型 | |
US20210019634A1 (en) | Dynamic multi-layer execution for artificial intelligence modeling | |
WO2018228399A1 (zh) | 运算装置和方法 | |
WO2022156475A1 (zh) | 神经网络模型的训练方法、数据处理方法及装置 | |
US20210312278A1 (en) | Method and apparatus with incremental learning moddel | |
CN111079753A (zh) | 一种基于深度学习与大数据结合的车牌识别方法及装置 | |
WO2022105348A1 (zh) | 神经网络的训练方法和装置 | |
CN114286985A (zh) | 用于预测内核调谐参数的方法和设备 | |
CN113407820B (zh) | 利用模型进行数据处理的方法及相关系统、存储介质 | |
US20200202222A1 (en) | Information processing apparatus, information processing method, and non-transitory computer-readable storage medium for storing program | |
WO2021013117A1 (en) | Systems and methods for providing block-wise sparsity in a neural network | |
Zhang et al. | Hybrid multi-objective evolutionary model compression with convolutional neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21893500 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21893500 Country of ref document: EP Kind code of ref document: A1 |