WO2024060727A1

WO2024060727A1 - Method and apparatus for training neural network model, and device and system

Info

Publication number: WO2024060727A1
Application number: PCT/CN2023/101170
Authority: WO
Inventors: 潘一荣; 姚益武; 王兵
Original assignee: 华为技术有限公司
Priority date: 2022-09-20
Filing date: 2023-06-19
Publication date: 2024-03-28
Also published as: CN117787375A

Abstract

A method and apparatus for training a neural network model, and a device and a system, which are applied to a computing device for training a neural network model. The method comprises: during the process of performing quantization training on a neural network model, for the problem of an inaccurate gradient caused by quantization, a computing device changing a gradient compensation strategy according to a fluctuation value of a quantization error of a parameter, using an applicable gradient compensation strategy to correct the gradient, and updating a parameter of the neural network model on the basis of the gradient determined by the gradient compensation strategy, so as to obtain an optimized neural network model. Therefore, the accuracy of a gradient of a parameter of a neural network model is improved, and the precision of model training is ensured according to the precision of a parameter determined by the gradient.

Description

Training method, device, equipment and system for neural network model

This application requires the priority of the Chinese patent application submitted to the State Intellectual Property Office on September 20, 2022, with the application number 202211145916.7 and the application name "Neural Network Model Training Method, Device, Equipment and System", and its entire content has been approved This reference is incorporated into this application.

Technical field

The present application relates to the field of artificial intelligence technology, and in particular to a training method, device, equipment and system for a neural network model.

Background technique

Introducing quantize processing in the training process of the neural network model can reduce the consumption of storage and processing resources of the neural network model. However, the neural network model updates the model based on the quantized parameters, and the model convergence is poor, which causes a large loss of accuracy of the neural network model.

Contents of the invention

This application provides a training method, device, equipment and system for a neural network model, thereby solving the problems of poor model convergence and large accuracy loss of the neural network model caused by the quantification processing of the neural network model.

In a first aspect, a training method of a neural network model is provided, which is executed by a computing device that trains a neural network model. The method includes: the computing device trains the neural network model, and the parameters of the neural network model are quantified. Since the parameters The quantification leads to errors in model parameters. In the initial stage of neural network model training, the computing device adopts the first gradient compensation strategy to compensate for the gradient obtained after training, and counts the fluctuation value of the quantized error (quantize error) of the parameters of the neural network model. When the fluctuation value of the quantized error is less than When equal to the preset value, the second gradient compensation strategy is used to compensate the gradient obtained after training.

Therefore, the computing device uses the first gradient compensation strategy to update the parameters of the neural network model in the initial stage of model training when the parameters of the neural network model are unstable, and determines that the neural network model is in a state where the parameters are relatively stable based on the fluctuation value of the quantified error of the parameters. After the training stage, the first gradient compensation strategy is changed to the second gradient compensation strategy, so that for the neural network model in the training stage with different degrees of stability, the computing device can use the applicable gradient compensation strategy to determine the gradient of the neural network model, and the gradient Optimization is performed to improve the accuracy of the gradient of the parameters of the neural network model and the accuracy of the parameters determined based on the gradient, thereby ensuring the accuracy of model training.

Among them, quantization refers to using a quantization function to convert parameters from floating point values to integer values during the forward training process of the neural network model. The parameters of the neural network model can include the weight parameters output by each network layer included in the neural network model. , and/or the value of convolution or matrix multiplication with the weight parameter is the activation value. The quantization error refers to the difference between the floating point value of the parameters of the neural network model before quantization and the inverse quantization value. The inverse quantization value uses the inverse function of the quantization function to inverse quantize the integer value after parameter quantization (de- quantize) to get the floating point value.

In a possible implementation, the computing device may determine the gradient compensation strategy based on a comparison result between the fluctuation value of the quantization error of the parameter and the preset value.

For example, when the fluctuation value of the quantization error is greater than a preset value, the computing device changes the currently used gradient compensation strategy to the first gradient compensation strategy.

For another example, when the fluctuation value of the quantization error is less than or equal to the preset value, the computing device changes the currently used gradient compensation strategy to the second gradient compensation strategy.

Wherein, the first gradient compensation strategy may be an Element-wise Gradient Scaling (EWGS) strategy, the second gradient compensation strategy may be a multi-dimensional weight hybrid training strategy, and the multi-dimensional weight hybrid training strategy is used based on the parameters The quantized value and the inverse quantized value are used for gradient compensation.

During the training process of the neural network model, the quantization error fluctuates greatly when the model first starts training. The fluctuation of the quantization error will gradually decrease and become stable as the training progresses. In this embodiment, the quantization error of the parameter can be The training phase in which the fluctuation value is greater than the preset value is called the first training phase of the neural network model. The training phase in which the fluctuation value of the parameter quantification error is less than or equal to the preset value is called the first training phase of the neural network model. The training stage is called the second training stage of the neural network model.

In this embodiment, the computing device uses a multi-dimensional weight hybrid training strategy for gradient compensation in the second training stage when the quantization error fluctuation value is small, and uses parameters of a single-precision data type or parameter integer values of a half-precision data type for model training. It improves the transmission and calculation efficiency of parameters, and iterates faster than the element-level gradient scaling strategy. It improves the training efficiency while ensuring the training accuracy of the neural network model. In the first training stage when the quantization error fluctuates greatly, the computing device uses an element-level gradient scaling strategy for gradient compensation, adaptively enlarges or shrinks each gradient element, and uses the scaled gradient as the gradient output by the quantization function, through Back propagation is used to train the quantized network, which achieves higher precision gradient compensation compared to the multi-dimensional weight hybrid training strategy, thereby enabling the parameter to change when the quantization error fluctuates greatly, that is, the discrete error between the input and output of the quantization function. When the gradient mismatch is severe, the gradient accuracy of the parameters of the neural network model is guaranteed, thereby ensuring the accuracy of model training.

As a possible implementation, the quantization of the neural network model by the computing device is performed in the forward training of the model training. For example, the computing device quantifies the weight parameters and activation values in forward training, that is, the computing device obtains the integral value of each network layer based on the quantized values of the weight parameters and the quantified activation values of each network layer in the neural network model. type output result, and perform forward calculation after dequantizing the reshaped output result.

Optionally, the computing device can adopt different quantization methods for data distribution with different parameters.

For example, activation values are quantized using a sample-by-sample asymmetric uniform quantization method, and weight parameters are quantized using a channel-by-channel symmetric uniform quantization method. Therefore, the computing device quantizes the activation value using a sample-by-sample asymmetric uniform quantization method that has higher quantization accuracy than a channel-by-channel symmetric uniform quantization method. Since the sample-by-sample asymmetric uniform quantization method has no obvious accuracy advantage over the channel-by-channel symmetric uniform quantization method when quantizing weight parameters, the computing device uses a sample-by-sample asymmetric uniform quantization method that is more computationally efficient than the sample-by-sample asymmetric uniform quantization method. The channel symmetric and uniform quantization method quantizes the weight parameters. Therefore, the quantization method is adaptively adopted according to the data distribution of the parameters, which improves the quantification accuracy while ensuring the quantification efficiency.

As a possible implementation method, the computing device can periodically count the fluctuation values of the quantization error, and periodically use the gradient compensation strategy to compensate the gradient. That is, the quantization error of the last training refers to the statistical training of the previous cycle. quantization error, thereby reducing the computational overhead of quantization processing.

Optionally, the computing device calculates the fluctuation value of the quantization error every m training steps in the first training stage, and uses the first gradient compensation strategy to compensate the gradient of the neural network model. Among them, each time the neural network model completes one forward propagation (ie, forward training) and one back propagation (ie, reverse training), it is called a training step, and m is a positive integer.

Optionally, the computing device calculates the fluctuation value of the quantization error every M2/m training steps in the second training stage, and uses the second gradient compensation strategy to compensate the gradient of the neural network model. Among them, M2 is the total number of training steps in the second training stage. Therefore, the period of the fluctuation value of the statistical quantization error of the first gradient compensation strategy is smaller than the period of the fluctuation value of the statistical quantization error of the second gradient compensation strategy, thereby reducing the fluctuation value of the quantization error of the neural network model in the second training stage. The gradient compensation frequency of the neural network model not only ensures the training accuracy of the neural network model, but also improves the overall training efficiency of the neural network model.

A second aspect provides a training device for a neural network model. The device includes various modules for executing the training method for a neural network model in the first aspect or any possible implementation of the first aspect.

It should be noted that the training device for the neural network model described in the second aspect can be a terminal device or a network device, or a chip (system) or other parts or components that can be set in the terminal device or the network device, or a device that includes a terminal device or a network device, and this application does not limit this.

In addition, the technical effects of the neural network model training device described in the second aspect can be referred to the technical effects of the neural network model training method described in the first aspect, and will not be described again here.

In a third aspect, a computing device is provided, comprising a memory and a processor, wherein the memory is used to store a set of computer instructions, and when the processor executes the set of computer instructions, it is used to execute the operating steps of the training method of the neural network model in any possible design in the first aspect.

In addition, the technical effects of the computing device described in the third aspect can be referred to the technical effects of the training method of the neural network model described in the first aspect, which will not be described again here.

A fourth aspect provides a training system for a neural network model, including an execution device and the computing device described in the third aspect. The computing device is used to perform training of the neural network model in any possible design of the first aspect. The steps of the method are as follows: An optimized neural network model is obtained, and the execution device is used to apply the optimized neural network model.

In a fifth aspect, a computer-readable storage medium is provided, including: computer software instructions; when the computer software instructions are run in a data processing system, the computing device is caused to execute as described in any possible implementation manner in the first aspect. The steps of the method.

In a sixth aspect, a computer program product is provided. When the computer program product is run on a computer, it causes the computing device to perform the operation steps of the method described in any possible implementation manner in the first aspect.

Based on the implementation methods provided in the above aspects, this application can also be further combined to provide more implementation methods.

Description of the drawings

Figure 1 is a schematic structural diagram of a neural network provided by an embodiment of the present application;

Figure 2 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application;

Figure 3 is a schematic architectural diagram of a neural network model training system provided by an embodiment of the present application;

FIG4 is a schematic diagram of a training method for a neural network model provided in an embodiment of the present application;

Figure 5 is a schematic diagram of forward propagation provided by an embodiment of the present application;

Figure 6 is a schematic diagram of an element-level gradient scaling strategy provided by an embodiment of the present application;

Figure 7 is a schematic diagram of a multi-dimensional weight hybrid training strategy provided by an embodiment of the present application;

Figure 8 is a schematic diagram of a training device for a neural network model provided by an embodiment of the present application;

Figure 9 is a schematic structural diagram of a computing device provided by an embodiment of the present application.

Detailed ways

In order to facilitate understanding, the relevant terms involved in the embodiments of this application are first introduced below.

(1)Neural network

A neural network may be composed of neurons, and a neuron may refer to an operation unit with xs and intercept 1 as input. The output of the operation unit satisfies the following formula:

Among them, s=1, 2,...n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neuron. f is the activation function of the neuron, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neuron into an output signal. The output signal of the activation function can be used as the input of the next layer, and the activation function can be a sigmoid function. A neural network is a network formed by connecting multiple single neurons mentioned above, that is, the output of one neuron can be the input of another neuron. The input of each neuron can be connected to the local receptive field of the previous layer to extract the features of the local receptive field. The local receptive field can be an area composed of several neurons. Weights represent the strength of connections between different neurons. Weight determines the influence of input on output. A weight close to 0 means that changing the input does not change the output. Negative weights mean that increasing input decreases output.

As shown in Figure 1, it is a schematic structural diagram of a neural network provided by an embodiment of the present application. The neural network 100 includes N processing layers, where N is an integer greater than or equal to 3. The first layer of the neural network 100 is the input layer 110, which is responsible for receiving input signals. The last layer of the neural network 100 is the output layer 130, which is responsible for outputting the processing results of the neural network. The other layers except the first layer and the last layer are intermediate layers 140. These intermediate layers 140 together form a hidden layer 120. Each intermediate layer 140 in the hidden layer 120 can both receive input signals and output signals. The hidden layer 120 is responsible for the processing of the input signal. Each layer represents a logical level of signal processing. Through multiple layers, the data signal can be processed by multi-level logic.

In some feasible embodiments, the input signal of the neural network may be a video signal, a voice signal, a text signal, an image signal or a temperature signal, etc. in various forms. The voice signal can be various sensor signals such as human voice audio signals recorded by a microphone (sound sensor) such as speaking and singing. The input signals of the neural network also include various other computer-processable engineering signals, which will not be listed here. If a neural network is used to perform deep learning on image signals, the quality of images processed by the neural network can be improved.

(2) Convolutional neural network

Convolutional Neuron Network (CNN) is a deep neural network with a convolutional structure network. The convolutional neural network contains a feature extractor composed of convolutional layers and subsampling layers. The feature extractor can be regarded as a filter, and the convolution process can be regarded as using a trainable filter to convolve with an input image or feature map. The convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal. In the convolutional layer of a convolutional neural network, a neuron can be connected to only some of the neighboring layer neurons. A convolutional layer can output several feature maps, and the feature map can refer to the intermediate result during the operation of the convolutional neural network. Neurons in the same feature map share weights, and the shared weights here are convolution kernels. Shared weights can be understood as a way to extract image information independent of position. That is, the statistics of one part of the image are the same as those of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the same learned image information can be used for all positions on the image. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the greater the number of convolution kernels, the richer the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a random-sized matrix. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.

For example, as shown in Figure 2, it is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application. The convolutional neural network 200 may include an input layer 210, a convolutional/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230.

The convolution layer/pooling layer 220 may include, for example, layers 221 to 226. In one example, layer 221 may be, for example, a convolution layer, layer 222 may be, for example, a pooling layer, layer 223 may be, for example, a convolution layer, layer 224 may be, for example, a pooling layer, layer 225 may be, for example, a convolution layer, and layer 226 may be, for example, a pooling layer. In another example, layers 221 and 222 may be, for example, convolution layers, layer 223 may be, for example, a pooling layer, layers 224 and 225 may be, for example, convolution layers, and layer 226 may be, for example, a pooling layer. The output of a convolution layer may be used as the input of a subsequent pooling layer, or as the input of another convolution layer to continue the convolution operation.

Taking convolutional layer 221 as an example, we will introduce the inner working principle of a convolutional layer.

The convolution layer 221 may include many convolution operators, and the convolution operators may also be called kernels. The role of the convolution operator in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator can essentially be a weight matrix, which is usually predefined. The size of this weight matrix is related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix extends to the entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolved output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (rows × columns) are applied, That is, multiple matrices of the same type. The output of each weight matrix is stacked to form the depth dimension of the convolutional image. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to remove unnecessary noise in the image. Perform blurring, etc. The multiple weight matrices have the same size (row × column), and the feature maps extracted by the multiple weight matrices with the same size are also the same size. The extracted multiple feature maps with the same size are then merged to form a convolution operation. output.

The weight values in these weight matrices require a large amount of training in practical applications. Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, thereby allowing the convolutional neural network 200 to make correct predictions. .

When the convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (eg layer 221) often extracts more general features, which can also be called low-level features. As the depth of the convolutional neural network 200 deepens, the features extracted by the later convolutional layers (for example, layer 226) become more and more complex, such as high-level semantic features. Features with higher semantics are more suitable for Problems to be solved.

Since it is often necessary to reduce the number of training parameters, pooling layers are often introduced periodically after the convolutional layer. Each layer from layer 221 to layer 226 as shown in the convolutional layer/pooling layer 220 in Figure 2 can be a convolutional layer followed by a pooling layer, or multiple convolutional layers followed by a pooling layer. layer or multi-layer pooling layer. During image or audio processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling. The max pooling operator can take the pixel with the largest value in a specific range as the result of max pooling. In addition, just as the size of the weight matrix used in the convolutional layer should be related to the image size, the operation in the pooling layer Symbols should also be related to the size of the image. The size of the image output after processing by the pooling layer can be smaller than the size of the image input to the pooling layer. Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.

After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 220 extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 200 needs to use the neural network layer 230 to generate an output or a set of required number of classes. Therefore, the neural network layer 230 may include multiple hidden layers (layer 231, layer 232 to layer 23n as shown in FIG. 2) and an output layer 240. The parameters included in the multiple hidden layers may be based on specific tasks. The relevant training data of the type can be pre-trained. For example, the task type can include image recognition, image classification, target recognition, etc.

After the multi-layer hidden layer in the neural network layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240. The output layer 240 has a loss function similar to classification cross entropy, specifically used to calculate the prediction error. The forward propagation of the entire convolutional neural network 200 (the propagation in the direction from layer 210 to layer 240 in Figure 2 is forward propagation) is completed, and the reverse propagation (the propagation in the direction from layer 240 to layer 210 in Figure 2 is back propagation) ) will start to update the weight values and deviations of each layer mentioned above to reduce the loss of the convolutional neural network 200 and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.

It should be noted that the convolutional neural network 200 shown in Figure 2 is only an example of a convolutional neural network. In specific applications, the convolutional neural network can also exist in the form of other network models, such as U- Net, 3D Morphable Face Model (3DMM) and Residual Network (ResNet), etc. In addition, the methods provided by the embodiments of this application can also be applied to neural networks other than convolutional neural networks, such as Transformer models, Transformer-based bidirectional encoding (Bidirectional Encoder Representations from Transformer, BERT) models, etc.

(3)Loss function

In the process of training a deep neural network, because we hope that the output of the deep neural network is as close as possible to the value that we really want to predict, we can compare the predicted value of the current network with the really desired target value, and then based on the difference between the two to update the weight vector of each layer of the neural network according to the difference (of course, there is usually an initialization process before the first update, that is, preconfiguring parameters for each layer in the deep neural network). For example, if the predicted value of the network If it is too high, adjust the weight vector to make it predict lower, and continue to adjust until the deep neural network can predict the really desired target value or a value that is very close to the really desired target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value". This is the loss function (loss function) or objective function (objective function), which is used to measure the difference between the predicted value and the target value. Important equations. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. Then the training of the deep neural network becomes a process of reducing this loss as much as possible.

The original meaning of gradient is a vector (vector), which means that the directional derivative of a certain function at this point reaches the maximum value along this direction, that is, the function changes fastest along this direction (the direction of this gradient) at this point. rate is the largest. When looking for the optimal parameters of each network layer during the training process of a deep neural network, it is necessary to determine the parameters that minimize the value of the loss function as much as possible. In order to find the place where the value of the loss function is as small as possible, it is necessary to calculate the gradient of the loss function relative to the parameters. That is, when the gradient vector is closer to 0, it means that the loss function reaches a minimum value point and the model accuracy reaches a maximum value point. .

(4)Back propagation algorithm

The convolutional neural network can use the error back propagation (BP) algorithm to modify the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, forward propagation of the input signal until the output will produce an error loss, and parameter-based gradients update the parameters in the initial neural network model by backpropagating the error loss information, thereby making the error loss converge. The backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as weight parameters. The backpropagation method is a specific implementation of the gradient descent method on deep networks.

(5)Quantification

In mathematics and digital signal processing, quantization refers to the process of mapping input values from a large set (usually a continuous set) into a smaller set (usually with a finite number of elements). In the field of neural network models, model quantization means converting continuous values (or a large number of possible discrete values) into floating-point model weights or fixed-point approximations (usually int8) of tensor data flowing through the model at the expense of lower inference accuracy. ) is a process of finitely many (or fewer) discrete values, which is a data type with fewer digits. The model is used to approximate the process of representing 32-bit limited-range floating-point data, while the input and output of the model are still floating-point, thereby achieving the goals of reducing model size, reducing model memory consumption, and accelerating model inference speed.

However, since the backpropagation of model training requires the quantized integer value of the parameter to be dequantized into a floating point value for calculation, there is usually a certain error between the floating point value obtained by dequantization and the original floating point value of the parameter. , that is, the quantization error. This quantization error will cause a gradient mismatch problem in the process of determining the optimal parameters based on the gradient of the loss function, reducing the accuracy of the parameters of the neural network model determined by backpropagation, and bringing accuracy to the neural network model. loss.

Among them, dequantization is the process of dividing floating-point data by a scaling factor and mapping it to an integer value through a discretization operation, and then multiplying the integer value by the same scaling factor to convert it into a floating-point value.

In the quantitative training process of the existing neural network model, since the gradient value distribution in different network layers of the neural network model is inconsistent and the gradient values with smaller values occupy the majority, these gradient values are amplified and then quantized into discrete values. It will cause gradient mismatch problems. For example, a small gradient value is directly mapped to a 0 value, thereby losing the original information. Therefore, quantitative training of neural network models will cause serious loss of model accuracy.

Embodiments of the present application provide a training method for a neural network model, particularly a model training method that selects different gradient compensation strategies to update parameters according to the fluctuation value of the quantization error of the parameters of the neural network model, that is, the computing device quantifies the parameters. When the neural network model is trained, the first gradient compensation strategy is used to compensate the gradient obtained by model training in the initial stage when the quantization error of the parameters of the model training fluctuates greatly. When the quantization error of the parameters of the neural network model fluctuates The value is less than or equal to the preset value, and it is determined that when the model training enters the training stage where the fluctuation value of the parameter quantification error is small, the second gradient compensation strategy is used to compensate the gradient obtained by the model training. Therefore, for neural network models in training stages with different degrees of stability, the computing device adopts applicable gradient compensation strategies to optimize the neural network model, alleviate the gradient mismatch problem caused by quantization errors, and improve the accuracy of the gradient of the parameters of the neural network model. It uses more accurate gradients to update the parameters of the neural network model, ensuring the accuracy of model training.

The implementation of the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Figure 3 is a schematic architectural diagram of a neural network model training system provided by an embodiment of the present application. As shown in FIG. 3 , the training system 300 includes an execution device 310 , a training device 320 , a database 330 , a terminal device 340 , a data storage system 350 and a data collection device 360 .

The execution device 310 may be a terminal, such as a mobile phone terminal, a tablet computer, a laptop, a virtual reality (VR) device, an augmented reality (AR) device, a mixed reality (Mixed Reality, MR) device, an extended reality ( Extended Reality (ER) devices, cameras or vehicle-mounted terminals, etc., or edge devices (for example, boxes carrying chips with processing capabilities), etc.

The training device 320 may be a terminal or other computing device that supports integer calculation, such as a server or a cloud device.

As a possible embodiment, the execution device 310 and the training device 320 are different processors deployed on different physical devices (such as a server or a server in a cluster). For example, the execution device 310 can be a graphics processing unit (GPU), a central processing unit (CPU), other general-purpose processors, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc. The training device 320 can be a graphics processing unit (GPU), a neural network processing unit (NPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the program of the present application.

In another possible embodiment, the execution device 310 and the training device 320 are deployed on the same physical device, or the execution device 310 and the training device 320 are the same physical device.

The data collection device 360 is used to collect training data and store the training data in the database 330. The data collection device 360, the execution device 310 and the training device 320 may be the same or different devices. The training data includes data in at least one form of images, speech, text, etc.

The training device 320 is used to train the neural network using the training data until the loss function in the neural network converges and the neural network training is completed when the loss function value is less than a specific threshold, so that the neural network reaches a certain accuracy. The equipment 320 performs quantization training on the neural network model, quantizes the weight parameters and/or activation values during the forward propagation process of the neural network model, and then selects a gradient compensation strategy to determine the gradient of the parameters according to the fluctuation value of the quantization error of the parameters for the neural network model in the back propagation process, and updates the parameters of the neural network model based on the gradient to obtain the optimized neural network model. Alternatively, all the training data in the database 330 are used for training, and the neural network training is completed, so that the trained neural network has the target functions such as image recognition, image classification, and speech recognition. Furthermore, the training device 320 configures the trained neural network 301 to the execution device 310. The execution device 310 is used to realize the function of processing application data according to the trained neural network 301.

In some embodiments, the execution device 310 and the training device 320 are the same computing device. The computing device can configure the trained neural network 301 to itself, and use the trained neural network 301 to achieve target functions such as image recognition and speech recognition.

In other embodiments, the training device 320 may configure the trained neural network 301 to multiple execution devices 310 . Each execution device 310 uses the trained neural network 301 to implement the target function of the neural network model.

Combined with the training system 300, the image processing method provided in this embodiment can be applied to the training scenario of the neural network model. Specifically, the model training method of the embodiment of the present application can be applied in scenarios such as neural network model accelerated training scenarios and model low-bit quantization scenarios.

For example, for a neural network accelerated training scenario: when the training device 320 trains a neural network model for face recognition, due to the large amount of face photos contained in the training data, all full-scale face photos are used in the model training process. High-precision floating-point data consumes a lot of computing resources and time, and model training efficiency is low. Therefore, the training device 320 performs quantitative training on the neural network model, causing the neural network model to convert parameters into integer values for forward propagation calculations. During the back propagation process of the neural network model, the training device 320 uses the first gradient compensation strategy to perform gradient compensation in the initial stage of model training when the fluctuation value of the quantization error is large according to the fluctuation value of the quantization error before and after parameter quantization. In the stable phase of model training with smaller error fluctuations, the second gradient compensation strategy is used for gradient compensation, and then the parameters of the neural network model are updated based on the compensated gradient to obtain an optimized neural network model. Therefore, the training device 320 accelerates the training speed of the neural network model through quantitative training when the training data contains a large number of face photos, and selects applicable gradient compensation strategies at different stages of model training based on the fluctuation value of the quantization error. The parameters of the neural network model are updated, which alleviates the problem of reduced model accuracy caused by quantization errors and ensures the accuracy of the neural network model in the face recognition function.

It should be noted that in actual applications, the training data maintained in the database 330 may not necessarily come from the data collection device 360, but may also be received from other devices. In addition, the training device 320 does not necessarily train the neural network entirely based on the training data maintained by the database 330. It is also possible to obtain training data from the cloud or other places to train the neural network. The above description should not be used as a limitation on the embodiments of the present application.

Further, according to the functions performed by the execution device 310, the execution device 310 can be further subdivided into an architecture as shown in Figure 7. As shown in the figure, the execution device 310 is configured with a computing module 311, an I/O interface 312 and Preprocessing module 313.

The I/O interface 312 is used for data interaction with external devices. The user can input data to the I/O interface 312 through the terminal device 740. Additionally, input data may also come from database 330.

The preprocessing module 313 is used to perform preprocessing according to the input data received by the I/O interface 312 . In this embodiment of the present application, the preprocessing module 313 may be used to generate training data, such as a training set, a verification set, and a test set according to the input data received from the I/O interface 312.

When the execution device 310 preprocesses input data, or when the calculation module 311 of the execution device 310 performs calculations and other related processes, the execution device 310 can call data, codes, etc. in the data storage system 350 for corresponding processing. , the data and instructions obtained by corresponding processing can also be stored in the data storage system 350 .

Finally, the I/O interface 312 returns the processing result to the terminal device 340, thereby providing it to the user so that the user can view the processing result.

The terminal device 340 can also be used as a data collection terminal to collect the input data input to the I/O interface 312 and the processing results output from the I/O interface 312 as new sample data, and store them in the database 330. Of course, it is also possible to collect without going through the terminal device 340. Instead, the I/O interface 312 uses the input data input to the I/O interface 312 and the processing results output from the I/O interface 312 as new sample data as shown in the figure. Store in database 330.

Figure 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application. The devices, devices, modules, etc. shown in Figure 3 The positional relationship does not constitute any limitation. For example, in Figure 3, the data storage system 350 is an external memory relative to the execution device 310. In other cases, the data storage system 350 can also be placed in the execution device 310.

Next, please refer to Figure 4 for a detailed explanation of the training method of the neural network model. Here, the training device 320 in FIG. 3 is taken as an example for explanation.

Step 410: The training device 320 trains the neural network model, and compensates the gradient obtained after training using the first gradient compensation strategy.

The training device 320 performs forward propagation training on the neural network model, quantifies the parameters of the neural network model during the forward propagation process, and uses the first gradient compensation strategy to obtain the gradient of the neural network model after completing the forward propagation training. Make compensation.

For example, the first gradient compensation strategy is an element-level gradient scaling strategy, and the element-level gradient scaling strategy includes determining the accuracy of parameters using a backpropagation process of element-level gradient scaling quantification. The element-level gradient scaling strategy adaptively enlarges or shrinks each gradient element for the gradient output by the quantized neural network model, and uses the scaled gradient as the gradient output by the quantized function to train the quantized network through backpropagation. Scaling is performed based on the sign of each gradient element and the error between the continuous input and discrete output of the quantized function. For specific steps of updating the gradient using the element gradient scaling strategy by the training device 320, please refer to Figure 6 and related descriptions, which will not be described again here.

Optionally, the quantized parameters in the neural network model include activation values and/or weight parameters. Among them, the activation value refers to the value passed from the network layer to the next layer in the neural network model, which often appears in pairs with the weight parameters and performs convolution operations or matrix multiplication operations together with the weight parameters. For example, the activation value is the output value of the network layer after being processed by the activation function. For another example, the activation value is the value in the network layer that is not processed by the activation function and is input to the next network layer for convolution operations or matrix multiplication operations.

As a possible implementation, the training device 320 selects different gradient compensation strategies to correct the gradient value when the fluctuation values of the quantization error of the parameter belong to different numerical ranges. In step 410, the training device 320 uses the first gradient compensation strategy to compensate the gradient obtained after training in the initial stage when the quantization error of the parameters trained by the neural network model is large. Optionally, a large quantification error of the parameter means that the quantification error of the parameter is greater than the preset value. The specific value of the preset value can be flexibly adjusted according to the accuracy requirements of the neural network model, such as 0.5%, 0.8%, 1%. , 1.6%, etc.

In this embodiment, the training device 320 uses a sample-by-sample asymmetric uniform quantization method to quantize the activation values, and a channel-by-channel symmetric uniform quantization method is used to quantize the weight parameters. Among them, per-sample refers to operating each sample data separately in the same batch of training data, and per-channel (per-channel) refers to grouping parameters by channels, and operating on the entire data in each channel. Perform operations. For the specific steps of the sample-by-sample asymmetric uniform quantization method and the channel-by-channel symmetric uniform quantization method, please refer to the relevant description of parameter quantization in Figure 5 below, and will not be repeated here.

The above-mentioned sample-by-sample asymmetric uniform quantization method and channel-by-channel symmetric uniform quantization method are examples provided by the embodiments of the present application. The embodiments of the present application do not limit the quantization methods of activation values or weight parameters. For example, the quantization methods of activation values or weight parameters are also It can be one of the sample-by-sample symmetric uniform quantization method, the channel-by-channel asymmetric uniform quantization method, etc.

Based on the above quantization method of activation values and weight parameters, on the one hand, the training device 320 uses a sample-by-sample asymmetric uniform quantization method that has higher quantization accuracy than the channel-by-channel symmetric and uniform quantization method to quantize the activation value, ensuring that the activation value The quantization accuracy reduces the quantization error during the quantization process. On the other hand, since the sample-by-sample asymmetric uniform quantization method has no obvious accuracy advantage over the channel-by-channel symmetric uniform quantization method when quantizing weight parameters, the training device 320 adopts the channel-by-channel symmetric uniform quantization method with higher computational efficiency. The weight parameters are quantified, which improves the efficiency of parameter quantization. Therefore, the training device 320 in the embodiment of the present application adaptively adopts the quantization method according to the data distribution of the parameters, thereby improving the quantization accuracy while ensuring the quantization efficiency.

Step 420: The training device 320 determines the fluctuation value of the quantization error of the parameters of the neural network model.

The training device 320 first dequantizes the quantized integer value of the parameter to obtain a floating-point inverse quantization value, then calculates the quantization error of the parameter based on the difference between the inverse quantization value and the floating point value of the unquantized parameter, and finally The difference between the quantization errors of parameters with different training steps is used as the fluctuation value of the quantization error.

The specific steps of the specific calculation method of the above quantization error can be expressed by the following formula (2) to formula (4).

A _QE =A _Q *A _scale -A _{zero_point} *A _scale formula (3);
W _QE =W _Q *W _scale formula (4);

Among them, MSE (X _N ,X _QE ) represents the quantization error of the parameters, M represents the number of parameters to be quantified, and X _N represents the unquantified parameters. ized full-precision floating point value, X _QE represents the floating point value obtained by inverse quantization of the quantized value of the parameter, Represents the unquantized full-precision floating point value of the parameter corresponding to the i-th sample, Represents the floating point value obtained by inverse quantization of the quantized value of the parameter corresponding to the i-th sample. X represents the activation value or weight parameter, A _QE represents the floating point value obtained by inverse quantization of the quantized value of the activation value, A _Q represents the quantized value after quantization of the activation value, A _{zero_point} represents the integer zero point value of the overall activation value, A _scale Represents the scaling factor for the overall activation value. W _QE represents the floating point value obtained by inverse quantizing the quantized value of the weight parameter, W _Q represents the quantized value after quantizing the weight parameter, and W _scale represents the overall scaling factor of the weight parameter.

Optionally, in this embodiment, the training phase in which the fluctuation value of the quantization error of the parameter is greater than the preset value can be called the first training phase of the neural network model, and the training phase in which the fluctuation value of the quantization error of the parameter is less than or equal to the preset value can be called The training phase is called the second training phase of the neural network model.

The quantified error of the parameters obtained by the two trainings in step 420 may refer to the quantified error of the parameters after this training and the quantified error of the parameters after the last training. This training and the last training may be separated by one or more training steps. number. For example, the training device 320 calculates the fluctuation value of the quantization error every m training steps in the first training phase, and determines to maintain the first gradient compensation strategy or change the first gradient compensation strategy to the second one according to the fluctuation value of the quantization error. Gradient compensation strategy. Among them, each time the neural network model completes one forward propagation and one back propagation is called a training step, and m is a positive integer. As a result, the training device 320 can intermittently determine whether to start the gradient compensation strategy, avoid frequently determining whether to start or change the gradient compensation strategy during a period when the parameters of the neural network model are relatively stable, and reduce the consumption of computing resources of the training device 320 .

Step 430: When the fluctuation value of the quantization error is less than or equal to the preset value, the training device 320 changes the first gradient compensation strategy to the second gradient compensation strategy, and in subsequent training, uses the second gradient compensation strategy to compensate for the gradient obtained in the subsequent training.

When the fluctuation value of the quantization error is less than or equal to the preset value, the training device 320 determines that the training of the neural network model is in a relatively stable second training stage, changes the first gradient compensation strategy to the second gradient compensation strategy, and uses the second gradient compensation The policy compensates for gradients obtained in subsequent training.

For example, the second gradient compensation strategy is a multi-dimensional weight hybrid training strategy. The multi-dimensional weight hybrid training strategy refers to using quantized FP16 type or FP32 type parameters to perform matrix multiplication during the training process of the neural network model, and then converting the FP16 type or FP32 type parameters into inverse quantization values. According to The floating point value and inverse quantization value of the parameters before quantization are used to optimize the loss function, and then the gradient is reconstructed based on the loss function. To put it simply, the parameter values before and after quantization are used to optimize the loss function to make up for the lost accuracy. This can effectively reduce rounding errors in the calculation process and minimize the problem of precision loss. For specific steps of updating the gradient using the multi-dimensional weight hybrid training strategy by the training device 320, please refer to Figure 7 and related descriptions, which will not be described again here.

After determining the gradient using the second gradient compensation strategy, the training device 320 uses the gradient descent method to update the parameters of each network layer of the neural network model to obtain an optimized neural network model. The gradient descent method is a commonly used algorithm in the training process of neural network models and will not be described in detail here.

As a possible implementation, the training device 320 periodically counts the fluctuation values of the quantization period of the parameters during the entire process of model training, determines to start the gradient compensation strategy when the fluctuation value of the quantization error is greater than the startup threshold, and then determines to start the gradient compensation strategy according to the quantization The fluctuation value of the error selects to execute the first gradient compensation strategy or the second gradient compensation strategy, that is, the gradient compensation strategy is executed periodically. Therefore, the gradient compensation strategy of the neural network model may be changed from the first gradient compensation strategy to the second gradient compensation strategy, or from the second gradient compensation strategy to the first gradient compensation strategy. For example, the training device 320 calculates the fluctuation value of the quantization error every m training steps in the first training phase, and determines whether to maintain or change the gradient compensation strategy. The training device 320 calculates the fluctuation value of the quantization error every M2/m training steps in the second training phase. Calculate the fluctuation value of the quantization error once, and determine whether to maintain or change the gradient compensation strategy, where M2 is the total number of training steps in the second training stage. Therefore, compared with the first training stage, the training device 320 reduces the judgment and execution frequency of gradient compensation in the second training stage when the neural network model is relatively stable, and avoids frequently judging whether to start or not during the time period when the parameters of the neural network model are relatively stable. Or the gradient compensation strategy is changed to reduce the consumption of computing resources of the training device 320 .

Based on the above step 430, the training device 320 changes the gradient compensation strategy when the fluctuation values of the quantization error are different, and selects different gradient compensation strategies to update the parameters of the neural network model. Therefore, for neural network models in training stages with different degrees of stability, the training device 320 can use an applicable gradient compensation strategy to determine the gradient of the neural network model, and optimize the gradient value of the lack of accuracy caused by the quantization error, thereby improving the neural network model. The accuracy of the gradient of the parameters of the network model and the accuracy of the parameters determined based on the gradient ensure the accuracy of model training. In addition, the training device 320 does not need to introduce operators and The algorithm is adapted to low-precision integer operations, and there is no need to introduce learnable quantization parameters to minimize quantization errors, thereby reducing resource occupation of the training device 320 and improving model training efficiency.

Next, with reference to Figure 5, taking one of the multiple network layers included in the neural network model, that is, the second network layer, as an example, the parameter quantification and operation steps of the neural network model in the forward propagation in step 410 will be explained. , the quantizer, inverse quantizer, and low-precision integer calculation unit in Figure 5 are functional modules implemented by hardware or software in the training device 320.

Step 510: The quantizer quantizes the parameters and obtains the quantized integer value.

The quantizer receives the activation value from the previous network layer of the neural network model, that is, the first network layer, and quantizes the activation value and weight parameters to obtain the quantized integer value of the activation value and weight parameters. The parameters input by the first network layer to the second network layer are floating-point values, such as FP16 or FP32, and the quantized integer values may be low-precision integer values of the INT8 type.

Optionally, in addition to being obtained from the previous network layer, the activation value can also be the parameter value obtained by the quantizer from this network layer without being processed by the activation function.

In this embodiment, for the sake of simplicity of Figure 5 and text description, only one quantizer is shown. In the example, since the parameters include activation values and weight parameters, the training device 320 can use two quantizers to perform activation values and weights respectively. Quantification of parameters. For example, one quantizer uses sample-by-sample asymmetric uniform quantization to quantize activation values, and another quantizer uses channel-by-channel symmetric uniform quantization to quantize weight parameters. The steps of the sample-by-sample asymmetric uniform quantization method and the channel-by-channel symmetric uniform quantization method are described in detail below.

For the sample-by-sample asymmetric uniform quantization method, the training device 320 first inputs the training samples into the neural network model, performs operations on each training sample, obtains the floating-point activation value output by each network layer in the neural network model, and then counts each separately. The maximum and minimum values of the floating-point activation values are calculated. The scaling factor and integer zero value of each sample are calculated based on the statistical results. Finally, the activation value corresponding to each sample is calculated based on the scaling factor and integer zero value of each sample. Quantized integer value.

For the specific steps of the above-mentioned sample-by-sample asymmetric uniform quantization method, please refer to the following formula (5) to formula (7).

in, Indicates the activation value of the i-th sample ( _AN indicates the unquantized activation value, and _AQ indicates the quantized activation value). n indicates the number of quantization bits, for example, the value of n is 8 in the INT8 quantization scenario. and Represent the quantization scaling factor and integer zero value of the activation value of the i-th sample respectively. The Round function represents the quantization operation, and the Clip function represents the data truncation operation. An integer value representing the quantized activation value of the i-th sample.

For the channel-by-channel symmetric and uniform quantization method, the training device 320 first inputs the training samples into the neural network model, processes the training samples of each channel in batches, and obtains the floating-point weight parameters output by each network layer in the neural network model, and then respectively Count the maximum absolute value of the floating-point weight parameter of the training samples of each channel, calculate the scaling factor of the weight parameter of each channel based on the statistical results, and finally determine the quantized integer value of the weight parameter based on the scaling factor of the weight parameter. type value.

The specific steps of the above-mentioned channel-by-channel symmetric and uniform quantization method can refer to the following formula (8)-formula (9).

in, Represents the weight parameter corresponding to the j-th output dimension (W _N represents the unquantized weight parameter, W _Q represents the quantized weight parameter). n represents the number of quantization bits. For example, the n value is 8 in the INT8 quantization scenario. Represents the quantified scaling factor of the weight parameter corresponding to the j-th output dimension. The Round function represents the quantization operation, and the Clip function represents the data truncation operation. An integer value representing the quantized weight parameter of the j-th output dimension.

Step 520: The low-precision integer calculation unit performs operations on the integer values to obtain operation results.

The low-precision integer calculation unit performs matrix multiplication or convolution on the integer value of the activation value and the integer value of the weight parameter to obtain the operation result.

The specific methods of the above matrix multiplication operation or convolution operation are as follows:

Among them, Output _INT represents the result of integer convolution operation or matrix multiplication operation, W _Q represents the quantized weight parameter, A _Q represents the quantized activation value, Represents convolution calculation or matrix multiplication calculation.

Step 530: The dequantizer will dequantize the operation result to obtain the dequantized floating point value.

The dequantizer performs a dequantization operation on the output result of the integer convolution operation or the matrix multiplication operation to obtain a dequantized value to approximately represent the original floating-point calculation result. The dequantized value output by the dequantizer is the activation value of the second network layer input into the next network layer, namely the third network layer, or the value to continue calculation in the second network layer.

The specific operation method of the above inverse quantization is as follows:

Among them, Output _FP represents the convolution or matrix multiplication calculation result after inverse quantization, A _scale represents the overall scaling factor of the activation value, W _scale represents the overall scaling factor of the weight parameter, A _{zero_point} represents the integer zero point value of the overall activation value, W _N Represents the weight parameter of unquantized floating point type, A _N represents the activation value of unquantized floating point type.

The forward propagation of model training is explained above in conjunction with the data transmission direction of the first network layer -> second network layer -> third network layer in Figure 5. Next, the element-level gradient scaling is combined with Figures 6 and 7. The specific steps of the strategy or multi-dimensional weight hybrid training strategy are explained. Since the data propagation direction of back propagation in model training is opposite to that of forward propagation, the difference is that the training device 320 uses a gradient estimator to perform a gradient compensation strategy to determine the gradient at the quantization step of forward propagation, and no further calculation of the back propagation is performed here. The specific steps of the training device 320 to update the model parameters according to the gradient will be described in detail.

Please refer to Figure 6, which is a schematic diagram of an element-level gradient scaling strategy provided by an embodiment of the present application. In the back propagation of the neural network model, taking the training device 320 using the element-level gradient scaling strategy to determine the parameters in the second network layer as an example, the specific steps of the element-level gradient scaling strategy are as follows:

Step 610: The training device 320 obtains parameters.

The second network layer of the training device 320 may obtain parameters from the third network layer, and obtain parameters from the second network layer. Among them, the parameters obtained by the second network layer from the third network layer include the gradient value of the quantization function, and the parameters obtained by the second network layer from itself include activation values and weight parameters.

Step 620: The training device 320 scales the gradient of the quantization function according to the parameters to obtain the reconstructed gradient.

The training device 320 scales the gradient of the quantization function according to the activation value and weight parameter to obtain the reconstructed gradient of the activation value and weight parameter.

The specific algorithm for scaling the gradient of the quantized function by the training device 320 can refer to the following formula:

in, and Represent the gradient matrix respectively and The elements in , μ is the gradient scaling factor, μ ≥ 0, which can be set to a small constant (such as 10e-3), or an adaptive coefficient based on the second-order gradient estimation, x _n and x _q represent the parameters respectively. The elements in the quantized full-precision value X _N and the dequantized value X _Q.

Step 630: The training device 320 inputs the reconstruction gradient to the first network layer.

The training device 320 updates the weight parameter of the second network layer according to the reconstruction gradient of the weight parameter, and inputs the reconstruction gradient of the activation value into the first network layer. The training device 320 performs operations based on the same principle as step 610 and step 620 based on the reconstruction gradient of the activation value transmitted by the second network layer to obtain the reconstruction gradient of the first network layer. By analogy, each network of the entire neural network model is obtained. The reconstruction gradient of the layer.

Please refer to FIG. 7 , which is a schematic diagram of a multi-dimensional weight hybrid training strategy provided by an embodiment of the present application. In the back propagation of the neural network model, taking the training device 320 using the multi-dimensional weight hybrid training strategy to determine the parameters in the second network layer as an example, the specific steps of the multi-dimensional weight hybrid training strategy are as follows:

Step 710: The training device 320 determines the floating point value of the unquantized parameter of the second network layer and the inverse quantized value of the parameter.

Step 720: The training device 320 determines the optimized loss function based on the floating point value and the inverse quantization value.

Optionally, the specific steps for the training device 320 to determine the optimized loss function based on the floating point value and the inverse quantization value may refer to Formula (13)-Formula (14):
Loss(W _N ,ρ)=Loss((1-ρ)·W _N +ρ·W _QE ) Formula (13);

Among them, Loss(W _N ,ρ) represents the optimized loss function, ρ≥0, and its value gradually increases from 0 to 1 during the training process, W _N Represents the unquantized floating point value of the weight parameter, W _QE represents the inverse quantization value of the weight parameter, W _scale represents the overall scaling factor of the mass parameter, and the Round function represents the quantization operation.

Step 730: The training device 320 determines the reconstruction gradient according to the optimized loss function.

The specific steps of the training device 320 to determine the reconstruction gradient can refer to the following formula:

in, Represents the gradient of the optimized loss function for W _N , Indicates the gradient of the loss function before optimization relative to W _N. Since W _QE involves the quantization function Round, therefore It is always 0 and does not participate in the parameter update process, which can effectively avoid gradient mismatch problems.

Step 740: The training device 320 inputs the reconstruction gradient to the first network layer.

The training device 320 updates the weight parameters of the second network layer according to the reconstruction gradient of the weight parameters, and inputs the reconstruction gradient into the first network layer. The training device 320 performs operations based on the same principles as steps 710 to 730 based on the reconstruction gradient transmitted by the second network layer to obtain the reconstruction gradient of the first network layer. By analogy, the weight of each network layer of the entire neural network model is obtained. structural gradient.

The training method of the neural network model provided by this embodiment is described in detail above with reference to FIGS. 3 to 7 . Next, the training device of the neural network model provided by this embodiment will be described with reference to FIG. 8 .

FIG8 is a schematic diagram of a possible training device for a neural network model provided in this embodiment. The training device for a neural network model can be used to implement the functions of the execution device in the above method embodiment, and thus can also achieve the beneficial effects possessed by the above method embodiment. In this embodiment, the training device for the neural network model can be the training device 320 shown in FIG3 , or can be a module (such as a chip) applied to a server.

The neural network model training device 800 includes a compensation module 810 and a processing module 820 . The neural network model training device 800 is used to implement the functions of the training device 320 in the method embodiment shown in FIG. 4 .

The compensation module 810 is used to change the gradient compensation strategy according to the fluctuation value of the quantization error of the parameter, and use the gradient compensation strategy to compensate the gradient obtained by the neural network model training. For example, the compensation module 810 is used to perform steps 410 and 430 in FIG. 4 .

The calculation module 820 is used to determine the fluctuation value of the quantization error of the parameters of the neural network model. For example, the calculation module 820 is used to perform step 420 in FIG. 4 .

As a possible implementation, the first gradient compensation strategy includes an element-level gradient scaling strategy, and the second gradient compensation strategy includes a multi-dimensional weight hybrid training strategy.

As a possible implementation, the parameters include weight parameters or activation values.

As a possible implementation manner, the calculation module 820 is specifically configured to periodically count the fluctuation values of the quantization errors of the parameters of the neural network model.

As a possible implementation manner, the first period of the fluctuation value of the statistical quantization error of the first gradient compensation strategy is smaller than the second period of the fluctuation value of the statistical quantization error of the second gradient compensation strategy.

As a possible implementation, the number of training steps included in the second cycle is equal to the total number of training steps used to compensate the gradient obtained after training using the second gradient compensation strategy and the number of training steps included in the first cycle. business.

It should be understood that the training device 800 of the neural network model in the embodiment of the present application can be implemented by GPU, NPU, ASIC, or programmable logic device (PLD). The above PLD can be a complex program logic device (complex). programmable logical device (CPLD), field-programmable gate array (FPGA), general array logic (GAL) or any combination thereof. When the method shown in FIG. 4 can also be implemented through software, the neural network model training device 800 and its respective modules can also be software modules.

The neural network model training device 800 in the embodiment of the present application may correspond to executing the method described in the embodiment of the present application, and the above and other operations and/or functions of each unit in the neural network model training device 800 are respectively to implement the figure. The corresponding processes of each method in 4 will not be repeated here for the sake of brevity.

An embodiment of the present application also provides a computing device. Please refer to FIG. 9 . FIG. 9 is a schematic structural diagram of a computing device provided by an embodiment of the present application. Computing device 900 includes memory 901, processor 902, communication interface 903, and bus 904. Among them, the memory 901, the processor 902, and the communication interface 903 implement communication connections between each other through the bus 904.

Memory 901 may be a read-only memory, a static storage device, a dynamic storage device, or a random access memory. memory 901 may store computer instructions. When the computer instructions stored in the memory 901 are executed by the processor 902, the processor 902 and the communication interface 903 are used to execute steps in the image processing method of the software system. For example, the communication interface 903 is used to execute step 410 in the training method of the neural network model shown in Figure 4, and the function of the compensation module 810 in the training device 800 of the neural network model shown in Figure 8. The processor 902 uses In executing steps 420 and 430 in the training method of the neural network model shown in FIG. 4 , as well as the functions of the processing module 820 in the training device 800 of the neural network model shown in FIG. 8 . The memory can also store data sets. For example, a part of the storage resources in the memory 901 is divided into an area for storing programs that implement the functions of the neural network model in the embodiment of the present application.

The processor 902 can be a general CPU, an application specific integrated circuit (ASIC), a GPU or any combination thereof. Processor 902 may include one or more chips. Processor 902 may include an AI accelerator, such as an NPU.

The communication interface 903 uses a transceiver module such as but not limited to a transceiver to implement communication between the computing device 900 and other devices or communication networks. For example, the iterative training request, training data, and feedback of the iteratively trained neural network can be obtained through the communication interface 903.

Bus 904 may include a path that carries information between various components of computing device 900 (eg, memory 901, processor 902, communications interface 903).

The computing device 900 may be a computer (for example, a server) in a cloud data center, a computer in an edge data center, or a terminal.

The functionality of training device 320 may be deployed on each computing device 900. For example, a GPU is used to implement the functions of the training device 320.

For the functions of the training device 320 and the functions of the execution device 310 deployed within the same computing device 900, the training device 320 can communicate with the execution device 310 through the bus 904.

For the functions of the training device 320 and the functions of the execution device 310 deployed within different computing devices 900, the training device 320 may communicate with the execution device 310 through a communication network.

The method steps in this embodiment can be implemented by hardware or by a processor executing software instructions. Software instructions can be composed of corresponding software modules, and software modules can be stored in random access memory (random access memory, RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM) , PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM), register, hard disk, mobile hard disk, CD-ROM or other well-known in the art any other form of storage media. An exemplary storage medium is coupled to the processor such that the processor can read information from the storage medium and write information to the storage medium. Of course, the storage medium can also be an integral part of the processor. The processor and storage media may be located in an ASIC. Additionally, the ASIC can be located in the terminal device. Of course, the processor and the storage medium can also exist as discrete components in network equipment or terminal equipment.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are executed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a user equipment, or other programmable device. The computer program or instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer program or instructions may be transmitted from a website, computer, A server or data center transmits via wired or wireless means to another website site, computer, server, or data center. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center that integrates one or more available media. The available media may be magnetic media, such as floppy disks, hard disks, and magnetic tapes; they may also be optical media, such as digital video discs (DVDs); they may also be semiconductor media, such as solid state drives (solid state drives). ,SSD). The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person familiar with the technical field can easily think of various equivalent methods within the technical scope disclosed in the present application. Modification or replacement, these modifications or replacements shall be covered by the protection scope of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

A training method for a neural network model, which is characterized by including:

Train a neural network model, quantify parameters of the neural network model, and compensate the gradient obtained after training with a first gradient compensation strategy;

Determine the fluctuation value of the quantization error of the parameters of the neural network model, where the fluctuation value is the difference between the quantization error of this training and the quantization error of the last training;

When the fluctuation value of the quantization error is less than or equal to the preset value, the first gradient compensation strategy is changed to the second gradient compensation strategy, and in subsequent training, the second gradient compensation strategy is used to The resulting gradient is compensated.
The method of claim 1, wherein the first gradient compensation strategy includes an element-level gradient scaling strategy, the second gradient compensation strategy includes a multi-dimensional weight hybrid training strategy, and the multi-dimensional weight hybrid training strategy Used to perform gradient compensation based on the quantized value and inverse quantized value of the parameter.
The method according to claim 1 or 2, characterized in that the parameters include weight parameters or activation values.
The method according to any one of claims 1 to 3, characterized in that determining the fluctuation value of the quantization error of the parameters of the neural network model comprises:

The fluctuation value of the quantization error of the parameters of the neural network model is periodically counted.
The method according to claim 4, characterized in that the first period of the fluctuation value of the statistical quantization error of the first gradient compensation strategy is smaller than the second period of the fluctuation value of the statistical quantization error of the second gradient compensation strategy.
The method of claim 5, wherein the number of training steps included in the second period is equal to the total number of training steps used to compensate the gradient obtained after training using the second gradient compensation strategy and the total number of training steps. is the quotient of the number of training steps included in the first cycle.
A training device for a neural network model, which is characterized by including:

The compensation module is used to train the neural network model, the parameters of the neural network model are quantified, and the gradient obtained after training is compensated with the first gradient compensation strategy;

A calculation module, used to determine the fluctuation value of the quantization error of the parameters of the neural network model, where the fluctuation value is the difference between the quantization error of this training and the quantization error of the last training;

The compensation module is also configured to change the first gradient compensation strategy to a second gradient compensation strategy when the fluctuation value of the quantization error is less than or equal to a preset value, and use the said gradient compensation strategy in subsequent training. The second gradient compensation strategy compensates for the gradient obtained after training.
The device according to claim 7, wherein the first gradient compensation strategy includes an element-level gradient scaling strategy, the second gradient compensation strategy includes a multi-dimensional weight hybrid training strategy, and the multi-dimensional weight hybrid training strategy Used to perform gradient compensation based on the quantized value and inverse quantized value of the parameter.
The device according to claim 7 or 8, characterized in that the parameters include weight parameters or activation values.
The device according to any one of claims 7-9, characterized in that the computing module is specifically used for:

The fluctuation value of the quantization error of the parameters of the neural network model is periodically counted.
The device according to claim 10, wherein the first period of the fluctuation value of the statistical quantization error of the first gradient compensation strategy is smaller than the second period of the fluctuation value of the statistical quantization error of the second gradient compensation strategy.
The device according to claim 11, wherein the number of training steps included in the second period is equal to the total number of training steps used to compensate the gradient obtained after training using the second gradient compensation strategy and the total number of training steps. is the quotient of the number of training steps included in the first cycle.
A computing device, characterized in that the computing device includes a memory and at least one processor, the memory is used to store a set of computer instructions; when the processor executes the set of computer instructions, the above claims are executed The steps of the method described in any one of 1-6.
A training system for neural network models, characterized in that the system includes an execution device and a computing device as claimed in claim 13, the computing device being used to execute the method as described in any one of claims 1-6. The operation steps are to train the neural network model to obtain the optimized neural network model, and the execution device is used to apply the optimized neural network model.