WO2024060727A1 - Method and apparatus for training neural network model, and device and system - Google Patents

Method and apparatus for training neural network model, and device and system Download PDF

Info

Publication number
WO2024060727A1
WO2024060727A1 PCT/CN2023/101170 CN2023101170W WO2024060727A1 WO 2024060727 A1 WO2024060727 A1 WO 2024060727A1 CN 2023101170 W CN2023101170 W CN 2023101170W WO 2024060727 A1 WO2024060727 A1 WO 2024060727A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
neural network
gradient
network model
value
Prior art date
Application number
PCT/CN2023/101170
Other languages
French (fr)
Chinese (zh)
Inventor
潘一荣
姚益武
王兵
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024060727A1 publication Critical patent/WO2024060727A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular to a training method, device, equipment and system for a neural network model.
  • Introducing quantize processing in the training process of the neural network model can reduce the consumption of storage and processing resources of the neural network model.
  • the neural network model updates the model based on the quantized parameters, and the model convergence is poor, which causes a large loss of accuracy of the neural network model.
  • This application provides a training method, device, equipment and system for a neural network model, thereby solving the problems of poor model convergence and large accuracy loss of the neural network model caused by the quantification processing of the neural network model.
  • a training method of a neural network model is provided, which is executed by a computing device that trains a neural network model.
  • the method includes: the computing device trains the neural network model, and the parameters of the neural network model are quantified. Since the parameters The quantification leads to errors in model parameters.
  • the computing device adopts the first gradient compensation strategy to compensate for the gradient obtained after training, and counts the fluctuation value of the quantized error (quantize error) of the parameters of the neural network model. When the fluctuation value of the quantized error is less than When equal to the preset value, the second gradient compensation strategy is used to compensate the gradient obtained after training.
  • the computing device uses the first gradient compensation strategy to update the parameters of the neural network model in the initial stage of model training when the parameters of the neural network model are unstable, and determines that the neural network model is in a state where the parameters are relatively stable based on the fluctuation value of the quantified error of the parameters.
  • the first gradient compensation strategy is changed to the second gradient compensation strategy, so that for the neural network model in the training stage with different degrees of stability, the computing device can use the applicable gradient compensation strategy to determine the gradient of the neural network model, and the gradient Optimization is performed to improve the accuracy of the gradient of the parameters of the neural network model and the accuracy of the parameters determined based on the gradient, thereby ensuring the accuracy of model training.
  • quantization refers to using a quantization function to convert parameters from floating point values to integer values during the forward training process of the neural network model.
  • the parameters of the neural network model can include the weight parameters output by each network layer included in the neural network model. , and/or the value of convolution or matrix multiplication with the weight parameter is the activation value.
  • the quantization error refers to the difference between the floating point value of the parameters of the neural network model before quantization and the inverse quantization value.
  • the inverse quantization value uses the inverse function of the quantization function to inverse quantize the integer value after parameter quantization (de- quantize) to get the floating point value.
  • the computing device may determine the gradient compensation strategy based on a comparison result between the fluctuation value of the quantization error of the parameter and the preset value.
  • the computing device changes the currently used gradient compensation strategy to the first gradient compensation strategy.
  • the computing device changes the currently used gradient compensation strategy to the second gradient compensation strategy.
  • the first gradient compensation strategy may be an Element-wise Gradient Scaling (EWGS) strategy
  • the second gradient compensation strategy may be a multi-dimensional weight hybrid training strategy
  • the multi-dimensional weight hybrid training strategy is used based on the parameters
  • the quantized value and the inverse quantized value are used for gradient compensation.
  • the quantization error fluctuates greatly when the model first starts training.
  • the fluctuation of the quantization error will gradually decrease and become stable as the training progresses.
  • the quantization error of the parameter can be
  • the training phase in which the fluctuation value is greater than the preset value is called the first training phase of the neural network model.
  • the training phase in which the fluctuation value of the parameter quantification error is less than or equal to the preset value is called the first training phase of the neural network model.
  • the training stage is called the second training stage of the neural network model.
  • the computing device uses a multi-dimensional weight hybrid training strategy for gradient compensation in the second training stage when the quantization error fluctuation value is small, and uses parameters of a single-precision data type or parameter integer values of a half-precision data type for model training. It improves the transmission and calculation efficiency of parameters, and iterates faster than the element-level gradient scaling strategy. It improves the training efficiency while ensuring the training accuracy of the neural network model.
  • the computing device uses an element-level gradient scaling strategy for gradient compensation, adaptively enlarges or shrinks each gradient element, and uses the scaled gradient as the gradient output by the quantization function, through Back propagation is used to train the quantized network, which achieves higher precision gradient compensation compared to the multi-dimensional weight hybrid training strategy, thereby enabling the parameter to change when the quantization error fluctuates greatly, that is, the discrete error between the input and output of the quantization function.
  • the gradient mismatch is severe, the gradient accuracy of the parameters of the neural network model is guaranteed, thereby ensuring the accuracy of model training.
  • the quantization of the neural network model by the computing device is performed in the forward training of the model training.
  • the computing device quantifies the weight parameters and activation values in forward training, that is, the computing device obtains the integral value of each network layer based on the quantized values of the weight parameters and the quantified activation values of each network layer in the neural network model. type output result, and perform forward calculation after dequantizing the reshaped output result.
  • the computing device can adopt different quantization methods for data distribution with different parameters.
  • activation values are quantized using a sample-by-sample asymmetric uniform quantization method
  • weight parameters are quantized using a channel-by-channel symmetric uniform quantization method. Therefore, the computing device quantizes the activation value using a sample-by-sample asymmetric uniform quantization method that has higher quantization accuracy than a channel-by-channel symmetric uniform quantization method. Since the sample-by-sample asymmetric uniform quantization method has no obvious accuracy advantage over the channel-by-channel symmetric uniform quantization method when quantizing weight parameters, the computing device uses a sample-by-sample asymmetric uniform quantization method that is more computationally efficient than the sample-by-sample asymmetric uniform quantization method. The channel symmetric and uniform quantization method quantizes the weight parameters. Therefore, the quantization method is adaptively adopted according to the data distribution of the parameters, which improves the quantification accuracy while ensuring the quantification efficiency.
  • the computing device can periodically count the fluctuation values of the quantization error, and periodically use the gradient compensation strategy to compensate the gradient. That is, the quantization error of the last training refers to the statistical training of the previous cycle. quantization error, thereby reducing the computational overhead of quantization processing.
  • the computing device calculates the fluctuation value of the quantization error every m training steps in the first training stage, and uses the first gradient compensation strategy to compensate the gradient of the neural network model.
  • a training step each time the neural network model completes one forward propagation (ie, forward training) and one back propagation (ie, reverse training), it is called a training step, and m is a positive integer.
  • the computing device calculates the fluctuation value of the quantization error every M2/m training steps in the second training stage, and uses the second gradient compensation strategy to compensate the gradient of the neural network model.
  • M2 is the total number of training steps in the second training stage. Therefore, the period of the fluctuation value of the statistical quantization error of the first gradient compensation strategy is smaller than the period of the fluctuation value of the statistical quantization error of the second gradient compensation strategy, thereby reducing the fluctuation value of the quantization error of the neural network model in the second training stage.
  • the gradient compensation frequency of the neural network model not only ensures the training accuracy of the neural network model, but also improves the overall training efficiency of the neural network model.
  • a second aspect provides a training device for a neural network model.
  • the device includes various modules for executing the training method for a neural network model in the first aspect or any possible implementation of the first aspect.
  • the training device for the neural network model described in the second aspect can be a terminal device or a network device, or a chip (system) or other parts or components that can be set in the terminal device or the network device, or a device that includes a terminal device or a network device, and this application does not limit this.
  • the technical effects of the neural network model training device described in the second aspect can be referred to the technical effects of the neural network model training method described in the first aspect, and will not be described again here.
  • a computing device comprising a memory and a processor, wherein the memory is used to store a set of computer instructions, and when the processor executes the set of computer instructions, it is used to execute the operating steps of the training method of the neural network model in any possible design in the first aspect.
  • a fourth aspect provides a training system for a neural network model, including an execution device and the computing device described in the third aspect.
  • the computing device is used to perform training of the neural network model in any possible design of the first aspect.
  • the steps of the method are as follows: An optimized neural network model is obtained, and the execution device is used to apply the optimized neural network model.
  • a computer-readable storage medium including: computer software instructions; when the computer software instructions are run in a data processing system, the computing device is caused to execute as described in any possible implementation manner in the first aspect. The steps of the method.
  • a computer program product is provided.
  • the computer program product When the computer program product is run on a computer, it causes the computing device to perform the operation steps of the method described in any possible implementation manner in the first aspect.
  • Figure 1 is a schematic structural diagram of a neural network provided by an embodiment of the present application.
  • Figure 2 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application.
  • Figure 3 is a schematic architectural diagram of a neural network model training system provided by an embodiment of the present application.
  • FIG4 is a schematic diagram of a training method for a neural network model provided in an embodiment of the present application.
  • Figure 5 is a schematic diagram of forward propagation provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of an element-level gradient scaling strategy provided by an embodiment of the present application.
  • Figure 7 is a schematic diagram of a multi-dimensional weight hybrid training strategy provided by an embodiment of the present application.
  • Figure 8 is a schematic diagram of a training device for a neural network model provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • a neural network may be composed of neurons, and a neuron may refer to an operation unit with xs and intercept 1 as input.
  • the output of the operation unit satisfies the following formula:
  • s 1, 2,...n, n is a natural number greater than 1
  • Ws is the weight of xs
  • b is the bias of the neuron.
  • f is the activation function of the neuron, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neuron into an output signal.
  • the output signal of the activation function can be used as the input of the next layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting multiple single neurons mentioned above, that is, the output of one neuron can be the input of another neuron.
  • the input of each neuron can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neurons. Weights represent the strength of connections between different neurons. Weight determines the influence of input on output. A weight close to 0 means that changing the input does not change the output. Negative weights mean that increasing input decreases output.
  • the neural network 100 includes N processing layers, where N is an integer greater than or equal to 3.
  • the first layer of the neural network 100 is the input layer 110, which is responsible for receiving input signals.
  • the last layer of the neural network 100 is the output layer 130, which is responsible for outputting the processing results of the neural network.
  • the other layers except the first layer and the last layer are intermediate layers 140. These intermediate layers 140 together form a hidden layer 120.
  • Each intermediate layer 140 in the hidden layer 120 can both receive input signals and output signals.
  • the hidden layer 120 is responsible for the processing of the input signal.
  • Each layer represents a logical level of signal processing. Through multiple layers, the data signal can be processed by multi-level logic.
  • the input signal of the neural network may be a video signal, a voice signal, a text signal, an image signal or a temperature signal, etc. in various forms.
  • the voice signal can be various sensor signals such as human voice audio signals recorded by a microphone (sound sensor) such as speaking and singing.
  • the input signals of the neural network also include various other computer-processable engineering signals, which will not be listed here. If a neural network is used to perform deep learning on image signals, the quality of images processed by the neural network can be improved.
  • Convolutional Neuron Network is a deep neural network with a convolutional structure network.
  • the convolutional neural network contains a feature extractor composed of convolutional layers and subsampling layers.
  • the feature extractor can be regarded as a filter, and the convolution process can be regarded as using a trainable filter to convolve with an input image or feature map.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a neuron can be connected to only some of the neighboring layer neurons.
  • a convolutional layer can output several feature maps, and the feature map can refer to the intermediate result during the operation of the convolutional neural network.
  • Neurons in the same feature map share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as a way to extract image information independent of position. That is, the statistics of one part of the image are the same as those of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the same learned image information can be used for all positions on the image.
  • multiple convolution kernels can be used to extract different image information. Generally, the greater the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a random-sized matrix. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the convolutional neural network 200 may include an input layer 210, a convolutional/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230.
  • the convolution layer/pooling layer 220 may include, for example, layers 221 to 226.
  • layer 221 may be, for example, a convolution layer
  • layer 222 may be, for example, a pooling layer
  • layer 223 may be, for example, a convolution layer
  • layer 224 may be, for example, a pooling layer
  • layer 225 may be, for example, a convolution layer
  • layer 226 may be, for example, a pooling layer.
  • layers 221 and 222 may be, for example, convolution layers
  • layer 223 may be, for example, a pooling layer
  • layers 224 and 225 may be, for example, convolution layers
  • layer 226 may be, for example, a pooling layer.
  • the output of a convolution layer may be used as the input of a subsequent pooling layer, or as the input of another convolution layer to continue the convolution operation.
  • the convolution layer 221 may include many convolution operators, and the convolution operators may also be called kernels.
  • the role of the convolution operator in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can essentially be a weight matrix, which is usually predefined. The size of this weight matrix is related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix extends to the entire depth of the input image.
  • convolution with a single weight matrix will produce a convolved output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (rows ⁇ columns) are applied, That is, multiple matrices of the same type.
  • the output of each weight matrix is stacked to form the depth dimension of the convolutional image.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to remove unnecessary noise in the image. Perform blurring, etc.
  • the multiple weight matrices have the same size (row ⁇ column), and the feature maps extracted by the multiple weight matrices with the same size are also the same size. The extracted multiple feature maps with the same size are then merged to form a convolution operation. output.
  • weight values in these weight matrices require a large amount of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, thereby allowing the convolutional neural network 200 to make correct predictions. .
  • the initial convolutional layer eg layer 221
  • the features extracted by the later convolutional layers become more and more complex, such as high-level semantic features.
  • Features with higher semantics are more suitable for Problems to be solved.
  • pooling layers are often introduced periodically after the convolutional layer.
  • Each layer from layer 221 to layer 226 as shown in the convolutional layer/pooling layer 220 in Figure 2 can be a convolutional layer followed by a pooling layer, or multiple convolutional layers followed by a pooling layer. layer or multi-layer pooling layer.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling.
  • the max pooling operator can take the pixel with the largest value in a specific range as the result of max pooling.
  • the size of the weight matrix used in the convolutional layer should be related to the image size
  • the operation in the pooling layer Symbols should also be related to the size of the image.
  • the size of the image output after processing by the pooling layer can be smaller than the size of the image input to the pooling layer.
  • Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 220 extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 200 needs to use the neural network layer 230 to generate an output or a set of required number of classes. Therefore, the neural network layer 230 may include multiple hidden layers (layer 231, layer 232 to layer 23n as shown in FIG. 2) and an output layer 240. The parameters included in the multiple hidden layers may be based on specific tasks. The relevant training data of the type can be pre-trained. For example, the task type can include image recognition, image classification, target recognition, etc.
  • the output layer 240 has a loss function similar to classification cross entropy, specifically used to calculate the prediction error.
  • the forward propagation of the entire convolutional neural network 200 (the propagation in the direction from layer 210 to layer 240 in Figure 2 is forward propagation) is completed, and the reverse propagation (the propagation in the direction from layer 240 to layer 210 in Figure 2 is back propagation) ) will start to update the weight values and deviations of each layer mentioned above to reduce the loss of the convolutional neural network 200 and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.
  • the convolutional neural network 200 shown in Figure 2 is only an example of a convolutional neural network.
  • the convolutional neural network can also exist in the form of other network models, such as U- Net, 3D Morphable Face Model (3DMM) and Residual Network (ResNet), etc.
  • the methods provided by the embodiments of this application can also be applied to neural networks other than convolutional neural networks, such as Transformer models, Transformer-based bidirectional encoding (Bidirectional Encoder Representations from Transformer, BERT) models, etc.
  • the original meaning of gradient is a vector (vector), which means that the directional derivative of a certain function at this point reaches the maximum value along this direction, that is, the function changes fastest along this direction (the direction of this gradient) at this point. rate is the largest.
  • vector vector
  • the convolutional neural network can use the error back propagation (BP) algorithm to modify the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller.
  • BP error back propagation
  • forward propagation of the input signal until the output will produce an error loss
  • parameter-based gradients update the parameters in the initial neural network model by backpropagating the error loss information, thereby making the error loss converge.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as weight parameters.
  • the backpropagation method is a specific implementation of the gradient descent method on deep networks.
  • quantization refers to the process of mapping input values from a large set (usually a continuous set) into a smaller set (usually with a finite number of elements).
  • model quantization means converting continuous values (or a large number of possible discrete values) into floating-point model weights or fixed-point approximations (usually int8) of tensor data flowing through the model at the expense of lower inference accuracy.
  • ) is a process of finitely many (or fewer) discrete values, which is a data type with fewer digits.
  • the model is used to approximate the process of representing 32-bit limited-range floating-point data, while the input and output of the model are still floating-point, thereby achieving the goals of reducing model size, reducing model memory consumption, and accelerating model inference speed.
  • dequantization is the process of dividing floating-point data by a scaling factor and mapping it to an integer value through a discretization operation, and then multiplying the integer value by the same scaling factor to convert it into a floating-point value.
  • Embodiments of the present application provide a training method for a neural network model, particularly a model training method that selects different gradient compensation strategies to update parameters according to the fluctuation value of the quantization error of the parameters of the neural network model, that is, the computing device quantifies the parameters.
  • the first gradient compensation strategy is used to compensate the gradient obtained by model training in the initial stage when the quantization error of the parameters of the model training fluctuates greatly.
  • the quantization error of the parameters of the neural network model fluctuates The value is less than or equal to the preset value, and it is determined that when the model training enters the training stage where the fluctuation value of the parameter quantification error is small, the second gradient compensation strategy is used to compensate the gradient obtained by the model training.
  • the computing device adopts applicable gradient compensation strategies to optimize the neural network model, alleviate the gradient mismatch problem caused by quantization errors, and improve the accuracy of the gradient of the parameters of the neural network model. It uses more accurate gradients to update the parameters of the neural network model, ensuring the accuracy of model training.
  • Figure 3 is a schematic architectural diagram of a neural network model training system provided by an embodiment of the present application.
  • the training system 300 includes an execution device 310 , a training device 320 , a database 330 , a terminal device 340 , a data storage system 350 and a data collection device 360 .
  • the execution device 310 may be a terminal, such as a mobile phone terminal, a tablet computer, a laptop, a virtual reality (VR) device, an augmented reality (AR) device, a mixed reality (Mixed Reality, MR) device, an extended reality ( Extended Reality (ER) devices, cameras or vehicle-mounted terminals, etc., or edge devices (for example, boxes carrying chips with processing capabilities), etc.
  • a terminal such as a mobile phone terminal, a tablet computer, a laptop, a virtual reality (VR) device, an augmented reality (AR) device, a mixed reality (Mixed Reality, MR) device, an extended reality ( Extended Reality (ER) devices, cameras or vehicle-mounted terminals, etc., or edge devices (for example, boxes carrying chips with processing capabilities), etc.
  • VR virtual reality
  • AR augmented reality
  • MR mixed reality
  • ER Extended Reality
  • edge devices for example, boxes carrying chips with processing capabilities
  • the training device 320 may be a terminal or other computing device that supports integer calculation, such as a server or a cloud device.
  • the execution device 310 and the training device 320 are different processors deployed on different physical devices (such as a server or a server in a cluster).
  • the execution device 310 can be a graphics processing unit (GPU), a central processing unit (CPU), other general-purpose processors, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or any conventional processor, etc.
  • the training device 320 can be a graphics processing unit (GPU), a neural network processing unit (NPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the program of the present application.
  • the execution device 310 and the training device 320 are deployed on the same physical device, or the execution device 310 and the training device 320 are the same physical device.
  • the data collection device 360 is used to collect training data and store the training data in the database 330.
  • the data collection device 360, the execution device 310 and the training device 320 may be the same or different devices.
  • the training data includes data in at least one form of images, speech, text, etc.
  • the training device 320 is used to train the neural network using the training data until the loss function in the neural network converges and the neural network training is completed when the loss function value is less than a specific threshold, so that the neural network reaches a certain accuracy.
  • the equipment 320 performs quantization training on the neural network model, quantizes the weight parameters and/or activation values during the forward propagation process of the neural network model, and then selects a gradient compensation strategy to determine the gradient of the parameters according to the fluctuation value of the quantization error of the parameters for the neural network model in the back propagation process, and updates the parameters of the neural network model based on the gradient to obtain the optimized neural network model.
  • the training device 320 configures the trained neural network 301 to the execution device 310.
  • the execution device 310 is used to realize the function of processing application data according to the trained neural network 301.
  • the execution device 310 and the training device 320 are the same computing device.
  • the computing device can configure the trained neural network 301 to itself, and use the trained neural network 301 to achieve target functions such as image recognition and speech recognition.
  • the training device 320 may configure the trained neural network 301 to multiple execution devices 310 .
  • Each execution device 310 uses the trained neural network 301 to implement the target function of the neural network model.
  • the image processing method provided in this embodiment can be applied to the training scenario of the neural network model.
  • the model training method of the embodiment of the present application can be applied in scenarios such as neural network model accelerated training scenarios and model low-bit quantization scenarios.
  • the training device 320 trains a neural network model for face recognition, due to the large amount of face photos contained in the training data, all full-scale face photos are used in the model training process. High-precision floating-point data consumes a lot of computing resources and time, and model training efficiency is low. Therefore, the training device 320 performs quantitative training on the neural network model, causing the neural network model to convert parameters into integer values for forward propagation calculations. During the back propagation process of the neural network model, the training device 320 uses the first gradient compensation strategy to perform gradient compensation in the initial stage of model training when the fluctuation value of the quantization error is large according to the fluctuation value of the quantization error before and after parameter quantization.
  • the second gradient compensation strategy is used for gradient compensation, and then the parameters of the neural network model are updated based on the compensated gradient to obtain an optimized neural network model. Therefore, the training device 320 accelerates the training speed of the neural network model through quantitative training when the training data contains a large number of face photos, and selects applicable gradient compensation strategies at different stages of model training based on the fluctuation value of the quantization error. The parameters of the neural network model are updated, which alleviates the problem of reduced model accuracy caused by quantization errors and ensures the accuracy of the neural network model in the face recognition function.
  • the training data maintained in the database 330 may not necessarily come from the data collection device 360, but may also be received from other devices.
  • the training device 320 does not necessarily train the neural network entirely based on the training data maintained by the database 330. It is also possible to obtain training data from the cloud or other places to train the neural network. The above description should not be used as a limitation on the embodiments of the present application.
  • the execution device 310 can be further subdivided into an architecture as shown in Figure 7. As shown in the figure, the execution device 310 is configured with a computing module 311, an I/O interface 312 and Preprocessing module 313.
  • the I/O interface 312 is used for data interaction with external devices.
  • the user can input data to the I/O interface 312 through the terminal device 740. Additionally, input data may also come from database 330.
  • the preprocessing module 313 is used to perform preprocessing according to the input data received by the I/O interface 312 .
  • the preprocessing module 313 may be used to generate training data, such as a training set, a verification set, and a test set according to the input data received from the I/O interface 312.
  • the execution device 310 When the execution device 310 preprocesses input data, or when the calculation module 311 of the execution device 310 performs calculations and other related processes, the execution device 310 can call data, codes, etc. in the data storage system 350 for corresponding processing. , the data and instructions obtained by corresponding processing can also be stored in the data storage system 350 .
  • the I/O interface 312 returns the processing result to the terminal device 340, thereby providing it to the user so that the user can view the processing result.
  • the terminal device 340 can also be used as a data collection terminal to collect the input data input to the I/O interface 312 and the processing results output from the I/O interface 312 as new sample data, and store them in the database 330.
  • the I/O interface 312 uses the input data input to the I/O interface 312 and the processing results output from the I/O interface 312 as new sample data as shown in the figure. Store in database 330.
  • Figure 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application.
  • the devices, devices, modules, etc. shown in Figure 3 The positional relationship does not constitute any limitation.
  • the data storage system 350 is an external memory relative to the execution device 310. In other cases, the data storage system 350 can also be placed in the execution device 310.
  • Step 410 The training device 320 trains the neural network model, and compensates the gradient obtained after training using the first gradient compensation strategy.
  • the training device 320 performs forward propagation training on the neural network model, quantifies the parameters of the neural network model during the forward propagation process, and uses the first gradient compensation strategy to obtain the gradient of the neural network model after completing the forward propagation training. Make compensation.
  • the first gradient compensation strategy is an element-level gradient scaling strategy
  • the element-level gradient scaling strategy includes determining the accuracy of parameters using a backpropagation process of element-level gradient scaling quantification.
  • the element-level gradient scaling strategy adaptively enlarges or shrinks each gradient element for the gradient output by the quantized neural network model, and uses the scaled gradient as the gradient output by the quantized function to train the quantized network through backpropagation. Scaling is performed based on the sign of each gradient element and the error between the continuous input and discrete output of the quantized function.
  • the quantized parameters in the neural network model include activation values and/or weight parameters.
  • the activation value refers to the value passed from the network layer to the next layer in the neural network model, which often appears in pairs with the weight parameters and performs convolution operations or matrix multiplication operations together with the weight parameters.
  • the activation value is the output value of the network layer after being processed by the activation function.
  • the activation value is the value in the network layer that is not processed by the activation function and is input to the next network layer for convolution operations or matrix multiplication operations.
  • the training device 320 selects different gradient compensation strategies to correct the gradient value when the fluctuation values of the quantization error of the parameter belong to different numerical ranges.
  • the training device 320 uses the first gradient compensation strategy to compensate the gradient obtained after training in the initial stage when the quantization error of the parameters trained by the neural network model is large.
  • a large quantification error of the parameter means that the quantification error of the parameter is greater than the preset value.
  • the specific value of the preset value can be flexibly adjusted according to the accuracy requirements of the neural network model, such as 0.5%, 0.8%, 1%. , 1.6%, etc.
  • the training device 320 uses a sample-by-sample asymmetric uniform quantization method to quantize the activation values, and a channel-by-channel symmetric uniform quantization method is used to quantize the weight parameters.
  • per-sample refers to operating each sample data separately in the same batch of training data
  • per-channel refers to grouping parameters by channels, and operating on the entire data in each channel. Perform operations.
  • sample-by-sample asymmetric uniform quantization method and channel-by-channel symmetric uniform quantization method are examples provided by the embodiments of the present application.
  • the embodiments of the present application do not limit the quantization methods of activation values or weight parameters.
  • the quantization methods of activation values or weight parameters are also It can be one of the sample-by-sample symmetric uniform quantization method, the channel-by-channel asymmetric uniform quantization method, etc.
  • the training device 320 uses a sample-by-sample asymmetric uniform quantization method that has higher quantization accuracy than the channel-by-channel symmetric and uniform quantization method to quantize the activation value, ensuring that the activation value The quantization accuracy reduces the quantization error during the quantization process.
  • the training device 320 adopts the channel-by-channel symmetric uniform quantization method with higher computational efficiency.
  • the weight parameters are quantified, which improves the efficiency of parameter quantization. Therefore, the training device 320 in the embodiment of the present application adaptively adopts the quantization method according to the data distribution of the parameters, thereby improving the quantization accuracy while ensuring the quantization efficiency.
  • Step 420 The training device 320 determines the fluctuation value of the quantization error of the parameters of the neural network model.
  • the training device 320 first dequantizes the quantized integer value of the parameter to obtain a floating-point inverse quantization value, then calculates the quantization error of the parameter based on the difference between the inverse quantization value and the floating point value of the unquantized parameter, and finally The difference between the quantization errors of parameters with different training steps is used as the fluctuation value of the quantization error.
  • MSE (X N ,X QE ) represents the quantization error of the parameters
  • M represents the number of parameters to be quantified
  • X N represents the unquantified parameters.
  • X QE represents the floating point value obtained by inverse quantization of the quantized value of the parameter
  • X represents the activation value or weight parameter
  • a QE represents the floating point value obtained by inverse quantization of the quantized value of the activation value
  • a Q represents the quantized value after quantization of the activation value
  • a zero_point represents the integer zero point value of the overall activation value
  • a scale represents the scaling factor for the overall activation value.
  • W QE represents the floating point value obtained by inverse quantizing the quantized value of the weight parameter
  • W Q represents the quantized value after quantizing the weight parameter
  • W scale represents the overall scaling factor of the weight parameter.
  • the training phase in which the fluctuation value of the quantization error of the parameter is greater than the preset value can be called the first training phase of the neural network model, and the training phase in which the fluctuation value of the quantization error of the parameter is less than or equal to the preset value can be called The training phase is called the second training phase of the neural network model.
  • the quantified error of the parameters obtained by the two trainings in step 420 may refer to the quantified error of the parameters after this training and the quantified error of the parameters after the last training.
  • This training and the last training may be separated by one or more training steps. number.
  • the training device 320 calculates the fluctuation value of the quantization error every m training steps in the first training phase, and determines to maintain the first gradient compensation strategy or change the first gradient compensation strategy to the second one according to the fluctuation value of the quantization error.
  • Gradient compensation strategy each time the neural network model completes one forward propagation and one back propagation is called a training step, and m is a positive integer.
  • the training device 320 can intermittently determine whether to start the gradient compensation strategy, avoid frequently determining whether to start or change the gradient compensation strategy during a period when the parameters of the neural network model are relatively stable, and reduce the consumption of computing resources of the training device 320 .
  • Step 430 When the fluctuation value of the quantization error is less than or equal to the preset value, the training device 320 changes the first gradient compensation strategy to the second gradient compensation strategy, and in subsequent training, uses the second gradient compensation strategy to compensate for the gradient obtained in the subsequent training.
  • the training device 320 determines that the training of the neural network model is in a relatively stable second training stage, changes the first gradient compensation strategy to the second gradient compensation strategy, and uses the second gradient compensation The policy compensates for gradients obtained in subsequent training.
  • the second gradient compensation strategy is a multi-dimensional weight hybrid training strategy.
  • the multi-dimensional weight hybrid training strategy refers to using quantized FP16 type or FP32 type parameters to perform matrix multiplication during the training process of the neural network model, and then converting the FP16 type or FP32 type parameters into inverse quantization values.
  • the floating point value and inverse quantization value of the parameters before quantization are used to optimize the loss function, and then the gradient is reconstructed based on the loss function.
  • the parameter values before and after quantization are used to optimize the loss function to make up for the lost accuracy. This can effectively reduce rounding errors in the calculation process and minimize the problem of precision loss.
  • the training device 320 uses the gradient descent method to update the parameters of each network layer of the neural network model to obtain an optimized neural network model.
  • the gradient descent method is a commonly used algorithm in the training process of neural network models and will not be described in detail here.
  • the training device 320 periodically counts the fluctuation values of the quantization period of the parameters during the entire process of model training, determines to start the gradient compensation strategy when the fluctuation value of the quantization error is greater than the startup threshold, and then determines to start the gradient compensation strategy according to the quantization
  • the fluctuation value of the error selects to execute the first gradient compensation strategy or the second gradient compensation strategy, that is, the gradient compensation strategy is executed periodically. Therefore, the gradient compensation strategy of the neural network model may be changed from the first gradient compensation strategy to the second gradient compensation strategy, or from the second gradient compensation strategy to the first gradient compensation strategy.
  • the training device 320 calculates the fluctuation value of the quantization error every m training steps in the first training phase, and determines whether to maintain or change the gradient compensation strategy.
  • the training device 320 calculates the fluctuation value of the quantization error every M2/m training steps in the second training phase. Calculate the fluctuation value of the quantization error once, and determine whether to maintain or change the gradient compensation strategy, where M2 is the total number of training steps in the second training stage. Therefore, compared with the first training stage, the training device 320 reduces the judgment and execution frequency of gradient compensation in the second training stage when the neural network model is relatively stable, and avoids frequently judging whether to start or not during the time period when the parameters of the neural network model are relatively stable. Or the gradient compensation strategy is changed to reduce the consumption of computing resources of the training device 320 .
  • the training device 320 changes the gradient compensation strategy when the fluctuation values of the quantization error are different, and selects different gradient compensation strategies to update the parameters of the neural network model. Therefore, for neural network models in training stages with different degrees of stability, the training device 320 can use an applicable gradient compensation strategy to determine the gradient of the neural network model, and optimize the gradient value of the lack of accuracy caused by the quantization error, thereby improving the neural network model.
  • the accuracy of the gradient of the parameters of the network model and the accuracy of the parameters determined based on the gradient ensure the accuracy of model training.
  • the training device 320 does not need to introduce operators and
  • the algorithm is adapted to low-precision integer operations, and there is no need to introduce learnable quantization parameters to minimize quantization errors, thereby reducing resource occupation of the training device 320 and improving model training efficiency.
  • the quantizer, inverse quantizer, and low-precision integer calculation unit in Figure 5 are functional modules implemented by hardware or software in the training device 320.
  • Step 510 The quantizer quantizes the parameters and obtains the quantized integer value.
  • the quantizer receives the activation value from the previous network layer of the neural network model, that is, the first network layer, and quantizes the activation value and weight parameters to obtain the quantized integer value of the activation value and weight parameters.
  • the parameters input by the first network layer to the second network layer are floating-point values, such as FP16 or FP32, and the quantized integer values may be low-precision integer values of the INT8 type.
  • the activation value can also be the parameter value obtained by the quantizer from this network layer without being processed by the activation function.
  • the training device 320 can use two quantizers to perform activation values and weights respectively. Quantification of parameters. For example, one quantizer uses sample-by-sample asymmetric uniform quantization to quantize activation values, and another quantizer uses channel-by-channel symmetric uniform quantization to quantize weight parameters. The steps of the sample-by-sample asymmetric uniform quantization method and the channel-by-channel symmetric uniform quantization method are described in detail below.
  • the training device 320 first inputs the training samples into the neural network model, performs operations on each training sample, obtains the floating-point activation value output by each network layer in the neural network model, and then counts each separately. The maximum and minimum values of the floating-point activation values are calculated. The scaling factor and integer zero value of each sample are calculated based on the statistical results. Finally, the activation value corresponding to each sample is calculated based on the scaling factor and integer zero value of each sample. Quantized integer value.
  • AN indicates the unquantized activation value
  • AQ indicates the quantized activation value
  • n indicates the number of quantization bits, for example, the value of n is 8 in the INT8 quantization scenario. and Represent the quantization scaling factor and integer zero value of the activation value of the i-th sample respectively.
  • the Round function represents the quantization operation
  • the Clip function represents the data truncation operation. An integer value representing the quantized activation value of the i-th sample.
  • the training device 320 first inputs the training samples into the neural network model, processes the training samples of each channel in batches, and obtains the floating-point weight parameters output by each network layer in the neural network model, and then respectively Count the maximum absolute value of the floating-point weight parameter of the training samples of each channel, calculate the scaling factor of the weight parameter of each channel based on the statistical results, and finally determine the quantized integer value of the weight parameter based on the scaling factor of the weight parameter. type value.
  • n represents the number of quantization bits. For example, the n value is 8 in the INT8 quantization scenario.
  • the Round function represents the quantization operation
  • the Clip function represents the data truncation operation.
  • Step 520 The low-precision integer calculation unit performs operations on the integer values to obtain operation results.
  • the low-precision integer calculation unit performs matrix multiplication or convolution on the integer value of the activation value and the integer value of the weight parameter to obtain the operation result.
  • Output INT represents the result of integer convolution operation or matrix multiplication operation
  • W Q represents the quantized weight parameter
  • a Q represents the quantized activation value
  • Step 530 The dequantizer will dequantize the operation result to obtain the dequantized floating point value.
  • the dequantizer performs a dequantization operation on the output result of the integer convolution operation or the matrix multiplication operation to obtain a dequantized value to approximately represent the original floating-point calculation result.
  • the dequantized value output by the dequantizer is the activation value of the second network layer input into the next network layer, namely the third network layer, or the value to continue calculation in the second network layer.
  • Output FP represents the convolution or matrix multiplication calculation result after inverse quantization
  • a scale represents the overall scaling factor of the activation value
  • W scale represents the overall scaling factor of the weight parameter
  • a zero_point represents the integer zero point value of the overall activation value
  • W N Represents the weight parameter of unquantized floating point type
  • a N represents the activation value of unquantized floating point type.
  • the forward propagation of model training is explained above in conjunction with the data transmission direction of the first network layer -> second network layer -> third network layer in Figure 5.
  • the element-level gradient scaling is combined with Figures 6 and 7.
  • the specific steps of the strategy or multi-dimensional weight hybrid training strategy are explained. Since the data propagation direction of back propagation in model training is opposite to that of forward propagation, the difference is that the training device 320 uses a gradient estimator to perform a gradient compensation strategy to determine the gradient at the quantization step of forward propagation, and no further calculation of the back propagation is performed here.
  • the specific steps of the training device 320 to update the model parameters according to the gradient will be described in detail.
  • Figure 6 is a schematic diagram of an element-level gradient scaling strategy provided by an embodiment of the present application.
  • the specific steps of the element-level gradient scaling strategy are as follows:
  • Step 610 The training device 320 obtains parameters.
  • the second network layer of the training device 320 may obtain parameters from the third network layer, and obtain parameters from the second network layer.
  • the parameters obtained by the second network layer from the third network layer include the gradient value of the quantization function, and the parameters obtained by the second network layer from itself include activation values and weight parameters.
  • Step 620 The training device 320 scales the gradient of the quantization function according to the parameters to obtain the reconstructed gradient.
  • the training device 320 scales the gradient of the quantization function according to the activation value and weight parameter to obtain the reconstructed gradient of the activation value and weight parameter.
  • the specific algorithm for scaling the gradient of the quantized function by the training device 320 can refer to the following formula:
  • is the gradient scaling factor, ⁇ ⁇ 0, which can be set to a small constant (such as 10e-3), or an adaptive coefficient based on the second-order gradient estimation, x n and x q represent the parameters respectively.
  • Step 630 The training device 320 inputs the reconstruction gradient to the first network layer.
  • the training device 320 updates the weight parameter of the second network layer according to the reconstruction gradient of the weight parameter, and inputs the reconstruction gradient of the activation value into the first network layer.
  • the training device 320 performs operations based on the same principle as step 610 and step 620 based on the reconstruction gradient of the activation value transmitted by the second network layer to obtain the reconstruction gradient of the first network layer.
  • each network of the entire neural network model is obtained.
  • the reconstruction gradient of the layer is obtained.
  • FIG. 7 is a schematic diagram of a multi-dimensional weight hybrid training strategy provided by an embodiment of the present application.
  • the specific steps of the multi-dimensional weight hybrid training strategy are as follows:
  • Step 710 The training device 320 determines the floating point value of the unquantized parameter of the second network layer and the inverse quantized value of the parameter.
  • Step 720 The training device 320 determines the optimized loss function based on the floating point value and the inverse quantization value.
  • Loss(W N , ⁇ ) represents the optimized loss function, ⁇ 0, and its value gradually increases from 0 to 1 during the training process
  • W N Represents the unquantized floating point value of the weight parameter
  • W QE represents the inverse quantization value of the weight parameter
  • W scale represents the overall scaling factor of the mass parameter
  • the Round function represents the quantization operation.
  • Step 730 The training device 320 determines the reconstruction gradient according to the optimized loss function.
  • the specific steps of the training device 320 to determine the reconstruction gradient can refer to the following formula:
  • Step 740 The training device 320 inputs the reconstruction gradient to the first network layer.
  • the training device 320 updates the weight parameters of the second network layer according to the reconstruction gradient of the weight parameters, and inputs the reconstruction gradient into the first network layer.
  • the training device 320 performs operations based on the same principles as steps 710 to 730 based on the reconstruction gradient transmitted by the second network layer to obtain the reconstruction gradient of the first network layer.
  • the weight of each network layer of the entire neural network model is obtained. structural gradient.
  • the training method of the neural network model provided by this embodiment is described in detail above with reference to FIGS. 3 to 7 .
  • the training device of the neural network model provided by this embodiment will be described with reference to FIG. 8 .
  • FIG8 is a schematic diagram of a possible training device for a neural network model provided in this embodiment.
  • the training device for a neural network model can be used to implement the functions of the execution device in the above method embodiment, and thus can also achieve the beneficial effects possessed by the above method embodiment.
  • the training device for the neural network model can be the training device 320 shown in FIG3 , or can be a module (such as a chip) applied to a server.
  • the neural network model training device 800 includes a compensation module 810 and a processing module 820 .
  • the neural network model training device 800 is used to implement the functions of the training device 320 in the method embodiment shown in FIG. 4 .
  • the compensation module 810 is used to change the gradient compensation strategy according to the fluctuation value of the quantization error of the parameter, and use the gradient compensation strategy to compensate the gradient obtained by the neural network model training.
  • the compensation module 810 is used to perform steps 410 and 430 in FIG. 4 .
  • the calculation module 820 is used to determine the fluctuation value of the quantization error of the parameters of the neural network model. For example, the calculation module 820 is used to perform step 420 in FIG. 4 .
  • the first gradient compensation strategy includes an element-level gradient scaling strategy
  • the second gradient compensation strategy includes a multi-dimensional weight hybrid training strategy
  • the parameters include weight parameters or activation values.
  • the calculation module 820 is specifically configured to periodically count the fluctuation values of the quantization errors of the parameters of the neural network model.
  • the first period of the fluctuation value of the statistical quantization error of the first gradient compensation strategy is smaller than the second period of the fluctuation value of the statistical quantization error of the second gradient compensation strategy.
  • the number of training steps included in the second cycle is equal to the total number of training steps used to compensate the gradient obtained after training using the second gradient compensation strategy and the number of training steps included in the first cycle. business.
  • the training device 800 of the neural network model in the embodiment of the present application can be implemented by GPU, NPU, ASIC, or programmable logic device (PLD).
  • PLD programmable logic device
  • the above PLD can be a complex program logic device (complex).
  • CPLD programmable logical device
  • FPGA field-programmable gate array
  • GAL general array logic
  • the neural network model training device 800 and its respective modules can also be software modules.
  • the neural network model training device 800 in the embodiment of the present application may correspond to executing the method described in the embodiment of the present application, and the above and other operations and/or functions of each unit in the neural network model training device 800 are respectively to implement the figure.
  • the corresponding processes of each method in 4 will not be repeated here for the sake of brevity.
  • FIG. 9 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • Computing device 900 includes memory 901, processor 902, communication interface 903, and bus 904. Among them, the memory 901, the processor 902, and the communication interface 903 implement communication connections between each other through the bus 904.
  • Memory 901 may be a read-only memory, a static storage device, a dynamic storage device, or a random access memory.
  • memory 901 may store computer instructions.
  • the processor 902 and the communication interface 903 are used to execute steps in the image processing method of the software system.
  • the communication interface 903 is used to execute step 410 in the training method of the neural network model shown in Figure 4, and the function of the compensation module 810 in the training device 800 of the neural network model shown in Figure 8.
  • the processor 902 uses In executing steps 420 and 430 in the training method of the neural network model shown in FIG. 4 , as well as the functions of the processing module 820 in the training device 800 of the neural network model shown in FIG. 8 .
  • the memory can also store data sets. For example, a part of the storage resources in the memory 901 is divided into an area for storing programs that implement the functions of the neural network model in the embodiment of the present application.
  • the processor 902 can be a general CPU, an application specific integrated circuit (ASIC), a GPU or any combination thereof.
  • Processor 902 may include one or more chips.
  • Processor 902 may include an AI accelerator, such as an NPU.
  • the communication interface 903 uses a transceiver module such as but not limited to a transceiver to implement communication between the computing device 900 and other devices or communication networks. For example, the iterative training request, training data, and feedback of the iteratively trained neural network can be obtained through the communication interface 903.
  • a transceiver module such as but not limited to a transceiver to implement communication between the computing device 900 and other devices or communication networks. For example, the iterative training request, training data, and feedback of the iteratively trained neural network can be obtained through the communication interface 903.
  • Bus 904 may include a path that carries information between various components of computing device 900 (eg, memory 901, processor 902, communications interface 903).
  • the computing device 900 may be a computer (for example, a server) in a cloud data center, a computer in an edge data center, or a terminal.
  • training device 320 may be deployed on each computing device 900.
  • a GPU is used to implement the functions of the training device 320.
  • the training device 320 can communicate with the execution device 310 through the bus 904.
  • the training device 320 may communicate with the execution device 310 through a communication network.
  • the method steps in this embodiment can be implemented by hardware or by a processor executing software instructions.
  • Software instructions can be composed of corresponding software modules, and software modules can be stored in random access memory (random access memory, RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM) , PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM), register, hard disk, mobile hard disk, CD-ROM or other well-known in the art any other form of storage media.
  • An exemplary storage medium is coupled to the processor such that the processor can read information from the storage medium and write information to the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and storage media may be located in an ASIC.
  • the ASIC can be located in the terminal device.
  • the processor and the storage medium can also exist as discrete components in network equipment or terminal equipment.
  • the computer program product includes one or more computer programs or instructions.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a user equipment, or other programmable device.
  • the computer program or instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another.
  • the computer program or instructions may be transmitted from a website, computer, A server or data center transmits via wired or wireless means to another website site, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center that integrates one or more available media.
  • the available media may be magnetic media, such as floppy disks, hard disks, and magnetic tapes; they may also be optical media, such as digital video discs (DVDs); they may also be semiconductor media, such as solid state drives (solid state drives). ,SSD).
  • SSD solid state drives

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

A method and apparatus for training a neural network model, and a device and a system, which are applied to a computing device for training a neural network model. The method comprises: during the process of performing quantization training on a neural network model, for the problem of an inaccurate gradient caused by quantization, a computing device changing a gradient compensation strategy according to a fluctuation value of a quantization error of a parameter, using an applicable gradient compensation strategy to correct the gradient, and updating a parameter of the neural network model on the basis of the gradient determined by the gradient compensation strategy, so as to obtain an optimized neural network model. Therefore, the accuracy of a gradient of a parameter of a neural network model is improved, and the precision of model training is ensured according to the precision of a parameter determined by the gradient.

Description

神经网络模型的训练方法、装置、设备及系统Training method, device, equipment and system for neural network model
本申请要求于2022年9月20日提交国家知识产权局、申请号为202211145916.7、申请名称为“神经网络模型的训练方法、装置、设备及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application submitted to the State Intellectual Property Office on September 20, 2022, with the application number 202211145916.7 and the application name "Neural Network Model Training Method, Device, Equipment and System", and its entire content has been approved This reference is incorporated into this application.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种神经网络模型的训练方法、装置、设备及系统。The present application relates to the field of artificial intelligence technology, and in particular to a training method, device, equipment and system for a neural network model.
背景技术Background technique
在对神经网络模型进行训练过程中引入量化(quantize)处理,可以降低神经网络模型在存储和处理资源上的消耗。但是,神经网络模型根据量化处理后的参数进行模型更新,模型收敛性较差,存在造成神经网络模型精度损失较大的问题。Introducing quantize processing in the training process of the neural network model can reduce the consumption of storage and processing resources of the neural network model. However, the neural network model updates the model based on the quantized parameters, and the model convergence is poor, which causes a large loss of accuracy of the neural network model.
发明内容Contents of the invention
本申请提供了神经网络模型的训练方法、装置、设备及系统,由此解决因神经网络模型的量化处理所造成的模型收敛性较差,神经网络模型精度损失较大的问题。This application provides a training method, device, equipment and system for a neural network model, thereby solving the problems of poor model convergence and large accuracy loss of the neural network model caused by the quantification processing of the neural network model.
第一方面,提供了一种神经网络模型的训练方法,由训练神经网络模型的计算设备执行,该方法包括:计算设备对神经网络模型进行训练,该神经网络模型的参数进行了量化,由于参数的量化导致模型参数存在误差。计算设备在神经网络模型训练的初始阶段采用第一梯度补偿策略对训练后得到的梯度进行补偿,并统计神经网络模型的参数的量化误差(quantize error)的波动值,在量化误差的波动值小于等于预设值时采用第二梯度补偿策略对训练后得到的梯度进行补偿。In a first aspect, a training method of a neural network model is provided, which is executed by a computing device that trains a neural network model. The method includes: the computing device trains the neural network model, and the parameters of the neural network model are quantified. Since the parameters The quantification leads to errors in model parameters. In the initial stage of neural network model training, the computing device adopts the first gradient compensation strategy to compensate for the gradient obtained after training, and counts the fluctuation value of the quantized error (quantize error) of the parameters of the neural network model. When the fluctuation value of the quantized error is less than When equal to the preset value, the second gradient compensation strategy is used to compensate the gradient obtained after training.
由此,计算设备在神经网络模型的参数不稳定的模型训练初始阶段采用第一梯度补偿策略对神经网络模型的参数进行更新,根据参数的量化误差的波动值确定神经网络模型处于参数较为稳定的训练阶段后,将第一梯度补偿策略更改为第二梯度补偿策略,从而针对处于不同稳定程度的训练阶段的神经网络模型,计算设备能够采用适用的梯度补偿策略确定神经网络模型的梯度,对梯度进行优化,从而提高了神经网络模型的参数的梯度的准确性,以及根据梯度确定的参数的精度,进而保证了模型训练的精度。Therefore, the computing device uses the first gradient compensation strategy to update the parameters of the neural network model in the initial stage of model training when the parameters of the neural network model are unstable, and determines that the neural network model is in a state where the parameters are relatively stable based on the fluctuation value of the quantified error of the parameters. After the training stage, the first gradient compensation strategy is changed to the second gradient compensation strategy, so that for the neural network model in the training stage with different degrees of stability, the computing device can use the applicable gradient compensation strategy to determine the gradient of the neural network model, and the gradient Optimization is performed to improve the accuracy of the gradient of the parameters of the neural network model and the accuracy of the parameters determined based on the gradient, thereby ensuring the accuracy of model training.
其中,量化是指在神经网络模型的前向训练过程中采用量化函数将参数从浮点值转换为整型值,神经网络模型的参数可以包括神经网络模型包含的每个网络层输出的权重参数,和/或与权重参数进行卷积运算或矩阵乘法运算的数值即激活值。量化误差是指神经网络模型的参数在量化前的浮点值与反量化值之间的差值,反量化值是采用量化函数的反函数对参数量化后的整型值进行反量化(de-quantize)得到的浮点值。Among them, quantization refers to using a quantization function to convert parameters from floating point values to integer values during the forward training process of the neural network model. The parameters of the neural network model can include the weight parameters output by each network layer included in the neural network model. , and/or the value of convolution or matrix multiplication with the weight parameter is the activation value. The quantization error refers to the difference between the floating point value of the parameters of the neural network model before quantization and the inverse quantization value. The inverse quantization value uses the inverse function of the quantization function to inverse quantize the integer value after parameter quantization (de- quantize) to get the floating point value.
在一种可能的实现方式中,计算设备可以根据参数的量化误差的波动值与预设值的对比结果来确定梯度补偿策略。In a possible implementation, the computing device may determine the gradient compensation strategy based on a comparison result between the fluctuation value of the quantization error of the parameter and the preset value.
例如,当量化误差的波动值大于预设值,计算设备将当前使用的梯度补偿策略更改为第一梯度补偿策略。For example, when the fluctuation value of the quantization error is greater than a preset value, the computing device changes the currently used gradient compensation strategy to the first gradient compensation strategy.
又如,当量化误差的波动值小于或等于预设值,计算设备将当前使用的梯度补偿策略更改为第二梯度补偿策略。For another example, when the fluctuation value of the quantization error is less than or equal to the preset value, the computing device changes the currently used gradient compensation strategy to the second gradient compensation strategy.
其中,第一梯度补偿策略可以是元素级梯度缩放(Element-wise Gradient Scaling,EWGS)策略,第二梯度补偿策略可以是多维度权重混合训练策略,多维度权重混合训练策略用于基于所述参数的量化值和反量化值进行梯度补偿。Wherein, the first gradient compensation strategy may be an Element-wise Gradient Scaling (EWGS) strategy, the second gradient compensation strategy may be a multi-dimensional weight hybrid training strategy, and the multi-dimensional weight hybrid training strategy is used based on the parameters The quantized value and the inverse quantized value are used for gradient compensation.
神经网络模型的训练过程中,模型刚开始训练时量化误差的波动较大,量化误差的波动会随着训练的进行逐渐减小并趋于稳定,则本实施例中可以将参数的量化误差的波动值大于预设值的训练阶段称为神经网络模型的第一训练阶段,将参数的量化误差的波动值小于或等于预设值的训 练阶段称为神经网络模型的第二训练阶段。During the training process of the neural network model, the quantization error fluctuates greatly when the model first starts training. The fluctuation of the quantization error will gradually decrease and become stable as the training progresses. In this embodiment, the quantization error of the parameter can be The training phase in which the fluctuation value is greater than the preset value is called the first training phase of the neural network model. The training phase in which the fluctuation value of the parameter quantification error is less than or equal to the preset value is called the first training phase of the neural network model. The training stage is called the second training stage of the neural network model.
本实施例中计算设备在量化误差波动值较小的第二训练阶段使用多维度权重混合训练策略进行梯度补偿,使用单精度数据类型的参数或半精度数据类型的参数整型值进行模型训练,提高了参数的传输和计算效率,相对于元素级梯度缩放策略迭代速度更快,在保证神经网络模型的训练精度的同时提高了训练效率。计算设备在量化误差波动值较大的第一训练阶段,使用元素级梯度缩放策略进行梯度补偿,自适应地放大或缩小每个梯度元素,并使用缩放后的梯度作为量化函数输出的梯度,通过反向传播来训练量化的网络,相对于多维度权重混合训练策略实现了更高精度的梯度补偿,从而能够在量化误差波动较大,即量化函数的输入和输出之间的离散误差造成参数的梯度失配(gradient mismatch)较为严重的情况下,保证神经网络模型的参数的梯度精度,从而保证了模型训练精度。In this embodiment, the computing device uses a multi-dimensional weight hybrid training strategy for gradient compensation in the second training stage when the quantization error fluctuation value is small, and uses parameters of a single-precision data type or parameter integer values of a half-precision data type for model training. It improves the transmission and calculation efficiency of parameters, and iterates faster than the element-level gradient scaling strategy. It improves the training efficiency while ensuring the training accuracy of the neural network model. In the first training stage when the quantization error fluctuates greatly, the computing device uses an element-level gradient scaling strategy for gradient compensation, adaptively enlarges or shrinks each gradient element, and uses the scaled gradient as the gradient output by the quantization function, through Back propagation is used to train the quantized network, which achieves higher precision gradient compensation compared to the multi-dimensional weight hybrid training strategy, thereby enabling the parameter to change when the quantization error fluctuates greatly, that is, the discrete error between the input and output of the quantization function. When the gradient mismatch is severe, the gradient accuracy of the parameters of the neural network model is guaranteed, thereby ensuring the accuracy of model training.
作为一种可能的实现方式,计算设备对神经网络模型的量化是在模型训练的前向训练中进行的。例如,计算设备在前向训练中量化权重参数和激活值,即计算设备根据神经网络模型中每个网络层的权重参数的量化值和激活值的量化值,得到所述每个网络层的整型输出结果,将整形输出结果反量化后进行前向计算。As a possible implementation, the quantization of the neural network model by the computing device is performed in the forward training of the model training. For example, the computing device quantifies the weight parameters and activation values in forward training, that is, the computing device obtains the integral value of each network layer based on the quantized values of the weight parameters and the quantified activation values of each network layer in the neural network model. type output result, and perform forward calculation after dequantizing the reshaped output result.
可选地,计算设备针对参数不同的数据分布,可以采用不同的量化方式。Optionally, the computing device can adopt different quantization methods for data distribution with different parameters.
例如,激活值是采用逐样本非对称均匀量化方式进行量化的,权重参数是采用逐通道对称均匀量化方式进行量化的。由此,计算设备采用相对于逐通道对称均匀量化方式来说量化精度更高的逐样本非对称均匀量化方式对激活值进行量化。由于逐样本非对称均匀量化方式在对权重参数进行量化时相对于逐通道对称均匀量化方式并无明显的精度优势,计算设备采用相对于逐样本非对称均匀量化方式来说计算效率较高的逐通道对称均匀量化方式对权重参数进行量化。从而根据参数的数据分布适应性地采用量化方式,在保障量化效率的同时提高了量化精度。For example, activation values are quantized using a sample-by-sample asymmetric uniform quantization method, and weight parameters are quantized using a channel-by-channel symmetric uniform quantization method. Therefore, the computing device quantizes the activation value using a sample-by-sample asymmetric uniform quantization method that has higher quantization accuracy than a channel-by-channel symmetric uniform quantization method. Since the sample-by-sample asymmetric uniform quantization method has no obvious accuracy advantage over the channel-by-channel symmetric uniform quantization method when quantizing weight parameters, the computing device uses a sample-by-sample asymmetric uniform quantization method that is more computationally efficient than the sample-by-sample asymmetric uniform quantization method. The channel symmetric and uniform quantization method quantizes the weight parameters. Therefore, the quantization method is adaptively adopted according to the data distribution of the parameters, which improves the quantification accuracy while ensuring the quantification efficiency.
作为一种可能的实现方式,计算设备可以周期性地统计量化误差的波动值,并周期性地使用梯度补偿策略对梯度进行补偿,即上次训练的量化误差是指上个周期进行统计的训练的量化误差,从而降低量化处理的计算开销。As a possible implementation method, the computing device can periodically count the fluctuation values of the quantization error, and periodically use the gradient compensation strategy to compensate the gradient. That is, the quantization error of the last training refers to the statistical training of the previous cycle. quantization error, thereby reducing the computational overhead of quantization processing.
可选地,计算设备在第一训练阶段每间隔m个训练步数计算一次量化误差的波动值,并使用第一梯度补偿策略对神经网络模型的梯度进行补偿。其中,神经网络模型每完成一次前向传播(即前向训练)和一次反向传播(即反向训练)称为一个训练步数,m为正整数。Optionally, the computing device calculates the fluctuation value of the quantization error every m training steps in the first training stage, and uses the first gradient compensation strategy to compensate the gradient of the neural network model. Among them, each time the neural network model completes one forward propagation (ie, forward training) and one back propagation (ie, reverse training), it is called a training step, and m is a positive integer.
可选地,计算设备在第二训练阶段每间隔M2/m个训练步数计算一次量化误差的波动值,并使用第二梯度补偿策略对神经网络模型的梯度进行补偿。其中,M2为第二训练阶段的训练步长的总数量。由此,第一梯度补偿策略统计量化误差的波动值的周期小于第二梯度补偿策略统计量化误差的波动值的周期,从而在神经网络模型的量化误差的波动值较小的第二训练阶段减少神经网络模型的梯度补偿频率,在保证神经网络模型的训练精度的同时,提高了神经网络模型的整体训练效率。Optionally, the computing device calculates the fluctuation value of the quantization error every M2/m training steps in the second training stage, and uses the second gradient compensation strategy to compensate the gradient of the neural network model. Among them, M2 is the total number of training steps in the second training stage. Therefore, the period of the fluctuation value of the statistical quantization error of the first gradient compensation strategy is smaller than the period of the fluctuation value of the statistical quantization error of the second gradient compensation strategy, thereby reducing the fluctuation value of the quantization error of the neural network model in the second training stage. The gradient compensation frequency of the neural network model not only ensures the training accuracy of the neural network model, but also improves the overall training efficiency of the neural network model.
第二方面,提供一种神经网络模型的训练装置,所述装置包括用于执行第一方面或第一方面任一种可能实现方式中的神经网络模型的训练方法的各个模块。A second aspect provides a training device for a neural network model. The device includes various modules for executing the training method for a neural network model in the first aspect or any possible implementation of the first aspect.
需要说明的是,第二方面所述的神经网络模型的训练装置可以是终端设备或网络设备,也可以是可设置于终端设备或网络设备中的芯片(系统)或其他部件或组件,还可以是包含终端设备或网络设备的装置,本申请对此不做限定。It should be noted that the training device for the neural network model described in the second aspect can be a terminal device or a network device, or a chip (system) or other parts or components that can be set in the terminal device or the network device, or a device that includes a terminal device or a network device, and this application does not limit this.
此外,第二方面所述的神经网络模型的训练装置的技术效果可以参考第一方面所述的神经网络模型的训练方法的技术效果,此处不再赘述。In addition, the technical effects of the neural network model training device described in the second aspect can be referred to the technical effects of the neural network model training method described in the first aspect, and will not be described again here.
第三方面,提供了一种计算设备,包括存储器和处理器,所述存储器用于存储一组计算机指令,当所述处理器执行所述一组计算机指令时,用于执行第一方面中任一种可能设计中的神经网络模型的训练方法的操作步骤。In a third aspect, a computing device is provided, comprising a memory and a processor, wherein the memory is used to store a set of computer instructions, and when the processor executes the set of computer instructions, it is used to execute the operating steps of the training method of the neural network model in any possible design in the first aspect.
此外,第三方面所述的计算设备的技术效果可以参考第一方面所述的神经网络模型的训练方法的技术效果,此处不再赘述。In addition, the technical effects of the computing device described in the third aspect can be referred to the technical effects of the training method of the neural network model described in the first aspect, which will not be described again here.
第四方面,提供一种神经网络模型的训练系统,包括执行设备与第三方面所述的计算设备,所述计算设备用于执行第一方面中任一种可能设计中的神经网络模型的训练方法的操作步骤,来 得到优化后神经网络模型,所述执行设备用于应用优化后神经网络模型。A fourth aspect provides a training system for a neural network model, including an execution device and the computing device described in the third aspect. The computing device is used to perform training of the neural network model in any possible design of the first aspect. The steps of the method are as follows: An optimized neural network model is obtained, and the execution device is used to apply the optimized neural network model.
第五方面,提供一种计算机可读存储介质,包括:计算机软件指令;当计算机软件指令在数据处理系统中运行时,使得计算设备执行如第一方面中任意一种可能的实现方式中所述方法的操作步骤。In a fifth aspect, a computer-readable storage medium is provided, including: computer software instructions; when the computer software instructions are run in a data processing system, the computing device is caused to execute as described in any possible implementation manner in the first aspect. The steps of the method.
第六方面,提供一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算设备执行如第一方面中任意一种可能的实现方式中所述方法的操作步骤。In a sixth aspect, a computer program product is provided. When the computer program product is run on a computer, it causes the computing device to perform the operation steps of the method described in any possible implementation manner in the first aspect.
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。Based on the implementation methods provided in the above aspects, this application can also be further combined to provide more implementation methods.
附图说明Description of the drawings
图1为本申请实施例提供的一种神经网络的结构示意图;Figure 1 is a schematic structural diagram of a neural network provided by an embodiment of the present application;
图2为本申请实施例提供的一种卷积神经网络的结构示意图;Figure 2 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application;
图3为本申请实施例提供的一种神经网络模型的训练系统的架构示意图;Figure 3 is a schematic architectural diagram of a neural network model training system provided by an embodiment of the present application;
图4为本申请实施例提供的一种神经网络模型的训练方法的示意图;FIG4 is a schematic diagram of a training method for a neural network model provided in an embodiment of the present application;
图5为本申请实施例提供的一种前向传播的示意图;Figure 5 is a schematic diagram of forward propagation provided by an embodiment of the present application;
图6为本申请实施例提供的一种元素级梯度缩放策略的示意图;Figure 6 is a schematic diagram of an element-level gradient scaling strategy provided by an embodiment of the present application;
图7为本申请实施例提供的一种多维度权重混合训练策略的示意图;Figure 7 is a schematic diagram of a multi-dimensional weight hybrid training strategy provided by an embodiment of the present application;
图8为本申请实施例提供的一种神经网络模型的训练装置的示意图;Figure 8 is a schematic diagram of a training device for a neural network model provided by an embodiment of the present application;
图9为本申请实施例提供的一种计算设备的结构示意图。Figure 9 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
具体实施方式Detailed ways
为了便于理解,下面先对本申请实施例涉及的相关术语进行介绍。In order to facilitate understanding, the relevant terms involved in the embodiments of this application are first introduced below.
(1)神经网络(1)Neural network
神经网络可以是由神经元组成的,神经元可以是指以xs和截距1为输入的运算单元。该运算单元的输出满足如下公式:
A neural network may be composed of neurons, and a neuron may refer to an operation unit with xs and intercept 1 as input. The output of the operation unit satisfies the following formula:
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经元的偏置。f为神经元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经元联结在一起形成的网络,即一个神经元的输出可以是另一个神经元的输入。每个神经元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经元组成的区域。权重表征不同神经元之间连接的强度。权重决定着输入对输出的影响力。权重近于0意味着改变输入不改变输出。负权重意味着增加输入降低输出。Among them, s=1, 2,...n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neuron. f is the activation function of the neuron, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neuron into an output signal. The output signal of the activation function can be used as the input of the next layer, and the activation function can be a sigmoid function. A neural network is a network formed by connecting multiple single neurons mentioned above, that is, the output of one neuron can be the input of another neuron. The input of each neuron can be connected to the local receptive field of the previous layer to extract the features of the local receptive field. The local receptive field can be an area composed of several neurons. Weights represent the strength of connections between different neurons. Weight determines the influence of input on output. A weight close to 0 means that changing the input does not change the output. Negative weights mean that increasing input decreases output.
如图1所示,为本申请实施例提供的一种神经网络的结构示意图。神经网络100包括N个处理层,N为大于或等于3的整数。神经网络100的第一层为输入层110,负责接收输入信号,神经网络100的最后一层为输出层130,负责输出神经网络的处理结果。除去第一层和最后一层的其他层为中间层140,这些中间层140共同组成隐藏层120,隐藏层120中的每一层中间层140既可以接收输入信号,也可以输出信号。隐藏层120负责输入信号的处理过程。每一层代表了信号处理的一个逻辑级别,通过多个层,数据信号可经过多级逻辑的处理。As shown in Figure 1, it is a schematic structural diagram of a neural network provided by an embodiment of the present application. The neural network 100 includes N processing layers, where N is an integer greater than or equal to 3. The first layer of the neural network 100 is the input layer 110, which is responsible for receiving input signals. The last layer of the neural network 100 is the output layer 130, which is responsible for outputting the processing results of the neural network. The other layers except the first layer and the last layer are intermediate layers 140. These intermediate layers 140 together form a hidden layer 120. Each intermediate layer 140 in the hidden layer 120 can both receive input signals and output signals. The hidden layer 120 is responsible for the processing of the input signal. Each layer represents a logical level of signal processing. Through multiple layers, the data signal can be processed by multi-level logic.
在一些可行的实施例中该神经网络的输入信号可以是视频信号、语音信号、文本信号、图像信号或温度信号等各种形式的信号。语音信号可以是麦克风(声音传感器)录制的人说话、唱歌的人声音频信号等各类传感器信号。该神经网络的输入信号还包括其他各种计算机可处理的工程信号,在此不再一一列举。若利用神经网络对图像信号进行深度学习,可以提高神经网络处理图像的质量。In some feasible embodiments, the input signal of the neural network may be a video signal, a voice signal, a text signal, an image signal or a temperature signal, etc. in various forms. The voice signal can be various sensor signals such as human voice audio signals recorded by a microphone (sound sensor) such as speaking and singing. The input signals of the neural network also include various other computer-processable engineering signals, which will not be listed here. If a neural network is used to perform deep learning on image signals, the quality of images processed by the neural network can be improved.
(2)卷积神经网络(2) Convolutional neural network
卷积神经网络(Convolutional Neuron Network,CNN)是一种带有卷积结构的深度神经网 络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者特征图(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层可以输出若干个特征图,特征图可以是指卷积神经网络运算过程中的中间结果。同一特征图的神经元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。也就是,图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。Convolutional Neuron Network (CNN) is a deep neural network with a convolutional structure network. The convolutional neural network contains a feature extractor composed of convolutional layers and subsampling layers. The feature extractor can be regarded as a filter, and the convolution process can be regarded as using a trainable filter to convolve with an input image or feature map. The convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal. In the convolutional layer of a convolutional neural network, a neuron can be connected to only some of the neighboring layer neurons. A convolutional layer can output several feature maps, and the feature map can refer to the intermediate result during the operation of the convolutional neural network. Neurons in the same feature map share weights, and the shared weights here are convolution kernels. Shared weights can be understood as a way to extract image information independent of position. That is, the statistics of one part of the image are the same as those of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the same learned image information can be used for all positions on the image. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the greater the number of convolution kernels, the richer the image information reflected by the convolution operation.
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。The convolution kernel can be initialized in the form of a random-sized matrix. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
示例地,如图2所示,为本申请实施例提供的一种卷积神经网络的结构示意图。卷积神经网络200可以包括输入层210、卷积层/池化层220(其中池化层为可选的)和神经网络层230。For example, as shown in Figure 2, it is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application. The convolutional neural network 200 may include an input layer 210, a convolutional/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230.
卷积层/池化层220例如可以包括层221至层226。在一种示例中,层221例如可以为卷积层,层222例如可以为池化层,层223例如可以为卷积层,层224例如可以为池化层,层225例如可以为卷积层,层226例如可以为池化层。在另一种示例中,层221和层222例如可以为卷积层,层223例如可以为池化层,层224和层225例如可以为卷积层,层226例如可以为池化层。卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。The convolution layer/pooling layer 220 may include, for example, layers 221 to 226. In one example, layer 221 may be, for example, a convolution layer, layer 222 may be, for example, a pooling layer, layer 223 may be, for example, a convolution layer, layer 224 may be, for example, a pooling layer, layer 225 may be, for example, a convolution layer, and layer 226 may be, for example, a pooling layer. In another example, layers 221 and 222 may be, for example, convolution layers, layer 223 may be, for example, a pooling layer, layers 224 and 225 may be, for example, convolution layers, and layer 226 may be, for example, a pooling layer. The output of a convolution layer may be used as the input of a subsequent pooling layer, or as the input of another convolution layer to continue the convolution operation.
将以卷积层221为例,介绍一层卷积层的内部工作原理。Taking convolutional layer 221 as an example, we will introduce the inner working principle of a convolutional layer.
卷积层221可以包括很多个卷积算子,卷积算子也可称为核。卷积算子在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器。卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义。该权重矩阵的大小与图像的大小相关。需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的。在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,与一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同,再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。The convolution layer 221 may include many convolution operators, and the convolution operators may also be called kernels. The role of the convolution operator in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator can essentially be a weight matrix, which is usually predefined. The size of this weight matrix is related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix extends to the entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolved output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (rows × columns) are applied, That is, multiple matrices of the same type. The output of each weight matrix is stacked to form the depth dimension of the convolutional image. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to remove unnecessary noise in the image. Perform blurring, etc. The multiple weight matrices have the same size (row × column), and the feature maps extracted by the multiple weight matrices with the same size are also the same size. The extracted multiple feature maps with the same size are then merged to form a convolution operation. output.
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。The weight values in these weight matrices require a large amount of training in practical applications. Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, thereby allowing the convolutional neural network 200 to make correct predictions. .
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如层221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征。随着卷积神经网络200深度的加深,越往后的卷积层(例如层226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。When the convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (eg layer 221) often extracts more general features, which can also be called low-level features. As the depth of the convolutional neural network 200 deepens, the features extracted by the later convolutional layers (for example, layer 226) become more and more complex, such as high-level semantic features. Features with higher semantics are more suitable for Problems to be solved.
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层。在如图2中卷积层/池化层220所示例的层221至层226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像或音频的处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算 符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。Since it is often necessary to reduce the number of training parameters, pooling layers are often introduced periodically after the convolutional layer. Each layer from layer 221 to layer 226 as shown in the convolutional layer/pooling layer 220 in Figure 2 can be a convolutional layer followed by a pooling layer, or multiple convolutional layers followed by a pooling layer. layer or multi-layer pooling layer. During image or audio processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling. The max pooling operator can take the pixel with the largest value in a specific range as the result of max pooling. In addition, just as the size of the weight matrix used in the convolutional layer should be related to the image size, the operation in the pooling layer Symbols should also be related to the size of the image. The size of the image output after processing by the pooling layer can be smaller than the size of the image input to the pooling layer. Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用神经网络层230来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层230中可以包括多层隐藏层(如图2所示的层231、层232至层23n)以及输出层240,该多层隐藏层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,目标识别等。After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 220 extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 200 needs to use the neural network layer 230 to generate an output or a set of required number of classes. Therefore, the neural network layer 230 may include multiple hidden layers (layer 231, layer 232 to layer 23n as shown in FIG. 2) and an output layer 240. The parameters included in the multiple hidden layers may be based on specific tasks. The relevant training data of the type can be pre-trained. For example, the task type can include image recognition, image classification, target recognition, etc.
在神经网络层230中的多层隐藏层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图2由层210至层240方向的传播为前向传播)完成,反向传播(如图2由层240至层210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。After the multi-layer hidden layer in the neural network layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240. The output layer 240 has a loss function similar to classification cross entropy, specifically used to calculate the prediction error. The forward propagation of the entire convolutional neural network 200 (the propagation in the direction from layer 210 to layer 240 in Figure 2 is forward propagation) is completed, and the reverse propagation (the propagation in the direction from layer 240 to layer 210 in Figure 2 is back propagation) ) will start to update the weight values and deviations of each layer mentioned above to reduce the loss of the convolutional neural network 200 and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.
需要说明的是,如图2所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如U-Net、可变性人脸模型(3D Morphable Face Model,3DMM)和残差网络(Residual Network,ResNet)等。此外,本申请实施例提供的方法也可应用于卷积神经网络之外的神经网络中,例如Transformer模型、基于Transformer的双向编码(Bidirectional Encoder Representations from Transformer,BERT)模型等。It should be noted that the convolutional neural network 200 shown in Figure 2 is only an example of a convolutional neural network. In specific applications, the convolutional neural network can also exist in the form of other network models, such as U- Net, 3D Morphable Face Model (3DMM) and Residual Network (ResNet), etc. In addition, the methods provided by the embodiments of this application can also be applied to neural networks other than convolutional neural networks, such as Transformer models, Transformer-based bidirectional encoding (Bidirectional Encoder Representations from Transformer, BERT) models, etc.
(3)损失函数(3)Loss function
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值过高,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。In the process of training a deep neural network, because we hope that the output of the deep neural network is as close as possible to the value that we really want to predict, we can compare the predicted value of the current network with the really desired target value, and then based on the difference between the two to update the weight vector of each layer of the neural network according to the difference (of course, there is usually an initialization process before the first update, that is, preconfiguring parameters for each layer in the deep neural network). For example, if the predicted value of the network If it is too high, adjust the weight vector to make it predict lower, and continue to adjust until the deep neural network can predict the really desired target value or a value that is very close to the really desired target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value". This is the loss function (loss function) or objective function (objective function), which is used to measure the difference between the predicted value and the target value. Important equations. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. Then the training of the deep neural network becomes a process of reducing this loss as much as possible.
梯度的本意是一个向量(矢量),表示某一函数在该点处的方向导数沿着该方向取得最大值,即函数在该点处沿着该方向(此梯度的方向)变化最快,变化率最大。在深度神经网络的训练过程中寻找每个网络层的最优参数时,要确定使损失函数的值尽可能最小的参数。为了找到使损失函数的值尽可能小的地方,需要计算损失函数相对于参数的梯度,即当梯度向量越接近0,说明损失函数达到一个极小值点,模型准确度达到一个极大值点。The original meaning of gradient is a vector (vector), which means that the directional derivative of a certain function at this point reaches the maximum value along this direction, that is, the function changes fastest along this direction (the direction of this gradient) at this point. rate is the largest. When looking for the optimal parameters of each network layer during the training process of a deep neural network, it is necessary to determine the parameters that minimize the value of the loss function as much as possible. In order to find the place where the value of the loss function is as small as possible, it is necessary to calculate the gradient of the loss function relative to the parameters. That is, when the gradient vector is closer to 0, it means that the loss function reaches a minimum value point and the model accuracy reaches a maximum value point. .
(4)反向传播算法(4)Back propagation algorithm
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,基于参数的梯度通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重参数。反向传播法是梯度下降法在深度网络上的具体实现方式。The convolutional neural network can use the error back propagation (BP) algorithm to modify the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, forward propagation of the input signal until the output will produce an error loss, and parameter-based gradients update the parameters in the initial neural network model by backpropagating the error loss information, thereby making the error loss converge. The backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as weight parameters. The backpropagation method is a specific implementation of the gradient descent method on deep networks.
(5)量化(5)Quantification
在数学和数字信号处理中,量化是指将一个大集合(通常是一个连续集合)中的输入值映射到一个较小集合(通常具有有限数量的元素)中的过程。在神经网络模型的领域,模型量化即以较低的推理精度损失将连续取值(或者大量可能的离散取值)的浮点型模型权重或流经模型的张量数据定点近似(通常为int8)为有限多个(或较少的)离散值的过程,它是以更少位数的数据类 型用于近似表示32位有限范围浮点型数据的过程,而模型的输入输出依然是浮点型,从而达到减少模型尺寸大小、减少模型内存消耗及加快模型推理速度等目标。In mathematics and digital signal processing, quantization refers to the process of mapping input values from a large set (usually a continuous set) into a smaller set (usually with a finite number of elements). In the field of neural network models, model quantization means converting continuous values (or a large number of possible discrete values) into floating-point model weights or fixed-point approximations (usually int8) of tensor data flowing through the model at the expense of lower inference accuracy. ) is a process of finitely many (or fewer) discrete values, which is a data type with fewer digits. The model is used to approximate the process of representing 32-bit limited-range floating-point data, while the input and output of the model are still floating-point, thereby achieving the goals of reducing model size, reducing model memory consumption, and accelerating model inference speed.
但由于模型训练的反向传播中需要将参数的量化后的整型值反量化为浮点值进行计算,而反量化得到的浮点值与参数原本的浮点值之间通常存在一定的误差,即量化误差,该量化误差会在根据损失函数的梯度确定最优参数的过程中存在梯度失配的问题,降低反向传播确定的神经网络模型的参数的精度,给神经网络模型带来精度损失。However, since the backpropagation of model training requires the quantized integer value of the parameter to be dequantized into a floating point value for calculation, there is usually a certain error between the floating point value obtained by dequantization and the original floating point value of the parameter. , that is, the quantization error. This quantization error will cause a gradient mismatch problem in the process of determining the optimal parameters based on the gradient of the loss function, reducing the accuracy of the parameters of the neural network model determined by backpropagation, and bringing accuracy to the neural network model. loss.
其中,反量化是将浮点型的数据除以缩放因子并经过离散化操作映射为整型值之后,再将该整型值乘以相同的缩放因子转换为浮点型的值的过程。Among them, dequantization is the process of dividing floating-point data by a scaling factor and mapping it to an integer value through a discretization operation, and then multiplying the integer value by the same scaling factor to convert it into a floating-point value.
在现有的神经网络模型的量化训练过程中,由于神经网络模型的不同网络层中的梯度值分布不一致且数值较小的梯度值占据大部分,将这些梯度值进行放大后再量化为离散值会造成梯度失配的问题,例如数值较小的梯度值直接映射到0值,从而丢失了原有的信息。因此神经网络模型的量化训练会造成模型精度损失严重的问题。In the quantitative training process of the existing neural network model, since the gradient value distribution in different network layers of the neural network model is inconsistent and the gradient values with smaller values occupy the majority, these gradient values are amplified and then quantized into discrete values. It will cause gradient mismatch problems. For example, a small gradient value is directly mapped to a 0 value, thereby losing the original information. Therefore, quantitative training of neural network models will cause serious loss of model accuracy.
本申请实施例提供了一种神经网络模型的训练方法,尤其是一种根据神经网络模型的参数的量化误差的波动值选择不同梯度补偿策略更新参数的模型训练方法,即计算设备在对参数量化的神经网络模型进行训练时,在模型训练的参数的量化误差的波动值较大的初始阶段使用第一梯度补偿策略对模型训练得到的梯度进行补偿,当神经网络模型的参数的量化误差的波动值小于等于预设值,确定模型训练进入参数的量化误差的波动值较小的训练阶段时,采用第二梯度补偿策略对模型训练得到的梯度进行补偿。从而针对处于不同稳定程度的训练阶段的神经网络模型,计算设备采用适用的梯度补偿策略对神经网络模型进行优化,缓解量化误差导致的梯度失配问题,提高了神经网络模型的参数的梯度的准确性,使用更准确的梯度进行神经网络模型的参数更新,保证了模型训练的精度。Embodiments of the present application provide a training method for a neural network model, particularly a model training method that selects different gradient compensation strategies to update parameters according to the fluctuation value of the quantization error of the parameters of the neural network model, that is, the computing device quantifies the parameters. When the neural network model is trained, the first gradient compensation strategy is used to compensate the gradient obtained by model training in the initial stage when the quantization error of the parameters of the model training fluctuates greatly. When the quantization error of the parameters of the neural network model fluctuates The value is less than or equal to the preset value, and it is determined that when the model training enters the training stage where the fluctuation value of the parameter quantification error is small, the second gradient compensation strategy is used to compensate the gradient obtained by the model training. Therefore, for neural network models in training stages with different degrees of stability, the computing device adopts applicable gradient compensation strategies to optimize the neural network model, alleviate the gradient mismatch problem caused by quantization errors, and improve the accuracy of the gradient of the parameters of the neural network model. It uses more accurate gradients to update the parameters of the neural network model, ensuring the accuracy of model training.
下面将结合附图对本申请实施例的实施方式进行详细描述。The implementation of the embodiments of the present application will be described in detail below with reference to the accompanying drawings.
图3为本申请实施例提供的一种神经网络模型的训练系统的架构示意图。如图3所示,训练系统300包括执行设备310、训练设备320、数据库330、终端设备340、数据存储系统350和数据采集设备360。Figure 3 is a schematic architectural diagram of a neural network model training system provided by an embodiment of the present application. As shown in FIG. 3 , the training system 300 includes an execution device 310 , a training device 320 , a database 330 , a terminal device 340 , a data storage system 350 and a data collection device 360 .
执行设备310可以是终端,如手机终端、平板电脑、笔记本电脑、虚拟现实(virtual reality,VR)设备、增强现实(augmented reality,AR)设备、混合现实(Mixed Reality,MR)设备、扩展现实(Extended Reality,ER)设备、摄像头或车载终端等,还可以是边缘设备(例如,携带具有处理能力芯片的盒子)等。The execution device 310 may be a terminal, such as a mobile phone terminal, a tablet computer, a laptop, a virtual reality (VR) device, an augmented reality (AR) device, a mixed reality (Mixed Reality, MR) device, an extended reality ( Extended Reality (ER) devices, cameras or vehicle-mounted terminals, etc., or edge devices (for example, boxes carrying chips with processing capabilities), etc.
训练设备320可以是终端,还可以是其他支持整型计算的计算设备,如服务器或者云端设备等。The training device 320 may be a terminal or other computing device that supports integer calculation, such as a server or a cloud device.
作为一种可能的实施例,执行设备310和训练设备320是部署在不同物理设备(如:服务器或集群中的服务器)上的不同处理器。例如,执行设备310可以是图形处理单元(graphic processing unit,GPU)、中央处理器(central processing unit,CPU)、其他通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(application-specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。训练设备320可以是图形处理器(graphics processing unit,GPU)、神经网络处理器(neural network processing unit,NPU)、微处理器、特定应用集成电路(application-specific integrated circuit,ASIC)、或一个或多个用于控制本申请方案程序执行的集成电路。As a possible embodiment, the execution device 310 and the training device 320 are different processors deployed on different physical devices (such as a server or a server in a cluster). For example, the execution device 310 can be a graphics processing unit (GPU), a central processing unit (CPU), other general-purpose processors, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc. The training device 320 can be a graphics processing unit (GPU), a neural network processing unit (NPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the program of the present application.
在另一可能的实施例中,执行设备310和训练设备320部署在同一物理设备,或执行设备310和训练设备320为同一物理设备。In another possible embodiment, the execution device 310 and the training device 320 are deployed on the same physical device, or the execution device 310 and the training device 320 are the same physical device.
数据采集设备360用于采集训练数据,并将训练数据存入数据库330,数据采集设备360与执行设备310、训练设备320可以是相同或不同的设备。训练数据包括图像、语音和文字等中至少一种形式的数据。The data collection device 360 is used to collect training data and store the training data in the database 330. The data collection device 360, the execution device 310 and the training device 320 may be the same or different devices. The training data includes data in at least one form of images, speech, text, etc.
训练设备320用于利用训练数据对神经网络进行训练,直到神经网络中的损失函数收敛,且损失函数值小于特定阈值则神经网络训练完成,从而使得神经网络达到一定精度。例如,训练设 备320对神经网络模型进行量化训练,在神经网络模型的前向传播过程中对权重参数和/或激活值进行量化,再针对处于反向传播过程中的神经网络模型,根据参数的量化误差的波动值选择梯度补偿策略确定参数的梯度,基于梯度更新神经网络模型的参数,来得到优化后的神经网络模型。或者,数据库330中所有的训练数据被用于训练,则神经网络训练完成,使训练完成的神经网络具有图像识别、图像分类、语音识别等目标功能。进而,训练设备320将训练完成的神经网络301配置到执行设备310。执行设备310用于实现根据训练完成的神经网络301处理应用数据的功能。The training device 320 is used to train the neural network using the training data until the loss function in the neural network converges and the neural network training is completed when the loss function value is less than a specific threshold, so that the neural network reaches a certain accuracy. The equipment 320 performs quantization training on the neural network model, quantizes the weight parameters and/or activation values during the forward propagation process of the neural network model, and then selects a gradient compensation strategy to determine the gradient of the parameters according to the fluctuation value of the quantization error of the parameters for the neural network model in the back propagation process, and updates the parameters of the neural network model based on the gradient to obtain the optimized neural network model. Alternatively, all the training data in the database 330 are used for training, and the neural network training is completed, so that the trained neural network has the target functions such as image recognition, image classification, and speech recognition. Furthermore, the training device 320 configures the trained neural network 301 to the execution device 310. The execution device 310 is used to realize the function of processing application data according to the trained neural network 301.
在一些实施例中,执行设备310和训练设备320为同一计算设备,计算设备可以将训练完成的神经网络301配置到自身,利用训练完成的神经网络301实现图像识别、语音识别等目标功能。In some embodiments, the execution device 310 and the training device 320 are the same computing device. The computing device can configure the trained neural network 301 to itself, and use the trained neural network 301 to achieve target functions such as image recognition and speech recognition.
在另一些实施例中,训练设备320可以将训练完成的神经网络301配置到多个执行设备310。每个执行设备310利用训练完成的神经网络301实现神经网络模型的目标功能。In other embodiments, the training device 320 may configure the trained neural network 301 to multiple execution devices 310 . Each execution device 310 uses the trained neural network 301 to implement the target function of the neural network model.
结合训练系统300,本实施例提供的图像处理方法能够应用在神经网络模型的训练场景。具体而言,本申请实施例的模型训练方法能够应用在神经网络模型加速训练场景、模型低比特量化场景等场景中。Combined with the training system 300, the image processing method provided in this embodiment can be applied to the training scenario of the neural network model. Specifically, the model training method of the embodiment of the present application can be applied in scenarios such as neural network model accelerated training scenarios and model low-bit quantization scenarios.
例如,对于神经网络加速训练场景:训练设备320在对用于实现人脸识别的神经网络模型进行训练时,由于训练数据包含的人脸照片的数据量过大,在模型训练过程中全部采用全精度的浮点型数据会消耗大量的计算资源和时间,模型训练效率较低。因此,训练设备320对神经网络模型进行量化训练,使神经网络模型将参数转换为整型值进行前向传播的计算。在神经网络模型的反向传播过程中,训练设备320根据参数量化前后的量化误差的波动值,在量化误差的波动值较大的模型训练初始阶段使用第一梯度补偿策略进行梯度补偿,在量化误差的波动值较小的模型训练稳定阶段使用第二梯度补偿策略进行梯度补偿,再根据补偿后的梯度更新神经网络模型的参数,得到优化后的神经网络模型。由此,训练设备320通过量化训练,在训练数据包含大量人脸照片的情况下加速了神经网络模型的训练速度,并根据量化误差的波动值在模型训练的不同阶段选择适用的梯度补偿策略对神经网络模型的参数进行更新,缓解了量化误差导致的模型精度下降的问题,保证了神经网络模型在人脸识别功能上的精度。For example, for a neural network accelerated training scenario: when the training device 320 trains a neural network model for face recognition, due to the large amount of face photos contained in the training data, all full-scale face photos are used in the model training process. High-precision floating-point data consumes a lot of computing resources and time, and model training efficiency is low. Therefore, the training device 320 performs quantitative training on the neural network model, causing the neural network model to convert parameters into integer values for forward propagation calculations. During the back propagation process of the neural network model, the training device 320 uses the first gradient compensation strategy to perform gradient compensation in the initial stage of model training when the fluctuation value of the quantization error is large according to the fluctuation value of the quantization error before and after parameter quantization. In the stable phase of model training with smaller error fluctuations, the second gradient compensation strategy is used for gradient compensation, and then the parameters of the neural network model are updated based on the compensated gradient to obtain an optimized neural network model. Therefore, the training device 320 accelerates the training speed of the neural network model through quantitative training when the training data contains a large number of face photos, and selects applicable gradient compensation strategies at different stages of model training based on the fluctuation value of the quantization error. The parameters of the neural network model are updated, which alleviates the problem of reduced model accuracy caused by quantization errors and ensures the accuracy of the neural network model in the face recognition function.
需要说明的是,在实际的应用中,数据库330中维护的训练数据不一定都来自于数据采集设备360,也有可能是从其他设备接收得到的。另外,训练设备320也不一定完全基于数据库330维护的训练数据训练神经网络,也有可能从云端或其他地方获取训练数据训练神经网络。上述描述不应该作为对本申请实施例的限定。It should be noted that in actual applications, the training data maintained in the database 330 may not necessarily come from the data collection device 360, but may also be received from other devices. In addition, the training device 320 does not necessarily train the neural network entirely based on the training data maintained by the database 330. It is also possible to obtain training data from the cloud or other places to train the neural network. The above description should not be used as a limitation on the embodiments of the present application.
进一步地,根据执行设备310所执行的功能,还可以进一步将执行设备310细分为如图7所示的架构,如图所示,执行设备310配置有计算模块311、I/O接口312和预处理模块313。Further, according to the functions performed by the execution device 310, the execution device 310 can be further subdivided into an architecture as shown in Figure 7. As shown in the figure, the execution device 310 is configured with a computing module 311, an I/O interface 312 and Preprocessing module 313.
I/O接口312用于与外部设备进行数据交互。用户可以通过终端设备740向I/O接口312输入数据。另外,输入数据也可以来自数据库330。The I/O interface 312 is used for data interaction with external devices. The user can input data to the I/O interface 312 through the terminal device 740. Additionally, input data may also come from database 330.
预处理模块313用于根据I/O接口312接收到的输入数据进行预处理。在本申请实施例中,预处理模块313可以用于根据从I/O接口312接收到的输入数据生成训练数据,例如训练集、验证集和测试集。The preprocessing module 313 is used to perform preprocessing according to the input data received by the I/O interface 312 . In this embodiment of the present application, the preprocessing module 313 may be used to generate training data, such as a training set, a verification set, and a test set according to the input data received from the I/O interface 312.
在执行设备310对输入数据进行预处理,或者在执行设备310的计算模块311执行计算等相关的处理过程中,执行设备310可以调用数据存储系统350中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据和指令等存入数据存储系统350中。When the execution device 310 preprocesses input data, or when the calculation module 311 of the execution device 310 performs calculations and other related processes, the execution device 310 can call data, codes, etc. in the data storage system 350 for corresponding processing. , the data and instructions obtained by corresponding processing can also be stored in the data storage system 350 .
最后,I/O接口312将处理结果返回给终端设备340,从而提供给用户,以便用户查看处理结果。Finally, the I/O interface 312 returns the processing result to the terminal device 340, thereby providing it to the user so that the user can view the processing result.
终端设备340也可以作为数据采集端,采集如图所示输入I/O接口312的输入数据及输出I/O接口312的处理结果作为新的样本数据,并存入数据库330。当然,也可以不经过终端设备340进行采集,而是由I/O接口312将如图所示输入I/O接口312的输入数据及输出I/O接口312的处理结果,作为新的样本数据存入数据库330。The terminal device 340 can also be used as a data collection terminal to collect the input data input to the I/O interface 312 and the processing results output from the I/O interface 312 as new sample data, and store them in the database 330. Of course, it is also possible to collect without going through the terminal device 340. Instead, the I/O interface 312 uses the input data input to the I/O interface 312 and the processing results output from the I/O interface 312 as new sample data as shown in the figure. Store in database 330.
图3仅是本申请实施例提供的一种系统架构的示意图,图3中所示设备、器件、模块等之间 的位置关系不构成任何限制,例如,在图3中,数据存储系统350相对执行设备310是外部存储器,在其它情况下,也可以将数据存储系统350置于执行设备310中。Figure 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application. The devices, devices, modules, etc. shown in Figure 3 The positional relationship does not constitute any limitation. For example, in Figure 3, the data storage system 350 is an external memory relative to the execution device 310. In other cases, the data storage system 350 can also be placed in the execution device 310.
接下来请参考图4,对神经网络模型的训练方法进行详细阐述。在这里以图3中的训练设备320为例进行说明。Next, please refer to Figure 4 for a detailed explanation of the training method of the neural network model. Here, the training device 320 in FIG. 3 is taken as an example for explanation.
步骤410、训练设备320对神经网络模型进行训练,并对训练后得到的梯度以第一梯度补偿策略进行补偿。Step 410: The training device 320 trains the neural network model, and compensates the gradient obtained after training using the first gradient compensation strategy.
训练设备320对神经网络模型进行前向传播的训练,并在前向传播过程中对神经网络模型的参数进行量化,对完成前向传播的训练后获得神经网络模型的梯度以第一梯度补偿策略进行补偿。The training device 320 performs forward propagation training on the neural network model, quantifies the parameters of the neural network model during the forward propagation process, and uses the first gradient compensation strategy to obtain the gradient of the neural network model after completing the forward propagation training. Make compensation.
示例地,第一梯度补偿策略为元素级梯度缩放策略,元素级梯度缩放策略包括利用元素级梯度缩放量化的反向传播过程确定参数的精度。元素级梯度缩放策略针对量化的神经网络模型输出的梯度,自适应地放大或缩小每个梯度元素,并使用缩放后的梯度作为量化函数输出的梯度,通过反向传播来训练量化的网络。缩放是根据每个梯度元素的符号和量化函数的连续输入和离散输出之间的误差进行的。训练设备320使用元素梯度缩放策略更新梯度的具体步骤请参考图6及相关描述,在此不再赘述。For example, the first gradient compensation strategy is an element-level gradient scaling strategy, and the element-level gradient scaling strategy includes determining the accuracy of parameters using a backpropagation process of element-level gradient scaling quantification. The element-level gradient scaling strategy adaptively enlarges or shrinks each gradient element for the gradient output by the quantized neural network model, and uses the scaled gradient as the gradient output by the quantized function to train the quantized network through backpropagation. Scaling is performed based on the sign of each gradient element and the error between the continuous input and discrete output of the quantized function. For specific steps of updating the gradient using the element gradient scaling strategy by the training device 320, please refer to Figure 6 and related descriptions, which will not be described again here.
可选地,神经网络模型中量化的参数包含激活值和/或权重参数。其中,激活值是指神经网络模型中网络层向下一层传递的值,常与权重参数成对出现并与权重参数一同进行卷积运算或矩阵乘法运算。例如激活值是网络层中经过激活函数处理后的输出值。又如,激活值是网络层中未经过激活函数处理就输入到后一网络层中进行卷积运算或矩阵乘法运算的值。Optionally, the quantized parameters in the neural network model include activation values and/or weight parameters. Among them, the activation value refers to the value passed from the network layer to the next layer in the neural network model, which often appears in pairs with the weight parameters and performs convolution operations or matrix multiplication operations together with the weight parameters. For example, the activation value is the output value of the network layer after being processed by the activation function. For another example, the activation value is the value in the network layer that is not processed by the activation function and is input to the next network layer for convolution operations or matrix multiplication operations.
作为一种可能的实现方式,训练设备320在参数的量化误差的波动值属于不同的数值范围时,选取不同的梯度补偿策略对梯度值进行校正。在步骤410中,训练设备320在神经网络模型训练的参数的量化误差较大的初始阶段采用第一梯度补偿策略对训练后得到的梯度进行补偿。可选地,参数的量化误差较大是指参数的量化误差大于预设值,其中,预设值的具体数值可以根据神经网络模型的精度需求进行灵活调节,例如0.5%、0.8%、1%、1.6%等。As a possible implementation, the training device 320 selects different gradient compensation strategies to correct the gradient value when the fluctuation values of the quantization error of the parameter belong to different numerical ranges. In step 410, the training device 320 uses the first gradient compensation strategy to compensate the gradient obtained after training in the initial stage when the quantization error of the parameters trained by the neural network model is large. Optionally, a large quantification error of the parameter means that the quantification error of the parameter is greater than the preset value. The specific value of the preset value can be flexibly adjusted according to the accuracy requirements of the neural network model, such as 0.5%, 0.8%, 1%. , 1.6%, etc.
在本实施例中,训练设备320采用逐样本非对称均匀量化方式对激活值进行量化,采用逐通道对称均匀量化方式对权重参数进行量化。其中,逐样本(per-sample)是指同一批训练数据中,对每一个样本数据分别进行操作,逐通道(per-channel)是指对参数按通道分组,分别对每个通道中的数据整体进行操作。逐样本非对称均匀量化方式和逐通道对称均匀量化方式的具体步骤请参考后文图5中参数量化的相关描述,在此不再赘述。In this embodiment, the training device 320 uses a sample-by-sample asymmetric uniform quantization method to quantize the activation values, and a channel-by-channel symmetric uniform quantization method is used to quantize the weight parameters. Among them, per-sample refers to operating each sample data separately in the same batch of training data, and per-channel (per-channel) refers to grouping parameters by channels, and operating on the entire data in each channel. Perform operations. For the specific steps of the sample-by-sample asymmetric uniform quantization method and the channel-by-channel symmetric uniform quantization method, please refer to the relevant description of parameter quantization in Figure 5 below, and will not be repeated here.
上述逐样本非对称均匀量化方式和逐通道对称均匀量化方式是本申请实施例提供的示例,本申请实施例并未限定激活值或权重参数的量化方式,例如激活值或权重参数的量化方式还可以是逐样本对称均匀量化方式、逐通道非对称均匀量化方式等中的一种。The above-mentioned sample-by-sample asymmetric uniform quantization method and channel-by-channel symmetric uniform quantization method are examples provided by the embodiments of the present application. The embodiments of the present application do not limit the quantization methods of activation values or weight parameters. For example, the quantization methods of activation values or weight parameters are also It can be one of the sample-by-sample symmetric uniform quantization method, the channel-by-channel asymmetric uniform quantization method, etc.
基于上述激活值和权重参数的量化方式,一方面,训练设备320采用相对于逐通道对称均匀量化方式来说量化精度更高的逐样本非对称均匀量化方式对激活值进行量化,保证了激活值的量化精度,在量化过程中减小了量化误差。另一方面,由于逐样本非对称均匀量化方式在对权重参数进行量化时相对于逐通道对称均匀量化方式并无明显的精度优势,训练设备320采用计算效率较高的逐通道对称均匀量化方式对权重参数进行量化,提高了参数量化的效率。因此,本申请实施例中的训练设备320根据参数的数据分布适应性地采用量化方式,在保障量化效率的同时提高了量化精度。Based on the above quantization method of activation values and weight parameters, on the one hand, the training device 320 uses a sample-by-sample asymmetric uniform quantization method that has higher quantization accuracy than the channel-by-channel symmetric and uniform quantization method to quantize the activation value, ensuring that the activation value The quantization accuracy reduces the quantization error during the quantization process. On the other hand, since the sample-by-sample asymmetric uniform quantization method has no obvious accuracy advantage over the channel-by-channel symmetric uniform quantization method when quantizing weight parameters, the training device 320 adopts the channel-by-channel symmetric uniform quantization method with higher computational efficiency. The weight parameters are quantified, which improves the efficiency of parameter quantization. Therefore, the training device 320 in the embodiment of the present application adaptively adopts the quantization method according to the data distribution of the parameters, thereby improving the quantization accuracy while ensuring the quantization efficiency.
步骤420、训练设备320确定神经网络模型的参数的量化误差的波动值。Step 420: The training device 320 determines the fluctuation value of the quantization error of the parameters of the neural network model.
训练设备320首先对参数量化后的整型值进行反量化,得到浮点型的反量化值,再根据反量化值和未量化的参数的浮点值的差值计算参数的量化误差,最后将不同训练步数的参数的量化误差的差值作为量化误差的波动值。The training device 320 first dequantizes the quantized integer value of the parameter to obtain a floating-point inverse quantization value, then calculates the quantization error of the parameter based on the difference between the inverse quantization value and the floating point value of the unquantized parameter, and finally The difference between the quantization errors of parameters with different training steps is used as the fluctuation value of the quantization error.
上述量化误差的具体计算方式的具体步骤可以用以下公式(2)-公式(4)表示。

AQE=AQ*Ascale-Azero_point*Ascale   公式(3);
WQE=WQ*Wscale   公式(4);
The specific steps of the specific calculation method of the above quantization error can be expressed by the following formula (2) to formula (4).

A QE =A Q *A scale -A zero_point *A scale formula (3);
W QE =W Q *W scale formula (4);
其中,MSE(XN,XQE)表示参数的量化误差,M表示要量化的参数的个数,XN表示参数的未量 化的全精度浮点值,XQE表示对参数的量化值进行反量化得到的浮点值,表示第i个样本对应的参数的未量化的全精度浮点值,表示第i个样本对应的参数的量化值进行反量化得到的浮点值。X表示激活值或权重参数,AQE表示对激活值的量化值进行反量化得到的浮点值,AQ表示激活值量化后的量化值,Azero_point表示激活值整体的整数零点值,Ascale表示激活值整体的缩放因子。WQE表示对权重参数的量化值进行反量化得到的浮点值,WQ表示对权重参数量化后的量化值,Wscale表示权重参数整体的缩放因子。Among them, MSE (X N ,X QE ) represents the quantization error of the parameters, M represents the number of parameters to be quantified, and X N represents the unquantified parameters. ized full-precision floating point value, X QE represents the floating point value obtained by inverse quantization of the quantized value of the parameter, Represents the unquantized full-precision floating point value of the parameter corresponding to the i-th sample, Represents the floating point value obtained by inverse quantization of the quantized value of the parameter corresponding to the i-th sample. X represents the activation value or weight parameter, A QE represents the floating point value obtained by inverse quantization of the quantized value of the activation value, A Q represents the quantized value after quantization of the activation value, A zero_point represents the integer zero point value of the overall activation value, A scale Represents the scaling factor for the overall activation value. W QE represents the floating point value obtained by inverse quantizing the quantized value of the weight parameter, W Q represents the quantized value after quantizing the weight parameter, and W scale represents the overall scaling factor of the weight parameter.
可选地,本实施例中可以将参数的量化误差的波动值大于预设值的训练阶段称为神经网络模型的第一训练阶段,将参数的量化误差的波动值小于或等于预设值的训练阶段称为神经网络模型的第二训练阶段。Optionally, in this embodiment, the training phase in which the fluctuation value of the quantization error of the parameter is greater than the preset value can be called the first training phase of the neural network model, and the training phase in which the fluctuation value of the quantization error of the parameter is less than or equal to the preset value can be called The training phase is called the second training phase of the neural network model.
步骤420中两次训练获得的参数的量化误差可以是指本次训练后的参数的量化误差和上次训练后的参数的量化误差,本次训练和上次训练可以间隔一个或多个训练步数。例如,训练设备320在第一训练阶段每间隔m个训练步数计算一次量化误差的波动值,并根据量化误差的波动值确定保持第一梯度补偿策略或将第一梯度补偿策更改为第二梯度补偿策略。其中,神经网络模型每完成一次前向传播和一次反向传播称为一个训练步数,m为正整数。由此,训练设备320能够间歇性地判定是否启动梯度补偿策略,避免在神经网络模型的参数较为稳定的时间段频繁判定是否启动或变更梯度补偿策略,降低了对训练设备320的计算资源的消耗。The quantified error of the parameters obtained by the two trainings in step 420 may refer to the quantified error of the parameters after this training and the quantified error of the parameters after the last training. This training and the last training may be separated by one or more training steps. number. For example, the training device 320 calculates the fluctuation value of the quantization error every m training steps in the first training phase, and determines to maintain the first gradient compensation strategy or change the first gradient compensation strategy to the second one according to the fluctuation value of the quantization error. Gradient compensation strategy. Among them, each time the neural network model completes one forward propagation and one back propagation is called a training step, and m is a positive integer. As a result, the training device 320 can intermittently determine whether to start the gradient compensation strategy, avoid frequently determining whether to start or change the gradient compensation strategy during a period when the parameters of the neural network model are relatively stable, and reduce the consumption of computing resources of the training device 320 .
步骤430、当量化误差的波动值小于等于预设值时,训练设备320将第一梯度补偿策略更改为第二梯度补偿策略,并在后续的训练中,使用第二梯度补偿策略对后续训练中得到的梯度进行补偿。Step 430: When the fluctuation value of the quantization error is less than or equal to the preset value, the training device 320 changes the first gradient compensation strategy to the second gradient compensation strategy, and in subsequent training, uses the second gradient compensation strategy to compensate for the gradient obtained in the subsequent training.
训练设备320在量化误差的波动值小于等于预设值时,确定神经网络模型的训练处于较为稳定的第二训练阶段,将第一梯度补偿策略更改为第二梯度补偿策略,使用第二梯度补偿策略对后续训练中得到的梯度进行补偿。When the fluctuation value of the quantization error is less than or equal to the preset value, the training device 320 determines that the training of the neural network model is in a relatively stable second training stage, changes the first gradient compensation strategy to the second gradient compensation strategy, and uses the second gradient compensation The policy compensates for gradients obtained in subsequent training.
示例地,第二梯度补偿策略为多维度权重混合训练策略。多维度权重混合训练策略是指在神经网络模型的训练过程中,使用量化后的FP16类型或FP32类型的参数进行矩阵乘法运算,然后再将FP16类型或FP32类型的参数转化为反量化值,根据参数的量化前的浮点值和反量化值优化损失函数,再根据损失函数重构梯度。简单而言,就是利用量化前和量化后的参数值进行损失函数的优化来弥补丢失的精度。这样可以有效减少计算过程中的舍入误差,尽量减缓精度损失的问题。训练设备320使用多维度权重混合训练策略更新梯度的具体步骤请参考图7及相关描述,在此不再赘述。For example, the second gradient compensation strategy is a multi-dimensional weight hybrid training strategy. The multi-dimensional weight hybrid training strategy refers to using quantized FP16 type or FP32 type parameters to perform matrix multiplication during the training process of the neural network model, and then converting the FP16 type or FP32 type parameters into inverse quantization values. According to The floating point value and inverse quantization value of the parameters before quantization are used to optimize the loss function, and then the gradient is reconstructed based on the loss function. To put it simply, the parameter values before and after quantization are used to optimize the loss function to make up for the lost accuracy. This can effectively reduce rounding errors in the calculation process and minimize the problem of precision loss. For specific steps of updating the gradient using the multi-dimensional weight hybrid training strategy by the training device 320, please refer to Figure 7 and related descriptions, which will not be described again here.
训练设备320在使用第二梯度补偿策略确定梯度后,使用梯度下降法对神经网络模型的各网络层的参数进行更新,得到优化后神经网络模型。梯度下降法是神经网络模型训练过程中常用的算法,在此不再赘述。After determining the gradient using the second gradient compensation strategy, the training device 320 uses the gradient descent method to update the parameters of each network layer of the neural network model to obtain an optimized neural network model. The gradient descent method is a commonly used algorithm in the training process of neural network models and will not be described in detail here.
作为一种可能的实现方式中,训练设备320在模型训练的整个过程中周期性地统计参数的量化周期的波动值,在量化误差的波动值大于启动阈值时确定启动梯度补偿策略,然后根据量化误差的波动值选择执行第一梯度补偿策略或第二梯度补偿策略,即周期性地执行梯度补偿策略。因此,神经网络模型的梯度补偿策略可以从第一梯度补偿策略更改为第二梯度补偿策略,或从第二梯度补偿策略更改为第一梯度补偿策略。例如,训练设备320在第一训练阶段每间隔m个训练步数计算一次量化误差的波动值,并判断保持或更改梯度补偿策略,训练设备320在第二训练阶段每间隔M2/m个训练步数计算一次量化误差的波动值,并判断保持或更改梯度补偿策略,其中M2为第二训练阶段的训练步数的总数量。由此,相对于第一训练阶段,训练设备320在神经网络模型较为稳定的第二训练阶段减少了梯度补偿的判断及执行频率,避免在神经网络模型的参数较为稳定的时间段频繁判定是否启动或变更梯度补偿策略,降低了对训练设备320的计算资源的消耗。As a possible implementation, the training device 320 periodically counts the fluctuation values of the quantization period of the parameters during the entire process of model training, determines to start the gradient compensation strategy when the fluctuation value of the quantization error is greater than the startup threshold, and then determines to start the gradient compensation strategy according to the quantization The fluctuation value of the error selects to execute the first gradient compensation strategy or the second gradient compensation strategy, that is, the gradient compensation strategy is executed periodically. Therefore, the gradient compensation strategy of the neural network model may be changed from the first gradient compensation strategy to the second gradient compensation strategy, or from the second gradient compensation strategy to the first gradient compensation strategy. For example, the training device 320 calculates the fluctuation value of the quantization error every m training steps in the first training phase, and determines whether to maintain or change the gradient compensation strategy. The training device 320 calculates the fluctuation value of the quantization error every M2/m training steps in the second training phase. Calculate the fluctuation value of the quantization error once, and determine whether to maintain or change the gradient compensation strategy, where M2 is the total number of training steps in the second training stage. Therefore, compared with the first training stage, the training device 320 reduces the judgment and execution frequency of gradient compensation in the second training stage when the neural network model is relatively stable, and avoids frequently judging whether to start or not during the time period when the parameters of the neural network model are relatively stable. Or the gradient compensation strategy is changed to reduce the consumption of computing resources of the training device 320 .
基于上述步骤430,训练设备320在量化误差的波动值不同时,变更梯度补偿策略,选择不同的梯度补偿策略对神经网络模型的参数进行更新。因此,针对处于不同稳定程度的训练阶段的神经网络模型,训练设备320能够采用适用的梯度补偿策略确定神经网络模型的梯度,对量化误差所造成的精度缺失的梯度值进行优化,从而提高了神经网络模型的参数的梯度的准确性,以及根据梯度确定的参数的精度,进而保证了模型训练的精度。此外,训练设备320无需引入算子及 算法来适配低精度整数运算,也无需引入可学习的量化参数最小化量化误差,从而减少了对训练设备320的资源占用,提高了模型训练效率。Based on the above step 430, the training device 320 changes the gradient compensation strategy when the fluctuation values of the quantization error are different, and selects different gradient compensation strategies to update the parameters of the neural network model. Therefore, for neural network models in training stages with different degrees of stability, the training device 320 can use an applicable gradient compensation strategy to determine the gradient of the neural network model, and optimize the gradient value of the lack of accuracy caused by the quantization error, thereby improving the neural network model. The accuracy of the gradient of the parameters of the network model and the accuracy of the parameters determined based on the gradient ensure the accuracy of model training. In addition, the training device 320 does not need to introduce operators and The algorithm is adapted to low-precision integer operations, and there is no need to introduce learnable quantization parameters to minimize quantization errors, thereby reducing resource occupation of the training device 320 and improving model training efficiency.
接下来结合图5,以神经网络模型包含的多个网络层中的一个网络层,即第二网络层为例,对步骤410中神经网络模型在前向传播中的参数量化及运算步骤进行说明,图5中的量化器、反量化器、低精度整数计算单元是训练设备320中的硬件或软件实现的功能模块。Next, with reference to Figure 5, taking one of the multiple network layers included in the neural network model, that is, the second network layer, as an example, the parameter quantification and operation steps of the neural network model in the forward propagation in step 410 will be explained. , the quantizer, inverse quantizer, and low-precision integer calculation unit in Figure 5 are functional modules implemented by hardware or software in the training device 320.
步骤510、量化器对参数进行量化,得到量化后的整型值。Step 510: The quantizer quantizes the parameters and obtains the quantized integer value.
量化器从神经网络模型的上一网络层即第一网络层接收激活值,并根据对激活值和权重参数进行量化,得到激活值和权重参数量化后的整型值。其中,第一网络层输入第二网络层的参数为浮点型的数值,例如FP16或FP32,量化后的整型值可以是INT8类型的低精度整型值。The quantizer receives the activation value from the previous network layer of the neural network model, that is, the first network layer, and quantizes the activation value and weight parameters to obtain the quantized integer value of the activation value and weight parameters. The parameters input by the first network layer to the second network layer are floating-point values, such as FP16 or FP32, and the quantized integer values may be low-precision integer values of the INT8 type.
可选地,激活值除了是从上一网络层获取,还可以是量化器从本网络层获取的未经过激活函数处理的参数值。Optionally, in addition to being obtained from the previous network layer, the activation value can also be the parameter value obtained by the quantizer from this network layer without being processed by the activation function.
在本实施例中,为了图5和文字描述的简洁,仅示出一个量化器,在实例中,由于参数包括激活值和权重参数,训练设备320可以使用两个量化器分别进行激活值和权重参数的量化。例如,一个量化器采用逐样本非对称均匀量化方式对激活值进行量化,另一个量化器采用逐通道对称均匀量化方式对权重参数进行量化。下面对逐样本非对称均匀量化方式和逐通道对称均匀量化方式的步骤进行详细说明。In this embodiment, for the sake of simplicity of Figure 5 and text description, only one quantizer is shown. In the example, since the parameters include activation values and weight parameters, the training device 320 can use two quantizers to perform activation values and weights respectively. Quantification of parameters. For example, one quantizer uses sample-by-sample asymmetric uniform quantization to quantize activation values, and another quantizer uses channel-by-channel symmetric uniform quantization to quantize weight parameters. The steps of the sample-by-sample asymmetric uniform quantization method and the channel-by-channel symmetric uniform quantization method are described in detail below.
针对逐样本非对称均匀量化方式,训练设备320首先将训练样本输入神经网络模型,针对每个训练样本进行运算,得到神经网络模型中各网络层输出的浮点型的激活值,再分别统计每个浮点型的激活值的最大值和最小值,根据统计结果计算每个样本的缩放因子和整数零点值,最后根据每个样本的缩放因子和整数零点值计算每个样本对应的激活值经量化后的整型值。For the sample-by-sample asymmetric uniform quantization method, the training device 320 first inputs the training samples into the neural network model, performs operations on each training sample, obtains the floating-point activation value output by each network layer in the neural network model, and then counts each separately. The maximum and minimum values of the floating-point activation values are calculated. The scaling factor and integer zero value of each sample are calculated based on the statistical results. Finally, the activation value corresponding to each sample is calculated based on the scaling factor and integer zero value of each sample. Quantized integer value.
上述逐样本非对称均匀量化方式的具体步骤可参考以下公式(5)-公式(7)。


For the specific steps of the above-mentioned sample-by-sample asymmetric uniform quantization method, please refer to the following formula (5) to formula (7).


其中,表示第i个样本的激活值(AN表示未量化的激活值,AQ表示量化后的激活值)。n表示量化比特数量,例如INT8量化场景中n值为8。分别表示第i个样本的激活值的量化缩放因子和整数零点值。Round函数表示量化操作,Clip函数表示数据截断操作,表示第i个样本量化后的激活值的整型值。in, Indicates the activation value of the i-th sample ( AN indicates the unquantized activation value, and AQ indicates the quantized activation value). n indicates the number of quantization bits, for example, the value of n is 8 in the INT8 quantization scenario. and Represent the quantization scaling factor and integer zero value of the activation value of the i-th sample respectively. The Round function represents the quantization operation, and the Clip function represents the data truncation operation. An integer value representing the quantized activation value of the i-th sample.
针对逐通道对称均匀量化方式,训练设备320首先将训练样本输入神经网络模型,对每个通道的训练样本批量进行处理,得到神经网络模型中各网络层输出的浮点型的权重参数,再分别统计每个通道的训练样本的浮点型的权重参数的绝对值的最大值,根据统计结果计算每个通道的权重参数的缩放因子,最后根据权重参数的缩放因子确定权重参数经量化后的整型值。For the channel-by-channel symmetric and uniform quantization method, the training device 320 first inputs the training samples into the neural network model, processes the training samples of each channel in batches, and obtains the floating-point weight parameters output by each network layer in the neural network model, and then respectively Count the maximum absolute value of the floating-point weight parameter of the training samples of each channel, calculate the scaling factor of the weight parameter of each channel based on the statistical results, and finally determine the quantized integer value of the weight parameter based on the scaling factor of the weight parameter. type value.
上述逐通道对称均匀量化方式的具体步骤可以参考以下公式(8)-公式(9)。

The specific steps of the above-mentioned channel-by-channel symmetric and uniform quantization method can refer to the following formula (8)-formula (9).

其中,表示第j个输出维度所对应的权重参数(WN表示未量化的权重参数,WQ表示量化后的权重参数)。n表示量化比特数量,例如INT8量化场景中n值为8。表示第j个输出维度所对应的权重参数的量化缩放因子。Round函数表示量化操作,Clip函数表示数据截断操作,表示第j个输出维度量化后的权重参数的整型值。in, Represents the weight parameter corresponding to the j-th output dimension (W N represents the unquantized weight parameter, W Q represents the quantized weight parameter). n represents the number of quantization bits. For example, the n value is 8 in the INT8 quantization scenario. Represents the quantified scaling factor of the weight parameter corresponding to the j-th output dimension. The Round function represents the quantization operation, and the Clip function represents the data truncation operation. An integer value representing the quantized weight parameter of the j-th output dimension.
步骤520、低精度整数计算单元对整型值进行运算,得到运算结果。Step 520: The low-precision integer calculation unit performs operations on the integer values to obtain operation results.
低精度整数计算单元对激活值的整型值和权重参数的整型值进行矩阵乘法运算或卷积运算,得到运算结果。The low-precision integer calculation unit performs matrix multiplication or convolution on the integer value of the activation value and the integer value of the weight parameter to obtain the operation result.
上述矩阵乘法运算或卷积运算的具体方式如下:
The specific methods of the above matrix multiplication operation or convolution operation are as follows:
其中,OutputINT表示整数卷积运算或矩阵乘法运算结果,WQ表示量化后的权重参数,AQ表示量化后的激活值,表示卷积计算或者矩阵乘法计算。Among them, Output INT represents the result of integer convolution operation or matrix multiplication operation, W Q represents the quantized weight parameter, A Q represents the quantized activation value, Represents convolution calculation or matrix multiplication calculation.
步骤530、反量化器将对运算结果进行反量化,得到反量化后的浮点值。Step 530: The dequantizer will dequantize the operation result to obtain the dequantized floating point value.
反量化器对整数卷积运算或矩阵乘法运算的输出结果进行反量化操作得到反量化值,以近似表示原始的浮点计算结果,反量化器输出的反量化值为第二网络层输入下一网络层即第三网络层的激活值,或者是在第二网络层中继续进行计算的数值。The dequantizer performs a dequantization operation on the output result of the integer convolution operation or the matrix multiplication operation to obtain a dequantized value to approximately represent the original floating-point calculation result. The dequantized value output by the dequantizer is the activation value of the second network layer input into the next network layer, namely the third network layer, or the value to continue calculation in the second network layer.
上述反量化的具体运算方式如下:
The specific operation method of the above inverse quantization is as follows:
其中,OutputFP表示反量化后的卷积或矩阵乘法计算结果,Ascale表示激活值整体的缩放因子,Wscale表示权重参数整体的缩放因子,Azero_point表示激活值整体的整数零点值,WN表示未量化的浮点型的权重参数,AN表示未量化的浮点型的激活值。Among them, Output FP represents the convolution or matrix multiplication calculation result after inverse quantization, A scale represents the overall scaling factor of the activation value, W scale represents the overall scaling factor of the weight parameter, A zero_point represents the integer zero point value of the overall activation value, W N Represents the weight parameter of unquantized floating point type, A N represents the activation value of unquantized floating point type.
上文结合图5中第一网络层—>第二网络层—>第三网络层的数据传输方向对模型训练的前向传播进行了说明,接下来结合图6和图7对元素级梯度缩放策略或多维度权重混合训练策略的具体步骤进行说明。由于模型训练的反向传播的数据传播方向与前向传播相逆,区别在于训练设备320在前向传播的量化步骤处采用梯度估计器执行梯度补偿策略确定梯度,在此不再对反向传播中训练设备320根据梯度更新模型参数的具体步骤进行赘述。The forward propagation of model training is explained above in conjunction with the data transmission direction of the first network layer -> second network layer -> third network layer in Figure 5. Next, the element-level gradient scaling is combined with Figures 6 and 7. The specific steps of the strategy or multi-dimensional weight hybrid training strategy are explained. Since the data propagation direction of back propagation in model training is opposite to that of forward propagation, the difference is that the training device 320 uses a gradient estimator to perform a gradient compensation strategy to determine the gradient at the quantization step of forward propagation, and no further calculation of the back propagation is performed here. The specific steps of the training device 320 to update the model parameters according to the gradient will be described in detail.
请参考图6,图6为本申请实施例提供的一种元素级梯度缩放策略的示意图。在神经网络模型的反向传播中,以训练设备320利用元素级梯度缩放策略确定第二网络层中的参数为例,元素级梯度缩放策略的具体步骤如下:Please refer to Figure 6, which is a schematic diagram of an element-level gradient scaling strategy provided by an embodiment of the present application. In the back propagation of the neural network model, taking the training device 320 using the element-level gradient scaling strategy to determine the parameters in the second network layer as an example, the specific steps of the element-level gradient scaling strategy are as follows:
步骤610、训练设备320获取参数。Step 610: The training device 320 obtains parameters.
训练设备320的第二网络层可以从第三网络层获取参数,以及从第二网络层获取参数。其中,第二网络层从第三网络层获取的参数包括量化函数的梯度值,第二网络层从自身获取的参数包括激活值和权重参数。The second network layer of the training device 320 may obtain parameters from the third network layer, and obtain parameters from the second network layer. Among them, the parameters obtained by the second network layer from the third network layer include the gradient value of the quantization function, and the parameters obtained by the second network layer from itself include activation values and weight parameters.
步骤620、训练设备320根据参数对量化函数的梯度进行缩放,得到重构梯度。Step 620: The training device 320 scales the gradient of the quantization function according to the parameters to obtain the reconstructed gradient.
训练设备320是根据激活值和权重参数对量化函数的梯度进行缩放,得到激活值和权重参数的重构梯度。The training device 320 scales the gradient of the quantization function according to the activation value and weight parameter to obtain the reconstructed gradient of the activation value and weight parameter.
训练设备320对量化函数的梯度进行缩放的具体算法可参考如下公式:
The specific algorithm for scaling the gradient of the quantized function by the training device 320 can refer to the following formula:
其中,分别代表梯度矩阵中的元素,μ是梯度缩放因子,μ≥0,可设置为一个较小的常数(例如10e-3),或者是基于二阶梯度估计的自适应系数,xn和xq分别代表参数未量化的全精度数值XN和反量化后数值XQ中的元素。in, and Represent the gradient matrix respectively and The elements in , μ is the gradient scaling factor, μ ≥ 0, which can be set to a small constant (such as 10e-3), or an adaptive coefficient based on the second-order gradient estimation, x n and x q represent the parameters respectively. The elements in the quantized full-precision value X N and the dequantized value X Q.
步骤630、训练设备320向第一网络层输入重构梯度。Step 630: The training device 320 inputs the reconstruction gradient to the first network layer.
训练设备320根据权重参数的重构梯度更新第二网络层的权重参数,将激活值的重构梯度输入第一网络层。训练设备320根据第二网络层传输的激活值的重构梯度执行与步骤610和步骤620相同原理的运算,得到第一网络层的重构梯度,以此类推,得到整个神经网络模型每个网络层的重构梯度。The training device 320 updates the weight parameter of the second network layer according to the reconstruction gradient of the weight parameter, and inputs the reconstruction gradient of the activation value into the first network layer. The training device 320 performs operations based on the same principle as step 610 and step 620 based on the reconstruction gradient of the activation value transmitted by the second network layer to obtain the reconstruction gradient of the first network layer. By analogy, each network of the entire neural network model is obtained. The reconstruction gradient of the layer.
请参考图7,图7为本申请实施例提供的一种多维度权重混合训练策略的示意图。在神经网络模型的反向传播中,以训练设备320利用多维度权重混合训练策略确定第二网络层中的参数为例,多维度权重混合训练策略的具体步骤如下:Please refer to FIG. 7 , which is a schematic diagram of a multi-dimensional weight hybrid training strategy provided by an embodiment of the present application. In the back propagation of the neural network model, taking the training device 320 using the multi-dimensional weight hybrid training strategy to determine the parameters in the second network layer as an example, the specific steps of the multi-dimensional weight hybrid training strategy are as follows:
步骤710、训练设备320确定第二网络层的未量化的参数的浮点值,以及参数的反量化值。Step 710: The training device 320 determines the floating point value of the unquantized parameter of the second network layer and the inverse quantized value of the parameter.
步骤720、训练设备320根据浮点值和反量化值确定优化后损失函数。Step 720: The training device 320 determines the optimized loss function based on the floating point value and the inverse quantization value.
可选地,训练设备320根据浮点值和反量化值确定优化后损失函数的具体步骤可参考公式(13)-公式(14):
Loss(WN,ρ)=Loss((1-ρ)·WN+ρ·WQE)    公式(13);
Optionally, the specific steps for the training device 320 to determine the optimized loss function based on the floating point value and the inverse quantization value may refer to Formula (13)-Formula (14):
Loss(W N ,ρ)=Loss((1-ρ)·W N +ρ·W QE ) Formula (13);
其中,Loss(WN,ρ)表示优化后损失函数,ρ≥0,其数值在训练过程中逐步从0增加至1,WN 表示权重参数未量化的浮点值,WQE表示权重参数的反量化值,Wscale表示群众参数的整体缩放因子,Round函数表示量化操作。Among them, Loss(W N ,ρ) represents the optimized loss function, ρ≥0, and its value gradually increases from 0 to 1 during the training process, W N Represents the unquantized floating point value of the weight parameter, W QE represents the inverse quantization value of the weight parameter, W scale represents the overall scaling factor of the mass parameter, and the Round function represents the quantization operation.
步骤730、训练设备320根据优化后损失函数确定重构梯度。Step 730: The training device 320 determines the reconstruction gradient according to the optimized loss function.
训练设备320确定重构梯度的具体步骤可以参考如下公式:
The specific steps of the training device 320 to determine the reconstruction gradient can refer to the following formula:
其中,表示优化后损失函数对于WN的梯度,表示优化前的损失函数相对于WN的梯度。由于WQE涉及量化函数Round,因而始终为0,不参与参数的更新过程,可以有效避免梯度失配问题。in, Represents the gradient of the optimized loss function for W N , Indicates the gradient of the loss function before optimization relative to W N. Since W QE involves the quantization function Round, therefore It is always 0 and does not participate in the parameter update process, which can effectively avoid gradient mismatch problems.
步骤740、训练设备320向第一网络层输入重构梯度。Step 740: The training device 320 inputs the reconstruction gradient to the first network layer.
训练设备320根据权重参数的重构梯度更新第二网络层的权重参数,将重构梯度输入第一网络层。训练设备320根据第二网络层传输的重构梯度执行与步骤710至步骤730相同原理的运算,得到第一网络层的重构梯度,以此类推,得到整个神经网络模型每个网络层的重构梯度。The training device 320 updates the weight parameters of the second network layer according to the reconstruction gradient of the weight parameters, and inputs the reconstruction gradient into the first network layer. The training device 320 performs operations based on the same principles as steps 710 to 730 based on the reconstruction gradient transmitted by the second network layer to obtain the reconstruction gradient of the first network layer. By analogy, the weight of each network layer of the entire neural network model is obtained. structural gradient.
上文结合图3-图7详细描述了根据本实施例所提供的神经网络模型的训练方法,下面将结合图8,描述本实施例所提供的神经网络模型的训练装置。The training method of the neural network model provided by this embodiment is described in detail above with reference to FIGS. 3 to 7 . Next, the training device of the neural network model provided by this embodiment will be described with reference to FIG. 8 .
图8为本实施例提供的可能的神经网络模型的训练装置的示意图。神经网络模型的训练装置可以用于实现上述方法实施例中执行设备的功能,因此也能实现上述方法实施例所具备的有益效果。在本实施例中,该神经网络模型的训练装置可以是如图3所示的训练设备320,还可以是应用于服务器的模块(如芯片)。FIG8 is a schematic diagram of a possible training device for a neural network model provided in this embodiment. The training device for a neural network model can be used to implement the functions of the execution device in the above method embodiment, and thus can also achieve the beneficial effects possessed by the above method embodiment. In this embodiment, the training device for the neural network model can be the training device 320 shown in FIG3 , or can be a module (such as a chip) applied to a server.
神经网络模型的训练装置800包括补偿模块810和处理模块820。神经网络模型的训练装置800用于实现上述图4中所示方法实施例中训练设备320的功能。The neural network model training device 800 includes a compensation module 810 and a processing module 820 . The neural network model training device 800 is used to implement the functions of the training device 320 in the method embodiment shown in FIG. 4 .
补偿模块810,用于根据参数的量化误差的波动值更改梯度补偿策略,并使用梯度补偿策略对神经网络模型训练得到的梯度进行补偿。例如,补偿模块810用于执行图4中步骤410和步骤430。The compensation module 810 is used to change the gradient compensation strategy according to the fluctuation value of the quantization error of the parameter, and use the gradient compensation strategy to compensate the gradient obtained by the neural network model training. For example, the compensation module 810 is used to perform steps 410 and 430 in FIG. 4 .
计算模块820,用于确定神经网络模型的参数的量化误差的波动值。例如,计算模块820用于执行图4中步骤420。The calculation module 820 is used to determine the fluctuation value of the quantization error of the parameters of the neural network model. For example, the calculation module 820 is used to perform step 420 in FIG. 4 .
作为一种可能的实现方式,第一梯度补偿策略包括元素级梯度缩放策略,第二梯度补偿策略包括多维度权重混合训练策略。As a possible implementation, the first gradient compensation strategy includes an element-level gradient scaling strategy, and the second gradient compensation strategy includes a multi-dimensional weight hybrid training strategy.
作为一种可能的实现方式,参数包括权重参数或激活值。As a possible implementation, the parameters include weight parameters or activation values.
作为一种可能的实现方式,计算模块820具体用于:周期性地统计神经网络模型的参数的量化误差的波动值。As a possible implementation manner, the calculation module 820 is specifically configured to periodically count the fluctuation values of the quantization errors of the parameters of the neural network model.
作为一种可能的实现方式,第一梯度补偿策略统计量化误差的波动值的第一周期小于第二梯度补偿策略统计量化误差的波动值的第二周期。As a possible implementation manner, the first period of the fluctuation value of the statistical quantization error of the first gradient compensation strategy is smaller than the second period of the fluctuation value of the statistical quantization error of the second gradient compensation strategy.
作为一种可能的实现方式,第二周期包含的训练步长数量,等于使用第二梯度补偿策略对训练后得到的梯度进行补偿的训练步长总数量与第一周期包含的训练步长数量的商。As a possible implementation, the number of training steps included in the second cycle is equal to the total number of training steps used to compensate the gradient obtained after training using the second gradient compensation strategy and the number of training steps included in the first cycle. business.
应理解的是,本申请实施例的神经网络模型的训练装置800可以通过GPU、NPU、ASIC实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。也可以通过软件实现图4所示的方法时,神经网络模型的训练装置800及其各个模块也可以为软件模块。It should be understood that the training device 800 of the neural network model in the embodiment of the present application can be implemented by GPU, NPU, ASIC, or programmable logic device (PLD). The above PLD can be a complex program logic device (complex). programmable logical device (CPLD), field-programmable gate array (FPGA), general array logic (GAL) or any combination thereof. When the method shown in FIG. 4 can also be implemented through software, the neural network model training device 800 and its respective modules can also be software modules.
本申请实施例的神经网络模型的训练装置800可对应于执行本申请实施例中描述的方法,并且神经网络模型的训练装置800中的各个单元的上述和其它操作和/或功能分别为了实现图4中的各个方法的相应流程,为了简洁,在此不再赘述。The neural network model training device 800 in the embodiment of the present application may correspond to executing the method described in the embodiment of the present application, and the above and other operations and/or functions of each unit in the neural network model training device 800 are respectively to implement the figure. The corresponding processes of each method in 4 will not be repeated here for the sake of brevity.
本申请实施例还提供了一种计算设备,请参考图9,图9为本申请实施例提供的一种计算设备的结构示意图。计算设备900包括存储器901、处理器902、通信接口903以及总线904。其中,存储器901、处理器902、通信接口903通过总线904实现彼此之间的通信连接。An embodiment of the present application also provides a computing device. Please refer to FIG. 9 . FIG. 9 is a schematic structural diagram of a computing device provided by an embodiment of the present application. Computing device 900 includes memory 901, processor 902, communication interface 903, and bus 904. Among them, the memory 901, the processor 902, and the communication interface 903 implement communication connections between each other through the bus 904.
存储器901可以是只读存储器,静态存储设备,动态存储设备或者随机存取存储器。存储器 901可以存储计算机指令,当存储器901中存储的计算机指令被处理器902执行时,处理器902和通信接口903用于执行软件系统的图像处理方法中的步骤。例如,通信接口903用于执行上述图4所示的神经网络模型的训练方法中的步骤410,以及上述图8所述的神经网络模型的训练装置800中补偿模块810的功能,处理器902用于执行上述图4所示的神经网络模型的训练方法中的步骤420、步骤430,以及上述图8所述的神经网络模型的训练装置800中处理模块820的功能。存储器还可以存储数据集合,例如:存储器901中的一部分存储资源被划分成一个区域,用于存储实现本申请实施例的神经网络模型的功能的程序。Memory 901 may be a read-only memory, a static storage device, a dynamic storage device, or a random access memory. memory 901 may store computer instructions. When the computer instructions stored in the memory 901 are executed by the processor 902, the processor 902 and the communication interface 903 are used to execute steps in the image processing method of the software system. For example, the communication interface 903 is used to execute step 410 in the training method of the neural network model shown in Figure 4, and the function of the compensation module 810 in the training device 800 of the neural network model shown in Figure 8. The processor 902 uses In executing steps 420 and 430 in the training method of the neural network model shown in FIG. 4 , as well as the functions of the processing module 820 in the training device 800 of the neural network model shown in FIG. 8 . The memory can also store data sets. For example, a part of the storage resources in the memory 901 is divided into an area for storing programs that implement the functions of the neural network model in the embodiment of the present application.
处理器902可以采用通用的CPU,应用专用集成电路(application specific integrated circuit,ASIC),GPU或其任意组合。处理器902可以包括一个或多个芯片。处理器902可以包括AI加速器,例如NPU。The processor 902 can be a general CPU, an application specific integrated circuit (ASIC), a GPU or any combination thereof. Processor 902 may include one or more chips. Processor 902 may include an AI accelerator, such as an NPU.
通信接口903使用例如但不限于收发器一类的收发模块,来实现计算设备900与其他设备或通信网络之间的通信。例如,可以通过通信接口903获取迭代训练请求、训练数据,以及反馈迭代训练后神经网络。The communication interface 903 uses a transceiver module such as but not limited to a transceiver to implement communication between the computing device 900 and other devices or communication networks. For example, the iterative training request, training data, and feedback of the iteratively trained neural network can be obtained through the communication interface 903.
总线904可包括在计算设备900各个部件(例如,存储器901、处理器902、通信接口903)之间传送信息的通路。Bus 904 may include a path that carries information between various components of computing device 900 (eg, memory 901, processor 902, communications interface 903).
计算设备900可以为云数据中心中的计算机(例如:服务器),或边缘数据中心中的计算机,或终端。The computing device 900 may be a computer (for example, a server) in a cloud data center, a computer in an edge data center, or a terminal.
每个计算设备900上都可以部署训练设备320的功能。例如,GPU用于实现训练设备320的功能。The functionality of training device 320 may be deployed on each computing device 900. For example, a GPU is used to implement the functions of the training device 320.
对于同一个计算设备900内部署的训练设备320的功能和执行设备310的功能,训练设备320可以通过总线904与执行设备310进行通信。For the functions of the training device 320 and the functions of the execution device 310 deployed within the same computing device 900, the training device 320 can communicate with the execution device 310 through the bus 904.
对于不同计算设备900内部署的训练设备320的功能和执行设备310的功能,训练设备320可以通过通信网络与执行设备310进行通信。For the functions of the training device 320 and the functions of the execution device 310 deployed within different computing devices 900, the training device 320 may communicate with the execution device 310 through a communication network.
本实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(random access memory,RAM)、闪存、只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于终端设备中。当然,处理器和存储介质也可以作为分立组件存在于网络设备或终端设备中。The method steps in this embodiment can be implemented by hardware or by a processor executing software instructions. Software instructions can be composed of corresponding software modules, and software modules can be stored in random access memory (random access memory, RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM) , PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM), register, hard disk, mobile hard disk, CD-ROM or other well-known in the art any other form of storage media. An exemplary storage medium is coupled to the processor such that the processor can read information from the storage medium and write information to the storage medium. Of course, the storage medium can also be an integral part of the processor. The processor and storage media may be located in an ASIC. Additionally, the ASIC can be located in the terminal device. Of course, the processor and the storage medium can also exist as discrete components in network equipment or terminal equipment.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序或指令。在计算机上加载和执行所述计算机程序或指令时,全部或部分地执行本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、用户设备或者其它可编程装置。所述计算机程序或指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机程序或指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是集成一个或多个可用介质的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,例如,软盘、硬盘、磁带;也可以是光介质,例如,数字视频光盘(digital video disc,DVD);还可以是半导体介质,例如,固态硬盘(solid state drive,SSD)。以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。 In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are executed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a user equipment, or other programmable device. The computer program or instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer program or instructions may be transmitted from a website, computer, A server or data center transmits via wired or wireless means to another website site, computer, server, or data center. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center that integrates one or more available media. The available media may be magnetic media, such as floppy disks, hard disks, and magnetic tapes; they may also be optical media, such as digital video discs (DVDs); they may also be semiconductor media, such as solid state drives (solid state drives). ,SSD). The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person familiar with the technical field can easily think of various equivalent methods within the technical scope disclosed in the present application. Modification or replacement, these modifications or replacements shall be covered by the protection scope of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (14)

  1. 一种神经网络模型的训练方法,其特征在于,包括:A training method for a neural network model, which is characterized by including:
    对神经网络模型进行训练,所述神经网络模型的参数进行了量化,并对训练后得到的梯度以第一梯度补偿策略进行补偿;Train a neural network model, quantify parameters of the neural network model, and compensate the gradient obtained after training with a first gradient compensation strategy;
    确定所述神经网络模型的参数的量化误差的波动值,所述波动值为本次训练的量化误差与上次训练的量化误差之间的差值;Determine the fluctuation value of the quantization error of the parameters of the neural network model, where the fluctuation value is the difference between the quantization error of this training and the quantization error of the last training;
    当所述量化误差的波动值小于等于预设值时,则将所述第一梯度补偿策略更改为第二梯度补偿策略,并在后续的训练中,使用所述第二梯度补偿策略对训练后得到的梯度进行补偿。When the fluctuation value of the quantization error is less than or equal to the preset value, the first gradient compensation strategy is changed to the second gradient compensation strategy, and in subsequent training, the second gradient compensation strategy is used to The resulting gradient is compensated.
  2. 根据权利要求1所述的方法,其特征在于,所述第一梯度补偿策略包括元素级梯度缩放策略,所述第二梯度补偿策略包括多维度权重混合训练策略,所述多维度权重混合训练策略用于基于所述参数的量化值和反量化值进行梯度补偿。The method of claim 1, wherein the first gradient compensation strategy includes an element-level gradient scaling strategy, the second gradient compensation strategy includes a multi-dimensional weight hybrid training strategy, and the multi-dimensional weight hybrid training strategy Used to perform gradient compensation based on the quantized value and inverse quantized value of the parameter.
  3. 根据权利要求1或2所述的方法,其特征在于,所述参数包括权重参数或激活值。The method according to claim 1 or 2, characterized in that the parameters include weight parameters or activation values.
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,所述确定所述神经网络模型的参数的量化误差的波动值,包括:The method according to any one of claims 1 to 3, characterized in that determining the fluctuation value of the quantization error of the parameters of the neural network model comprises:
    周期性地统计所述神经网络模型的参数的量化误差的波动值。The fluctuation value of the quantization error of the parameters of the neural network model is periodically counted.
  5. 根据权利要求4所述的方法,其特征在于,所述第一梯度补偿策略统计量化误差的波动值的第一周期小于所述第二梯度补偿策略统计量化误差的波动值的第二周期。The method according to claim 4, characterized in that the first period of the fluctuation value of the statistical quantization error of the first gradient compensation strategy is smaller than the second period of the fluctuation value of the statistical quantization error of the second gradient compensation strategy.
  6. 根据权利要求5所述的方法,其特征在于,所述第二周期包含的训练步长数量,等于使用所述第二梯度补偿策略对训练后得到的梯度进行补偿的训练步长总数量与所述第一周期包含的训练步长数量的商。The method of claim 5, wherein the number of training steps included in the second period is equal to the total number of training steps used to compensate the gradient obtained after training using the second gradient compensation strategy and the total number of training steps. is the quotient of the number of training steps included in the first cycle.
  7. 一种神经网络模型的训练装置,其特征在于,包括:A training device for a neural network model, which is characterized by including:
    补偿模块,用于对神经网络模型进行训练,所述神经网络模型的参数进行了量化,并对训练后得到的梯度以第一梯度补偿策略进行补偿;The compensation module is used to train the neural network model, the parameters of the neural network model are quantified, and the gradient obtained after training is compensated with the first gradient compensation strategy;
    计算模块,用于确定所述神经网络模型的参数的量化误差的波动值,所述波动值为本次训练的量化误差与上次训练的量化误差之间的差值;A calculation module, used to determine the fluctuation value of the quantization error of the parameters of the neural network model, where the fluctuation value is the difference between the quantization error of this training and the quantization error of the last training;
    所述补偿模块,还用于当所述量化误差的波动值小于等于预设值时,则将所述第一梯度补偿策略更改为第二梯度补偿策略,并在后续的训练中,使用所述第二梯度补偿策略对训练后得到的梯度进行补偿。The compensation module is also configured to change the first gradient compensation strategy to a second gradient compensation strategy when the fluctuation value of the quantization error is less than or equal to a preset value, and use the said gradient compensation strategy in subsequent training. The second gradient compensation strategy compensates for the gradient obtained after training.
  8. 根据权利要求7所述的装置,其特征在于,所述第一梯度补偿策略包括元素级梯度缩放策略,所述第二梯度补偿策略包括多维度权重混合训练策略,所述多维度权重混合训练策略用于基于所述参数的量化值和反量化值进行梯度补偿。The device according to claim 7, wherein the first gradient compensation strategy includes an element-level gradient scaling strategy, the second gradient compensation strategy includes a multi-dimensional weight hybrid training strategy, and the multi-dimensional weight hybrid training strategy Used to perform gradient compensation based on the quantized value and inverse quantized value of the parameter.
  9. 根据权利要求7或8所述的装置,其特征在于,所述参数包括权重参数或激活值。The device according to claim 7 or 8, characterized in that the parameters include weight parameters or activation values.
  10. 根据权利要求7-9中任一项所述的装置,其特征在于,所述计算模块具体用于:The device according to any one of claims 7-9, characterized in that the computing module is specifically used for:
    周期性地统计所述神经网络模型的参数的量化误差的波动值。The fluctuation value of the quantization error of the parameters of the neural network model is periodically counted.
  11. 根据权利要求10所述的装置,其特征在于,所述第一梯度补偿策略统计量化误差的波动值的第一周期小于所述第二梯度补偿策略统计量化误差的波动值的第二周期。The device according to claim 10, wherein the first period of the fluctuation value of the statistical quantization error of the first gradient compensation strategy is smaller than the second period of the fluctuation value of the statistical quantization error of the second gradient compensation strategy.
  12. 根据权利要求11所述的装置,其特征在于,所述第二周期包含的训练步长数量,等于使用所述第二梯度补偿策略对训练后得到的梯度进行补偿的训练步长总数量与所述第一周期包含的训练步长数量的商。The device according to claim 11, wherein the number of training steps included in the second period is equal to the total number of training steps used to compensate the gradient obtained after training using the second gradient compensation strategy and the total number of training steps. is the quotient of the number of training steps included in the first cycle.
  13. 一种计算设备,其特征在于,所述计算设备包括存储器和至少一个处理器,所述存储器用于存储一组计算机指令;当所述处理器执行所述一组计算机指令时,执行上述权利要求1-6中任一所述的方法的操作步骤。A computing device, characterized in that the computing device includes a memory and at least one processor, the memory is used to store a set of computer instructions; when the processor executes the set of computer instructions, the above claims are executed The steps of the method described in any one of 1-6.
  14. 一种神经网络模型的训练系统,其特征在于,所述系统包括执行设备以及如权利要求13所述的计算设备,所述计算设备用于执行上述权利要求1-6中任一所述的方法的操作步骤对神经网络模型进行训练,来得到优化后神经网络模型,所述执行设备用于应用所述优化后神经网络模型。 A training system for neural network models, characterized in that the system includes an execution device and a computing device as claimed in claim 13, the computing device being used to execute the method as described in any one of claims 1-6. The operation steps are to train the neural network model to obtain the optimized neural network model, and the execution device is used to apply the optimized neural network model.
PCT/CN2023/101170 2022-09-20 2023-06-19 Method and apparatus for training neural network model, and device and system WO2024060727A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211145916.7A CN117787375A (en) 2022-09-20 2022-09-20 Training method, device, equipment and system of neural network model
CN202211145916.7 2022-09-20

Publications (1)

Publication Number Publication Date
WO2024060727A1 true WO2024060727A1 (en) 2024-03-28

Family

ID=90387802

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101170 WO2024060727A1 (en) 2022-09-20 2023-06-19 Method and apparatus for training neural network model, and device and system

Country Status (2)

Country Link
CN (1) CN117787375A (en)
WO (1) WO2024060727A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200202213A1 (en) * 2018-12-19 2020-06-25 Microsoft Technology Licensing, Llc Scaled learning for training dnn
CN111429142A (en) * 2020-06-10 2020-07-17 腾讯科技(深圳)有限公司 Data processing method and device and computer readable storage medium
CN112085074A (en) * 2020-08-25 2020-12-15 腾讯科技(深圳)有限公司 Model parameter updating system, method and device
CN112884146A (en) * 2021-02-25 2021-06-01 香港理工大学深圳研究院 Method and system for training model based on data quantization and hardware acceleration
KR102389910B1 (en) * 2021-12-30 2022-04-22 주식회사 모빌린트 Quantization aware training method for neural networks that supplements limitations of gradient-based learning by adding gradient-independent updates

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200202213A1 (en) * 2018-12-19 2020-06-25 Microsoft Technology Licensing, Llc Scaled learning for training dnn
CN111429142A (en) * 2020-06-10 2020-07-17 腾讯科技(深圳)有限公司 Data processing method and device and computer readable storage medium
CN112085074A (en) * 2020-08-25 2020-12-15 腾讯科技(深圳)有限公司 Model parameter updating system, method and device
CN112884146A (en) * 2021-02-25 2021-06-01 香港理工大学深圳研究院 Method and system for training model based on data quantization and hardware acceleration
KR102389910B1 (en) * 2021-12-30 2022-04-22 주식회사 모빌린트 Quantization aware training method for neural networks that supplements limitations of gradient-based learning by adding gradient-independent updates

Also Published As

Publication number Publication date
CN117787375A (en) 2024-03-29

Similar Documents

Publication Publication Date Title
US20210004663A1 (en) Neural network device and method of quantizing parameters of neural network
US20200250542A1 (en) Neural Network Training Method, Neural Network Training Apparatus and Electronic Device
CN109754066B (en) Method and apparatus for generating a fixed-point neural network
WO2021036905A1 (en) Data processing method and apparatus, computer equipment, and storage medium
WO2021036908A1 (en) Data processing method and apparatus, computer equipment and storage medium
CN110969251B (en) Neural network model quantification method and device based on label-free data
JP2022501677A (en) Data processing methods, devices, computer devices, and storage media
CN109800865B (en) Neural network generation and image processing method and device, platform and electronic equipment
TWI744724B (en) Method of processing convolution neural network
WO2020001401A1 (en) Operation method and apparatus for network layer in deep neural network
US11704556B2 (en) Optimization methods for quantization of neural network models
WO2021042857A1 (en) Processing method and processing apparatus for image segmentation model
US20200302283A1 (en) Mixed precision training of an artificial neural network
CN111461302B (en) Data processing method, device and storage medium based on convolutional neural network
CN110874627A (en) Data processing method, data processing apparatus, and computer readable medium
WO2019237357A1 (en) Method and device for determining weight parameters of neural network model
CN112085175A (en) Data processing method and device based on neural network calculation
US20220044109A1 (en) Quantization-aware training of quantized neural networks
WO2024060727A1 (en) Method and apparatus for training neural network model, and device and system
CN112418388A (en) Method and device for realizing deep convolutional neural network processing
WO2021057926A1 (en) Method and apparatus for training neural network model
CN111614358B (en) Feature extraction method, system, equipment and storage medium based on multichannel quantization
WO2021036904A1 (en) Data processing method, apparatus, computer device, and storage medium
CN114358280A (en) Data processing method and device, electronic equipment and computer readable storage medium
CN114065913A (en) Model quantization method and device and terminal equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23867022

Country of ref document: EP

Kind code of ref document: A1