CN115860062A

CN115860062A - Neural network quantization method and device suitable for FPGA

Info

Publication number: CN115860062A
Application number: CN202211456706.XA
Authority: CN
Inventors: 吕文浩; 支小莉; 童维勤
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2022-11-21
Filing date: 2022-11-21
Publication date: 2023-03-28

Abstract

The invention discloses a neural network quantization method and device suitable for an FPGA. The method comprises the following steps: in the forward propagation process, carrying out first shift quantization on the weight parameters of each layer of the neural network model; grouping the parameters according to the quantization error of the first shift quantization of the weight parameters to obtain quantized weight parameter values; calculating the convolution result of each layer according to the quantized weight parameter value; the hyperparameters are updated during the back propagation process until the training reaches a predetermined precision or Epoch. The device comprises a quantization module, a weight grouping module and a retraining module. The invention converts multiplication operation into shift operation, reduces the calculation cost and the deployment difficulty on hardware equipment such as FPGA and the like, effectively improves the quantization precision and increases the flexibility of weight representation.

Description

Neural network quantization method and device suitable for FPGA

Technical Field

The invention relates to the technical field of neural networks, in particular to a neural network quantization method and device suitable for an FPGA.

Background

In the past years, a Convolutional Neural Network (CNN) model has been widely used in the fields of object detection, image segmentation, image classification, human body posture recognition, and the like, and has made remarkable progress. Meanwhile, the application range of the CNN model is gradually expanded from image processing services on a server cluster to edge-end real-time applications which are more sensitive to delay. Such applications tend to be deployed on low-power, computationally-limited edge devices, which conflicts with the computationally-and memory-intensive features of the CNN model.

In order to enable adaptation of CNN models to resource constrained computing environments, model quantification has begun to be of interest to researchers. Quantization techniques reduce the storage and computation costs of the model by reducing the precision of the weights and intermediate computation results. Existing quantization techniques mainly include uniform quantization and non-uniform quantization. The fixed point quantization technology is most commonly used in the uniform quantization, the fixed point quantization converts full-precision floating point numbers into fixed point numbers, although the calculation cost is saved, multiplication is still mainly used, the fixed point quantization is usually realized on hardware devices such as an FPGA (field programmable gate array) by a DSP (digital signal processor), the DSP on the FPGA has relatively few resources, and developers are required to specially optimize the DSP on the FPGA to possibly support the large-scale multiplication of the CNN. The amount of DSP resources is often an important reason to limit performance. In contrast to fixed-point quantization, shift quantization techniques in non-uniform quantization convert multiplications into shift operations by converting full-precision floating-point numbers to the power-of-2 form, which is typically implemented on FPGAs by a larger number of look-up tables.

However, in the implementation process of shift quantization, due to the non-uniform characteristic of the power-power value, the problems that the accuracy of the model is rapidly saturated, the flexibility of weight expression is limited, and the like, are caused, so that the performance of shift quantization is worse than that of fixed-point quantization, and therefore how to more effectively improve the performance of shift quantization is a difficult point of current research.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a neural network quantization method and device suitable for an FPGA (field programmable gate array), so that the accuracy of a model is improved, and the flexibility of weight representation is increased.

In order to achieve the purpose, the invention adopts the following technical scheme:

a neural network quantification method suitable for an FPGA comprises the following steps:

s1, carrying out first displacement quantization on weight parameters of each layer of the neural network model in a forward propagation process;

s2, grouping the parameters according to the quantization error of the first shift quantization of the weight parameters to obtain quantized weight parameter values;

s3, calculating a convolution result of each layer according to the quantized weight parameter values;

and S4, updating the hyper-parameters in the back propagation process until the training reaches the preset precision or Epoch.

Further, the grouping processing of the parameters according to the quantization error of the first shift quantization of the weight parameters in step S2 specifically includes:

configuring a threshold parameter capable of being trained for the weight parameter of each layer of the network, and determining the boundary of weight parameter division; and grouping the weight parameters according to the magnitude relation between the quantization error of the first shift quantization and the threshold parameter.

Further, the grouping policy in step S2 specifically includes:

if the quantization error of the first shift quantization is larger than the threshold parameter, dividing the quantization error into a first group, carrying out second shift quantization on the quantization error, and finally expressing the weight parameter as the sum of two terms;

if the quantization error of the first shift quantization is smaller than the threshold parameter, dividing the first shift quantization into a second group, keeping the result of the first shift quantization, and finally expressing the weight parameter as a single item;

if the weight parameter obtained by the first shift quantization is zero, the weight parameter is divided into a third group, and no additional processing is performed.

Further, the updating of the hyper-parameters in the back propagation process in step S4 specifically includes: calculating the gradient of the hyper-parameter, and reversely propagating the gradient to update the hyper-parameter. The hyper-parameter in step S4 specifically includes: gradients of weight parameters, gradients of threshold parameters, and other layers of trainable parameters.

An apparatus for quantizing a neural network suitable for an FPGA, comprising:

a quantization module: the device comprises a neural network model, a first-time shift quantization module, a second-time shift quantization module and a third-time shift quantization module, wherein the first-time shift quantization module is used for carrying out first-time shift quantization on weight parameters of each layer in the neural network model according to a required target bit width;

a weight grouping module: the weight parameter grouping module is used for grouping the weight parameters according to the weight parameters and the quantization errors of the first shift quantization calculated by the quantization module and calculating the complete quantized weight parameters;

a retraining module: and the device is used for retraining and iterating until the training reaches a preset precision or Epoch according to the results calculated by the quantization module and the weight grouping module.

Compared with the prior art, the invention has the beneficial effects that:

according to the neural network quantization method and device suitable for the FPGA, the full-precision weight is converted into the sum of one or two power square values of 2 through weight grouping and retraining, on one hand, the multiplication operation is converted into the shift operation, so that the calculation cost and the deployment difficulty on hardware equipment such as the FPGA are reduced, on the other hand, the problem of rapid precision saturation in the shift quantization is improved, the quantization precision is improved, and the flexibility of weight representation is increased.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a neural network quantization method suitable for an FPGA according to an embodiment of the present invention.

Fig. 2 is a structural diagram of a neural network quantization apparatus suitable for an FPGA according to an embodiment of the present invention.

Detailed Description

The advantages and features of the present invention will be readily understood by those skilled in the art, and the scope of the present invention will be more clearly defined by the following detailed description of the preferred embodiments of the present invention taken in conjunction with the accompanying drawings.

As shown in fig. 1, a neural network quantization method suitable for an FPGA includes the following steps:

s1, carrying out first-time shift quantification on weight parameters of each layer of the neural network model in the forward propagation process.

In the present embodiment, the weight parameters are shift quantized according to equations (1) to (3).

Wherein b represents a bit width of weight quantization, W (i) represents an original weight parameter value of an ith layer, P represents a shift value of the weight parameter, W _q (i) Represents the value of the quantized weight parameter of the i-th layer, and clip (.) represents the clipping function.

In one possible implementation, the target bit width of weight quantization is 4 bits, the original weight parameter value of a layer is (0.4, 0.3, 0.2), and the weight parameter value after the first shift quantization is (0.5, 0.25), so that the quantization error is (-0.1, 0.05, -0.05).

And S2, grouping the parameters according to the quantization error of the first shift quantization of the weight parameters to obtain quantized weight parameter values.

In the present embodiment, a threshold value exceeding parameter is set as a criterion for the weight grouping processing, and the initial value of the threshold value parameter is 0. And grouping the weights according to the magnitude relation between the quantization error of the first shift quantization of the weight parameters and the threshold parameters.

The grouping strategy comprises the following steps:

the first shift quantization error of the weight parameter is larger than the threshold parameter, the weight parameter is divided into a first group, the quantization error is subjected to second shift quantization, and the finally quantized weight parameter value is represented as the sum of the results of the two shift quantization.

And dividing the first-time shift quantization error of the weight parameter into a second group, not performing second-time shift quantization on the quantization error, and taking the parameter value of the first-time shift quantization as the quantized weight parameter value.

The weight parameter is zero value, and is divided into a third group, and the zero value is used as the quantized weight parameter value without any processing.

In this embodiment, the weight parameter value after the packet processing can be represented by the following formula:

R(i)＝W(i)-Quant(W(i))； (4)

W _q (i)＝Quant(W(i))+Quant(R(i)⊙T)。 (6)

wherein R (i) represents a quantization error of the ith layer weight parameter, T represents a threshold value of shift quantization, and T is a binary mask matrix for determining a relationship between the quantization error and the threshold value, as indicated by element-by-element multiplication.

In one possible implementation, the target bit width of the weight quantization is 4 bits, the threshold parameter is 0.1, the full-precision weight parameter is (0.4, 0.3, 0.2), the weight parameter of the first shift quantization is (0.5, 0.25), and the quantization error is (-0.1, 0.05, -0.05), so that after the grouping process, the quantized weight parameter value is (0.375, 0.25), and the final quantization error is reduced to (0.025, 0.05, -0.05).

the quantized weight parameter value can participate in calculation through a formula (7), and the hardware equipment can complete convolution calculation only by carrying out shift operation through a lookup table, so that the calculation cost of the neural network model is effectively reduced, and the storage cost of the model is also reduced by converting floating point numbers into shift values for storage, so that the method is a hardware-friendly mode.

Where n and p represent any two operands involved in the operation.

In the embodiment, the SGD is used as an optimizer to iteratively optimize the hyper-parameters, the related hyper-parameters comprise weight parameters, threshold parameters and trainable parameters of other layers, the initial learning rate is 0.02, the weight attenuation is 1 × 10-4, and the momentum is 0.9. The target epoch for the iteration is 200, where the learning rate at the 80 th and 120 th epochs would be reduced to the previous 10%.

The specific steps of back propagation comprise calculating the gradient of the hyper-parameter and updating the hyper-parameter by back propagation of the gradient.

The gradient of the weight parameter is calculated by equation (8).

Wherein L represents a loss function, Y represents a true output value of the model, W represents an original weight matrix of the model, W _q Representing the quantized weight matrix.

As can be seen from equation (1), the calculation of the quantized weight parameters involves a rounding operation of the full-precision weights, which results in

Is 0 everywhere except for discrete points where W is exactly a power of 2, the gradient of the weights will not propagate in the opposite direction as normal.

In one possible implementation, an approximation of the gradient is obtained by using a pass-through estimator. Order to

I.e. derivation with y = x approximating y = round (x).

Then, the gradient of the weight parameter is reduced to the form of equation (9).

In one possible implementation, each gradient element is adaptively scaled up or down by combining the gradient of the weights with the quantization error and its direction of change by equation (10).

Wherein, g _xn And g _xq Is the loss function pair x _n And x _q And δ is a scaling factor greater than or equal to 0.

It should be noted that, in the present embodiment, the influence of the second shift quantization on the gradient is negligible. Then, the gradient of the weight parameter can be calculated by equation (11).

The gradient of the threshold parameter is calculated by equation (12).

It can be seen from equation (5) that T is an indicative function of T and is not everywhere derivable. In the present embodiment, formula (12) is calculated by formula (13).

As shown in fig. 2, an embodiment of the present invention further provides a neural network quantization apparatus suitable for an FPGA, including:

the quantization module 21: and the method is used for carrying out first shift quantization on the weight parameter of each layer in the neural network model according to the required target bit width.

Weight grouping module 22: and the weight parameter grouping module is used for grouping the weight parameters according to the weight parameters and the quantization errors of the first shift quantization calculated by the quantization module and calculating the complete quantized weight parameters.

Retraining module 23: and the device is used for retraining and iterating until the training reaches a preset precision or Epoch according to the results calculated by the quantization module and the weight grouping module.

According to the embodiment of the invention, through weight grouping and retraining, the full-precision weight is converted into the form of one or two 2 power values, on one hand, the multiplication operation is converted into the shift operation, so that the calculation cost and the deployment difficulty on hardware equipment such as FPGA (field programmable gate array) are reduced, on the other hand, the problem of rapid precision saturation in shift quantization is improved, the quantization precision is improved, and the flexibility of weight representation is increased.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A neural network quantification method suitable for an FPGA is characterized by comprising the following steps of:

2. The method according to claim 1, wherein the grouping of the weighted parameters according to the quantization error of the first shift quantization of the weighted parameters in step S2 specifically comprises:

3. The method for quantizing a neural network suitable for an FPGA according to claim 2, wherein the grouping policy in step S2 specifically includes:

4. The method for quantizing a neural network suitable for an FPGA according to claim 1, wherein the updating of the hyper-parameters in the back propagation process in step S4 specifically includes: calculating the gradient of the hyper-parameter, and reversely propagating the gradient to update the hyper-parameter.

5. The method for quantizing a neural network suitable for an FPGA according to claim 1, wherein the hyper-parameters in the step S4 specifically include: gradients of weight parameters, gradients of threshold parameters, and training-capable parameters of other layers.

6. An apparatus for quantizing a neural network suitable for an FPGA, comprising:

a weight grouping module: the system comprises a quantization module, a weight parameter calculation module and a quantization error calculation module, wherein the quantization module is used for calculating the weight parameter of the first shift quantization and calculating the complete quantized weight parameter;