CN115860062A - Neural network quantization method and device suitable for FPGA - Google Patents

Neural network quantization method and device suitable for FPGA Download PDF

Info

Publication number
CN115860062A
CN115860062A CN202211456706.XA CN202211456706A CN115860062A CN 115860062 A CN115860062 A CN 115860062A CN 202211456706 A CN202211456706 A CN 202211456706A CN 115860062 A CN115860062 A CN 115860062A
Authority
CN
China
Prior art keywords
quantization
weight
parameters
shift
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211456706.XA
Other languages
Chinese (zh)
Inventor
吕文浩
支小莉
童维勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202211456706.XA priority Critical patent/CN115860062A/en
Publication of CN115860062A publication Critical patent/CN115860062A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a neural network quantization method and device suitable for an FPGA. The method comprises the following steps: in the forward propagation process, carrying out first shift quantization on the weight parameters of each layer of the neural network model; grouping the parameters according to the quantization error of the first shift quantization of the weight parameters to obtain quantized weight parameter values; calculating the convolution result of each layer according to the quantized weight parameter value; the hyperparameters are updated during the back propagation process until the training reaches a predetermined precision or Epoch. The device comprises a quantization module, a weight grouping module and a retraining module. The invention converts multiplication operation into shift operation, reduces the calculation cost and the deployment difficulty on hardware equipment such as FPGA and the like, effectively improves the quantization precision and increases the flexibility of weight representation.

Description

Neural network quantization method and device suitable for FPGA
Technical Field
The invention relates to the technical field of neural networks, in particular to a neural network quantization method and device suitable for an FPGA.
Background
In the past years, a Convolutional Neural Network (CNN) model has been widely used in the fields of object detection, image segmentation, image classification, human body posture recognition, and the like, and has made remarkable progress. Meanwhile, the application range of the CNN model is gradually expanded from image processing services on a server cluster to edge-end real-time applications which are more sensitive to delay. Such applications tend to be deployed on low-power, computationally-limited edge devices, which conflicts with the computationally-and memory-intensive features of the CNN model.
In order to enable adaptation of CNN models to resource constrained computing environments, model quantification has begun to be of interest to researchers. Quantization techniques reduce the storage and computation costs of the model by reducing the precision of the weights and intermediate computation results. Existing quantization techniques mainly include uniform quantization and non-uniform quantization. The fixed point quantization technology is most commonly used in the uniform quantization, the fixed point quantization converts full-precision floating point numbers into fixed point numbers, although the calculation cost is saved, multiplication is still mainly used, the fixed point quantization is usually realized on hardware devices such as an FPGA (field programmable gate array) by a DSP (digital signal processor), the DSP on the FPGA has relatively few resources, and developers are required to specially optimize the DSP on the FPGA to possibly support the large-scale multiplication of the CNN. The amount of DSP resources is often an important reason to limit performance. In contrast to fixed-point quantization, shift quantization techniques in non-uniform quantization convert multiplications into shift operations by converting full-precision floating-point numbers to the power-of-2 form, which is typically implemented on FPGAs by a larger number of look-up tables.
However, in the implementation process of shift quantization, due to the non-uniform characteristic of the power-power value, the problems that the accuracy of the model is rapidly saturated, the flexibility of weight expression is limited, and the like, are caused, so that the performance of shift quantization is worse than that of fixed-point quantization, and therefore how to more effectively improve the performance of shift quantization is a difficult point of current research.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a neural network quantization method and device suitable for an FPGA (field programmable gate array), so that the accuracy of a model is improved, and the flexibility of weight representation is increased.
In order to achieve the purpose, the invention adopts the following technical scheme:
a neural network quantification method suitable for an FPGA comprises the following steps:
s1, carrying out first displacement quantization on weight parameters of each layer of the neural network model in a forward propagation process;
s2, grouping the parameters according to the quantization error of the first shift quantization of the weight parameters to obtain quantized weight parameter values;
s3, calculating a convolution result of each layer according to the quantized weight parameter values;
and S4, updating the hyper-parameters in the back propagation process until the training reaches the preset precision or Epoch.
Further, the grouping processing of the parameters according to the quantization error of the first shift quantization of the weight parameters in step S2 specifically includes:
configuring a threshold parameter capable of being trained for the weight parameter of each layer of the network, and determining the boundary of weight parameter division; and grouping the weight parameters according to the magnitude relation between the quantization error of the first shift quantization and the threshold parameter.
Further, the grouping policy in step S2 specifically includes:
if the quantization error of the first shift quantization is larger than the threshold parameter, dividing the quantization error into a first group, carrying out second shift quantization on the quantization error, and finally expressing the weight parameter as the sum of two terms;
if the quantization error of the first shift quantization is smaller than the threshold parameter, dividing the first shift quantization into a second group, keeping the result of the first shift quantization, and finally expressing the weight parameter as a single item;
if the weight parameter obtained by the first shift quantization is zero, the weight parameter is divided into a third group, and no additional processing is performed.
Further, the updating of the hyper-parameters in the back propagation process in step S4 specifically includes: calculating the gradient of the hyper-parameter, and reversely propagating the gradient to update the hyper-parameter. The hyper-parameter in step S4 specifically includes: gradients of weight parameters, gradients of threshold parameters, and other layers of trainable parameters.
An apparatus for quantizing a neural network suitable for an FPGA, comprising:
a quantization module: the device comprises a neural network model, a first-time shift quantization module, a second-time shift quantization module and a third-time shift quantization module, wherein the first-time shift quantization module is used for carrying out first-time shift quantization on weight parameters of each layer in the neural network model according to a required target bit width;
a weight grouping module: the weight parameter grouping module is used for grouping the weight parameters according to the weight parameters and the quantization errors of the first shift quantization calculated by the quantization module and calculating the complete quantized weight parameters;
a retraining module: and the device is used for retraining and iterating until the training reaches a preset precision or Epoch according to the results calculated by the quantization module and the weight grouping module.
Compared with the prior art, the invention has the beneficial effects that:
according to the neural network quantization method and device suitable for the FPGA, the full-precision weight is converted into the sum of one or two power square values of 2 through weight grouping and retraining, on one hand, the multiplication operation is converted into the shift operation, so that the calculation cost and the deployment difficulty on hardware equipment such as the FPGA are reduced, on the other hand, the problem of rapid precision saturation in the shift quantization is improved, the quantization precision is improved, and the flexibility of weight representation is increased.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a neural network quantization method suitable for an FPGA according to an embodiment of the present invention.
Fig. 2 is a structural diagram of a neural network quantization apparatus suitable for an FPGA according to an embodiment of the present invention.
Detailed Description
The advantages and features of the present invention will be readily understood by those skilled in the art, and the scope of the present invention will be more clearly defined by the following detailed description of the preferred embodiments of the present invention taken in conjunction with the accompanying drawings.
As shown in fig. 1, a neural network quantization method suitable for an FPGA includes the following steps:
s1, carrying out first-time shift quantification on weight parameters of each layer of the neural network model in the forward propagation process.
In the present embodiment, the weight parameters are shift quantized according to equations (1) to (3).
Figure BDA0003953534550000031
Figure BDA0003953534550000032
Figure BDA0003953534550000033
Wherein b represents a bit width of weight quantization, W (i) represents an original weight parameter value of an ith layer, P represents a shift value of the weight parameter, W q (i) Represents the value of the quantized weight parameter of the i-th layer, and clip (.) represents the clipping function.
In one possible implementation, the target bit width of weight quantization is 4 bits, the original weight parameter value of a layer is (0.4, 0.3, 0.2), and the weight parameter value after the first shift quantization is (0.5, 0.25), so that the quantization error is (-0.1, 0.05, -0.05).
And S2, grouping the parameters according to the quantization error of the first shift quantization of the weight parameters to obtain quantized weight parameter values.
In the present embodiment, a threshold value exceeding parameter is set as a criterion for the weight grouping processing, and the initial value of the threshold value parameter is 0. And grouping the weights according to the magnitude relation between the quantization error of the first shift quantization of the weight parameters and the threshold parameters.
The grouping strategy comprises the following steps:
the first shift quantization error of the weight parameter is larger than the threshold parameter, the weight parameter is divided into a first group, the quantization error is subjected to second shift quantization, and the finally quantized weight parameter value is represented as the sum of the results of the two shift quantization.
And dividing the first-time shift quantization error of the weight parameter into a second group, not performing second-time shift quantization on the quantization error, and taking the parameter value of the first-time shift quantization as the quantized weight parameter value.
The weight parameter is zero value, and is divided into a third group, and the zero value is used as the quantized weight parameter value without any processing.
In this embodiment, the weight parameter value after the packet processing can be represented by the following formula:
R(i)=W(i)-Quant(W(i)); (4)
Figure BDA0003953534550000034
W q (i)=Quant(W(i))+Quant(R(i)⊙T)。 (6)
wherein R (i) represents a quantization error of the ith layer weight parameter, T represents a threshold value of shift quantization, and T is a binary mask matrix for determining a relationship between the quantization error and the threshold value, as indicated by element-by-element multiplication.
In one possible implementation, the target bit width of the weight quantization is 4 bits, the threshold parameter is 0.1, the full-precision weight parameter is (0.4, 0.3, 0.2), the weight parameter of the first shift quantization is (0.5, 0.25), and the quantization error is (-0.1, 0.05, -0.05), so that after the grouping process, the quantized weight parameter value is (0.375, 0.25), and the final quantization error is reduced to (0.025, 0.05, -0.05).
S3, calculating a convolution result of each layer according to the quantized weight parameter values;
the quantized weight parameter value can participate in calculation through a formula (7), and the hardware equipment can complete convolution calculation only by carrying out shift operation through a lookup table, so that the calculation cost of the neural network model is effectively reduced, and the storage cost of the model is also reduced by converting floating point numbers into shift values for storage, so that the method is a hardware-friendly mode.
Figure BDA0003953534550000041
Where n and p represent any two operands involved in the operation.
And S4, updating the hyper-parameters in the back propagation process until the training reaches the preset precision or Epoch.
In the embodiment, the SGD is used as an optimizer to iteratively optimize the hyper-parameters, the related hyper-parameters comprise weight parameters, threshold parameters and trainable parameters of other layers, the initial learning rate is 0.02, the weight attenuation is 1 × 10-4, and the momentum is 0.9. The target epoch for the iteration is 200, where the learning rate at the 80 th and 120 th epochs would be reduced to the previous 10%.
The specific steps of back propagation comprise calculating the gradient of the hyper-parameter and updating the hyper-parameter by back propagation of the gradient.
The gradient of the weight parameter is calculated by equation (8).
Figure BDA0003953534550000042
Wherein L represents a loss function, Y represents a true output value of the model, W represents an original weight matrix of the model, W q Representing the quantized weight matrix.
As can be seen from equation (1), the calculation of the quantized weight parameters involves a rounding operation of the full-precision weights, which results in
Figure BDA0003953534550000043
Is 0 everywhere except for discrete points where W is exactly a power of 2, the gradient of the weights will not propagate in the opposite direction as normal.
In one possible implementation, an approximation of the gradient is obtained by using a pass-through estimator. Order to
Figure BDA0003953534550000044
I.e. derivation with y = x approximating y = round (x).
Then, the gradient of the weight parameter is reduced to the form of equation (9).
Figure BDA0003953534550000045
In one possible implementation, each gradient element is adaptively scaled up or down by combining the gradient of the weights with the quantization error and its direction of change by equation (10).
Figure BDA0003953534550000051
Wherein, g xn And g xq Is the loss function pair x n And x q And δ is a scaling factor greater than or equal to 0.
It should be noted that, in the present embodiment, the influence of the second shift quantization on the gradient is negligible. Then, the gradient of the weight parameter can be calculated by equation (11).
Figure BDA0003953534550000052
The gradient of the threshold parameter is calculated by equation (12).
Figure BDA0003953534550000053
It can be seen from equation (5) that T is an indicative function of T and is not everywhere derivable. In the present embodiment, formula (12) is calculated by formula (13).
Figure BDA0003953534550000054
As shown in fig. 2, an embodiment of the present invention further provides a neural network quantization apparatus suitable for an FPGA, including:
the quantization module 21: and the method is used for carrying out first shift quantization on the weight parameter of each layer in the neural network model according to the required target bit width.
Weight grouping module 22: and the weight parameter grouping module is used for grouping the weight parameters according to the weight parameters and the quantization errors of the first shift quantization calculated by the quantization module and calculating the complete quantized weight parameters.
Retraining module 23: and the device is used for retraining and iterating until the training reaches a preset precision or Epoch according to the results calculated by the quantization module and the weight grouping module.
According to the embodiment of the invention, through weight grouping and retraining, the full-precision weight is converted into the form of one or two 2 power values, on one hand, the multiplication operation is converted into the shift operation, so that the calculation cost and the deployment difficulty on hardware equipment such as FPGA (field programmable gate array) are reduced, on the other hand, the problem of rapid precision saturation in shift quantization is improved, the quantization precision is improved, and the flexibility of weight representation is increased.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (6)

1. A neural network quantification method suitable for an FPGA is characterized by comprising the following steps of:
s1, carrying out first displacement quantization on weight parameters of each layer of the neural network model in a forward propagation process;
s2, grouping the parameters according to the quantization error of the first shift quantization of the weight parameters to obtain quantized weight parameter values;
s3, calculating a convolution result of each layer according to the quantized weight parameter values;
and S4, updating the hyper-parameters in the back propagation process until the training reaches the preset precision or Epoch.
2. The method according to claim 1, wherein the grouping of the weighted parameters according to the quantization error of the first shift quantization of the weighted parameters in step S2 specifically comprises:
configuring a threshold parameter capable of being trained for the weight parameter of each layer of the network, and determining the boundary of weight parameter division; and grouping the weight parameters according to the magnitude relation between the quantization error of the first shift quantization and the threshold parameter.
3. The method for quantizing a neural network suitable for an FPGA according to claim 2, wherein the grouping policy in step S2 specifically includes:
if the quantization error of the first shift quantization is larger than the threshold parameter, dividing the quantization error into a first group, carrying out second shift quantization on the quantization error, and finally expressing the weight parameter as the sum of two terms;
if the quantization error of the first shift quantization is smaller than the threshold parameter, dividing the first shift quantization into a second group, keeping the result of the first shift quantization, and finally expressing the weight parameter as a single item;
if the weight parameter obtained by the first shift quantization is zero, the weight parameter is divided into a third group, and no additional processing is performed.
4. The method for quantizing a neural network suitable for an FPGA according to claim 1, wherein the updating of the hyper-parameters in the back propagation process in step S4 specifically includes: calculating the gradient of the hyper-parameter, and reversely propagating the gradient to update the hyper-parameter.
5. The method for quantizing a neural network suitable for an FPGA according to claim 1, wherein the hyper-parameters in the step S4 specifically include: gradients of weight parameters, gradients of threshold parameters, and training-capable parameters of other layers.
6. An apparatus for quantizing a neural network suitable for an FPGA, comprising:
a quantization module: the device comprises a neural network model, a first-time shift quantization module, a second-time shift quantization module and a third-time shift quantization module, wherein the first-time shift quantization module is used for carrying out first-time shift quantization on weight parameters of each layer in the neural network model according to a required target bit width;
a weight grouping module: the system comprises a quantization module, a weight parameter calculation module and a quantization error calculation module, wherein the quantization module is used for calculating the weight parameter of the first shift quantization and calculating the complete quantized weight parameter;
a retraining module: and the device is used for retraining and iterating until the training reaches a preset precision or Epoch according to the results calculated by the quantization module and the weight grouping module.
CN202211456706.XA 2022-11-21 2022-11-21 Neural network quantization method and device suitable for FPGA Pending CN115860062A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211456706.XA CN115860062A (en) 2022-11-21 2022-11-21 Neural network quantization method and device suitable for FPGA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211456706.XA CN115860062A (en) 2022-11-21 2022-11-21 Neural network quantization method and device suitable for FPGA

Publications (1)

Publication Number Publication Date
CN115860062A true CN115860062A (en) 2023-03-28

Family

ID=85664389

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211456706.XA Pending CN115860062A (en) 2022-11-21 2022-11-21 Neural network quantization method and device suitable for FPGA

Country Status (1)

Country Link
CN (1) CN115860062A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116468079A (en) * 2023-04-13 2023-07-21 上海处理器技术创新中心 Method for training deep neural network model and related product

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116468079A (en) * 2023-04-13 2023-07-21 上海处理器技术创新中心 Method for training deep neural network model and related product

Similar Documents

Publication Publication Date Title
US11270187B2 (en) Method and apparatus for learning low-precision neural network that combines weight quantization and activation quantization
CN107688849B (en) Dynamic strategy fixed-point training method and device
CN107679618B (en) Static strategy fixed-point training method and device
US11449729B2 (en) Efficient convolutional neural networks
CN108491926B (en) Low-bit efficient depth convolution neural network hardware accelerated design method, module and system based on logarithmic quantization
WO2021208186A1 (en) Block floating point-based fpga implementation apparatus and method for fblms algorithm
CN111985523A (en) Knowledge distillation training-based 2-exponential power deep neural network quantification method
CN113011571B (en) INT8 offline quantization and integer inference method based on Transformer model
TWI744724B (en) Method of processing convolution neural network
CN112508125A (en) Efficient full-integer quantization method of image detection model
US11544526B2 (en) Computing device and method
WO2023011002A1 (en) Overflow-aware quantization model training method and apparatus, medium and terminal device
US11341400B1 (en) Systems and methods for high-throughput computations in a deep neural network
US20210294874A1 (en) Quantization method based on hardware of in-memory computing and system thereof
Choi et al. Retrain-less weight quantization for multiplier-less convolutional neural networks
CN115860062A (en) Neural network quantization method and device suitable for FPGA
Bao et al. LSFQ: A low precision full integer quantization for high-performance FPGA-based CNN acceleration
CN115238893A (en) Neural network model quantification method and device for natural language processing
CN114756517A (en) Visual Transformer compression method and system based on micro-quantization training
Jiang et al. A low-latency LSTM accelerator using balanced sparsity based on FPGA
CN111882050B (en) Design method for improving BCPNN speed based on FPGA
CN116187416A (en) Iterative retraining method based on layer pruning sensitivity and image processor
CN115965062A (en) FPGA (field programmable Gate array) acceleration method for BERT (binary offset Transmission) middle-layer normalized nonlinear function
CN113918882A (en) Data processing acceleration method of dynamic sparse attention mechanism capable of being realized by hardware
CN112561036A (en) HE-LSTM network structure and corresponding FPGA hardware accelerator thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination