CN112381205A

CN112381205A - Neural network low bit quantization method

Info

Publication number: CN112381205A
Application number: CN202011057930.2A
Authority: CN
Inventors: 张书瑞; 欧阳鹏; 尹首一
Original assignee: Beijing Qingwei Intelligent Technology Co ltd
Current assignee: Beijing Qingwei Intelligent Technology Co ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2021-02-19

Abstract

The invention relates to a low bit quantization method of a neural network, which quantizes the weight value of each channel of the neural network to be low specific point weight. And acquiring the input quantization coefficient of the current convolutional layer according to the target quantization threshold and the floating point quantization interval. And taking the input quantized coefficient of the next convolutional layer as the output quantized coefficient of the current convolutional layer. Input floating point data is quantized to obtain input fixed point data. And quantizing the output floating point data to obtain the output fixed point data. The scaling factor and the bias are converted into a scaling factor fixed point value and a bias fixed point value, respectively. And acquiring the neural network according to the low specific point weight, the input fixed point data, the output fixed point data, the scaling factor fixed point value and the bias fixed point value, wherein the quantized neural network model can be applied to embedded equipment.

Description

Neural network low bit quantization method

Technical Field

The invention relates to the field of data compression, in particular to a low bit quantization method of a neural network.

Background

The neural network technology has good effects on tasks including image classification, target detection, natural language processing and the like, and in order to improve the identification accuracy, the scale of a model of the neural network is continuously increased, and the complexity of the model is continuously improved.

Higher demands are also put on the operational performance of the device, and the increasing network size and power consumption gradually become the main obstacles limiting the application of the neural network. The ever-increasing neural network causes the memory required for the operation of the neural network to be larger and larger, and the increasing of the network model also requires larger bandwidth, which greatly limits the application of the neural network in the embedded device.

Disclosure of Invention

The invention aims to provide a low-bit quantization method of a neural network, so that a neural network model can be applied to an embedded device.

In order to realize the purpose, the technical scheme is as follows: a neural network low bit quantization method comprises the following steps.

S101: obtaining an initial neural network, and obtaining weight values and offsets of C channels in the initial neural network, wherein each channel comprises a convolutional layer.

S102: and counting the maximum value of the input of the C channels in the initial neural network.

S103: and obtaining the scaling factor according to the input maximum value in the C channels.

S104: and quantizing the weight value of each channel to be lower than the specific point weight according to the scaling factor and the weight value of each channel.

S105: and inputting set data to the initial neural network for forward calculation.

S106: and counting the maximum value of the input absolute value of the current convolution layer in the initial neural network, and acquiring a floating point quantization interval according to the maximum value of the input absolute value of the current convolution layer.

S107: repeating S105, and constructing a histogram with a first set length by using the input floating point value of the current convolutional layer based on the floating point quantization interval.

S108: and circularly traversing the set quantization threshold values in the floating point quantization interval, converting the histogram with the first set length into the histogram with the second set length for each quantization threshold value, calculating KL divergence, and taking the quantization threshold value with the minimum KL divergence as a target quantization threshold value.

S109: and acquiring the input quantization coefficient of the current convolutional layer according to the target quantization threshold and the floating point quantization interval.

S110: and taking the next convolutional layer connected after the current convolutional layer as the current convolutional layer, repeating S105 to 109 to obtain the input quantized coefficient of the next convolutional layer, and taking the input quantized coefficient of the next convolutional layer as the output quantized coefficient of the current convolutional layer.

S111: and quantizing the input floating point data to obtain input fixed point data according to the input quantization coefficient of the current convolution layer and the input floating point data of the current convolution layer.

S112: and quantizing the output floating point data to obtain output fixed point data according to the scaling factor, the input quantization coefficient of the current convolutional layer, the output quantization coefficient of the current convolutional layer and the output floating point data of the current convolutional layer.

S113: the scaling factor and the bias are converted into a scaling factor fixed point value and a bias fixed point value, respectively.

S114: and obtaining the quantized neural network according to the low specific point weight, the input fixed point data, the output fixed point data, the scaling factor fixed point value and the bias fixed point value.

Compared with the prior art, the invention has the technical effects that: the invention has stronger practicability, improves the quantization efficiency and the quantization precision greatly, and the quantized neural network structure built by using the quantization method can be directly applied to the deployment of a special chip FPGA/CGRA hardware platform.

Drawings

Fig. 1 is a flow chart illustrating a neural network low bit quantization method according to the present invention.

FIG. 2 is a schematic flow chart of the present invention for converting the scaling factor and the bias into the scaling factor fixed-point value and the bias fixed-point value, respectively.

Detailed Description

The following describes embodiments of the present invention with reference to the drawings.

As shown in fig. 1, an embodiment of the invention is a neural network low bit quantization method, which includes S101 to S114.

The initial neural network can be applied to tasks such as image classification, target detection and natural language processing, and the initial neural network is trained. The initial neural network is a floating-point store operation, i.e., originally representing a weight needs to be represented using float32, i.e., the initial neural network is a floating-point operation.

And (4) quantizing the initial neural network, namely converting floating point operation into integer memory operation, and realizing the compression technology of the model initial neural network. In short, the initial neural network needs to be quantified and then represented using int 8.

The initial neural network comprises a convolution layer, a batch normalization layer and an activation function layer, a common neural network model mainly comprises the three structures, and meanwhile, the following steps of the method can also be used for quantifying the eltwise layer, so that the principle is the same and the method is more universal. The invention is applicable to all neural network models.

The step S101 comprises the following steps: eliminating the batch normalization layer to enable the network structure of the initial neural network to be a convolution layer and an activation function layer; and acquiring the weight values and the offsets of the C channels in the initial neural network after the batch normalization layer is eliminated.

Obtaining a vector with the length of C according to the maximum value input in the C channels; the scaling factor is calculated by equation (1).

S^(C)＝th^C/127.0 equation (1)

Wherein S is^(C)Representing a scaling factor; th (h)^CRepresenting a vector of length C.

S104: quantizing the weight value of each channel to a lower specific point weight according to the scaling factor and the weight value of each channel.

The weight value of each channel is quantized to a low specific point weight by formula (2).

Wherein C is [0, C ]]；W_int8Int8 represents the weight value of the channel after lower quantization; roundtrip represents rounding the calculation; w^(C)A weight value representing a channel; s^(C)Representing a scaling factor.

And S101 to S104 finish the quantization of each channel weight value in the initial neural network. I.e. only the quantization operation needs to be performed on the weights in the initial neural network. Since the weights of the initial neural network are generally preserved, we can quantify in advance according to the weights.

S105 to S104 below are excitation quantization of the neural network, that is, input floating point data and output floating point data of the neural network are quantized.

S105: and inputting setting data to the initial neural network for forward calculation.

For example, the data is set to 1000 images, and the 1000 images are input into an initial neural network for forward calculation.

And counting the maximum value of the input absolute value of the current convolution layer in the initial neural network and recording the maximum value as max, and calculating a floating point quantization interval by a formula dist _ scale which is max/2048.0. Wherein dist _ scale is a floating point quantization interval. Assume that the calculated floating point quantization interval is [128 ]. 2048].

S107: repeating the step S105, and constructing a histogram with a first set length by using the input floating point value of the current convolutional layer based on the floating point quantization interval.

And inputting 1000 images into the initial neural network again for forward calculation, and constructing a histogram with a first set length of 2048 by using the input floating point value of the current convolutional layer based on the floating point quantization interval.

S108: and circularly traversing the set quantization threshold values in the floating point quantization interval, converting the histogram with the first set length into a histogram for each quantization threshold value, calculating KL divergence, and taking the quantization threshold value when the KL divergence is minimum as a target quantization threshold value.

In the quantization threshold th set for the round-robin vector in the floating-point quantization interval [128,2048], the histogram with the first set length of 2048 is changed to the histogram with the second set length of 128 for each quantization threshold th, KL divergence is calculated, and the quantization threshold with the smallest KL divergence is set as the target quantization threshold. In other words, the quantization threshold value when the KL divergence is the smallest is taken as the optimal threshold value target _ th.

The KL divergence (Kullback-Leibler divergence) is a measure of the asymmetry of the difference between two probability distributions (probability distributions).

The input quantization coefficient is calculated by formula (3).

scale ═ (target _ th +0.5) · dist _ scale/127.0 equation (3)

Wherein scale is denoted as S_preRepresenting the input quantized coefficients; target _ th represents a target quantization threshold; dist _ scale represents a floating point quantization interval.

Through the above-mentioned S105 to S109, the input quantized coefficients of the current convolutional layer are obtained.

Further, in the case where the input and output of the activation function layer in the initial neural network are all positive values, the quantization range of 8-bit quantization includes the cause of a negative number. This results in wasted quantization space and extra overhead in storing parameters.

Thus, the activation function layer output is optimized to the uint8 asymmetric quantization; that is, in S108, the floating point quantization interval becomes [256,2048], and a histogram of length 256 is generated based on different quantization thresholds set; the input quantized coefficients are finally divided by 255.0 when they are calculated. The waste of quantization space and extra parameter storage overhead are reduced, and further precision improvement can be obtained.

S110: and taking the next convolutional layer connected after the current convolutional layer as the current convolutional layer, repeating the steps from S105 to S109 to obtain the input quantization coefficient of the next convolutional layer, and taking the input quantization coefficient of the next convolutional layer as the output quantization coefficient of the current convolutional layer.

Wherein the input quantization coefficient of the next convolutional layer is the output quantization coefficient of the current convolutional layer and is recorded as S_cur。

S111: and quantizing the input floating point data to obtain input fixed point data according to the input quantization coefficient of the current convolutional layer and the input floating point data of the current convolutional layer.

When I is_fInt8 quantization is carried out when the input is not all right, and the input floating point data quantization is obtained through calculation of formula (4).

Wherein, I_8bitRepresenting input fixed point data; i is_fRepresenting input floating point data; s_preRepresenting the input quantized coefficients;

will I_fBased on S_preMaking 8bit quantization, when inputting I_fQuantization of the agent 8 is done for the ReLU output. I.e. by the formula

And (6) quantizing.

The output floating point data quantization is obtained by calculation of a formula (5) and comprises the following steps:

wherein, O_8bitIndicating output fixed point data; s^(C)Representing a scaling factor；W_int88bit weight after representing the weight value quantization of the channel; i is_8bitRepresenting input fixed point data; bias^CRepresents a bias; s_curRepresenting the output quantized coefficients.

S113 described below is a process of coefficient fixing.

S113: scaling the scaling factor S^CAnd said Bias^CRespectively into a scaling factor fixed-point value and a bias fixed-point value.

Step S113 includes steps S201 to S206 described below.

S201: calculating to obtain a target scaling factor according to the scaling factor, the input quantization coefficient and the output quantization coefficient; i.e. a target scaling factor of

S202: calculating to obtain target bias according to the bias and the output quantization coefficient; i.e. the target is biased to

The target scaling factor and the target offset are floating points, and when the target scaling factor and the target offset are deployed in a chip for calculation, the target scaling factor and the target offset need to be converted into fixed point values, and the following is to make the target scaling factor and the target offset into 16-bit fixed points. The details are as follows.

S203: and inputting setting data to the initial neural network for forward calculation.

S204: statistics (S)^(c)*∑{W_int8*I_8bit}+Bias^(c)) Maximum absolute value of (D) is recorded as Max_rst。

S205: statistical target scaling factor S^CMax of_scale(ii) a For Max_scaleMaking a left shift fixed point of 16 bits to obtain a fixed point value q of a scaling factor_scale。

S206: statistical target Bias^CMaximum value, denoted Max_bias(ii) a For Max (Max)_rst，Max_bias) Making a 16-bit left shift fixed point to obtain a bias fixed point value q_rst。

Fixed and floating points are both representations of numerical values (representation), fixed points hold specific number integers and decimals, and floating points hold specific number significands (signed) and exponents (exponents). Fixed-point is distinguished from floating-point in that the integer (integer) and fractional (fractional) parts are separated.

And after the weight quantization, the excitation quantization and the coefficient fixed point are carried out, the quantization of the neural network can be completed.

S114: and acquiring a quantized neural network according to the low specific point weight, the input fixed point data, the output fixed point data, the scaling factor fixed point value and the bias fixed point value.

The quantized neural network structure built by using the quantization method can be directly applied to the deployment of hardware platforms such as Field Programmable Gate Array (FPGA) or reconfigurable computing architecture (CGRA) of a special chip.

Table 1 below shows the quantization accuracy comparison of the neural network before and after quantization.

TABLE 1

Kind of model	Before quantization	After quantization
			GRU-L	95.09％	94.87％
LSTM-L	91.37％	90.35％

Wherein, the precision of GRU-L before quantization is 95.09%, and the precision after quantization is 94.87%. The accuracy before LSTM-L quantization was 91.37%, and the accuracy after quantization was 90.35%. Therefore, the quantized precision loss of the neural network is small by the quantization method, and the quantized neural network can be ensured to have higher precision and accuracy.

The invention can directly deploy the quantified neural network on the chip. Firstly, for any given neural network floating point model, the quantization method provided by the invention can be used for directly quantizing the neural network model, and the quantized network model does not need to be retrained and other operations, so that the quantization efficiency is high.

Secondly, the quantization method provided by the invention achieves the purpose of quantizing the whole neural network model by a layer-by-layer quantization method of the common layers of the neural network, and has proved that the neural network structures such as gru, lstm, rnn and the like have higher identification accuracy, can be further popularized to other neural network structures, and have stronger universality.

Finally, the neural network low bit quantization method provided by the invention aims to directly deploy the quantized neural network model on a chip, and is already applied to the actual voice recognition product, the quantization result can ensure higher precision and accuracy, the recognition rate requirement of the actual scene is met, and the method is extremely friendly to the development and deployment of a hardware platform, so that the method has important practical significance for widening the application of the neural network model in embedded equipment.

Claims

1. A neural network low bit quantization method, comprising:

s101: acquiring an initial neural network, and acquiring weight values and offsets of C channels in the initial neural network, wherein each channel comprises a convolutional layer;

s102: counting the maximum value of the input of C channels in the initial neural network;

s103: obtaining a scaling factor according to the input maximum value in the C channels;

s104: quantizing the weight value of each channel to a low specific point weight according to the scaling factor and the weight value of each channel;

s105: inputting set data to the initial neural network for forward calculation;

s106: counting the maximum value of the input absolute value of the current convolution layer in the initial neural network, and acquiring a floating point quantization interval according to the maximum value of the input absolute value of the current convolution layer;

s107: repeating the step S105, and constructing a histogram with a first set length by using the input floating point value of the current convolutional layer based on the floating point quantization interval;

s108: circularly traversing the set quantization threshold values in the floating point quantization interval, converting the histogram with a first set length into a histogram with a second set length for each quantization threshold value, calculating KL divergence, and taking the quantization threshold value with the smallest KL divergence as a target quantization threshold value;

s109: acquiring an input quantization coefficient of the current convolution layer according to the target quantization threshold and the floating point quantization interval;

s110: taking the next convolutional layer connected after the current convolutional layer as the current convolutional layer, repeating the steps from S105 to S109 to obtain the input quantization coefficient of the next convolutional layer, and taking the input quantization coefficient of the next convolutional layer as the output quantization coefficient of the current convolutional layer;

s111: quantizing the input floating point data to obtain input fixed point data according to the input quantization coefficient of the current convolution layer and the input floating point data of the current convolution layer;

s112: quantizing the output floating point data to obtain output fixed point data according to the scaling factor, the input quantization coefficient of the current convolutional layer, the output quantization coefficient of the current convolutional layer and the output floating point data of the current convolutional layer;

s113: converting the scaling factor and the bias into a scaling factor fixed-point value and a bias fixed-point value, respectively;

2. The neural network low bit quantization method of claim 1, wherein the initial neural network comprises a convolutional layer, a bulk normalization layer, and an activation function layer;

the S101 includes:

eliminating the batch normalization layer to enable the network structure of the initial neural network to be a convolution layer and an activation function layer; and acquiring the weight values and the offsets of the C channels in the initial neural network after the batch normalization layer is eliminated.

3. The neural network low bit quantization method of claim 2, wherein obtaining the scaling factor according to the maximum value input in the C channels in S103 comprises:

obtaining a vector with the length of C according to the maximum value input in the C channels; calculating the scaling factor by equation (1);

S^(C)＝th^C/127.0 equation (1)

4. The neural network low bit quantization method of claim 3, wherein said S104 comprises:

quantizing the weight value of each channel to a low specific point weight by formula (2);

5. The neural network low bit quantization method of claim 4, wherein said S109 comprises:

calculating the input quantization coefficient by formula (3);

scale ═ (target _ th +0.5) · dist _ scale/127.0 equation (3)

6. The neural network low bit quantization method of claim 5, wherein the quantization of the input floating point data in S111 is calculated by formula (4);

the output floating point data quantization in the S112 is obtained by calculation of formula (5);

wherein, O_8bitIndicating output fixed point data; s^(C)Representing a scaling factor; w_int88bit weight after representing the weight value quantization of the channel; i is_8bitRepresenting input fixed point data; bias^CRepresents a bias; s_curRepresenting the output quantized coefficients.

7. The neural network low bit quantization method of claim 6, wherein said converting said scaling factor and said bias into a scaling factor fixed point value and a bias fixed point value in said S113 comprises:

calculating to obtain a target scaling factor according to the scaling factor, the input quantization coefficient and the output quantization coefficient;

calculating to obtain target bias according to the bias and the output quantization coefficient;

inputting set data to the initial neural network for forward calculation;

statistics (S)^(c)*∑{W_int8*I_8bit}+Bias^(c)) Maximum absolute value of (D) is recorded as Max_rst；

Max of the maximum of the statistical target scaling factor_scale(ii) a For Max_scaleMaking a left shift fixed point of 16 bits to obtain a fixed point value q of a scaling factor_scale；

Counting the maximum value of the target offset, and recording as Max_bias(ii) a For Max (Max)_rst，Max_bias) Making a 16-bit left shift fixed point to obtain a bias fixed point value q_rst。

8. The neural network low bit quantization method of claim 5, wherein in case that the input and output of the activation function layer in the initial neural network are all positive values, the activation function layer output is optimized to be asymmetric quantization; the floating point quantization interval is changed to [256,2048], and a histogram with the length of 256 is generated according to different set quantization threshold values; the input quantized coefficients are finally divided by 255.0 when they are calculated.