CN111768002B

CN111768002B - Deep neural network quantization method based on elastic significance

Info

Publication number: CN111768002B
Application number: CN202010661226.1A
Authority: CN
Inventors: 龚成; 卢冶; 李涛
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2021-06-22
Anticipated expiration: 2040-07-10
Also published as: CN111768002A

Abstract

The invention provides a deep neural network quantization method based on elastic significance, which quantizes fixed points or floating points into a quantization value with elastic significance, discards redundant mantissa parts and quantitatively evaluates the distribution difference between the quantization value and original data in a feasible solving mode. The invention has the quantization value of the elastic effective bit, and the distribution of the quantization value can cover a series of bell-shaped distributions from a long tail to an even one through different effective bits, so as to adapt to the weight/activation distribution of DNNs, thereby ensuring low precision loss; multiplication can be realized by multiple shift addition on hardware, so that the overall efficiency of the quantization model is improved; the distribution difference function quantitatively estimates the quantization loss caused by different quantization schemes, and the optimal quantization scheme can be selected under different conditions, so that lower quantization loss is realized, and the precision of a quantization model is improved.

Description

Deep neural network quantization method based on elastic significance

Technical Field

The invention belongs to the technical field of deep neural network compression, and particularly relates to a deep neural network quantization method based on elastic significant bits.

Background

Deep Neural Networks (DNNs) quantification is an effective method for compressing DNNs, and can remarkably improve the calculation efficiency of DNNs, so that the network can be deployed on an edge calculation platform with limited resources. One of the most common methods to achieve quantization of DNNs is to project high precision floating point values into low precision quantized values. For example, DNNs with 32-bit floating points may achieve 32x model size compression by replacing weights with only one bit, and may even reduce complex multiplication operations to simple bit operations in hardware. Therefore, under the condition of less bit values, the DNN quantization can significantly reduce the calculation scale or the memory occupation, thereby improving the calculation efficiency.

However, quantization also results in a substantial reduction in accuracy, mainly because the use of low-accuracy replacement destroys the original data distribution of weights/activations in DNNs. For example, binarization converts original values subject to arbitrary distribution into Bernoulli distribution, resulting in severe degradation of accuracy. Therefore, in order to obtain higher precision, it is still a challenge how to improve the approximation between two distributions before and after quantization as much as possible with the least number of bits as appropriate. Some low-bit quantization efforts such as binary/ternary may achieve high computational efficiency, but the model accuracy is too low. Therefore, researchers in recent years have focused more on multi-bit quantization, which can be roughly divided into two categories: linear quantization and non-linear quantization.

Linear quantization: linear quantization quantizes the data into continuous fixed-point values. The Dorefa-Net first finds a scaling factor to scale the data to the target range and then applies a rounding operation to implement the projection process. TSQ gives two-step quantization of activation and weight, respectively. It first sets the activations below the threshold to 0, then quantizes the other activations by linear quantization, and then quantizes the weights to three values by quantization based on a convolution kernel. QIL introduces a trainable linear quantization process that is re-parameterized by optimizing task loss. PACT considers that the reason for the difficulty of activation quantization is the lack of any trainable parameters. It re-parameterizes the clipping parameters used to activate quantization to limit the range of values over which linear quantization is then used. BCGD uses a simple step function with one scaling factor to achieve linear quantization of all weights and activations. DSQ proposes a soft, derivable linear quantization process that solves the quantization-irreducible problem by using a tanh function to iteratively approximate a step function step by step. The latest μ L2Q introduces scaling and shifting factors for weight transformation, and then truncates and rounds the transformed data to discrete values. VecQ scales the weights using the scaling factor only within the target range and then truncates and rounds them to a fixed point value.

Nonlinear quantization: non-linear quantization projects floating point numbers into low precision values with non-uniform spacing. TTQ attempts to correct the degradation in accuracy caused by the three values and the quantization weights to three values that are not centrosymmetric. To achieve this goal, it introduces two trainable scaling factors. To exploit bit manipulation from binary quantization while preventing severe degradation of precision, LQ-Nets and ABC-Nets quantize weights and activations as the sum of multiple binary quantization results to achieve a new trade-off between precision and efficiency. Logarithmic quantization attempts to map floating point numbers to values based on different base exponents, such as base 1 being 1.2 or 2.1. The second power (PoT) was found to be computationally efficient, but has poor performance. INQ attempts to solve the problem of fast convergence of quantized networks, which proposes a method of quantizing weights step by step into PoT. ENN introduces an Alternating Direction Multiplier Method (ADMM) into the quantization to solve the problem of gradient propagation of the non-derivable quantization. It also quantizes the weight to PoT.

Recent APoT corrects the accuracy by adding a complex quantization value in PoT, but it actually generates a multi-peak distributed quantization value, resulting in some accuracy degradation. And its complex quantization interval introduces complex projection operations as a new performance bottleneck, resulting in extremely low computational efficiency. APoT quantizes floating point numbers into the form of a sum of some PoT. For example, if there are b given quantization bits, APoT will first generate n groups from b, each group containing

PoT. The remaining 1 bit is removed as the sign bit and n must be guaranteed to be divisible by (b-1). Then, the random combination of the values in each group is extracted to generate the final 2^b1 quantization value (0 value recurs). Fig. 1 shows the quantized values of different packets for the case of APoT where b is 5. APoT is equivalent to PoT quantization when n is 1, and linear quantization when n is 4. At n-2, the quantified values of ApoT are multimodal. Neither of these distributions is adapted to a bell-shaped distribution, but in most cases the DNN weights/activation values are bell-shaped, which indicates that APoT cannot adapt to the vast majority of DNN quantizations, with a large degradation in accuracy. The APoT shown in fig. 2 is used for quantizing a complex step function of a projection, which is difficult to implement with a simple scaling and rounding function, so that the time complexity and the space complexity thereof reach o (n), which greatly reduces the computational efficiency of a quantization model.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides a deep neural network quantization method based on elastic significant bits, and the distribution of quantization values can cover a series of bell-shaped distributions from long tails to uniformity through different significant bits, so that the weight/activation distribution of DNNs is adapted, the low precision loss is ensured, and the overall efficiency of a quantization model is improved.

The technical scheme adopted by the invention is as follows: a deep neural network quantization method based on elastic significant bits quantizes fixed-point numbers or floating-point numbers into quantized values with elastic significant bits, and discards redundant mantissa parts.

The elastic significant bit is a finite number of significant bits reserved from the most significant bit. The number of significant bits is determined according to the fixed-point number or the floating-point number itself.

For a given data v, k +1 significant bits are specified from their most significant bit positions, as follows:

wherein the (n-k) to n sections are reserved significant sections and the 0 (-infinity) to (n-k-1) sections are mantissa sections requiring rounding; quantization of fixed or floating point numbers:

P(v)＝R(v＞＞n-k)＜＜n-k

where > and < are shift operations and R () is a rounding operation.

The multiplication of the elastic significant bit is based on a multi-stage shift accumulation implementation of the significant bit.

The number of significance determines the number of stages of shift addition.

The quantitative evaluation is carried out on the distribution difference between the quantized value and the original data by adopting a feasible solving mode:

the quantization weight is W, the sampling self-random variable t-p (t), the collection of all quantization values is Q, and the distribution difference function is defined as follows:

wherein (q-q)_l，q+q_u]Representing a range of continuous data that can be projected onto a q-value, the range being centered at q, q₁And q is_uIndicating its floating range.

The distribution difference is used to evaluate the optimal quantization at different elastic significands.

The working principle is as follows: the elastic effective bit realizes the quantized values of different distributions to improve the model precision: the key point affecting accuracy is actually how many bits in a given bit can actually be used to control the distribution of quantization values to approximate the original distribution. In the present invention, these bits are referred to as significant bits, and the number of significant bits, i.e. the number of bits between the most significant bit and the least significant bit, will affect the distribution of the quantization values. Due to the flexible effective bit number design, the quantization value of the quantization method can cover all possible bell-shaped data distribution forms from long tail to uniform distribution so as to adapt to the weight/activation of the DNNs distributed in the bell shape, and therefore the quantization precision is effectively guaranteed. In addition, the invention also designs a distribution difference function to estimate the quantization error, so as to find the most suitable quantization scheme, namely the most suitable effective bit number.

The high-efficiency quantization value set ensures the calculation efficiency of the quantization model: by setting the quantization value with limited effective bits, the quantization value has less effective bits than the number of continuous fixed points under the same quantization bit number, so the multiplication can be realized by shift addition with less stages, and the calculation efficiency of the multiplication of a quantization model is improved.

Fast projection function: compared with a complex step function as a function, the method realizes fast projection operation from a floating point value to a quantized value by using simple shifting and rounding operations, only needs O (1) time complexity and space complexity in algorithm, and only needs one CPU clock cycle in hardware implementation, so that the projection efficiency is greatly improved.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention has the quantization value of the elastic effective bit, and the distribution of the quantization value can cover a series of bell-shaped distributions from a long tail to an even one through different effective bits, so as to adapt to the weight/activation distribution of DNNs, thereby ensuring low precision loss; experiments show that the ResNet18 is quantized into 2 bits on ImageNet, the accuracy based on EpoT quantization reaches 68.13%, and is improved by 1.03% compared with 67.1% of APoT;

2. the multiplication of the elastic effective bit can be realized by multiple shift addition on hardware, and the grade number of the shift addition depends on the effective digit;

3. according to the invention, the shift operation and the rounding operation can be completed in one clock cycle of the CPU, so that the projection algorithm can realize high-efficiency projection operation;

4. the distribution difference function quantitatively estimates the quantization loss caused by different quantization schemes, and can select the optimal quantization scheme under different conditions, thereby realizing lower quantization loss and improving the precision of a quantization model.

Drawings

FIG. 1 is a graph of the prior art quantification of APoT to form a multimodal distribution;

FIG. 2 is a prior art complex step function caused by APoT;

FIG. 3 is a schematic diagram of a shift and rounding operation according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating multiplication by multi-stage shift-accumulate operations according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a shift and rounding operation according to another embodiment of the present invention;

FIG. 6 is a diagram illustrating multiplication by multi-stage shift-accumulate operations according to another embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating a distribution variance solution according to an embodiment of the present invention;

fig. 8 is a quantization process of quantizing ResNet18 according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

Example one

The embodiment of the invention provides a deep neural network quantization method based on elastic significant bits, which quantizes fixed-point numbers or floating-point numbers into quantized values with elastic significant bits, and discards redundant mantissa parts. The elastic significant bit is a finite number of significant bits reserved from the most significant bit.

P(v)＝R(v＞＞n-k)＜＜n-k

where > and < are shift operations and R () is a rounding operation.

As shown in fig. 3, in the present embodiment, 91 is quantized, the elastic valid bit is set to retain 4 valid bits, two shift operations and one rounding operation are performed, and 91 is quantized to 88.

For the multiplication operation of 91 and 91, after the shift and rounding operation, the operation is changed into 88 and 88, and then the multi-stage shift accumulation is carried out, as shown in fig. 4, the operation is the shift accumulation of four numbers, and the calculation efficiency is greatly improved. The embodiment can further improve the calculation efficiency by removing the calculation of the 0 value in the valid bit.

Example two

P(v)＝R(v＞＞n-k)＜＜n-k

where > and < are shift operations and R () is a rounding operation.

As shown in fig. 5, in the present embodiment, 92 is quantized, the elastic valid bit is set to retain 4 valid bits, two shift operations and one rounding operation are performed, and 92 is quantized to 96.

For the multiplication operation of 92 and 92, after shift and rounding operations, the operation is changed into 96 and 96, and then multi-stage shift accumulation is performed, as shown in fig. 6, the operation is shift accumulation of two numbers, so that the calculation efficiency is greatly improved.

EXAMPLE III

s.t.S＝(q-q_l，q+q_u]

wherein (q-q)_l，q+q_u]Representing a range of continuous data that can be projected onto a q-value, the range being centered at q, q₁And q is_uIndicating its floating range. The solving diagram is shown in fig. 7. The distribution difference can be used to evaluate the optimal quantization at different elastic significands.

Inputting: there is a DNN weight w sampled from a standard normal distribution N (0, 1)_fIt needs to be quantized to a low precision value of 4 bits;

and (3) outputting: optimum significand, quantized weight w_q。

A. For a 4-bit quantized value, the most significant bits possible are 1, 2, 3, respectively denoted as Q_i，i∈{1，2，3} quantization value set.

B. In order to align the distribution of quantization values to ensure accuracy, a scaling parameter λ is allowed. Randomly initializing a lambda_iE is [0, 3 ]) of

Scaling to different distribution ranges to align quantization values Q_iDistribution of (2).

C. For based on Q_iA set of quantization values which can be quantized by a fast quantization technique

Quantized to low-precision Q-only_iQuantization weight w represented by set median_qi。

D. Calculating the distribution difference J between the quantized weight and the original weight calculated based on the quantized set by using the distribution difference_i。

E. Calculation of J_iFor lambda_iGradient of (2)

Searching for optimal solutions by gradient descent algorithm

And the corresponding

Or by calculation

Directly solving the extreme value to obtain a local optimal solution

And the corresponding

F. Obtaining the optimal significance:

G. based on Q_i*A set of quantization values to be quantized using an elastic significance quantization technique

Quantization weight w with low precision for speed decision quantization_q。

Example four

Taking ResNet 8 as an example, this embodiment quantizes all convolutional layers and fully-connected layers of ResNet18 except the first layer and the last layer, and the quantization process is shown in fig. 8. In the forward process, for each layer to be quantized, the weight of the layer and the activation value from the upper layer are first quantized into quantized values of a finite number of Significant Bits by a projection process of efficient Elastic Significant Bits (ESB). And then inputting the quantized value into a multiplication accumulator, and converting the multiplication calculation of a limited number of valid bits into shift addition based on the valid bits to improve the calculation efficiency of the neural network. And finally, obtaining an output result of the layer through accumulation operation. In the back propagation process, the present embodiment utilizes a direct Estimation method (STE) to implement gradient transmission in the quantization process, and in brief, directly estimates the network loss to the quantized weight or activated gradient as the full-precision weight or activated gradient before quantization, so as to implement network update. Finally, experiments show that the weight and activation of ResNet18 are quantized to 2 bits on ImageNet, and the quantization precision reaches 68.13%.

The present invention has been described in detail with reference to the embodiments, but the description is only illustrative of the present invention and should not be construed as limiting the scope of the present invention. The scope of the invention is defined by the claims. The technical solutions of the present invention or those skilled in the art, based on the teaching of the technical solutions of the present invention, should be considered to be within the scope of the present invention, and all equivalent changes and modifications made within the scope of the present invention or equivalent technical solutions designed to achieve the above technical effects are also within the scope of the present invention.

Claims

1. A deep neural network quantization method based on elastic significance is characterized in that: quantizing the fixed point number or the floating point number into a quantized value with an elastic significant digit, and discarding redundant mantissa parts;

inputting the quantized value into a multiplication accumulator, converting the value into a shift addition based on a finite effective bit through multiplication calculation of the finite effective bit, and finally obtaining an output result of the layer through accumulation operation;

and quantitatively evaluating the distribution difference between the quantized value and the original data in a feasible solving way:

s.t.S＝(q-q_l，q+q_u]

wherein (q-q)_l，q+q_u]Representing a range of continuous data that can be projected onto a q-value, the range being centered at q, q_lAnd q is_uIndicating its floating range.

2. The elastic significance-based deep neural network quantization method of claim 1, wherein: the elastic significant bit is a finite number of significant bits reserved from the most significant bit.

3. The elastic significance-based deep neural network quantization method of claim 1 or 2, wherein: for a given data v, k +1 significant bits are specified from their most significant bit positions, as follows:

P(v)＝R(v＞＞n-k)＜＜n-k

where > and < are shift operations and R () is a rounding operation.

4. The elastic significance-based deep neural network quantization method of claim 3, wherein: the multiplication of the elastic significant bit is based on a multi-stage shift accumulation implementation of the significant bit.

5. The elastic significance-based deep neural network quantization method of claim 4, wherein: the number of significance determines the number of stages of shift addition.

6. The elastic significance-based deep neural network quantization method of claim 3, wherein: the distribution difference is used to evaluate the optimal quantization at different elastic significands.