CN113762496A

CN113762496A - Method for reducing inference operation complexity of low-bit convolutional neural network

Info

Publication number: CN113762496A
Application number: CN202010497777.9A
Authority: CN
Inventors: 张东
Original assignee: Hefei Ingenic Technology Co ltd
Current assignee: Hefei Ingenic Technology Co ltd
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2021-12-07
Anticipated expiration: 2040-06-04
Also published as: CN113762496B

Abstract

The invention provides a method for reducing the inference operation complexity of a low-bit convolutional neural network, which comprises the following steps of quantizing by using stored data after training of an S1 neural network is finished, and assuming that the quantization of an ith layer is as follows:

wherein delta_iTo activate a function, Q_AFor the quantization formula of feature map, Q_wA quantization formula for the weight; s2 quantifying when the parameters of the formula in S1 meet the conditions

Obtaining through the operation of fixed point number:

s3 determines a threshold from the quantization of feature map: quantization of feature map:

the direct derivation of the threshold value from the quantization formula of feature map is (0.5, 1.5 … (2)^k-0.5)), where k is the quantized bit-width; since the distance between the thresholds is all 1.0, only the hold is needed at the final quantization

Wherein

Then the threshold value

n∈{0,1…(2^k‑1) Where k is the quantized bit-width; s4 since quantization is low, the value of the feature map after quantization is determined, and Q_AFor uniform quantization, so in S2

Pass and a series of thresholds (T)₁，T₂…T_n) And comparing to obtain a final quantification result. The method and the device solve the problems of high computational complexity and high computational resource requirements in the low-bit model reasoning process.

Description

Method for reducing inference operation complexity of low-bit convolutional neural network

Technical Field

The invention relates to the technical field of neural network acceleration, in particular to a method for reducing the reasoning and operation complexity of a low-bit convolutional neural network.

Background

In recent years, with the rapid development of science and technology, a big data age has come. Deep learning takes a Deep Neural Network (DNN) as a model, and achieves remarkable results in key fields of many human intelligence, such as image recognition, reinforcement learning, semantic analysis and the like. The Convolutional Neural Network (CNN) is a typical DNN structure, can effectively extract hidden layer features of an image and accurately classify the image, and is widely applied to the field of image identification and detection in recent years.

In particular, the multiply-shift implements a 32-bit quantization to a low bit: the result of the quantization convolution operation is stored as 32-bit shaping, and then multiplication and shift operation are carried out according to the pre-calculated parameters to realize the conversion from 32-bit to low bit.

However, when the 32bit quantization is low in the prior art, since the precision after quantization needs to be ensured, a series of addition and comparison operations need to be performed in the quantization process, which greatly increases the computational complexity and the computational resource, and especially when the quantization is 2bit, the cost is often too large.

Furthermore, the common terminology in the prior art is as follows:

convolutional Neural Networks (CNN): is a type of feedforward neural network that contains convolution calculations and has a depth structure.

And (3) quantification: quantization refers to the process of approximating a continuous value (or a large number of possible discrete values) of a signal to a finite number (or fewer) of discrete values.

Low bit rate: and quantizing the data into data with bit width of 8bit, 4bit or 2 bit.

Reasoning: and after the neural network training is finished, the stored data is used for carrying out the operation process.

Disclosure of Invention

The application provides a method for reducing the inference operation complexity of a low-bit convolutional neural network, aims to overcome the defects in the prior art and solves the problems of high computation complexity and high computation resource requirements in the inference process of the existing low-bit model.

Specifically, the invention provides a method for reducing the inference operation complexity of a low bit convolution neural network, which comprises the following steps:

s1, after the training of the neural network is finished, the stored data is used for quantization,

assume that quantization of the ith layer is as follows:

wherein delta_iTo activate a function, Q_AFor the quantization formula of feature map, Q_wA quantization formula for the weight;

s2, when the parameters of the formula in S1 meet the following conditions:

1)、

expressed in fixed point numbers scaled by floating point scalars

w_intIs a fixed point number expressed in an integer;

2)、

expressed in fixed point numbers scaled by floating point scalars

x_intIs a fixed point number expressed in an integer;

3)、δ_iis a monotonic function;

then, quantize

Obtained by the operation of fixed point number, namely:

s3, determining a threshold from the quantization of feature map:

the quantization formula of feature map is:

the quantization formula of the above feature map can directly deduce that the threshold value is (0.5, 1.5 … (2)^k-0.5)), where k is the quantized bit-width;

since the distance between the thresholds is all 1.0, only the hold is needed at the final quantization

Wherein

Then the threshold value

n∈{0,1…(2^k-1) Where k is the quantized bit-width;

s4, since the quantization is low bit, the value of the feature map after quantization is determined, and Q_ATo quantize uniformly, δ in S2_i(s_ws_xs_BN(w_int·x_int+b_i/(s_ws_xs_BN) ) pass and the series of said threshold values (T) in step S3₁，T₂…T_n) And comparing to obtain a final quantification result.

In step S2, when the quantization is a low-bit 2bit, the value of the feature map after quantization is 0,1,2, and 3.

The step S2 is due to delta_iIs a monotonic function, s_ws_x> 0, so can also pass (w)_int·x_int+b_i/(s_ws_xs_BN) ) and

to obtain a quantized result.

In the step S4, S is_BNEach channel is not the same, so saving the threshold requires that each channel needs to be saved one.

Thus, the present application has the advantages that:

1. the quantization of 32 bits into low bits is realized directly through threshold comparison, so that the complexity of operation is reduced;

2. the overall running time of the quantization model is reduced;

3. the demand of computing resources is reduced;

the 64bit by 64bit operation is avoided.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.

FIG. 1 is a schematic flow diagram of the process of the present invention.

Detailed Description

In order that the technical contents and advantages of the present invention can be more clearly understood, the present invention will now be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, a method for reducing the complexity of inference operation of low bit convolution neural network of the present invention includes the following steps:

assume that quantization of the ith layer is as follows:

s2, when the parameters of the formula in S1 meet the following conditions:

1)、

expressed in fixed point numbers scaled by floating point scalars

w_intIs a fixed point number expressed in an integer;

2)、

expressed in fixed point numbers scaled by floating point scalars

x_intIs a fixed point number expressed in an integer;

3)、δ_iis a monotonic function;

then, quantize

Obtained by the operation of fixed point number, namely:

s3, determining a threshold from the quantization of feature map:

the quantization formula of feature map is:

Wherein

Then the threshold value

n∈{1，2…(2^k-1) Where k is the quantized bit-width;

s4, since the quantization is low bit, the value of the feature map after quantization is determined, and Q_ATo uniformly quantizeSo δ in S2_i(s_ws_xs_BN(w_int·x_int+b_i/(s_ws_xs_BN) ) pass and the series of said threshold values (T) in step S3₁，T₂…T_n) And comparing to obtain a final quantification result.

In particular, the method of the present application can also be expressed as follows:

assume that the quantization calculation for the ith layer is as follows:

wherein delta_iTo activate a function, Q_AFor the quantization formula of feature map, Q_wQuantization formula for weight

The parameters in the above formula meet the following conditions:

1、

can be represented by fixed-point numbers scaled by floating-point scalars

w_intIs a fixed point number expressed by an integer

2、

Can be represented by fixed-point numbers scaled by floating-point scalars

x_intIs a fixed point number expressed by an integer

3、δ_iIs a monotonic function

So that the final is calculated

The following can be obtained by the fixed point number operation:

since the quantization is low, the value of the quantized feature map is actually determined (taking 2 bits as an example, the value of the feature map is 0,1,2,3), and Q is_ATo uniform quantization, so_i(s_ws_xs_BN(w_int·x_int+b_i/(s_ws_xs_BN) A passable and a series of threshold values (T)₁，T₂…T_n) Compared to obtain a quantized result due to delta_iIs a monotonic function, s_ws_x> 0, so can also pass (w)_int·x_int+b_i/(s_ws_xs_BN) ) and

to obtain a quantized result.

The determination of the threshold needs to be started from the quantization formula of feature map.

The quantization formula of feature map is:

the threshold value of (0.5, 1.5 … (2k-0.5)) can be directly derived from the above equation, where k is the quantized bit width. Since the distance between the thresholds is 1.0, we only need to save it in the final quantization

Wherein

Then the threshold value is set

n∈{0,1…(2^k-1) Where k is the quantized bit-width; due to S_BNEach channel is not the same, so saving the threshold requires that each channel needs to be saved one.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for reducing the complexity of low bit convolutional neural network inferential computation, comprising the steps of:

assume that quantization of the ith layer is as follows:

s2, when the parameters of the formula in S1 meet the following conditions:

1)、

expressed in fixed point numbers scaled by floating point scalars

w_intIs a fixed point number expressed in an integer;

2)、

expressed in fixed point numbers scaled by floating point scalars

x_intIs a fixed point number expressed in an integer;

3)、δ_iis a monotonic function;

then, quantize

Obtained by the operation of fixed point number, namely:

s3, determining a threshold from the quantization of feature map:

the quantization formula of feature map is:

the threshold value is directly deduced from the quantization formula of the above feature map as (0.5, 1.5 … (2)^k-0.5)), where k is the quantized bit-width;

Wherein

Then the threshold value

n∈{0,1…(2^k-1) Where k is the quantized bit-width;

2. The method according to claim 1, wherein when the quantization is a low bit 2bit in step S2, the value of the feature map after quantization is 0,1,2, 3.

3. The method for reducing the complexity of inference operations of low bit convolution neural network as claimed in claim 1, wherein said step S2 is performed due to delta_iIs a monotonic function, s_ws_x> 0, so can also pass (w)_int·x_int+b_i/(s_ws_xs_BN) ) and

to obtain a quantized result.

4. The method for reducing the complexity of inference operations of low bit convolution neural network as claimed in claim 1, wherein S is the factor of S in step S4_BNEach channel is not the same, so saving the threshold requires that each channel needs to be saved one.