CN114372553A

CN114372553A - Neural network quantification method and device

Info

Publication number: CN114372553A
Application number: CN202111418068.8A
Authority: CN
Inventors: 张书瑞; 欧阳鹏
Original assignee: Beijing Qingwei Intelligent Information Technology Co ltd
Current assignee: Beijing Qingwei Intelligent Information Technology Co ltd
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2022-04-19

Abstract

The invention discloses a quantization method and a quantization device of a neural network, wherein the method comprises the following steps: quantizing the weight parameters of each layer of the neural network model to obtain quantization coefficients of the weight parameters; quantizing the input data and the output data of each layer of the neural network model to obtain a quantization coefficient of the input data and a quantization coefficient of the output data; and for each layer of network, determining a fixed point value corresponding to each quantization coefficient, and completing the quantization of the neural network model. The invention adopts a direct quantization method, namely, the weight parameters, the input data and the output data of each layer of the network in the neural network model are directly quantized, and the quantized network model can be directly obtained without retraining and other operations in the quantization process, thereby effectively improving the quantization efficiency.

Description

Neural network quantification method and device

Technical Field

The invention relates to the technical field of deep learning, in particular to a quantization method and device of a neural network.

Background

The rapid development of neural network technology brings great convenience to our lives, and many applications including neural network technology have been integrated into our daily lives. However, the application scenes of the neural network are also limited by the continuously increased neural network models, so that the neural network models which are trained in advance are subjected to quantization processing, the scale of the neural network models is reduced, the forward reasoning speed of the neural network is improved, and the application scenes of the neural network are widened, thereby having important significance. However, the existing quantization method has the problems of insufficient precision, low quantization efficiency and the like in a network model.

Disclosure of Invention

In order to solve the problems of insufficient precision and low quantization efficiency of the existing quantization method, the embodiment of the invention provides a quantization method and a quantization device of a neural network. The technical scheme is as follows:

in a first aspect, a quantization method of a neural network is provided, the method including:

quantizing the weight parameters of each layer of the neural network model to obtain quantization coefficients of the weight parameters;

quantizing the input data and the output data of each layer of the neural network model to obtain a quantization coefficient of the input data and a quantization coefficient of the output data;

and for each layer of network, determining a fixed point value corresponding to each quantization coefficient, and completing the quantization of the neural network model.

Optionally, before the step of quantizing the input data and the output data of each layer of the neural network model, the method includes:

and selecting partial data from the training set or the test set as input data for quantifying the neural network model.

Optionally, the step of quantizing the weight parameter of each layer of the neural network model includes:

and quantizing the weight parameters of each layer of the neural network model by adopting a linear quantization mode.

Optionally, the step of quantizing input data of each layer of the neural network model includes:

calculating before running, and counting the maximum value of the absolute value of input data;

determining a quantization interval of the input data based on the maximum value;

constructing a histogram of length n1 based on the quantization intervals for the floating point values of the input data;

circularly traversing the quantization threshold values, changing the histogram with the length of n1 into the histogram with the length of n2 for each threshold value, calculating a divergence value, and taking the threshold value with the smallest divergence value as an optimal threshold value, wherein n1 is larger than n 2;

and determining a quantization coefficient of the input data based on the optimal threshold and the quantization interval.

Optionally, when the neural network model includes a saturation activation function, the input data of the saturation activation function is quantized by using a quantization method of saturation interception.

Optionally, when the neural network model includes a classification function, a quantization method of translation processing is used to quantize input data of the classification function.

In a second aspect, an apparatus for quantizing a neural network is provided, the apparatus comprising:

the quantization module is used for quantizing the weight parameters of each layer of the neural network model to obtain quantization coefficients of the weight parameters;

the quantization module is further configured to quantize input data and output data of each layer of the neural network model to obtain a quantization coefficient of the input data and a quantization coefficient of the output data;

and the fixed point module is used for determining a fixed point value corresponding to each quantization coefficient for each layer of network to finish the quantization of the neural network model.

Optionally, the quantization module is further configured to select a part of data from a training set or a test set as input data for quantizing the neural network model.

Optionally, the quantization module is specifically configured to quantize a weight parameter of each layer of the neural network model in a linear quantization manner.

Optionally, the quantization module is specifically configured to:

Optionally, the quantization module is specifically configured to, when the neural network model includes a saturation activation function, quantize input data of the saturation activation function by using a quantization method of saturation interception.

Optionally, the quantization module is specifically configured to, when the neural network model includes a classification function, quantize input data of the classification function by using a quantization method of translation processing.

In a third aspect, an electronic device is provided, which includes a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor, configured to implement the method for quantizing a neural network according to the first aspect when executing a program stored in the memory.

The embodiment of the invention adopts a direct quantization method, namely, the weight parameters, the input data and the output data of each layer of the neural network model are directly quantized, and the quantized network model can be directly obtained without retraining and other operations in the quantization process, so that the quantization efficiency is effectively improved, and meanwhile, the quantization method of saturation interception is adopted for the activation function with the saturation threshold, so that the aim of improving the quantization precision can be fulfilled while the operation amount is reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a method for quantifying neural networks according to an embodiment of the present invention;

fig. 2 is a block diagram of a quantization apparatus of a neural network according to an embodiment of the present invention;

fig. 3 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of a quantization method of a neural network according to an embodiment of the present invention may specifically include the following steps.

Step 101, quantizing the weight parameters of each layer of the neural network model to obtain quantization coefficients of the weight parameters.

The unquantized neural network model is usually a floating point type, and the weight quantization is to quantize weight parameters in the floating point neural network model to obtain quantization coefficients of the weight parameters. The distribution range of the weight parameters in the floating point neural network model is relatively fixed, so that the weight parameters in the floating point neural network model can be quantized in a linear quantization mode.

The model quantization calculation process is illustrated below as a convolutional layer. For example, the convolutional layer before quantization is represented as:

wherein the content of the first and second substances,

representing output data, W_fDenotes a weight parameter, I_fRepresenting input data, Bias^(c)Representing a deviation parameter;

after weight quantization, the convolutional layer can be represented as:

wherein S is^(c)Quantized coefficients representing the weight parameters.

The step of quantizing the weight parameters may include: firstly, calculating the maximum value of the absolute value of the weight parameter, secondly, mapping the maximum value to a preset value, wherein the preset value is the maximum value of a quantized data range, for example, when the data is quantized to 8 bits, the preset value is 127, and then, carrying out linear quantization to calculate the quantization coefficient of the weight parameter.

The quantization method in the embodiment of the invention supports the quantization of any bit width. For the case that the input and output values of the neural network model are positive or negative, the quantized data range is [ -64,63] when the number of the input and output values is 4 bits, the quantized data range is [ -128,127] when the number of the input and output values is 8 bits, and the quantized data range is [ -256,255] when the number of the input and output values is 16 bits. Aiming at the condition that the input and the output of the neural network model are all positive values, the quantized data range is [0,127] when the input and the output are 4 bits, the quantized data range is [0,255] when the input and the output are 8 bits, and the quantized data range is [0,511] when the input and the output are 16 bits. The flexible bit width configuration can select the appropriate quantization bit width according to the storage condition of the equipment and the precision requirement of the use scene, and has better adaptability and flexibility.

And 102, quantizing the input data and the output data of each layer of the neural network model to obtain a quantization coefficient of the input data and a quantization coefficient of the output data.

In practice, a small amount of partial data may be selected from the training set or the test set as input data for quantifying the neural network model. For example, the training set includes 10W pictures, and 1000 pictures can be selected from the pictures and input into the model, and the data is calculated before running for quantization. The embodiment of the invention has no strict requirement on the number of the data sets, can finish the quantization work by selecting partial data, and greatly improves the calculation efficiency in the quantization process while reducing the data dependence.

The model quantization calculation process is continued by taking the convolution layer as an example. When inputting data I_fInt8 quantization is done when not all are positive, resulting in:

wherein, I_8bitRepresenting quantized input data, S_preRepresenting quantized coefficients of the input data.

Substituting the quantized input data, namely formula (3), into the convolutional layer after weight quantization, namely formula (2), to obtain formula (4):

for output data

Carrying out quantization processing to obtain formula (5):

wherein, O_8bitRepresenting the output data after quantization, S_curRepresenting the quantized coefficients of the output data.

Continue to simplify equation (5) by

Obtaining the quantified convolutional layer:

in an implementation, the step of quantizing the input data of each layer of the neural network model may include: calculating before running, and counting the maximum value of the absolute value of input data; determining a quantization interval of the input data based on the maximum value; constructing a histogram of length n1 based on the quantization intervals for the floating point values of the input data; circularly traversing the quantization threshold values, changing the histogram with the length of n1 into the histogram with the length of n2 for each threshold value, calculating a divergence value, and taking the threshold value with the smallest divergence value as an optimal threshold value, wherein n1 is larger than n 2; and determining a quantization coefficient of the input data based on the optimal threshold and the quantization interval.

In implementation, a KL divergence (Kullback-Leibler divergence) algorithm may be used to determine the quantization coefficients of the input data and the output data. Accordingly, the KL variance value can be calculated by calculating the variance value in the step of quantizing the input data.

For example, the input data to the convolutional layer is quantized to 8 bits. The step of quantizing the input data of the convolutional layer comprises: calculating the running direction, such as running 1000 pictures, and counting the maximum value max of the absolute value of input data; determining a quantization interval dist _ scale of the input data based on the maximum value max, wherein the quantization interval dist _ scale is equal to the maximum value max divided by 2048; constructing a histogram with length of 2048 based on quantization intervals from floating point values of input data; circularly traversing the quantization threshold th in the interval [128,2048], changing the histogram with the length of 2048 into the histogram with the length of 128 for each threshold th, calculating a divergence value, and taking the threshold with the minimum divergence value as an optimal threshold target _ th; a quantization coefficient scale of the input data is determined based on the optimal threshold value target _ th and the quantization interval dist scale,

wherein, scale ═ (target _ th +0.5) × (dist _ scale/127).

The quantization method provided by the embodiment of the invention can be applied to a neural network voice awakening model. The voice network awakening model mainly comprises a recurrent neural network and a classification function. The recurrent neural network model mainly comprises three common recurrent neural networks of RNN, GRU and LSTM. If the activation function in the recurrent neural network is a saturation activation function, such as a tanh function and a sigmoid function, the calculation result has a saturation threshold, so that when the neural network model comprises the saturation activation function, the input data of the saturation activation function can be quantized by adopting a quantization method of saturation truncation.

For a saturation activation function, when the input data is greater than the right convergence value, the function will tend to saturate to the right, i.e., the function derivative is zero or tends to zero; when the input data is less than the left convergence value, the function saturates to the left, i.e. the derivative of the function is or tends to zero. The process of performing saturation clipping quantization on the saturation activation function may include: the left and right convergence values of the saturation activation function are determined, then the maximum value of the absolute value of the input data in the interval of the left and right convergence values is determined, then the maximum value is mapped to a preset value, for example 127, and then the quantization coefficients of the input data are calculated. The saturation interception quantization method does not need to quantize all input data, and for the same quantization interval, data with less quantization can improve data expression capacity, so that the saturation interception quantization method not only can reduce the operation amount, but also can keep higher model precision.

The absolute value of the calculation result of a common classification function, such as a softmax function, is meaningless, and the meaningless is the relative value of the calculation result, so that when the neural network model includes the classification function, the input data of the classification function can be quantized by using a quantization method of translation processing by using the translation invariance of the classification function, and the quantization precision of the classification function can be ensured.

The process of quantizing the input data of the classification function by using the quantization method of translation processing may include: counting the maximum value of the absolute value of each row of data in the input data matrix, mapping the maximum value of the absolute value of each row of input data to a preset value, such as 127, and calculating the quantization coefficient of each row of input data; counting the maximum value of each row of input data; setting the offset of each row of input data, wherein the offset is equal to the maximum value of each row of input data subtracted by the preset value; each line of input data is shifted, that is, each line of input data is changed to the current input data plus a corresponding offset.

The process of quantizing the output data of each layer of network is similar to the process of quantizing the input data, and is not described herein again.

And 103, determining a fixed point value corresponding to each quantization coefficient for each layer of the network, and completing the quantization of the neural network model.

S in the convolutional layer after quantization of the above formula (6)^(c)And Bias^(c)For floating point values, conversion to fixed point values is required when calculating in the chip. For example, quantization is completed by performing fixed-point processing of 8 bits or 16 bits on the quantized coefficients.

For example, 16-bit fixed point is performed on the quantized coefficients in the above equation (6), and the coefficient fixed point procedure is as follows.

For sigma { W_int8*I_8bitResults of (c) with Sum_int32And (4) saturation storage. To S^(c)Make 16bit fixed point, record fixed point q_scaleAfter a fixed point value of

In the case of hardware, it is preferable that,

the overhead is too large, where Sum can be reduced_int32Further fix is Sum_int16Noting the point q_sumThen do it later

To (S)^(c)*∑{W_int8*I_8bit}+Bias^(c)) And Bias^(c)Perform 16bit fixed point, and record the point q_rst。

Forward running, statistics ∑ W_int8*I_8bitMaximum absolute value of (E) } noted as Max_sum。

Statistics (S)^(c)*∑{W_int8*I_8bit}+Bias^(c)) Maximum absolute value of (D) is recorded as Max_rst. Statistics S^cMaximum value, denoted Max_scaleStatistics of Bias^(c)Maximum value, denoted Max_bias。

For Max_scaleMaking a left shift fixed point of 16 bits to obtain q_scale(ii) a For Max (Max)_rst，Max_bias) Making a left shift fixed point of 16 bits to obtain q_rst(ii) a For Max_sumRight shift 16bit fixed point to obtain q_sumIf Max is present_sumIn the range int16, q is_sumIs 0.

Carrying out parameter pretreatment:

q₁＝q_sum；q₂＝q_scale-q_sum-q_rst；q₃＝q_rst

fixed point forward calculation:

Sum_int32＝Clip_int32(∑{W_int8*I_8bit})

Sum_int16＝Clip_int16(Sum_int32＞＞q₁)

or

And after the coefficients are quantized to the fixed points, the quantized fixed point type neural network model can be obtained.

Referring to fig. 2, a block diagram of a charging device according to an embodiment of the present invention is shown, where the charging device includes:

a quantization module 201, configured to quantize a weight parameter of each layer in the neural network model to obtain a quantization coefficient of the weight parameter;

the quantization module 201 is further configured to quantize input data and output data of each layer of the neural network model to obtain a quantization coefficient of the input data and a quantization coefficient of the output data;

the fixed point module 202 is configured to determine, for each layer of the network, a fixed point value corresponding to each quantized coefficient, and complete quantization of the neural network model.

Preferably, the quantification module 201 is further configured to select a part of data from a training set or a test set as input data for quantifying the neural network model.

Preferably, the quantization module 201 is specifically configured to quantize the weight parameter of each layer in the neural network model in a linear quantization manner.

Preferably, the quantization module 201 is specifically configured to:

Preferably, the quantization module 201 is specifically configured to, when the neural network model includes a saturation activation function, quantize the input data of the saturation activation function by using a quantization method of saturation interception.

Preferably, the quantization module 201 is specifically configured to, when the neural network model includes a classification function, quantize the input data of the classification function by using a quantization method of translation processing.

An embodiment of the present invention further provides an electronic device, as shown in fig. 3, including a processor 001, a communication interface 002, a memory 003 and a communication bus 004, where the processor 001, the communication interface 002 and the memory 003 complete mutual communication through the communication bus 004,

a memory 003 for storing a computer program;

the processor 001 is configured to implement the method for quantizing a neural network when executing the program stored in the memory 003, and the method includes:

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the embodiments of the apparatus and the electronic device, since they are substantially similar to the embodiments of the method, the description is simple, and the relevant points can be referred to only in the partial description of the embodiments of the method.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method of quantifying neural networks, the method comprising:

2. The method of claim 1, wherein the step of quantifying the input data and the output data for each layer of the neural network model is preceded by:

3. The method of claim 1, wherein the step of quantifying the weight parameters for each layer of the network in the neural network model comprises:

4. The method of claim 1, wherein the step of quantifying the input data for each layer of the neural network model comprises:

5. The method of claim 1, wherein when the neural network model includes a saturation activation function, the input data to the saturation activation function is quantized using a quantization method of saturation clipping.

6. The method of claim 1, wherein when the neural network model includes a classification function, the input data of the classification function is quantized using a quantization method of a translation process.

7. An apparatus for quantization of a neural network, the apparatus comprising:

8. The apparatus of claim 7, wherein the quantization module is further configured to select a portion of data from a training set or a test set as input data for quantizing the neural network model.

9. The apparatus according to claim 7, wherein the quantization module is specifically configured to quantize the weight parameter of each layer in the neural network model by using linear quantization.

10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-6 when executing a program stored in the memory.