CN111882058A

CN111882058A - 4-bit quantization method and system of neural network

Info

Publication number: CN111882058A
Application number: CN202010589233.5A
Authority: CN
Inventors: 王曦辉
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-11-03
Also published as: WO2021258752A1

Abstract

The application discloses a 4-bit quantization method and a system of a neural network, wherein the method comprises the following steps: loading a pre-training model of the neural network; counting initial values of satRelu of each saturation activation layer in a pre-training model; adding a pseudo-quantization node in the neural network, and retraining the neural network by using an initial value of satRelu to obtain a pseudo-quantization model; judging whether the precision of the pseudo-quantization model converges to a set precision or not; if yes, carrying out reasoning pretreatment on the pseudo-quantization model, and converting the pseudo-quantization model into a 4-bit reasoning model for reasoning operation; otherwise, returning to retraining the neural network. The system mainly comprises: the device comprises a loading module, a statistic module, a retraining module, a judging module and a converting module. Through the method and the device, the training efficiency can be effectively improved on the basis of ensuring the accuracy of the training result.

Description

4-bit quantization method and system of neural network

Technical Field

The present application relates to the field of neural network model compression technologies, and in particular, to a 4-bit quantization method and system for a neural network.

Background

In a Neural Network, a Neural Network model generally occupies a large disk space, for example, the model file of AlexNet exceeds 200 MB. The model contains millions of parameters and a significant portion of disk space is used to store the model parameters. Because the model parameters are of a floating point type, the space of the model parameters is difficult to compress by a common compression algorithm, so that the model quantization is introduced, the original network is compressed by reducing the bit number required by representing each weight, and the running speed of the network can be greatly improved. Therefore, how to quantify the neural network is an important technical problem.

At present, the mainstream method of neural network quantization is 8-bit quantization, and most training and reasoning frameworks support 8-bit quantization. However, compared with 8-bit quantization, 4-bit quantization can compress the volume of the model by 1 time on the basis of 8 bits, and the running speed is improved by 50%. Therefore, 4-bit quantization is also gradually gaining attention.

The current 4-bit quantization algorithm usually trains a network from the beginning until the whole network is trained, and a large data set such as imagenet generally needs to be trained for more than 100 cycles. And the model precision is improved by adopting a nonlinear quantization mode.

However, in the current 4-bit quantization algorithm, the quantization efficiency is low because the number of quantization cycles is large and the training time is long.

Disclosure of Invention

The application provides a 4-bit quantization method and a system of a neural network, which aim to solve the problem of low quantization efficiency of the neural network quantization method in the prior art.

In order to solve the technical problem, the embodiment of the application discloses the following technical scheme:

a 4-bit quantization method of a neural network, the method comprising:

loading a pre-training model of the neural network;

counting initial values of satRelu of each saturation activation layer in the pre-training model;

adding a pseudo-quantization node in the neural network, and retraining the neural network by using an initial value of satRelu to obtain a pseudo-quantization model;

judging whether the precision of the pseudo quantization model converges to a set precision or not;

if yes, carrying out pre-inference processing on the pseudo quantization model, and converting the pseudo quantization model into a 4-bit inference model for inference operation, wherein the pre-inference processing process comprises the following steps: constant folding, secondary quantization and activating equivalent transformation;

if not, retraining the neural network is continued.

Optionally, in the pre-training model, the method for counting the initial value of each saturation activation layer satRelu includes:

replacing all active layers relu in the neural network with saturated active layers satRelu;

acquiring an activation value of each saturation activation layer satRelu according to the acquired command;

according to the activation value, counting distribution data by utilizing a histogram;

selecting an activation value located at 99.999% of a point in the histogram as an initial value of a parameter max in a saturation activation layer satRelu, wherein the satRelu is defined as:

and in back propagation, the gradient of satRelu for parameter max is:

the gradient of satRelu for input x is

max is the maximum value of the saturation active layer satRelu.

Optionally, during the retraining process of the neural network, the parameter max is compressed by using a regularization method of L2.

Optionally, the retraining period is less than or equal to 10, and the value of the parameter max is less than or equal to 1.

Optionally, the method for adding a pseudo quantization node in the neural network, and performing retraining on the neural network by using an initial value of satRelu to obtain a pseudo quantization model includes:

inserting a weight pseudo-quantization layer before a weight layer of the neural network and an activation pseudo-quantization layer before an activation layer;

retraining the neural network on the weight pseudo-quantization layer by using a formula y equal to quant (w equal to clip)/scale, wherein w is a weight value, n is 4, and a scale coefficient scale is:

retraining the neural network on the activated pseudo quantization layer by using the formula y equal to quant (x) clip (round (x scale))/scale, wherein x is the activation value of each layer, max is the maximum value of satRelu, n is 4, and the scaling coefficient scale is:

optionally, in retraining the neural network, the back propagation process uses a straight-through estimator to calculate the gradient.

A 4-bit quantization system for a neural network, the system comprising:

the loading module is used for loading the pre-training model of the neural network;

the statistical module is used for counting the initial value of each saturation activation layer satRelu in the pre-training model;

the retraining module is used for adding a pseudo quantization node in the neural network, retraining the neural network by using an initial value of the satRelu, and acquiring a pseudo quantization model;

the judging module is used for judging whether the precision of the pseudo quantization model converges to the set precision or not;

a conversion module, configured to perform pre-inference processing on the pseudo quantization model when the accuracy of the pseudo quantization model converges to a set accuracy, and convert the pseudo quantization model into a 4-bit inference model that can be used for inference operation, where the pre-inference processing includes: constant folding, quadratic quantization and activating equivalent transformation.

Optionally, the statistics module includes:

a replacement unit, configured to replace all active layers relu in the neural network with saturated active layers satRelu;

an activation value acquisition unit, configured to acquire an activation value of each saturation activation layer satRelu according to the acquired command;

a statistical unit for counting distribution data by using a histogram according to the activation value;

an initial value selecting unit, configured to select an activation value located at 99.999% of a point in the histogram as an initial value of a parameter max in a saturation activation layer satRelu, where satRelu is defined as:

and in back propagation, the gradient of satRelu for parameter max is:

the gradient of satRelu for input x is

max is the maximum value of the saturation active layer satRelu.

Optionally, the retraining module comprises:

a pseudo quantization layer insertion unit for inserting a weight pseudo quantization layer before a weight layer of the neural network and inserting an activation pseudo quantization layer before an activation layer;

a first retraining unit, configured to retrain the neural network for the weight pseudo quantization layer by using a formula y equal to quant (w) clip (w equal scale)/scale, where w is a weight value, n is 4, and a scaling coefficient scale is:

and a second retraining unit, configured to retrain the neural network for the activated pseudo quantization layer by using a formula y equal to quant (x) clip (round (x)/scale), where x is an activation value of each layer, max is a maximum value of satRelu, n is 4, and a scaling factor scale is:

optionally, the pre-inference processing module includes:

a constant folding unit for fusing the batch norm layer into a convolution, whichThe convolution formula is: z is w x + b, and the formula of batchNorm is:

the calculation formula of the new convolution after combination is as follows:

the secondary quantization unit is used for carrying out secondary quantization on the weight to obtain a quantization scale coefficient scale of the weight after the secondary quantization;

activating equivalent transformation units for using formulas

And performing equivalent transformation on the activation.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

the application provides a 4-bit quantification method of a neural network, the quantification method is characterized by firstly loading a pre-training model of the neural network, counting initial values of various satreu layers in the pre-training model, adding a pseudo quantification node in the neural network, retraining the neural network by using the initial values of the satreu, acquiring the pseudo quantification model through retraining for a plurality of periods, converting the pseudo quantification model into a 4-bit reasoning model by using a reasoning method, completing a full 4-bit reasoning process before reasoning by using a reasoning algorithm, being beneficial to improving the calculation speed, ensuring that a final reasoning model can be directly applied to a 4-bit GPU, supporting the 4-bit GPU operation and being beneficial to improving the practicability of the invention. The method is carried out in a pseudo-quantization mode during retraining, a corresponding pseudo-quantization layer is inserted before a weight layer and an activation layer and is used for simulating the influence of model quantization on the whole neural network, the model is learned and adapted to the influence on the neural network by means of training, so that the accuracy of the quantization model can be greatly improved, the precision loss of the obtained model can be controlled within 1%, and the 4-bit quantization result is ensured to meet the precision requirement. Moreover, the period of retraining in this embodiment is 10, i.e.: the neural network is retrained for 10 times, so that the training time is greatly saved, the training efficiency can be effectively improved on the basis of ensuring the precision, and further the 4-bit quantization efficiency of the neural network is improved.

In the embodiment, the neural network is retrained, and the weight layer and the activation layer are subjected to linear pseudo-quantization, so that the calculation speed is increased on the basis of ensuring the training precision, and the quantization efficiency of the neural network is effectively improved.

The application also provides a 4-bit quantization system of the neural network. The system mainly comprises: the device comprises a loading module, a statistic module, a retraining module, a judging module and a converting module. Through the loading module and the retraining module, a pre-training model can be loaded in a retraining mode, retraining is carried out on the pre-training model for several cycles, a pseudo-quantization model is obtained, and when the judging module judges that the precision of the pseudo-quantization model converges to the set precision, the conversion module is started to carry out pre-inference processing on the pseudo-quantization model and convert the pseudo-quantization model into a 4-bit inference model for inference operation. In the embodiment, the training time can be effectively saved by adopting less retraining times, generally 10 periods, so that the quantization efficiency of the neural network is greatly improved. And the pseudo-quantization model is processed in the inference state by utilizing the pre-inference processing module, so that all 4-bit inference can be completed before inference, the calculation speed can be further improved, and the quantization efficiency can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a 4-bit quantization method of a neural network according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a 4-bit quantization system of a neural network according to an embodiment of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For a better understanding of the present application, embodiments of the present application are explained in detail below with reference to the accompanying drawings.

Example one

Referring to fig. 1, fig. 1 is a schematic flowchart of a 4-bit quantization method of a neural network according to an embodiment of the present disclosure. As can be seen from fig. 1, the 4-bit quantization method of the neural network in this embodiment mainly includes the following steps:

s1: a pre-trained model of the neural network is loaded.

S2: in the pre-training model, the initial value of each saturation activation layer satRelu is counted.

Specifically, step S2 further includes:

s21: all the active layers relu in the neural network are replaced with saturated active layers satRelu.

S22: and acquiring the activation value of each saturated activation layer satRelu according to the acquired command.

S23: the distribution data is counted using the histogram according to the activation value.

S24: the activation value at 99.999% of the point in the histogram is selected as the initial value of the parameter max in the saturation activation layer satRelu. Wherein satRelu is defined as:

and in back propagation, the gradient of satRelu for parameter max is:

the gradient of satRelu for input x is

As shown in steps S1 and S2, the pre-training model of the neural network is loaded before training, and the pre-training model can be implemented by running a network script. The distribution condition of the activation values of each layer is counted by running a network script, 4096 sampling points can be adopted to count the histogram distribution of the activation values of each layer, and the activation value of the point position which is 99.999% of the histogram distribution is selected as the initial value of the unique parameter max in the satralu for subsequent iterative training.

In this embodiment, the lower limit of satRelu is 0, the lower limit of satRelu is max, that is, the maximum value of satRelu is max, and max is a variable, and the maximum value gradually decreases in the neural network training process, where an initial value of max needs to be obtained.

In this embodiment, during training, all relu layers in the original network need to be replaced by satRelu layers, where satRelu is defined as follows:

wherein max is the initial value of each layer counted by the histogram. In backpropagation, the gradient of satRelu for max is:

the gradient for input x is:

further, in the retraining process of the neural network, for the max parameter, the L2 regularization is adopted to compress the max parameter, so that the quantization error can be effectively reduced, and the network precision can be increased.

In this embodiment, the retraining period is not greater than 10, and the value of the parameter max is not greater than 1, wherein the preferable value is: the retraining period is 10 and the parameter max is 1. The value setting of the parameter max in the embodiment can ensure that each layer outputs properly, and is beneficial to improving the network precision. In the training process, the max value needs to be observed, and a proper regularization parameter is adopted to ensure that the max value is close to 1 when the network converges.

With continued reference to fig. 1, after counting the initial values of the saturation activation layers satRelu, step S3 is executed: and adding a pseudo quantization node in the neural network, retraining the neural network by using the initial value of the satRelu, and acquiring a pseudo quantization model.

In this embodiment, a pseudo quantization node is added to a neural network, specifically: pseudo-quantization nodes are added to the convolved inputs and weights of the neural network, and to the fully connected inputs and weights of the neural network.

S31: a weight pseudo-quantization layer is inserted before a weight layer of the neural network and an activation pseudo-quantization layer is inserted before an activation layer.

In the model training of this embodiment, the whole neural network is trained in a pseudo-quantization manner, that is: in the conventional neural network, a pseudo-quantization layer is inserted before a weight layer and an activation layer, the influence of model quantization on the whole neural network can be simulated through the weight pseudo-quantization layer and the activation pseudo-quantization layer, and the model is learned and adapted to the influence through training, so that the accuracy of the most quantized model can be effectively improved.

S32: the weight pseudo quantization layer is retrained by the neural network using the formula y equal quant (w equal clip)/scale.

Wherein w is a weighted value, n is 4, and the scale coefficient scale is:

in this embodiment, after retraining the neural network on the weight pseudo-quantization layer, a pseudo-quantization model is obtained, and a weight value in the pseudo-quantization model changes due to the retraining. In addition, in the embodiment, 4-bit quantization is performed on the neural network, and the value range after quantization is-8 to 7, so that the upper limit of clip is 7, the lower limit is-8, a round function is used for rounding the input, and the value of scale is obtained in a manner that each output channel shares one scale.

S33: retraining the neural network for the activated pseudo quantization layer using the formula (equal (x)) clip/scale.

Wherein x is the activation value of each layer, max is the maximum value of satRelu, n takes the value of 4, and the proportionality coefficient scale is as follows:

since a relu layer is arranged after each layer of activation, the activation value cannot have a negative value, the value range after quantization is between 0 and 15, the upper limit of clip is 15, the lower limit of clip is 0, and the activation needs to be mapped to between-8 and 7 during reasoning subsequently, so that GPU calculation is facilitated, and the value of scale is a mode that each layer shares one scale.

In this embodiment, when retraining the neural network, the back propagation process uses a straight-through estimator to calculate a gradient, i.e., STE. Specifically, because the quantization function is a discrete non-derivable function, a straight-through estimator is adopted to calculate the gradient, and a network quantized bit number is gradually reduced to 4 bits, and the quantized bit number n is 8, 6, 5, 4, namely, training is started from 8bit quantization, and the bit number is gradually reduced to 4 bits in the training process. By using the method, the problem of gradient mismatching can be reduced, and the network can gradually adapt to errors caused by quantization, so that the operation precision and the quantization efficiency are effectively improved.

As can be seen from fig. 1, in this embodiment, after retraining the neural network and obtaining the pseudo quantization model, step S4 is executed: and judging whether the precision of the pseudo quantization model converges to the set precision.

If the accuracy of the pseudo-quantization model converges to the set accuracy, step S5 is performed: and (4) carrying out reasoning pretreatment on the pseudo quantization model, and converting into a 4-bit reasoning model for reasoning operation.

In this embodiment, the training model is reloaded first, and the satRelu layer is changed to the normal Relu layer again, so as to ensure that the network structure is not changed. The process of the pre-inference treatment comprises the following steps: constant folding, quadratic quantization and activating equivalent transformation.

Specifically, during the training process, the batchNorm layer does not perform any processing, and the accuracy of the network is not affected. After training, the batch norm layer needs to be fused into convolution, and the calculation formula of the constant folding process comprises the following steps:

1) and (3) convolution calculation: z ═ w x + b

2) Calculation process of batchNorm:

3) the calculation process of the new convolution after combination adopts a formula:

and

because the weight parameter changes after the batchnorm layer is folded, the calculation precision can be improved by carrying out secondary quantization on the weight in the embodiment. The quantization method is consistent with the quantization mode of the weights during training, the quantization scales of the weights are obtained again, and the scales obtained by training are still adopted when the activated scales are not changed.

In the embodiment, the TenscorCore can support 4-bit operation on the GPU, so that inference calculation can be carried out by using a TenscorCore code, but the TenscorCore only supports the operation of signed 4bit and 4bi t or unsigned 4bit and unsigned 4bit, but the network activation after training is unsigned 4bit number, the weight is signed 4bi t number, and the activation is equivalently transformed, so that the GPU operation can be supported, and the practicability of the method is improved. Specific activation equivalence transformations: the activation is subtracted by the median value 8, transformed into signed 4bi t data and then reasoned. Equivalent transformation of convolution

According to the convolution equivalent transformation formula, the inference calculation process of convolution is divided into the following steps:

1) the pad operation is first performed on the convolved inputs.

2) The input and weights are multiplied by respective scales to obtain a 4-bit quantization value. I.e. W_q＝scale_wW，x_q＝scale_a(x-8)。

3) The weight and the input are in an int32 format, but the data range is int4, namely the data range is between-8 and 7, and int4, which is a data type, is not present in the cpu, so that the weight and the activation of the data with the lower 4 bits need to be respectively performed, 8 data are spliced into an int32 through shift operation, the space occupied by data storage is reduced, and the GPU can conveniently fetch data for operation.

4) Performing a 4-bit convolution operation, i.e. W_q*x_qThen applying an offset W_q*scale_a*8+b_q。

5) Inverse quantization, i.e. division by scale_w*scale_aSince the inverse quantization operation can be folded and combined with the convolution input of the next layer, the operation can be hidden.

Finally, some overall checking of the network is required: if the constants are folded, some constants are combined for calculation, after redundant operators are removed, the model is stored, the model is directly stored, namely, the model is the whole int4 inference model, and the model can be used for inference calculation.

If the accuracy of the pseudo-quantization model does not converge to the set accuracy, the process returns to step S3, and retraining the neural network is performed again, and a new pseudo-quantization model is obtained until the accuracy of the new pseudo-quantization model converges to the set accuracy.

Example two

Referring to fig. 2 based on the embodiment shown in fig. 1, fig. 2 is a schematic structural diagram of a 4-bit quantization system of a neural network according to an embodiment of the present disclosure. As can be seen from fig. 2, the 4-bit quantization system of the neural network in this embodiment mainly includes: the device comprises a loading module, a statistic module, a retraining module, a judging module and a converting module.

The loading module is used for loading a pre-training model of the neural network; the statistical module is used for counting the initial value of each saturation activation layer satRelu in the pre-training model; the retraining module is used for adding a pseudo quantization node in the neural network, retraining the neural network by using an initial value of the satRelu, and acquiring a pseudo quantization model; the judging module is used for judging whether the precision of the pseudo quantization model converges to the set precision; a conversion module, configured to perform pre-inference processing on the pseudo quantization model when the accuracy of the pseudo quantization model converges to a set accuracy, and convert the pseudo quantization model into a 4-bit inference model for inference operation, where the pre-inference processing process includes: constant folding, quadratic quantization and activating equivalent transformation.

Further, the statistics module includes: the device comprises a replacing unit, an activation value acquiring unit, a counting unit and an initial value selecting unit. Wherein, the replacing unit is used for replacing all the activation layers relu in the neural network with saturated activation layers satRelu; an activation value acquisition unit, configured to acquire an activation value of each saturation activation layer satRelu according to the acquired command; a statistical unit for counting distribution data by using a histogram according to the activation value; an initial value selecting unit, configured to select an activation value located at 99.999% of a point in the histogram as an initial value of a parameter max in a saturation activation layer satRelu, where satRelu is defined as:

and in back propagation, the gradient of satRelu for parameter max is:

the gradient of satRelu for input x is

The retraining module comprises: a pseudo quantization layer insertion unit, a first retraining unit and a second retraining unit. The pseudo quantization layer inserting unit is used for inserting a weight pseudo quantization layer before a weight layer of the neural network and inserting an activation pseudo quantization layer before an activation layer; a first retraining unit, configured to retrain the neural network for the weight pseudo quantization layer by using a formula y equal to quant (w) clip (w equal scale)/scale, where w is a weight value, n is 4, and a scaling coefficient scale is:

the pre-inference processing module comprises: the device comprises a constant folding unit, a quadratic quantization unit and an active equivalent transformation unit. The constant folding unit is used for fusing the batch norm layer into convolution, wherein the calculation formula of the convolution is as follows: z is w x + b, and the formula of batchNorm is:

the calculation formula of the new convolution after combination is as follows:

the secondary quantization unit is used for carrying out secondary quantization on the weight to obtain a quantization scale coefficient scale of the weight after the secondary quantization; activating equivalent transformation units for using formulas

And performing equivalent transformation on the activation.

The working principle and working method of the 4-bit quantization system of the neural network in this embodiment have already been described in detail in the embodiment shown in fig. 1, and are not repeated here.

The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of 4-bit quantization for a neural network, the method comprising:

loading a pre-training model of the neural network;

if not, retraining the neural network is continued.

2. The method of claim 1, wherein the method for counting initial values of satRelu in each saturation activation layer in the pre-training model comprises:

and in back propagation, the gradient of satRelu for parameter max is:

the gradient of satRelu for input x is

max is the maximum value of the saturation active layer satRelu.

3. The 4-bit quantization method of the neural network as claimed in claim 2, wherein the parameter max is compressed by a regularization method of L2 during retraining of the neural network.

4. The 4-bit quantization method of the neural network as claimed in claim 2, wherein the retraining period is less than or equal to 10, and the value of the parameter max is less than or equal to 1.

5. The method according to claim 1, wherein the method for obtaining the pseudo quantization model by adding the pseudo quantization node in the neural network and retraining the neural network with an initial value of satRelu comprises:

6. a4-bit quantization method for neural networks according to any of claims 2-4, characterized in that the back-propagation process uses a pass-through estimator to calculate the gradient when retraining the neural network.

7. A 4-bit quantization system for a neural network, the system comprising:

8. The 4-bit quantization system of a neural network of claim 7, wherein the statistics module comprises:

and in back propagation, the gradient of satRelu for parameter max is:

the gradient of satRelu for input x is

max is the maximum value of the saturation active layer satRelu.

9. The 4-bit quantization system of a neural network of claim 7, wherein the retraining module comprises:

10. the 4-bit quantization system of a neural network of claim 7, wherein the pre-inference processing module comprises:

and the constant folding unit is used for fusing the batch norm layer into convolution, wherein the calculation formula of the convolution is as follows: z is w x + b, and the formula of batchNorm is:

the calculation formula of the new convolution after combination is as follows:

activating equivalent transformation units for using formulas

And performing equivalent transformation on the activation.