CN111882058A - 4-bit quantization method and system of neural network - Google Patents

4-bit quantization method and system of neural network Download PDF

Info

Publication number
CN111882058A
CN111882058A CN202010589233.5A CN202010589233A CN111882058A CN 111882058 A CN111882058 A CN 111882058A CN 202010589233 A CN202010589233 A CN 202010589233A CN 111882058 A CN111882058 A CN 111882058A
Authority
CN
China
Prior art keywords
quantization
neural network
pseudo
satrelu
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010589233.5A
Other languages
Chinese (zh)
Inventor
王曦辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202010589233.5A priority Critical patent/CN111882058A/en
Publication of CN111882058A publication Critical patent/CN111882058A/en
Priority to PCT/CN2021/076982 priority patent/WO2021258752A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The application discloses a 4-bit quantization method and a system of a neural network, wherein the method comprises the following steps: loading a pre-training model of the neural network; counting initial values of satRelu of each saturation activation layer in a pre-training model; adding a pseudo-quantization node in the neural network, and retraining the neural network by using an initial value of satRelu to obtain a pseudo-quantization model; judging whether the precision of the pseudo-quantization model converges to a set precision or not; if yes, carrying out reasoning pretreatment on the pseudo-quantization model, and converting the pseudo-quantization model into a 4-bit reasoning model for reasoning operation; otherwise, returning to retraining the neural network. The system mainly comprises: the device comprises a loading module, a statistic module, a retraining module, a judging module and a converting module. Through the method and the device, the training efficiency can be effectively improved on the basis of ensuring the accuracy of the training result.

Description

4-bit quantization method and system of neural network
Technical Field
The present application relates to the field of neural network model compression technologies, and in particular, to a 4-bit quantization method and system for a neural network.
Background
In a Neural Network, a Neural Network model generally occupies a large disk space, for example, the model file of AlexNet exceeds 200 MB. The model contains millions of parameters and a significant portion of disk space is used to store the model parameters. Because the model parameters are of a floating point type, the space of the model parameters is difficult to compress by a common compression algorithm, so that the model quantization is introduced, the original network is compressed by reducing the bit number required by representing each weight, and the running speed of the network can be greatly improved. Therefore, how to quantify the neural network is an important technical problem.
At present, the mainstream method of neural network quantization is 8-bit quantization, and most training and reasoning frameworks support 8-bit quantization. However, compared with 8-bit quantization, 4-bit quantization can compress the volume of the model by 1 time on the basis of 8 bits, and the running speed is improved by 50%. Therefore, 4-bit quantization is also gradually gaining attention.
The current 4-bit quantization algorithm usually trains a network from the beginning until the whole network is trained, and a large data set such as imagenet generally needs to be trained for more than 100 cycles. And the model precision is improved by adopting a nonlinear quantization mode.
However, in the current 4-bit quantization algorithm, the quantization efficiency is low because the number of quantization cycles is large and the training time is long.
Disclosure of Invention
The application provides a 4-bit quantization method and a system of a neural network, which aim to solve the problem of low quantization efficiency of the neural network quantization method in the prior art.
In order to solve the technical problem, the embodiment of the application discloses the following technical scheme:
a 4-bit quantization method of a neural network, the method comprising:
loading a pre-training model of the neural network;
counting initial values of satRelu of each saturation activation layer in the pre-training model;
adding a pseudo-quantization node in the neural network, and retraining the neural network by using an initial value of satRelu to obtain a pseudo-quantization model;
judging whether the precision of the pseudo quantization model converges to a set precision or not;
if yes, carrying out pre-inference processing on the pseudo quantization model, and converting the pseudo quantization model into a 4-bit inference model for inference operation, wherein the pre-inference processing process comprises the following steps: constant folding, secondary quantization and activating equivalent transformation;
if not, retraining the neural network is continued.
Optionally, in the pre-training model, the method for counting the initial value of each saturation activation layer satRelu includes:
replacing all active layers relu in the neural network with saturated active layers satRelu;
acquiring an activation value of each saturation activation layer satRelu according to the acquired command;
according to the activation value, counting distribution data by utilizing a histogram;
selecting an activation value located at 99.999% of a point in the histogram as an initial value of a parameter max in a saturation activation layer satRelu, wherein the satRelu is defined as:
Figure BDA0002555762740000021
and in back propagation, the gradient of satRelu for parameter max is:
Figure BDA0002555762740000022
the gradient of satRelu for input x is
Figure BDA0002555762740000023
max is the maximum value of the saturation active layer satRelu.
Optionally, during the retraining process of the neural network, the parameter max is compressed by using a regularization method of L2.
Optionally, the retraining period is less than or equal to 10, and the value of the parameter max is less than or equal to 1.
Optionally, the method for adding a pseudo quantization node in the neural network, and performing retraining on the neural network by using an initial value of satRelu to obtain a pseudo quantization model includes:
inserting a weight pseudo-quantization layer before a weight layer of the neural network and an activation pseudo-quantization layer before an activation layer;
retraining the neural network on the weight pseudo-quantization layer by using a formula y equal to quant (w equal to clip)/scale, wherein w is a weight value, n is 4, and a scale coefficient scale is:
Figure BDA0002555762740000024
retraining the neural network on the activated pseudo quantization layer by using the formula y equal to quant (x) clip (round (x scale))/scale, wherein x is the activation value of each layer, max is the maximum value of satRelu, n is 4, and the scaling coefficient scale is:
Figure BDA0002555762740000031
optionally, in retraining the neural network, the back propagation process uses a straight-through estimator to calculate the gradient.
A 4-bit quantization system for a neural network, the system comprising:
the loading module is used for loading the pre-training model of the neural network;
the statistical module is used for counting the initial value of each saturation activation layer satRelu in the pre-training model;
the retraining module is used for adding a pseudo quantization node in the neural network, retraining the neural network by using an initial value of the satRelu, and acquiring a pseudo quantization model;
the judging module is used for judging whether the precision of the pseudo quantization model converges to the set precision or not;
a conversion module, configured to perform pre-inference processing on the pseudo quantization model when the accuracy of the pseudo quantization model converges to a set accuracy, and convert the pseudo quantization model into a 4-bit inference model that can be used for inference operation, where the pre-inference processing includes: constant folding, quadratic quantization and activating equivalent transformation.
Optionally, the statistics module includes:
a replacement unit, configured to replace all active layers relu in the neural network with saturated active layers satRelu;
an activation value acquisition unit, configured to acquire an activation value of each saturation activation layer satRelu according to the acquired command;
a statistical unit for counting distribution data by using a histogram according to the activation value;
an initial value selecting unit, configured to select an activation value located at 99.999% of a point in the histogram as an initial value of a parameter max in a saturation activation layer satRelu, where satRelu is defined as:
Figure BDA0002555762740000032
and in back propagation, the gradient of satRelu for parameter max is:
Figure BDA0002555762740000033
the gradient of satRelu for input x is
Figure BDA0002555762740000034
max is the maximum value of the saturation active layer satRelu.
Optionally, the retraining module comprises:
a pseudo quantization layer insertion unit for inserting a weight pseudo quantization layer before a weight layer of the neural network and inserting an activation pseudo quantization layer before an activation layer;
a first retraining unit, configured to retrain the neural network for the weight pseudo quantization layer by using a formula y equal to quant (w) clip (w equal scale)/scale, where w is a weight value, n is 4, and a scaling coefficient scale is:
Figure BDA0002555762740000035
and a second retraining unit, configured to retrain the neural network for the activated pseudo quantization layer by using a formula y equal to quant (x) clip (round (x)/scale), where x is an activation value of each layer, max is a maximum value of satRelu, n is 4, and a scaling factor scale is:
Figure BDA0002555762740000041
optionally, the pre-inference processing module includes:
a constant folding unit for fusing the batch norm layer into a convolution, whichThe convolution formula is: z is w x + b, and the formula of batchNorm is:
Figure BDA0002555762740000042
the calculation formula of the new convolution after combination is as follows:
Figure BDA0002555762740000043
the secondary quantization unit is used for carrying out secondary quantization on the weight to obtain a quantization scale coefficient scale of the weight after the secondary quantization;
activating equivalent transformation units for using formulas
Figure BDA0002555762740000044
And performing equivalent transformation on the activation.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
the application provides a 4-bit quantification method of a neural network, the quantification method is characterized by firstly loading a pre-training model of the neural network, counting initial values of various satreu layers in the pre-training model, adding a pseudo quantification node in the neural network, retraining the neural network by using the initial values of the satreu, acquiring the pseudo quantification model through retraining for a plurality of periods, converting the pseudo quantification model into a 4-bit reasoning model by using a reasoning method, completing a full 4-bit reasoning process before reasoning by using a reasoning algorithm, being beneficial to improving the calculation speed, ensuring that a final reasoning model can be directly applied to a 4-bit GPU, supporting the 4-bit GPU operation and being beneficial to improving the practicability of the invention. The method is carried out in a pseudo-quantization mode during retraining, a corresponding pseudo-quantization layer is inserted before a weight layer and an activation layer and is used for simulating the influence of model quantization on the whole neural network, the model is learned and adapted to the influence on the neural network by means of training, so that the accuracy of the quantization model can be greatly improved, the precision loss of the obtained model can be controlled within 1%, and the 4-bit quantization result is ensured to meet the precision requirement. Moreover, the period of retraining in this embodiment is 10, i.e.: the neural network is retrained for 10 times, so that the training time is greatly saved, the training efficiency can be effectively improved on the basis of ensuring the precision, and further the 4-bit quantization efficiency of the neural network is improved.
In the embodiment, the neural network is retrained, and the weight layer and the activation layer are subjected to linear pseudo-quantization, so that the calculation speed is increased on the basis of ensuring the training precision, and the quantization efficiency of the neural network is effectively improved.
The application also provides a 4-bit quantization system of the neural network. The system mainly comprises: the device comprises a loading module, a statistic module, a retraining module, a judging module and a converting module. Through the loading module and the retraining module, a pre-training model can be loaded in a retraining mode, retraining is carried out on the pre-training model for several cycles, a pseudo-quantization model is obtained, and when the judging module judges that the precision of the pseudo-quantization model converges to the set precision, the conversion module is started to carry out pre-inference processing on the pseudo-quantization model and convert the pseudo-quantization model into a 4-bit inference model for inference operation. In the embodiment, the training time can be effectively saved by adopting less retraining times, generally 10 periods, so that the quantization efficiency of the neural network is greatly improved. And the pseudo-quantization model is processed in the inference state by utilizing the pre-inference processing module, so that all 4-bit inference can be completed before inference, the calculation speed can be further improved, and the quantization efficiency can be improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a 4-bit quantization method of a neural network according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a 4-bit quantization system of a neural network according to an embodiment of the present disclosure.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
For a better understanding of the present application, embodiments of the present application are explained in detail below with reference to the accompanying drawings.
Example one
Referring to fig. 1, fig. 1 is a schematic flowchart of a 4-bit quantization method of a neural network according to an embodiment of the present disclosure. As can be seen from fig. 1, the 4-bit quantization method of the neural network in this embodiment mainly includes the following steps:
s1: a pre-trained model of the neural network is loaded.
S2: in the pre-training model, the initial value of each saturation activation layer satRelu is counted.
Specifically, step S2 further includes:
s21: all the active layers relu in the neural network are replaced with saturated active layers satRelu.
S22: and acquiring the activation value of each saturated activation layer satRelu according to the acquired command.
S23: the distribution data is counted using the histogram according to the activation value.
S24: the activation value at 99.999% of the point in the histogram is selected as the initial value of the parameter max in the saturation activation layer satRelu. Wherein satRelu is defined as:
Figure BDA0002555762740000061
and in back propagation, the gradient of satRelu for parameter max is:
Figure BDA0002555762740000062
the gradient of satRelu for input x is
Figure BDA0002555762740000063
As shown in steps S1 and S2, the pre-training model of the neural network is loaded before training, and the pre-training model can be implemented by running a network script. The distribution condition of the activation values of each layer is counted by running a network script, 4096 sampling points can be adopted to count the histogram distribution of the activation values of each layer, and the activation value of the point position which is 99.999% of the histogram distribution is selected as the initial value of the unique parameter max in the satralu for subsequent iterative training.
In this embodiment, the lower limit of satRelu is 0, the lower limit of satRelu is max, that is, the maximum value of satRelu is max, and max is a variable, and the maximum value gradually decreases in the neural network training process, where an initial value of max needs to be obtained.
In this embodiment, during training, all relu layers in the original network need to be replaced by satRelu layers, where satRelu is defined as follows:
Figure BDA0002555762740000064
wherein max is the initial value of each layer counted by the histogram. In backpropagation, the gradient of satRelu for max is:
Figure BDA0002555762740000065
the gradient for input x is:
Figure BDA0002555762740000066
further, in the retraining process of the neural network, for the max parameter, the L2 regularization is adopted to compress the max parameter, so that the quantization error can be effectively reduced, and the network precision can be increased.
In this embodiment, the retraining period is not greater than 10, and the value of the parameter max is not greater than 1, wherein the preferable value is: the retraining period is 10 and the parameter max is 1. The value setting of the parameter max in the embodiment can ensure that each layer outputs properly, and is beneficial to improving the network precision. In the training process, the max value needs to be observed, and a proper regularization parameter is adopted to ensure that the max value is close to 1 when the network converges.
With continued reference to fig. 1, after counting the initial values of the saturation activation layers satRelu, step S3 is executed: and adding a pseudo quantization node in the neural network, retraining the neural network by using the initial value of the satRelu, and acquiring a pseudo quantization model.
In this embodiment, a pseudo quantization node is added to a neural network, specifically: pseudo-quantization nodes are added to the convolved inputs and weights of the neural network, and to the fully connected inputs and weights of the neural network.
S31: a weight pseudo-quantization layer is inserted before a weight layer of the neural network and an activation pseudo-quantization layer is inserted before an activation layer.
In the model training of this embodiment, the whole neural network is trained in a pseudo-quantization manner, that is: in the conventional neural network, a pseudo-quantization layer is inserted before a weight layer and an activation layer, the influence of model quantization on the whole neural network can be simulated through the weight pseudo-quantization layer and the activation pseudo-quantization layer, and the model is learned and adapted to the influence through training, so that the accuracy of the most quantized model can be effectively improved.
S32: the weight pseudo quantization layer is retrained by the neural network using the formula y equal quant (w equal clip)/scale.
Wherein w is a weighted value, n is 4, and the scale coefficient scale is:
Figure BDA0002555762740000071
in this embodiment, after retraining the neural network on the weight pseudo-quantization layer, a pseudo-quantization model is obtained, and a weight value in the pseudo-quantization model changes due to the retraining. In addition, in the embodiment, 4-bit quantization is performed on the neural network, and the value range after quantization is-8 to 7, so that the upper limit of clip is 7, the lower limit is-8, a round function is used for rounding the input, and the value of scale is obtained in a manner that each output channel shares one scale.
S33: retraining the neural network for the activated pseudo quantization layer using the formula (equal (x)) clip/scale.
Wherein x is the activation value of each layer, max is the maximum value of satRelu, n takes the value of 4, and the proportionality coefficient scale is as follows:
Figure BDA0002555762740000072
since a relu layer is arranged after each layer of activation, the activation value cannot have a negative value, the value range after quantization is between 0 and 15, the upper limit of clip is 15, the lower limit of clip is 0, and the activation needs to be mapped to between-8 and 7 during reasoning subsequently, so that GPU calculation is facilitated, and the value of scale is a mode that each layer shares one scale.
In this embodiment, when retraining the neural network, the back propagation process uses a straight-through estimator to calculate a gradient, i.e., STE. Specifically, because the quantization function is a discrete non-derivable function, a straight-through estimator is adopted to calculate the gradient, and a network quantized bit number is gradually reduced to 4 bits, and the quantized bit number n is 8, 6, 5, 4, namely, training is started from 8bit quantization, and the bit number is gradually reduced to 4 bits in the training process. By using the method, the problem of gradient mismatching can be reduced, and the network can gradually adapt to errors caused by quantization, so that the operation precision and the quantization efficiency are effectively improved.
As can be seen from fig. 1, in this embodiment, after retraining the neural network and obtaining the pseudo quantization model, step S4 is executed: and judging whether the precision of the pseudo quantization model converges to the set precision.
If the accuracy of the pseudo-quantization model converges to the set accuracy, step S5 is performed: and (4) carrying out reasoning pretreatment on the pseudo quantization model, and converting into a 4-bit reasoning model for reasoning operation.
In this embodiment, the training model is reloaded first, and the satRelu layer is changed to the normal Relu layer again, so as to ensure that the network structure is not changed. The process of the pre-inference treatment comprises the following steps: constant folding, quadratic quantization and activating equivalent transformation.
Specifically, during the training process, the batchNorm layer does not perform any processing, and the accuracy of the network is not affected. After training, the batch norm layer needs to be fused into convolution, and the calculation formula of the constant folding process comprises the following steps:
1) and (3) convolution calculation: z ═ w x + b
2) Calculation process of batchNorm:
Figure BDA0002555762740000081
3) the calculation process of the new convolution after combination adopts a formula:
Figure BDA0002555762740000082
and
Figure BDA0002555762740000083
because the weight parameter changes after the batchnorm layer is folded, the calculation precision can be improved by carrying out secondary quantization on the weight in the embodiment. The quantization method is consistent with the quantization mode of the weights during training, the quantization scales of the weights are obtained again, and the scales obtained by training are still adopted when the activated scales are not changed.
In the embodiment, the TenscorCore can support 4-bit operation on the GPU, so that inference calculation can be carried out by using a TenscorCore code, but the TenscorCore only supports the operation of signed 4bit and 4bi t or unsigned 4bit and unsigned 4bit, but the network activation after training is unsigned 4bit number, the weight is signed 4bi t number, and the activation is equivalently transformed, so that the GPU operation can be supported, and the practicability of the method is improved. Specific activation equivalence transformations: the activation is subtracted by the median value 8, transformed into signed 4bi t data and then reasoned. Equivalent transformation of convolution
Figure BDA0002555762740000084
According to the convolution equivalent transformation formula, the inference calculation process of convolution is divided into the following steps:
1) the pad operation is first performed on the convolved inputs.
2) The input and weights are multiplied by respective scales to obtain a 4-bit quantization value. I.e. Wq=scalewW,xq=scalea(x-8)。
3) The weight and the input are in an int32 format, but the data range is int4, namely the data range is between-8 and 7, and int4, which is a data type, is not present in the cpu, so that the weight and the activation of the data with the lower 4 bits need to be respectively performed, 8 data are spliced into an int32 through shift operation, the space occupied by data storage is reduced, and the GPU can conveniently fetch data for operation.
4) Performing a 4-bit convolution operation, i.e. Wq*xqThen applying an offset Wq*scalea*8+bq
5) Inverse quantization, i.e. division by scalew*scaleaSince the inverse quantization operation can be folded and combined with the convolution input of the next layer, the operation can be hidden.
Finally, some overall checking of the network is required: if the constants are folded, some constants are combined for calculation, after redundant operators are removed, the model is stored, the model is directly stored, namely, the model is the whole int4 inference model, and the model can be used for inference calculation.
If the accuracy of the pseudo-quantization model does not converge to the set accuracy, the process returns to step S3, and retraining the neural network is performed again, and a new pseudo-quantization model is obtained until the accuracy of the new pseudo-quantization model converges to the set accuracy.
Example two
Referring to fig. 2 based on the embodiment shown in fig. 1, fig. 2 is a schematic structural diagram of a 4-bit quantization system of a neural network according to an embodiment of the present disclosure. As can be seen from fig. 2, the 4-bit quantization system of the neural network in this embodiment mainly includes: the device comprises a loading module, a statistic module, a retraining module, a judging module and a converting module.
The loading module is used for loading a pre-training model of the neural network; the statistical module is used for counting the initial value of each saturation activation layer satRelu in the pre-training model; the retraining module is used for adding a pseudo quantization node in the neural network, retraining the neural network by using an initial value of the satRelu, and acquiring a pseudo quantization model; the judging module is used for judging whether the precision of the pseudo quantization model converges to the set precision; a conversion module, configured to perform pre-inference processing on the pseudo quantization model when the accuracy of the pseudo quantization model converges to a set accuracy, and convert the pseudo quantization model into a 4-bit inference model for inference operation, where the pre-inference processing process includes: constant folding, quadratic quantization and activating equivalent transformation.
Further, the statistics module includes: the device comprises a replacing unit, an activation value acquiring unit, a counting unit and an initial value selecting unit. Wherein, the replacing unit is used for replacing all the activation layers relu in the neural network with saturated activation layers satRelu; an activation value acquisition unit, configured to acquire an activation value of each saturation activation layer satRelu according to the acquired command; a statistical unit for counting distribution data by using a histogram according to the activation value; an initial value selecting unit, configured to select an activation value located at 99.999% of a point in the histogram as an initial value of a parameter max in a saturation activation layer satRelu, where satRelu is defined as:
Figure BDA0002555762740000091
and in back propagation, the gradient of satRelu for parameter max is:
Figure BDA0002555762740000092
the gradient of satRelu for input x is
Figure BDA0002555762740000093
The retraining module comprises: a pseudo quantization layer insertion unit, a first retraining unit and a second retraining unit. The pseudo quantization layer inserting unit is used for inserting a weight pseudo quantization layer before a weight layer of the neural network and inserting an activation pseudo quantization layer before an activation layer; a first retraining unit, configured to retrain the neural network for the weight pseudo quantization layer by using a formula y equal to quant (w) clip (w equal scale)/scale, where w is a weight value, n is 4, and a scaling coefficient scale is:
Figure BDA0002555762740000101
and a second retraining unit, configured to retrain the neural network for the activated pseudo quantization layer by using a formula y equal to quant (x) clip (round (x)/scale), where x is an activation value of each layer, max is a maximum value of satRelu, n is 4, and a scaling factor scale is:
Figure BDA0002555762740000102
the pre-inference processing module comprises: the device comprises a constant folding unit, a quadratic quantization unit and an active equivalent transformation unit. The constant folding unit is used for fusing the batch norm layer into convolution, wherein the calculation formula of the convolution is as follows: z is w x + b, and the formula of batchNorm is:
Figure BDA0002555762740000103
the calculation formula of the new convolution after combination is as follows:
Figure BDA0002555762740000104
the secondary quantization unit is used for carrying out secondary quantization on the weight to obtain a quantization scale coefficient scale of the weight after the secondary quantization; activating equivalent transformation units for using formulas
Figure BDA0002555762740000105
And performing equivalent transformation on the activation.
The working principle and working method of the 4-bit quantization system of the neural network in this embodiment have already been described in detail in the embodiment shown in fig. 1, and are not repeated here.
The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of 4-bit quantization for a neural network, the method comprising:
loading a pre-training model of the neural network;
counting initial values of satRelu of each saturation activation layer in the pre-training model;
adding a pseudo-quantization node in the neural network, and retraining the neural network by using an initial value of satRelu to obtain a pseudo-quantization model;
judging whether the precision of the pseudo quantization model converges to a set precision or not;
if yes, carrying out pre-inference processing on the pseudo quantization model, and converting the pseudo quantization model into a 4-bit inference model for inference operation, wherein the pre-inference processing process comprises the following steps: constant folding, secondary quantization and activating equivalent transformation;
if not, retraining the neural network is continued.
2. The method of claim 1, wherein the method for counting initial values of satRelu in each saturation activation layer in the pre-training model comprises:
replacing all active layers relu in the neural network with saturated active layers satRelu;
acquiring an activation value of each saturation activation layer satRelu according to the acquired command;
according to the activation value, counting distribution data by utilizing a histogram;
selecting an activation value located at 99.999% of a point in the histogram as an initial value of a parameter max in a saturation activation layer satRelu, wherein the satRelu is defined as:
Figure FDA0002555762730000011
and in back propagation, the gradient of satRelu for parameter max is:
Figure FDA0002555762730000012
the gradient of satRelu for input x is
Figure FDA0002555762730000013
max is the maximum value of the saturation active layer satRelu.
3. The 4-bit quantization method of the neural network as claimed in claim 2, wherein the parameter max is compressed by a regularization method of L2 during retraining of the neural network.
4. The 4-bit quantization method of the neural network as claimed in claim 2, wherein the retraining period is less than or equal to 10, and the value of the parameter max is less than or equal to 1.
5. The method according to claim 1, wherein the method for obtaining the pseudo quantization model by adding the pseudo quantization node in the neural network and retraining the neural network with an initial value of satRelu comprises:
inserting a weight pseudo-quantization layer before a weight layer of the neural network and an activation pseudo-quantization layer before an activation layer;
retraining the neural network on the weight pseudo-quantization layer by using a formula y equal to quant (w equal to clip)/scale, wherein w is a weight value, n is 4, and a scale coefficient scale is:
Figure FDA0002555762730000021
retraining the neural network on the activated pseudo quantization layer by using the formula y equal to quant (x) clip (round (x scale))/scale, wherein x is the activation value of each layer, max is the maximum value of satRelu, n is 4, and the scaling coefficient scale is:
Figure FDA0002555762730000022
6. a4-bit quantization method for neural networks according to any of claims 2-4, characterized in that the back-propagation process uses a pass-through estimator to calculate the gradient when retraining the neural network.
7. A 4-bit quantization system for a neural network, the system comprising:
the loading module is used for loading the pre-training model of the neural network;
the statistical module is used for counting the initial value of each saturation activation layer satRelu in the pre-training model;
the retraining module is used for adding a pseudo quantization node in the neural network, retraining the neural network by using an initial value of the satRelu, and acquiring a pseudo quantization model;
the judging module is used for judging whether the precision of the pseudo quantization model converges to the set precision or not;
a conversion module, configured to perform pre-inference processing on the pseudo quantization model when the accuracy of the pseudo quantization model converges to a set accuracy, and convert the pseudo quantization model into a 4-bit inference model that can be used for inference operation, where the pre-inference processing includes: constant folding, quadratic quantization and activating equivalent transformation.
8. The 4-bit quantization system of a neural network of claim 7, wherein the statistics module comprises:
a replacement unit, configured to replace all active layers relu in the neural network with saturated active layers satRelu;
an activation value acquisition unit, configured to acquire an activation value of each saturation activation layer satRelu according to the acquired command;
a statistical unit for counting distribution data by using a histogram according to the activation value;
an initial value selecting unit, configured to select an activation value located at 99.999% of a point in the histogram as an initial value of a parameter max in a saturation activation layer satRelu, where satRelu is defined as:
Figure FDA0002555762730000023
and in back propagation, the gradient of satRelu for parameter max is:
Figure FDA0002555762730000024
the gradient of satRelu for input x is
Figure FDA0002555762730000031
max is the maximum value of the saturation active layer satRelu.
9. The 4-bit quantization system of a neural network of claim 7, wherein the retraining module comprises:
a pseudo quantization layer insertion unit for inserting a weight pseudo quantization layer before a weight layer of the neural network and inserting an activation pseudo quantization layer before an activation layer;
a first retraining unit, configured to retrain the neural network for the weight pseudo quantization layer by using a formula y equal to quant (w) clip (w equal scale)/scale, where w is a weight value, n is 4, and a scaling coefficient scale is:
Figure FDA0002555762730000032
and a second retraining unit, configured to retrain the neural network for the activated pseudo quantization layer by using a formula y equal to quant (x) clip (round (x)/scale), where x is an activation value of each layer, max is a maximum value of satRelu, n is 4, and a scaling factor scale is:
Figure FDA0002555762730000033
10. the 4-bit quantization system of a neural network of claim 7, wherein the pre-inference processing module comprises:
and the constant folding unit is used for fusing the batch norm layer into convolution, wherein the calculation formula of the convolution is as follows: z is w x + b, and the formula of batchNorm is:
Figure FDA0002555762730000034
the calculation formula of the new convolution after combination is as follows:
Figure FDA0002555762730000035
the secondary quantization unit is used for carrying out secondary quantization on the weight to obtain a quantization scale coefficient scale of the weight after the secondary quantization;
activating equivalent transformation units for using formulas
Figure FDA0002555762730000036
And performing equivalent transformation on the activation.
CN202010589233.5A 2020-06-24 2020-06-24 4-bit quantization method and system of neural network Pending CN111882058A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010589233.5A CN111882058A (en) 2020-06-24 2020-06-24 4-bit quantization method and system of neural network
PCT/CN2021/076982 WO2021258752A1 (en) 2020-06-24 2021-02-20 4-bit quantization method and system for neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010589233.5A CN111882058A (en) 2020-06-24 2020-06-24 4-bit quantization method and system of neural network

Publications (1)

Publication Number Publication Date
CN111882058A true CN111882058A (en) 2020-11-03

Family

ID=73156945

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010589233.5A Pending CN111882058A (en) 2020-06-24 2020-06-24 4-bit quantization method and system of neural network

Country Status (2)

Country Link
CN (1) CN111882058A (en)
WO (1) WO2021258752A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112488291A (en) * 2020-11-03 2021-03-12 珠海亿智电子科技有限公司 Neural network 8-bit quantization compression method
CN112884144A (en) * 2021-02-01 2021-06-01 上海商汤智能科技有限公司 Network quantization method and device, electronic equipment and storage medium
WO2021258752A1 (en) * 2020-06-24 2021-12-30 苏州浪潮智能科技有限公司 4-bit quantization method and system for neural network
CN113887706A (en) * 2021-09-30 2022-01-04 苏州浪潮智能科技有限公司 Method and device for low bit quantization aiming at one-stage target detection network
CN113971457A (en) * 2021-10-29 2022-01-25 苏州浪潮智能科技有限公司 Method and system for optimizing calculation performance of neural network
CN114611697A (en) * 2022-05-11 2022-06-10 上海登临科技有限公司 Neural network quantification and deployment method, system, electronic device and storage medium
CN114676760A (en) * 2022-03-10 2022-06-28 北京智源人工智能研究院 Pre-training model inference processing method and device, electronic equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230298569A1 (en) * 2022-03-21 2023-09-21 Google Llc 4-bit Conformer with Accurate Quantization Training for Speech Recognition

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11348009B2 (en) * 2018-09-24 2022-05-31 Samsung Electronics Co., Ltd. Non-uniform quantization of pre-trained deep neural network
CN110334802A (en) * 2019-05-23 2019-10-15 腾讯科技(深圳)有限公司 A kind of construction method of neural network model, device, equipment and storage medium
CN110837890A (en) * 2019-10-22 2020-02-25 西安交通大学 Weight value fixed-point quantization method for lightweight convolutional neural network
CN111882058A (en) * 2020-06-24 2020-11-03 苏州浪潮智能科技有限公司 4-bit quantization method and system of neural network

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021258752A1 (en) * 2020-06-24 2021-12-30 苏州浪潮智能科技有限公司 4-bit quantization method and system for neural network
CN112488291A (en) * 2020-11-03 2021-03-12 珠海亿智电子科技有限公司 Neural network 8-bit quantization compression method
CN112488291B (en) * 2020-11-03 2024-06-04 珠海亿智电子科技有限公司 8-Bit quantization compression method for neural network
CN112884144A (en) * 2021-02-01 2021-06-01 上海商汤智能科技有限公司 Network quantization method and device, electronic equipment and storage medium
CN113887706A (en) * 2021-09-30 2022-01-04 苏州浪潮智能科技有限公司 Method and device for low bit quantization aiming at one-stage target detection network
CN113887706B (en) * 2021-09-30 2024-02-06 苏州浪潮智能科技有限公司 Method and device for low-bit quantization of one-stage target detection network
CN113971457A (en) * 2021-10-29 2022-01-25 苏州浪潮智能科技有限公司 Method and system for optimizing calculation performance of neural network
CN113971457B (en) * 2021-10-29 2024-02-02 苏州浪潮智能科技有限公司 Computing performance optimization method and system for neural network
CN114676760A (en) * 2022-03-10 2022-06-28 北京智源人工智能研究院 Pre-training model inference processing method and device, electronic equipment and storage medium
CN114611697A (en) * 2022-05-11 2022-06-10 上海登临科技有限公司 Neural network quantification and deployment method, system, electronic device and storage medium

Also Published As

Publication number Publication date
WO2021258752A1 (en) 2021-12-30

Similar Documents

Publication Publication Date Title
CN111882058A (en) 4-bit quantization method and system of neural network
CN113011581B (en) Neural network model compression method and device, electronic equipment and readable storage medium
CN111814973B (en) Memory computing system suitable for neural ordinary differential equation network computing
CN112906294A (en) Quantization method and quantization device for deep learning model
CN109726799A (en) A kind of compression method of deep neural network
CN113850389B (en) Quantum circuit construction method and device
CN112733863B (en) Image feature extraction method, device, equipment and storage medium
CN110245753A (en) A kind of neural network compression method based on power exponent quantization
US5621861A (en) Method of reducing amount of data required to achieve neural network learning
CN110276451A (en) One kind being based on the normalized deep neural network compression method of weight
CN114139683A (en) Neural network accelerator model quantization method
CN112200311A (en) 4-bit quantitative reasoning method, device, equipment and readable medium
CN112884146A (en) Method and system for training model based on data quantization and hardware acceleration
CN112581397A (en) Degraded image restoration method based on image prior information and application thereof
Verma et al. A" Network Pruning Network''Approach to Deep Model Compression
CN112927159B (en) True image denoising method based on multi-scale selection feedback network
CN109800859B (en) Neural network batch normalization optimization method and device
CN114707636A (en) Neural network architecture searching method and device, electronic equipment and storage medium
CN114595802A (en) Data compression-based impulse neural network acceleration method and device
CN114372539A (en) Machine learning framework-based classification method and related equipment
CN113887706B (en) Method and device for low-bit quantization of one-stage target detection network
CN116760724A (en) Endophytic artificial intelligence evaluation method, system and storage medium
CN116776926B (en) Optimized deployment method, device, equipment and medium for dialogue model
CN111985639A (en) Neural network quantification method
CN115375953A (en) Training method and device for image classification model, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201103