CN111882058A - 4-bit quantization method and system of neural network - Google Patents
4-bit quantization method and system of neural network Download PDFInfo
- Publication number
- CN111882058A CN111882058A CN202010589233.5A CN202010589233A CN111882058A CN 111882058 A CN111882058 A CN 111882058A CN 202010589233 A CN202010589233 A CN 202010589233A CN 111882058 A CN111882058 A CN 111882058A
- Authority
- CN
- China
- Prior art keywords
- quantization
- neural network
- pseudo
- satrelu
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013139 quantization Methods 0.000 title claims abstract description 170
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 108
- 238000000034 method Methods 0.000 title claims abstract description 54
- 230000004913 activation Effects 0.000 claims abstract description 82
- 238000012549 training Methods 0.000 claims abstract description 45
- 238000004364 calculation method Methods 0.000 claims description 18
- 238000012545 processing Methods 0.000 claims description 16
- 230000009466 transformation Effects 0.000 claims description 16
- 230000003213 activating effect Effects 0.000 claims description 9
- 229920006395 saturated elastomer Polymers 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000003780 insertion Methods 0.000 claims description 3
- 230000037431 insertion Effects 0.000 claims description 3
- 238000011002 quantification Methods 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 4
- 101100317378 Mus musculus Wnt3 gene Proteins 0.000 description 3
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The application discloses a 4-bit quantization method and a system of a neural network, wherein the method comprises the following steps: loading a pre-training model of the neural network; counting initial values of satRelu of each saturation activation layer in a pre-training model; adding a pseudo-quantization node in the neural network, and retraining the neural network by using an initial value of satRelu to obtain a pseudo-quantization model; judging whether the precision of the pseudo-quantization model converges to a set precision or not; if yes, carrying out reasoning pretreatment on the pseudo-quantization model, and converting the pseudo-quantization model into a 4-bit reasoning model for reasoning operation; otherwise, returning to retraining the neural network. The system mainly comprises: the device comprises a loading module, a statistic module, a retraining module, a judging module and a converting module. Through the method and the device, the training efficiency can be effectively improved on the basis of ensuring the accuracy of the training result.
Description
Technical Field
The present application relates to the field of neural network model compression technologies, and in particular, to a 4-bit quantization method and system for a neural network.
Background
In a Neural Network, a Neural Network model generally occupies a large disk space, for example, the model file of AlexNet exceeds 200 MB. The model contains millions of parameters and a significant portion of disk space is used to store the model parameters. Because the model parameters are of a floating point type, the space of the model parameters is difficult to compress by a common compression algorithm, so that the model quantization is introduced, the original network is compressed by reducing the bit number required by representing each weight, and the running speed of the network can be greatly improved. Therefore, how to quantify the neural network is an important technical problem.
At present, the mainstream method of neural network quantization is 8-bit quantization, and most training and reasoning frameworks support 8-bit quantization. However, compared with 8-bit quantization, 4-bit quantization can compress the volume of the model by 1 time on the basis of 8 bits, and the running speed is improved by 50%. Therefore, 4-bit quantization is also gradually gaining attention.
The current 4-bit quantization algorithm usually trains a network from the beginning until the whole network is trained, and a large data set such as imagenet generally needs to be trained for more than 100 cycles. And the model precision is improved by adopting a nonlinear quantization mode.
However, in the current 4-bit quantization algorithm, the quantization efficiency is low because the number of quantization cycles is large and the training time is long.
Disclosure of Invention
The application provides a 4-bit quantization method and a system of a neural network, which aim to solve the problem of low quantization efficiency of the neural network quantization method in the prior art.
In order to solve the technical problem, the embodiment of the application discloses the following technical scheme:
a 4-bit quantization method of a neural network, the method comprising:
loading a pre-training model of the neural network;
counting initial values of satRelu of each saturation activation layer in the pre-training model;
adding a pseudo-quantization node in the neural network, and retraining the neural network by using an initial value of satRelu to obtain a pseudo-quantization model;
judging whether the precision of the pseudo quantization model converges to a set precision or not;
if yes, carrying out pre-inference processing on the pseudo quantization model, and converting the pseudo quantization model into a 4-bit inference model for inference operation, wherein the pre-inference processing process comprises the following steps: constant folding, secondary quantization and activating equivalent transformation;
if not, retraining the neural network is continued.
Optionally, in the pre-training model, the method for counting the initial value of each saturation activation layer satRelu includes:
replacing all active layers relu in the neural network with saturated active layers satRelu;
acquiring an activation value of each saturation activation layer satRelu according to the acquired command;
according to the activation value, counting distribution data by utilizing a histogram;
selecting an activation value located at 99.999% of a point in the histogram as an initial value of a parameter max in a saturation activation layer satRelu, wherein the satRelu is defined as:
and in back propagation, the gradient of satRelu for parameter max is:the gradient of satRelu for input x ismax is the maximum value of the saturation active layer satRelu.
Optionally, during the retraining process of the neural network, the parameter max is compressed by using a regularization method of L2.
Optionally, the retraining period is less than or equal to 10, and the value of the parameter max is less than or equal to 1.
Optionally, the method for adding a pseudo quantization node in the neural network, and performing retraining on the neural network by using an initial value of satRelu to obtain a pseudo quantization model includes:
inserting a weight pseudo-quantization layer before a weight layer of the neural network and an activation pseudo-quantization layer before an activation layer;
retraining the neural network on the weight pseudo-quantization layer by using a formula y equal to quant (w equal to clip)/scale, wherein w is a weight value, n is 4, and a scale coefficient scale is:
retraining the neural network on the activated pseudo quantization layer by using the formula y equal to quant (x) clip (round (x scale))/scale, wherein x is the activation value of each layer, max is the maximum value of satRelu, n is 4, and the scaling coefficient scale is:
optionally, in retraining the neural network, the back propagation process uses a straight-through estimator to calculate the gradient.
A 4-bit quantization system for a neural network, the system comprising:
the loading module is used for loading the pre-training model of the neural network;
the statistical module is used for counting the initial value of each saturation activation layer satRelu in the pre-training model;
the retraining module is used for adding a pseudo quantization node in the neural network, retraining the neural network by using an initial value of the satRelu, and acquiring a pseudo quantization model;
the judging module is used for judging whether the precision of the pseudo quantization model converges to the set precision or not;
a conversion module, configured to perform pre-inference processing on the pseudo quantization model when the accuracy of the pseudo quantization model converges to a set accuracy, and convert the pseudo quantization model into a 4-bit inference model that can be used for inference operation, where the pre-inference processing includes: constant folding, quadratic quantization and activating equivalent transformation.
Optionally, the statistics module includes:
a replacement unit, configured to replace all active layers relu in the neural network with saturated active layers satRelu;
an activation value acquisition unit, configured to acquire an activation value of each saturation activation layer satRelu according to the acquired command;
a statistical unit for counting distribution data by using a histogram according to the activation value;
an initial value selecting unit, configured to select an activation value located at 99.999% of a point in the histogram as an initial value of a parameter max in a saturation activation layer satRelu, where satRelu is defined as:
and in back propagation, the gradient of satRelu for parameter max is:the gradient of satRelu for input x ismax is the maximum value of the saturation active layer satRelu.
Optionally, the retraining module comprises:
a pseudo quantization layer insertion unit for inserting a weight pseudo quantization layer before a weight layer of the neural network and inserting an activation pseudo quantization layer before an activation layer;
a first retraining unit, configured to retrain the neural network for the weight pseudo quantization layer by using a formula y equal to quant (w) clip (w equal scale)/scale, where w is a weight value, n is 4, and a scaling coefficient scale is:
and a second retraining unit, configured to retrain the neural network for the activated pseudo quantization layer by using a formula y equal to quant (x) clip (round (x)/scale), where x is an activation value of each layer, max is a maximum value of satRelu, n is 4, and a scaling factor scale is:
optionally, the pre-inference processing module includes:
a constant folding unit for fusing the batch norm layer into a convolution, whichThe convolution formula is: z is w x + b, and the formula of batchNorm is:the calculation formula of the new convolution after combination is as follows:
the secondary quantization unit is used for carrying out secondary quantization on the weight to obtain a quantization scale coefficient scale of the weight after the secondary quantization;
activating equivalent transformation units for using formulasAnd performing equivalent transformation on the activation.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
the application provides a 4-bit quantification method of a neural network, the quantification method is characterized by firstly loading a pre-training model of the neural network, counting initial values of various satreu layers in the pre-training model, adding a pseudo quantification node in the neural network, retraining the neural network by using the initial values of the satreu, acquiring the pseudo quantification model through retraining for a plurality of periods, converting the pseudo quantification model into a 4-bit reasoning model by using a reasoning method, completing a full 4-bit reasoning process before reasoning by using a reasoning algorithm, being beneficial to improving the calculation speed, ensuring that a final reasoning model can be directly applied to a 4-bit GPU, supporting the 4-bit GPU operation and being beneficial to improving the practicability of the invention. The method is carried out in a pseudo-quantization mode during retraining, a corresponding pseudo-quantization layer is inserted before a weight layer and an activation layer and is used for simulating the influence of model quantization on the whole neural network, the model is learned and adapted to the influence on the neural network by means of training, so that the accuracy of the quantization model can be greatly improved, the precision loss of the obtained model can be controlled within 1%, and the 4-bit quantization result is ensured to meet the precision requirement. Moreover, the period of retraining in this embodiment is 10, i.e.: the neural network is retrained for 10 times, so that the training time is greatly saved, the training efficiency can be effectively improved on the basis of ensuring the precision, and further the 4-bit quantization efficiency of the neural network is improved.
In the embodiment, the neural network is retrained, and the weight layer and the activation layer are subjected to linear pseudo-quantization, so that the calculation speed is increased on the basis of ensuring the training precision, and the quantization efficiency of the neural network is effectively improved.
The application also provides a 4-bit quantization system of the neural network. The system mainly comprises: the device comprises a loading module, a statistic module, a retraining module, a judging module and a converting module. Through the loading module and the retraining module, a pre-training model can be loaded in a retraining mode, retraining is carried out on the pre-training model for several cycles, a pseudo-quantization model is obtained, and when the judging module judges that the precision of the pseudo-quantization model converges to the set precision, the conversion module is started to carry out pre-inference processing on the pseudo-quantization model and convert the pseudo-quantization model into a 4-bit inference model for inference operation. In the embodiment, the training time can be effectively saved by adopting less retraining times, generally 10 periods, so that the quantization efficiency of the neural network is greatly improved. And the pseudo-quantization model is processed in the inference state by utilizing the pre-inference processing module, so that all 4-bit inference can be completed before inference, the calculation speed can be further improved, and the quantization efficiency can be improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a 4-bit quantization method of a neural network according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a 4-bit quantization system of a neural network according to an embodiment of the present disclosure.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
For a better understanding of the present application, embodiments of the present application are explained in detail below with reference to the accompanying drawings.
Example one
Referring to fig. 1, fig. 1 is a schematic flowchart of a 4-bit quantization method of a neural network according to an embodiment of the present disclosure. As can be seen from fig. 1, the 4-bit quantization method of the neural network in this embodiment mainly includes the following steps:
s1: a pre-trained model of the neural network is loaded.
S2: in the pre-training model, the initial value of each saturation activation layer satRelu is counted.
Specifically, step S2 further includes:
s21: all the active layers relu in the neural network are replaced with saturated active layers satRelu.
S22: and acquiring the activation value of each saturated activation layer satRelu according to the acquired command.
S23: the distribution data is counted using the histogram according to the activation value.
S24: the activation value at 99.999% of the point in the histogram is selected as the initial value of the parameter max in the saturation activation layer satRelu. Wherein satRelu is defined as:
and in back propagation, the gradient of satRelu for parameter max is:the gradient of satRelu for input x is
As shown in steps S1 and S2, the pre-training model of the neural network is loaded before training, and the pre-training model can be implemented by running a network script. The distribution condition of the activation values of each layer is counted by running a network script, 4096 sampling points can be adopted to count the histogram distribution of the activation values of each layer, and the activation value of the point position which is 99.999% of the histogram distribution is selected as the initial value of the unique parameter max in the satralu for subsequent iterative training.
In this embodiment, the lower limit of satRelu is 0, the lower limit of satRelu is max, that is, the maximum value of satRelu is max, and max is a variable, and the maximum value gradually decreases in the neural network training process, where an initial value of max needs to be obtained.
In this embodiment, during training, all relu layers in the original network need to be replaced by satRelu layers, where satRelu is defined as follows:
wherein max is the initial value of each layer counted by the histogram. In backpropagation, the gradient of satRelu for max is:the gradient for input x is:
further, in the retraining process of the neural network, for the max parameter, the L2 regularization is adopted to compress the max parameter, so that the quantization error can be effectively reduced, and the network precision can be increased.
In this embodiment, the retraining period is not greater than 10, and the value of the parameter max is not greater than 1, wherein the preferable value is: the retraining period is 10 and the parameter max is 1. The value setting of the parameter max in the embodiment can ensure that each layer outputs properly, and is beneficial to improving the network precision. In the training process, the max value needs to be observed, and a proper regularization parameter is adopted to ensure that the max value is close to 1 when the network converges.
With continued reference to fig. 1, after counting the initial values of the saturation activation layers satRelu, step S3 is executed: and adding a pseudo quantization node in the neural network, retraining the neural network by using the initial value of the satRelu, and acquiring a pseudo quantization model.
In this embodiment, a pseudo quantization node is added to a neural network, specifically: pseudo-quantization nodes are added to the convolved inputs and weights of the neural network, and to the fully connected inputs and weights of the neural network.
S31: a weight pseudo-quantization layer is inserted before a weight layer of the neural network and an activation pseudo-quantization layer is inserted before an activation layer.
In the model training of this embodiment, the whole neural network is trained in a pseudo-quantization manner, that is: in the conventional neural network, a pseudo-quantization layer is inserted before a weight layer and an activation layer, the influence of model quantization on the whole neural network can be simulated through the weight pseudo-quantization layer and the activation pseudo-quantization layer, and the model is learned and adapted to the influence through training, so that the accuracy of the most quantized model can be effectively improved.
S32: the weight pseudo quantization layer is retrained by the neural network using the formula y equal quant (w equal clip)/scale.
in this embodiment, after retraining the neural network on the weight pseudo-quantization layer, a pseudo-quantization model is obtained, and a weight value in the pseudo-quantization model changes due to the retraining. In addition, in the embodiment, 4-bit quantization is performed on the neural network, and the value range after quantization is-8 to 7, so that the upper limit of clip is 7, the lower limit is-8, a round function is used for rounding the input, and the value of scale is obtained in a manner that each output channel shares one scale.
S33: retraining the neural network for the activated pseudo quantization layer using the formula (equal (x)) clip/scale.
Wherein x is the activation value of each layer, max is the maximum value of satRelu, n takes the value of 4, and the proportionality coefficient scale is as follows:
since a relu layer is arranged after each layer of activation, the activation value cannot have a negative value, the value range after quantization is between 0 and 15, the upper limit of clip is 15, the lower limit of clip is 0, and the activation needs to be mapped to between-8 and 7 during reasoning subsequently, so that GPU calculation is facilitated, and the value of scale is a mode that each layer shares one scale.
In this embodiment, when retraining the neural network, the back propagation process uses a straight-through estimator to calculate a gradient, i.e., STE. Specifically, because the quantization function is a discrete non-derivable function, a straight-through estimator is adopted to calculate the gradient, and a network quantized bit number is gradually reduced to 4 bits, and the quantized bit number n is 8, 6, 5, 4, namely, training is started from 8bit quantization, and the bit number is gradually reduced to 4 bits in the training process. By using the method, the problem of gradient mismatching can be reduced, and the network can gradually adapt to errors caused by quantization, so that the operation precision and the quantization efficiency are effectively improved.
As can be seen from fig. 1, in this embodiment, after retraining the neural network and obtaining the pseudo quantization model, step S4 is executed: and judging whether the precision of the pseudo quantization model converges to the set precision.
If the accuracy of the pseudo-quantization model converges to the set accuracy, step S5 is performed: and (4) carrying out reasoning pretreatment on the pseudo quantization model, and converting into a 4-bit reasoning model for reasoning operation.
In this embodiment, the training model is reloaded first, and the satRelu layer is changed to the normal Relu layer again, so as to ensure that the network structure is not changed. The process of the pre-inference treatment comprises the following steps: constant folding, quadratic quantization and activating equivalent transformation.
Specifically, during the training process, the batchNorm layer does not perform any processing, and the accuracy of the network is not affected. After training, the batch norm layer needs to be fused into convolution, and the calculation formula of the constant folding process comprises the following steps:
1) and (3) convolution calculation: z ═ w x + b
3) the calculation process of the new convolution after combination adopts a formula:
because the weight parameter changes after the batchnorm layer is folded, the calculation precision can be improved by carrying out secondary quantization on the weight in the embodiment. The quantization method is consistent with the quantization mode of the weights during training, the quantization scales of the weights are obtained again, and the scales obtained by training are still adopted when the activated scales are not changed.
In the embodiment, the TenscorCore can support 4-bit operation on the GPU, so that inference calculation can be carried out by using a TenscorCore code, but the TenscorCore only supports the operation of signed 4bit and 4bi t or unsigned 4bit and unsigned 4bit, but the network activation after training is unsigned 4bit number, the weight is signed 4bi t number, and the activation is equivalently transformed, so that the GPU operation can be supported, and the practicability of the method is improved. Specific activation equivalence transformations: the activation is subtracted by the median value 8, transformed into signed 4bi t data and then reasoned. Equivalent transformation of convolution
According to the convolution equivalent transformation formula, the inference calculation process of convolution is divided into the following steps:
1) the pad operation is first performed on the convolved inputs.
2) The input and weights are multiplied by respective scales to obtain a 4-bit quantization value. I.e. Wq=scalewW,xq=scalea(x-8)。
3) The weight and the input are in an int32 format, but the data range is int4, namely the data range is between-8 and 7, and int4, which is a data type, is not present in the cpu, so that the weight and the activation of the data with the lower 4 bits need to be respectively performed, 8 data are spliced into an int32 through shift operation, the space occupied by data storage is reduced, and the GPU can conveniently fetch data for operation.
4) Performing a 4-bit convolution operation, i.e. Wq*xqThen applying an offset Wq*scalea*8+bq。
5) Inverse quantization, i.e. division by scalew*scaleaSince the inverse quantization operation can be folded and combined with the convolution input of the next layer, the operation can be hidden.
Finally, some overall checking of the network is required: if the constants are folded, some constants are combined for calculation, after redundant operators are removed, the model is stored, the model is directly stored, namely, the model is the whole int4 inference model, and the model can be used for inference calculation.
If the accuracy of the pseudo-quantization model does not converge to the set accuracy, the process returns to step S3, and retraining the neural network is performed again, and a new pseudo-quantization model is obtained until the accuracy of the new pseudo-quantization model converges to the set accuracy.
Example two
Referring to fig. 2 based on the embodiment shown in fig. 1, fig. 2 is a schematic structural diagram of a 4-bit quantization system of a neural network according to an embodiment of the present disclosure. As can be seen from fig. 2, the 4-bit quantization system of the neural network in this embodiment mainly includes: the device comprises a loading module, a statistic module, a retraining module, a judging module and a converting module.
The loading module is used for loading a pre-training model of the neural network; the statistical module is used for counting the initial value of each saturation activation layer satRelu in the pre-training model; the retraining module is used for adding a pseudo quantization node in the neural network, retraining the neural network by using an initial value of the satRelu, and acquiring a pseudo quantization model; the judging module is used for judging whether the precision of the pseudo quantization model converges to the set precision; a conversion module, configured to perform pre-inference processing on the pseudo quantization model when the accuracy of the pseudo quantization model converges to a set accuracy, and convert the pseudo quantization model into a 4-bit inference model for inference operation, where the pre-inference processing process includes: constant folding, quadratic quantization and activating equivalent transformation.
Further, the statistics module includes: the device comprises a replacing unit, an activation value acquiring unit, a counting unit and an initial value selecting unit. Wherein, the replacing unit is used for replacing all the activation layers relu in the neural network with saturated activation layers satRelu; an activation value acquisition unit, configured to acquire an activation value of each saturation activation layer satRelu according to the acquired command; a statistical unit for counting distribution data by using a histogram according to the activation value; an initial value selecting unit, configured to select an activation value located at 99.999% of a point in the histogram as an initial value of a parameter max in a saturation activation layer satRelu, where satRelu is defined as:
and in back propagation, the gradient of satRelu for parameter max is:the gradient of satRelu for input x is
The retraining module comprises: a pseudo quantization layer insertion unit, a first retraining unit and a second retraining unit. The pseudo quantization layer inserting unit is used for inserting a weight pseudo quantization layer before a weight layer of the neural network and inserting an activation pseudo quantization layer before an activation layer; a first retraining unit, configured to retrain the neural network for the weight pseudo quantization layer by using a formula y equal to quant (w) clip (w equal scale)/scale, where w is a weight value, n is 4, and a scaling coefficient scale is:and a second retraining unit, configured to retrain the neural network for the activated pseudo quantization layer by using a formula y equal to quant (x) clip (round (x)/scale), where x is an activation value of each layer, max is a maximum value of satRelu, n is 4, and a scaling factor scale is:
the pre-inference processing module comprises: the device comprises a constant folding unit, a quadratic quantization unit and an active equivalent transformation unit. The constant folding unit is used for fusing the batch norm layer into convolution, wherein the calculation formula of the convolution is as follows: z is w x + b, and the formula of batchNorm is:the calculation formula of the new convolution after combination is as follows:the secondary quantization unit is used for carrying out secondary quantization on the weight to obtain a quantization scale coefficient scale of the weight after the secondary quantization; activating equivalent transformation units for using formulasAnd performing equivalent transformation on the activation.
The working principle and working method of the 4-bit quantization system of the neural network in this embodiment have already been described in detail in the embodiment shown in fig. 1, and are not repeated here.
The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A method of 4-bit quantization for a neural network, the method comprising:
loading a pre-training model of the neural network;
counting initial values of satRelu of each saturation activation layer in the pre-training model;
adding a pseudo-quantization node in the neural network, and retraining the neural network by using an initial value of satRelu to obtain a pseudo-quantization model;
judging whether the precision of the pseudo quantization model converges to a set precision or not;
if yes, carrying out pre-inference processing on the pseudo quantization model, and converting the pseudo quantization model into a 4-bit inference model for inference operation, wherein the pre-inference processing process comprises the following steps: constant folding, secondary quantization and activating equivalent transformation;
if not, retraining the neural network is continued.
2. The method of claim 1, wherein the method for counting initial values of satRelu in each saturation activation layer in the pre-training model comprises:
replacing all active layers relu in the neural network with saturated active layers satRelu;
acquiring an activation value of each saturation activation layer satRelu according to the acquired command;
according to the activation value, counting distribution data by utilizing a histogram;
selecting an activation value located at 99.999% of a point in the histogram as an initial value of a parameter max in a saturation activation layer satRelu, wherein the satRelu is defined as:
3. The 4-bit quantization method of the neural network as claimed in claim 2, wherein the parameter max is compressed by a regularization method of L2 during retraining of the neural network.
4. The 4-bit quantization method of the neural network as claimed in claim 2, wherein the retraining period is less than or equal to 10, and the value of the parameter max is less than or equal to 1.
5. The method according to claim 1, wherein the method for obtaining the pseudo quantization model by adding the pseudo quantization node in the neural network and retraining the neural network with an initial value of satRelu comprises:
inserting a weight pseudo-quantization layer before a weight layer of the neural network and an activation pseudo-quantization layer before an activation layer;
retraining the neural network on the weight pseudo-quantization layer by using a formula y equal to quant (w equal to clip)/scale, wherein w is a weight value, n is 4, and a scale coefficient scale is:
6. a4-bit quantization method for neural networks according to any of claims 2-4, characterized in that the back-propagation process uses a pass-through estimator to calculate the gradient when retraining the neural network.
7. A 4-bit quantization system for a neural network, the system comprising:
the loading module is used for loading the pre-training model of the neural network;
the statistical module is used for counting the initial value of each saturation activation layer satRelu in the pre-training model;
the retraining module is used for adding a pseudo quantization node in the neural network, retraining the neural network by using an initial value of the satRelu, and acquiring a pseudo quantization model;
the judging module is used for judging whether the precision of the pseudo quantization model converges to the set precision or not;
a conversion module, configured to perform pre-inference processing on the pseudo quantization model when the accuracy of the pseudo quantization model converges to a set accuracy, and convert the pseudo quantization model into a 4-bit inference model that can be used for inference operation, where the pre-inference processing includes: constant folding, quadratic quantization and activating equivalent transformation.
8. The 4-bit quantization system of a neural network of claim 7, wherein the statistics module comprises:
a replacement unit, configured to replace all active layers relu in the neural network with saturated active layers satRelu;
an activation value acquisition unit, configured to acquire an activation value of each saturation activation layer satRelu according to the acquired command;
a statistical unit for counting distribution data by using a histogram according to the activation value;
an initial value selecting unit, configured to select an activation value located at 99.999% of a point in the histogram as an initial value of a parameter max in a saturation activation layer satRelu, where satRelu is defined as:
9. The 4-bit quantization system of a neural network of claim 7, wherein the retraining module comprises:
a pseudo quantization layer insertion unit for inserting a weight pseudo quantization layer before a weight layer of the neural network and inserting an activation pseudo quantization layer before an activation layer;
a first retraining unit, configured to retrain the neural network for the weight pseudo quantization layer by using a formula y equal to quant (w) clip (w equal scale)/scale, where w is a weight value, n is 4, and a scaling coefficient scale is:
10. the 4-bit quantization system of a neural network of claim 7, wherein the pre-inference processing module comprises:
and the constant folding unit is used for fusing the batch norm layer into convolution, wherein the calculation formula of the convolution is as follows: z is w x + b, and the formula of batchNorm is:the calculation formula of the new convolution after combination is as follows:
the secondary quantization unit is used for carrying out secondary quantization on the weight to obtain a quantization scale coefficient scale of the weight after the secondary quantization;
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010589233.5A CN111882058A (en) | 2020-06-24 | 2020-06-24 | 4-bit quantization method and system of neural network |
PCT/CN2021/076982 WO2021258752A1 (en) | 2020-06-24 | 2021-02-20 | 4-bit quantization method and system for neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010589233.5A CN111882058A (en) | 2020-06-24 | 2020-06-24 | 4-bit quantization method and system of neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111882058A true CN111882058A (en) | 2020-11-03 |
Family
ID=73156945
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010589233.5A Pending CN111882058A (en) | 2020-06-24 | 2020-06-24 | 4-bit quantization method and system of neural network |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111882058A (en) |
WO (1) | WO2021258752A1 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112488291A (en) * | 2020-11-03 | 2021-03-12 | 珠海亿智电子科技有限公司 | Neural network 8-bit quantization compression method |
CN112884144A (en) * | 2021-02-01 | 2021-06-01 | 上海商汤智能科技有限公司 | Network quantization method and device, electronic equipment and storage medium |
WO2021258752A1 (en) * | 2020-06-24 | 2021-12-30 | 苏州浪潮智能科技有限公司 | 4-bit quantization method and system for neural network |
CN113887706A (en) * | 2021-09-30 | 2022-01-04 | 苏州浪潮智能科技有限公司 | Method and device for low bit quantization aiming at one-stage target detection network |
CN113971457A (en) * | 2021-10-29 | 2022-01-25 | 苏州浪潮智能科技有限公司 | Method and system for optimizing calculation performance of neural network |
CN114611697A (en) * | 2022-05-11 | 2022-06-10 | 上海登临科技有限公司 | Neural network quantification and deployment method, system, electronic device and storage medium |
CN114676760A (en) * | 2022-03-10 | 2022-06-28 | 北京智源人工智能研究院 | Pre-training model inference processing method and device, electronic equipment and storage medium |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230298569A1 (en) * | 2022-03-21 | 2023-09-21 | Google Llc | 4-bit Conformer with Accurate Quantization Training for Speech Recognition |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11348009B2 (en) * | 2018-09-24 | 2022-05-31 | Samsung Electronics Co., Ltd. | Non-uniform quantization of pre-trained deep neural network |
CN110334802A (en) * | 2019-05-23 | 2019-10-15 | 腾讯科技(深圳)有限公司 | A kind of construction method of neural network model, device, equipment and storage medium |
CN110837890A (en) * | 2019-10-22 | 2020-02-25 | 西安交通大学 | Weight value fixed-point quantization method for lightweight convolutional neural network |
CN111882058A (en) * | 2020-06-24 | 2020-11-03 | 苏州浪潮智能科技有限公司 | 4-bit quantization method and system of neural network |
-
2020
- 2020-06-24 CN CN202010589233.5A patent/CN111882058A/en active Pending
-
2021
- 2021-02-20 WO PCT/CN2021/076982 patent/WO2021258752A1/en active Application Filing
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021258752A1 (en) * | 2020-06-24 | 2021-12-30 | 苏州浪潮智能科技有限公司 | 4-bit quantization method and system for neural network |
CN112488291A (en) * | 2020-11-03 | 2021-03-12 | 珠海亿智电子科技有限公司 | Neural network 8-bit quantization compression method |
CN112488291B (en) * | 2020-11-03 | 2024-06-04 | 珠海亿智电子科技有限公司 | 8-Bit quantization compression method for neural network |
CN112884144A (en) * | 2021-02-01 | 2021-06-01 | 上海商汤智能科技有限公司 | Network quantization method and device, electronic equipment and storage medium |
CN113887706A (en) * | 2021-09-30 | 2022-01-04 | 苏州浪潮智能科技有限公司 | Method and device for low bit quantization aiming at one-stage target detection network |
CN113887706B (en) * | 2021-09-30 | 2024-02-06 | 苏州浪潮智能科技有限公司 | Method and device for low-bit quantization of one-stage target detection network |
CN113971457A (en) * | 2021-10-29 | 2022-01-25 | 苏州浪潮智能科技有限公司 | Method and system for optimizing calculation performance of neural network |
CN113971457B (en) * | 2021-10-29 | 2024-02-02 | 苏州浪潮智能科技有限公司 | Computing performance optimization method and system for neural network |
CN114676760A (en) * | 2022-03-10 | 2022-06-28 | 北京智源人工智能研究院 | Pre-training model inference processing method and device, electronic equipment and storage medium |
CN114611697A (en) * | 2022-05-11 | 2022-06-10 | 上海登临科技有限公司 | Neural network quantification and deployment method, system, electronic device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2021258752A1 (en) | 2021-12-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111882058A (en) | 4-bit quantization method and system of neural network | |
CN113011581B (en) | Neural network model compression method and device, electronic equipment and readable storage medium | |
CN111814973B (en) | Memory computing system suitable for neural ordinary differential equation network computing | |
CN112906294A (en) | Quantization method and quantization device for deep learning model | |
CN109726799A (en) | A kind of compression method of deep neural network | |
CN113850389B (en) | Quantum circuit construction method and device | |
CN112733863B (en) | Image feature extraction method, device, equipment and storage medium | |
CN110245753A (en) | A kind of neural network compression method based on power exponent quantization | |
US5621861A (en) | Method of reducing amount of data required to achieve neural network learning | |
CN110276451A (en) | One kind being based on the normalized deep neural network compression method of weight | |
CN114139683A (en) | Neural network accelerator model quantization method | |
CN112200311A (en) | 4-bit quantitative reasoning method, device, equipment and readable medium | |
CN112884146A (en) | Method and system for training model based on data quantization and hardware acceleration | |
CN112581397A (en) | Degraded image restoration method based on image prior information and application thereof | |
Verma et al. | A" Network Pruning Network''Approach to Deep Model Compression | |
CN112927159B (en) | True image denoising method based on multi-scale selection feedback network | |
CN109800859B (en) | Neural network batch normalization optimization method and device | |
CN114707636A (en) | Neural network architecture searching method and device, electronic equipment and storage medium | |
CN114595802A (en) | Data compression-based impulse neural network acceleration method and device | |
CN114372539A (en) | Machine learning framework-based classification method and related equipment | |
CN113887706B (en) | Method and device for low-bit quantization of one-stage target detection network | |
CN116760724A (en) | Endophytic artificial intelligence evaluation method, system and storage medium | |
CN116776926B (en) | Optimized deployment method, device, equipment and medium for dialogue model | |
CN111985639A (en) | Neural network quantification method | |
CN115375953A (en) | Training method and device for image classification model, storage medium and electronic device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201103 |