CN112101524A - Method and system for on-line switching bit width quantization neural network - Google Patents
Method and system for on-line switching bit width quantization neural network Download PDFInfo
- Publication number
- CN112101524A CN112101524A CN202010929604.XA CN202010929604A CN112101524A CN 112101524 A CN112101524 A CN 112101524A CN 202010929604 A CN202010929604 A CN 202010929604A CN 112101524 A CN112101524 A CN 112101524A
- Authority
- CN
- China
- Prior art keywords
- network
- bit
- sub
- neural network
- super
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 175
- 238000013139 quantization Methods 0.000 title claims abstract description 55
- 238000000034 method Methods 0.000 title claims abstract description 24
- 230000006870 function Effects 0.000 claims abstract description 60
- 238000012549 training Methods 0.000 claims abstract description 57
- 238000010606 normalization Methods 0.000 claims abstract description 24
- 238000009826 distribution Methods 0.000 claims description 17
- 238000005457 optimization Methods 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 13
- 238000012545 processing Methods 0.000 claims description 8
- OAICVXFJPJFONN-UHFFFAOYSA-N Phosphorus Chemical compound [P] OAICVXFJPJFONN-UHFFFAOYSA-N 0.000 claims description 7
- 230000036961 partial effect Effects 0.000 claims description 7
- 230000002441 reversible effect Effects 0.000 claims description 6
- 230000008447 perception Effects 0.000 description 6
- 230000002829 reductive effect Effects 0.000 description 5
- 230000006835 compression Effects 0.000 description 4
- 238000007906 compression Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000004913 activation Effects 0.000 description 3
- 238000001994 activation Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention provides a method and a system for on-line switching of bit width quantization neural networks, which comprises the following steps: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture; the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer; training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network; and extracting a quantization neural network of the target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning. The invention can enable the neural network to switch bit width at will on the premise of not carrying out retraining so as to adapt to different hardware deployment environments.
Description
Technical Field
The invention relates to the field of computer vision and image processing, in particular to a method and a system for a quantitative neural network capable of switching bit width on line.
Background
With the increasing complexity of Deep Neural Networks (DNNs), great challenges are often faced in deploying deep neural networks. Therefore, model compression and model acceleration are receiving increasing attention in machine learning. One important direction of research is to quantize the deep neural network, which quantizes model weights and intermediate layer activations to smaller bit-widths. Due to the reduction of bit width, the quantization depth neural network has smaller model size and can carry out rapid reasoning by utilizing efficient fixed-point calculation. However, when the bit width is reduced to 4 bits or even less, a significant loss of precision is incurred. To alleviate this problem, quantitative perceptual training is commonly employed to restore model accuracy. The quantization perception training simulates the quantization effect of a certain bit width during training, so that the model can adapt to quantization noise in the training stage. In a real-world scenario, different devices may support different bit widths. For example, Tesla T4 supports 4, 8, 16 and 32 bits, while Watt A1 supports 1, 2, 3 and 4 bits. If the deep neural network is deployed to different bit widths by using quantization perception training, a large amount of training time and resources are needed, so that the deployment is inconvenient. The invention provides a quantization neural network capable of switching bit widths on line, which aims to enable a deep neural network to be deployed with different bit widths without extra training.
At the present stage, according to the difference of bit width, the quantization algorithm without additional training can be divided into no-data quantization and calibration data quantization. For no data quantization, the quantization process does not require any original data set. This type of quantization algorithm would adjust the quantization scheme according to the characteristics of the model itself and would modify the batch normalization layer of the neural network. For quantization based on calibration data, a portion of the image is collected as a calibration data set. By inputting the calibration data set into the deep neural network to be quantified, activation distribution information for each layer in the network can be collected. By calculating KL divergence of activation distribution before and after quantization, network quantization can be better realized.
Patent document CN109961141A (application number: 201910288941.2) discloses a method and apparatus for generating a quantized neural network. One embodiment of the method comprises: acquiring a training sample set and an initial neural network; converting original floating point type network parameters in the initial neural network into integer type network parameters; generating a quantized initial neural network based on the converted integer network parameters; selecting training samples from the training sample set, and executing the following training steps: taking the sample information in the training sample as the input of the quantitative initial neural network, taking the sample result in the training sample as the expected output of the quantitative initial neural network, and training the quantitative initial neural network; in response to determining that the training of the quantized initial neural network is complete, a quantized neural network is generated based on the trained quantized initial neural network. But it can only quantize the neural network to 8 bits, and at the same time, it can not change bit width flexibly during the network operation.
Disclosure of Invention
In view of the defects in the prior art, an object of the present invention is to provide a method and a system for on-line switching of a quantized neural network with bit width.
The method for the quantization neural network capable of switching the bit width on line provided by the invention comprises the following steps:
step M1: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;
step M2: the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer;
step M3: training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network;
step M4: extracting a quantization neural network of a target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning;
the low bit mode is a sub-neural network mode of bit width when k is smaller than a preset value;
the high bit pattern is a sub-neural network pattern of bit width when k is a preset value.
Preferably, the step M1 includes: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.
Preferably, the processing of each network middle layer feature in the step M2 by using the corresponding batch normalization layer includes:
wherein k represents a bit width; f. ofkRepresenting original characteristics generated by a k bit width sub-neural network to be processed; gamma raykAnd betakThe learnable parameters which are exclusive to the k bit width sub-neural network are represented; mu.skAnd σk 2Is the original feature fkMean and variance of; e represents a minimum amount.
Preferably, the step M3 includes:
the optimization task of the super network is based on multi-task learning and an optimized super network overall objective function, and the formula is as follows:
Lall=∑k∈mode listαkLk+ω||Q||2 (2)
wherein, mode list represents the set of all bit-wide sub-neural networks contained in the super network, and k represents a k-bit-wide sub-neural network; l iskRepresenting a consistency loss function corresponding to the k bit wide sub-neural network; alpha is alphakRepresenting a sub-optimization goal LkThe proportion occupied in the overall objective function; q represents a parameter of the hyper-network; | | non-woven hair2To representAn L2 norm; | Q | non-conducting phosphor2Representing parameter attenuation, and regularizing the super network; omega is the weight of parameter attenuation;
optimizing the target L for each sub-kWhen k is a, a represents the data type matched to the hardware, the super network runs with single precision floating point number, LaThe specific expression form of (A) is as follows:
wherein (X, Y) represents an image and a label of the target dataset; x is the number ofiRepresenting an image, yiA presentation image category label; f. ofaRepresenting a single-precision floating-point number sub-neural network; h represents a cross entropy function;
optimizing the target L for each sub-kWhen k < a, the hyper-network operates with k-bit integer, LkThe expression of (A) is as follows:
wherein KL represents a KL divergence between a distribution corresponding to the low bit pattern and a distribution corresponding to the high bit pattern; sigma represents a softmax function, and the output of the network is mapped into a probability value; and T represents a hyper-parameter for calculating the probability value, so that the consistency alignment between the sub-neural networks with different bit widths is realized.
Preferably, the step M4 includes:
in a forward calculation stage of the neural network, the super network reserves a copy of network parameters of the single-precision floating point number, the super network quantizes the parameters of the single-precision floating point number into an integer number with a low bit width, and a quantizer for the k bit width is as follows:
wherein r represents a parameter of a single-precision floating point number to be quantized; k represents the bit width; clamp () represents a truncation function; round () represents a rounding function;
during the training phase, the gradient of the quantizer is estimated using the straight-through-estimator:
wherein,represents partial differential, QkRepresenting the quantized value after quantization, and r represents a single-precision floating point number parameter to be quantized;
in the reverse calculation stage of the neural network, the super network is updated by utilizing the gradient obtained by estimation; the updated parameters are parameters of single-precision floating point numbers stored in the hyper-network.
The system for the quantization neural network capable of switching the bit width on line provided by the invention comprises the following components:
module M1: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;
module M2: the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer;
module M3: training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network;
module M4: extracting a quantization neural network of a target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning;
the low bit mode is a sub-neural network mode of bit width when k is smaller than a preset value;
the high bit pattern is a sub-neural network pattern of bit width when k is a preset value.
Preferably, said module M1 comprises: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.
Preferably, the processing of each network middle layer feature in the module M2 by using the corresponding batch normalization layer includes:
wherein k represents a bit width; f. ofkRepresenting original characteristics generated by a k bit width sub-neural network to be processed; gamma raykAnd betakThe learnable parameters which are exclusive to the k bit width sub-neural network are represented; mu.skAnd σk 2Is the original feature fkMean and variance of; e represents a minimum amount.
Preferably, said module M3 comprises:
the optimization task of the super network is based on multi-task learning and an optimized super network overall objective function, and the formula is as follows:
Lall=∑k∈mode listαkLk+ω||Q||2 (2)
wherein, mode list represents the set of all bit-wide sub-neural networks contained in the super network, and k represents a k-bit-wide sub-neural network; l iskRepresenting a consistency loss function corresponding to the k bit wide sub-neural network; alpha is alphakRepresenting a sub-optimization goal LkThe proportion occupied in the overall objective function; q represents a parameter of the hyper-network; | | non-woven hair2Represents the L2 norm; | Q | non-conducting phosphor2Representing parameter attenuation, and regularizing the super network; omega is the weight of parameter attenuation;
optimizing the target L for each sub-kWhen k is a, a represents the data type matched to the hardware, the super network runs with single precision floating point number, LaThe specific expression form of (A) is as follows:
wherein (X, Y) represents an image and a label of the target dataset; x is the number ofiRepresenting an image, yiA presentation image category label; f. ofaRepresenting a single-precision floating-point number sub-neural network; h represents a cross entropy function;
optimizing the target L for each sub-kWhen k < a, the hyper-network operates with k-bit integer, LkThe expression of (A) is as follows:
wherein KL represents a KL divergence between a distribution corresponding to the low bit pattern and a distribution corresponding to the high bit pattern; sigma represents a softmax function, and the output of the network is mapped into a probability value; and T represents a hyper-parameter for calculating the probability value, so that the consistency alignment between the sub-neural networks with different bit widths is realized.
Preferably, said module M4 comprises:
in a forward calculation stage of the neural network, the super network reserves a copy of network parameters of the single-precision floating point number, the super network quantizes the parameters of the single-precision floating point number into an integer number with a low bit width, and a quantizer for the k bit width is as follows:
wherein r represents a parameter of a single-precision floating point number to be quantized; k represents the bit width; clamp () represents a truncation function; round () represents a rounding function;
during the training phase, the gradient of the quantizer is estimated using the straight-through-estimator:
wherein,represents partial differential, QkRepresenting the quantized value after quantization, and r represents a single-precision floating point number parameter to be quantized;
in the reverse calculation stage of the neural network, the super network is updated by utilizing the gradient obtained by estimation; the updated parameters are parameters of single-precision floating point numbers stored in the hyper-network.
Compared with the prior art, the invention has the following beneficial effects:
1. according to the invention, by designing the quantization neural network capable of switching the bit width on line, the deep neural network can be directly deployed by different bits without any additional training, so that the use and landing of a model are greatly facilitated;
2. in the aspect of prediction accuracy, due to mutual promotion among sub-networks in the ultra-network, the method can obtain higher prediction accuracy than the traditional quantitative neural network;
3. on the basis of an ultra-large data set ImageNet, the design can achieve the improvement of the classification accuracy rate by more than 1%.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a flow chart of a method for on-line switching of bit-width quantized neural networks;
fig. 2 is a system schematic diagram of a quantization neural network capable of switching bit widths online.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Example 1
The method for the quantization neural network capable of switching the bit width on line provided by the invention comprises the following steps:
step M1: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;
step M2: the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer;
step M3: training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network;
step M4: extracting a quantization neural network of a target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning;
the low bit mode is a sub-neural network mode of bit width when k is smaller than a preset value;
the high bit pattern is a sub-neural network pattern of bit width when k is a preset value.
Specifically, the step M1 includes: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.
Specifically, the processing of each network middle layer feature in the step M2 by using the corresponding batch normalization layer includes:
wherein k represents a bit width; f. ofkRepresenting original characteristics generated by a k bit width sub-neural network to be processed; gamma raykAnd betakThe learnable parameters which are exclusive to the k bit width sub-neural network are represented; mu.skAnd σk 2Is the original feature fkMean and variance of; e represents a minimum amount.
Specifically, the step M3 includes:
the optimization task of the super network is based on multi-task learning and an optimized super network overall objective function, and the formula is as follows:
Lall=∑k∈mode listαkLk+ω||Q||2 (2)
wherein, mode list represents the set of all bit-wide sub-neural networks contained in the super network, and k represents a k-bit-wide sub-neural network; l iskRepresenting a consistency loss function corresponding to the k bit wide sub-neural network; alpha is alphakRepresenting a sub-optimization goal LkThe proportion occupied in the overall objective function; q represents a parameter of the hyper-network; | | non-woven hair2Represents the L2 norm; | Q | non-conducting phosphor2Representing parameter attenuation, and regularizing the super network; omega is the weight of parameter attenuation;
optimizing the target L for each sub-kWhen k is a, a represents the data type matched to the hardware, the super network runs with single precision floating point number, LaThe specific expression form of (A) is as follows:
wherein (X, Y) represents an image and a label of the target dataset; x is the number ofiRepresenting an image, yiA presentation image category label; f. ofaRepresenting a single-precision floating-point number sub-neural network; h represents a cross entropy functionCounting;
optimizing the target L for each sub-kWhen k < a, the hyper-network operates with k-bit integer, LkThe expression of (A) is as follows:
wherein KL represents a KL divergence between a distribution corresponding to the low bit pattern and a distribution corresponding to the high bit pattern; sigma represents a softmax function, and the output of the network is mapped into a probability value; and T represents a hyper-parameter for calculating the probability value, so that the consistency alignment between the sub-neural networks with different bit widths is realized.
Specifically, the step M4 includes:
in a forward calculation stage of the neural network, the super network reserves a copy of network parameters of the single-precision floating point number, the super network quantizes the parameters of the single-precision floating point number into an integer number with a low bit width, and a quantizer for the k bit width is as follows:
wherein r represents a parameter of a single-precision floating point number to be quantized; k represents the bit width; clamp () represents a truncation function; round () represents a rounding function;
during the training phase, the gradient of the quantizer is estimated using the straight-through-estimator:
wherein,represents partial differential, QkRepresenting the quantized value after quantization, and r represents a single-precision floating point number parameter to be quantized;
in the reverse calculation stage of the neural network, the super network is updated by utilizing the gradient obtained by estimation; the updated parameters are parameters of single-precision floating point numbers stored in the hyper-network.
The system for the quantization neural network capable of switching the bit width on line provided by the invention comprises the following components:
module M1: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;
module M2: the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer;
module M3: training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network;
module M4: extracting a quantization neural network of a target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning;
the low bit mode is a sub-neural network mode of bit width when k is smaller than a preset value;
the high bit pattern is a sub-neural network pattern of bit width when k is a preset value.
Specifically, the module M1 includes: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.
Specifically, the processing of each network middle layer feature in the module M2 by using a corresponding batch normalization layer includes:
wherein k represents a bit width; f. ofkRepresenting original characteristics generated by a k bit width sub-neural network to be processed; gamma raykAnd betakThe learnable parameters which are exclusive to the k bit width sub-neural network are represented; mu.skAnd σk 2Is the original feature fkMean and variance of; e represents a minimum amount.
Specifically, the module M3 includes:
the optimization task of the super network is based on multi-task learning and an optimized super network overall objective function, and the formula is as follows:
Lall=∑k∈mode listαkLk+ω||Q||2 (2)
wherein, mode list represents the set of all bit-wide sub-neural networks contained in the super network, and k represents a k-bit-wide sub-neural network; l iskRepresenting a consistency loss function corresponding to the k bit wide sub-neural network; alpha is alphakRepresenting a sub-optimization goal LkThe proportion occupied in the overall objective function; q represents a parameter of the hyper-network; | | non-woven hair2Represents the L2 norm; | Q | non-conducting phosphor2Representing parameter attenuation, and regularizing the super network; omega is the weight of parameter attenuation;
optimizing the target L for each sub-kWhen k is a, a represents the data type matched to the hardware, the super network runs with single precision floating point number, LaThe specific expression form of (A) is as follows:
wherein (X, Y) represents an image and a label of the target dataset; x is the number ofiRepresenting an image, yiA presentation image category label; f. ofaRepresenting a single-precision floating-point number sub-neural network; h represents a cross entropy function;
optimizing the target L for each sub-kWhen k < a, the hyper-network operates with k-bit integer, LkThe expression of (A) is as follows:
wherein KL represents a KL divergence between a distribution corresponding to the low bit pattern and a distribution corresponding to the high bit pattern; sigma represents a softmax function, and the output of the network is mapped into a probability value; and T represents a hyper-parameter for calculating the probability value, so that the consistency alignment between the sub-neural networks with different bit widths is realized.
Specifically, the module M4 includes:
in a forward calculation stage of the neural network, the super network reserves a copy of network parameters of the single-precision floating point number, the super network quantizes the parameters of the single-precision floating point number into an integer number with a low bit width, and a quantizer for the k bit width is as follows:
wherein r represents a parameter of a single-precision floating point number to be quantized; k represents the bit width; clamp () represents a truncation function; round () represents a rounding function;
during the training phase, the gradient of the quantizer is estimated using the straight-through-estimator:
wherein,represents partial differential, QkRepresenting the quantized value after quantization, and r represents a single-precision floating point number parameter to be quantized;
in the reverse calculation stage of the neural network, the super network is updated by utilizing the gradient obtained by estimation; the updated parameters are parameters of single-precision floating point numbers stored in the hyper-network.
Example 2
Example 2 is a modification of example 1
In view of the defects in the prior art, an object of the present invention is to provide a quantization neural network capable of switching bit widths online, so that the deep neural network can be deployed with different bit widths without any additional training.
As shown in fig. 1, which is a flowchart of the present invention for switching bit-width quantized neural networks on line, the method integrates neural networks with different bit-widths into the same network structure by constructing a super network, and processes features by using separate batch normalization layers for different bit-widths to ensure network convergence. Through the consistency loss function step, the consistency between the low bit mode and the high bit mode is restricted in the training stage, and the error caused by quantization is reduced. And (3) optimizing the hyper-network by using quantitative perception training, and realizing rapid reasoning and model compression under any bit width after the hyper-network converges.
A quantized neural network capable of switching bit widths online, comprising:
building a hyper network: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;
bit width exclusive batch normalization step: processing the features obtained under different bit widths by adopting an independent batch normalization layer;
based on consistency loss function: in the training stage, the consistency between the low bit mode and the high bit mode is restricted, and the error caused by quantization is reduced;
and (3) quantitative perception training: the quantization noise is simulated in the network training stage, so that the model obtains higher classification precision under the condition of low bit width;
low bit inference steps: after training is finished, a low-bit network model is derived by using a predefined quantizer, and rapid reasoning and model compression are realized.
The hyper-network building step, wherein: and for deep neural networks with different bit widths, integrating the deep neural networks into the same network structure by utilizing a hyper-network.
By utilizing the characteristic that the high-bit neural network can be quantized into a lower-bit neural network, the sub-neural networks with different bit widths can be integrated into a super network. The smaller the bit width is, the faster the inference speed of the model is. The super-network model and the sub-networks thereof have the same network topology structure, and simultaneously, all networks share all convolution kernels and full-connection layer parameters, so that the parameter quantity of the super-network is reduced. By means of a predefined quantizer, sub-networks of lower bit-widths can be directly quantized by the network of higher bit-widths.
The bit width exclusive batch normalization step, wherein: for the intermediate layer characteristics generated under different bit widths, an independent batch normalization layer is used for processing, so that the characteristics in different distributions are not influenced by each other.
The bit width exclusive batch normalization step specifically comprises the following steps:
the batch normalization layer is as follows:
where f is the raw features to be processed, γ and β are learnable parameters, and e is a minimal amount, typically set to 1e-5, in order to prevent σ from being too small. Mu and sigma2Is the mean and variance of the original feature f.
In the training phase, the mean and variance are statistically derived for the current batch of training samples.
In the test phase, the mean and variance are not affected by the test samples, but are determined by the running mean of the training phase.
The bit width dedicated batch normalization layer comprises:
where k is the bit width, fkIs a raw feature, gamma, produced by the k-bit subnetwork to be processedkAnd betakAre learnable parameters specific to a k-bit subnetwork, with the aim of enabling the subnetwork to flexibly affine-transform the channel characteristics of subnetworks of different bit-widths. Mu.skAnd σk 2Is the original feature fkThe purpose of the mean and variance of (a) is to prevent sub-networks with different feature distributions from interfering with each other.
A consistency loss function based step, wherein: in the model training phase, each bit-wide subnetwork is optimized. And constraining the prediction consistency of the low bit width and high bit width sub-networks in the optimization process.
The consistency loss function-based steps are as follows:
the optimization task of the hyper-network is based on multi-task learning, and the optimized overall objective function is as follows:
where mode list is the set of all bit-widths contained by the super network, and k is a k-bit-wide sub-network. L iskSub-optimization objectives, α, corresponding to k-bit wide sub-networkskIs a sub-optimization goal LkThe weight of the overall objective function. Q is a parameter of the super network, | | | | | non-conducting phosphor2Is L2 norm, | Q | | non-woven2Represents a parametric decay with the aim of regularizing the hyper-network. ω is the weight of the parameter decay.
Optimizing the target L for each sub-kWhen k is 32, the super network runs with single precision floating point number, L32The specific expression form of (A) is as follows:
wherein (X, Y) represents an image and a label of a data set, XiRepresentative image, yiIs its class label, f32Representative form accuracyAnd (4) a floating-point number sub neural network, wherein H represents a cross entropy function.
When k < 32, the hyper-network operates with k-bit integer, LkThe specific expression form of (A) is as follows:
where KL is the KL divergence between the two distributions and σ is the softmax function, with the aim of mapping the output of the network to probability values. T is a hyper-parameter for calculating probability values, and aims to play a role in softening the output probability values so as to better realize the consistency alignment among different sub-networks.
The step of quantitative perceptual training, wherein: and quantization noise is simulated in a network training stage, so that the model obtains higher classification precision under the condition of low bit width.
The quantitative perception training step specifically comprises the following steps:
in the network forward propagation stage, the super network can keep a copy of network parameters of single-precision floating point numbers. In order to realize efficient matrix operation, the super network quantizes parameters of single-precision floating point numbers into integer numbers with low bit width, and a quantizer corresponding to the k bit width is as follows:
wherein r represents the parameter of the single-precision floating point number to be quantized, k represents the bit width, clip () represents the truncation function, and round () represents the rounding function.
Since the quantizer is not differentiable, the gradient of the quantizer is estimated using the straight-through-estimator during the training phase:
wherein,represents partial differential, QkRepresenting the quantized value after quantization, and r represents the single-precision floating point number parameter to be quantized.
In the network back propagation stage, the model is updated by the estimated gradient. The updated parameters are parameters of single-precision floating point numbers stored in the hyper-network.
And the low bit reasoning step, wherein after training is finished, a quantization neural network of a target bit is extracted from the hyper-network by using a predefined quantizer so as to carry out low bit reasoning.
In summary, the present invention provides a quantized neural network capable of switching bit widths online, and a super network is constructed by building the super network, so as to integrate neural networks with different bit widths into the same network structure. By setting bit width exclusive batch normalization, for different bit widths, a single batch normalization layer is adopted to process the characteristics. By setting a consistency loss function, the consistency between the low bit mode and the high bit mode is restricted in the training stage, and the error caused by quantization is reduced. And through quantization perception training, quantization noise is simulated in a network training stage, so that the model obtains higher classification precision under low bit width. After training is finished, a network model with specific low bit width is derived by using a predefined quantizer, and rapid reasoning and model compression are realized. The invention can ensure that the bit width of the neural network can be switched at will on the premise of not carrying out retraining so as to adapt to different hardware deployment environments.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.
Claims (10)
1. A method for on-line switching of bit-width quantized neural networks, comprising:
step M1: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;
step M2: the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer;
step M3: training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network;
step M4: extracting a quantization neural network of a target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning;
the low bit mode is a sub-neural network mode of bit width when k is smaller than a preset value;
the high bit pattern is a sub-neural network pattern of bit width when k is a preset value.
2. The method for on-line switching of a bit-width quantized neural network according to claim 1, wherein said step M1 comprises: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.
3. The method of claim 1, wherein the processing of each network middle layer feature in the step M2 by using a corresponding batch normalization layer comprises:
wherein k represents a bit width; f. ofkRepresenting original characteristics generated by a k bit width sub-neural network to be processed; gamma raykAnd betakThe learnable parameters which are exclusive to the k bit width sub-neural network are represented; mu.skAnd σk 2Is the original feature fkMean and variance of; e represents a minimum amount.
4. The method for on-line switching of a bit-width quantized neural network according to claim 1, wherein said step M3 comprises:
the optimization task of the super network is based on multi-task learning and an optimized super network overall objective function, and the formula is as follows:
Lall=∑k∈mode listαkLk+ω||Q||2 (2)
wherein, mode list represents the set of all bit-wide sub-neural networks contained in the super network, and k represents a k-bit-wide sub-neural network; l iskRepresenting a consistency loss function corresponding to the k bit wide sub-neural network; alpha is alphakRepresenting a sub-optimization goal LkThe proportion occupied in the overall objective function; q represents a parameter of the hyper-network; non-viable cells|||2Represents the L2 norm; | Q | non-conducting phosphor2Representing parameter attenuation, and regularizing the super network; omega is the weight of parameter attenuation;
optimizing the target L for each sub-kWhen k is a, a represents the data type matched to the hardware, the super network runs with single precision floating point number, LaThe specific expression form of (A) is as follows:
wherein (X, Y) represents an image and a label of the target dataset; x is the number ofiRepresenting an image, yiA presentation image category label; f. ofaRepresenting a single-precision floating-point number sub-neural network; h represents a cross entropy function;
optimizing the target L for each sub-kWhen k < a, the hyper-network operates with k-bit integer, LkThe expression of (A) is as follows:
wherein KL represents a KL divergence between a distribution corresponding to the low bit pattern and a distribution corresponding to the high bit pattern; sigma represents a softmax function, and the output of the network is mapped into a probability value; and T represents a hyper-parameter for calculating the probability value, so that the consistency alignment between the sub-neural networks with different bit widths is realized.
5. The method for on-line switching of a bit-width quantized neural network according to claim 4, wherein said step M4 comprises:
in a forward calculation stage of the neural network, the super network reserves a copy of network parameters of the single-precision floating point number, the super network quantizes the parameters of the single-precision floating point number into an integer number with a low bit width, and a quantizer for the k bit width is as follows:
wherein r represents a parameter of a single-precision floating point number to be quantized; k represents the bit width; clamp () represents a truncation function; round () represents a rounding function;
during the training phase, the gradient of the quantizer is estimated using the straight-through-estimator:
wherein,represents partial differential, QkRepresenting the quantized value after quantization, and r represents a single-precision floating point number parameter to be quantized;
in the reverse calculation stage of the neural network, the super network is updated by utilizing the gradient obtained by estimation; the updated parameters are parameters of single-precision floating point numbers stored in the hyper-network.
6. A system for on-line switching of bit-width quantized neural networks, comprising:
module M1: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;
module M2: the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer;
module M3: training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network;
module M4: extracting a quantization neural network of a target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning;
the low bit mode is a sub-neural network mode of bit width when k is smaller than a preset value;
the high bit pattern is a sub-neural network pattern of bit width when k is a preset value.
7. The system of on-line switchable bit-width quantized neural network of claim 6, wherein said module M1 comprises: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.
8. The system of claim 6, wherein each network middle layer feature in the module M2 is processed by a corresponding batch normalization layer, and comprises:
wherein k represents a bit width; f. ofkRepresenting original characteristics generated by a k bit width sub-neural network to be processed; gamma raykAnd betakThe learnable parameters which are exclusive to the k bit width sub-neural network are represented; mu.skAnd σk 2Is the original feature fkMean and variance of; e represents a minimum amount.
9. The system of on-line switchable bit-width quantized neural network of claim 6, wherein said module M3 comprises:
the optimization task of the super network is based on multi-task learning and an optimized super network overall objective function, and the formula is as follows:
Lall=∑k∈mode listαkLk+ω||Q||2 (2)
wherein, mode list represents the set of all bit-wide sub-neural networks contained in the super network, and k represents a k-bit-wide sub-neural network; l iskRepresenting a consistency loss function corresponding to the k bit wide sub-neural network; alpha is alphakRepresenting a sub-optimization goal LkThe proportion occupied in the overall objective function; q represents a parameter of the hyper-network; | | non-woven hair2Represents the L2 norm; | Q | non-conducting phosphor2Representing parameter attenuation, and regularizing the super network; omega is the weight of parameter attenuation;
optimizing the target L for each sub-kWhen k is a, a represents the data type matched to the hardware, the super network runs with single precision floating point number, LaThe specific expression form of (A) is as follows:
wherein (X, Y) represents an image and a label of the target dataset; x is the number ofiRepresenting an image, yiA presentation image category label; f. ofaRepresenting a single-precision floating-point number sub-neural network; h represents a cross entropy function;
optimizing the target L for each sub-kWhen k < a, the hyper-network operates with k-bit integer, LkThe expression of (A) is as follows:
wherein KL represents a KL divergence between a distribution corresponding to the low bit pattern and a distribution corresponding to the high bit pattern; sigma represents a softmax function, and the output of the network is mapped into a probability value; and T represents a hyper-parameter for calculating the probability value, so that the consistency alignment between the sub-neural networks with different bit widths is realized.
10. The system of on-line switchable bit-width quantized neural network of claim 9, wherein the module M4 comprises:
in a forward calculation stage of the neural network, the super network reserves a copy of network parameters of the single-precision floating point number, the super network quantizes the parameters of the single-precision floating point number into an integer number with a low bit width, and a quantizer for the k bit width is as follows:
wherein r represents a parameter of a single-precision floating point number to be quantized; k represents the bit width; clamp () represents a truncation function; round () represents a rounding function;
during the training phase, the gradient of the quantizer is estimated using the straight-through-estimator:
wherein,represents partial differential, QkRepresenting the quantized value after quantization, and r represents a single-precision floating point number parameter to be quantized;
in the reverse calculation stage of the neural network, the super network is updated by utilizing the gradient obtained by estimation; the updated parameters are parameters of single-precision floating point numbers stored in the hyper-network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010929604.XA CN112101524A (en) | 2020-09-07 | 2020-09-07 | Method and system for on-line switching bit width quantization neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010929604.XA CN112101524A (en) | 2020-09-07 | 2020-09-07 | Method and system for on-line switching bit width quantization neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112101524A true CN112101524A (en) | 2020-12-18 |
Family
ID=73750792
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010929604.XA Pending CN112101524A (en) | 2020-09-07 | 2020-09-07 | Method and system for on-line switching bit width quantization neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112101524A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112926570A (en) * | 2021-03-26 | 2021-06-08 | 上海交通大学 | Adaptive bit network quantization method, system and image processing method |
CN113434750A (en) * | 2021-06-30 | 2021-09-24 | 北京市商汤科技开发有限公司 | Neural network search method, apparatus, device, storage medium, and program product |
WO2022222649A1 (en) * | 2021-04-23 | 2022-10-27 | Oppo广东移动通信有限公司 | Neural network model training method and apparatus, device, and storage medium |
WO2023015674A1 (en) * | 2021-08-12 | 2023-02-16 | 北京交通大学 | Multi-bit-width quantization method for deep convolutional neural network |
CN117709409A (en) * | 2023-05-09 | 2024-03-15 | 荣耀终端有限公司 | Neural network training method applied to image processing and related equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190050710A1 (en) * | 2017-08-14 | 2019-02-14 | Midea Group Co., Ltd. | Adaptive bit-width reduction for neural networks |
CN110363281A (en) * | 2019-06-06 | 2019-10-22 | 上海交通大学 | A kind of convolutional neural networks quantization method, device, computer and storage medium |
CN110555450A (en) * | 2018-05-31 | 2019-12-10 | 北京深鉴智能科技有限公司 | Face recognition neural network adjusting method and device |
US20200202213A1 (en) * | 2018-12-19 | 2020-06-25 | Microsoft Technology Licensing, Llc | Scaled learning for training dnn |
-
2020
- 2020-09-07 CN CN202010929604.XA patent/CN112101524A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190050710A1 (en) * | 2017-08-14 | 2019-02-14 | Midea Group Co., Ltd. | Adaptive bit-width reduction for neural networks |
CN110555450A (en) * | 2018-05-31 | 2019-12-10 | 北京深鉴智能科技有限公司 | Face recognition neural network adjusting method and device |
US20200202213A1 (en) * | 2018-12-19 | 2020-06-25 | Microsoft Technology Licensing, Llc | Scaled learning for training dnn |
CN110363281A (en) * | 2019-06-06 | 2019-10-22 | 上海交通大学 | A kind of convolutional neural networks quantization method, device, computer and storage medium |
Non-Patent Citations (1)
Title |
---|
KUNYUAN DU ET AL.,: ""From Quantized DNNs to Quantizable DNNs"", 《ARXIV》, pages 2 - 3 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112926570A (en) * | 2021-03-26 | 2021-06-08 | 上海交通大学 | Adaptive bit network quantization method, system and image processing method |
WO2022222649A1 (en) * | 2021-04-23 | 2022-10-27 | Oppo广东移动通信有限公司 | Neural network model training method and apparatus, device, and storage medium |
CN113434750A (en) * | 2021-06-30 | 2021-09-24 | 北京市商汤科技开发有限公司 | Neural network search method, apparatus, device, storage medium, and program product |
WO2023015674A1 (en) * | 2021-08-12 | 2023-02-16 | 北京交通大学 | Multi-bit-width quantization method for deep convolutional neural network |
CN117709409A (en) * | 2023-05-09 | 2024-03-15 | 荣耀终端有限公司 | Neural network training method applied to image processing and related equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112101524A (en) | Method and system for on-line switching bit width quantization neural network | |
EP3467723B1 (en) | Machine learning based network model construction method and apparatus | |
CN110880036B (en) | Neural network compression method, device, computer equipment and storage medium | |
CN110347873B (en) | Video classification method and device, electronic equipment and storage medium | |
EP3340129B1 (en) | Artificial neural network class-based pruning | |
CN109816032B (en) | Unbiased mapping zero sample classification method and device based on generative countermeasure network | |
WO2019060670A1 (en) | Compression of sparse deep convolutional network weights | |
CN110378434A (en) | Training method, recommended method, device and the electronic equipment of clicking rate prediction model | |
CN112906294A (en) | Quantization method and quantization device for deep learning model | |
CN110969251A (en) | Neural network model quantification method and device based on label-free data | |
EP4350572A1 (en) | Method, apparatus and system for generating neural network model, devices, medium and program product | |
US20220164666A1 (en) | Efficient mixed-precision search for quantizers in artificial neural networks | |
US11341034B2 (en) | Analysis of verification parameters for training reduction | |
DE112020004031T5 (en) | SYSTEM-RELATED SELECTIVE QUANTIZATION FOR PERFORMANCE OPTIMIZED DISTRIBUTED DEEP LEARNING | |
CN112257751A (en) | Neural network pruning method | |
CN113762061A (en) | Quantitative perception training method and device for neural network and electronic equipment | |
CN117455757A (en) | Image processing method, device, equipment and storage medium | |
CN112561050B (en) | Neural network model training method and device | |
CN110852414A (en) | High-precision low-order convolution neural network | |
CN113554097B (en) | Model quantization method and device, electronic equipment and storage medium | |
CN117999560A (en) | Hardware-aware progressive training of machine learning models | |
CN115392441A (en) | Method, apparatus, device and medium for on-chip adaptation of quantized neural network model | |
CN113688989B (en) | Deep learning network acceleration method, device, equipment and storage medium | |
DE102022126248A1 (en) | Methods and devices for training models for program synthesis | |
CN114118411A (en) | Training method of image recognition network, image recognition method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |