CN112101524A - Method and system for on-line switching bit width quantization neural network - Google Patents

Method and system for on-line switching bit width quantization neural network Download PDF

Info

Publication number
CN112101524A
CN112101524A CN202010929604.XA CN202010929604A CN112101524A CN 112101524 A CN112101524 A CN 112101524A CN 202010929604 A CN202010929604 A CN 202010929604A CN 112101524 A CN112101524 A CN 112101524A
Authority
CN
China
Prior art keywords
network
bit
sub
neural network
super
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010929604.XA
Other languages
Chinese (zh)
Inventor
张娅
杜昆原
王延峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202010929604.XA priority Critical patent/CN112101524A/en
Publication of CN112101524A publication Critical patent/CN112101524A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides a method and a system for on-line switching of bit width quantization neural networks, which comprises the following steps: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture; the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer; training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network; and extracting a quantization neural network of the target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning. The invention can enable the neural network to switch bit width at will on the premise of not carrying out retraining so as to adapt to different hardware deployment environments.

Description

Method and system for on-line switching bit width quantization neural network
Technical Field
The invention relates to the field of computer vision and image processing, in particular to a method and a system for a quantitative neural network capable of switching bit width on line.
Background
With the increasing complexity of Deep Neural Networks (DNNs), great challenges are often faced in deploying deep neural networks. Therefore, model compression and model acceleration are receiving increasing attention in machine learning. One important direction of research is to quantize the deep neural network, which quantizes model weights and intermediate layer activations to smaller bit-widths. Due to the reduction of bit width, the quantization depth neural network has smaller model size and can carry out rapid reasoning by utilizing efficient fixed-point calculation. However, when the bit width is reduced to 4 bits or even less, a significant loss of precision is incurred. To alleviate this problem, quantitative perceptual training is commonly employed to restore model accuracy. The quantization perception training simulates the quantization effect of a certain bit width during training, so that the model can adapt to quantization noise in the training stage. In a real-world scenario, different devices may support different bit widths. For example, Tesla T4 supports 4, 8, 16 and 32 bits, while Watt A1 supports 1, 2, 3 and 4 bits. If the deep neural network is deployed to different bit widths by using quantization perception training, a large amount of training time and resources are needed, so that the deployment is inconvenient. The invention provides a quantization neural network capable of switching bit widths on line, which aims to enable a deep neural network to be deployed with different bit widths without extra training.
At the present stage, according to the difference of bit width, the quantization algorithm without additional training can be divided into no-data quantization and calibration data quantization. For no data quantization, the quantization process does not require any original data set. This type of quantization algorithm would adjust the quantization scheme according to the characteristics of the model itself and would modify the batch normalization layer of the neural network. For quantization based on calibration data, a portion of the image is collected as a calibration data set. By inputting the calibration data set into the deep neural network to be quantified, activation distribution information for each layer in the network can be collected. By calculating KL divergence of activation distribution before and after quantization, network quantization can be better realized.
Patent document CN109961141A (application number: 201910288941.2) discloses a method and apparatus for generating a quantized neural network. One embodiment of the method comprises: acquiring a training sample set and an initial neural network; converting original floating point type network parameters in the initial neural network into integer type network parameters; generating a quantized initial neural network based on the converted integer network parameters; selecting training samples from the training sample set, and executing the following training steps: taking the sample information in the training sample as the input of the quantitative initial neural network, taking the sample result in the training sample as the expected output of the quantitative initial neural network, and training the quantitative initial neural network; in response to determining that the training of the quantized initial neural network is complete, a quantized neural network is generated based on the trained quantized initial neural network. But it can only quantize the neural network to 8 bits, and at the same time, it can not change bit width flexibly during the network operation.
Disclosure of Invention
In view of the defects in the prior art, an object of the present invention is to provide a method and a system for on-line switching of a quantized neural network with bit width.
The method for the quantization neural network capable of switching the bit width on line provided by the invention comprises the following steps:
step M1: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;
step M2: the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer;
step M3: training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network;
step M4: extracting a quantization neural network of a target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning;
the low bit mode is a sub-neural network mode of bit width when k is smaller than a preset value;
the high bit pattern is a sub-neural network pattern of bit width when k is a preset value.
Preferably, the step M1 includes: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.
Preferably, the processing of each network middle layer feature in the step M2 by using the corresponding batch normalization layer includes:
Figure BDA0002669741220000021
wherein k represents a bit width; f. ofkRepresenting original characteristics generated by a k bit width sub-neural network to be processed; gamma raykAnd betakThe learnable parameters which are exclusive to the k bit width sub-neural network are represented; mu.skAnd σk 2Is the original feature fkMean and variance of; e represents a minimum amount.
Preferably, the step M3 includes:
the optimization task of the super network is based on multi-task learning and an optimized super network overall objective function, and the formula is as follows:
Lall=∑k∈mode listαkLk+ω||Q||2 (2)
wherein, mode list represents the set of all bit-wide sub-neural networks contained in the super network, and k represents a k-bit-wide sub-neural network; l iskRepresenting a consistency loss function corresponding to the k bit wide sub-neural network; alpha is alphakRepresenting a sub-optimization goal LkThe proportion occupied in the overall objective function; q represents a parameter of the hyper-network; | | non-woven hair2To representAn L2 norm; | Q | non-conducting phosphor2Representing parameter attenuation, and regularizing the super network; omega is the weight of parameter attenuation;
optimizing the target L for each sub-kWhen k is a, a represents the data type matched to the hardware, the super network runs with single precision floating point number, LaThe specific expression form of (A) is as follows:
Figure BDA0002669741220000031
wherein (X, Y) represents an image and a label of the target dataset; x is the number ofiRepresenting an image, yiA presentation image category label; f. ofaRepresenting a single-precision floating-point number sub-neural network; h represents a cross entropy function;
optimizing the target L for each sub-kWhen k < a, the hyper-network operates with k-bit integer, LkThe expression of (A) is as follows:
Figure BDA0002669741220000032
wherein KL represents a KL divergence between a distribution corresponding to the low bit pattern and a distribution corresponding to the high bit pattern; sigma represents a softmax function, and the output of the network is mapped into a probability value; and T represents a hyper-parameter for calculating the probability value, so that the consistency alignment between the sub-neural networks with different bit widths is realized.
Preferably, the step M4 includes:
in a forward calculation stage of the neural network, the super network reserves a copy of network parameters of the single-precision floating point number, the super network quantizes the parameters of the single-precision floating point number into an integer number with a low bit width, and a quantizer for the k bit width is as follows:
Figure BDA0002669741220000033
wherein r represents a parameter of a single-precision floating point number to be quantized; k represents the bit width; clamp () represents a truncation function; round () represents a rounding function;
during the training phase, the gradient of the quantizer is estimated using the straight-through-estimator:
Figure BDA0002669741220000034
wherein,
Figure BDA0002669741220000041
represents partial differential, QkRepresenting the quantized value after quantization, and r represents a single-precision floating point number parameter to be quantized;
in the reverse calculation stage of the neural network, the super network is updated by utilizing the gradient obtained by estimation; the updated parameters are parameters of single-precision floating point numbers stored in the hyper-network.
The system for the quantization neural network capable of switching the bit width on line provided by the invention comprises the following components:
module M1: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;
module M2: the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer;
module M3: training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network;
module M4: extracting a quantization neural network of a target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning;
the low bit mode is a sub-neural network mode of bit width when k is smaller than a preset value;
the high bit pattern is a sub-neural network pattern of bit width when k is a preset value.
Preferably, said module M1 comprises: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.
Preferably, the processing of each network middle layer feature in the module M2 by using the corresponding batch normalization layer includes:
Figure BDA0002669741220000042
wherein k represents a bit width; f. ofkRepresenting original characteristics generated by a k bit width sub-neural network to be processed; gamma raykAnd betakThe learnable parameters which are exclusive to the k bit width sub-neural network are represented; mu.skAnd σk 2Is the original feature fkMean and variance of; e represents a minimum amount.
Preferably, said module M3 comprises:
the optimization task of the super network is based on multi-task learning and an optimized super network overall objective function, and the formula is as follows:
Lall=∑k∈mode listαkLk+ω||Q||2 (2)
wherein, mode list represents the set of all bit-wide sub-neural networks contained in the super network, and k represents a k-bit-wide sub-neural network; l iskRepresenting a consistency loss function corresponding to the k bit wide sub-neural network; alpha is alphakRepresenting a sub-optimization goal LkThe proportion occupied in the overall objective function; q represents a parameter of the hyper-network; | | non-woven hair2Represents the L2 norm; | Q | non-conducting phosphor2Representing parameter attenuation, and regularizing the super network; omega is the weight of parameter attenuation;
optimizing the target L for each sub-kWhen k is a, a represents the data type matched to the hardware, the super network runs with single precision floating point number, LaThe specific expression form of (A) is as follows:
Figure BDA0002669741220000051
wherein (X, Y) represents an image and a label of the target dataset; x is the number ofiRepresenting an image, yiA presentation image category label; f. ofaRepresenting a single-precision floating-point number sub-neural network; h represents a cross entropy function;
optimizing the target L for each sub-kWhen k < a, the hyper-network operates with k-bit integer, LkThe expression of (A) is as follows:
Figure BDA0002669741220000052
wherein KL represents a KL divergence between a distribution corresponding to the low bit pattern and a distribution corresponding to the high bit pattern; sigma represents a softmax function, and the output of the network is mapped into a probability value; and T represents a hyper-parameter for calculating the probability value, so that the consistency alignment between the sub-neural networks with different bit widths is realized.
Preferably, said module M4 comprises:
in a forward calculation stage of the neural network, the super network reserves a copy of network parameters of the single-precision floating point number, the super network quantizes the parameters of the single-precision floating point number into an integer number with a low bit width, and a quantizer for the k bit width is as follows:
Figure BDA0002669741220000053
wherein r represents a parameter of a single-precision floating point number to be quantized; k represents the bit width; clamp () represents a truncation function; round () represents a rounding function;
during the training phase, the gradient of the quantizer is estimated using the straight-through-estimator:
Figure BDA0002669741220000054
wherein,
Figure BDA0002669741220000055
represents partial differential, QkRepresenting the quantized value after quantization, and r represents a single-precision floating point number parameter to be quantized;
in the reverse calculation stage of the neural network, the super network is updated by utilizing the gradient obtained by estimation; the updated parameters are parameters of single-precision floating point numbers stored in the hyper-network.
Compared with the prior art, the invention has the following beneficial effects:
1. according to the invention, by designing the quantization neural network capable of switching the bit width on line, the deep neural network can be directly deployed by different bits without any additional training, so that the use and landing of a model are greatly facilitated;
2. in the aspect of prediction accuracy, due to mutual promotion among sub-networks in the ultra-network, the method can obtain higher prediction accuracy than the traditional quantitative neural network;
3. on the basis of an ultra-large data set ImageNet, the design can achieve the improvement of the classification accuracy rate by more than 1%.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a flow chart of a method for on-line switching of bit-width quantized neural networks;
fig. 2 is a system schematic diagram of a quantization neural network capable of switching bit widths online.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Example 1
The method for the quantization neural network capable of switching the bit width on line provided by the invention comprises the following steps:
step M1: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;
step M2: the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer;
step M3: training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network;
step M4: extracting a quantization neural network of a target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning;
the low bit mode is a sub-neural network mode of bit width when k is smaller than a preset value;
the high bit pattern is a sub-neural network pattern of bit width when k is a preset value.
Specifically, the step M1 includes: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.
Specifically, the processing of each network middle layer feature in the step M2 by using the corresponding batch normalization layer includes:
Figure BDA0002669741220000071
wherein k represents a bit width; f. ofkRepresenting original characteristics generated by a k bit width sub-neural network to be processed; gamma raykAnd betakThe learnable parameters which are exclusive to the k bit width sub-neural network are represented; mu.skAnd σk 2Is the original feature fkMean and variance of; e represents a minimum amount.
Specifically, the step M3 includes:
the optimization task of the super network is based on multi-task learning and an optimized super network overall objective function, and the formula is as follows:
Lall=∑k∈mode listαkLk+ω||Q||2 (2)
wherein, mode list represents the set of all bit-wide sub-neural networks contained in the super network, and k represents a k-bit-wide sub-neural network; l iskRepresenting a consistency loss function corresponding to the k bit wide sub-neural network; alpha is alphakRepresenting a sub-optimization goal LkThe proportion occupied in the overall objective function; q represents a parameter of the hyper-network; | | non-woven hair2Represents the L2 norm; | Q | non-conducting phosphor2Representing parameter attenuation, and regularizing the super network; omega is the weight of parameter attenuation;
optimizing the target L for each sub-kWhen k is a, a represents the data type matched to the hardware, the super network runs with single precision floating point number, LaThe specific expression form of (A) is as follows:
Figure BDA0002669741220000072
wherein (X, Y) represents an image and a label of the target dataset; x is the number ofiRepresenting an image, yiA presentation image category label; f. ofaRepresenting a single-precision floating-point number sub-neural network; h represents a cross entropy functionCounting;
optimizing the target L for each sub-kWhen k < a, the hyper-network operates with k-bit integer, LkThe expression of (A) is as follows:
Figure BDA0002669741220000073
wherein KL represents a KL divergence between a distribution corresponding to the low bit pattern and a distribution corresponding to the high bit pattern; sigma represents a softmax function, and the output of the network is mapped into a probability value; and T represents a hyper-parameter for calculating the probability value, so that the consistency alignment between the sub-neural networks with different bit widths is realized.
Specifically, the step M4 includes:
in a forward calculation stage of the neural network, the super network reserves a copy of network parameters of the single-precision floating point number, the super network quantizes the parameters of the single-precision floating point number into an integer number with a low bit width, and a quantizer for the k bit width is as follows:
Figure BDA0002669741220000074
wherein r represents a parameter of a single-precision floating point number to be quantized; k represents the bit width; clamp () represents a truncation function; round () represents a rounding function;
during the training phase, the gradient of the quantizer is estimated using the straight-through-estimator:
Figure BDA0002669741220000081
wherein,
Figure BDA0002669741220000082
represents partial differential, QkRepresenting the quantized value after quantization, and r represents a single-precision floating point number parameter to be quantized;
in the reverse calculation stage of the neural network, the super network is updated by utilizing the gradient obtained by estimation; the updated parameters are parameters of single-precision floating point numbers stored in the hyper-network.
The system for the quantization neural network capable of switching the bit width on line provided by the invention comprises the following components:
module M1: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;
module M2: the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer;
module M3: training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network;
module M4: extracting a quantization neural network of a target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning;
the low bit mode is a sub-neural network mode of bit width when k is smaller than a preset value;
the high bit pattern is a sub-neural network pattern of bit width when k is a preset value.
Specifically, the module M1 includes: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.
Specifically, the processing of each network middle layer feature in the module M2 by using a corresponding batch normalization layer includes:
Figure BDA0002669741220000083
wherein k represents a bit width; f. ofkRepresenting original characteristics generated by a k bit width sub-neural network to be processed; gamma raykAnd betakThe learnable parameters which are exclusive to the k bit width sub-neural network are represented; mu.skAnd σk 2Is the original feature fkMean and variance of; e represents a minimum amount.
Specifically, the module M3 includes:
the optimization task of the super network is based on multi-task learning and an optimized super network overall objective function, and the formula is as follows:
Lall=∑k∈mode listαkLk+ω||Q||2 (2)
wherein, mode list represents the set of all bit-wide sub-neural networks contained in the super network, and k represents a k-bit-wide sub-neural network; l iskRepresenting a consistency loss function corresponding to the k bit wide sub-neural network; alpha is alphakRepresenting a sub-optimization goal LkThe proportion occupied in the overall objective function; q represents a parameter of the hyper-network; | | non-woven hair2Represents the L2 norm; | Q | non-conducting phosphor2Representing parameter attenuation, and regularizing the super network; omega is the weight of parameter attenuation;
optimizing the target L for each sub-kWhen k is a, a represents the data type matched to the hardware, the super network runs with single precision floating point number, LaThe specific expression form of (A) is as follows:
Figure BDA0002669741220000091
wherein (X, Y) represents an image and a label of the target dataset; x is the number ofiRepresenting an image, yiA presentation image category label; f. ofaRepresenting a single-precision floating-point number sub-neural network; h represents a cross entropy function;
optimizing the target L for each sub-kWhen k < a, the hyper-network operates with k-bit integer, LkThe expression of (A) is as follows:
Figure BDA0002669741220000092
wherein KL represents a KL divergence between a distribution corresponding to the low bit pattern and a distribution corresponding to the high bit pattern; sigma represents a softmax function, and the output of the network is mapped into a probability value; and T represents a hyper-parameter for calculating the probability value, so that the consistency alignment between the sub-neural networks with different bit widths is realized.
Specifically, the module M4 includes:
in a forward calculation stage of the neural network, the super network reserves a copy of network parameters of the single-precision floating point number, the super network quantizes the parameters of the single-precision floating point number into an integer number with a low bit width, and a quantizer for the k bit width is as follows:
Figure BDA0002669741220000093
wherein r represents a parameter of a single-precision floating point number to be quantized; k represents the bit width; clamp () represents a truncation function; round () represents a rounding function;
during the training phase, the gradient of the quantizer is estimated using the straight-through-estimator:
Figure BDA0002669741220000094
wherein,
Figure BDA0002669741220000095
represents partial differential, QkRepresenting the quantized value after quantization, and r represents a single-precision floating point number parameter to be quantized;
in the reverse calculation stage of the neural network, the super network is updated by utilizing the gradient obtained by estimation; the updated parameters are parameters of single-precision floating point numbers stored in the hyper-network.
Example 2
Example 2 is a modification of example 1
In view of the defects in the prior art, an object of the present invention is to provide a quantization neural network capable of switching bit widths online, so that the deep neural network can be deployed with different bit widths without any additional training.
As shown in fig. 1, which is a flowchart of the present invention for switching bit-width quantized neural networks on line, the method integrates neural networks with different bit-widths into the same network structure by constructing a super network, and processes features by using separate batch normalization layers for different bit-widths to ensure network convergence. Through the consistency loss function step, the consistency between the low bit mode and the high bit mode is restricted in the training stage, and the error caused by quantization is reduced. And (3) optimizing the hyper-network by using quantitative perception training, and realizing rapid reasoning and model compression under any bit width after the hyper-network converges.
A quantized neural network capable of switching bit widths online, comprising:
building a hyper network: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;
bit width exclusive batch normalization step: processing the features obtained under different bit widths by adopting an independent batch normalization layer;
based on consistency loss function: in the training stage, the consistency between the low bit mode and the high bit mode is restricted, and the error caused by quantization is reduced;
and (3) quantitative perception training: the quantization noise is simulated in the network training stage, so that the model obtains higher classification precision under the condition of low bit width;
low bit inference steps: after training is finished, a low-bit network model is derived by using a predefined quantizer, and rapid reasoning and model compression are realized.
The hyper-network building step, wherein: and for deep neural networks with different bit widths, integrating the deep neural networks into the same network structure by utilizing a hyper-network.
By utilizing the characteristic that the high-bit neural network can be quantized into a lower-bit neural network, the sub-neural networks with different bit widths can be integrated into a super network. The smaller the bit width is, the faster the inference speed of the model is. The super-network model and the sub-networks thereof have the same network topology structure, and simultaneously, all networks share all convolution kernels and full-connection layer parameters, so that the parameter quantity of the super-network is reduced. By means of a predefined quantizer, sub-networks of lower bit-widths can be directly quantized by the network of higher bit-widths.
The bit width exclusive batch normalization step, wherein: for the intermediate layer characteristics generated under different bit widths, an independent batch normalization layer is used for processing, so that the characteristics in different distributions are not influenced by each other.
The bit width exclusive batch normalization step specifically comprises the following steps:
the batch normalization layer is as follows:
Figure BDA0002669741220000101
where f is the raw features to be processed, γ and β are learnable parameters, and e is a minimal amount, typically set to 1e-5, in order to prevent σ from being too small. Mu and sigma2Is the mean and variance of the original feature f.
In the training phase, the mean and variance are statistically derived for the current batch of training samples.
In the test phase, the mean and variance are not affected by the test samples, but are determined by the running mean of the training phase.
The bit width dedicated batch normalization layer comprises:
Figure BDA0002669741220000111
where k is the bit width, fkIs a raw feature, gamma, produced by the k-bit subnetwork to be processedkAnd betakAre learnable parameters specific to a k-bit subnetwork, with the aim of enabling the subnetwork to flexibly affine-transform the channel characteristics of subnetworks of different bit-widths. Mu.skAnd σk 2Is the original feature fkThe purpose of the mean and variance of (a) is to prevent sub-networks with different feature distributions from interfering with each other.
A consistency loss function based step, wherein: in the model training phase, each bit-wide subnetwork is optimized. And constraining the prediction consistency of the low bit width and high bit width sub-networks in the optimization process.
The consistency loss function-based steps are as follows:
the optimization task of the hyper-network is based on multi-task learning, and the optimized overall objective function is as follows:
Figure BDA0002669741220000112
where mode list is the set of all bit-widths contained by the super network, and k is a k-bit-wide sub-network. L iskSub-optimization objectives, α, corresponding to k-bit wide sub-networkskIs a sub-optimization goal LkThe weight of the overall objective function. Q is a parameter of the super network, | | | | | non-conducting phosphor2Is L2 norm, | Q | | non-woven2Represents a parametric decay with the aim of regularizing the hyper-network. ω is the weight of the parameter decay.
Optimizing the target L for each sub-kWhen k is 32, the super network runs with single precision floating point number, L32The specific expression form of (A) is as follows:
Figure BDA0002669741220000113
wherein (X, Y) represents an image and a label of a data set, XiRepresentative image, yiIs its class label, f32Representative form accuracyAnd (4) a floating-point number sub neural network, wherein H represents a cross entropy function.
When k < 32, the hyper-network operates with k-bit integer, LkThe specific expression form of (A) is as follows:
Figure BDA0002669741220000114
where KL is the KL divergence between the two distributions and σ is the softmax function, with the aim of mapping the output of the network to probability values. T is a hyper-parameter for calculating probability values, and aims to play a role in softening the output probability values so as to better realize the consistency alignment among different sub-networks.
The step of quantitative perceptual training, wherein: and quantization noise is simulated in a network training stage, so that the model obtains higher classification precision under the condition of low bit width.
The quantitative perception training step specifically comprises the following steps:
in the network forward propagation stage, the super network can keep a copy of network parameters of single-precision floating point numbers. In order to realize efficient matrix operation, the super network quantizes parameters of single-precision floating point numbers into integer numbers with low bit width, and a quantizer corresponding to the k bit width is as follows:
Figure BDA0002669741220000121
wherein r represents the parameter of the single-precision floating point number to be quantized, k represents the bit width, clip () represents the truncation function, and round () represents the rounding function.
Since the quantizer is not differentiable, the gradient of the quantizer is estimated using the straight-through-estimator during the training phase:
Figure BDA0002669741220000122
wherein,
Figure BDA0002669741220000123
represents partial differential, QkRepresenting the quantized value after quantization, and r represents the single-precision floating point number parameter to be quantized.
In the network back propagation stage, the model is updated by the estimated gradient. The updated parameters are parameters of single-precision floating point numbers stored in the hyper-network.
And the low bit reasoning step, wherein after training is finished, a quantization neural network of a target bit is extracted from the hyper-network by using a predefined quantizer so as to carry out low bit reasoning.
In summary, the present invention provides a quantized neural network capable of switching bit widths online, and a super network is constructed by building the super network, so as to integrate neural networks with different bit widths into the same network structure. By setting bit width exclusive batch normalization, for different bit widths, a single batch normalization layer is adopted to process the characteristics. By setting a consistency loss function, the consistency between the low bit mode and the high bit mode is restricted in the training stage, and the error caused by quantization is reduced. And through quantization perception training, quantization noise is simulated in a network training stage, so that the model obtains higher classification precision under low bit width. After training is finished, a network model with specific low bit width is derived by using a predefined quantizer, and rapid reasoning and model compression are realized. The invention can ensure that the bit width of the neural network can be switched at will on the premise of not carrying out retraining so as to adapt to different hardware deployment environments.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. A method for on-line switching of bit-width quantized neural networks, comprising:
step M1: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;
step M2: the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer;
step M3: training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network;
step M4: extracting a quantization neural network of a target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning;
the low bit mode is a sub-neural network mode of bit width when k is smaller than a preset value;
the high bit pattern is a sub-neural network pattern of bit width when k is a preset value.
2. The method for on-line switching of a bit-width quantized neural network according to claim 1, wherein said step M1 comprises: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.
3. The method of claim 1, wherein the processing of each network middle layer feature in the step M2 by using a corresponding batch normalization layer comprises:
Figure FDA0002669741210000011
wherein k represents a bit width; f. ofkRepresenting original characteristics generated by a k bit width sub-neural network to be processed; gamma raykAnd betakThe learnable parameters which are exclusive to the k bit width sub-neural network are represented; mu.skAnd σk 2Is the original feature fkMean and variance of; e represents a minimum amount.
4. The method for on-line switching of a bit-width quantized neural network according to claim 1, wherein said step M3 comprises:
the optimization task of the super network is based on multi-task learning and an optimized super network overall objective function, and the formula is as follows:
Lall=∑k∈mode listαkLk+ω||Q||2 (2)
wherein, mode list represents the set of all bit-wide sub-neural networks contained in the super network, and k represents a k-bit-wide sub-neural network; l iskRepresenting a consistency loss function corresponding to the k bit wide sub-neural network; alpha is alphakRepresenting a sub-optimization goal LkThe proportion occupied in the overall objective function; q represents a parameter of the hyper-network; non-viable cells|||2Represents the L2 norm; | Q | non-conducting phosphor2Representing parameter attenuation, and regularizing the super network; omega is the weight of parameter attenuation;
optimizing the target L for each sub-kWhen k is a, a represents the data type matched to the hardware, the super network runs with single precision floating point number, LaThe specific expression form of (A) is as follows:
Figure FDA0002669741210000021
wherein (X, Y) represents an image and a label of the target dataset; x is the number ofiRepresenting an image, yiA presentation image category label; f. ofaRepresenting a single-precision floating-point number sub-neural network; h represents a cross entropy function;
optimizing the target L for each sub-kWhen k < a, the hyper-network operates with k-bit integer, LkThe expression of (A) is as follows:
Figure FDA0002669741210000022
wherein KL represents a KL divergence between a distribution corresponding to the low bit pattern and a distribution corresponding to the high bit pattern; sigma represents a softmax function, and the output of the network is mapped into a probability value; and T represents a hyper-parameter for calculating the probability value, so that the consistency alignment between the sub-neural networks with different bit widths is realized.
5. The method for on-line switching of a bit-width quantized neural network according to claim 4, wherein said step M4 comprises:
in a forward calculation stage of the neural network, the super network reserves a copy of network parameters of the single-precision floating point number, the super network quantizes the parameters of the single-precision floating point number into an integer number with a low bit width, and a quantizer for the k bit width is as follows:
Figure FDA0002669741210000023
wherein r represents a parameter of a single-precision floating point number to be quantized; k represents the bit width; clamp () represents a truncation function; round () represents a rounding function;
during the training phase, the gradient of the quantizer is estimated using the straight-through-estimator:
Figure FDA0002669741210000024
wherein,
Figure FDA0002669741210000025
represents partial differential, QkRepresenting the quantized value after quantization, and r represents a single-precision floating point number parameter to be quantized;
in the reverse calculation stage of the neural network, the super network is updated by utilizing the gradient obtained by estimation; the updated parameters are parameters of single-precision floating point numbers stored in the hyper-network.
6. A system for on-line switching of bit-width quantized neural networks, comprising:
module M1: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;
module M2: the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer;
module M3: training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network;
module M4: extracting a quantization neural network of a target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning;
the low bit mode is a sub-neural network mode of bit width when k is smaller than a preset value;
the high bit pattern is a sub-neural network pattern of bit width when k is a preset value.
7. The system of on-line switchable bit-width quantized neural network of claim 6, wherein said module M1 comprises: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.
8. The system of claim 6, wherein each network middle layer feature in the module M2 is processed by a corresponding batch normalization layer, and comprises:
Figure FDA0002669741210000031
wherein k represents a bit width; f. ofkRepresenting original characteristics generated by a k bit width sub-neural network to be processed; gamma raykAnd betakThe learnable parameters which are exclusive to the k bit width sub-neural network are represented; mu.skAnd σk 2Is the original feature fkMean and variance of; e represents a minimum amount.
9. The system of on-line switchable bit-width quantized neural network of claim 6, wherein said module M3 comprises:
the optimization task of the super network is based on multi-task learning and an optimized super network overall objective function, and the formula is as follows:
Lall=∑k∈mode listαkLk+ω||Q||2 (2)
wherein, mode list represents the set of all bit-wide sub-neural networks contained in the super network, and k represents a k-bit-wide sub-neural network; l iskRepresenting a consistency loss function corresponding to the k bit wide sub-neural network; alpha is alphakRepresenting a sub-optimization goal LkThe proportion occupied in the overall objective function; q represents a parameter of the hyper-network; | | non-woven hair2Represents the L2 norm; | Q | non-conducting phosphor2Representing parameter attenuation, and regularizing the super network; omega is the weight of parameter attenuation;
optimizing the target L for each sub-kWhen k is a, a represents the data type matched to the hardware, the super network runs with single precision floating point number, LaThe specific expression form of (A) is as follows:
Figure FDA0002669741210000032
wherein (X, Y) represents an image and a label of the target dataset; x is the number ofiRepresenting an image, yiA presentation image category label; f. ofaRepresenting a single-precision floating-point number sub-neural network; h represents a cross entropy function;
optimizing the target L for each sub-kWhen k < a, the hyper-network operates with k-bit integer, LkThe expression of (A) is as follows:
Figure FDA0002669741210000041
wherein KL represents a KL divergence between a distribution corresponding to the low bit pattern and a distribution corresponding to the high bit pattern; sigma represents a softmax function, and the output of the network is mapped into a probability value; and T represents a hyper-parameter for calculating the probability value, so that the consistency alignment between the sub-neural networks with different bit widths is realized.
10. The system of on-line switchable bit-width quantized neural network of claim 9, wherein the module M4 comprises:
in a forward calculation stage of the neural network, the super network reserves a copy of network parameters of the single-precision floating point number, the super network quantizes the parameters of the single-precision floating point number into an integer number with a low bit width, and a quantizer for the k bit width is as follows:
Figure FDA0002669741210000042
wherein r represents a parameter of a single-precision floating point number to be quantized; k represents the bit width; clamp () represents a truncation function; round () represents a rounding function;
during the training phase, the gradient of the quantizer is estimated using the straight-through-estimator:
Figure FDA0002669741210000043
wherein,
Figure FDA0002669741210000044
represents partial differential, QkRepresenting the quantized value after quantization, and r represents a single-precision floating point number parameter to be quantized;
in the reverse calculation stage of the neural network, the super network is updated by utilizing the gradient obtained by estimation; the updated parameters are parameters of single-precision floating point numbers stored in the hyper-network.
CN202010929604.XA 2020-09-07 2020-09-07 Method and system for on-line switching bit width quantization neural network Pending CN112101524A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010929604.XA CN112101524A (en) 2020-09-07 2020-09-07 Method and system for on-line switching bit width quantization neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010929604.XA CN112101524A (en) 2020-09-07 2020-09-07 Method and system for on-line switching bit width quantization neural network

Publications (1)

Publication Number Publication Date
CN112101524A true CN112101524A (en) 2020-12-18

Family

ID=73750792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010929604.XA Pending CN112101524A (en) 2020-09-07 2020-09-07 Method and system for on-line switching bit width quantization neural network

Country Status (1)

Country Link
CN (1) CN112101524A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926570A (en) * 2021-03-26 2021-06-08 上海交通大学 Adaptive bit network quantization method, system and image processing method
CN113434750A (en) * 2021-06-30 2021-09-24 北京市商汤科技开发有限公司 Neural network search method, apparatus, device, storage medium, and program product
WO2022222649A1 (en) * 2021-04-23 2022-10-27 Oppo广东移动通信有限公司 Neural network model training method and apparatus, device, and storage medium
WO2023015674A1 (en) * 2021-08-12 2023-02-16 北京交通大学 Multi-bit-width quantization method for deep convolutional neural network
CN117709409A (en) * 2023-05-09 2024-03-15 荣耀终端有限公司 Neural network training method applied to image processing and related equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190050710A1 (en) * 2017-08-14 2019-02-14 Midea Group Co., Ltd. Adaptive bit-width reduction for neural networks
CN110363281A (en) * 2019-06-06 2019-10-22 上海交通大学 A kind of convolutional neural networks quantization method, device, computer and storage medium
CN110555450A (en) * 2018-05-31 2019-12-10 北京深鉴智能科技有限公司 Face recognition neural network adjusting method and device
US20200202213A1 (en) * 2018-12-19 2020-06-25 Microsoft Technology Licensing, Llc Scaled learning for training dnn

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190050710A1 (en) * 2017-08-14 2019-02-14 Midea Group Co., Ltd. Adaptive bit-width reduction for neural networks
CN110555450A (en) * 2018-05-31 2019-12-10 北京深鉴智能科技有限公司 Face recognition neural network adjusting method and device
US20200202213A1 (en) * 2018-12-19 2020-06-25 Microsoft Technology Licensing, Llc Scaled learning for training dnn
CN110363281A (en) * 2019-06-06 2019-10-22 上海交通大学 A kind of convolutional neural networks quantization method, device, computer and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KUNYUAN DU ET AL.,: ""From Quantized DNNs to Quantizable DNNs"", 《ARXIV》, pages 2 - 3 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926570A (en) * 2021-03-26 2021-06-08 上海交通大学 Adaptive bit network quantization method, system and image processing method
WO2022222649A1 (en) * 2021-04-23 2022-10-27 Oppo广东移动通信有限公司 Neural network model training method and apparatus, device, and storage medium
CN113434750A (en) * 2021-06-30 2021-09-24 北京市商汤科技开发有限公司 Neural network search method, apparatus, device, storage medium, and program product
WO2023015674A1 (en) * 2021-08-12 2023-02-16 北京交通大学 Multi-bit-width quantization method for deep convolutional neural network
CN117709409A (en) * 2023-05-09 2024-03-15 荣耀终端有限公司 Neural network training method applied to image processing and related equipment

Similar Documents

Publication Publication Date Title
CN112101524A (en) Method and system for on-line switching bit width quantization neural network
EP3467723B1 (en) Machine learning based network model construction method and apparatus
CN110880036B (en) Neural network compression method, device, computer equipment and storage medium
CN110347873B (en) Video classification method and device, electronic equipment and storage medium
EP3340129B1 (en) Artificial neural network class-based pruning
CN109816032B (en) Unbiased mapping zero sample classification method and device based on generative countermeasure network
WO2019060670A1 (en) Compression of sparse deep convolutional network weights
CN110378434A (en) Training method, recommended method, device and the electronic equipment of clicking rate prediction model
CN112906294A (en) Quantization method and quantization device for deep learning model
CN110969251A (en) Neural network model quantification method and device based on label-free data
EP4350572A1 (en) Method, apparatus and system for generating neural network model, devices, medium and program product
US20220164666A1 (en) Efficient mixed-precision search for quantizers in artificial neural networks
US11341034B2 (en) Analysis of verification parameters for training reduction
DE112020004031T5 (en) SYSTEM-RELATED SELECTIVE QUANTIZATION FOR PERFORMANCE OPTIMIZED DISTRIBUTED DEEP LEARNING
CN112257751A (en) Neural network pruning method
CN113762061A (en) Quantitative perception training method and device for neural network and electronic equipment
CN117455757A (en) Image processing method, device, equipment and storage medium
CN112561050B (en) Neural network model training method and device
CN110852414A (en) High-precision low-order convolution neural network
CN113554097B (en) Model quantization method and device, electronic equipment and storage medium
CN117999560A (en) Hardware-aware progressive training of machine learning models
CN115392441A (en) Method, apparatus, device and medium for on-chip adaptation of quantized neural network model
CN113688989B (en) Deep learning network acceleration method, device, equipment and storage medium
DE102022126248A1 (en) Methods and devices for training models for program synthesis
CN114118411A (en) Training method of image recognition network, image recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination