CN112101524A

CN112101524A - Method and system for on-line switching bit width quantization neural network

Info

Publication number: CN112101524A
Application number: CN202010929604.XA
Authority: CN
Inventors: 张娅; 杜昆原; 王延峰
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-09-07
Filing date: 2020-09-07
Publication date: 2020-12-18

Abstract

The invention provides a method and a system for on-line switching of bit width quantization neural networks, which comprises the following steps: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture; the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer; training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network; and extracting a quantization neural network of the target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning. The invention can enable the neural network to switch bit width at will on the premise of not carrying out retraining so as to adapt to different hardware deployment environments.

Description

Method and system for on-line switching bit width quantization neural network

Technical Field

The invention relates to the field of computer vision and image processing, in particular to a method and a system for a quantitative neural network capable of switching bit width on line.

Background

With the increasing complexity of Deep Neural Networks (DNNs), great challenges are often faced in deploying deep neural networks. Therefore, model compression and model acceleration are receiving increasing attention in machine learning. One important direction of research is to quantize the deep neural network, which quantizes model weights and intermediate layer activations to smaller bit-widths. Due to the reduction of bit width, the quantization depth neural network has smaller model size and can carry out rapid reasoning by utilizing efficient fixed-point calculation. However, when the bit width is reduced to 4 bits or even less, a significant loss of precision is incurred. To alleviate this problem, quantitative perceptual training is commonly employed to restore model accuracy. The quantization perception training simulates the quantization effect of a certain bit width during training, so that the model can adapt to quantization noise in the training stage. In a real-world scenario, different devices may support different bit widths. For example, Tesla T4 supports 4, 8, 16 and 32 bits, while Watt A1 supports 1, 2, 3 and 4 bits. If the deep neural network is deployed to different bit widths by using quantization perception training, a large amount of training time and resources are needed, so that the deployment is inconvenient. The invention provides a quantization neural network capable of switching bit widths on line, which aims to enable a deep neural network to be deployed with different bit widths without extra training.

At the present stage, according to the difference of bit width, the quantization algorithm without additional training can be divided into no-data quantization and calibration data quantization. For no data quantization, the quantization process does not require any original data set. This type of quantization algorithm would adjust the quantization scheme according to the characteristics of the model itself and would modify the batch normalization layer of the neural network. For quantization based on calibration data, a portion of the image is collected as a calibration data set. By inputting the calibration data set into the deep neural network to be quantified, activation distribution information for each layer in the network can be collected. By calculating KL divergence of activation distribution before and after quantization, network quantization can be better realized.

Patent document CN109961141A (application number: 201910288941.2) discloses a method and apparatus for generating a quantized neural network. One embodiment of the method comprises: acquiring a training sample set and an initial neural network; converting original floating point type network parameters in the initial neural network into integer type network parameters; generating a quantized initial neural network based on the converted integer network parameters; selecting training samples from the training sample set, and executing the following training steps: taking the sample information in the training sample as the input of the quantitative initial neural network, taking the sample result in the training sample as the expected output of the quantitative initial neural network, and training the quantitative initial neural network; in response to determining that the training of the quantized initial neural network is complete, a quantized neural network is generated based on the trained quantized initial neural network. But it can only quantize the neural network to 8 bits, and at the same time, it can not change bit width flexibly during the network operation.

Disclosure of Invention

In view of the defects in the prior art, an object of the present invention is to provide a method and a system for on-line switching of a quantized neural network with bit width.

The method for the quantization neural network capable of switching the bit width on line provided by the invention comprises the following steps:

step M1: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;

step M2: the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer;

step M3: training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network;

step M4: extracting a quantization neural network of a target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning;

the low bit mode is a sub-neural network mode of bit width when k is smaller than a preset value;

the high bit pattern is a sub-neural network pattern of bit width when k is a preset value.

Preferably, the step M1 includes: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.

Preferably, the processing of each network middle layer feature in the step M2 by using the corresponding batch normalization layer includes:

wherein k represents a bit width; f. of_kRepresenting original characteristics generated by a k bit width sub-neural network to be processed; gamma ray_kAnd beta_kThe learnable parameters which are exclusive to the k bit width sub-neural network are represented; mu.s_kAnd σ_k ²Is the original feature f_kMean and variance of; e represents a minimum amount.

Preferably, the step M3 includes:

the optimization task of the super network is based on multi-task learning and an optimized super network overall objective function, and the formula is as follows:

L_all＝∑_{k∈mode list}α_kL_k+ω||Q||² (2)

wherein, mode list represents the set of all bit-wide sub-neural networks contained in the super network, and k represents a k-bit-wide sub-neural network; l is_kRepresenting a consistency loss function corresponding to the k bit wide sub-neural network; alpha is alpha_kRepresenting a sub-optimization goal L_kThe proportion occupied in the overall objective function; q represents a parameter of the hyper-network; | | non-woven hair²To representAn L2 norm; | Q | non-conducting phosphor²Representing parameter attenuation, and regularizing the super network; omega is the weight of parameter attenuation;

optimizing the target L for each sub-_kWhen k is a, a represents the data type matched to the hardware, the super network runs with single precision floating point number, L_aThe specific expression form of (A) is as follows:

wherein (X, Y) represents an image and a label of the target dataset; x is the number of_iRepresenting an image, y_iA presentation image category label; f. of_aRepresenting a single-precision floating-point number sub-neural network; h represents a cross entropy function;

optimizing the target L for each sub-_kWhen k < a, the hyper-network operates with k-bit integer, L_kThe expression of (A) is as follows:

wherein KL represents a KL divergence between a distribution corresponding to the low bit pattern and a distribution corresponding to the high bit pattern; sigma represents a softmax function, and the output of the network is mapped into a probability value; and T represents a hyper-parameter for calculating the probability value, so that the consistency alignment between the sub-neural networks with different bit widths is realized.

Preferably, the step M4 includes:

in a forward calculation stage of the neural network, the super network reserves a copy of network parameters of the single-precision floating point number, the super network quantizes the parameters of the single-precision floating point number into an integer number with a low bit width, and a quantizer for the k bit width is as follows:

wherein r represents a parameter of a single-precision floating point number to be quantized; k represents the bit width; clamp () represents a truncation function; round () represents a rounding function;

during the training phase, the gradient of the quantizer is estimated using the straight-through-estimator:

wherein,

represents partial differential, Q_kRepresenting the quantized value after quantization, and r represents a single-precision floating point number parameter to be quantized;

in the reverse calculation stage of the neural network, the super network is updated by utilizing the gradient obtained by estimation; the updated parameters are parameters of single-precision floating point numbers stored in the hyper-network.

The system for the quantization neural network capable of switching the bit width on line provided by the invention comprises the following components:

module M1: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;

module M2: the super network operates with different bit widths, corresponding network intermediate layer characteristics are obtained for any bit width, and each network intermediate layer characteristic is processed by adopting a corresponding batch normalization layer;

module M3: training the super network through supervised learning, and simulating quantization noise in a super network training stage until a consistency loss function between a low bit mode and a high bit mode converges to obtain the trained super network;

module M4: extracting a quantization neural network of a target bit from the trained hyper-network by using a preset quantizer to carry out low bit reasoning;

Preferably, said module M1 comprises: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.

Preferably, the processing of each network middle layer feature in the module M2 by using the corresponding batch normalization layer includes:

Preferably, said module M3 comprises:

L_all＝∑_{k∈mode list}α_kL_k+ω||Q||² (2)

wherein, mode list represents the set of all bit-wide sub-neural networks contained in the super network, and k represents a k-bit-wide sub-neural network; l is_kRepresenting a consistency loss function corresponding to the k bit wide sub-neural network; alpha is alpha_kRepresenting a sub-optimization goal L_kThe proportion occupied in the overall objective function; q represents a parameter of the hyper-network; | | non-woven hair²Represents the L2 norm; | Q | non-conducting phosphor²Representing parameter attenuation, and regularizing the super network; omega is the weight of parameter attenuation;

Preferably, said module M4 comprises:

wherein,

Compared with the prior art, the invention has the following beneficial effects:

1. according to the invention, by designing the quantization neural network capable of switching the bit width on line, the deep neural network can be directly deployed by different bits without any additional training, so that the use and landing of a model are greatly facilitated;

2. in the aspect of prediction accuracy, due to mutual promotion among sub-networks in the ultra-network, the method can obtain higher prediction accuracy than the traditional quantitative neural network;

3. on the basis of an ultra-large data set ImageNet, the design can achieve the improvement of the classification accuracy rate by more than 1%.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of a method for on-line switching of bit-width quantized neural networks;

fig. 2 is a system schematic diagram of a quantization neural network capable of switching bit widths online.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Example 1

Specifically, the step M1 includes: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.

Specifically, the processing of each network middle layer feature in the step M2 by using the corresponding batch normalization layer includes:

Specifically, the step M3 includes:

L_all＝∑_{k∈mode list}α_kL_k+ω||Q||² (2)

wherein (X, Y) represents an image and a label of the target dataset; x is the number of_iRepresenting an image, y_iA presentation image category label; f. of_aRepresenting a single-precision floating-point number sub-neural network; h represents a cross entropy functionCounting;

Specifically, the step M4 includes:

wherein,

Specifically, the module M1 includes: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.

Specifically, the processing of each network middle layer feature in the module M2 by using a corresponding batch normalization layer includes:

Specifically, the module M3 includes:

L_all＝∑_{k∈mode list}α_kL_k+ω||Q||² (2)

Specifically, the module M4 includes:

wherein,

Example 2

Example 2 is a modification of example 1

In view of the defects in the prior art, an object of the present invention is to provide a quantization neural network capable of switching bit widths online, so that the deep neural network can be deployed with different bit widths without any additional training.

As shown in fig. 1, which is a flowchart of the present invention for switching bit-width quantized neural networks on line, the method integrates neural networks with different bit-widths into the same network structure by constructing a super network, and processes features by using separate batch normalization layers for different bit-widths to ensure network convergence. Through the consistency loss function step, the consistency between the low bit mode and the high bit mode is restricted in the training stage, and the error caused by quantization is reduced. And (3) optimizing the hyper-network by using quantitative perception training, and realizing rapid reasoning and model compression under any bit width after the hyper-network converges.

A quantized neural network capable of switching bit widths online, comprising:

building a hyper network: integrating deep neural networks with different bit widths into a super network, wherein the networks with all bit widths share the same network architecture;

bit width exclusive batch normalization step: processing the features obtained under different bit widths by adopting an independent batch normalization layer;

based on consistency loss function: in the training stage, the consistency between the low bit mode and the high bit mode is restricted, and the error caused by quantization is reduced;

and (3) quantitative perception training: the quantization noise is simulated in the network training stage, so that the model obtains higher classification precision under the condition of low bit width;

low bit inference steps: after training is finished, a low-bit network model is derived by using a predefined quantizer, and rapid reasoning and model compression are realized.

The hyper-network building step, wherein: and for deep neural networks with different bit widths, integrating the deep neural networks into the same network structure by utilizing a hyper-network.

By utilizing the characteristic that the high-bit neural network can be quantized into a lower-bit neural network, the sub-neural networks with different bit widths can be integrated into a super network. The smaller the bit width is, the faster the inference speed of the model is. The super-network model and the sub-networks thereof have the same network topology structure, and simultaneously, all networks share all convolution kernels and full-connection layer parameters, so that the parameter quantity of the super-network is reduced. By means of a predefined quantizer, sub-networks of lower bit-widths can be directly quantized by the network of higher bit-widths.

The bit width exclusive batch normalization step, wherein: for the intermediate layer characteristics generated under different bit widths, an independent batch normalization layer is used for processing, so that the characteristics in different distributions are not influenced by each other.

The bit width exclusive batch normalization step specifically comprises the following steps:

the batch normalization layer is as follows:

where f is the raw features to be processed, γ and β are learnable parameters, and e is a minimal amount, typically set to 1e-5, in order to prevent σ from being too small. Mu and sigma²Is the mean and variance of the original feature f.

In the training phase, the mean and variance are statistically derived for the current batch of training samples.

In the test phase, the mean and variance are not affected by the test samples, but are determined by the running mean of the training phase.

The bit width dedicated batch normalization layer comprises:

where k is the bit width, f_kIs a raw feature, gamma, produced by the k-bit subnetwork to be processed_kAnd beta_kAre learnable parameters specific to a k-bit subnetwork, with the aim of enabling the subnetwork to flexibly affine-transform the channel characteristics of subnetworks of different bit-widths. Mu.s_kAnd σ_k ²Is the original feature f_kThe purpose of the mean and variance of (a) is to prevent sub-networks with different feature distributions from interfering with each other.

A consistency loss function based step, wherein: in the model training phase, each bit-wide subnetwork is optimized. And constraining the prediction consistency of the low bit width and high bit width sub-networks in the optimization process.

The consistency loss function-based steps are as follows:

the optimization task of the hyper-network is based on multi-task learning, and the optimized overall objective function is as follows:

where mode list is the set of all bit-widths contained by the super network, and k is a k-bit-wide sub-network. L is_kSub-optimization objectives, α, corresponding to k-bit wide sub-networks_kIs a sub-optimization goal L_kThe weight of the overall objective function. Q is a parameter of the super network, | | | | | non-conducting phosphor²Is L2 norm, | Q | | non-woven²Represents a parametric decay with the aim of regularizing the hyper-network. ω is the weight of the parameter decay.

Optimizing the target L for each sub-_kWhen k is 32, the super network runs with single precision floating point number, L₃₂The specific expression form of (A) is as follows:

wherein (X, Y) represents an image and a label of a data set, X_iRepresentative image, y_iIs its class label, f₃₂Representative form accuracyAnd (4) a floating-point number sub neural network, wherein H represents a cross entropy function.

When k < 32, the hyper-network operates with k-bit integer, L_kThe specific expression form of (A) is as follows:

where KL is the KL divergence between the two distributions and σ is the softmax function, with the aim of mapping the output of the network to probability values. T is a hyper-parameter for calculating probability values, and aims to play a role in softening the output probability values so as to better realize the consistency alignment among different sub-networks.

The step of quantitative perceptual training, wherein: and quantization noise is simulated in a network training stage, so that the model obtains higher classification precision under the condition of low bit width.

The quantitative perception training step specifically comprises the following steps:

in the network forward propagation stage, the super network can keep a copy of network parameters of single-precision floating point numbers. In order to realize efficient matrix operation, the super network quantizes parameters of single-precision floating point numbers into integer numbers with low bit width, and a quantizer corresponding to the k bit width is as follows:

wherein r represents the parameter of the single-precision floating point number to be quantized, k represents the bit width, clip () represents the truncation function, and round () represents the rounding function.

Since the quantizer is not differentiable, the gradient of the quantizer is estimated using the straight-through-estimator during the training phase:

wherein,

represents partial differential, Q_kRepresenting the quantized value after quantization, and r represents the single-precision floating point number parameter to be quantized.

In the network back propagation stage, the model is updated by the estimated gradient. The updated parameters are parameters of single-precision floating point numbers stored in the hyper-network.

And the low bit reasoning step, wherein after training is finished, a quantization neural network of a target bit is extracted from the hyper-network by using a predefined quantizer so as to carry out low bit reasoning.

In summary, the present invention provides a quantized neural network capable of switching bit widths online, and a super network is constructed by building the super network, so as to integrate neural networks with different bit widths into the same network structure. By setting bit width exclusive batch normalization, for different bit widths, a single batch normalization layer is adopted to process the characteristics. By setting a consistency loss function, the consistency between the low bit mode and the high bit mode is restricted in the training stage, and the error caused by quantization is reduced. And through quantization perception training, quantization noise is simulated in a network training stage, so that the model obtains higher classification precision under low bit width. After training is finished, a network model with specific low bit width is derived by using a predefined quantizer, and rapid reasoning and model compression are realized. The invention can ensure that the bit width of the neural network can be switched at will on the premise of not carrying out retraining so as to adapt to different hardware deployment environments.

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A method for on-line switching of bit-width quantized neural networks, comprising:

2. The method for on-line switching of a bit-width quantized neural network according to claim 1, wherein said step M1 comprises: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.

3. The method of claim 1, wherein the processing of each network middle layer feature in the step M2 by using a corresponding batch normalization layer comprises:

4. The method for on-line switching of a bit-width quantized neural network according to claim 1, wherein said step M3 comprises:

L_all＝∑_{k∈mode list}α_kL_k+ω||Q||² (2)

wherein, mode list represents the set of all bit-wide sub-neural networks contained in the super network, and k represents a k-bit-wide sub-neural network; l is_kRepresenting a consistency loss function corresponding to the k bit wide sub-neural network; alpha is alpha_kRepresenting a sub-optimization goal L_kThe proportion occupied in the overall objective function; q represents a parameter of the hyper-network; non-viable cells|||²Represents the L2 norm; | Q | non-conducting phosphor²Representing parameter attenuation, and regularizing the super network; omega is the weight of parameter attenuation;

5. The method for on-line switching of a bit-width quantized neural network according to claim 4, wherein said step M4 comprises:

wherein,

6. A system for on-line switching of bit-width quantized neural networks, comprising:

7. The system of on-line switchable bit-width quantized neural network of claim 6, wherein said module M1 comprises: based on the high-bit neural network, the low-bit neural network can be quantized, and sub-neural networks with different bit widths are integrated into a super network; the super network and each sub-neural network with different bit widths have the same network topology structure, and simultaneously share all convolutional layer/full connection layer parameters; the sub-neural network with low bit width can be obtained by quantizing the sub-neural network with high bit width.

8. The system of claim 6, wherein each network middle layer feature in the module M2 is processed by a corresponding batch normalization layer, and comprises:

9. The system of on-line switchable bit-width quantized neural network of claim 6, wherein said module M3 comprises:

L_all＝∑_{k∈mode list}α_kL_k+ω||Q||² (2)

10. The system of on-line switchable bit-width quantized neural network of claim 9, wherein the module M4 comprises:

wherein,