CN115438784A

CN115438784A - Sufficient training method for hybrid bit width hyper-network

Info

Publication number: CN115438784A
Application number: CN202210965207.7A
Authority: CN
Inventors: 王玉峰; 张泽豪; 方双康; 丁文锐
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-08-12
Filing date: 2022-08-12
Publication date: 2022-12-06

Abstract

The invention discloses a sufficient training method for a hybrid bit width hyper-network, and belongs to the field of machine learning. The method comprises the following specific steps: firstly, for a search space containing a specific bit width, respectively training a single-precision network under each bit width until convergence, and calculating and recording quantization errors of each layer of the network under different bit widths to form the quantization error of the single-precision network. Then, constructing a mixed bit width hyper-network containing all bit widths for the search space, and calculating the quantization error of each layer of the hyper-network under each bit width in each round of training; and further adjusting the sampling probability of each bit width in the super network in the next round according to the comparison of the quantization error of each bit width of each layer with the quantization error of the single-precision network. And finally, searching the hyper-network by adopting a reinforcement learning algorithm to obtain the optimal bit width configuration. The invention can accurately evaluate each subnet during searching, and effectively improves the accuracy of the subnets and the performance of the optimal solution obtained by searching.

Description

Sufficient training method for hybrid bit width hyper-network

Technical Field

The invention belongs to the field of machine learning, and particularly relates to a sufficient training method for a mixed bit width hyper-network.

Background

In recent decade, deep learning has been paid more and more attention by researchers due to the great advantages of feature extraction and model construction compared with shallow models, and has been rapidly developed in the fields of computer vision, character recognition and the like.

Deep learning takes a deep Neural Network as a main presentation form, and a Convolutional Neural Network (CNN) is one of pioneering researches among them due to the inspiration of biological neurology. Compared with the traditional method, the convolutional neural network has the characteristics of weight sharing, local connection, pooling operation and the like, so that global optimization training parameters can be effectively reduced, the complexity of the model is reduced, and the network model has certain invariance to input zooming, translation and distortion. Under the advantage of the characteristic, the convolutional neural network has excellent performance in many computer vision tasks including image classification, target detection and recognition and semantic segmentation.

Although convolutional neural networks exhibit reliable effects in many visual tasks, the huge storage and computation overhead limits the application of convolutional neural networks to the widely popular portable devices at present, and in order to expand the application of convolutional neural networks, the compression and acceleration of models become hot problems in the field of computer vision.

The compression methods for the convolutional neural network at present are mainly divided into three categories:

one is a network pruning method. The basic idea of the method is as follows: the convolutional neural network with better performance usually has a more complex structure, but some parameters of the convolutional neural network do not contribute much to the final output result and appear redundant, so that an effective convolutional kernel channel importance judgment means can be found for the existing convolutional neural network, corresponding redundant convolutional kernel parameters are cut off, and the efficiency of the neural network is improved. In this method, the evaluation means has a very important influence on the model performance.

The second is a neural network structure searching method. The method can enable a machine to automatically search out a network with high speed and high precision through a good search algorithm in a search space in a certain range, and achieves the purpose of network compression. The key of the method is to establish a huge network architecture space, explore the space through an effective network search algorithm, and search an optimal convolutional neural network architecture under a specific combination of training data and computational constraints (such as network size and delay).

And thirdly, a network quantization method. The method quantizes the weight parameters of the 32-bit full-precision network into lower-bit parameters (such as 8-bit,4-bit,1-bit and the like) so as to obtain the low-bit network. The method can effectively reduce parameter redundancy, thereby reducing storage occupation, communication bandwidth and calculation complexity, and facilitating application of a deep network in light-weight application scenes such as artificial intelligence chips.

In the network quantization method, mixed bit width quantization is used for researching the quantization sensitivity of each layer of the network, and each layer in the network is allocated with the most appropriate bit width according to the quantization sensitivity. The super network search method is one of important methods for quantifying the mixed bit width, firstly a search space containing a plurality of bit widths is set, a super network containing all selectable bit widths is constructed, and then training of the mixed bit width super network is realized by sampling subnets with different bit widths in the training process. The trained hyper-network can be searched by algorithms such as evolutionary learning and the like, so that a mixed bit width sub-network with the highest testing accuracy of the network under the constraint of limited resources is obtained. However, in the training process of the hybrid bit-wide hyper-network, due to insufficient training of the sub-networks, the accuracy of the sub-networks is often degraded, the accuracy of evaluation of each sub-network in the search process is further affected, and finally the problem that the search is trapped in a suboptimal solution is caused.

How to solve the problem of precision degradation caused by insufficient sub-network training in the super-network training and improve the accuracy of the sub-network becomes a problem to be deeply researched at present.

Disclosure of Invention

The invention provides a full training method for a hybrid bit width super network, which takes the problem of precision degradation caused by insufficient sub-network training in the research of the current super network search method into consideration. The method comprises the steps of firstly calculating and recording quantization error information of a single-precision network, and utilizing the information to adjust bit width sampling probability in mixed bit width super network training so as to guide the training process of the super network, so that subnets under each bit width are trained more fully, and further the training efficiency and the final performance of the network are improved.

The full training method for the mixed bit width ultra-network specifically comprises the following steps:

step one, for a search space containing n bit widths, respectively training single-precision networks with each bit width to form respective quantization networks;

for a certain network structure arch, the search space containing n bits wide is B [ B ] ₁ ,b ₂ ,...,b _n B, respectively performing bit width on the input and weight of each convolution layer in the network structure arch ₁ To b _n Forming a quantized network arch ₁ ,arch ₂ ,...,arch _n 。

Training each quantization network until the network converges, and calculating the quantization error of each layer in the quantization networks corresponding to different bit widths to form the quantization error Q of the single-precision network;

for quantized network arch _n Quantization error value q of the l-th layer _ln The calculation method is as follows:

wherein x is _ln Representing quantized network arch _n Full precision data in layer I, x _lnq Representation of full precision data x _ln Making bit width b _n The quantized data of (a); m represents the number of data in the current layer, the total number of layers of the quantization network is L, Q = { Q = { (Q) } _l } _l＝1:L Representing the quantization error of a single precision network, where q _l ＝[q _l1 ,q _l2 …q _ln ]。

And step three, inserting quantizers containing candidate bit widths into each convolution layer of the network structure to construct a mixed bit width super network.

The specific insertion process is as follows:

an input quantizer is inserted into the input end of each convolution layer, a weight quantizer is inserted into each convolution core part, and each quantizer has n candidate bit widths corresponding to the bit width in the search space B. Sampling each bit width in the training process, and after sampling, converting the bit width of each quantizer to the sampled bit width to complete the forward propagation process of the input data.

Step four, before each round of training the mixed bit width super network begins, the quantization error of the current super network under each bit width is calculated to form the quantization error of the super network

The specific operation mode is as follows:

first, all quantizer bit widths of the hyper-network are converted to b ₁ In the above, the quantization errors of all layers at this time are calculated

The specific calculation method is as follows:

wherein,

representing full-precision data in the l-th layer in the hyper-network;

representation of full precision data

Bit width of b ₁ The quantized data of (1).

Then, the quantizer bit width of the hyper-network is converted into b in turn ₂ ，b ₃ ，...，b _n Respectively calculating the quantization error of the hyper-network under the configuration

Thereby, a quantization error matrix of the hyper-network is obtained

Wherein the quantization error of each layer can be expressed as:

quantization error values for the layer n candidate bit widths are included.

Step five, according to the quantization error of the hyper network

Dynamically adjusting the sampling probability of the bit width of each layer of the super network in the training round according to the difference between the quantization error Q of the single-precision network and the quantization error Q of the single-precision network;

when the difference value of the nth candidate bit width of the l layer of the hyper-network is obtained, the calculation mode is as follows:

if Δ _ln A positive number indicates a bit width b _n Then, the quantization error of the l layer of the hyper-network is larger than that of the single-precision network, and the layer is not sufficiently trained under the bit width and needs further training; and Δ _ln And if the number is a negative number, the training error is small enough under the bit width of the layer, and the number of sampling times of the bit width can be reduced in the subsequent training.

Thus after the error difference is calculated, the error is calculated for delta _ln The positive part is reserved, and the negative value is set to 0, namely:

then, normalizing the quantization error difference value between bit widths of each layer of convolution to obtain the sampling probability of each bit width of the current round:

thus, the sampling probability of each bit width of the super network in the current training round is obtained. Delta of _ln The larger the value is, the lower the training fullness of the current bit width is, and thus the sampling probability p of the current round to the training fullness is _ln The larger the bit width is, the easier the network under the current bit width is to be trained, so that the quantization error of the network under the bit width is reduced, and the quantization accuracy of the network is improved.

And step six, after the training of the super network is finished, searching on the super network, and finding out the optimal candidate network which meets the balance between the precision and the calculated amount from all the candidate networks.

The invention has the advantages that:

(1) A full training method for a hybrid bit width super network provides an improved direction for a super network search algorithm in consideration of the problem of insufficient training.

(2) A sufficient training method for a hybrid bit width super network utilizes quantization errors to adjust bit width sampling probability in super network training in a targeted mode, and accuracy of a sub network is effectively improved.

(3) A full training method for a hybrid bit width super network improves the performance of each sub-network in the super network, enables each sub-network in the super network to be accurately evaluated during searching, and can effectively improve the performance of an optimal solution obtained through searching.

Drawings

FIG. 1 is a flow chart of a method of full training for a hybrid bit-wide hyper-network of the present invention;

FIG. 2 is a diagram illustrating a quantizer insertion method according to the present invention.

Detailed Description

The following describes in detail a specific embodiment of the present invention with reference to the drawings.

The invention discloses a full training method for a mixed bit width super network, which is characterized in that for a search space containing a specific bit width, single-precision networks of the search space under each bit width are respectively trained until the convergence of the networks, and the quantization errors of each layer of the networks under different bit widths are calculated and recorded; then, constructing a mixed bit width super network containing all bit widths for the search space; in each round of training the super network, calculating the quantization error of each layer of the super network under each bit width; according to the comparison condition of the quantization error under each bit width and the quantization error of the single-precision network, adjusting the sampling probability of each bit width in the mixed bit width super network of the next round; and finally, after the mixed bit width super network training is finished, searching the super network by adopting an enhanced learning algorithm or an evolutionary learning algorithm to obtain the optimal bit width configuration. The invention utilizes the quantization error to adjust the bit width sampling probability in the training of the super network in a targeted manner, effectively improves the precision of the sub-network, enables the search algorithm to accurately evaluate each sub-network in the super network, and finally enables the search result to approach the optimal solution.

As shown in fig. 1, the specific method comprises the following steps:

step one, respectively training single-precision networks with each bit width for a search space with n bit widths to form respective quantization networks;

for a certain network fabric arch, the search space containing n bits wide is B { B ₁ ，b ₂ ，...，b _n B, performing bit width on the input and weight of each convolution layer in the network structure arch to obtain b ₁ To b _n Forming a quantized network arch ₁ ，arch ₂ ，...，arch _n 。

training each quantization network until the network converges, and respectively calculating quantization errors Q = { Q ] of each layer of the quantization network after training _l } _l＝1：L Denotes the quantization error of a single precision network, where L denotes the total number of layers of the quantized network, q _l ＝[q _l1 ，q _l2 …q _ln ]Representing quantization errors of each net of the l layer;

wherein x is _ln Representing quantized network arch _n Full precision data in layer l, such as weights and inputs; x is the number of _lnq Representation of full precision data x _ln Making bit width b _n M represents the number of data in the current layer.

And step three, inserting a quantizer containing the candidate bit width into each convolution layer of the network structure to construct a mixed bit width super network.

The specific insertion process is as follows:

the quantizer is inserted in the manner shown in fig. 2, where an input quantizer is inserted into an input end of each convolution layer of the network structure arch, and a weight quantizer is inserted into each convolution core, where each quantizer has n candidate bit widths, and corresponds to the bit width in the search space B. And sampling each bit width with a certain probability in the training process, and converting the bit width of each quantizer to the sampled bit width after sampling is finished so as to finish the forward propagation process of the input data.

The specific operation mode is as follows:

first, all quantizer bit widths of the super network are converted to b ₁ In the above, the quantization errors of all layers at this time are calculated

The specific calculation method is as follows:

wherein

Representing full-precision data in layer l in the super network, such as weights and inputs;

representing data to full precision

Wall row bit width b ₁ The quantized data of (1).

Thereby, a quantization error matrix of the hyper-network is obtained

Wherein the quantization error of each layer can be expressed as:

quantization error values for the layer n candidate bit widths are included.

Step five, according to the quantization error of the hyper-network

first, calculate

The difference with Q is normalized and calculated as follows:

note that the above operation is an element-level operation, for example, when a difference value of nth candidate bit widths of an l-th layer of the super network is obtained, the calculation method is as follows:

in Δ, if Δ _ln A positive number indicates a bit width b _n Then, the quantization error of the l layer of the hyper-network is larger than that of the single-precision network, and the layer is not sufficiently trained under the bit width and needs further training; and Δ _ln And a negative number, this means that the training error for the bit width at this layer is small enough, and the number of sampling for the bit width can be reduced in the following training.

Thus after the error difference is calculated, the error is calculated for delta _ln The positive part is reserved, and the negative value is set to 0, which is expressed as:

thus, the sampling probability of each bit width of the hyper-network in the current training round is obtained. Delta _ln The larger the value, the lower the training fullness of the current bit width, and thus the current roundSampling probability p for it _ln The larger the bit width is, the easier the network under the current bit width is to be trained, so that the quantization error of the network under the bit width is reduced, and the quantization accuracy of the network is improved.

This step may be implemented using an evolutionary learning search algorithm. Firstly, P candidate networks are randomly sampled from the super network to serve as an initial population, each candidate network has completely different bit width codes, and the performance of the candidate networks is evaluated, recorded and sequenced after sampling is completed each time. Then, the network with better performance is used as a parent, and the crossing and mutation operations are carried out on the network to generate a new child. And after the evolution search reaches a certain round, selecting a candidate network with the optimal performance as a search result, wherein the candidate network is the optimal mixed bit width network obtained by final search.

Claims

1. A full training method for a hybrid bit width hyper-network is characterized by comprising the following specific steps:

firstly, respectively training single-precision networks with each bit width for search spaces with n bit widths to form respective quantization networks; training each quantization network until the network converges, and calculating the quantization errors of each layer in the quantization networks corresponding to different bit widths to form the quantization error Q of the single-precision network;

wherein x is _ln Representing quantized network arch _n Full precision data in layer I, x _lnq Representation of full precision data x _ln Making bit width b _n The quantized data of (a); m represents the number of data in the current layer, quantizedThe total number of layers of the network is L, Q = { Q = { Q = _l } _l＝1:L Representing the quantization error of a single precision network, where q _l ＝[q _l1 ,q _l2 …q _ln ]；

Then, a quantizer with candidate bit width is inserted into each convolution layer of the network structure to construct a mixed bit width super network; before each round of training the mixed bit width super network begins, the quantization error of the current super network under each bit width is calculated to form the quantization error of the super network

The specific operation mode is as follows:

The specific calculation method is as follows:

wherein,

representing full-precision data in the l-th layer in the hyper-network;

representation of full precision data

Making bit width b ₁ The quantized data of (a);

secondly, the quantizer bit width of the hyper-network is converted into b in sequence ₂ ，b ₃ ，...，b _n Respectively calculating the quantization error of the hyper-network under the configuration

Thereby, a quantization error matrix of the hyper-network is obtained

Wherein the quantization error of each layer can be expressed as:

the quantization error values under the n candidate bit widths of the layer are included;

then, the quantization error according to the hyper network

and finally, after the training of the super network is finished, searching on the super network, and finding out the optimal candidate network meeting the balance between precision and calculated amount from all candidate networks.

2. The method of claim 1, wherein the search space for a certain network fabric arch with n bit widths is B { B } B ₁ ,b ₂ ,...,b _n B, respectively performing bit width on the input and weight of each convolution layer in the network structure arch ₁ To b _n Forming a quantized network arch ₁ ,arch ₂ ,...,arch _n 。

3. The method for fully training the hybrid bit-wide hyper-network according to claim 1, wherein the hybrid bit-wide hyper-network is constructed by the following specific insertion process:

respectively inserting an input quantizer at the input end of each convolution layer, respectively inserting a weight quantizer at each convolution core part, wherein each quantizer has n candidate bit widths and corresponds to the bit width in a search space; sampling each bit width in the training process, and after sampling, converting the bit width of each quantizer to the sampled bit width to complete the forward propagation process of the input data.

4. The method for fully training the hybrid bit-width hyper-network according to claim 1, wherein the specific process for dynamically adjusting the sampling probability is as follows:

firstly, when the difference value of the nth candidate bit width of the l layer of the hyper network is obtained, the calculation mode is as follows:

if Δ _ln A positive number indicates a bit width b _n Then, the quantization error of the l layer of the hyper-network is larger than that of the single-precision network, and the layer is not sufficiently trained under the bit width and needs further training; and Δ _ln If the bit width is a negative number, the training error of the layer under the bit width is small enough, and the sampling times of the bit width can be reduced in the following training;

then, after the error difference is calculated, the delta is calculated _ln The positive part is reserved, and the negative value is set to 0, which is expressed as:

finally, normalizing the quantization error difference value between the bit widths of each layer of convolution to obtain the sampling probability of each bit width of the current round:

thus, the sampling probability, delta, of each bit width of the hyper-network in the current training round is obtained _ln The larger the value is, the lower the training fullness of the current bit width is, and thus the sampling probability p of the current round to the training fullness is _ln The larger the bit width is, the easier the network under the current bit width is to be trained, so that the quantization error of the network under the bit width is reduced, and the quantization accuracy of the network is improved.