CN116524173A

CN116524173A - Deep learning network model optimization method based on parameter quantization

Info

Publication number: CN116524173A
Application number: CN202310162619.1A
Authority: CN
Inventors: 钮赛赛; 邵艳明; 蔡彬; 史庆杰; 张晗
Original assignee: Shanghai Aerospace Control Technology Institute
Current assignee: Shanghai Aerospace Control Technology Institute
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-08-01

Abstract

A deep learning network model optimization method based on parameter quantization aims at the limitation of hardware resources such as memory, power consumption and the like faced when an intelligent information processing platform based on deep learning is built on a missile-borne platform, real-time requirements and algorithm light-weight requirements in an infrared image target detection and identification scene are met, a light-weight network model based on YOLOv3-tiny is combined with a low-bit quantization and channel-level quantization method, and the quantization method of weight parameters is realized step by step in the model retraining process.

Description

Deep learning network model optimization method based on parameter quantization

Technical Field

The invention relates to a deep learning network model optimization method based on parameter quantification, and belongs to the technical field of computer vision.

Background

The current infrared target detection method based on deep learning mostly achieves the recognition effect of high accuracy by establishing a high-performance network model. However, in the missile-borne platform environment, the embedded hardware platform with limited resources such as space, power consumption and the like is difficult to adapt to the complex calculation power requirement and redundant storage requirement of the deep neural network, so that the software and hardware platform system suitable for deep learning calculation of the missile-borne platform is researched from the aspects of low-power consumption, miniaturized intelligent hardware, low-complexity deep learning network model optimization and the like. When the solidified intelligent hardware platform is given, the method starts from the aspect of researching low-complexity deep learning network model optimization, can effectively save the storage space of an intelligent processor, reduce the complexity scale of the processor, improve the operation efficiency of the processor and reduce the power consumption required by the processor.

The low number of bits in the computer occupies less memory than the higher number of floating point numbers. Quantization this model compression method uses low precision numbers instead of high precision floating point representations on convolutional neural network parameters. For example, an 8-bit integer is used to replace a 32-bit single-precision floating point number of an original convolutional neural network, so that the storage space occupied by the network model is reduced to be one fourth of the original one. This low-precision representation reduces to some extent the representation redundancy present in the network, i.e. the network characteristics can be well expressed with quantized parameters without requiring too high precision. However, in some cases, the quantized parameter expression accuracy is difficult to meet the expression accuracy required by the target task, which may result in a decrease in network accuracy. The task of network quantization is to express the network with as few bits as possible with as little loss of network accuracy as possible, and to achieve a balance between quantization bit number and accuracy loss.

In reference [ CN 114170512A ], aiming at the defects of high complexity and low reasoning speed of the existing remote sensing SAR target detection method, a divided training set and test set are acquired from a disclosed remote sensing SAR target detection data set to expand and data enhance; the existing lightweight network is adjusted, pruning and mixed precision quantification are carried out, a final remote sensing SAR target detection model based on combination of network pruning and parameter quantification is obtained, and training cost is saved while detection precision is improved.

In reference [ CN 111767993A ], a method, a system, a device and a storage medium for quantizing a convolutional neural network INT8 are provided, and the whole model pure integer operation is realized by performing offline nonlinear INT8 quantization on model convolutional layer parameters, input and output, and meanwhile, the quantization precision is improved.

In reference [ Li Gushi ] deep neural network compression and acceleration research [ D ]. Beijing university of post, 2020], aiming at serious network oscillation caused by One-time quantization of floating point parameters and activation values to low-bit values (One-Shot quantization), the problem that quantized network training is difficult to converge and has low precision, an incremental quantization algorithm in the dimension of an output channel is provided, and network fluctuation in the quantization process is reduced by iteratively quantizing the network parameters and the activation values. In each quantization iteration, only a part of network parameters and network activation values corresponding to the output channels are selected and quantized according to rules. And to further mitigate fluctuations in the network, the quantized weights and activation values should be disjoint in the output channel dimension in each quantization iteration.

Aiming at the problems that the existing network model optimization method mostly adopts a single branch reduction or quantization compression mode and the application lack of the existing network quantization method in light weight of the infrared target recognition network, the invention aims at the requirements and characteristics of infrared target recognition, combines INT8 quantization, channel quantization and other modes, and stepwise realizes the low-complexity optimization method of the infrared target recognition network model based on the YOLOv3-tiny network.

Disclosure of Invention

The invention solves the technical problems that: aiming at the limitations of hardware resources such as memory, power consumption and the like faced when an intelligent information processing platform based on deep learning is built on a missile-borne platform in the prior art, the optimization method of the deep learning network model based on parameter quantification is provided.

The invention solves the technical problems by the following technical proposal:

a deep learning network model optimization method based on parameter quantization comprises the following steps:

constructing a lightweight network model based on YOLOv3-tiny and training to obtain preliminary floating point type network weight parameters; the lightweight network model comprises a convolution layer, a batch normalization layer, an activation function, a maximum pooling layer, an upsampling layer and a routing layer,

carrying out channel level quantification on the designed lightweight network model;

and retraining the obtained preliminary network, and realizing quantization of the network weight in a stepwise quantization mode in the retraining process.

In the lightweight network model, the convolution layer is used for extracting high-dimensional features from an input image, and specifically comprises the following steps:

wherein w is _n Represents the weight of the nth layer, x _n-1 Representing the input eigenvalue of the nth layer, o _n Represents the output of the nth layer, K is the width of the convolution kernel, C _n The number of channels for outputting the feature value for the nth layer.

The batch normalization layer comprises the following specific steps:

wherein x and y respectively represent input and output of the batch normalization layer, mu ⁽ⁱ⁾ 、σ ⁽ⁱ⁾ Mean and variance of feature map in ith channel in a batch, gamma ⁽ⁱ⁾ And beta ⁽ⁱ⁾ Is a learnable channel level parameter in the normalization layer, epsilon is used to avoid data overflow.

The activation function employs a ReLU function as the employed activation function between two convolutional layers:

ReLU＝max(0,x)。

the maximum pooling layer is used for reducing data dimension, reducing calculated amount, enhancing invariance of image characteristics and increasing receptive field; the up-sampling layer is used for recovering image features to an input dimension to realize target position output, and the routing layer acquires the feature quantity of multi-scale fusion through the output feature values of the two cascade convolution layers.

The channel-level quantization of the designed lightweight network model is specifically as follows:

different quantization intervals are used for carrying out quantization parameter matching on different channels of each layer so as to improve the model precision of the lightweight network model, and the channel level quantization operation of the channel j in each layer i is specifically as follows:

by max _ij And min _ij Recording the distribution interval of the channel parameters of the layer, and cutting the long tail weight to obtain the quantization range d of the parameters _ij Average value of m _ij The weight record of channel j in current layer i is w _ij Quantization weight wq _ij And the recovered weight wr _ij According to d _ij 、m _ij And quantization bit number b;

and traversing the quantization parameters of each channel of each layer to match, and finishing channel-level quantization.

The step-by-step quantization mode adopted in the retraining process is specifically as follows:

model retraining of a preset iteration step number is carried out on the lightweight network model after channel-level quantization processing, and random quantization is carried out on weights of all layers in the retraining process so as to eliminate the dependence of the lightweight network model on fixed characteristic quantity until the lightweight network model converges, wherein:

the model retraining steps are as follows:

in the forward reasoning process, a quantization range is selected through parameter distribution; limiting the parameters exceeding the quantization range within the quantization range, and recording the full-precision weight as an updating basis;

updating the scaling factor and the quantization range d based on the average absolute error of the full-precision parameter and the quantized parameter _ij Mean value m _ij 。

In the error back propagation process, according to the loss between the target obtained by forward reasoning of the quantized parameters and the actual target under the loss function, gradually and reversely updating each weight parameter.

Through multi-round forward reasoning and back propagation in multi-step iteration, the effects of step-by-step quantization of weight parameters of the network model and reconvergence of the network reasoning are achieved.

In the model retraining process, each convolution layer is quantized, the quantization sequence of the weight is randomly selected from each convolution layer, so that the step-by-step quantization of the weight is achieved, and the learning capacity of the convolution layer is improved through coexistence of quantization parameters and full-precision parameters.

In the model retraining process, a batch normalization layer is fused to a convolution layer through a progressive fusion strategy, batch normalization parameters are transferred to a front convolution layer, mean and variance updating are kept, the convolution layer learns through the incoming mean and variance, updating of the mean and variance is stopped in the fusion stage to eliminate independent batch normalization parameters, and fusion of the batch normalization layer and the convolution layer is completed to reduce difficulty in deployment of a deep learning network model on hardware.

In the model retraining process, the quantization process of the convolution layer specifically comprises the following steps:

for input feature A _in Weight W _conv And bias B _conv Quantization is carried out to obtainAnd->By passing throughConvolution->Obtaining quantized convolution output M ^q According to M ^q And->Quantization range, bias->And M is as follows ^q Adding after inverse quantization, and obtaining final output characteristic value A by the added result through an activation function _out 。

The parameter quantization is specifically as follows:

the method comprises the steps of adopting a uniform quantization mode to preset a quantization bit width k, wherein the distances between neighboring quantization points are equal:

x ^q ＝Q _k (x ^r ,α)

wherein x is ^r For tensors, for weights or offsets or activation values, α for scaling factors, q for integer tensors involved in the calculation in the integer arithmetic unit, x ^q For quantized parameters, Q represents a quantization function, clip is a truncation function, round is a rounding function, and return a rounded value of the floating point number, where:

the scaling factor is used for overcoming the long tailing phenomenon existing in the weight distribution of the convolution layer and realizing the quantization correction in the interval range, and specifically comprises the following steps:

in the forward reasoning stage, parameters of the batch normalization layer are fixed, and the method specifically comprises the following steps:

y＝ξ ⁽ⁱ⁾ o+η ⁽ⁱ⁾

where o is the output of the previous layer of convolution layer, and the quantized convolution operation:

o＝α _a q _a α _w q _w

wherein alpha is _a q _a And alpha _w q _w Respectively representing the quantized activation value and the weight;

the quantized convolution process after the batch normalization layer and the convolution layer are combined is as follows:

after the batch normalization layer is combined with the convolution layer after the previous quantization, the combined output is quantized again, and the specific method is as follows:

wherein alpha is _β The channel-level scaling factor tensor of beta is that the initial value is the absolute maximum value on each channel of beta, and the scaling factors alpha of heavy, offset and activation values are floating point numbers, so that full integer operation cannot be realized;

shift-quantize the scaling factor:

the scaling factor of shift quantization can use bit shift left or shift right to carry out floating point operation, and the convolution calculation after quantization is specifically as follows:

the gradient calculation in the weight parameter error back propagation process in the retraining process is specifically as follows:

the step-by-step quantization process of the network weight parameters is completed by the parameter quantization calculation method in the retraining process, so that the optimization of the network model is realized.

Compared with the prior art, the invention has the advantages that:

according to the deep learning network model optimization method based on parameter quantization, the number of network model weight parameters required to be directly stored on an AI processor is greatly reduced, the computational power requirements of an algorithm on the processor are reduced, and the realization of a deep network model of an intelligent missile-borne information processing platform can be completed, so that the storage space and the calculation power consumption of the intelligent algorithm in the processor are effectively saved, and the calculation efficiency of a hardware platform is improved. And the power consumption of the hardware platform is reduced.

Drawings

FIG. 1 is a schematic diagram of an optimization flow of a deep learning network model provided by the invention;

FIG. 2 is a diagram of a YOLOv3-tiny data flow provided by the invention;

FIG. 3 is a step-by-step retraining flowchart provided by the invention;

Detailed Description

A deep learning network model optimization method based on parameter quantization aims at the limitation of hardware resources such as memory and power consumption faced when an intelligent information processing platform based on deep learning is built on a missile-borne platform, and aims at the limitation of hardware resources such as memory and power consumption faced when the intelligent information processing platform based on deep learning is built on the missile-borne platform, a lightweight network model based on YOLOv 3-tini is combined with a channel level quantization method, and the parameter quantization method of the network model is realized in a multi-step quantization mode.

The lightweight network model of YOLOv3-tiny is a simplified version of YOLOv3, has less memory space and computational overhead, and is suitable for deployment on embedded devices. The network model comprises a convolution layer, a batch normalization layer, an activation function, a maximum pooling layer, an up-sampling layer and a routing layer, and the specific flow is as follows:

constructing a lightweight network model based on YOLOv 3-tiny;

carrying out channel level quantification on the lightweight network model;

carrying out retraining quantization on the lightweight network model subjected to channel-level quantization treatment;

and carrying out parameter quantification of the trained lightweight network model.

The convolution layer is used for extracting high-dimensional features from an input image, and specifically comprises the following steps:

wherein w is _n Represents the weight of the nth layer, x _n-1 Representing the input eigenvalue of the nth layer, o _n Represents the output of the nth layer, K is the width of the convolution kernel, C _n Outputting the number of channels of the characteristic value for the nth layer;

the batch normalization layer comprises the following concrete steps:

wherein x and y respectively represent input and output of the batch normalization layer, mu ⁽ⁱ⁾ 、σ ⁽ⁱ⁾ Mean and variance of feature map in ith channel in a batch, gamma ⁽ⁱ⁾ And beta ⁽ⁱ⁾ Is a learnable channel level parameter in the normalization layer, epsilon is used to avoid data overflow;

ReLU＝max(0,x)；

the maximum pooling layer is used for reducing data dimension, reducing calculated amount, enhancing invariance of image characteristics and increasing receptive field; the up-sampling layer is used for recovering image features to an input dimension to realize target position output, and the routing layer acquires multi-scale fused feature values through output feature values of the two cascaded convolution layers;

the channel level quantization is specifically:

after traversing the quantization parameters of each channel of each layer, finishing channel level quantization;

the retraining quantization is specifically as follows:

model retraining of a preset step number is carried out on the lightweight network model after channel-level quantization processing, and random quantization is carried out on weights of all layers in the retraining process so as to eliminate the dependence of the lightweight network model on fixed characteristic quantity until the lightweight network model converges, wherein:

the model retraining steps are as follows:

selecting a quantization range through parameter distribution; limiting the parameters exceeding the quantization range within the quantization range, and recording the full-precision weight as an updating basis;

updating the scaling factor, d, based on the average absolute error of the full-precision parameter and the quantized parameter _ij And m _ij ；

In the model retraining process, quantizing each convolution layer, randomly selecting the quantization sequence of the weight in each convolution layer to achieve step quantization of the weight, and improving the learning capacity of the convolution layer through coexistence of quantization parameters and full-precision parameters;

the batch normalization layer is fused to the convolution layer through a progressive fusion strategy, batch normalization parameters are transferred to the front convolution layer, mean and variance updating is kept, the convolution layer learns through the incoming mean and variance, in the fusion stage, updating of the mean and variance is stopped to eliminate independent batch normalization parameters, and fusion of the batch normalization layer and the convolution layer is completed to reduce difficulty in deployment of a deep learning network model on hardware;

the quantization process of the convolution layer in the model retraining process is specifically as follows:

for input feature A _in Weight W _conv And bias B _conv Quantization is carried out to obtainAnd->By passing throughConvolution->Obtaining quantized convolution output M ^q According to M ^q And->Quantization range, bias->And M is as follows ^q Adding after inverse quantization, and obtaining final output characteristic value A by the added result through an activation function _out ；

The parameter quantization is specifically as follows:

x ^q ＝Q _k (x ^r ,α)

wherein x is ^r For tensors, for weights or offsets or activation values, α for scaling factors, q for integer tensors involved in the calculation in the integer arithmetic unit, x ^q For quantized parameters, for network training, the scaling factor α, Q represents a quantization function, clip is a truncation function, round is a rounding function, and return a rounded value of a floating point number, where:

y＝ξ ⁽ⁱ⁾ o+η ⁽ⁱ⁾

o＝α _a q _a α _w q _w

wherein alpha is _a q _a And alpha _w q _w Representing quantized activation values and weights, respectivelyWeighing;

shift-quantize the scaling factor:

in the retraining process, the gradient calculation in the weight parameter error back propagation process is specifically as follows:

The following further description of the preferred embodiments is provided in connection with the accompanying drawings of the specification:

in the current embodiment, the overall implementation flow is shown in fig. 1, and the low-complexity optimization method of the deep learning network model based on parameter quantization comprises a lightweight network model based on YOLOv3-tiny, and the parameter quantization method of the network model is realized by combining a channel level quantization method and a multi-step quantization mode.

As shown in fig. 2, the lightweight network model of YOLOv3-tiny is a simplified version of YOLOv3, with less memory space and computational overhead, suitable for deployment on embedded devices. The network model comprises a convolution layer, a batch normalization layer, an activation function, a maximum pooling layer, an up-sampling layer and a routing layer.

Wherein the convolution layer is used to extract high-dimensional features from the input image, the operation of which is as follows:

The batch normalization layer was expressed using the following operations:

wherein x and y respectively represent input and output of the batch normalization layer, mu ⁽ⁱ⁾ 、σ ⁽ⁱ⁾ Mean and variance of feature map in ith channel in a batch, gamma ⁽ⁱ⁾ And beta ⁽ⁱ⁾ Is two learnable channel level parameters in the normalization layer, epsilon is used to avoid data overflow.

The ReLU function is used as an activation function between two convolution layers:

ReLU＝max(0,x)

the routing layer in YOLO acquires features extracted from the first half of the neural network by concatenating two output feature values of the same size from different convolutional layers.

The channel level quantization can use different quantization intervals for different channels of each layer, so that the quantization intervals can be better matched with the distribution of parameters of each channel. The channel-level quantization can better save the difference information among the channels, and is beneficial to improving the model precision. The quantization operation of channel j in convolutional layer i can be described as follows: first by max _ij And min _ij Recording the distribution interval of the channel parameters of the layer, and then cutting the long tail weight according to the parameter distribution to obtain the quantization range of the parameters as d _ij Average value of m _ij The weight of channel j in convolution layer i is recorded as w _ij Quantization weight wq _ij And the recovered weight wr _ij According to d _ij 、m _ij And quantization bit number b.

As shown in FIG. 3, the adopted multi-step quantization mode decomposes one-step quantization in model retraining into multiple steps, so that stability in model training is ensured, and meanwhile, random quantization is carried out on weights during training, so that the robustness of the model is enhanced, the dependence of the model on fixed features is eliminated, and the model is enabled to be converged more.

The training phase is divided into two main steps by the model retraining process adopted. In a first step, the quantization range is selected by means of a parameter distribution. Parameters beyond the quantization range are limited in range, and full-precision weights are recorded as update basis. In the second stepIn d _ij And m _ij And updating according to the average absolute error of the full-precision parameter and the quantized parameter.

The quantization process of the retraining convolution layers is to randomly select the quantization sequence of the weight value in each convolution layer so as to achieve the step quantization of the weight value. This can significantly reduce the disturbance of the model during training, avoiding model escape from the global minimum. Meanwhile, the coexistence of the quantization parameter and the full-precision parameter can volatilize the learning capacity of the full-precision parameter.

Further, to reduce the difficulty of deploying the model on hardware, batch normalization structures in the network model are fused onto the convolutional layer preceding it. The operation is decomposed into two phases by adopting a progressive fusion strategy. In the learning phase, the batch normalization parameters are transferred to the pre-convolution layer and the mean and variance updates are maintained. The convolution layer learns the distribution of activations by means of the incoming mean and variance. In the fusion phase, updating of the mean and variance is stopped, thereby eliminating the independent batch normalization parameters.

Further, in the process of quantizing the convolution layer, the input feature A is firstly input _in Weight W _conv And bias B _conv Quantization is carried out to obtainAnd->By->Convolution->Obtaining quantized convolution output M ^q . Due to M ^q And->The quantization ranges are different, the bias is required>And M is as follows ^q Adding after inverse quantization, and enabling the added result to pass through an activation function to obtain a final output characteristic value A _out 。

Further, the parameter quantization of the network model adopts a uniform quantization mode, wherein the distances between quantization points of adjacent neighbors are equal. Given a quantization bit width k, the quantization process can be expressed by:

x ^q ＝Q _k (x ^r ,α)

wherein x is ^r The tensor may be a weight, bias or activation value. Alpha is a scaling factor, q is an integer tensor participated in calculation in an integer arithmetic unit, x ^q Is a quantized parameter used for network training. The scaling factor alpha is critical for low bit quantization. Q represents a quantization function. The clip is a truncated function. round is a round function that returns a rounded value for the floating point number.

Further, in order to solve the difficulty in selecting the scaling factor alpha caused by the long tailing phenomenon existing in the weight distribution in the convolutional neural network, a learner-driven scaling factor alpha is introduced to realize the interval-variable clamping function. The quantization process is modified to be as follows:

in order to be able to update the scaling factor α in neural network training, the gradient of α in the back propagation process is calculated by:

further, in the forward reasoning stage, parameters of the batch normalization layer are fixed, and the following formula is obtained:

y＝ξ ⁽ⁱ⁾ o+η ⁽ⁱ⁾

where o is the output of the previous layer of convolution layer, the quantized convolution operation can be represented by:

o＝α _a q _a α _w q _w

wherein alpha is _a q _a And alpha _w q _w Representing the quantized activation value and weight, respectively.

Further, the batch normalization layer and the convolution layer are combined, and the combined quantized convolution process is as follows:

the merging is characterized in that after merging the batch normalization layer and the convolution layer after the previous quantization, the merged output is quantized again. The low-bit integer operation is adopted to simplify further convolution calculation, and the operation process is as follows:

wherein alpha is _β Channel-level scaling factor tensor being betaIts initial value is the absolute maximum on each channel of beta. At this time, the scaling factors alpha of the weight, the bias and the activation value are floating point numbers, and cannot realize full integer operation.

Further shift-quantizes the scaling factor:

shifting quantized scaling factors may use bit shift left or shift right operations to promote floating point operations, and the quantized convolution calculation is shown as follows:

the final convolution operation only includes an integer multiply-add operation of the weight tensor and the activation value tensor, and a bit shift operation of the scaling factor, without any floating point operations. Wherein two bits are used after the weight quantization, and eight bits are used for both bias and activation values. The quantized convolution operation reduces the required memory use, bandwidth overhead and improves the utilization rate of resources.

Although the present invention has been described in terms of the preferred embodiments, it is not intended to be limited to the embodiments, and any person skilled in the art can make any possible variations and modifications to the technical solution of the present invention by using the methods and technical matters disclosed above without departing from the spirit and scope of the present invention, so any simple modifications, equivalent variations and modifications to the embodiments described above according to the technical matters of the present invention are within the scope of the technical matters of the present invention.

What is not described in detail in the present specification belongs to the known technology of those skilled in the art.

Claims

1. The deep learning network model optimization method based on parameter quantization is characterized by comprising the following steps of:

and (3) retraining the obtained preliminary network, and realizing quantification of the network weight in a stepwise quantification mode in the retraining process.

2. The deep learning network model optimization method based on parameter quantization according to claim 1, wherein:

3. The deep learning network model optimization method based on parameter quantization according to claim 2, wherein:

the batch normalization layer comprises the following specific steps:

4. A method for optimizing a deep learning network model based on parameter quantization according to claim 3, wherein:

ReLU＝max(0,x)。

5. the deep learning network model optimization method based on parameter quantization according to claim 4, wherein:

6. The deep learning network model optimization method based on parameter quantization according to claim 5, wherein:

7. The deep learning network model optimization method based on parameter quantization according to claim 6, wherein:

the model retraining steps are as follows:

updating the scaling factor and the quantization range d based on the average absolute error of the full-precision parameter and the quantized parameter _ij Mean value m _ij ；

In the error back propagation process, gradually and reversely updating each weight parameter according to the loss between the target obtained by forward reasoning of the quantized parameter and the actual target under the loss function;

8. The method for optimizing the deep learning network model based on parameter quantification according to claim 7, wherein the method comprises the following steps:

9. The deep learning network model optimization method based on parameter quantization according to claim 8, wherein:

10. The deep learning network model optimization method based on parameter quantization according to claim 8, wherein:

for input feature A _in Weight W _conv And bias B _conv Quantization is carried out to obtainAnd->By->Convolution->Obtaining quantized convolution output M ^q According to M ^q And->Quantization range, bias->And M is as follows ^q Adding after inverse quantization, and obtaining final output characteristic value A by the added result through an activation function _out 。

11. The deep learning network model optimization method based on parameter quantization according to claim 10, wherein:

the parameter quantization is specifically as follows:

x ^q ＝Q _k (x ^r ,α)

12. the method for optimizing a deep learning network model based on parameter quantification of claim 11, wherein the method comprises the steps of:

y＝ξ ⁽ⁱ⁾ o+η ⁽ⁱ⁾

o＝α _a q _a α _w q _w

shift-quantize the scaling factor:

13. the method for optimizing the deep learning network model based on parameter quantification according to claim 7, wherein the method comprises the following steps: