CN113269312B

CN113269312B - Model compression method and system combining quantization and pruning search

Info

Publication number: CN113269312B
Application number: CN202110620864.3A
Authority: CN
Inventors: 郭锴凌; 周欣欣; 徐向民
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-06-03
Filing date: 2021-06-03
Publication date: 2021-11-09
Anticipated expiration: 2041-06-03
Also published as: CN113269312A

Abstract

The invention discloses a model compression method and a model compression system for joint quantization and pruning search, relates to the field of deep learning, and provides a scheme aiming at the problem of model compression precision maintenance in the prior art. After the objects and ranges of quantization and pruning search are set, search training is carried out on the convolutional neural network model, the weight and the structural parameters of the convolutional neural network model are optimized, and finally the optimized model is retrained. The method has the advantages that the quantization and the pruning are simultaneously combined to effectively compress the model, the accuracy of the compressed model is improved, and the method has the advantages of two compression means of the pruning and the quantization.

Description

Model compression method and system combining quantization and pruning search

Technical Field

The invention relates to the field of deep learning, in particular to a model compression method and a model compression system for joint quantization and pruning search.

Background

Deep learning is widely adopted in many real-world applications, such as autopilot, robotics, and virtual reality. Within a constrained range (e.g., latency, model size, and energy consumption), the compression scheme that requires careful design of the network architecture to achieve optimal performance on the target hardware is critical to deep neural network research and deployment.

The network quantization and pruning method plays a crucial role on a resource-limited platform, and the calculation amount and the storage amount of the network can be greatly reduced through low bit quantization or reduction of the channel number. But how to effectively design a quantization or pruning scheme and maintain relatively high model accuracy is a difficulty of application.

Disclosure of Invention

The present invention aims to provide a model compression method and system for joint quantization and pruning search, so as to solve the problems existing in the prior art.

The invention relates to a model compression method combining quantization and pruning search, which comprises the following steps:

s1, input image data and hardware constraints;

s2, establishing a convolutional neural network model, and setting objects and ranges for quantification and pruning search;

s3, searching and training the convolutional neural network model, and optimizing the weight and the structural parameters of the convolutional neural network model; the object of quantization search is the weight bit width of an active layer in the convolutional layer, and the object of pruning search is the channel number of the convolutional layer;

s4, selecting the number of channels with the maximum probability in the pruning search range and the bit width with the maximum probability in the quantization search range, and reconstructing a lightweight network; and storing the model weight of the last iteration during searching as initialization information to retrain the optimized convolutional neural network model.

Cutting the input image data into a training set, a verification set and a test set in step S1; wherein the training set and the validation set are used for the alternating optimization of the convolutional neural network model in step S3.

In step S3, a loss of computation cost is added to the classification verification loss function during model search;

loss function is L ═ L_c+λ_costl_cost(ii) a Wherein l_cIs the cross entropy loss, l_costIs the total computation cost of the network, λ_costIs a weight of the computational cost.

Step S3 searches the selection space for both quantization and pruning.

And performing Gumbel-softmax normalization on the weights of the quantization and pruning selection, and setting normalized temperature exponential decay to enable the compressed selection probability matrix generated after the search is finished to approximate to one-hot.

The weights are normalized using the following equation:

wherein

Is normalized selection weight vector, K is selection number, tau is decay temperature, p is original selection weight vector, U (0,1) represents uniform distribution between 0 and 1, o_iFor the generated random numbers, the s.t. representation is limited to the following formula. The output function of the quantization and pruning joint optimization in step S3 is:

wherein f is a convolutional layer, n_cIs a selectable number of channels, g, in the convolutional layer_ckIs the selection weight of the number of channels, n_wIs a selectable number, g, of bits wide of the activation value in the convolutional layer_wiIs the selection weight, n, of the convolutional layer bit width_aIs the selectable number, g, of bits wide of the activation value in the activation layer_ajThe selection weight of the bit width of the active layer, α is the active layer, t is the column vector, the length is the maximum number of channels N, the number of selected channels is k, then k length elements at the front end of the column vector are 1, and N-k length elements at the rear end are 0.

The model compression system combining quantization and pruning search provided by the invention optimizes the convolutional neural network by using the model compression method.

The model compression method and the system combining quantization and pruning search have the advantages that the quantization and the pruning work simultaneously and jointly can effectively compress the model, the accuracy of the compressed model is improved, and the method and the system have the advantages of two compression means of pruning and quantization.

And searching the number of the channels and the quantization bit width of the pruning by utilizing a neural framework search according to hardware constraint to obtain a light convolutional neural network meeting the hardware requirement. And the weight and the structural parameters of the model are alternately optimized by utilizing a gradient strategy, so that a large amount of time and resources are saved. By means of a gumbel-softmax method and setting a proper temperature, the compressed selection probability matrix generated after the search is finished is approximate to one-hot, namely the maximum probability of selection is close to 1, and the error of the search result selected according to the probability is smaller.

Drawings

FIG. 1 is a schematic flow chart of a model compression method according to the present invention.

FIG. 2 is a schematic diagram of the model compression method in the channel number search process.

FIG. 3 is a schematic diagram of a quantized bit width search process of the model compression method according to the present invention.

Detailed Description

Quantization and pruning methods can effectively compress the model, but simply compressing sequentially results in a less than optimal solution. Therefore, the invention adopts a combined mode to carry out pruning and quantization simultaneously, improves the accuracy of the compressed model and has the advantages of two compression means of pruning and quantization. Therefore, a lightweight network meeting the resource requirement of a hardware platform is automatically searched, and as shown in fig. 1-3, the model compression method combining quantization and pruning search comprises the following specific steps:

and S1, cutting the original data set into a training set, a verification set and a test set, performing preprocessing such as filling, cutting, turning and normalization on images in the data set, and training the weight of the model and the structural parameters of the model in the training set and the verification set alternately.

S2, establishing a convolutional neural network model, quantifying the weight of the activation layer of the convolutional layer as the searched object, and pruning the channel number of the convolutional layer as the searched object.

And S3, carrying out search training on the convolutional neural network by using neural architecture search of the gradient search strategy, and optimizing the weight and the structural parameters of the network.

Specifically, the method comprises the following steps:

and performing a gumbel-softmax normalization operation on the weights respectively selected for quantization and pruning, so that the probability sum of each group of search ranges is 1, setting the temperature tau to decay from a larger number to a number close to 0, such as from 10 exponential decay to 0.01, and obtaining a matrix close to one-hot after the search is finished. Let the original selection weight vector be p, the selected number be K, and the output after normalization is as follows:

the purpose of network channel pruning is to reduce the number of channels per layer in the network. Let the convolution layer be f: has n_cSelecting the number of channels for searching; defining t as a column vector, the length of which is the maximum number of channels N, and the number of the selected channels is k, then k length elements at the front end of the column vector are 1, and N-k length elements at the rear end are 0. Through weight sharing, given input x, normalization is carried out on different selected weights by using the gumbel-softmax, and the selection weight of the number of channels is g_ckThe output is as follows:

the purpose of the quantized bit width search is to replace the original 32-bit sized parameters with low bit width parameters. Let the convolutional layer be f: has n_wSelecting bit width of each activation value, setting any activation layer to be alpha, and having n_aSelecting bit width of each activation value, giving input x, normalizing weights of different selections by using a Gumbel-softmax, and selecting the bit width of the convolutional layer with the weight g_wiThe selection weight of the bit width of the active layer is g_ajThe output is as follows:

combining the formula (1) and the formula (2), performing combined optimization on quantization and pruning, searching selection spaces of quantization and pruning simultaneously, and outputting the following results:

determining a loss function: due to the fact thatThe searched model is adapted to the resource constraints of different hardware platforms, so the loss of computational cost is added in the classification verification loss function. Describing the calculation cost of a single network according to the number of floating point operations of the filter, and calculating the weighted sum of all candidate network costs to obtain the total calculation cost l of the network_cost，λ_costIs a weight of the computational cost. The loss function is as follows: l ═ L_c+λ_costl_costWherein l is_cRepresenting the cross-entropy loss of the search network structure.

And S4, after the model search is finished, selecting the number of the channels with the maximum probability in the pruning search range and the bit width with the maximum probability in the quantification search range, and reconstructing the lightweight network. And storing the model weight of the last iteration during searching as initialization information, and retraining to finally obtain the compression model meeting the hardware constraint requirement.

It will be apparent to those skilled in the art that various other changes and modifications may be made in the above-described embodiments and concepts and all such changes and modifications are intended to be within the scope of the appended claims.

Claims

1. A model compression method for joint quantization and pruning search is characterized by comprising the following steps:

s1, input image data and hardware constraints;

s4, selecting the number of channels with the maximum probability in the pruning search range and the bit width with the maximum probability in the quantization search range, and reconstructing a lightweight network; storing the model weight of the last iteration during searching as initialization information to retrain the optimized convolutional neural network model;

the output function of the quantization and pruning joint optimization in step S3 is:

wherein, let the convolution layers be f, n_cThe number g of channels selectable in the convolutional layer during pruning_ckIs the selection weight of the number of channels, n_wIs a selectable number, g, of bits wide of the activation value in the convolutional layer_wiIs the selection weight, n, of the convolutional layer bit width_aIs the selectable number, g, of bits wide of the activation value in the activation layer_ajThe selection weight of the bit width of the active layer, α is the active layer, t is the column vector, the length is the maximum number of channels N, the number of selected channels is k, then k length elements at the front end of the column vector are 1, N-k length elements at the rear end are 0, and x is the given input.

2. The model compression method for joint quantization and pruning search according to claim 1, wherein the input image data is segmented into a training set, a validation set and a test set in step S1; wherein the training set and the validation set are used for the alternating optimization of the convolutional neural network model in step S3.

3. The model compression method for joint quantization and pruning search according to claim 2, wherein in step S3, a loss of computational cost is added to the classification verification loss function during model search;

4. The model compression method for joint quantization and pruning search according to claim 1, wherein the step S3 searches the selection spaces for quantization and pruning simultaneously.

5. The model compression method for joint quantization and pruning search according to claim 1, wherein the quantization and pruning selected weights are normalized by a gumbel-softmax, and a normalized temperature exponential decay is set such that the compressed selection probability matrix generated after the search is completed approximates to one-hot.

6. The method of model compression for joint quantization and pruning search of claim 5, wherein the selection weights are normalized using the following equation:

wherein

Is normalized selection weight vector, K is selection number, tau is decay temperature, p is original selection weight vector, U (0,1) represents uniform distribution between 0 and 1, o_iFor the generated random numbers, the s.t. representation is limited to the following formula.

7. A model compression system combining quantization and pruning search, characterized in that the optimization of the convolutional neural network is performed using the model compression method as claimed in any one of claims 1 to 6.