CN113269312B - Model compression method and system combining quantization and pruning search - Google Patents

Model compression method and system combining quantization and pruning search Download PDF

Info

Publication number
CN113269312B
CN113269312B CN202110620864.3A CN202110620864A CN113269312B CN 113269312 B CN113269312 B CN 113269312B CN 202110620864 A CN202110620864 A CN 202110620864A CN 113269312 B CN113269312 B CN 113269312B
Authority
CN
China
Prior art keywords
pruning
quantization
model
search
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110620864.3A
Other languages
Chinese (zh)
Other versions
CN113269312A (en
Inventor
郭锴凌
周欣欣
徐向民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110620864.3A priority Critical patent/CN113269312B/en
Publication of CN113269312A publication Critical patent/CN113269312A/en
Application granted granted Critical
Publication of CN113269312B publication Critical patent/CN113269312B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Abstract

The invention discloses a model compression method and a model compression system for joint quantization and pruning search, relates to the field of deep learning, and provides a scheme aiming at the problem of model compression precision maintenance in the prior art. After the objects and ranges of quantization and pruning search are set, search training is carried out on the convolutional neural network model, the weight and the structural parameters of the convolutional neural network model are optimized, and finally the optimized model is retrained. The method has the advantages that the quantization and the pruning are simultaneously combined to effectively compress the model, the accuracy of the compressed model is improved, and the method has the advantages of two compression means of the pruning and the quantization.

Description

Model compression method and system combining quantization and pruning search
Technical Field
The invention relates to the field of deep learning, in particular to a model compression method and a model compression system for joint quantization and pruning search.
Background
Deep learning is widely adopted in many real-world applications, such as autopilot, robotics, and virtual reality. Within a constrained range (e.g., latency, model size, and energy consumption), the compression scheme that requires careful design of the network architecture to achieve optimal performance on the target hardware is critical to deep neural network research and deployment.
The network quantization and pruning method plays a crucial role on a resource-limited platform, and the calculation amount and the storage amount of the network can be greatly reduced through low bit quantization or reduction of the channel number. But how to effectively design a quantization or pruning scheme and maintain relatively high model accuracy is a difficulty of application.
Disclosure of Invention
The present invention aims to provide a model compression method and system for joint quantization and pruning search, so as to solve the problems existing in the prior art.
The invention relates to a model compression method combining quantization and pruning search, which comprises the following steps:
s1, input image data and hardware constraints;
s2, establishing a convolutional neural network model, and setting objects and ranges for quantification and pruning search;
s3, searching and training the convolutional neural network model, and optimizing the weight and the structural parameters of the convolutional neural network model; the object of quantization search is the weight bit width of an active layer in the convolutional layer, and the object of pruning search is the channel number of the convolutional layer;
s4, selecting the number of channels with the maximum probability in the pruning search range and the bit width with the maximum probability in the quantization search range, and reconstructing a lightweight network; and storing the model weight of the last iteration during searching as initialization information to retrain the optimized convolutional neural network model.
Cutting the input image data into a training set, a verification set and a test set in step S1; wherein the training set and the validation set are used for the alternating optimization of the convolutional neural network model in step S3.
In step S3, a loss of computation cost is added to the classification verification loss function during model search;
loss function is L ═ Lccostlcost(ii) a Wherein lcIs the cross entropy loss, lcostIs the total computation cost of the network, λcostIs a weight of the computational cost.
Step S3 searches the selection space for both quantization and pruning.
And performing Gumbel-softmax normalization on the weights of the quantization and pruning selection, and setting normalized temperature exponential decay to enable the compressed selection probability matrix generated after the search is finished to approximate to one-hot.
The weights are normalized using the following equation:
Figure BDA0003099818440000021
wherein
Figure BDA0003099818440000022
Is normalized selection weight vector, K is selection number, tau is decay temperature, p is original selection weight vector, U (0,1) represents uniform distribution between 0 and 1, oiFor the generated random numbers, the s.t. representation is limited to the following formula. The output function of the quantization and pruning joint optimization in step S3 is:
Figure BDA0003099818440000023
wherein f is a convolutional layer, ncIs a selectable number of channels, g, in the convolutional layerckIs the selection weight of the number of channels, nwIs a selectable number, g, of bits wide of the activation value in the convolutional layerwiIs the selection weight, n, of the convolutional layer bit widthaIs the selectable number, g, of bits wide of the activation value in the activation layerajThe selection weight of the bit width of the active layer, α is the active layer, t is the column vector, the length is the maximum number of channels N, the number of selected channels is k, then k length elements at the front end of the column vector are 1, and N-k length elements at the rear end are 0.
The model compression system combining quantization and pruning search provided by the invention optimizes the convolutional neural network by using the model compression method.
The model compression method and the system combining quantization and pruning search have the advantages that the quantization and the pruning work simultaneously and jointly can effectively compress the model, the accuracy of the compressed model is improved, and the method and the system have the advantages of two compression means of pruning and quantization.
And searching the number of the channels and the quantization bit width of the pruning by utilizing a neural framework search according to hardware constraint to obtain a light convolutional neural network meeting the hardware requirement. And the weight and the structural parameters of the model are alternately optimized by utilizing a gradient strategy, so that a large amount of time and resources are saved. By means of a gumbel-softmax method and setting a proper temperature, the compressed selection probability matrix generated after the search is finished is approximate to one-hot, namely the maximum probability of selection is close to 1, and the error of the search result selected according to the probability is smaller.
Drawings
FIG. 1 is a schematic flow chart of a model compression method according to the present invention.
FIG. 2 is a schematic diagram of the model compression method in the channel number search process.
FIG. 3 is a schematic diagram of a quantized bit width search process of the model compression method according to the present invention.
Detailed Description
Quantization and pruning methods can effectively compress the model, but simply compressing sequentially results in a less than optimal solution. Therefore, the invention adopts a combined mode to carry out pruning and quantization simultaneously, improves the accuracy of the compressed model and has the advantages of two compression means of pruning and quantization. Therefore, a lightweight network meeting the resource requirement of a hardware platform is automatically searched, and as shown in fig. 1-3, the model compression method combining quantization and pruning search comprises the following specific steps:
and S1, cutting the original data set into a training set, a verification set and a test set, performing preprocessing such as filling, cutting, turning and normalization on images in the data set, and training the weight of the model and the structural parameters of the model in the training set and the verification set alternately.
S2, establishing a convolutional neural network model, quantifying the weight of the activation layer of the convolutional layer as the searched object, and pruning the channel number of the convolutional layer as the searched object.
And S3, carrying out search training on the convolutional neural network by using neural architecture search of the gradient search strategy, and optimizing the weight and the structural parameters of the network.
Specifically, the method comprises the following steps:
and performing a gumbel-softmax normalization operation on the weights respectively selected for quantization and pruning, so that the probability sum of each group of search ranges is 1, setting the temperature tau to decay from a larger number to a number close to 0, such as from 10 exponential decay to 0.01, and obtaining a matrix close to one-hot after the search is finished. Let the original selection weight vector be p, the selected number be K, and the output after normalization is as follows:
Figure BDA0003099818440000031
the purpose of network channel pruning is to reduce the number of channels per layer in the network. Let the convolution layer be f: has ncSelecting the number of channels for searching; defining t as a column vector, the length of which is the maximum number of channels N, and the number of the selected channels is k, then k length elements at the front end of the column vector are 1, and N-k length elements at the rear end are 0. Through weight sharing, given input x, normalization is carried out on different selected weights by using the gumbel-softmax, and the selection weight of the number of channels is gckThe output is as follows:
Figure BDA0003099818440000032
the purpose of the quantized bit width search is to replace the original 32-bit sized parameters with low bit width parameters. Let the convolutional layer be f: has nwSelecting bit width of each activation value, setting any activation layer to be alpha, and having naSelecting bit width of each activation value, giving input x, normalizing weights of different selections by using a Gumbel-softmax, and selecting the bit width of the convolutional layer with the weight gwiThe selection weight of the bit width of the active layer is gajThe output is as follows:
Figure BDA0003099818440000033
combining the formula (1) and the formula (2), performing combined optimization on quantization and pruning, searching selection spaces of quantization and pruning simultaneously, and outputting the following results:
Figure BDA0003099818440000041
determining a loss function: due to the fact thatThe searched model is adapted to the resource constraints of different hardware platforms, so the loss of computational cost is added in the classification verification loss function. Describing the calculation cost of a single network according to the number of floating point operations of the filter, and calculating the weighted sum of all candidate network costs to obtain the total calculation cost l of the networkcost,λcostIs a weight of the computational cost. The loss function is as follows: l ═ LccostlcostWherein l iscRepresenting the cross-entropy loss of the search network structure.
And S4, after the model search is finished, selecting the number of the channels with the maximum probability in the pruning search range and the bit width with the maximum probability in the quantification search range, and reconstructing the lightweight network. And storing the model weight of the last iteration during searching as initialization information, and retraining to finally obtain the compression model meeting the hardware constraint requirement.
The model compression system combining quantization and pruning search provided by the invention optimizes the convolutional neural network by using the model compression method.
It will be apparent to those skilled in the art that various other changes and modifications may be made in the above-described embodiments and concepts and all such changes and modifications are intended to be within the scope of the appended claims.

Claims (7)

1. A model compression method for joint quantization and pruning search is characterized by comprising the following steps:
s1, input image data and hardware constraints;
s2, establishing a convolutional neural network model, and setting objects and ranges for quantification and pruning search;
s3, searching and training the convolutional neural network model, and optimizing the weight and the structural parameters of the convolutional neural network model; the object of quantization search is the weight bit width of an active layer in the convolutional layer, and the object of pruning search is the channel number of the convolutional layer;
s4, selecting the number of channels with the maximum probability in the pruning search range and the bit width with the maximum probability in the quantization search range, and reconstructing a lightweight network; storing the model weight of the last iteration during searching as initialization information to retrain the optimized convolutional neural network model;
the output function of the quantization and pruning joint optimization in step S3 is:
Figure FDA0003290356660000011
wherein, let the convolution layers be f, ncThe number g of channels selectable in the convolutional layer during pruningckIs the selection weight of the number of channels, nwIs a selectable number, g, of bits wide of the activation value in the convolutional layerwiIs the selection weight, n, of the convolutional layer bit widthaIs the selectable number, g, of bits wide of the activation value in the activation layerajThe selection weight of the bit width of the active layer, α is the active layer, t is the column vector, the length is the maximum number of channels N, the number of selected channels is k, then k length elements at the front end of the column vector are 1, N-k length elements at the rear end are 0, and x is the given input.
2. The model compression method for joint quantization and pruning search according to claim 1, wherein the input image data is segmented into a training set, a validation set and a test set in step S1; wherein the training set and the validation set are used for the alternating optimization of the convolutional neural network model in step S3.
3. The model compression method for joint quantization and pruning search according to claim 2, wherein in step S3, a loss of computational cost is added to the classification verification loss function during model search;
loss function is L ═ Lccostlcost(ii) a Wherein lcIs the cross entropy loss, lcostIs the total computation cost of the network, λcostIs a weight of the computational cost.
4. The model compression method for joint quantization and pruning search according to claim 1, wherein the step S3 searches the selection spaces for quantization and pruning simultaneously.
5. The model compression method for joint quantization and pruning search according to claim 1, wherein the quantization and pruning selected weights are normalized by a gumbel-softmax, and a normalized temperature exponential decay is set such that the compressed selection probability matrix generated after the search is completed approximates to one-hot.
6. The method of model compression for joint quantization and pruning search of claim 5, wherein the selection weights are normalized using the following equation:
Figure FDA0003290356660000021
wherein
Figure FDA0003290356660000022
Is normalized selection weight vector, K is selection number, tau is decay temperature, p is original selection weight vector, U (0,1) represents uniform distribution between 0 and 1, oiFor the generated random numbers, the s.t. representation is limited to the following formula.
7. A model compression system combining quantization and pruning search, characterized in that the optimization of the convolutional neural network is performed using the model compression method as claimed in any one of claims 1 to 6.
CN202110620864.3A 2021-06-03 2021-06-03 Model compression method and system combining quantization and pruning search Active CN113269312B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110620864.3A CN113269312B (en) 2021-06-03 2021-06-03 Model compression method and system combining quantization and pruning search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110620864.3A CN113269312B (en) 2021-06-03 2021-06-03 Model compression method and system combining quantization and pruning search

Publications (2)

Publication Number Publication Date
CN113269312A CN113269312A (en) 2021-08-17
CN113269312B true CN113269312B (en) 2021-11-09

Family

ID=77234203

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110620864.3A Active CN113269312B (en) 2021-06-03 2021-06-03 Model compression method and system combining quantization and pruning search

Country Status (1)

Country Link
CN (1) CN113269312B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113947185B (en) * 2021-09-30 2022-11-18 北京达佳互联信息技术有限公司 Task processing network generation method, task processing device, electronic equipment and storage medium
CN114418086B (en) 2021-12-02 2023-02-28 北京百度网讯科技有限公司 Method and device for compressing neural network model
CN117036911A (en) * 2023-10-10 2023-11-10 华侨大学 Vehicle re-identification light-weight method and system based on neural architecture search

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222820A (en) * 2019-05-28 2019-09-10 东南大学 Convolutional neural networks compression method based on weight beta pruning and quantization
CN110378468A (en) * 2019-07-08 2019-10-25 浙江大学 A kind of neural network accelerator quantified based on structuring beta pruning and low bit
CN111275190A (en) * 2020-02-25 2020-06-12 北京百度网讯科技有限公司 Neural network model compression method and device, image processing method and processor
CN111652366A (en) * 2020-05-09 2020-09-11 哈尔滨工业大学 Combined neural network model compression method based on channel pruning and quantitative training
CN111931906A (en) * 2020-07-14 2020-11-13 北京理工大学 Deep neural network mixing precision quantification method based on structure search
CN112686382A (en) * 2020-12-30 2021-04-20 中山大学 Convolution model lightweight method and system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256634A (en) * 2018-02-08 2018-07-06 杭州电子科技大学 A kind of ship target detection method based on lightweight deep neural network
US11556796B2 (en) * 2019-03-25 2023-01-17 Nokia Technologies Oy Compressing weight updates for decoder-side neural networks
CN110210618A (en) * 2019-05-22 2019-09-06 东南大学 The compression method that dynamic trimming deep neural network weight and weight are shared
US20210089922A1 (en) * 2019-09-24 2021-03-25 Qualcomm Incorporated Joint pruning and quantization scheme for deep neural networks
CN111667054B (en) * 2020-06-05 2023-09-01 北京百度网讯科技有限公司 Method, device, electronic equipment and storage medium for generating neural network model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222820A (en) * 2019-05-28 2019-09-10 东南大学 Convolutional neural networks compression method based on weight beta pruning and quantization
CN110378468A (en) * 2019-07-08 2019-10-25 浙江大学 A kind of neural network accelerator quantified based on structuring beta pruning and low bit
CN111275190A (en) * 2020-02-25 2020-06-12 北京百度网讯科技有限公司 Neural network model compression method and device, image processing method and processor
CN111652366A (en) * 2020-05-09 2020-09-11 哈尔滨工业大学 Combined neural network model compression method based on channel pruning and quantitative training
CN111931906A (en) * 2020-07-14 2020-11-13 北京理工大学 Deep neural network mixing precision quantification method based on structure search
CN112686382A (en) * 2020-12-30 2021-04-20 中山大学 Convolution model lightweight method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Automated Model Compression by Jointly Applied Pruning and Quantization;Wenting Tang 等;《arXiv》;20201112;正文第1-2页Introduction部分,第3页Methodology部分,第4页Layer Controller部分,第5页Experimental Results部分,Algorithm 1,附图2-3 *
Dynamic Probabilistic Pruning: A general framework for hardware-constrained pruning at different granularities;Lizeth Gonzalez-Carabarin 等;《arXiv》;20210526;1-9 *
模型剪枝与低精度量化融合的DNN模型压缩算法;吴进 等;《电讯技术》;20200630;第60卷(第6期);617-624 *

Also Published As

Publication number Publication date
CN113269312A (en) 2021-08-17

Similar Documents

Publication Publication Date Title
CN113269312B (en) Model compression method and system combining quantization and pruning search
CN110378468B (en) Neural network accelerator based on structured pruning and low bit quantization
CN107239825B (en) Deep neural network compression method considering load balance
CN109948029B (en) Neural network self-adaptive depth Hash image searching method
CN107239829B (en) Method for optimizing artificial neural network
Sun et al. VAQF: fully automatic software-hardware co-design framework for low-bit vision transformer
CN107688855B (en) Hierarchical quantization method and device for complex neural network
CN110909667B (en) Lightweight design method for multi-angle SAR target recognition network
US20180260711A1 (en) Calculating device and method for a sparsely connected artificial neural network
CN109002889B (en) Adaptive iterative convolution neural network model compression method
CN110413255B (en) Artificial neural network adjusting method and device
CN111507521A (en) Method and device for predicting power load of transformer area
CN111898689A (en) Image classification method based on neural network architecture search
CN110414630A (en) The training method of neural network, the accelerated method of convolutional calculation, device and equipment
CN111723914A (en) Neural network architecture searching method based on convolution kernel prediction
CN113222138A (en) Convolutional neural network compression method combining layer pruning and channel pruning
CN111696149A (en) Quantization method for stereo matching algorithm based on CNN
CN114970853A (en) Cross-range quantization convolutional neural network compression method
CN114943335A (en) Layer-by-layer optimization method of ternary neural network
Qi et al. Learning low resource consumption cnn through pruning and quantization
CN113792621A (en) Target detection accelerator design method based on FPGA
CN116777842A (en) Light texture surface defect detection method and system based on deep learning
Wang et al. Structured feature sparsity training for convolutional neural network compression
CN116363423A (en) Knowledge distillation method, device and storage medium for small sample learning
CN114419361A (en) Neural network image classification method based on gated local channel attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant