CN116542311A

CN116542311A - Neural network model compression method and system

Info

Publication number: CN116542311A
Application number: CN202210086928.0A
Authority: CN
Inventors: 曹俊峰; 黄文辉; 尹路; 王斌; 邓超; 冯俊兰
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2023-08-04

Abstract

The invention provides a neural network model compression method and a system, wherein the neural network model compression method comprises the following steps: inputting training samples into a neural network model to be quantized aiming at each training sample in a training set, and extracting excitation factors of all channels in a convolution layer of the neural network model; for each channel in the convolution layer, determining the weight of the channel according to the excitation factors of the channels corresponding to all training samples in the training set; and quantizing each channel in the convolution layer according to the weight of each channel in the convolution layer, wherein the quantization degree of each channel is inversely related to the weight of the channel. In the invention, the quantization degree of the channel is determined according to the weight (importance) of the channel of the convolution layer, the quantization degree of the important channel is low, the quantization degree of the unimportant channel is high, and the number of redundant unimportant parameters can be reduced while the performance of the neural network model is ensured, so that the storage space and the running memory of the neural network model are reduced.

Description

Neural network model compression method and system

Technical Field

The embodiment of the invention relates to the technical field of business support, in particular to a neural network model compression method and system.

Background

Neural networks are commonly used in various fields such as computer vision, speech recognition, and natural language processing. The accuracy of the neural network model is very high, however, the accuracy is not the only index for measuring the model, and the storage space, the operation speed, the energy consumption condition and the like are important influencing factors.

Neural network models typically have a significant portion of redundancy, meaning that we can optimize the model to reduce parameters and computation, achieving higher accuracy with a smaller, compact network. Therefore, the deep learning field inputs a great deal of research resources for the problems, and mainly has two aspects: one is to design a more efficient network architecture with acceptable accuracy with relatively small model sizes. Secondly, the network scale is reduced by means of compression, coding and the like. Quantization is one of the most widely used compression methods. Quantization is the conversion of the floating point algorithm of the neural network into a fixed point.

The quantification of the neural network model brings a lot of benefits, but at the same time, the quantification of the model also brings great challenges to the performance of the neural network model, and the problem of how to reduce the calculation amount and the size of the neural network model while guaranteeing the accuracy of the neural network is urgent to be solved.

Disclosure of Invention

The embodiment of the invention provides a neural network model compression method and a system, which are used for solving the problem of how to reduce the calculated amount and the size of a neural network model while ensuring the accuracy of the neural network.

In order to solve the technical problems, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a neural network model compression method, including:

inputting the training samples into a neural network model to be quantized aiming at each training sample in a training set, and extracting excitation factors of all channels in a convolution layer of the neural network model;

for each channel in the convolution layer, determining the weight of the channel according to the excitation factors of the channel corresponding to all training samples in the training set;

and quantizing each channel in the convolution layer according to the weight of each channel in the convolution layer, wherein the quantization degree of each channel is inversely related to the weight of the channel.

Optionally, for each channel in the convolutional layer, determining the weight of the channel according to the excitation factors of the channels corresponding to all training samples in the training set, including:

and for each channel in the convolution layer, acquiring an average value of excitation factors of the channels corresponding to all training samples in the training set, and determining the average value as the weight of the channel.

Optionally, quantizing each channel in the convolution layer according to the weight of each channel in the convolution layer includes:

clustering, for each channel, respective weights in a convolution kernel of the convolution layer into k categories, wherein a value of k for each channel is positively correlated with a weight of the channel;

and taking the clustering center of each category as the sharing weight of the category.

Alternatively to this, the method may comprise,

where M is the row sum of the convolution kernelsNumber of pixels in a column, Q _step As a piecewise function, w _i For the weight of channel i, T ₀ And T ₁ To preset threshold value, T ₁ Greater than T ₀ 。

Optionally, after taking the cluster center of each category as the sharing weight of the category, the method further includes:

the sharing weight of each category is recorded through an index.

Optionally, before inputting the training samples to the neural network model to be quantized for each training sample in the training set, the method further comprises:

and pre-training the neural network model by adopting the training samples in the training set to obtain the neural network to be quantized.

Optionally, after quantizing each channel in the convolution layer according to the weight of each channel in the convolution layer, the method further includes:

and retraining the quantized neural network model by adopting the training samples in the training set.

In a second aspect, an embodiment of the present invention provides a neural network model compression system, including:

the extraction module is used for inputting the training samples into a neural network model to be quantized aiming at each training sample in a training set, and extracting excitation factors of all channels in a convolution layer of the neural network model;

the weight determining module is used for determining the weight of each channel in the convolution layer according to the excitation factors of the channels corresponding to all training samples in the training set;

and the quantization module is used for quantizing each channel in the convolution layer according to the weight of each channel in the convolution layer, wherein the quantization degree of each channel is inversely related to the weight of the channel.

Optionally, the extracting module is configured to obtain, for each channel in the convolutional layer, an average value of excitation factors of the channels corresponding to all training samples in the training set, and determine the average value as a weight of the channel.

Optionally, the quantization module is configured to cluster, for each channel, respective weights in a convolution kernel of the convolution layer into k categories, where a value of k of each channel is positively correlated with a weight of the channel; and taking the clustering center of each category as the sharing weight of the category.

Alternatively to this, the method may comprise,

wherein M is the number of pixels of the rows and columns of the convolution kernel, Q _step As a piecewise function, w _i For the weight of channel i, T ₀ And T ₁ To preset threshold value, T ₁ Greater than T ₀ 。

Optionally, the quantization module is further configured to record the sharing weight of each category through an index.

Optionally, the neural network model compression system further includes:

and the pre-training module is used for pre-training the neural network model by adopting the training samples in the training set to obtain the neural network to be quantized.

Optionally, the neural network model compression system further includes:

and the retraining module is used for retraining the quantized neural network model by adopting the training samples in the training set.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a processor, a memory, and a program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the neural network model compression method described in the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium, where a computer program is stored, where the computer program is executed by a processor to implement the steps of the neural network model compression method described in the first aspect.

In the embodiment of the invention, the quantization degree of the channel is determined according to the weight (importance) of the channel of the convolution layer, the quantization degree of the important channel is low, the quantization degree of the unimportant channel is high, and the number of redundant unimportant parameters can be reduced while the performance of the neural network model is ensured, so that the storage space and the running memory of the neural network model are reduced.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a schematic flow chart of a neural network model compression method according to an embodiment of the invention;

FIG. 2 is a schematic diagram illustrating the operation of an SE module according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a neural network model compression system according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, an embodiment of the present invention provides a neural network model compression method, including:

step 11: inputting the training samples into a neural network model to be quantized aiming at each training sample in a training set, and extracting excitation factors of all channels in a convolution layer of the neural network model;

step 12: for each channel in the convolution layer, determining the weight of the channel according to the excitation factors of the channel corresponding to all training samples in the training set;

step 13: and quantizing each channel in the convolution layer according to the weight of each channel in the convolution layer, wherein the quantization degree of each channel is inversely related to the weight of the channel.

It should be noted that, the neural network model in the embodiment of the present invention may include a plurality of convolution layers, and the steps may be performed for all or part of the convolution layers of the neural network model, where each convolution layer independently performs quantization of each channel in the convolution layer according to the steps described above.

In the embodiment of the invention, the weight of the channel can represent the importance degree of the channel, and the weight of the channel is positively correlated with the importance of the channel, namely, the higher the weight is, the more important the channel is, and the lower the weight is, the less important the channel is.

The quantization degree of each channel is inversely related to the weight of the channel, which means that: the greater the weight of the channel, the lower the quantization level, and the smaller the weight, the higher the quantization level. That is, when quantization is performed, the degree of quantization of important channels is low, and the degree of quantization of unimportant channels is high.

In the embodiment of the invention, a compression and activation (SE) module can be added in the neural network model to extract the Excitation factors of all channels in the convolution layer of the neural network model, and the SE module can model the correlation among the characteristic channels to extractExcitation factors of the channels. In the embodiment of the invention, the SE module can extract the excitation factors of all channels in the convolution layer of the neural network model by adopting an attention mechanism. Referring to FIG. 2, FIG. 2 is a schematic diagram illustrating the operation of the SE module in accordance with an embodiment of the present invention, wherein the characteristic diagram of the convolution layer output is shown as XεR in FIG. 2 ^W×H×C Where W is width, H is Height, and C is channel number. First, the W×H dimensional feature map X is compressed by Global pooling layer (Global pooling) to a number (Squeeze operation) representing the information of this channel (i.e. to get a 1×1×C feature map), and then the excitation factors s of each channel are learned with two sets of full connection layer (FC) and activation layer (ReLU and Sigmoid) _i (i=1, 2, …, C) (specification operation).

In an embodiment of the present invention, optionally, for each channel in the convolutional layer, determining the weight of the channel according to the excitation factors of the channels corresponding to all training samples in the training set includes: and for each channel in the convolution layer, acquiring an average value of excitation factors of the channels corresponding to all training samples in the training set, and determining the average value as the weight of the channel. The average value of the excitation factors of the channels corresponding to the training samples is used as the weight of the channels, so that the obtained weight is more accurate.

Assuming the number N of training samples in the training set, for one training sample N, is the excitation factor of each channel of a certain convolution layer, then the weight w= [ w ] of each channel of the convolution layer ₁ ,w ₂ ,…,w _C ]Represented asWhere i=1, 2, …, C.

In an embodiment of the present invention, optionally, quantizing each channel in the convolutional layer according to a weight of each channel in the convolutional layer includes:

step 131: clustering, for each channel, respective weights in a convolution kernel of the convolution layer into k categories, wherein a value of k for each channel is positively correlated with a weight of the channel;

in the embodiment of the invention, K-means clustering can be adopted to cluster M multiplied by M weights in the convolution kernel into K categories.

Step 132: and taking the clustering center of each category as the sharing weight of the category.

That is, the connections with similar weights in the same channel share the same shared weight, and meanwhile, different numbers of shared weights can be selected for different channels, so that the effect of different quantization degrees for each channel is achieved.

The value of k for each channel is positively correlated with the weight of the channel, meaning: the larger the weight of the channel is, the larger the value of k is, the smaller the weight of the channel is, the smaller the value of k is, the larger the value of k represents low quantization degree, and the smaller the value of k represents high quantization degree.

In the embodiment of the invention, optional:

wherein M is the number of pixels of the rows and columns of the convolution kernel, Q _step As a piecewise function, w _i For the weight of channel i, T ₀ And T ₁ To preset threshold value, T ₁ Greater than T ₀ . In the embodiment of the invention, T ₀ And T ₁ Can be set according to different requirements.

In the embodiment of the invention, a monotonically decreasing piecewise function is used to map the weights of the channels to the parameter Q _step The number of clusters required by channel quantization compression is calculated, so that different degrees of clustering are obtainedAnd quantifying the compression effect. This allows for a low degree of quantization (even no quantization) for channels with high weight values (high importance), maintaining a high model accuracy; for channels with low weight values (low importance), a high degree of quantization is performed, thereby obtaining an overall larger compression rate.

In this embodiment of the present invention, optionally, after taking the cluster center of each category as the sharing weight of the category, the method further includes: the sharing weight of each category is recorded through indexes, and the index value type is the integer of log2 (k). That is, the original floating point type weight is replaced by an integer index, so that the storage space is smaller. For example, the 3×3 convolution kernels (including 9 weights) corresponding to one channel are clustered into 4 categories, and the clustering centers are: (-1,0,1.5,2.0), indices (0, 1,2, 3) may be used to represent (-1,0,1.5,2.0), respectively, to reduce storage space.

In the embodiment of the invention, the channel can be quantized once or multiple times, and the quantization times are set according to the needs.

In an embodiment of the present invention, optionally, before inputting the training samples to the neural network model to be quantized for each training sample in the training set, the method further includes:

step 10: and pre-training the neural network model by adopting the training samples in the training set to obtain the neural network to be quantized.

In an embodiment of the present invention, optionally, after quantizing each channel in the convolutional layer according to the weight of each channel in the convolutional layer, the method further includes:

step 14: and retraining the quantized neural network model by adopting the training samples in the training set.

The neural network model in the embodiment of the invention can be a neural network model applied to various services, for example, a neural network model applied to video monitoring services. Video monitoring is a very widely applied service, and comprises machine room monitoring, kitchen monitoring, community monitoring and the like. Because the bandwidth of the video is limited, the network traffic born by the cloud server is too large, so that the edge equipment is adopted to collect the video stream sent by the camera, the neural network model is used for reasoning, and the reasoning result is sent to the client or uploaded to the cloud server. The neural network model on the edge device is huge and affects the time delay. The calculated amount of the neural network model such as behavior recognition of video monitoring is at least 309 hundred million floating point operations. The huge parameter quantity of the neural network model makes the neural network model need larger storage space and occupy memory during operation, the huge calculation quantity makes the energy consumption of the neural network model increase and the operation time of the neural network model lengthen, so the neural network model needs to be compressed to adapt to the edge equipment to process more camera video streams, and the cost is reduced.

Referring to fig. 3, an embodiment of the present invention provides a neural network model compression system 30, including:

the extracting module 31 is configured to input, for each training sample in a training set, the training sample to a neural network model to be quantized, and extract excitation factors of each channel in a convolutional layer of the neural network model;

a weight determining module 32, configured to determine, for each channel in the convolutional layer, a weight of the channel according to excitation factors of the channels corresponding to all training samples in the training set;

and a quantization module 33, configured to quantize each channel in the convolutional layer according to the weight of each channel in the convolutional layer, where the quantization degree of each channel is inversely related to the weight of the channel.

Optionally, the extracting module 31 is configured to obtain, for each channel in the convolutional layer, an average value of excitation factors of the channels corresponding to all training samples in the training set, and determine the average value as a weight of the channel.

Optionally, the quantization module 33 is configured to cluster, for each channel, respective weights in the convolution kernel of the convolution layer into k categories, where a value of k of each channel is positively correlated with a weight of the channel; and taking the clustering center of each category as the sharing weight of the category.

Alternatively to this, the method may comprise,

Optionally, the quantization module 33 is further configured to record the sharing weight of each category through an index.

Optionally, the neural network model compression system 30 further includes:

and the pre-training module (not shown) is used for pre-training the neural network model by adopting training samples in the training set to obtain the neural network to be quantized.

Optionally, the neural network model compression system 30 further includes:

and the retraining module (not shown) is used for retraining the quantized neural network model by adopting training samples in the training set.

Referring to fig. 4, the embodiment of the present invention further provides an electronic device 40, which includes a processor 41, a memory 42, and a computer program stored in the memory 42 and capable of running on the processor 41, where the computer program when executed by the processor 41 implements the processes of the above embodiment of the neural network model compression method, and can achieve the same technical effects, and for avoiding repetition, a detailed description is omitted herein.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the processes of the neural network model compression method embodiment described above, and can achieve the same technical effects, so that repetition is avoided, and no further description is given here. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present invention and the scope of the claims, which are to be protected by the present invention.

Claims

1. A neural network model compression method, comprising:

2. The method of claim 1, wherein for each channel in the convolutional layer, determining the weights of the channel from the excitation factors of the channel for all training samples in the training set comprises:

3. The method of claim 1, wherein quantizing each channel in the convolutional layer according to the weight of each channel in the convolutional layer comprises:

4. The method of claim 3, wherein the step of,

5. A method according to claim 3, wherein after taking the cluster center of each category as the sharing weight of the category, the method further comprises:

the sharing weight of each category is recorded through an index.

6. The method of claim 1, wherein prior to inputting the training samples into the neural network model to be quantized for each training sample in a training set, the method further comprises:

7. The method of claim 1, wherein after quantizing each channel in the convolutional layer according to the weight of each channel in the convolutional layer, the method further comprises:

8. A neural network model compression system, comprising:

9. An electronic device, comprising: a processor, a memory and a program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the neural network model compression method of any one of claims 1 to 7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the neural network model compression method according to any of claims 1 to 7.