CN116992946B

CN116992946B - Model compression method, apparatus, storage medium, and program product

Info

Publication number: CN116992946B
Application number: CN202311257748.5A
Authority: CN
Inventors: 姚万欣
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2024-05-17
Anticipated expiration: 2043-09-27
Also published as: CN116992946A

Abstract

The application provides a model compression method, a device, a storage medium and a program product, wherein in the method, firstly, a plurality of weight parameters corresponding to a convolution layer to be pruned in a neural network model to be compressed are obtained, and the plurality of weight parameters comprise weight parameters of each convolution kernel in the convolution layer to be pruned; secondly, determining a first interval according to a plurality of weight parameters, wherein the average value of the weight parameters falls into the first interval, and the distribution of the weight parameters of the convolution layer to be pruned is more uniform as the number of the weight parameters falling into the first interval is smaller; and finally, carrying out channel pruning on the neural network model to be compressed according to the number of weight parameters falling into a first interval in the weight parameters of each convolution kernel and the size of the weight parameters of each convolution kernel, quantifying the neural network model after pruning, and reducing model precision loss after quantifying the neural network model after pruning while improving the reasoning speed of the neural network model.

Description

Model compression method, apparatus, storage medium, and program product

Technical Field

The present application relates to the field of neural network technologies, and in particular, to a model compression method, a device, a storage medium, and a program product.

Background

As the functional requirements of people on the neural network model are more and more complex, the weight parameters of the neural network model are more and more, so that the model is more and more complex, and the reasoning time of the complex model is longer.

In some scenes in which an inference result needs to be given quickly, the inference duration of the complex model cannot meet the scene requirement, so that in these scenes, the inference speed of the model can be improved by compressing the complex model.

A common method for compressing a model includes pruning and then quantifying, where in the pruning process, pruning operation may be performed according to the importance of the convolution kernels of the convolution layers of the neural network model, for example, pruning operation is performed on the convolution kernels with low importance, and the convolution kernels with high importance are reserved.

However, in the process of pruning the neural network model, only the importance of the convolution kernel is considered, and in this case, the accuracy loss of the model may be serious when the neural network model after pruning is quantized later.

Disclosure of Invention

The application provides a model compression method, a device, a storage medium and a program product, which can improve the reasoning speed of a neural network model and reduce the model precision loss after quantifying the neural network model after pruning.

In order to achieve the above purpose, the application adopts the following technical scheme:

In a first aspect, a method for compressing a model is provided, firstly, a plurality of weight parameters corresponding to a convolution layer to be pruned in a neural network model to be compressed are obtained, wherein the plurality of weight parameters comprise weight parameters of each convolution kernel in the convolution layer to be pruned; secondly, determining a first interval according to the plurality of weight parameters, wherein the average value of the plurality of weight parameters falls into the first interval, and the distribution of the weight parameters of the convolution layer to be pruned is more uniform as the number of the weight parameters falling into the first interval is smaller; and finally, carrying out channel pruning on the neural network model to be compressed according to the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel and the size of the weight parameters of each convolution kernel, and quantifying the neural network model after pruning.

Based on the technical scheme, the model compression method in the embodiment of the application firstly carries out channel pruning on the neural network model to be compressed, and then quantifies the neural network model after pruning so as to improve the reasoning speed of the model. In the pruning process, channel pruning is carried out on the neural network model to be compressed according to the number of weight parameters falling into a first interval in the weight parameters of each convolution kernel and the size of the weight parameters of each convolution kernel. The weight parameters of the convolution kernels represent the importance of the convolution kernels, and the smaller the number of the weight parameters falling into the first interval in the weight parameters of the convolution kernels is, the more uniform the distribution of the weight parameters of the convolution layer to be pruned is, and the smaller the precision loss of the quantized model is; conversely, the more the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel, the more uneven the distribution of the weight parameters of the convolution layer to be pruned, and the greater the loss of accuracy of the quantized model. Therefore, in the embodiment of the application, the magnitude of the weight parameter of each convolution kernel and the number of the weight parameters falling into the first interval in the weight parameters of each convolution kernel are simultaneously considered during pruning, so that the model precision loss after the neural network model after pruning is quantized can be reduced while the reasoning speed of the neural network model is improved.

In a possible implementation manner of the first aspect, the determining a first interval according to the plurality of weight parameters includes: determining a first average value and a first standard deviation of the plurality of weight parameters; and calculating the first interval according to the first average value, the first standard deviation and a first preset threshold value, wherein the first preset threshold value is smaller than 1.

In a possible implementation manner of the first aspect, the calculating the first interval according to the first average value, the first standard deviation, and a first preset threshold includes: adding the first average value to the product of the first preset threshold value and the first standard deviation to obtain a first upper threshold value of the first interval; subtracting the product of the first preset threshold value and the first standard deviation from the first average value to obtain a first lower threshold value of the first interval.

In a possible implementation manner of the first aspect, the obtaining a plurality of weight parameters corresponding to a convolutional layer to be pruned in the neural network model to be compressed includes: acquiring a plurality of initial weight parameters corresponding to a convolution layer to be pruned in the neural network model to be compressed; determining a second upper threshold and a second lower threshold according to the initial weight parameters, wherein the first interval is positioned between the second upper threshold and the second lower threshold; determining a plurality of weight parameters corresponding to the convolution layer to be pruned according to the second upper threshold, the second lower threshold and the plurality of initial weight parameters, wherein the plurality of weight parameters comprise: and replacing the initial weight parameters which are larger than the second upper limit threshold with the second upper limit threshold and replacing the initial weight parameters which are smaller than the second lower limit threshold with the weight parameters obtained after the second lower limit threshold.

Based on the above technical solution, before pruning, after determining the second upper threshold and the second lower threshold corresponding to each convolution layer to be pruned, an initial weight parameter greater than the second upper threshold and an initial weight parameter (i.e. an outlier) less than the second lower threshold may be found. And then, processing the outlier, replacing an initial weight parameter larger than a second upper limit threshold value with the second upper limit threshold value, replacing an initial weight parameter smaller than a second lower limit threshold value with the second lower limit threshold value, avoiding that the absolute value of the outlier is too large and is not friendly to quantitative training, and simultaneously avoiding that the absolute value of the outlier is too large to influence the evaluation result of the importance of the convolution kernel in pruning.

In a possible implementation manner of the first aspect, the determining a second upper threshold and a second lower threshold according to the plurality of initial weight parameters includes: determining a second average value and a second standard deviation of the plurality of initial weight parameters; and calculating the second upper limit threshold and the second lower limit threshold according to the second average value, the second standard deviation and a second preset threshold, wherein the second preset threshold is more than or equal to 3.

In a possible implementation manner of the first aspect, the calculating the second upper threshold and the second lower threshold according to the second average value, the second standard deviation, and a second preset threshold includes: adding the second average value to the product of the second preset threshold value and the second standard deviation to obtain a second upper limit threshold value; and subtracting the product of the second preset threshold value and the second standard deviation from the second average value to obtain the second lower threshold value.

In a possible implementation manner of the first aspect, the performing channel pruning on the neural network model to be compressed according to the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel and the size of the weight parameters of each convolution kernel includes: and performing channel pruning on the neural network model to be compressed according to the sum of the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel and the absolute value of the weight parameters of each convolution kernel.

In a possible implementation manner of the first aspect, the performing channel pruning on the neural network model to be compressed according to a sum of the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel and an absolute value of the weight parameters of each convolution kernel includes: calculating the ratio of the number of weight parameters of each convolution kernel to the total number of weight parameters of each convolution kernel, wherein the weight parameters of each convolution kernel fall into the first interval; and performing channel pruning on the neural network model to be compressed according to the sum of the ratio corresponding to each convolution kernel and the absolute value of the weight parameter of each convolution kernel.

In a possible implementation manner of the first aspect, the performing channel pruning on the neural network model to be compressed according to a sum of the ratio value corresponding to each convolution kernel and an absolute value of a weight parameter of each convolution kernel includes: calculating the product of the value obtained by subtracting the ratio corresponding to each convolution kernel from the preset value and the sum of the absolute values of the weight parameters of each convolution kernel; and performing channel pruning on the convolution kernel of which the product is smaller than a third preset threshold value in the neural network model to be compressed.

In a second aspect, there is provided a model compression apparatus comprising: the acquisition module is used for acquiring a plurality of weight parameters corresponding to the convolution layers to be pruned in the neural network model to be compressed, wherein the plurality of weight parameters comprise weight parameters of each convolution kernel in the convolution layers to be pruned; the processing module is used for determining a first interval according to the weight parameters, wherein the average value of the weight parameters falls into the first interval, and the distribution of the weight parameters of the convolution layer to be pruned is more uniform as the number of the weight parameters falling into the first interval is smaller; and performing channel pruning on the neural network model to be compressed according to the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel and the size of the weight parameters of each convolution kernel; and quantifying the neural network model after pruning.

Based on the technical scheme, the model compression device of the embodiment of the application firstly performs channel pruning on the neural network model to be compressed, and then quantifies the neural network model after pruning, so as to improve the reasoning speed of the model. In the pruning process, the processing module performs channel pruning on the neural network model to be compressed according to the number of weight parameters falling into a first interval in the weight parameters of each convolution kernel and the size of the weight parameters of each convolution kernel. The magnitude of the weight parameter of each convolution kernel represents the importance of each convolution kernel; the fewer the number of weight parameters falling into a first interval in the weight parameters of each convolution kernel, the more uniform the distribution of the weight parameters of the convolution layer to be pruned, and the smaller the precision loss of the quantized model; conversely, the more the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel, the more uneven the distribution of the weight parameters in the convolution layer to be pruned, and the greater the loss of accuracy of the quantized model. Therefore, in the embodiment of the application, the magnitude of the weight parameter of each convolution kernel and the number of the weight parameters falling into the first interval in the weight parameters of each convolution kernel are simultaneously considered during pruning, so that the model precision loss after the neural network model after pruning is quantized can be reduced while the reasoning speed of the neural network model is improved.

In a possible implementation manner of the second aspect, the processing module is specifically configured to determine a first average value and a first standard deviation of the plurality of weight parameters; and calculating the first interval according to the first average value, the first standard deviation and a first preset threshold value, wherein the first preset threshold value is smaller than 1.

In a possible implementation manner of the second aspect, the processing module is specifically configured to add the first average value to a product of the first preset threshold value and the first standard deviation to obtain a first upper threshold value of the first interval; and subtracting the product of the first preset threshold value and the first standard deviation from the first average value to obtain a first lower threshold value of the first interval.

In a possible implementation manner of the second aspect, the obtaining module is specifically configured to obtain a plurality of initial weight parameters corresponding to a convolutional layer to be pruned in the neural network model to be compressed; determining a second upper limit threshold and a second lower limit threshold according to the initial weight parameters, wherein the first interval is positioned between the second upper limit threshold and the second lower limit threshold; determining a plurality of weight parameters corresponding to the convolution layer to be pruned according to the second upper threshold, the second lower threshold and the plurality of initial weight parameters, wherein the plurality of weight parameters comprise: and replacing the initial weight parameters which are larger than the second upper limit threshold with the second upper limit threshold and replacing the initial weight parameters which are smaller than the second lower limit threshold with the weight parameters obtained after the second lower limit threshold.

In a possible implementation manner of the second aspect, the processing module is specifically configured to determine a second average value and a second standard deviation of the plurality of initial weight parameters; and calculating the second upper threshold and the second lower threshold according to the second average value, the second standard deviation and a second preset threshold, wherein the second preset threshold is greater than or equal to 3.

In a possible implementation manner of the second aspect, the processing module is specifically configured to add the second average value to a product of the second preset threshold value and the second standard deviation to obtain the second upper threshold value; and subtracting the product of the second preset threshold value and the second standard deviation from the second average value to obtain the second lower threshold value.

In a possible implementation manner of the second aspect, the processing module is specifically configured to perform channel pruning on the neural network model to be compressed according to a sum of a number of weight parameters falling into the first interval in the weight parameters of each convolution kernel and an absolute value of the weight parameters of each convolution kernel.

In a possible implementation manner of the second aspect, the processing module is specifically configured to calculate a ratio of a number of weight parameters of the respective convolution kernels falling into the first interval to a total number of weight parameters of the respective convolution kernels; and performing channel pruning on the neural network model to be compressed according to the sum of the ratio corresponding to each convolution kernel and the absolute value of the weight parameter of each convolution kernel.

In a possible implementation manner of the second aspect, the processing module is specifically configured to calculate a product of a value obtained by subtracting the ratio corresponding to each convolution kernel from a preset value and a sum of absolute values of weight parameters of each convolution kernel; and performing channel pruning on the convolution kernel of which the product is smaller than a third preset threshold value in the neural network model to be compressed.

In a third aspect, there is provided a model compression device comprising a memory and a processor, the memory for storing instructions which, when executed by the processor, cause the model compression device to perform the model compression method of the first aspect or any one of the possible implementations of the first aspect.

In a fourth aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program comprising program instructions which, when executed, implement the method of model compression in the first aspect or any one of the possible implementations of the first aspect.

In a fifth aspect, there is provided a computer program product comprising: computer program code which, when run on a computer, causes the computer to perform the method of model compression in the first aspect or any one of the possible implementations of the first aspect.

Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects.

Drawings

FIG. 1 is a schematic diagram of a relatively uniform distribution of weight parameters of a convolutional layer provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of non-uniform distribution of weight parameters of a convolutional layer provided by an embodiment of the present application;

FIG. 3 is an exemplary flow chart of an exemplary method for model compression provided by an embodiment of the application;

FIG. 4 is a schematic diagram of a long tail distribution of weight parameters of a certain convolution layer according to an embodiment of the present application;

FIG. 5 is an exemplary flow chart of another example model compression method provided by an embodiment of the application;

FIG. 6 is an exemplary flow chart of yet another example model compression method provided by an embodiment of the application;

FIG. 7 is a schematic diagram of an exemplary model compressing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of another model compressing apparatus according to an embodiment of the present application.

Detailed Description

The technical scheme of the application will be described below with reference to the accompanying drawings.

In the description of the embodiments of the present application, unless otherwise indicated, "/" means or, for example, a/B may represent a or B; "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone.

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present embodiment, unless otherwise specified, the meaning of "plurality" is two or more.

Each set of training data in the training data set of the neural network model comprises an input of the neural network model and a corresponding label, and the label represents a real output value corresponding to the input data. The training process of the neural network model is to continuously update the weight parameters of the neural network model, so that the output result of the neural network model is gradually fitted with the labels in the training data after the input data of each group of training data are input. The fit is typically evaluated by a loss function, wherein the value of the loss function is used to measure the gap between the output of the neural network model and the actual output, i.e., the smaller the value of the loss function, the better the neural network model fit is considered. In general, the fitting capability of a neural network model with a relatively complex structure (a large number of weight parameters, a high structural complexity, a large model for short) is better, that is, the learning capability is stronger, and the fitting capability of a neural network model with a relatively simple structure (a small number of weight parameters, a low structural complexity, a small model for short) is relatively poorer. Because the number of weight parameters of the large model is large, the reasoning time of the large model is longer. The inference time of the neural network model refers to the time from the input of data to the neural network model until the calculation of the output result.

Although the expression ability of the large model is strong, it is difficult to be practically applied because of longer reasoning time. In an actual application scene, the electronic equipment needs to obtain an output result of the model in a short time. For example, voice recognition on a mobile phone, a mobile phone camera shooting or continuous shooting program, two-dimensional code detection and recognition and the like all need to obtain an output result of the neural network model in a short time. In order to improve the reasoning speed of the neural network model on the electronic equipment, besides designing a lightweight neural network structure, compression processing is performed on the neural network model after training is completed. The model compression technique to which the present application relates includes structured pruning and quantization.

Structured pruning and quantification are described below.

(1) The structured pruning improves the reasoning speed of the neural network model by pruning the structure of the neural network model and reducing the number of weight parameters of the model, so that the structured pruning can change the structure of the neural network model.

Structured pruning includes three pruning methods.

Mode one: structured pruning of convolution layer dimensions. The neural network model (e.g., convolutional neural network model) is assumed to comprise 5 convolutional layers, respectively [ L1, L2, L3, L4, L5]. Wherein the second layer L2 includes 8 convolution kernels of 3*3 shapes, and each of the 8 convolution kernels of L2 has 3*3 =9 weight parameters, and the total number of weight parameters of the second layer L2 is 8×3×3=72. Structured pruning of convolutional layer dimensions may directly prune one convolutional layer of the neural network model, such as pruning the fourth convolutional layer L4, leaving only 4 convolutional layers of the pruned neural network model, such as [ L1, L2, L3, L5].

Mode two: structured pruning in the convolution kernel dimension (also referred to as structured pruning in the channel dimension, simply channel pruning). For example, the convolution kernels of one or more convolution layers of the neural network model may be pruned to reduce the number of channels of the convolution layer (the number of channels of the convolution layer is equal to the number of convolution kernels of the layer). For example, 8 convolution kernels in the second convolution layer L2 are pruned into 5 convolution kernels, and the pruned L2 layer of the neural network model has 5 convolution kernels with a shape of 3*3.

Mode three: structured pruning of convolution kernel shape dimensions. For example, the shape of the convolution kernel of one or more convolution layers in the neural network model may be pruned. For example, the shape of 8 convolution kernels in the second layer L2 is clipped from 3*3 to 2×2, and each of the 8 convolution kernels in the clipped L2 layer has 2×2=4 weight parameters.

The pruning in the embodiment of the application refers to channel pruning, namely pruning the convolution kernels of certain convolution layers in the model to reduce the number of the convolution kernels of the convolution layers, and the model reasoning speed is improved by simplifying the model structure and reducing the number of weight parameters (namely the complexity of the model).

In the embodiment of the application, the importance of each convolution kernel in the neural network model is required to be evaluated firstly in the structured pruning process, namely, the importance of each convolution kernel in the convolution layer is judged according to certain standards, then channel pruning is carried out according to the sequence from small importance to large importance, convolution kernels with small importance are pruned, and convolution kernels with large importance are reserved so as to ensure the accuracy of the model after pruning.

(2) Quantization techniques refer to converting the weights and activation layer parameters of a model from floating point (e.g., float 32) to lower precision integer (e.g., int16 or int 8) to save storage and computation overhead of the model. The quantization related to the embodiment of the application particularly refers to quantization perception training, so that a quantization model learns a truncation error introduced in a quantization process to improve the precision and effect of a final quantization model.

During the training process of the neural network model, dummy quantization nodes are inserted into all or part of the quantized layers of the neural network model, and the truncation (four-house five-in) errors introduced in the quantization process are simulated through quantization and inverse quantization, so that the neural network model can be better adapted to the errors caused by quantization. For example, taking as an example the quantized perceptual training of all convolutional layers of a neural network model (e.g., convolutional neural network model), i.e., the insertion of pseudo-quantization nodes into all convolutional layers. In the process of carrying out quantization perception training on all the convolution layers of the convolution neural network model, before carrying out convolution calculation on the weights of the convolution layers and the input feature images, respectively carrying out quantization and dequantization on the weights of the convolution layers and the input feature images. The quantization process of the convolution layer and the feature map may convert the floating point number into different bit integer types, for example, the quantization process of the weight may convert the 32-bit floating point number (float 32) into 8-bit integer type (int 8), and the quantization process of the feature map may convert the 32-bit floating point number (float 32) into 16-bit integer type (int 16).

The distribution of the weight parameters in the convolution layer is uniform, as shown in fig. 1, in the coordinate system with the X axis (horizontal axis) as the weight parameters and the Y axis (vertical axis) as the number, the number of the weight parameters is smaller, the distribution curve of the weight parameters in the convolution layer is smoother, and the distribution of the weight parameters in the convolution layer can be considered to be uniform. However, in the actual neural network model, a considerable amount of weight parameters of the convolution layer are concentrated near the average value of the weight parameters, and the distribution of the weight parameters of the convolution layer is uneven, as shown in fig. 2, in the coordinate system in which the X axis (horizontal axis) is the weight parameter and the Y axis (vertical axis) is the number, the number of the weight parameters of each weight parameter is greatly different, the distribution curve of the weight parameters of the convolution layer is steeper, and the distribution of the weight parameters of the convolution layer can be considered to be uneven. The more uneven the distribution of the weight parameters of the convolution layer is, the less friendly the quantization training is, and the more serious the precision loss of the quantized model is.

In view of this, the embodiment of the present application provides a model compression method, firstly, a plurality of weight parameters corresponding to a convolution layer to be pruned in a neural network model to be compressed are obtained, where the plurality of weight parameters include weight parameters of each convolution kernel in the convolution layer to be pruned; secondly, determining a first interval according to the plurality of weight parameters, wherein the distribution of the weight parameters of the convolution layer to be pruned is more uniform as the number of the weight parameters falling into the first interval is smaller; and finally, carrying out channel pruning on the neural network model to be compressed according to the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel and the size of the weight parameters of each convolution kernel, and quantifying the neural network model after pruning.

Based on the technical scheme, the model compression method in the embodiment of the application firstly carries out channel pruning on the neural network model to be compressed, and then quantifies the neural network model after pruning so as to improve the reasoning speed of the model. In the pruning process, channel pruning is carried out on the neural network model to be compressed according to the number of weight parameters falling into a first interval in the weight parameters of each convolution kernel and the size of the weight parameters of each convolution kernel. The weight parameters of each convolution kernel represent the importance of each convolution kernel, the fewer the number of weight parameters falling into a first interval in the weight parameters of each convolution kernel, the more uniform the distribution of the weight parameters in a convolution layer to be pruned, and the smaller the precision loss of the quantized model; conversely, the more the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel, the more uneven the distribution of the weight parameters in the convolution layer to be pruned, and the greater the loss of accuracy of the quantized model. Therefore, in the embodiment of the application, the magnitude of the weight parameter of each convolution kernel and the number of the weight parameters falling into the first interval in the weight parameters of each convolution kernel are simultaneously considered during pruning, so that the model precision loss after the neural network model after pruning is quantized can be reduced while the reasoning speed of the neural network model is improved.

In summary, in the embodiment of the application, quantization friendliness is considered, in the process of pruning a model channel before quantization, the sequence of pruning the convolution kernels is not only based on the importance of the convolution kernels, but also the weight parameter distribution condition of the convolution kernels is referred to, so that the model compression method in the embodiment of the application can consider the precision of the model after pruning and the precision of the model after quantizing the neural network model after pruning.

A model compression method according to an embodiment of the present application will be described in detail with reference to the accompanying drawings, and fig. 3 is an exemplary flowchart of the model compression method.

Step S110: and acquiring a plurality of weight parameters corresponding to the convolution layer to be pruned in the neural network model to be compressed.

Specifically, the neural network model to be compressed is a floating point model after training, that is, the weight of each convolution layer in the neural network model is a floating point number.

The plurality of weight parameters corresponding to the convolution layers in the embodiment of the application comprise weight parameters of all convolution kernels in the convolution layers to be pruned, and the plurality of weight parameters corresponding to the convolution layers comprise all weight parameters of all convolution kernels in the convolution layers to be pruned. The plurality of weight parameters corresponding to the convolution layer may include a partial weight parameter of each convolution kernel in the convolution layer to be pruned; or the plurality of weight parameters corresponding to the convolution layer comprise all weight parameters of part of convolution kernels in the convolution layer to be pruned, and part of weight parameters of the other part of convolution kernels.

In the embodiment, a convolution layer to be pruned in a neural network model to be compressed is determined first, and the convolution layer to be pruned can be a convolution layer with higher improvement of the performance of the pruned model in the neural network model; can be a convolution layer with little influence on model precision after pruning in the neural network model; the convolution layer with a larger number of convolution kernels in the neural network model can also be used.

Assume that neural network model is to be compressedThe number of convolution layers is n, which can be represented by L as L= [/>,,/>, ……,/>,/>The number of convolution kernels in N convolution layers may be denoted N = [/>, N = -),/>,/>,/>, ……,/>,/>In which the first layer convolutions layer/>The number of convolution kernels is/>Second layer convolution layer/>The number of convolution kernels is/>Third layer convolution layer/>The number of convolution kernels is/>… … N-th layer convolutional layer/>The number of convolution kernels of (2) is. Ith convolution layer/>/>, For the first convolution kernelThe representation, second convolution kernel/>Representation … …, N < th > convolution kernel/>Representation, then layer i convolutional layer/>Convolution kernel available/>Expressed as/>=[/>,/>, ……,/>,/>]。

In the embodiment of the application, the neural network model to be compressed can be obtainedSome of the n convolutional layers are determined as convolutional layers to be pruned, or all of the n convolutional layers may be determined as convolutional layers to be pruned, which is not limited herein.

Assuming that the distribution diagram of the weight parameters of a certain convolution layer in a coordinate system with the weight parameters as the abscissa and the number as the ordinate is in a long tail distribution as shown in fig. 4, there are a considerable number of weight parameters in the convolution layer concentrated near the average value of the weights. The more the number of weight parameters concentrated near the average value of the weights, the more uneven the distribution of the weight parameters, the more serious the model accuracy loss after quantization. And at the end of the long tail distribution there are a smaller number of weight parameters but with a larger absolute value, which is called outliers. These outliers are not only less friendly to quantitative training, but also affect the evaluation of the importance of the convolution kernel in pruning. Therefore, it is expected that the number of weight parameters, which are concentrated near the average value of the weights, of the distribution curve of the weight parameters of the convolution layer has small difference from the number of weight parameters distributed at other positions, namely, the distribution curve of the weight parameters of the convolution layer is smooth, and the distribution of the weight parameters of the convolution layer is uniform; and eliminates or reduces the effect of outliers on the quantization training and the convolutional kernel importance assessment.

In one embodiment, the plurality of weight parameters corresponding to the convolution layer to be pruned in the step S110 may be a plurality of initial weight parameters corresponding to the convolution layer to be pruned in the neural network model to be compressed (i.e. the weight parameters in the convolution layer that are not processed in any way).

In another embodiment, the plurality of weight parameters corresponding to the convolution layer to be pruned in the step S110 may be weight parameters obtained after the outlier processing in the plurality of initial weight parameters corresponding to the convolution layer to be pruned in the neural network model to be compressed, so as to eliminate or reduce the influence of the outlier on the quantization training and the importance evaluation of the convolution kernel.

Step S120: the first interval is determined according to the plurality of weight parameters.

Specifically, the first interval is an interval corresponding to an area in which weight parameters are easy to concentrate, the smaller the number of weight parameters falling into the first interval is, the more uniform the distribution of the weight parameters of the convolution layer to be pruned is, and the smaller the precision loss of the quantized model is; conversely, the more the number of weight parameters that fall within the first interval, the more uneven the distribution of weight parameters of the convolutional layer to be pruned, and the greater the loss of precision of the quantized model. In a coordinate system taking a weight parameter as a horizontal axis and the number as a vertical axis, the smoother the formed distribution curve of the convolution layer is, the more uniform the distribution of the weight parameter of the convolution layer is; the steeper the distribution curve formed by the weight parameters of the convolution layer, the more uneven the distribution of the weight parameters of the convolution layer. For example, as shown in fig. 4, a weight parameter of a certain convolution layer is shown in a long tail distribution, where a first upper threshold of a first interval isThe first lower threshold of the first interval is/>It can be seen that the drop-in/>, in the convolutional layerAnd/>The number of the weight parameters is large, the distribution curve of the weight parameters is steep, the distribution of the weight parameters of the convolution layer is uneven, the method is unfriendly to quantization training, and the precision loss of the quantized model is large.

Similar to the above step S110, the weight parameters mentioned in the present step S120 may be initial weight parameters corresponding to the convolutional layer to be pruned in the neural network model to be compressed (i.e., weight parameters in the convolutional layer without any processing); the method can also be a weight parameter obtained after processing outliers in a plurality of initial weight parameters corresponding to a convolution layer to be pruned in the neural network model to be compressed.

In one possible implementation, determining the first interval according to the plurality of weight parameters includes: determining a first average value and a first standard deviation of the plurality of weight parameters; and calculating a first interval according to the first average value, the first standard deviation and a first preset threshold value, wherein the first preset threshold value is smaller than 1.

Specifically, the example in which the plurality of weight parameters corresponding to the convolution layer are initial weight parameters corresponding to the convolution layer to be pruned in the neural network model to be compressed (i.e., weight parameters in the convolution layer that are not processed in any way) is described. Assume that n first averages corresponding to n convolutional layers are=[/>,/>,/>,/>, ……,/>,/>N first standard deviations corresponding to n convolution layers are/>=[/>,/>,/>,/>, ……,/>,/>]. N first upper threshold values corresponding to n convolution layers are/>=[/>,/>, ……,，/>N first lower threshold values corresponding to n convolution layers are/>=[/>,/>, ……,/>，/>]. Wherein the first layer convolutions layer/>The corresponding first upper threshold is/>The first lower threshold is/>; Second layer convolution layer/>The corresponding first upper threshold is/>The first lower threshold is/>; … …; N-th convolution layer/>The corresponding first upper threshold is/>The first lower threshold is/>。

In this embodiment, the first interval may be calculated by subtracting or adding the product of the first preset threshold and the standard deviation from the average value, where the first preset threshold is smaller than 1, and for example, the first preset threshold may be 0.5, 0.4, 0.35, or 0.3. In this embodiment, the first preset threshold may be determined by experience to be a specific value, which is not limited herein.

Practically, calculating the first interval according to the first average value, the first standard deviation and the first preset threshold value comprises: adding the first average value to the product of the first preset threshold value and the first standard deviation to obtain a first upper threshold value of a first interval; subtracting the product of the first preset threshold value and the first standard deviation from the first average value to obtain a first lower threshold value of the first interval.

Specifically, it is assumed that the first preset threshold valueRepresentation, then layer i convolutional layer/>Is the first upper threshold ofI-th convolution layer/>Is/>. Wherein/>。

Step S130: and carrying out channel pruning on the neural network model to be compressed according to the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel and the size of the weight parameters of each convolution kernel.

Specifically, the channel pruning can be performed on the convolution kernels according to the sequence that the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel is from more to less; and channel pruning is carried out on the convolution kernels according to the sequence from small to large of the sum of the absolute values of the weight parameters of each convolution kernel. Or a pruning index can be calculated according to the sum of the number of the weight parameters falling into the first interval in the weight parameters of each convolution kernel and the absolute value of the weight parameters of each convolution kernel, and channel pruning is carried out according to the pruning index.

In one possible implementation manner, performing channel pruning on the neural network model to be compressed according to the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel and the size of the weight parameters of each convolution kernel includes: and performing channel pruning on the neural network model to be compressed according to the sum of the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel and the absolute value of the weight parameters of each convolution kernel.

Specifically, the magnitude of the weight parameter of each convolution kernel may be represented by a sum E of absolute values of the weight parameters in each convolution kernel, e.g., a sum of absolute values of weight parameters of N convolution kernels in an i-th layer convolution layer=[/>,/>, ……,/>,/>]. Wherein the sum of absolute values of weight parameters of the 1 st convolution kernel in the ith convolution layer is/>The sum of the absolute values of the weight parameters of the 2 nd convolution kernel in the i-th convolution layer is/>… … The sum of the absolute values of the weight parameters of the nth convolution kernel in the ith convolution layer is/>。

In another possible implementation, the magnitude of the weight parameter of the convolution kernel may also be represented by an average of all weight parameters in the convolution kernel.

In yet another possible implementation, the magnitude of the weight parameter of the convolution kernel may also be represented by the average of the maximum and minimum values of all weight parameters in the convolution kernel.

Step S140: and quantifying the neural network model after pruning.

Specifically, structured pruned neural network modelsPerforming quantized perception training to enable the quantized perception model to be trained and converged on a data set to obtain a final quantized model/>。

Since the quantized perceptual training involves inserting pseudo-quantization nodes into certain layers (convolution layer and activation layer and other layers) of the original floating point model, the pseudo-quantization nodes make maximum and minimum statistics on the weight parameters of the layers, namely in the quantized training processAnd/>For subsequent dequantization calculations.

When the convolution layer of the model is quantized, the quantization can be simply divided into two types of tensor quantization and channel quantization according to the statistical mode of the maximum value and the minimum value. A plurality of convolution kernels (i.e. a plurality of channels) are arranged in a certain layer of convolution layer of the model, and tensor quantization refers to statistics of a maximum value and a minimum value of all weight parameters of the whole convolution layer, namely the weight parameters of all channels share the maximum value and the minimum value; channel level quantization refers to counting a maximum value and a minimum value for each channel weight parameter of the convolution layer.

According to the model compression method, channel pruning is firstly carried out on the neural network model to be compressed, and then quantification is carried out on the neural network model after pruning, so that the reasoning speed of the model is improved. In the pruning process, channel pruning is carried out on the neural network model to be compressed according to the number of weight parameters falling into a first interval in the weight parameters of each convolution kernel and the size of the weight parameters of each convolution kernel. The weight parameters of the convolution kernels represent the importance of the convolution kernels, and the smaller the number of the weight parameters falling into the first interval in the weight parameters of the convolution kernels is, the more uniform the distribution of the weight parameters of the convolution layer to be pruned is, and the smaller the precision loss of the quantized model is; conversely, the more the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel, the more uneven the distribution of the weight parameters of the convolution layer to be pruned, and the greater the loss of accuracy of the quantized model. Therefore, in the embodiment of the application, the magnitude of the weight parameter of each convolution kernel and the number of the weight parameters falling into the first interval in the weight parameters of each convolution kernel are simultaneously considered during pruning, so that the model precision loss after the neural network model after pruning is quantized can be reduced while the reasoning speed of the neural network model is improved.

The embodiment of the application also provides another model compression method, as shown in fig. 5, comprising the following steps.

Step S310: and acquiring a plurality of initial weight parameters corresponding to the convolution layer to be pruned in the neural network model to be compressed.

Step S320: a second upper threshold and a second lower threshold are determined based on the plurality of initial weight parameters.

Specifically, in this embodiment, the weight parameter greater than the second upper threshold and the weight parameter less than the second lower threshold are outliers, the second upper threshold is greater than the first upper threshold of the first section, and the second lower threshold is less than the first lower threshold of the first section. The multiple initial weight parameters of the convolution layer to be pruned may be displayed in a coordinate system with the initial weight parameters as abscissa and the number as ordinate, and a second upper threshold and a second lower threshold may be determined in the coordinate system according to the display result, for example, as shown in fig. 2Is determined as a second upper threshold,/>A second lower threshold is determined. Because the initial weight parameters of the convolution layers to be pruned are different, the second upper threshold and the second lower threshold of the convolution layers to be pruned are different.

Step S330: and determining a plurality of weight parameters corresponding to the convolution layers to be pruned according to the second upper limit threshold value, the second lower limit threshold value and the plurality of initial weight parameters.

Specifically, after determining the second upper threshold and the second lower threshold corresponding to each convolution layer to be pruned, an initial weight parameter greater than the second upper threshold and an initial weight parameter less than the second lower threshold (i.e., an outlier) in the plurality of initial weight parameters corresponding to the convolution layers to be pruned may be found. And then, processing the outlier, replacing an initial weight parameter larger than a second upper limit threshold value with the second upper limit threshold value, replacing an initial weight parameter smaller than a second lower limit threshold value with the second lower limit threshold value, avoiding that the absolute value of the outlier is too large and is not friendly to quantitative training, and simultaneously avoiding that the absolute value of the outlier is too large to influence the evaluation result of the importance of the convolution kernel in pruning.

The plurality of weight parameters corresponding to the convolution layer to be pruned in the embodiment of the application comprise: and replacing the initial weight parameters which are larger than the second upper limit threshold with the second upper limit threshold and replacing the initial weight parameters which are smaller than the second lower limit threshold with the weight parameters which are obtained after the second upper limit threshold and the second lower limit threshold are adopted in the plurality of initial weight parameters.

In one possible implementation, determining the second upper threshold and the second lower threshold from the plurality of initial weight parameters includes: determining a second average value and a second standard deviation of the plurality of initial weight parameters; and calculating a second upper limit threshold and a second lower limit threshold according to the second average value, the second standard deviation and a second preset threshold, wherein the second preset threshold is more than or equal to 3.

Specifically, an example will be described in which n convolutional layers are each determined as a convolutional layer to be pruned. Neural network model to be compressed respectivelyIs of the order of n convolution layers l= [/>,/>,/>, ……,/>,/>Calculating a second average value and a second standard deviation of a plurality of initial weight parameters corresponding to each convolution layer. Let us assume neural network model to be compressed/>N second averages corresponding to the inner n convolution layers are/>=[/>,/>,/>,/>, ……,/>,/>N second standard deviations corresponding to n convolution layers are/>=[/>,/>,/>,/>, ……,/>,/>In which the first layer convolutions layer/>The second average value of the corresponding plurality of initial weight parameters is/>The second standard deviation of the corresponding plurality of initial weight parameters is/>; Second layer convolution layer/>The second average value of the corresponding multiple initial weight parameters isThe second standard deviation of the corresponding plurality of initial weight parameters is/>; … …; N-th convolution layer/>The second average value of the corresponding plurality of initial weight parameters is/>The second standard deviation of the corresponding plurality of initial weight parameters is/>。

In this embodiment, the standard deviation of the preset multiple is subtracted from or added to the average value to represent the deviation degree of the weight parameter from the average value. Data outside the standard deviation of the average minus or plus 3 times is generally considered outliers, i.e., considered outliers in the embodiments of the present application. In the embodiment of the present application, the preset multiple is a second preset threshold, where the second preset threshold is greater than or equal to 3, for example, the second preset threshold may be 3 or 3.5.

Practically, calculating the second upper threshold and the second lower threshold according to the second average value, the second standard deviation and the second preset threshold comprises: adding the second average value to the product of a second preset threshold value and a second standard deviation to obtain a second upper threshold value; and subtracting the product of the second preset threshold value and the second standard deviation from the second average value to obtain a second lower threshold value.

Specifically, the n second upper threshold values corresponding to the n convolution layers are=[/>,/>,/>,, ……,/>,/>N threshold values corresponding to n convolution layers are/>=[/>,/>,/>,/>, ……,/>,/>In which the first layer convolutions layer/>The corresponding second upper threshold is/>The second lower threshold is/>; Second layer convolution layer/>The corresponding second upper threshold is/>The second lower threshold is/>; … …; N-th convolution layer/>The corresponding second upper threshold is/>The second lower threshold is/>. Let us assume second preset threshold valueRepresentation, then layer i convolutional layer/>The second upper threshold of (2) is/>I-th convolution layer/>The second lower threshold of (2) is/>. Wherein/>。

Step S340: the first interval is determined according to the plurality of weight parameters.

It should be noted that, if the first upper threshold and the first lower threshold of the first interval are determined directly according to the initial weight parameter in the convolutional layer to be pruned, the second average value in the embodiment of the present application is the same as the first average value, and the second standard deviation is the same as the first standard deviation. If a first upper threshold and a first lower threshold of a first interval are determined according to weight parameters obtained after outlier processing is performed on initial weight parameters in a convolution layer to be pruned, a second average value in the embodiment of the application is different from the first average value, and a second standard deviation is different from the first standard deviation.

Step S350: and carrying out channel pruning on the neural network model to be compressed according to the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel and the size of the weight parameters of each convolution kernel.

Step S360: and quantifying the neural network model after pruning.

In this embodiment, step S340 is substantially the same as step S120 in the above embodiment, step S350 is substantially the same as step S130 in the above embodiment, and step S360 is substantially the same as step S140 in the above embodiment, so that repetition is avoided and detailed description is omitted.

Since for tensor-level quantization all weight parameters of a single convolutional layer in the quantization model are used in a setAnd/>To perform quantization and inverse quantization calculations, so that the post-quantization accuracy of the layer is affected by the distribution of the weight parameters. Assuming that the maximum value of the weight parameter in the convolution layer is 100, the minimum value is 0, and the rest weight parameters are all 50, the rest weight parameters are used for quantization and inverse quantization, and 100 is used as the maximum value, and 0 is used as the minimum value, so that the accuracy is affected. Similarly, in channel level quantization, all weight parameters of each channel in a single convolution layer in the quantization model are used as a set/>And/>For quantization and inverse quantization calculations, it is assumed that the maximum value and the minimum value of the weight parameter of a certain channel are 100, the minimum value are 0, and the rest weight parameters are 50, and the accuracy after quantization is also affected.

It can be seen that, in both tensor level quantization and channel level quantization, in order to reduce the precision loss of the quantized model, it is necessary to ensure that the difference between the maximum value and the minimum value of the weight parameter in the model is small. Since the points affecting the maximum and minimum of the weight parameters are outliers distributed at both ends of the distribution curve of the weight parameters. In order to reduce the difference between the maximum value and the minimum value of the weight parameters in the convolution layers, in the embodiment of the application, before pruning, the initial weight parameters of each convolution layer are acquired, outliers in each convolution layer are found, and the outliers are processed to reduce the difference between the maximum value and the minimum value of the weight parameters in the convolution layers, so that the precision loss of the quantized model is further reduced. Before pruning, outliers in each convolution layer are found and processed, and interference of the outliers on calculation of the convolution kernel weight parameters can be avoided.

The embodiment of the application also provides a model compression method, as shown in fig. 6, comprising the following steps.

Step S410: and acquiring a plurality of weight parameters corresponding to the convolution layer to be pruned in the neural network model to be compressed.

Step S420: the first interval is determined according to the plurality of weight parameters.

The step S410 is substantially the same as the step S110 in the above embodiment, and the step S420 is substantially the same as the step S120 in the above embodiment, and for avoiding repetition, the description is omitted in this embodiment.

Step S430: the ratio of the number of weight parameters of each convolution kernel to the total number of weight parameters of each convolution kernel falling within the first interval is calculated.

Specifically, assume that the jth convolution kernel in the ith convolution layerThe total number of medium weight parameters is/> _ij J-th convolution kernel/>, in i-th convolution layerThe number of weight parameters whose middle weight parameters fall within the first interval is/>Then, j-th convolution kernel/>Corresponding ratio/>。

Step S440: and performing channel pruning on the neural network model to be compressed according to the sum of the ratio of each convolution kernel and the absolute value of the weight parameter of each convolution kernel.

Specifically, the larger the number of weight parameters whose weight parameters fall within the first interval in the convolution kernels, the larger the ratio corresponding to each convolution kernel, and the larger the influence on the accuracy of the quantized model. Thus, the convolution kernels may be channel pruned in order of the ratio from large to small.

In one possible implementation manner, performing channel pruning on the neural network model to be compressed according to the sum of the ratio value corresponding to each convolution kernel and the absolute value of the weight parameter of each convolution kernel includes: calculating the product of the value obtained by subtracting the ratio of each convolution kernel from the preset value and the sum of the absolute values of the weight parameters of each convolution kernel; and performing channel pruning on convolution kernels with products smaller than a third preset threshold in the neural network model to be compressed.

Specifically, the pruning index in the embodiment of the present application is represented by the product of the value obtained by subtracting the ratio corresponding to each convolution kernel from the preset value and the sum of the absolute values of the weight parameters of each convolution kernel, where the pruning index= (preset value-ratio corresponding to each convolution kernel) ×the sum of the absolute values of the weight parameters of each convolution kernel. Assume that the jth convolution kernel in the ith convolution layerCorresponding pruning index use/>The preset value is denoted by a. J-th convolution kernel/>, in i-th convolution layerThe corresponding ratio is/>The sum of the absolute values of the weight parameters is/>Then, pruning index/>. Wherein the preset value a may be 1.

Due toThe larger the weight parameter of the convolution kernel, the larger the importance of the convolution kernel, and the convolution kernel needs to be reserved; /(I)The larger the weight parameters are, the smaller the number of the weight parameters falling into the first interval in the convolution kernel is, the more uniform the weight parameters are distributed, the smaller the influence on the precision of the quantized model is, and the convolution kernel needs to be reserved. Therefore, pruning index/>, needs to be maintainedLarger convolution kernel, but prune pruning index/>Smaller convolution kernels. After the pruning indexes of all the convolution kernels in the convolution layer to be pruned are calculated, the convolution kernels with small pruning indexes are preferentially pruned according to the sequence from small pruning indexes to large pruning indexes.

In the embodiment of the application, a third preset threshold value can be set, and the channel pruning is carried out on the convolution kernel with the product (namely the pruning index) smaller than the third preset threshold value. The third preset threshold may be set according to an empirical value, which is not limited herein. The number of pruning can also be preset, channel pruning is carried out according to the sequence from small to large of pruning indexes, and the number of the subtracted convolution kernels is the same as the preset pruning number.

Step S450: and quantifying the neural network model after pruning.

In this embodiment, the step S450 is substantially the same as the step S140 in the above embodiment, and in order to avoid repetition, the description is omitted.

It should be noted that, the application scenario of the model compression method in the embodiment of the application has low requirement on the accuracy of the model, and is more prone to improving the performance (i.e. the reasoning speed) of the model, or the convolutional neural network used by the application scenario currently has a great deal of parameter redundancy, so that the model can be further compressed. For example, a mobile phone snapshot or mobile phone continuous shooting scene has higher processing speed requirement of a model of the scene, or a scene such as two-dimensional code detection and face recognition of a gate, and the like, although the scene has higher precision requirement on the model, the currently used model may have parameter redundancy, and the performance of the model can be further compressed and improved on the premise of not losing the precision.

It should be understood that the above description is intended to aid those skilled in the art in understanding the embodiments of the present application, and is not intended to limit the embodiments of the present application to the specific values or particular scenarios illustrated. It will be apparent to those skilled in the art from the foregoing description that various equivalent modifications or variations can be made, and such modifications or variations are intended to be within the scope of the embodiments of the present application.

Fig. 7 is a schematic diagram of a model compressing apparatus 500 according to an embodiment of the present application, including an obtaining module 510 and a processing module 520.

The obtaining module 510 is configured to obtain a plurality of weight parameters corresponding to a convolution layer to be pruned in the neural network model to be compressed, where the plurality of weight parameters include weight parameters of each convolution kernel in the convolution layer to be pruned.

The processing module 520 is configured to determine a first interval according to a plurality of weight parameters, where the fewer the number of weight parameters falling into the first interval, the more uniform the distribution of weight parameters of a convolution layer to be pruned; channel pruning is carried out on the neural network model to be compressed according to the number of weight parameters falling into a first interval in the weight parameters of each convolution kernel and the size of the weight parameters of each convolution kernel; and quantifying the neural network model after pruning.

The model compression device 500 in the embodiment of the application is used for pruning a channel of a neural network model to be compressed, and then quantifying the pruned neural network model so as to improve the reasoning speed of the model. In the pruning process, the processing module 520 performs channel pruning on the neural network model to be compressed according to the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel and the size of the weight parameters of each convolution kernel. The weight parameters of the convolution kernels represent the importance of the convolution kernels, and the smaller the number of the weight parameters falling into the first interval in the weight parameters of the convolution kernels is, the more uniform the distribution of the weight parameters of the convolution layer to be pruned is, and the smaller the precision loss of the quantized model is; conversely, the more the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel, the more uneven the distribution of the weight parameters of the convolution layer to be pruned, and the greater the loss of accuracy of the quantized model. Therefore, in the embodiment of the application, the magnitude of the weight parameter of each convolution kernel and the number of the weight parameters falling into the first interval in the weight parameters of each convolution kernel are simultaneously considered during pruning, so that the model precision loss after the neural network model after pruning is quantized can be reduced while the reasoning speed of the neural network model is improved.

Optionally, in some embodiments, the processing module 520 is specifically configured to determine a first average value and a first standard deviation of the plurality of weight parameters; and calculating a first interval according to the first average value, the first standard deviation and a first preset threshold value, wherein the first preset threshold value is smaller than 1.

Optionally, in some embodiments, the processing module 520 is specifically configured to add the first average value to a product of the first preset threshold value and the first standard deviation to obtain a first upper threshold value of the first interval; and subtracting the product of the first preset threshold value and the first standard deviation from the first average value to obtain a first lower threshold value of the first interval.

Optionally, in some embodiments, the obtaining module 510 is specifically configured to obtain a plurality of initial weight parameters corresponding to a convolutional layer to be pruned in the neural network model to be compressed, determine a second upper threshold and a second lower threshold according to the plurality of initial weight parameters, where the first interval is located between the second upper threshold and the second lower threshold; and determining a plurality of weight parameters corresponding to the convolution layer to be pruned according to the second upper threshold, the second lower threshold and the plurality of initial weight parameters, wherein the plurality of weight parameters comprise: and replacing the initial weight parameters which are larger than the second upper limit threshold with the second upper limit threshold and replacing the initial weight parameters which are smaller than the second lower limit threshold with the weight parameters which are obtained after the second upper limit threshold and the second lower limit threshold are adopted in the weight parameters.

Before pruning, after determining a second upper threshold and a second lower threshold corresponding to each convolution layer to be pruned, the embodiment of the application can find out an initial weight parameter greater than the second upper threshold and an initial weight parameter (namely an outlier) smaller than the second lower threshold. And then, processing the outlier, replacing an initial weight parameter larger than a second upper limit threshold value with the second upper limit threshold value, replacing an initial weight parameter smaller than a second lower limit threshold value with the second lower limit threshold value, avoiding that the absolute value of the outlier is too large and is not friendly to quantitative training, and simultaneously avoiding that the absolute value of the outlier is too large to influence the evaluation result of the importance of the convolution kernel in pruning.

Optionally, in some embodiments, the processing module 520 is specifically configured to determine a second average value and a second standard deviation of the plurality of initial weight parameters; and calculating the second upper threshold and the second lower threshold according to the second average value, the second standard deviation and a second preset threshold, wherein the second preset threshold is greater than or equal to 3.

Optionally, in some embodiments, the processing module 520 is specifically configured to add the second average value to a product of the second preset threshold value and the second standard deviation to obtain the second upper threshold value; and subtracting the product of the second preset threshold value and the second standard deviation from the second average value to obtain the second lower threshold value.

Optionally, in some embodiments, the processing module 520 is specifically configured to perform channel pruning on the neural network model to be compressed according to a sum of the number of weight parameters falling into the first interval in the weight parameters of each convolution kernel and an absolute value of the weight parameters of each convolution kernel.

Optionally, in some embodiments, the processing module 520 is specifically configured to calculate a ratio of a number of weight parameters of the respective convolution kernels falling into the first interval to a total number of weight parameters of the respective convolution kernels; and performing channel pruning on the neural network model to be compressed according to the ratio of the convolution kernels and the weight parameters of the convolution kernels.

Optionally, in some embodiments, the processing module 520 is specifically configured to calculate a product of a value obtained by subtracting the ratio corresponding to the respective convolution kernel from a preset value and a sum of absolute values of weight parameters of the respective convolution kernels; and performing channel pruning on the convolution kernel of which the product is smaller than a third preset threshold value in the neural network model to be compressed.

The model compression apparatus 500 of the embodiment of the present application may correspond to performing the model compression method described in the embodiment of the present application, and the above and other operations and/or functions of each unit in the model compression apparatus 500 are respectively for implementing the corresponding flows of the methods in fig. 3, 4 and 6, and are not repeated herein for brevity.

Fig. 8 is a schematic block diagram of a model compressing apparatus 600 according to an embodiment of the present application. The model compression apparatus 600 includes: processor 610, memory 620, communication interface 630, bus 640.

It should be appreciated that the processor 610 in the model compression device 600 shown in fig. 7 may correspond to the processing module 520 in the model compression device 500 of fig. 7, and the communication interface 630 in the model compression device 600 may correspond to the acquisition module 510 in the model compression device 500.

Wherein the processor 610 may be coupled to a memory 620. The memory 620 may be used to store the program codes and data. Accordingly, the memory 620 may be a storage unit internal to the processor 610, an external storage unit independent of the processor 610, or a component including a storage unit internal to the processor 610 and an external storage unit independent of the processor 610.

Optionally, model compression device 600 may also include a bus 640. Memory 620 and communication interface 630 may be connected to processor 610 by bus 640. Bus 640 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus 640 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one line is shown in fig. 8, but not only one bus or one type of bus.

It should be appreciated that in embodiments of the present application, the processor 610 may employ a central processing unit (central processing unit, CPU). The processor may also be other general purpose processors, digital Signal Processors (DSP), application SPECIFIC INTEGRATED Circuits (ASIC), off-the-shelf programmable gate arrays (field programmable GATE ARRAY, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. Or the processor 610 may employ one or more integrated circuits for executing associated routines to perform techniques provided by embodiments of the application.

The memory 620 may include read only memory and random access memory, and provides instructions and data to the processor 610. A portion of the processor 610 may also include non-volatile random access memory. For example, the processor 610 may also store information of the device type.

The processor 610 executes computer-executable instructions in the memory 620 to perform the operational steps of the model compression method described above using hardware resources in the model compression device when the model compression device is running.

It should be understood that the model compressing apparatus 600 according to an embodiment of the present application may correspond to the model compressing apparatus 500 according to an embodiment of the present application and may correspond to the respective bodies performing the methods shown in fig. 3, 5 and 6 according to an embodiment of the present application, and that the above and other operations and/or functions of the respective modules in the model compressing apparatus 600 are respectively for implementing the respective flows of the methods in fig. 3, 5 and 6, and are not repeated herein for brevity.

The application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, the computer program comprises program instructions, and when the program instructions are executed, the model compression method provided by the embodiment of the application is realized.

The present application also provides a computer program product comprising: computer program code which, when run on a computer, causes the computer to perform the model compression method provided by the embodiments of the present application.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded or executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data store such as a server, data center, etc. that contains one or more collections of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a Solid State Disk (SSD) STATE DRIVE.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be embodied in essence or a part contributing to the prior art or a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a memory (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a specific implementation of the embodiment of the present application, but the protection scope of the embodiment of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the embodiment of the present application, and the changes or substitutions are covered by the protection scope of the embodiment of the present application. Therefore, the protection scope of the embodiments of the present application shall be subject to the protection scope of the claims.

Claims

1. The model compression method is characterized in that the model is deployed in electronic equipment, the model is used for processing image data acquired by the electronic equipment in a photographing mode and outputting an image corresponding to the image data, and the method comprises the following steps:

Acquiring a plurality of weight parameters corresponding to a convolution layer to be pruned in a neural network model to be compressed, wherein the plurality of weight parameters comprise weight parameters of each convolution kernel in the convolution layer to be pruned;

Determining a first interval according to the weight parameters, wherein the distribution of the weight parameters of the convolution layer to be pruned is more uniform as the number of the weight parameters falling into the first interval is smaller;

Calculating the ratio of the number of weight parameters of each convolution kernel to the total number of weight parameters of each convolution kernel, wherein the weight parameters of each convolution kernel fall into the first interval;

Calculating the product of the value obtained by subtracting the ratio corresponding to each convolution kernel from the preset value and the sum of the absolute values of the weight parameters of each convolution kernel;

performing channel pruning on convolution kernels, of which the product is smaller than a third preset threshold value, in the neural network model to be compressed;

and quantifying the neural network model after pruning.

2. The method of model compression of claim 1, wherein the determining a first interval from the plurality of weight parameters comprises:

Determining a first average value and a first standard deviation of the plurality of weight parameters;

And calculating the first interval according to the first average value, the first standard deviation and a first preset threshold value, wherein the first preset threshold value is smaller than 1.

3. The method of model compression according to claim 2, wherein calculating the first interval according to the first average value, the first standard deviation, and a first preset threshold value comprises:

Adding the first average value to the product of the first preset threshold value and the first standard deviation to obtain a first upper threshold value of the first interval;

Subtracting the product of the first preset threshold value and the first standard deviation from the first average value to obtain a first lower threshold value of the first interval.

4. A method of compressing a model according to any one of claims 1 to 3, wherein the obtaining a plurality of weight parameters corresponding to a convolutional layer to be pruned in the neural network model to be compressed includes:

Acquiring a plurality of initial weight parameters corresponding to a convolution layer to be pruned in the neural network model to be compressed;

Determining a second upper threshold and a second lower threshold according to the initial weight parameters, wherein the first interval is positioned between the second upper threshold and the second lower threshold;

Determining a plurality of weight parameters corresponding to the convolution layer to be pruned according to the second upper threshold, the second lower threshold and the plurality of initial weight parameters, wherein the plurality of weight parameters comprise: and replacing the initial weight parameters which are larger than the second upper limit threshold with the second upper limit threshold and replacing the initial weight parameters which are smaller than the second lower limit threshold with the weight parameters obtained after the second lower limit threshold.

5. The method of model compression of claim 4, wherein determining a second upper threshold and a second lower threshold from the plurality of initial weight parameters comprises:

determining a second average value and a second standard deviation of the plurality of initial weight parameters;

And calculating the second upper limit threshold and the second lower limit threshold according to the second average value, the second standard deviation and a second preset threshold, wherein the second preset threshold is more than or equal to 3.

6. The method of model compression according to claim 5, wherein calculating the second upper threshold and the second lower threshold according to the second average value, the second standard deviation, and a second preset threshold comprises:

adding the second average value to the product of the second preset threshold value and the second standard deviation to obtain a second upper limit threshold value;

And subtracting the product of the second preset threshold value and the second standard deviation from the second average value to obtain the second lower threshold value.

7. The model compression device is characterized in that the model is deployed in electronic equipment, the model is used for processing image data acquired by the electronic equipment in a photographing mode and outputting an image corresponding to the image data, and the model comprises the following components:

The acquisition module is used for acquiring a plurality of weight parameters corresponding to the convolution layers to be pruned in the neural network model to be compressed, wherein the plurality of weight parameters comprise weight parameters of each convolution kernel in the convolution layers to be pruned;

The processing module is used for determining a first interval according to the weight parameters, wherein the distribution of the weight parameters of the convolution layer to be pruned is more uniform as the number of the weight parameters falling into the first interval is smaller; and calculating the ratio of the number of the weight parameters of each convolution kernel falling into the first interval to the total number of the weight parameters of each convolution kernel, calculating the product of the value obtained by subtracting the ratio of each convolution kernel from the preset value and the sum of the absolute values of the weight parameters of each convolution kernel, performing channel pruning on the convolution kernels of which the product is smaller than a third preset threshold in the neural network model to be compressed, and quantifying the neural network model after pruning.

8. The apparatus of claim 7, wherein the compression device comprises a compression device,

The processing module is specifically configured to determine a first average value and a first standard deviation of the plurality of weight parameters; and calculating the first interval according to the first average value, the first standard deviation and a first preset threshold value, wherein the first preset threshold value is smaller than 1.

9. The apparatus of claim 8, wherein the compression device comprises a compression device,

The processing module is specifically configured to add the first average value to a product of the first preset threshold value and the first standard deviation to obtain a first upper limit threshold value of the first interval;

and subtracting the product of the first preset threshold value and the first standard deviation from the first average value to obtain a first lower threshold value of the first interval.

10. The model compressing apparatus as recited in any one of claims 7 to 9, wherein,

The acquisition module is specifically configured to acquire a plurality of initial weight parameters corresponding to a convolution layer to be pruned in the neural network model to be compressed;

Determining a second upper limit threshold and a second lower limit threshold according to the initial weight parameters, wherein the first interval is positioned between the second upper limit threshold and the second lower limit threshold; determining a plurality of weight parameters corresponding to the convolution layer to be pruned according to the second upper threshold, the second lower threshold and the plurality of initial weight parameters, wherein the plurality of weight parameters comprise: and replacing the initial weight parameters which are larger than the second upper limit threshold with the second upper limit threshold and replacing the initial weight parameters which are smaller than the second lower limit threshold with the weight parameters obtained after the second lower limit threshold.

11. The apparatus of claim 10, wherein the compression device comprises a compression device,

The processing module is specifically configured to determine a second average value and a second standard deviation of the plurality of initial weight parameters;

and calculating the second upper threshold and the second lower threshold according to the second average value, the second standard deviation and a second preset threshold, wherein the second preset threshold is greater than or equal to 3.

12. The apparatus of claim 11, wherein the compression device comprises a compression device,

The processing module is specifically configured to add the second average value to a product of the second preset threshold value and the second standard deviation to obtain the second upper limit threshold value;

13. A model compression device, characterized in that it comprises a memory and a processor, the memory being adapted to store instructions which, when executed by the processor, cause the model compression device to perform the model compression method according to any one of claims 1 to 6.

14. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed, implement the model compression method according to any one of claims 1 to 6.