CN112906889A

CN112906889A - Method and system for compressing deep neural network model

Info

Publication number: CN112906889A
Application number: CN202110234699.8A
Authority: CN
Inventors: 李超; 许建荣; 徐勇军; 崔碧峰; 宫禄齐
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2021-06-04

Abstract

The embodiment of the invention provides a method and a system for compressing a deep neural network model, wherein the method comprises the following steps: s1, obtaining a baseline model, wherein the baseline model is a deep neural network model to be compressed; s2, calculating the effectiveness of a plurality of convolution kernels in at least part of convolution layers of the baseline model based on the weight parameters of the convolution kernels, and determining invalid convolution kernels; s3, cutting the determined invalid convolution kernel from the baseline model; and S4, carrying out fine adjustment or retraining on the cut model to obtain a compressed model. In the process of pruning the corresponding convolutional layer, the influence of the pruning result of the layer on the sensitivity is not needed to be analyzed, and only whether the final result can be accepted or not is concerned, so that the method is simpler, has higher efficiency and is particularly suitable for the deep convolutional neural network with more convolutional layers.

Description

Method and system for compressing deep neural network model

Technical Field

The invention relates to the field of artificial intelligence, in particular to a compression technology neighborhood of a neural network model, and more particularly to a method and a system for compressing a deep neural network model.

Background

In recent years, Deep Neural Networks (DNNs) have enjoyed great success in various fields, including image classification, object detection, semantic segmentation, autopilot, speech recognition, machine translation, emotion analysis, recommendation systems, and the like. However, a Convolutional Neural Network (CNN) usually requires high computation overhead and large memory occupation, and a mobile or embedded device cannot bear high computation-effort resources due to the limiting factors such as volume and space of the mobile or embedded device, so that the computation and storage resources of the mobile or embedded device are very precious, and the CNN cannot be directly deployed in the mobile or embedded device. At present, the compression and acceleration of CNN have been widely explored and tried in academic and industrial fields, and the main compression and acceleration methods of convolutional neural networks at present include low rank approximation, parameter quantization, binarization and network pruning. Network pruning is an efficient and highly targeted model compression method. The current model compression method can be divided into fine-grain level pruning, vector level pruning, Kernel level pruning, group level pruning and Filter level pruning methods according to the operation granularity. However, the basic operation flow of pruning is similar to that of any pruning method. It can be generally divided into 3 major steps, respectively:

1. training a Baseline model;

2. carrying out compression clipping on the Baseline model;

3. model fine tuning or retraining.

In recent years, most scholars have been further divided into structured and unstructured pruning methods according to their granular completeness.

The unstructured pruning is to remove unimportant weight by means of thinning and the like. Some researchers propose to prune weights with small absolute values and store sparse structures in a compressed sparse row or column format. Still other researchers have proposed an energy-aware pruning method that prunes insignificant weights layer by minimizing error reconstruction. However, these methods require a special format to store the network, and only if special sparse matrix multiplication is used in special software or hardware can the acceleration be achieved. The unstructured method is usually a pruning method on the level of model weight, so that the model weight can be directly pruned, and the compression effect of parameters is better than that of the structured pruning method. However, since the compression method is a weight-level compression method, the sparsity of the pruned model is very high, and the model cannot be accelerated directly. If the model needs to be accelerated to embody the pruning compression effect, the pruning compression effect of the model can be completely embodied by means of Basic Linear Algebra Subparograms (BLAS), so that the model is accelerated.

In contrast, structured pruning directly removes structured parts (e.g., 2D-kernel convolution kernels, 3D-filter convolution kernels or layers) to simultaneously compress and accelerate CNN, and is well supported by various off-the-shelf deep learning libraries. Some researchers have proposed filters that remove insignificant based on the L1 norm. Still other researchers have calculated the percentage of the mean Zeros (APOZ) for each filter, i.e., the percentage of the mean Zeros (APOZ) in the output profile corresponding to the filter, and evaluated the redundancy of the filter. Recently, some researchers have proposed a group sparsity regularization that exploits the correlation between features in the network. Still other researchers have proposed a channel selection method based on LASSO regression that uses least squares reconstruction to prune the filter. The structured pruning method aims at a whole block of convolution Kernel operation objects such as Kernel and Filter, so that after a model is cut, the sparsity of the model is unchanged, but a convolution channel of the model changes to a certain extent, and compared with the model before pruning, the structured pruning method can directly generate a smaller model without causing the sparsity of the model, and the pruning acceleration effect of the model can be realized without adopting a BLAS acceleration library.

However, the above structured pruning method has a problem that in the pruning process of each convolutional layer, analysis sensitivity (i.e. influence of pruning proportion on model performance loss) is required, and the method is still usable for a convolutional neural network with a small number of layers, but for a deep convolutional neural network with many convolutional layers, manual adjustment of pruning proportion of each layer is required many times, which is quite complicated. Therefore, there is a need for improvements in the prior art.

Disclosure of Invention

It is therefore an object of the present invention to overcome the above-mentioned deficiencies of the prior art and to provide a method and system for compressing a deep neural network model.

The purpose of the invention is realized by the following technical scheme:

according to a first aspect of the present invention, there is provided a method for compressing a deep neural network model, comprising: s1, obtaining a baseline model, wherein the baseline model is a deep neural network model to be compressed; s2, calculating the effectiveness of a plurality of convolution kernels in at least part of convolution layers of the baseline model based on the weight parameters of the convolution kernels, and determining invalid convolution kernels; s3, cutting the determined invalid convolution kernel from the baseline model; and S4, carrying out fine adjustment or retraining on the cut model to obtain a compressed model.

In some embodiments of the present invention, the step S2 includes: s21, obtaining weight parameters of a plurality of convolution kernels of any convolution layer needing pruning in the baseline model; s22, respectively calculating norms of the weight parameters based on the convolution kernels; s23, normalizing the norm of each convolution kernel to obtain a value to be analyzed of each convolution kernel; s24, calculating the effectiveness index of each convolution kernel based on the value to be analyzed of each convolution kernel; and S25, determining the convolution kernel with the validity index less than or equal to the preset validity threshold value as an invalid convolution kernel.

In some embodiments of the invention, the norm of the convolution kernel is found according to the following equation:

wherein p represents a norm type, N_iRepresenting the number of input channels of the ith convolutional layer, n representing the currently calculated channel, K₁Denotes the length of the convolution kernel, K₂Representing the width, k, of the convolution kernel₁Length sequence number representing currently calculated parameterIs k₁，k₂The length index representing the currently calculated parameter is k₂。

In some embodiments of the invention, the norm type is an L0, L1, or L2 norm.

In some embodiments of the invention, the normalization of the norm of each convolution kernel is based on a maximum and a minimum of the norms of all convolution kernels of the convolution layer in which the convolution kernel is located.

In some embodiments of the invention, the norm of each convolution kernel is normalized according to the following equation:

wherein u represents a value to be analyzed, { | | F_i,j| l | } denotes the norm of the jth convolution kernel of the ith convolution layer, min { | | F_i| l | } represents the minimum value in the norm of all convolution kernels of the ith convolution layer, max { | | F_iAnd | | represents the maximum value of the norms of all convolution kernels of the ith convolution layer.

In some embodiments of the invention, the significance indicator for each convolution kernel is calculated in any one of three ways:

the first method is as follows:

the second method comprises the following steps:

the third method comprises the following steps:

wherein the content of the first and second substances,

represents the output 0 and

the maximum value of (a) is,

represents outputs 1 and

the minimum value of (a) to (b),

γ and ζ are used to define a section (γ, ζ) that s satisfies when expanding and contracting,

u represents the value to be analyzed, U-U (0,1), α represents the position parameter, and β represents the temperature coefficient.

In some embodiments of the invention, γ <0, ζ >1, α is set to (0.5,1) and β is set to (0.7, 1).

According to a second aspect of the present invention, there is provided a system for compressing a deep neural network model, comprising: the effectiveness discriminator is used for respectively calculating the effectiveness of convolution kernels in at least part of convolution layers of the baseline model and determining invalid convolution kernels, wherein the baseline model is a deep neural network model to be compressed; a clipping module for clipping the determined invalid convolution kernel directly from the baseline model; and the performance recovery module is used for carrying out fine adjustment or retraining on the cut model to obtain a compressed model.

According to a third aspect of the present invention, there is provided an electronic apparatus comprising: one or more processors; and a memory, wherein the memory is to store one or more executable instructions; the one or more processors are configured to implement the steps of the method of the first aspect via execution of the one or more executable instructions.

Compared with the prior art, the invention has the advantages that:

in the process of pruning the corresponding convolutional layer, the influence of the pruning result of the layer on the sensitivity (the performance change of the model) is not required to be analyzed, only the final result (such as the compression ratio and the performance loss of the finally obtained compressed model) is concerned, and whether the final result can be accepted or not is considered, so that the method is simpler, has higher efficiency and is particularly suitable for the deep convolutional neural network with more convolutional layers.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

FIG. 1 is a schematic flow diagram of a method for compressing a deep neural network model according to an embodiment of the present invention;

FIG. 2 is a block diagram of a system for compressing a deep neural network model according to an embodiment of the present invention;

fig. 3 is a schematic diagram of the system for compressing a deep neural network model according to an embodiment of the present invention before and after compressing a baseline model.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As mentioned in the background section, the existing structured pruning method needs to analyze the influence of the pruning proportion on the model performance loss in the pruning process of each convolutional layer, and is too cumbersome for the deep convolutional neural network with many convolutional layers. Therefore, the invention provides a novel structured model compression acceleration method based on the background, which respectively calculates the effectiveness of convolution kernels in at least part of convolution layers of the baseline model directly based on the weight parameters of the convolution kernels, determines invalid convolution kernels and cuts the invalid convolution kernels from the baseline model. In the process of pruning the corresponding convolutional layer, the influence of the pruning result of the layer on the sensitivity (the performance change of the model) is not required to be analyzed, only the final result (such as the compression ratio and the performance loss of the finally obtained compressed model) is concerned, and whether the final result can be accepted or not is considered, so that the method is simpler, has higher efficiency and is particularly suitable for the deep convolutional neural network with more convolutional layers.

The invention provides a method for compressing a deep neural network model, which comprises the following steps: steps S1, S2, S3, S4. For a better understanding of the present invention, each step is described in detail below with reference to specific examples.

And S1, acquiring a baseline model which is a deep neural network model to be compressed.

According to one embodiment of the present invention, a reference Model (also called Baseline Model) is a pre-trained Model that requires compression pruning. A user firstly trains a convolutional neural network which has certain reasoning capability and needs model compression pruning to generate a reference model to be compressed. The reference model serves as the next original model to be compressed. After the compression of the model is finished, the performance index of the baseline model can be used as a reference to evaluate the effect of the compressed model. The reference model may be a deep neural network model for image classification, target detection, semantic segmentation, autopilot, speech recognition, machine translation, emotion analysis, or recommendation systems. For example, a user trains a deep neural network model for image classification to converge in advance, and a baseline model is obtained.

And S2, calculating the effectiveness of the convolution kernels in at least part of convolution layers of the baseline model based on the weight parameters of the convolution kernels respectively, and determining invalid convolution kernels.

According to an embodiment of the present invention, step S2 includes: s21, S22, S23, S24 and S25. The implementation details of each sub-step are as follows:

s21, obtaining the weight parameters of a plurality of convolution kernels of any convolution layer needing pruning in the baseline model.

As mentioned previously, the baseline model of the present invention is a deep neural network model having multiple convolutional layers, some of which can range up to hundreds or thousands of layers. Moreover, as technology and computational power have developed, the number of layers of deep neural network models may increase further in the future. Each convolutional layer also includes one or more convolution kernels (Filter), each convolution kernel containing one or more weight parameters (Weights). The size of the convolution kernel, i.e. the number of weight parameters in the convolution kernel, is equal to the length of the convolution kernel x the width of the convolution kernel x the number of input channels. The convolution Kernel length x convolution Kernel width is then the size of the Kernel (Kernel). If the length or width is 1 and the number of input channels is also 1 (e.g., 3 × 1 × 1), then the convolution kernel is a one-dimensional convolution kernel. If both the length and width are greater than 1 and the number of input channels is 1 (e.g., 3 × 3 × 1), then the convolution kernel is a two-dimensional convolution kernel. If the length, width and number of input channels are all greater than 1 (e.g., 3 × 3 × 3), the convolution kernel is a three-dimensional convolution kernel. This step extracts the weight parameters of each element position in the convolution kernel from the baseline model for subsequent calculation of the validity indicator.

According to an embodiment of the present invention, obtaining the weight parameter of each convolution kernel of the convolution layer needing pruning in the baseline model comprises: providing a user interface for selecting the convolutional layers needing pruning for a user, acquiring the convolutional layers needing pruning selected by the user on the user interface, and loading the weight parameters of each convolutional core of the convolutional layers needing pruning from the baseline model. The technical scheme of the embodiment can at least realize the following beneficial technical effects: more user selection spaces are provided, and the user can select partial or all the convolutional layers to prune according to the needs.

According to an embodiment of the invention, the method further comprises: providing a user interface for selecting convolution kernels with a specific channel number for a user, acquiring the specific channel number selected by the user on the user interface, and adding weight parameters of the convolution kernels with the specific input channel number from the baseline model when acquiring the weight parameters of each convolution kernel of the convolution layer needing pruning in the baseline model. For example, the user may select a convolution kernel having a number of 512 or 1024 input channels, or a particular range of input channels: such as convolution kernels with 128-1024 input channels. And if the convolution kernel with the input channel number larger than 1 in all the convolution layers is selected, evaluating all the 3D-filters in the convolution layers to identify whether the convolution kernels are invalid or not. The technical scheme of the embodiment can at least realize the following beneficial technical effects: more user selection spaces are provided, and the user can conveniently and individually select the convolution kernels needing to be adjusted.

And S22, calculating the norm of each weight parameter based on the convolution kernel.

According to one embodiment of the present invention, assume that the deep convolutional neural network has L (1)<i<L) layer, N_iIndicates the number of input channels of the ith convolutional layer,

a jth convolution kernel representing an ith convolution layer. Preferably, the norm of the convolution kernel is found according to the following formula:

wherein p represents a norm type, N_iRepresenting the number of input channels of the ith convolutional layer, n representing the currently calculated channel, K₁Denotes the length of the convolution kernel, K₂Representing the width, k, of the convolution kernel₁The length index representing the currently calculated parameter is k₁，k₂The length index representing the currently calculated parameter is k₂。

Preferably, the norm type is the L0, L1 or L2 norm. According to an embodiment of the present invention, let p be 1, i.e. find the L1 norm of the convolution kernel, the above equation can be expressed as:

and S23, normalizing the norm of each convolution kernel to obtain the value to be analyzed of each convolution kernel.

According to one embodiment of the invention, the norm of each convolution kernel is normalized based on the maximum value and the minimum value of the norms of all convolution kernels of the convolution layer where the convolution kernel is located.

Preferably, the norm of each convolution kernel is normalized according to the following formula:

wherein u represents a value to be analyzed, { | | F_i,j| l | } denotes the norm of the jth convolution kernel of the ith convolution layer, min { | | F_i| l | } represents the minimum value in the norm of all convolution kernels of the ith convolution layer, max { | | F_iAnd | | represents the maximum value of the norms of all convolution kernels of the ith convolution layer. And forming a set U epsilon (0,1) by the values U to be analyzed corresponding to all convolution kernels in the same convolution layer. The technical scheme of the embodiment can at least realize the following beneficial technical effects: because the norms of different convolution kernels may have larger differences, the norms of the convolution kernels in each layer are distributed differently, and the information extracted in the neural network is also different, if the maximum value and the minimum value of all the convolution kernels are taken as the reference, the accuracy of the found invalid convolution kernels can be influenced, therefore, when the norms of all the convolution kernels are normalized, the maximum value and the minimum value of the norms of all the convolution kernels in the convolution layer where the convolution kernels are located are taken as the reference, the normalization is more referential, and the invalid convolution kernels of the layer can be found more accurately.

And S24, calculating the effectiveness index of each convolution kernel based on the value to be analyzed of each convolution kernel. According to one embodiment of the invention, the effectiveness discriminator uses a Hard-Sigmoid function, and the effectiveness index is constrained by a position parameter and a temperature coefficient of the Hard-Sigmoid function.

According to one embodiment of the invention, the significance index of each convolution kernel is calculated in any one of three ways:

the first method is as follows:

the second method comprises the following steps:

the third method comprises the following steps:

wherein the content of the first and second substances,

represents the output 0 and

the maximum value of (a) is,

represents outputs 1 and

the minimum value of (a) to (b),

u represents the value to be analyzed, U-U (0,1), α represents the position parameter, and β represents the temperature coefficient. z is an index of effectiveness, and it can be known from the above expression that the range of z is in the interval of 0 to 1. The larger the value of z, the more significant the convolution kernel to which z corresponds. Wherein, γ<0，ζ>1. For example, γ may be set to-0.1 and ζ may be set to 1.1. In addition, in actual operation, the user can set the values of γ and ζ according to his own needs. α is used to control whether the distribution position of the effectiveness index z is more concentrated on the 0 side or the 1 side. A smaller log α will encourage the distribution of z to be more concentrated on the side near 0, and a larger log α will encourage the distribution of z to be more concentrated on the side near 1. Since the compression pruning process (i.e., the process of compressing the model) is intended to keep the convolution kernel more effective, and therefore it is intended to distribute the whole of the model to one side of 1, the preferred setting range of α is (0.5,1), so that a better-performing model can be obtained after pruning. The temperature coefficient β may control the distribution proportion of the effectiveness index z between the set {0,1} and the interval (0, 1). A relatively large beta encourages z to be in the interval (0,1)The distribution specific gravity of (2) is relatively large. Since the compression pruning process is to obtain more effective convolution kernels, cut off ineffective convolution kernels and ensure the accuracy of the original model as much as possible, the process is to generate more 1 but not too extreme, so that the distribution of z is also desired to be close to 1, and therefore, the preferred setting range of β is (0.7,1) so that a model with better accuracy can be obtained after pruning. In the actual pruning process, the output result can be controlled only by adjusting the position parameter alpha and the temperature coefficient alpha 1, and the number of the determined invalid convolution kernels is further adjusted. Namely: the method of the invention can control the compression ratio of the compression baseline model by adjusting the position parameters and/or the temperature coefficients. And after finishing the cutting of all layers to be cut, finally judging whether the compression ratio and the performance of the obtained model meet the requirements. If so, it can be directly employed. If not, the position parameter α 0 and/or the temperature coefficient α 4 may be adjusted until a model is obtained that meets the requirements. When the user operates, the larger the alpha 2 is set, the smaller the compression ratio of the compressed model relative to the baseline model is; the larger α 6 is set, the smaller the compression ratio of the compressed model with respect to the baseline model. Further, since α 3 is an overall adjustment of the distribution position, and the adjustment thereof has a greater influence on the model compression ratio, α 5 can be used for coarse adjustment and β can be used for fine adjustment. Namely: in the case where the values of α and β are adjusted to be the same, the adjustment of α has a greater influence on the magnitude of the adjustment of the compression ratio. The user can quickly obtain the desired compressed model based on a for coarse tuning and β for fine tuning. The technical scheme of the embodiment can at least realize the following beneficial technical effects: the user can complete the pruning of the whole model by adjusting alpha and beta without balancing the influence of the pruning quantity on the performance loss of the model in a layering way.

According to an embodiment of the invention, the method further comprises: and providing a user interface for a user to receive the selection of the user, and setting the final effectiveness index of each convolution kernel as a result of weighted summation of the effectiveness indexes obtained in the first mode, the second mode and the third mode, wherein the weight of each mode is set by the user according to the requirement, and the sum of the weights corresponding to the three modes is 1. For example, the validity indicators obtained in the first, second, and third modes are configured as 0.5, 0.3, and 0.2, respectively, by the user. The technical scheme of the embodiment can at least realize the following beneficial technical effects: different users can configure or adjust the weights of all modes according to experience or specific conditions during compression, so that invalid convolution kernels can be determined through comprehensive evaluation of the effectiveness indexes, and a model with better performance can be obtained.

And S25, determining the convolution kernel with the validity index less than or equal to the preset validity threshold value as an invalid convolution kernel.

According to one embodiment of the invention, a user may set a validity threshold, resulting in a corresponding compressed model. However, in order to observe the influence of different validity thresholds on the model performance and obtain the final model meeting the requirement more quickly, a plurality of validity thresholds can be set to automatically complete the corresponding pruning process according to the different validity thresholds and obtain a plurality of corresponding compressed models, so that the plurality of compressed models are quickly obtained, and a user can obtain the required model from the plurality of compressed models more quickly. Preferably, the method further comprises: receiving a plurality of validity threshold values input by a user, respectively determining a plurality of groups of invalid convolution kernels different from each other according to different validity threshold values, and respectively cutting the baseline model aiming at each group of invalid convolution kernels to obtain a plurality of cut models. For example, the user sets four validity thresholds, 0.6, 0.7, 0.8, 0.9 respectively. Then 4 clipped models are correspondingly generated, and are subjected to fine tuning or retraining to obtain 4 compressed models. The user weighs 4 compressed models with performance loss and compression ratio, and selects a model meeting the requirement.

S3, directly cutting the determined invalid convolution kernel from the baseline model.

According to an embodiment of the present invention, clipping the determined invalid convolution kernel directly from the baseline model refers to deleting the invalid convolution kernel determined according to the weight parameter of the convolution kernel directly from the baseline model. Furthermore, the process is to remove all invalid convolution kernels determined from the baseline model at once. This is different from the prior art, in the deleting process, each layer balances the influence of the number of deleted convolution kernels and the loss of model performance with the user, so that the operation is too complicated when the number of convolution layers is large. The technical scheme of the embodiment can at least realize the following beneficial technical effects: the invention adopts a structured pruning mode, so that the sparsity of the model after being cut cannot be changed, and certain model reasoning capability and certain generalization capability of the model can be kept.

And S4, carrying out fine adjustment or retraining on the cut model to obtain a compressed model.

According to one embodiment of the invention, the process may use a dataset of training baseline models to fine tune the tailored model. The invention carries out fine tuning training on the parameters of the model after being cut (the simplified model after being cut), so that the performance recovery of the model after being cut can be close to the initial state, and the precision similar to that of the baseline model can be obtained. And before fine tuning training, parameters of a reserved convolution kernel in the model after cutting are unchanged, and fine tuning is carried out on the basis of the parameters. In addition, the process may also retrain the clipped model with a dataset of training baseline models. Prior to retraining, the retained convolution kernels in the clipped model are initialized (e.g., randomly initialized) and then retrained with the data set.

According to an embodiment of the invention, the method further comprises: providing the user with a compression ratio of the one or more compressed models relative to the baseline model, a change in model performance, performance parameters of the one or more compressed models, performance parameters of the baseline model, or a combination thereof. The technical scheme of the embodiment can at least realize the following beneficial technical effects: the user can quickly know the performance condition of the compressed model according to the data, quickly select the required compressed model, or readjust the position parameters and/or the temperature coefficients to recompress the baseline model.

Referring to fig. 2, the present invention also provides a system for compressing a deep neural network model, comprising: a model obtaining module 100, configured to obtain a baseline model, where the baseline model is a deep neural network model to be compressed; an effectiveness discriminator 200, configured to calculate effectiveness for convolution kernels in at least some convolution layers of the baseline model, respectively, and determine invalid convolution kernels; a clipping module 300 for clipping the determined invalid convolution kernel from the baseline model; and/or performance recovery module 400 to fine tune or retrain the clipped model. The implementation of each module in the system may refer to the description of the foregoing method embodiment, and is not described herein again. The compressed system can be deployed on electronic devices with relatively few computing resources and storage resources, such as mobile devices, so as to solve the problems that an uncompressed baseline model is too large, the amount of computation is too large, and the system is difficult to deploy on the mobile devices.

FIG. 3 provides a schematic representation of the system of the present invention before and after compression of the baseline model. Before compression, there are 5 convolution kernels (actually more, possibly hundreds or thousands, simplified here), and feature extraction is performed from 4 feature maps to obtain 5 feature maps. Subsequently, identification is performed to identify invalid convolution kernels (corresponding to the convolution kernels identified by the dashed lines) with a validity discriminator. Then, the compression is carried out, 2 invalid convolution kernels are deleted, and 3 valid convolution kernels are left for feature extraction, so that 3 feature maps are obtained. Both the size and the computational load of the model are reduced to facilitate deployment of the compressed model to the mobile device.

In general, the invention realizes the compression and optimization of the deep neural network model while not causing performance reduction or performance reduction with slight amplitude to the deep convolutional neural network by identifying the invalid convolutional kernels which are redundant in the deep convolutional neural network or have low contribution to network reasoning. The method adopted by the invention is based on the convolution kernel layer to prune the convolution neural network, and the pruning method has the advantages that the pruned model can be directly used for industrial deployment, and compared with the model before pruning, the pruned model has fewer convolution kernels, less calculated amount, less parameter amount and faster reasoning time. In the current structured pruning methods, most methods tend to evaluate the convolution kernel part of the model in the model training process, such as calculating the norm of L2, introducing a centripetal SGD algorithm to cluster the convolution kernels, introducing regularization, and the like, and these methods need to establish induction factors of model structured pruning and are complicated to operate in the model training process. The method provided by the invention can analyze whether the convolution kernel is redundant and carry out compression clipping only by carrying out identification on the convolution kernels of each layer of the model after the model training is finished.

It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for compressing a deep neural network model, comprising:

s1, obtaining a baseline model, wherein the baseline model is a deep neural network model to be compressed;

s2, calculating the effectiveness of a plurality of convolution kernels in at least part of convolution layers of the baseline model based on the weight parameters of the convolution kernels, and determining invalid convolution kernels;

s3, cutting the determined invalid convolution kernel from the baseline model;

2. The method for compressing a deep neural network model as claimed in claim 1, wherein the step S2 includes:

s21, obtaining weight parameters of a plurality of convolution kernels of any convolution layer needing pruning in the baseline model;

s22, respectively calculating norms of the weight parameters based on the convolution kernels;

s23, normalizing the norm of each convolution kernel to obtain a value to be analyzed of each convolution kernel;

s24, calculating the effectiveness index of each convolution kernel based on the value to be analyzed of each convolution kernel;

3. The method for compressing a deep neural network model of claim 2, wherein a norm of a convolution kernel is found according to the following formula:

4. The method for compressing a deep neural network model of claim 3, wherein the norm type is an L0, L1, or L2 norm.

5. The method of claim 2, wherein the normalization of the norm of each convolution kernel is based on a maximum and a minimum of the norms of all convolution kernels of the convolution layer in which the convolution kernel is located.

6. The method for compressing a deep neural network model according to any one of claims 2 to 5, wherein the norm of each convolution kernel is normalized according to the following formula:

wherein u represents a value to be analyzed, { | | F_i,j| l | } denotes the norm of the jth convolution kernel of the ith convolution layer, min { | | F_iRepresents the minimum of the norms of all convolution kernels of the ith convolution layerValue, max { | | F_iAnd | | represents the maximum value of the norms of all convolution kernels of the ith convolution layer.

7. The method for compressing a deep neural network model according to any one of claims 2 to 5, wherein the validity indicator of each convolution kernel is calculated in any one of three ways:

the first method is as follows:

the second method comprises the following steps:

the third method comprises the following steps:

wherein the content of the first and second substances,

represents the output 0 and

the maximum value of (a) is,

represents outputs 1 and

the minimum value of (a) to (b),

u represents the value to be analyzed, U-U (0,1), alpha represents the position parameter, beta tableIndicating the temperature coefficient.

8. The method for compressing a deep neural network model according to claim 7, wherein γ <0, ζ >1, a is set to (0.5,1), and β is set to (0.7, 1).

9. A system for compressing a deep neural network model, comprising:

the effectiveness discriminator is used for respectively calculating the effectiveness of convolution kernels in at least part of convolution layers of a baseline model and determining invalid convolution kernels, wherein the baseline model is a deep neural network model to be compressed;

a clipping module for clipping the determined invalid convolution kernel directly from the baseline model;

and the performance recovery module is used for carrying out fine adjustment or retraining on the cut model to obtain a compressed model.

10. A computer-readable storage medium, having embodied thereon a computer program, the computer program being executable by a processor to perform the steps of the method of any one of claims 1 to 8.

11. An electronic device, comprising:

one or more processors; and

a memory, wherein the memory is to store one or more executable instructions;

the one or more processors are configured to implement the steps of the method of any one of claims 1-8 via execution of the one or more executable instructions.