CN113850385A

CN113850385A - Coarse and fine granularity combined neural network pruning method

Info

Publication number: CN113850385A
Application number: CN202111187212.1A
Authority: CN
Inventors: 姜宏旭; 朱雨婷; 李波; 张永华; 东东; 胡宗琦; 从容子
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-10-12
Filing date: 2021-10-12
Publication date: 2021-12-28

Abstract

The invention discloses a coarse and fine granularity combined neural network pruning method, which comprises the following steps: carrying out group thinning training on the screened candidate filters, and trimming the candidate filters smaller than the threshold value after a certain number of rounds; sequencing the importance of the convolution kernels layer by layer, and obtaining the convolution kernels to be pruned layer by layer according to a pre-defined pruning rate; performing regularization compression on the convolution kernel by taking the weight as a unit, and dynamically generating a mode set meeting a pre-constructed mode discrimination function in the compression process; matching each convolution kernel to an optimal pattern of the convolution kernel in the pattern set; carrying out convolution kernel pruning and mode pruning; setting parameters to be pruned to zero, and carrying out hard pruning on the model; and (4) retraining and fine-tuning the model after the hard pruning by combining a knowledge distillation method to obtain a final model after the pruning. The invention can fully exert the advantages of structured and unstructured pruning, improves the model storage and reasoning efficiency and has higher hardware friendliness.

Description

Coarse and fine granularity combined neural network pruning method

Technical Field

The invention relates to the technical field of embedded AI (artificial intelligence), in particular to a coarse and fine granularity combined neural network pruning method.

Background

With the advent of a range of embedded devices, it is very challenging to perform deep neural network reasoning in view of high computational and memory requirements. Pruning is a method widely used in model compression at present, and judges the importance of parameters by searching an effective judging means, and cuts unimportant parameters to reduce the redundancy of the model. The existing pruning method mainly comprises structured pruning and unstructured pruning, wherein the unstructured pruning has the characteristics of refinement and high precision, but is not friendly to hardware, and the structured pruning has the characteristics of coarse granularity and high hardware efficiency, but has larger precision loss.

For unstructured pruning, the size of a model cannot be changed and inference time cannot be reduced by directly setting the weight to zero, and the model needs to be sparsely encoded by indexes and assisted by a specific matrix multiplication strategy. Weight clipping has been less focused on arbitrary positions at present because although clipping at arbitrary positions brings higher net parameter compression ratio, the indexing overhead is not negligible, and sparse multiplication efficiency is significantly lower than dense matrix, so weight clipping at arbitrary positions no longer has advantages in parameter compression ratio or inference time. Mode pruning is proposed in recent research, a fixed number of weights are pruned in each convolution kernel, and the rest weights are concentrated in a certain area to form a specific kernel mode, so that an intermediate sparse type between unstructured pruning and structured pruning is generated, and the method has better hardware adaptation degree. However, in the current scheme, the research on the importance of the parameter of the convolution kernel is directly applied to mode selection, the mode is selected by a greedy method, all convolution kernels cannot be matched with the optimal mode, and a comparative experimental demonstration on the effect of adopting a convolution importance strategy (such as forcing a central element not to be pruned) is lacked. Another approach mathematically derives the corresponding best mode when the model parameters are fixed, but ignores the variability of the parameters during training. Meanwhile, single-round pattern generation consumes a lot of time cost compared with single-round training, and pattern generation performed round by round is not preferable.

Structured pruning is mainly divided into filter pruning and convolution kernel pruning from the dimension. These schemes crop insignificant filters or convolution kernels by proposing a set of significance criteria. Since the pruning can change the weight matrix size, the inference speed can be directly improved. However, most of the existing methods combining structured pruning and unstructured pruning stack different pruning schemes in series, and analysis of mutual influence of pruning sequence and different pruning modes is lacked.

Therefore, how to provide a coarse-grained structured combined pruning method capable of fully playing the advantages of structured and unstructured pruning and improving model storage and reasoning efficiency is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the invention provides a coarse-fine granularity combined neural network pruning method, which can give full play to the advantages of structured and unstructured pruning, improve the model storage and reasoning efficiency, and have higher hardware friendliness.

In order to achieve the purpose, the invention adopts the following technical scheme:

a thickness-granularity combined neural network pruning method comprises the following steps:

selecting candidate filters needing pruning in an original model layer by layer according to the two norms, carrying out group thinning training on the candidate filters, and pruning the candidate filters smaller than a threshold value after a certain number of rounds;

sequencing the importance of the convolution kernels layer by layer according to the two norms, and obtaining the convolution kernels to be pruned layer by layer according to a pre-defined pruning rate;

performing regularization compression on the convolution kernel by taking the weight as a unit, and dynamically generating a mode set meeting a pre-constructed mode discrimination function in the compression process;

matching each convolution kernel to an optimal pattern of the convolution kernel in the pattern set;

carrying out convolution kernel pruning and mode pruning;

setting parameters to be pruned to zero, and carrying out hard pruning on the model;

and (4) retraining and fine-tuning the model after the hard pruning by combining a knowledge distillation method to obtain a final model after the pruning.

Preferably, in the above method for pruning a coarse-fine combined neural network, the generating process of the pattern set is as follows:

traversing each convolution kernel of the model, and selecting 4 positions with the maximum weight absolute value to form an optimal mode of the convolution kernel;

globally counting the occurrence frequency of each optimal mode, and selecting k modes with the highest frequency as a mode set generated by the current round;

and judging the weight adjacency, the shape quantity and the current model convergence of the pattern set by using a pattern discriminant function so as to evaluate whether the pattern set generated in the current round can be used as a final pattern set.

Preferably, in the above method for pruning a coarse-fine combined neural network, the condition for determining the adjacency of weights in the pattern set is as follows: the elements within a single pattern set are all adjacent; the method specifically comprises the following steps:

traversing each non-zero element position (i, j) of the pattern set, and judging whether four adjacent positions (i +1, j), (i-1, j), (i, j +1) and (i, j-1) are in a convolution kernel or not; if the position is in the convolution kernel, judging whether the position is nonzero; if the non-zero element at the neighbor position of any element does not exist, all the mode elements are not adjacent, and if the condition does not exist after the traversal is completed, all the mode elements are adjacent.

Preferably, in the above coarse-fine combined neural network pruning method, the condition for determining the number of pattern set shapes is as follows: the shape of the mode set is less than or equal to k/2; the method specifically comprises the following steps:

numbering the convolution kernel weight indexes, wherein the number corresponding to the jth weight of the ith row is i × kernel _ size + j;

combining corresponding pattern set shapes by using dictionary record positions, and splicing the keys of the dictionaries by numbers with nonzero weights from small to large;

traversing each mode set, inquiring the number corresponding to the shape of the mode set, recording whether each number exists by using one set, if not, adding 1 to the shape number of the mode set, and adding the number into the set.

Preferably, in the above method for pruning a coarse-and-fine-grained neural network, the condition for determining the convergence of the current model is as follows: the current gradient mean value is smaller than a threshold value q; the method specifically comprises the following steps:

and acquiring the gradient of each position through the grad attribute of the tensor, and solving the mean value through a mean function.

Preferably, in the above method for pruning a coarse-fine combined neural network, pruning the candidate filters smaller than the threshold includes:

transforming the convolution matrix and the mask matrix into the same matrix, calculating two norms for each column, recording the columns smaller than the threshold value of 0.1m as a filter to be pruned;

traversing the model layer by layer, storing input and output channels to be pruned of each layer, regarding the l-th layer, the output channel to be discarded is a column index corresponding to the pruning filter of the layer, and the input channel is a row index corresponding to the pruning filter of the l-1-th layer;

the row and column to be pruned are discarded by the index _ select function.

Preferably, in the above method for pruning a coarse-fine-grained neural network, the obtaining process of the convolution kernel to be pruned is as follows:

transforming the convolution matrix, and calculating the two-norm sum of convolution kernels one by one;

sorting all the convolution kernels of each filter of each layer according to a two-norm sum;

the least important k convolution kernels are found according to the predefined good pruning rate, and the mask is set to 0 as the convolution kernel to be pruned.

Preferably, in the above method for pruning a coarse-fine-grained neural network, the matching each convolution kernel to the optimal pattern of the convolution kernel in the pattern set includes:

traversing the final mode sets one by one, calculating the sum of weight absolute values of the convolution kernel on each mode set, and selecting the maximum sum index as the mode set of the convolution kernel; after the allocation is complete, the pattern set shape of the convolution kernel is not changed.

Preferably, in the above neural network pruning method based on combination of coarse and fine granularity, regularization training is performed on weights that may be pruned in a filter pruning stage, a convolution kernel pruning stage and a mode pruning stage, and a difference between a predicted value and a true value of a cross entropy carving model is adopted.

Preferably, in the above coarse-fine granularity combined neural network pruning method, in the process of regularization training of the weights that may be pruned in stages, an iterative method of SGD random gradient descent is adopted to optimize a differentiable objective function, the gradients of loss functions are calculated on a small batch of data to iteratively update weights and bias terms, and a learning rate attenuation strategy multistep lr is adopted to perform compression pruning.

According to the technical scheme, compared with the prior art, the invention discloses a coarse and fine granularity combined neural network pruning method, a combined pruning method consisting of filter pruning, convolution kernel pruning and mode pruning compresses a model, the position of the weight needing pruning is judged according to a structured and unstructured pruning scheme, a mask layer is generated, and meanwhile, the weight needing to be discarded is gradually reduced through training with a punishment item. And carrying out actual pruning through hard pruning, multiplying the original model parameters by a mask layer, and setting the discarded parameters to be zero. And retraining and fine-tuning the pruned model to restore the precision, and learning by using the model before pruning as a teacher and using the difference of predicted values before and after pruning as a loss function through knowledge distillation.

The invention triggers from the advantages and disadvantages of different pruning methods, carries out structured and unstructured pruning on the model, and carries out filter, convolution kernel and mode pruning in sequence according to the pruning granularity, and all the pruning methods are not independent but are carried out cooperatively. And the final model has higher hardware friendliness under the condition that the compression ratio is slightly superior to that of the current hybrid pruning result. Meanwhile, the invention judges whether the mode set generated in the current round can be finally adopted or not through the mode discrimination function, and proves the effectiveness of the evaluation function through the experiments on a plurality of mode sets, thereby ensuring that the accuracy of the model after mode pruning is higher than that of the current mode pruning method.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a coarse-and-fine-granularity combined neural network pruning method provided by the invention;

FIG. 2 is a schematic diagram of pattern generation and pruning provided by the present invention;

FIG. 3 is a diagram illustrating a comparison of filter pruning, convolution kernel pruning and pattern pruning in accordance with the present invention;

FIG. 4 is a schematic diagram of filter pruning according to the present invention;

FIG. 5 is a schematic diagram of convolution kernel pruning according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the embodiment of the invention discloses a coarse-and-fine-granularity combined neural network pruning method, which comprises the following steps:

s1, filter pruning:

and selecting candidate filters needing pruning in the original model layer by layer according to the two norms, carrying out group thinning training on the candidate filters, and pruning the candidate filters smaller than a threshold value after a certain number of rounds.

S2, convolution kernel pruning:

and sequencing the importance of the convolution kernels layer by layer according to the two norms, and obtaining the convolution kernels to be pruned layer by layer according to a pre-defined pruning rate.

S3, generating a pattern set:

and performing regularization compression on the convolution kernel by taking the weight as a unit, and dynamically generating a mode set meeting a pre-constructed mode discrimination function in the compression process.

S4, allocating a mode for each convolution kernel:

each convolution kernel is matched to the optimal pattern of that convolution kernel in the pattern set.

And S5, performing convolution kernel pruning and mode pruning.

S6, hard pruning: setting the parameters to be pruned to zero, and carrying out hard pruning on the model.

S7, fine tuning and retraining:

The above steps are further described below:

s1, filter pruning

The process is shown in fig. 4. Because the pruning granularity is large, the regularization training of excessive parameters may cause excessive model pruning, and the precision is difficult to recover, so that candidate filters which may be pruned are found first, group thinning training is performed on the filters, and the filters smaller than the threshold are pruned actually after a certain number of rounds. Specifically, the convolution matrix is transformed into (input _ channel _ kernel _ size, output _ channel), and the sums of two norms in each column are sorted, as shown in formula (1). Finding p% is not important, and filters smaller than the threshold m are masked with 0 for the column.

Because the convolution layer after the filter pruning is still a dense matrix, any influence on subsequent links can not be caused, and the size of the convolution matrix can be reduced by the filter pruning, so that the subsequent compression speed is increased, and unimportant filters are removed in advance. Specifically, both the convolution matrix and the mask matrix are transformed into (input _ channel _ kernel _ size, output _ channel), and the two-norm is calculated for each column as described above, and columns smaller than the threshold of 0.1m are recorded as filters to be pruned. Traversing the neural network layer by layer, storing input and output channels to be pruned of each layer, regarding the l-th layer, the output channel to be discarded is the column index (except the last convolutional layer) corresponding to the pruning filter of the layer, and the input channel is the row index (except the first convolutional layer) corresponding to the pruning filter of the l-1 layer. The row and column to be pruned are discarded by the index _ select function.

S2 convolution kernel pruning

A schematic diagram of convolution kernel pruning is shown in fig. 5. Specifically, after the convolution matrix is transformed, the two-norm sum of convolution kernels is calculated one by one, and the importance of the kth convolution kernel of the fth filter of each layer is shown as formula (2). The parameters are sorted within the layers, the least important k convolution kernels are found according to a predefined good pruning rate, and the mask is set to 0.

S3, mode set generation

The embodiment of the invention sets the parameters of the model as follows: assuming that the number of convolution layers of the neural network is N, the number of input channels is K, the number of output channels is F, and the weight and offset of the nth layer are respectively represented as W_n、b_n. For each layer of the f output channel and the k output channel, the corresponding convolution kernel is Z_f,k,:,:The weight of the ith row and the jth column in the convolution kernel is Z_f,k,i,j。

The mode pruning is taken as unstructured pruning, has relatively good regularity, and can be optimized through a subsequent compiler, so that the neural network reasoning time is reduced. The mode set generation is divided into static generation and dynamic generation, wherein the static generation generates a mode before pruning, and the dynamic generation generates a mode in the pruning process.

For static mode generation, the traditional scheme is to select a maximum weight mode containing a central element, convolution is selected according to gradient mathematical derivation, strict theoretical derivation is achieved, but due to the fact that structured pruning and mode pruning are performed simultaneously, after a redundant filter is removed, model weight cannot be changed greatly, the model must be trained in advance and well distributed, and pruning speed is limited greatly.

For dynamic mode generation, the traditional method for generating a dynamic mode pool is to gradually generate candidate modes and finish the clipping of unimportant cores. Although the related concepts are proposed, the ideas are not formulated, and since pattern generation is very time consuming, patterns are actually pre-generated in the code.

The embodiment of the invention combines a static mode set generation method and a dynamic mode set generation method to construct a mode discrimination function so as to judge whether a mode set generated in the current round can be adopted or not, and the core of the problem lies in how to judge whether a mode set is a good mode set or not.

Previous studies on the importance of the convolution kernel eigenvalues have mainly proposed the following:

(1) the central weight in the 3 x 3 kernel is crucial.

(2) The important weights within the convolution kernel tend to be adjacent.

(3) The larger the absolute value of the element, the more important it tends to be.

It should be noted that there are some differences between the importance of elements in the convolution kernel and the pattern generation criteria, where the individual for judging the importance in the kernel is a convolution kernel, and the pattern generation is to classify the convolution kernel into a pattern by the idea of similar clustering, so that part of the convolution kernels cannot be matched with their corresponding best patterns. Therefore, the diversity of the convolution kernel is also an important criterion for measuring the quality of the pattern set.

Therefore, as shown in fig. 2, the present invention adopts a new pattern set generation method, and each pattern is set to have 4 non-null elements and the rest null values. And performing pattern generation on all convolution kernels of 3 x 3 in the neural network every r rounds, and setting the number of patterns to be generated as k, wherein the values of the hyper-parameter k are limited between 2, 4, 8 and 12 from the aspects of hardware friendliness and practical effect.

The method specifically comprises the following steps:

s31, traversing each convolution kernel of the model, and selecting 4 positions with the maximum weight absolute value to form an optimal mode of the convolution kernel;

s32, globally counting the occurrence frequency of each optimal mode, and selecting k modes with the highest frequency as a mode set generated by the current round;

and S33, judging the weight adjacency, the shape number and the current model convergence of the pattern set by using a pattern discriminant function to evaluate whether the pattern set generated in the current round can be used as a final pattern set.

The embodiment of the invention provides three judgment conditions, and if and only if the conditions are all met, the mode set generation process is ended. The conditions were as follows:

(1) the elements within a single pattern set are all adjacent.

(2) The mode set shape is less than or equal to k/2.

(3) The current gradient mean is less than the threshold q.

The specific judgment process is as follows:

(1) for the judgment of element adjacency, each non-zero element position (i, j) of the traversal pattern judges whether four adjacent positions (i +1, j), (i-1, j), (i, j +1) and (i, j-1) are inside a convolution kernel or not, and if the positions are inside the convolution kernel, whether the positions are non-zero or not is judged. If the non-zero element at the neighbor position of any element does not exist, all the mode elements are not adjacent, and if the condition does not exist after the traversal is completed, all the mode elements are adjacent.

(2) For the determination of the number of pattern set shapes, since the adjacency checking is performed first, it is not necessary to consider non-adjacent pattern combinations. All modes have C₉ ⁴126 cases, where all neighbors are only 36 cases, the invention therefore makes an enumeration decision. The convolution kernel weight indexes are numbered in advance, and the number corresponding to the jth weight in the ith row is i × kernel _ size + j. And combining the corresponding pattern shapes by using dictionary recording positions, wherein keys of the dictionary are the number splicing from small to large of non-zero weight. Since the convolution kernel processed by the method is 3 x 3, the number is 9 at most, when the numbers are spliced together, no ambiguity occurs, for example, keys corresponding to four positions of 1, 3, 5 and 9 are '1359', if the size of the convolution kernel is increased, other coding modes need to be considered, and the value of the dictionary is the mode shape number. Traversing each mode, inquiring the number corresponding to the shape of each mode, recording whether each number exists by using a set, if not, adding 1 to the shape number of the mode, and adding the number to the set.

(3) For the judgment of the average gradient, the gradient of each position is obtained through the grad attribute of the tensor, and the mean value is obtained through the mean function.

S4, allocating a mode for each convolution kernel:

assuming the pattern set as S, the optimization objective is shown in equation (3). And because the modes of each convolution kernel are independent, traversing the mode set one by one, calculating the sum of the weight absolute values of the convolution kernels on the modes, and selecting the maximum sum index as the mode of the convolution kernel. After the allocation is complete, the pattern set shape of the convolution kernel will not change.

S5 convolution kernel pruning and mode pruning

The invention performs regularization training on weights that may be pruned in stages. For each pruning method, a 0-1 mask matrix is used for controlling whether the weight is reserved, the weight mask of the pruned tree is 0, and the reservation is 1. W_kCorresponding filter, convolutionThe mask matrix of the core and the mode pruning are respectively I_{k_f}、I_{k_k}、I_{k_p}The shape is the same as the convolution matrix. Setting the regularization coefficients corresponding to the filter, the convolution kernel and the mode pruning stage as lambda respectively_f、λ_k、λ_pThe loss function is shown in equation (4).

The difference between the predicted value and the actual value of the model is carved by adopting cross entropy, the number of output categories of the neural network is set to be n, the distribution of the actual value output by the neural network is y _ true, and the predicted value is y _ pred, as shown in formula (5).

For the optimizer, the invention uses SGD random gradient descent as an iterative method for optimizing differentiable objective functions, iteratively updates weights and bias terms by calculating the gradient of the loss function over a small batch of data, and performs compressed pruning using a learning rate decay strategy MultiStepLR.

When the compression stage starts, the mask matrix is initialized to all 1, and the negation is taken as 0, and the loss function is the same as that of the common training stage. The values of the three mask matrices are then changed in stages, which controls the compression granularity of the model. The schematic diagrams of the three pruning methods are shown in fig. 3, wherein the filter pruning granularity is maximum, the convolution kernel pruning is centered, and the mode pruning is minimum. Since pruning should be a coarse to fine process, the model is first thinned out and pruned, and then pruned with convolution kernel and mode granularity.

S6 hard pruning

And (3) carrying out actual convolution kernel internal cutting on the model by hard pruning, wherein the parameters to be cut are very close to 0, traversing each layer of weight parameters, updating the model parameters by multiplying the convolution matrix and the mask matrix, so that the weight to be cut is 0, and the updated output is Wk', as shown in a formula (6).

W_k'＝W_k⊙I_{k_k}⊙I_{k_p} (6)。

S7, fine tuning retraining

In the embodiment of the invention, after the model is subjected to hard pruning, the model is retrained. And (4) further fine-tuning the generated model by adopting a knowledge distillation method. The vanilla knowledge distillation is a classical precision recovery method in the field of image classification, and a large teacher model is distilled into a small student model in a set mode without great precision loss. At this block, similar logit outputs for the original network and the pruned network are forced using the distillation loss function, as shown in equation (7). In the present invention, the fixed equilibrium coefficient α is 0.4 and the temperature T is 4.

In order to further prove the superiority of the pruning method, the invention carries out the following experimental verification:

in the invention, cifar10 and cifar100 data sets, vgg16, resnet32 and resnet50 are selected as networks to be pruned, and experiments are carried out on a general GPU platform. The contents of the processing of each module and the preliminary results are as follows.

Experiment one: and performing multiple groups of comparison experiments, researching whether a conclusion obtained by the study on the importance of the convolution kernel parameters is suitable for model pruning, setting an experimental group and a control group, and measuring whether a mode set is good or not by using the accuracy of a pruned model, wherein the specific experiment design and results are shown in table 1.

TABLE 1 comparison of different modes of pruning methods

Experiment two: the test of single pruning with the number of modes of 2, 4, 8 and 12 is carried out, and the accuracy rate under the same number of modes is compared with other existing achievements, and the result proves that the invention is superior to the existing achievements.

Experiment three: aiming at a plurality of indexes of the mode evaluation function, the invention verifies the correctness of the mode evaluation function in the compression process, namely, the model can gradually meet the judgment condition of the mode evaluation function under the condition of good training. And (3) respectively drawing a gradient, an adjacent element pattern number and a pattern set shape number change trend graph by taking the number of compression wheels as a horizontal axis so as to verify the feasibility and the accuracy of the evaluation condition.

Experiment four: and limiting the number of the modes to be 4, carrying out structural and unstructured pruning on the model, comparing the compression ratio and the accuracy of the model parameters, and taking other mixed pruning results as comparison objects. According to the observation of the current results, the invention has higher hardware friendliness under the condition of being slightly superior to the current research.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A thickness-granularity combined neural network pruning method is characterized by comprising the following steps:

carrying out convolution kernel pruning and mode pruning;

2. The method for pruning a coarse-and-fine-granularity combined neural network as claimed in claim 1, wherein the generation process of the pattern set is as follows:

3. The method for pruning a coarse-and-fine-grained neural network as claimed in claim 2, wherein the condition for judging the adjacency of the weights in the pattern set is as follows: the elements within a single pattern set are all adjacent; the method specifically comprises the following steps:

4. The method for pruning a coarse-and-fine-grained neural network as claimed in claim 2, wherein the judgment conditions of the number of pattern set shapes are as follows: the shape of the mode set is less than or equal to k/2; the method specifically comprises the following steps:

5. The method for pruning a coarse-and-fine-granularity combined neural network as claimed in claim 2, wherein the condition for judging the convergence of the current model is as follows: the current gradient mean value is smaller than a threshold value q; the method specifically comprises the following steps:

6. The method for pruning a coarse-and-fine-granularity combined neural network as claimed in claim 1, wherein the pruning the candidate filters smaller than the threshold comprises:

the row and column to be pruned are discarded by the index _ select function.

7. The method for pruning a coarse-and-fine-grained neural network as claimed in claim 1, wherein the convolution kernels to be pruned are obtained by:

8. The method for pruning a coarse-and-fine-granularity combined neural network as claimed in claim 1, wherein the matching of each convolution kernel to the optimal pattern of the convolution kernel in the pattern set comprises:

9. The method as claimed in claim 1, wherein regularization training is performed on weights that may be pruned in a filter pruning stage, a convolution kernel pruning stage and a mode pruning stage, respectively, and differences between predicted values and true values of a cross entropy carving model are employed.

10. The method of claim 9, wherein during the regularized training of the weights that can be pruned in stages, the objective function is optimized by using an iterative method of SGD stochastic gradient descent, the weights and bias terms are updated iteratively by calculating the gradient of the loss function over a small batch of data, and compressed pruning is performed by using a learning rate decay strategy multistep lr.