CN113065644B

CN113065644B - Method, apparatus, device and medium for compressing neural network model

Info

Publication number: CN113065644B
Application number: CN202110456225.8A
Authority: CN
Inventors: 鲁超
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2023-09-29
Anticipated expiration: 2041-04-26
Also published as: CN113065644A

Abstract

The disclosure provides a method, a device, equipment and a medium for compressing a neural network model, relates to the technical field of artificial intelligence, and particularly relates to the technical field of deep learning. The scheme comprises the following steps: acquiring a first neural network model, wherein the first neural network model comprises a plurality of convolution channels and batch standardization channels corresponding to each convolution channel, and the batch standardization channels have importance parameters for representing the importance degree of the corresponding convolution channels; determining at least one secondary channel from a first subset of the plurality of convolutional channels based on a respective importance parameter; determining respective redundancy parameters for the convolution channels in the second subset of the plurality of convolution channels, the redundancy parameters being indicative of a degree of redundancy of the respective convolution channels; determining at least one redundant channel from the second subset based on the respective redundancy parameters; and constructing a compressed second neural network model based on the remaining convolution channels.

Description

Method, apparatus, device and medium for compressing neural network model

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly, to the field of deep learning, and in particular, to a method, apparatus, electronic device, computer readable storage medium, and computer program product for compressing a neural network model.

Background

Convolutional neural networks are one of the representative algorithms for deep learning, and are widely used for processing tasks in the fields of speech recognition, image/video processing, natural language processing, and the like. With the complexity of processing tasks, convolutional neural networks continue to scale up. Accordingly, the convolutional neural network has larger and larger calculated amount and occupied storage space when running, so that the convolutional neural network is limited in application and is difficult to deploy and run in equipment (such as personal computers, mobile phones, tablet computers, intelligent wearable equipment and the like) with limited calculation capacity and storage space.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

The present disclosure provides a method, apparatus, electronic device, computer-readable storage medium, and computer program product for compressing a neural network model.

According to an aspect of the present disclosure, there is provided a method for compressing a neural network model, comprising: acquiring a first neural network model, wherein the first neural network model comprises a plurality of convolution channels and batch standardization channels corresponding to each convolution channel, and the batch standardization channels have importance parameters for representing the importance degree of the corresponding convolution channels; determining at least one secondary channel from the first subset of the plurality of convolved channels based on a corresponding importance parameter; determining respective redundancy parameters of the convolution channels in the second subset of the plurality of convolution channels, wherein the redundancy parameters are used for representing redundancy degrees of the respective convolution channels; determining at least one redundant channel from the second subset based on the respective redundancy parameters; and constructing a compressed second neural network model based on a remaining convolution channel, wherein the remaining convolution channel is a convolution channel other than the at least one secondary channel and the at least one redundant channel of the plurality of convolution channels.

According to another aspect of the present disclosure, there is also provided an apparatus for compressing a neural network model, including: an acquisition module configured to acquire a first neural network model including a plurality of convolution channels and a batch normalization channel corresponding to each convolution channel, the batch normalization channel having an importance parameter for representing an importance level of the respective convolution channel; a first tagging module configured to determine at least one secondary channel from a first subset of the plurality of convolution channels based on a corresponding importance parameter; a determining module configured to determine respective redundancy parameters of the convolution channels in the second subset of the plurality of convolution channels, the redundancy parameters being indicative of redundancy degrees of the respective convolution channels; a second tagging module configured to determine at least one redundant channel from the second subset based on the respective redundancy parameter; and a building module configured to build a compressed second neural network model based on a remaining convolution channel, wherein the remaining convolution channel is a convolution channel of the plurality of convolution channels other than the at least one secondary channel and the at least one redundant channel.

According to another aspect of the present disclosure, there is also provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program which, when executed by the at least one processor, implements a method according to the above.

According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements a method according to the above.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements a method according to the above.

According to one or more embodiments of the present disclosure, secondary channels in a first neural network model are determined by importance parameters, redundant channels in the first neural network model are determined by redundancy parameters, and the secondary channels and the redundant channels are cut together to generate a compressed second neural network model, so that overall compression of the first neural network model is achieved, the number of convolution channels is greatly reduced compared with the first neural network model, and therefore the calculation amount, occupied storage space and reasoning time consumption when the model is operated are greatly reduced. Meanwhile, as the secondary channels and the redundant channels are cut off, the full compression can be realized while the reasoning accuracy of the model is kept basically unchanged.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a flow chart of a method for compressing a neural network model, according to an embodiment of the present disclosure;

FIG. 2 illustrates a flow chart of a method of determining respective redundancy parameters for convolutional channels in a second subset in accordance with an embodiment of the disclosure;

FIG. 3 illustrates a flow chart of a method of determining at least one redundant channel from a second subset in accordance with an embodiment of the present disclosure;

FIG. 4 illustrates a schematic diagram of a compression process of a first neural network model, according to an embodiment of the present disclosure;

FIG. 5 illustrates a block diagram of an apparatus for compressing a neural network model, according to an embodiment of the present disclosure;

Fig. 6 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

For ease of understanding, before describing exemplary embodiments of the present disclosure, several terms in the present disclosure are first explained.

1. Convolutional neural network

The convolutional neural network (Convolutional Neural Networks, CNN) is a feed-forward neural network (Feedforward Neural Networks) with a Deep structure that includes convolutional calculations, and is one of the representative algorithms of Deep Learning. The convolutional neural network includes a plurality of convolutional layers (Convolutional Layer), each having a different number of convolutional kernels, each convolutional kernel corresponding to one convolutional channel.

The convolutional channels have weight (weights) and bias (bias) parameters (in some cases, some convolutional channels may have only weights and no bias), and the weights and bias of each convolutional channel are obtained by training a convolutional neural network.

2. Batch normalization layer

Batch normalization (Batch Normalization, BN), also known as batch normalization, is a technique used to improve the performance and stability of convolutional neural networks. The batch normalization layer (Batch Normalization Layer, BN layer) is a component of the convolutional neural network that can provide zero mean/unit variance input for the next processing layer for acceleration and stabilization training. Typically, convolutional neural networks include a plurality of batch normalization layers, each having a different number of batch normalization channels. Each batch of normalized layers corresponds to one convolution layer, and each batch of normalized channels in a batch of normalized layers corresponds to one convolution channel in a respective convolution layer. Typically, the batch normalization layer is disposed after the corresponding convolution layer, i.e., the input of the batch normalization channel is the output of the corresponding convolution channel.

The data processing process of the batch normalization channel can be represented by the formula y=α×x+β, where X, Y is the input data and the output data of the batch normalization channel, respectively; alpha and beta are parameters of a batch of standardized channels, and values of the alpha and the beta are obtained through training.

The number of convolutional channels included in the convolutional neural network determines the computational complexity of the convolutional neural network. In view of this, the present disclosure provides a scheme for compressing a neural network model. The scheme firstly determines secondary channels and redundant channels in a neural network model, and cuts the determined secondary channels and redundant channels together. Therefore, the neural network model can be comprehensively and effectively pruned, and simultaneously redundant channels and channels with low importance are cut, so that the number of convolution channels of the compressed neural network model is greatly reduced, the calculated amount, occupied storage space and time consumption of reasoning during the running of the model can be greatly reduced, and the reasoning accuracy of the model is kept unchanged basically.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. Unless otherwise indicated, references to neural networks hereinafter refer to convolutional neural networks.

Fig. 1 illustrates a flowchart of a method 100 for compressing a neural network model, according to an embodiment of the present disclosure. The method 100 may be performed in an electronic device, i.e. the subject of the execution of the method 100 is an electronic device. Electronic devices include, but are not limited to, servers, desktop computers, laptop computers, smart phones, and the like. Embodiments of an electronic device for performing the method 100 will be described in detail below.

As shown in fig. 1, the method 100 may include: step S110, acquiring a first neural network model, wherein the first neural network model comprises a plurality of convolution channels and batch standardization channels corresponding to each convolution channel, and the batch standardization channels have importance parameters for representing the importance degree of the corresponding convolution channels; step S120, determining at least one secondary channel from a first subset of the plurality of convolution channels based on the respective importance parameter; step S130, determining respective redundancy parameters of the convolution channels in the second subset of the plurality of convolution channels, wherein the redundancy parameters are used for representing redundancy degrees of the respective convolution channels; step S140, determining at least one redundant channel from the second subset based on the corresponding redundancy parameter; and step S150, constructing a compressed second neural network model based on the residual convolution channels, wherein the residual convolution channels are convolution channels except for the at least one secondary channel and the at least one redundant channel in the convolution channels. Therefore, secondary channels in the first neural network model are determined through importance parameters, redundant channels in the first neural network model are determined through redundant parameters, the secondary channels and the redundant channels are cut together, and a compressed second neural network model is generated, so that the first neural network model can be comprehensively and effectively compressed, the number of convolution channels of the compressed second neural network model is greatly reduced compared with that of the first neural network model, and the calculated amount, occupied storage space and time consumption of reasoning during model operation are greatly reduced. Meanwhile, as the secondary channels and the redundant channels are cut off, the full compression can be realized while the reasoning accuracy of the model is kept basically unchanged.

The first neural network model acquired in step S110 may be a trained convolutional neural network model. The first neural network model may include a plurality of convolution channels and a batch normalization channel corresponding to each convolution channel, the batch normalization channel having an importance parameter for indicating a degree of importance of the respective convolution channel. As described above, the data processing process of the batch normalization channel may be represented by the formula y=α×x+β, where X, Y is input data and output data of the batch normalization channel, and α and β are parameters of the batch normalization channel, and values of α and β are determined through training. Based on the above formula, the smaller the value of α, the smaller the value of the corresponding output data Y, indicating that the inference contribution of the corresponding convolution channel to the neural network model is lower. Even if the corresponding convolution channel is removed, the reasoning accuracy of the neural network model is only weakly influenced (the weakly influenced can be repaired by the model fine-tuning step below).

According to some embodiments, the importance parameter used to represent the importance of the corresponding convolution channel may be the alpha parameter of the batch normalization channel. In other words, the greater the value of the α parameter, the greater the importance of the corresponding convolution channel; the smaller the value of the alpha parameter, the lower the importance of the corresponding convolution channel.

According to some embodiments, during the training of the first neural network model, a loss function may be calculated based at least on the importance parameters of each batch of normalized channels. Because the loss function in the training process is calculated based on the importance parameters of each batch of standardized channels, the sparse training effect can be achieved, so that the difference of the values of the importance parameters of each standardized channel is larger in a first trained neural network model, and the important secondary channels are in a scattered state, thereby being beneficial to determining the secondary channels with low importance.

Specifically, the loss function employed in training the first neural network model may include a sum of absolute values of importance parameters of each batch of normalized channels, i.e., the loss function employed in training the first neural network model takes the importance parameters of the batch of normalized channels as an L1 regularization term. By adding the importance parameters of the batch of standardized channels into the loss function as the L1 regularization term, the difference of the values of the importance parameters of each batch of standardized channels in the trained first neural network model is large enough to be in a bipolar differentiation state, and the partial small values approach 0.

The first neural network model may be trained, for example, by the following steps Step1-Step 5:

Step1, a batch of training samples (A, B) are obtained, wherein A is data, B is a label, the number of the training samples is BatchSize, the data and the label can be determined according to the application scene of the first neural network model, for example, the data can be text, image, video, audio and the like, and correspondingly, the label can be the category to which the data of the text, image, video, audio and the like belong;

step2, inputting the training samples into a neural network (initial weights of all convolution channels of the neural network are obtained randomly) to obtain output of the neural network prediction;

step3, according to the networkThe predicted output of (1), the label B of the training sample, and the importance parameter of each batch of standardized channels, to calculate the loss function value l=l ₀ +lambda Sigma|alpha|, where L ₀ For predicting a loss function (e.g., absolute value loss function, square loss function, cross entropy loss function, etc.) between output and tag B, λ is a regularization coefficient (being a preset constant), and α is an importance parameter of the normalized channel (i.e., α parameter of the BN channel);

step4, adjusting the weight of each convolution channel according to the L back propagation;

step5, repeating the steps Step1-Step4 until the loss function value L of the neural network does not exceed a preset range, and obtaining a first neural network model.

After training to obtain the first neural network model, steps S120-S140 may be performed to determine secondary and redundant channels of low importance in the first neural network model. Wherein step S120 is for determining at least one secondary channel from a first subset of the plurality of convolutional channels comprised by the first neural network model, and steps S130 and S140 are for determining at least one redundant channel from a second subset of the plurality of convolutional channels comprised by the first neural network model.

In the embodiment of the present disclosure, the determining process of the secondary channel (step S120) and the determining process of the redundant channel (steps S130, S140) may be performed in any order, for example, they may be performed sequentially in a specified order, or may be performed simultaneously in parallel. The first subset and the second subset may each include all of the plurality of convolution channels, or may each include a part of the plurality of convolution channels. The different execution sequences of the secondary channel determination process (step S120) and the redundant channel determination process (steps S130, S140) correspond to different first and second subsets, which will be described in detail below in connection with several embodiments.

According to some embodiments, step 120 may be performed first to determine at least one secondary channel from among a plurality of convolution channels, and then steps S130 and S140 may be performed to determine at least one redundant channel from among the remaining convolution channels other than the at least one secondary channel. In this case, the first subset may include the plurality of convolution channels and the second subset may include the remaining convolution channels of the plurality of convolution channels other than the at least one secondary channel. Illustratively, after determining at least one secondary channel in step S120, the determined secondary channel may be marked, and the weight vector of the at least one secondary channel is set to 0 (it may be understood that the weight of the convolution channel may be stored in the form of a matrix or a vector, and in order to describe the geometric median vector hereinafter, in the embodiment of the present disclosure, the weight of the convolution channel is marked as the weight vector), so that the secondary channel determined in step S120 may be avoided from affecting the process of determining the redundant channel in the subsequent steps S130 and S140 when steps S130 and S140 are performed.

According to other embodiments, determining the secondary channels (step S120) and determining the redundant channels (steps S130, S140) may be performed in parallel. In this case, the first subset and the second subset may each include the plurality of convolution channels described above. After the secondary channel and the redundant channel are respectively determined, the determined secondary channel and redundant channel can be respectively marked. In some cases, the secondary channel set determined in step S120 may be partially overlapped with the redundant channel set determined in steps S130 and S140, that is, some convolution channels may exist and be marked as secondary channels and redundant channels at the same time, so that the overlapped portions may be de-duplicated to avoid duplication marking of the convolution channels.

According to still other embodiments, steps S130 and S140 may be performed first, determining at least one redundant channel from a plurality of convolution channels; step 120 is performed to determine at least one secondary channel from the remaining convolution channels other than the at least one redundant channel. In this case, the second subset may include the plurality of convolution channels, and the first subset includes remaining ones of the convolution channels other than the at least one redundancy channel.

The secondary channel determination process of step S120 and the redundant channel determination processes of steps S130, S140 are described in detail below.

In step S120, at least one secondary channel is determined from a first subset of the plurality of convolved channels based on the respective importance parameter.

According to some embodiments, a convolution channel in the first subset having a corresponding importance parameter smaller than a first threshold may be used as the secondary channel. The first threshold may be preset by a person skilled in the art according to the actual situation. According to some embodiments, in the first neural network model obtained through sparse training, importance parameters of each batch of standardized channels are in a bipolar differentiation state. The importance parameter for a low importance convolution channel is very close to 0 (e.g., 10-4), and the importance parameter for a high importance convolution channel is typically greater than 0.5. Thus, for example, the first threshold may be set to 0.1 to effectively distinguish between a convolution channel with low importance and a convolution channel with high importance, and a convolution channel corresponding to a batch of standardized channels with importance parameters less than 0.1 is taken as a secondary channel.

According to further embodiments, a first number of convolution channels in the first subset having the smallest corresponding importance parameter may be used as secondary channels. That is, the convolution channels in the first subset may be ordered in order of decreasing importance parameter, with the first number of convolution channels having the smallest importance parameter being the secondary channels. The first number may be determined by one skilled in the art with reference to factors such as the size of the first neural network model, the model compression effect desired to be achieved, and the like. For example, the first neural network model includes 1000 convolution channels from which it is desired to cut 10% of the secondary channels, and accordingly, the first number may be set to 1000×10% =100.

According to other embodiments, the two embodiments may also be combined, i.e. a convolution channel in the first subset having a corresponding importance parameter smaller than the first threshold value may be taken as a secondary channel and a first number of convolution channels in the first subset having a smallest corresponding importance parameter may be taken as secondary channels.

After at least one secondary channel is determined in step S120, the secondary channels may be marked.

In step S130, respective redundancy parameters for the convolution channels in the second subset of the plurality of convolution channels are determined, the redundancy parameters being indicative of the redundancy of the respective convolution channels.

According to some embodiments, as shown in fig. 2, the respective redundancy parameters of the convolution channels in the second subset may be determined by the following steps S132-S136: step S132, calculating geometric median vectors of weight vectors of all convolution channels in the second subset; step S134, respectively calculating the distance from the weight vector of each convolution channel in the second subset to the geometric median vector; and step S136, determining the corresponding distance as the redundancy parameter of each convolution channel in the second subset, wherein the redundancy degree of the convolution channel is inversely proportional to the corresponding redundancy parameter. The geometric median vector is the "center" of each convolution channel in the second subset, i.e., the common property of each convolution channel in the second subset. Therefore, if the weight vector of a certain convolution channel is close to the geometric median vector, the information of the convolution channel can be considered to be overlapped with other convolution channels, the redundancy degree is high, the convolution channel is redundant for the first neural network model, and the convolution channel can be cut out without greatly influencing the reasoning accuracy of the first neural network model.

According to some embodiments, the geometric median vector may be a vector having a smallest sum of distances from the weight vector of each convolution channel in the second subset.

It will be appreciated that since each convolution channel corresponds to one convolution kernel, the size of each convolution channel may be different, e.g., the size of the convolution kernels may be 3*3, 5*5, etc., and thus the dimensions of the weight vectors (i.e., the number of weights included in the weight vectors) of each convolution channel in the second subset may be different. In this case, the weight vector with the smaller dimension may be zero-padded so that the weight vectors of the convolution channels in the second subset have the same dimension. The geometric median vector is then calculated from the weight vectors of the same dimension for each convolution channel in the second subset.

According to some embodiments, the distance from the weight vector to the geometric median vector in step S134 may be, for example, but not limited to, the euclidean distance therebetween.

The degree of redundancy of the convolution channels in step S136 inversely proportional to the corresponding redundancy parameter (i.e., distance) means that: the closer the distance from the weight vector of the convolution channel to the geometric median vector is, the higher the redundancy degree of the convolution channel is; the farther the weight vector of a convolution channel is from the geometric median vector, the lower the degree of redundancy of the convolution channel.

The above-described embodiments use geometric median vectors to represent the common properties of the convolutions channels in the second subset, and determine the respective redundancy parameters of the convolutions channels in the second subset based on the geometric median vectors.

It will be appreciated that the determination of the respective redundancy parameters for the convolved channels in the second subset is not limited to being based on geometric median vectors. For example, the average value vector of the weight vector of each convolution channel in the second subset may be calculated, and the distance from the weight vector of each convolution channel in the second subset to the average value vector may be calculated, and the corresponding distance may be determined as the redundancy parameter of each convolution channel in the second subset, where the redundancy degree of the convolution channel is inversely proportional to the corresponding redundancy parameter. That is, in this embodiment, the mean vector is taken as the "center" of each convolution channel in the second subset, representing the common property of each convolution channel in the second subset. If the weight vector of a certain convolution channel is close to the mean value vector, the information of the convolution channel can be considered to be overlapped with other convolution channels, the redundancy degree is high, the convolution channel is redundant for the first neural network model, and the convolution channel can be cut out without greatly influencing the reasoning accuracy of the first neural network model.

With continued reference to fig. 1, after determining respective redundancy parameters for the convolution channels in the second subset at step S130, step S140 is performed to determine at least one redundancy channel from the second subset based on the respective redundancy parameters.

According to some embodiments, as shown in fig. 3, at least one redundant channel may be determined from the second subset by the following steps S142-S146, i.e., step S140 may further include the following steps S142-S146: step S142, determining a target compression ratio; step S144, multiplying the target compression ratio by the number of convolution channels included in the second subset to determine a second number of at least one redundancy channel; step S146, based on the corresponding redundancy parameters, the second number of convolution channels with the largest redundancy degree are used as the at least one redundancy channel.

According to some embodiments, the target compression ratio in step S142 may be determined by the following steps S1422, S1424: step S1422, obtaining a plurality of preset compression ratios, and for each compression ratio, executing the following steps: taking the product of the compression ratio and the number of convolution channels included in the second subset as a compression number corresponding to the compression ratio; based on the corresponding redundancy parameters, taking the convolution channels with the compression quantity with the largest redundancy degree in the second subset as target channels; setting the weight vector of each target channel to 0 to obtain a third neural network model; calculating the reasoning accuracy of the third neural network model; and step S1424, determining a target compression ratio from the compression ratios according to the corresponding reasoning accuracy of the compression ratios.

In the method for determining the target compression ratio, model retraining or model fine tuning is not needed, and the target compression ratio is determined directly based on the reasoning accuracy after redundant channels with different compression numbers are cut off. Compared with the scheme that after a certain compression number of redundant channels are cut off each time, the obtained network model is subjected to fine adjustment, and then the target compression ratio is determined according to the fine adjustment result, the implementation mode for determining the target compression ratio can determine the target compression ratio more quickly, and therefore the processing efficiency of the model compression process is improved.

In step S1422, the preset compression ratios may be values between 0 and 1. For example, the preset compression ratios may be values starting from 0 and gradually increasing to 1 at intervals of 0.02, i.e., 0,0.02,0.04,0.06, …,1. For another example, the preset compression ratios may be values from 0.05 to 0.5 at intervals of 0.05, i.e., 0.05,0.1,0.15,0.2, …,0.5.

For each compression ratio, the compression ratio is first multiplied by the number of convolution channels included in the second subset to obtain a compression number corresponding to the compression ratio. And then, based on the redundancy parameters of all the convolution channels in the second subset, taking the convolution channel with the greatest redundancy degree in the second subset and the compression quantity as a target channel, clearing the weight vector of the target channel to obtain a third neural network model, and calculating the reasoning accuracy of the third neural network model.

For example, the compression ratio is 0.1, and the number of convolution channels included in the second subset is 1000, and the compression ratio corresponds to a compression number of 0.1×1000=100. Correspondingly, taking 100 convolution channels with the largest redundancy degree (for example, 100 convolution channels with the smallest distance from the weight vector to the geometric median vector) in the second subset as target channels, and clearing the weight vector of each target channel to obtain a third neural network model. Then, the inference accuracy of the third neural network model is calculated. For example, a certain number of test samples (a, B) may be input into the third neural network model, where a is data and B is a label, so as to obtain an output value of the third neural network model, and the output of the third neural network model is compared with the actual label B of each test sample, and a ratio of the number of test samples with the correct output value (i.e., the output value is the same as the actual label B) to the number of all test samples is used as the reasoning accuracy of the model.

The target compression ratio is a preferred value selected from the above-mentioned plurality of compression ratios, which can be understood as the maximum ratio of the convolution channels that can be cut out while maintaining the inference accuracy of the neural network model substantially unchanged.

According to some embodiments, in step S1424, the plurality of compression ratios may be arranged in order from large to small; calculating the inference accuracy degradation rate corresponding to each compression ratio, wherein the inference accuracy degradation rate corresponding to the compression ratio is the ratio between the difference between the inference accuracy corresponding to the compression ratio and the inference accuracy corresponding to the next compression ratio and the inference accuracy corresponding to the compression ratio; the compression ratio with the highest inference accuracy degradation rate is taken as the target compression ratio. This embodiment corresponds to a graph of compression ratio-inference accuracy, in which the horizontal axis (x-axis) of the graph represents compression ratio and the vertical axis (y-axis) represents inference accuracy, and the compression ratio before the position where the accuracy is degraded is regarded as the target compression ratio.

For example, n compression ratios are arranged in order from small to large to obtain a compression ratio sequence [ x ] ₁ ,x ₂ ,x ₃ ,…，x _n ]Compression ratio x _i (i=1, 2, …, n) the corresponding inference accuracy is y _i According to the compression ratio x _i And the corresponding inference accuracy y _i To draw a compression ratio-inference accuracy curve, wherein the x-axis of the curve represents compression ratio and the y-axis represents inference accuracy, each compression ratio and its inference accuracy corresponding to a point (x _i ，y _i ). Compression ratio x _i Lower rate d of inference accuracy of (2) _i = (the compression ratio x _i Corresponding inference accuracy y _i -next compression ratio x _i+1 Corresponding inference accuracy y _i+1 ) Inference accuracy y corresponding to the compression ratio _i 。

According to other embodiments, in step S1424, the plurality of compression ratios may be arranged in order from small to large; the last compression ratio of the compression ratios whose inference accuracy is smaller than the third threshold is taken as the target compression ratio. As the compression ratio increases, the inference accuracy of the neural network model is generally gradually reduced, and when the accuracy is reduced to a certain degree (third threshold), the last compression ratio is taken as the target compression ratio, so as to control the inference accuracy of the compressed model within an acceptable range. The third threshold may be set by those skilled in the art according to actual needs. In one embodiment, the third threshold may be set to 90%, for example.

For example, n compression ratios are arranged in order from small to large to obtain a compression ratio sequence [ x ] ₁ ,x ₂ ,x ₃ ,…，x _n ]The corresponding reasoning accuracy sequences of the compression ratios are 97%,95%,94%,91%,88% and …]. Setting a third threshold value to be 90%, and reasoning the first in the accuracy sequence The inference accuracy of the sub-smaller than the third threshold is 88%, and the corresponding compression ratio is x ₅ Accordingly, the last compression ratio, compression ratio x ₄ As a target compression ratio.

After the target compression ratio is determined in step S142, steps S144, S146 are performed. In step S144, multiplying the target compression ratio by the number of convolution channels included in the second subset to determine a second number of at least one redundancy channel; in step S146, the second number of convolution channels with the greatest redundancy is taken as the at least one redundancy channel based on the corresponding redundancy parameter.

For example, if the target compression ratio determined in step S142 is 0.12 and the number of convolution channels included in the second subset is 1000, then in step S144, it is determined that the second number of redundancy channels is 0.12×1000=120. Subsequently, in step S146, the convolution channels in the second subset may be sorted in order of the redundancy level from large to small (for example, the distance from the weight vector to the geometric median vector is from small to large), and the 120 convolution channels with the largest redundancy level may be regarded as the redundancy channels.

The above-described embodiments determine at least one redundant channel from the second subset based on the target compression ratio.

According to other embodiments, at least one convolution channel with the greatest redundancy in the second subset may also be used as a redundancy channel based on the second threshold and the corresponding redundancy parameter. The value of the second threshold value may be set by a person skilled in the art according to the actual requirements.

Illustratively, the convolution channels in the second subset may be ordered in order of greater redundancy (e.g., from lesser to greater distances from the weight vector to the geometric median vector), and a second threshold for the redundancy may be set, with the convolution channels in the second subset having a redundancy greater than the second threshold being the redundancy channels.

After at least one redundant channel is determined by step S140, the redundant channels may be marked.

Step S150 may be performed after the at least one secondary channel determined in step S120 and the at least one redundant channel determined in step S140.

In step S150, a compressed second neural network model is constructed based on the remaining convolution channels, wherein the remaining convolution channels are convolution channels of the plurality of convolution channels other than the at least one secondary channel and the at least one redundant channel.

Since the secondary channels and the redundant channels in the first neural network model have been marked in steps S120 and S140, step S150 may obtain the weight vectors of the unlabeled convolution channels in the first neural network model, and construct the compressed second neural network model accordingly.

Fig. 4 shows a schematic diagram of a compression process of a first neural network model in an embodiment according to the disclosure.

The left side of FIG. 4 shows a first neural network model before compression, the ith convolution layer of the model comprising n convolution channels, respectively convolution channel C _i1 ，C _i2 ，C _i3 ，C _i4 ，…，C _in . The importance parameters of the batch normalization channels corresponding to the n convolution channels are 1.170,0.001,0.290,0.603, … and 0.820 respectively. The jth convolution layer (j=i+1) of the model includes 2 convolution channels, convolution channel C respectively _j1 ，C _j2 . For reasons of limited space, only the ith and jth convolution layers of the first neural network model are shown in fig. 4, and other convolution layers, batch normalization layers, activation layers, pooling layers, etc. in the first neural network model are not shown.

Due to convolution channel C _i2 The importance parameter 0.001 of the corresponding batch normalization channel is smaller than the first threshold value 0.1, thus convolving channel C _i2 Is determined as the secondary channel. Although the channel C is convolved _i4 Is 0.603, is greater than a first threshold value of 0.1, and therefore does not belong to a secondary channel; but because its corresponding redundancy parameter (redundancy parameter is not shown in fig. 4) indicates that its redundancy is high, the channel C is convolved _i4 Is determined as a redundant channel. In determining the secondary channel C _i2 And redundant channel C _i4 Thereafter, from the first neural network modelCut out the secondary channel C _i2 And the redundant channel C _i4 Based on the remaining convolution channel C _i1 ，C _i3 ，…，C _in ，C _j1 ，C _j2 A compressed second neural network model is generated as shown on the right side of fig. 4.

Step S150 cuts out the secondary channels and the redundant channels in the first neural network model, and generates a compressed second neural network model. Although the secondary channels and redundant channels contribute little to model reasoning, the clipping of the secondary channels and redundant channels may still result in a slight decrease in the reasoning accuracy of the second neural network model compared to the first neural network model.

Thus, according to some embodiments, after the compressed second neural network model is constructed in step S150, the second neural network model is trimmed to further ensure the inference accuracy of the second neural network model, minimizing the impact of clipping the secondary channels and redundant channels.

The second neural network model may be fine-tuned, for example, by the following steps Step1-Step 5:

step2, inputting the batch of training samples into a second neural network model (the initial weight of each convolution channel of the second neural network model is the training result of the inherited first neural network model) to obtain the prediction output of the model;

step3, calculating a loss function value L according to the predicted output of the network and the label B of the training sample (note that the loss function here is a loss function between the predicted output and the label B, such as an absolute value loss function, a square loss function, a cross entropy loss function, etc., and may not include an L1 regularization term any more);

step4, adjusting the weight of each convolution channel according to the L back propagation, wherein the weight is fine-tuned, so that a smaller learning rate can be used;

Step5, repeating the steps Step1-Step4 until the loss function value L does not exceed the preset range, and obtaining a fourth neural network model.

According to the embodiment of the disclosure, the secondary channels in the first neural network model are determined through the importance parameters, the redundant channels in the first neural network model are determined through the redundant parameters, the secondary channels and the redundant channels are cut together, and the compressed second neural network model is generated, so that the first neural network model is comprehensively compressed, the number of convolution channels of the compressed second neural network model is greatly reduced compared with that of the first neural network model, and therefore the calculated amount, occupied storage space and reasoning time consumption during model operation are greatly reduced. Meanwhile, as the secondary channels and the redundant channels are cut off, the full compression can be realized while the reasoning accuracy of the model is kept basically unchanged.

According to another aspect of the present disclosure, there is also provided an apparatus for compressing a neural network model. Fig. 5 shows a schematic diagram of an apparatus 500 for compressing a neural network model, according to an embodiment of the disclosure. As shown in fig. 5, the apparatus 500 includes an acquisition module 510, a first tagging module 520, a determination module 530, a second tagging module 540, and a construction module 550.

The acquisition module 510 may be configured to acquire a first neural network model including a plurality of convolution channels and a batch normalization channel corresponding to each convolution channel, the batch normalization channel having an importance parameter that is indicative of an importance of the respective convolution channel.

The first tagging module 520 may be configured to determine at least one secondary channel from the first subset of the plurality of convolution channels based on the respective importance parameter.

The determination module 530 may be configured to determine respective redundancy parameters for the convolution channels in the second subset of the plurality of convolution channels, the redundancy parameters being indicative of a degree of redundancy of the respective convolution channels.

The second tagging module 540 may be configured to determine at least one redundant channel from the second subset based on the respective redundancy parameter.

The construction module 550 may be configured to construct the compressed second neural network model based on the remaining convolution channels, wherein the remaining convolution channels are convolution channels of the plurality of convolution channels other than the at least one secondary channel and the at least one redundant channel.

It should be appreciated that the various modules of the apparatus 500 shown in fig. 5 may correspond to the various steps in the method 100 described with reference to fig. 1. Thus, the operations, features and advantages described above with respect to method 100 apply equally to apparatus 500 and the modules that it comprises. For brevity, certain operations, features and advantages are not described in detail herein.

It should also be understood that the connections between the various modules in fig. 5 are used to represent data transfer between the modules. The first marking module 520, the determining module 530, and the second marking module 540 correspond to steps S120, S130, and S140, respectively, in the method 100 described above. Referring to the above, since steps S120 to S140 may be performed in different orders, there may be different data transfer situations between the first marking module 520, the determining module 530, and the second marking module 540, respectively.

Specifically, in the case of performing step S120 and then performing steps S130 and S140, there is data transfer between the first marking module 520 and the determining module 530 (the first marking module 520 transfers the determined at least one secondary channel to the determining module 530), and accordingly, a connection needs to be added between the first marking module 520 and the determining module 530 in fig. 5; in the case where step S120 is performed in parallel with steps S130, S140, there may be no data transfer between the first marking module 520 and the determining module 530, as shown in fig. 5; in the case of performing steps S130 and S140 before performing step S120, there is data transfer between the second marking module 540 and the first marking module 520 (the second marking module 540 transfers the determined at least one redundant channel to the first marking module 520), and accordingly, a connection needs to be added between the second marking module 540 and the first marking module 520 in fig. 5.

Although specific functions are discussed above with reference to specific modules, it should be noted that the functions of the various modules discussed herein may be divided into multiple modules and/or at least some of the functions of the multiple modules may be combined into a single module. The particular module performing the actions discussed herein includes the particular module itself performing the actions, or alternatively the particular module invoking or otherwise accessing another component or module that performs the actions (or performs the actions in conjunction with the particular module). Thus, a particular module that performs an action may include that particular module itself that performs the action and/or another module that the particular module invokes or otherwise accesses that performs the action. For example, the determination module 530 and the second tagging module 540 described above may be combined into a single module in some embodiments. For another example, the build module 550 may include a first tagging module 520, a determination module 530, and a second tagging module 540 in some embodiments. As used herein, the phrase "entity a initiates action B" may refer to entity a issuing an instruction to perform action B, but entity a itself does not necessarily perform that action B. For example, the phrase "the building module 550 may be configured to build the compressed second neural network model based on the remaining convolution channels" may refer to the building module 550 directing a processor (not shown in fig. 5) to build the compressed second neural network model based on the remaining convolution channels without the building module 550 itself having to perform the action of "building the compressed second neural network model".

It should also be appreciated that various techniques may be described herein in the general context of software hardware elements or program modules. The various modules described above with respect to fig. 5 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, these modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the acquisition module 510, the first tagging module 520, the determination module 530, the second tagging module 540, the build module 550 may be implemented together in a System on Chip (SoC). The SoC may include an integrated circuit chip including one or more components of a processor (e.g., a central processing unit (Central Processing Unit, CPU), microcontroller, microprocessor, digital signal processor (Digital Signal Processor, DSP), etc.), memory, one or more communication interfaces, and/or other circuitry, and may optionally execute received program code and/or include embedded firmware to perform functions.

Referring to fig. 6, a block diagram of an electronic device 600 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, and a communication unit 609. The input unit 606 may be any type of device capable of inputting information to the device 600, the input unit 606 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 607 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 608 may include, but is not limited to, magnetic disks, optical disks. General purpose medicine The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset, such as bluetooth ^TM Devices, 1302.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 601 performs the respective methods and processing steps described above, for example, step S110 to step S150 in fig. 1. For example, in some embodiments, the method for compressing a neural network model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of the method for compressing a neural network model described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the method for compressing the neural network model in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. A method of compressing a neural network model for label prediction, comprising:

obtaining a trained first neural network model, wherein the first neural network model comprises a plurality of convolution channels and a batch of standardization channels corresponding to each convolution channel, the batch of standardization channels are provided with importance parameters used for representing the importance degree of the corresponding convolution channel, the importance parameters are determined through training, the importance parameters are parameters of the batch of standardization channels, the larger the value of the parameters of the batch of standardization channels is, the higher the importance degree of the corresponding convolution channel is, and the first neural network model is obtained by repeatedly executing the following training operations until the loss value of the first neural network model does not exceed a preset value:

Acquiring a training sample, wherein the training sample comprises data and a label corresponding to the data, the data comprises at least one of text, images, video and audio, and the label is a category to which the data belongs;

inputting the data of the training sample into the first neural network model to obtain a prediction output label of the first neural network model;

calculating the loss value based on the predicted output label, the label of the training sample, the importance parameter of the batch normalization channel; and

adjusting weights of the plurality of convolution channels based on the loss values;

determining at least one secondary channel from a first subset of the plurality of convolutional channels based on a respective importance parameter;

determining respective redundancy parameters for the convolution channels in the second subset of the plurality of convolution channels, the redundancy parameters being used to represent a degree of redundancy for the respective convolution channel, wherein the degree of redundancy is determined based on a proximity of a weight vector and a mean vector of the convolution channels, the degree of redundancy for the convolution channel being inversely proportional to the respective redundancy parameters;

determining at least one redundant channel from the second subset based on the respective redundancy parameters, comprising:

Determining a target compression ratio;

multiplying the target compression ratio by a number of convolution channels included in the second subset to determine a second number of the at least one redundant channel; and

taking the second number of convolution channels with the largest redundancy degree as the at least one redundancy channel based on the corresponding redundancy parameters; and

constructing a compressed second neural network model based on a remaining convolution channel, wherein the remaining convolution channel is a convolution channel of the plurality of convolution channels other than the at least one secondary channel and the at least one redundant channel,

wherein the first subset and the second subset each comprise the plurality of convolution channels.

2. The method of claim 1, wherein during training of the first neural network model, a loss function is calculated based at least on importance parameters of each batch of normalized channels.

3. The method of claim 1, wherein the first subset comprises the plurality of convolution channels;

the second subset includes remaining convolution channels of the plurality of convolution channels other than the at least one secondary channel.

4. The method of claim 3, after said determining at least one secondary channel from the first subset of the plurality of convolution channels, the method further comprising:

the weight vector of the at least one secondary channel is set to 0.

5. The method of claim 1, the second subset comprising the plurality of convolution channels;

the first subset includes remaining convolution channels of the plurality of convolution channels other than the at least one redundancy channel.

6. The method of any of claims 1-5, wherein the determining at least one secondary channel from the first subset of the plurality of convolution channels based on the respective importance parameter comprises:

taking a convolution channel with the corresponding importance parameter in the first subset smaller than a first threshold value as a secondary channel; and/or

A first number of convolution channels in the first subset having the smallest corresponding importance parameter is taken as secondary channels.

7. The method of any of claims 1-5, wherein the determining respective redundancy parameters for the convolution channels in the second subset of the plurality of convolution channels comprises:

calculating geometrical median vectors of weight vectors of all convolution channels in the second subset;

Respectively calculating the distance from the weight vector of each convolution channel in the second subset to the geometric median vector; and

and determining the corresponding distance as a redundancy parameter of each convolution channel in the second subset.

8. The method of claim 7, wherein the geometric median vector is a vector that has a minimum sum of distances from the weight vector of each convolution channel in the second subset.

9. The method of claim 1, wherein the determining a target compression ratio comprises:

acquiring a plurality of preset compression ratios, and for each compression ratio, executing the following steps:

taking the product of the compression ratio and the number of convolution channels included in the second subset as a compression number corresponding to the compression ratio;

based on the corresponding redundancy parameters, taking the convolution channels with the compression quantity with the largest redundancy degree in the second subset as target channels;

setting the weight vector of each target channel to 0 to obtain a third neural network model; and

calculating the reasoning accuracy of the third neural network model; and

and determining the target compression ratio from the compression ratios according to the corresponding reasoning accuracy of the compression ratios.

10. The method of claim 9, wherein the determining the target compression ratio from the plurality of compression ratios according to the respective inference accuracies of the plurality of compression ratios comprises:

arranging the compression ratios in order from small to large;

calculating the inference accuracy degradation rate corresponding to each compression ratio, wherein the inference accuracy degradation rate corresponding to the compression ratio is the ratio between the difference between the inference accuracy corresponding to the compression ratio and the inference accuracy corresponding to the next compression ratio and the inference accuracy corresponding to the compression ratio; and

and taking the compression ratio with the maximum inference accuracy degradation rate as the target compression ratio.

11. The method of claim 9, wherein the determining the target compression ratio from the plurality of compression ratios according to the respective inference accuracies of the plurality of compression ratios comprises:

arranging the compression ratios in order from small to large; and

and taking the last compression ratio of the compression ratio with the inference accuracy smaller than the third threshold as the target compression ratio.

12. The method of any of claims 1-5, wherein the determining at least one redundant channel from the second subset based on the respective redundancy parameter comprises:

And taking at least one convolution channel with the largest redundancy degree in the second subset as a redundancy channel based on a second threshold value and a corresponding redundancy parameter.

13. The method of any of claims 1-5, further comprising:

fine tuning the second neural network model.

14. A compression apparatus for a neural network model for label prediction, comprising:

an acquisition module configured to acquire a trained first neural network model, the first neural network model including a plurality of convolution channels and a batch of normalization channels corresponding to each convolution channel, the batch of normalization channels having an importance parameter for representing an importance level of the corresponding convolution channel, wherein the importance parameter is determined through training, the importance parameter is a parameter of the batch of normalization channels, the larger a value of the parameter of the batch of normalization channels, the higher an importance level of the corresponding convolution channel, the first neural network model obtained by repeatedly performing the following training operations until a loss value of the first neural network model does not exceed a preset value:

a first tagging module configured to determine at least one secondary channel from a first subset of the plurality of convolutional channels based on a respective importance parameter;

a determination module configured to determine respective redundancy parameters for the convolution channels in the second subset of the plurality of convolution channels, the redundancy parameters being indicative of a degree of redundancy for the respective convolution channels, wherein the degree of redundancy is determined based on a proximity of a weight vector and a mean vector of the convolution channels, the degree of redundancy for the convolution channels being inversely proportional to the respective redundancy parameters;

a second tagging module configured to determine at least one redundant channel from the second subset based on a corresponding redundancy parameter, wherein the second tagging module comprises:

a determination submodule configured to determine a target compression ratio;

A computation submodule configured to multiply the target compression ratio by a number of convolution channels included by the second subset to determine a second number of the at least one redundant channel; and

a marking sub-module configured to take the second number of convolution channels with the greatest redundancy as the at least one redundancy channel based on a respective redundancy parameter; and

a construction module configured to construct a compressed second neural network model based on a remaining convolution channel, wherein the remaining convolution channel is a convolution channel of the plurality of convolution channels other than the at least one secondary channel and the at least one redundant channel,

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the method comprises the steps of

The memory stores a computer program which, when executed by the at least one processor, implements the method according to any one of claims 1-13.

16. A non-transitory computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the method according to any one of claims 1-13.