CN116933858A

CN116933858A - Neural network pruning method and related products

Info

Publication number: CN116933858A
Application number: CN202210364501.2A
Authority: CN
Inventors: 李文进
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2022-04-07
Filing date: 2022-04-07
Publication date: 2023-10-24

Abstract

The embodiment of the application discloses a neural network pruning method and related products, wherein the method comprises the following steps: determining a first model, wherein the first model supports fine-grained training; analyzing the first model to cut off the weight which does not reach the standard to obtain a second model; initializing the weight corresponding to the second model to obtain a third model; performing iterative training on the third model until the third model converges or the iteration number reaches the maximum iteration number; in each generation of iterative training, judging whether the compression ratio corresponding to a fourth model obtained by each generation of iterative training meets the standard or not; and if the compression ratio reaches the standard, determining that pruning operation is finished. The embodiment of the application is beneficial to improving the training flexibility and flexibly customizing the application scene of a developer, thereby avoiding the influence of impaired expression capacity of the neural network model caused by cutting off part of unimportant weight.

Description

Neural network pruning method and related products

Technical Field

The application relates to the technical field of electronic equipment, in particular to a neural network pruning method and related products.

Background

With the advancement of deep learning technology, neural network models (e.g., deep neural network (Deep Neural Network, DNN), convolutional neural network (Convolutional Neural Network, CNN), etc.) are increasingly used in the fields of machine vision, autopilot, natural language, etc. However, its complex structure results in significant power consumption and resources (e.g., computing power, memory, storage space, etc.) being consumed even at the time of reasoning, severely limiting the deployment of such technologies on power-consuming and resource-constrained mobile and embedded platforms, and thus compressing them to some extent. Currently, methods for compressing the neural network model mainly include quantization (quantization), pruning (pruning), knowledge distillation (knowledge distillation), neural network structure search (neural architecture search, NAS), and the like.

Pruning refers to systematically pruning some unimportant weights from an original Neural Network model (NN) to lose accuracy as little as possible so as to reduce the parameter quantity of the NN; in this case, even though the weight is mainly not important to be cut, it still has some influence on the expression ability of the neural network.

Disclosure of Invention

The embodiment of the application provides a neural network pruning method and related products, which are beneficial to improving training flexibility and flexibly customizing a developer aiming at an application scene of the developer, so that the influence of impaired expression capacity of a neural network model caused by cutting off part of unimportant weight is avoided

In a first aspect, an embodiment of the present application provides a neural network pruning method, where the method includes:

determining a first model, wherein the first model supports fine-grained training;

analyzing the first model to cut off the weight which does not reach the standard to obtain a second model;

initializing the weight corresponding to the second model to obtain a third model;

performing iterative training on the third model until the third model converges or the iteration number reaches the maximum iteration number;

in each generation of iterative training, judging whether the compression ratio corresponding to a fourth model obtained by each generation of iterative training meets the standard or not;

and if the compression ratio reaches the standard, determining that pruning operation is finished.

In a second aspect, an embodiment of the present application provides a neural network pruning device, applied to an electronic device, where the device includes: the device comprises a determining unit, an analyzing unit, an initializing unit, an iterative training unit and a judging unit, wherein,

The determining unit is used for determining a first model, wherein the first model supports fine granularity training;

the analysis unit is used for analyzing the first model to cut off the weight which does not reach the standard so as to obtain a second model;

the initialization unit is used for initializing the weight corresponding to the second model to obtain a third model;

the iterative training unit is used for performing iterative training on the third model until the third model converges or the iterative times reach the maximum iterative times;

the judging unit is used for judging whether the compression ratio corresponding to the fourth model obtained by each generation of iterative training meets the standard or not in each generation of iterative training;

and the determining unit is further used for determining that pruning operation is finished if the compression ratio reaches the standard.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, the programs including instructions for performing steps in any of the methods of the first aspect of the embodiments of the present application.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes a computer to perform part or all of the steps as described in any of the methods of the first aspect of the embodiments of the present application.

In a fifth aspect, embodiments of the present application provide a computer program product, wherein the computer program product comprises a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps described in any of the methods of the first aspect of the embodiments of the present application. The computer program product may be a software installation package.

It can be seen that in an embodiment of the present application, an electronic device may determine a first model, where the first model supports fine-grained training; analyzing the first model to cut off the weight which does not reach the standard to obtain a second model; initializing the weight corresponding to the second model to obtain a third model; performing iterative training on the third model until the third model converges or the iteration number reaches the maximum iteration number; in each generation of iterative training, judging whether the compression ratio corresponding to a fourth model obtained by each generation of iterative training meets the standard or not; and if the compression ratio reaches the standard, determining that pruning operation is finished. Therefore, by introducing methods such as initialization, retraining and the like in the post-pruning training process, the training flexibility is improved, a developer can flexibly customize the training method according to the application scene of the developer, the influence of damage to the expression capacity of the neural network model caused by cutting off part of unimportant weight is avoided, the fine control of the mechanical learning curve for restoring the expression capacity after pruning is improved, and the pruning precision is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1A is a schematic flow chart of a network pruning method according to an embodiment of the present application;

fig. 1B is a schematic flow chart of a network pruning method according to an embodiment of the present application;

fig. 1C is a schematic flow chart of a network pruning method according to an embodiment of the present application;

fig. 2A is a schematic flow chart of a neural network pruning method according to an embodiment of the present application;

fig. 2B is a schematic flow chart of a neural network pruning method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 4 is a functional unit composition block diagram of a neural network pruning device according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The electronic device may be a portable electronic device that also includes other functionality such as personal digital assistant and/or music player functionality, such as a cell phone, tablet computer, wearable electronic device with wireless communication functionality (e.g., smart watch, smart glasses), in-vehicle device, etc. Exemplary embodiments of portable electronic devices include, but are not limited to, portable electronic devices that are equipped with IOS systems, android systems, microsoft systems, or other operating systems. The portable electronic device may also be other portable electronic devices such as a Laptop computer (Laptop) or the like. It should also be appreciated that in other embodiments, the electronic device described above may not be a portable electronic device, but rather a desktop computer.

At present, the pruning method mainly comprises the following steps:

1) Fig. 1A is a schematic flow chart of a network pruning method. An original Neural Network model (NN) can be determined, and unimportant weights (groups) in the original NN are identified and cut off to obtain sparse NN; training the sparse NN; after training is completed, determining whether the compression ratio of the sparse NN meets the standard, and ending the whole network pruning flow if the compression ratio meets the standard; if the weight value does not reach the standard, the step of identifying and cutting off the unimportant weight value (group) in the original NN is returned. In particular implementations, after pruning is completed, the pruned weights (or sets of weights, such as channels (channels), filters, weight bars (strips), etc.) in the CNN) may be zeroed out or removed (e.g., when the entire structure in the weight tensor is pruned out), and then training may be continued for a period of time or until convergence starting with NN in this state. In the AI field, the above method may also be called tuning (tuning), and is not limited herein.

2) Fig. 1B is a schematic flow chart of a network pruning method. An original Neural Network model (NN) can be determined, whether the original NN is converged or not is judged, and whether a retrospective algebra is sufficient or not is judged; training the original NN under the condition of non-convergence or insufficient traceable algebra; under the conditions of convergence and enough traceable algebra, unimportant weights (groups) in the original NN are identified and cut off, and the reserved weights (pairs) are traced back for a plurality of generations to obtain sparse NN; training the sparse NN; after training is completed, determining whether the compression ratio of the sparse NN meets the standard, and ending the whole network pruning flow if the compression ratio meets the standard; if the weight value does not reach the standard, the step of identifying and cutting off the unimportant weight value (group) in the original NN is returned. In a specific implementation, considering that the NN has been lost after pruning is performed, the NN may begin to tune at this time and cannot reach the optimal precision, but instead, before pruning is performed, the NN generally reaches a stable convergence state, so each weight bit that is retained may be returned to a state before a number of training generations (epoch) numbers (if the state information is not included, for example, pruning is performed on a pretraining network with only the final weight, then training is restarted for a number of generations in a non-pruning state, and pruning is performed and returned to a corresponding algebra in the training restarted before the reserved data volume is sufficiently traced back. In the AI field, the above method may also be called backtracking (rewinds). In this case, the NN expression ability may be restored from a better starting point.

3) Fig. 1C is a schematic flow chart of a network pruning method. An original Neural Network model (NN) can be determined, unimportant weights (groups) in the original NN are identified and cut off, and the reserved weights (groups) are initialized randomly to obtain sparse NN; training the sparse NN; after training is completed, determining whether the compression ratio of the sparse NN meets the standard, and ending the whole network pruning flow if the compression ratio meets the standard; if the weight value does not reach the standard, the step of identifying and cutting off the unimportant weight value (group) in the original NN is returned. In a specific implementation, only the reserved weight positions can be considered after pruning, each weight is initialized randomly again, training is started from the state, and for tasks such as image classification and the like, accuracy similar to an optimal strategy and even higher than that of an original NN can be completely obtained, so that the necessity of a pre-training network required by the two former post-processing schemes or specific numerical values of reserved weights is questioned. In other words, the pruned generated sparse NN structure is more valuable than inheriting weights retained by the original NN. In the AI field, the above method may also be referred to as de novo training (retrain from scratch).

Among the above three methods, method 1) inherits the weight of the reserved portion from the original NN before pruning. These weights may naturally already be in a converged state in the original NN, which is valuable, but there is no theoretical basis to ensure that their values are still available for new NNs with sparse structure after pruning. In fact, post-processing is required after pruning because, although one can prove or infer that the parameters of DNN generally have a certain redundancy (relative to its target task) from the information theory and the like, there is currently no good method to ensure that 100% of all the weights (groups) can be accurately identified, which correspond to the true redundant part, so that the sparse NN formed after pruning necessarily loses a part of the expressive power, and post-processing including the tuning strategy is to re-establish the part of the lost expressive power by using the reserved limited number of weight bits.

In the method 2), the specific choice of how many generations to trace back also depends on experience judgment, or is generated by multiple trial and error, and no solid theoretical basis exists, so that the precision and universality of various tasks and various NN model structures cannot be ensured.

In method 3), the success of randomly initializing the reserved weight bit naturally shows that the reserved weight bit is at least not worse than the inherited original weight under certain conditions, so that a deep research topic is led out, and the training of the original NN with high cost can be avoided under the premise of matching with a proper weight (group) importance identification method, but the method still lacks strict theoretical proof, and the effect of other work verification schemes in addition to individual tasks such as image classification is not known to be fortuitous at present. On the other hand, if the weights of the original NN are already owned, this scheme directly discards all of them, which is a huge waste. Meanwhile, because of the non-convexity of the DNN, the final weight distribution of the DNN is possibly obviously different from the corresponding part of the original NN if the DNN is not subjected to fine control from the head training, and the convergence result similar to the original NN cannot be ensured, so that the applicability of the DNN in certain occasions with higher requirements on precision is affected.

In view of the above problems, the present application provides a neural network pruning method and related products, and the following detailed description is provided.

Referring to fig. 2A, fig. 2A is a flow chart of a neural network pruning method according to an embodiment of the present application, which is applied to an electronic device, and as shown in the drawings, the neural network pruning method includes the following operations.

S201, determining a first model, wherein the first model supports fine-granularity training.

The electronic device may process the raw NN to obtain a first model that supports fine-grained training.

In a specific implementation, the training granularity can be determined according to the current application (for example, NN pruning in any field, such as machine vision, audio and video, natural language, control and the like, and any structure), and the first model can be determined according to the training granularity.

In the present application, when the first model supports fine-granularity training, the trainable variables in the original NN may be controlled (e.g., whether training is started, etc.), so as to check whether the current application requirement or model definition allows for controlling each trainable variable separately, or whether the learning rate of each variable is allowed to be set separately, and make appropriate modification to the model definition when necessary.

For example, when modifying model definitions, version 1.X of TensorFlow (which is an end-to-end open source machine learning platform) supports setting Boolean parameter trainable for various variables (Variable) that can only be trained when set to True, otherwise the current value is maintained, but this attribute once set cannot be altered unless an operational Graph (Graph) is redefined. For most neural network models (NN), the most common case is that all weights and bias tensors default to True, so to achieve dynamic control, all variables that want to be controlled together need to be grouped into the same group before training begins, a common optimizer is configured for each group of variables, and then different optimizers are dynamically invoked during training. Other neural network (DL) frameworks, such as TensorFlow 2.X, pyretch, etc., are different from the specific implementation, and are consistent with the above methods, and are not described herein.

It should be noted that, when the DL framework already supports dynamic training switch control, no modification may be made to the model definition.

The training granularity can be used for training the neural network later, and when determining or configuring the training granularity, the training granularity can be any scale between a single weight and the ownership of the whole network (or both of the weights are included), and the training granularity used by each layer can be the same or different.

In one possible example, the original NN or first model may include at least one of: floating point models, fixed point models, integer models, binarization models, etc., are not limited herein.

The floating point model may refer to a model with parameters of a floating point number, and the fixed point model may be any bit width (e.g., 16-bit, 12-bit, 11-bit, 10-bit, 8-bit, 7-bit, 6-bit, 4-bit, 2-bit, etc.).

S202, analyzing the first model to cut out the weight which does not reach the standard, and obtaining a second model.

Wherein, after pruning the first model, the second model may be a sparse NN.

The above-mentioned weights not reaching the standard may refer to weights or weight groups that are not important in the first model.

In particular implementations, the resource occupancy (e.g., total or individual layer calculations, parameter amounts, and/or memory occupancy, etc.) of the original NN may be analyzed; or according to a rule of thumb, non-important weights (groups) in the first model are identified and clipped.

The step of clipping the non-standard weight may be to zero the non-standard weight, or remove the non-standard weight, which is not limited herein.

And S203, initializing the weight corresponding to the second model to obtain a third model.

Wherein the third model may be a sparse NN. The method of initialization may be a backtracking method or a reinitialization. Of course, the weight corresponding to the second model may not be initialized, or may be implemented by other methods.

After pruning operation, that is, after the weights which do not reach the standard are cut off, the method for initializing the weights corresponding to the second model can be determined according to actual conditions.

For example, if the NN size needs to be reduced during model definition, the weights will be directly removed from the whole structure body in the weight tensor, in this case, if all the weights in the structure body are removed, the whole structure body will be removed, and the corresponding super parameters will be correspondingly reduced, so that the NN size is advantageously reduced directly during model definition.

For example, if the first model has reached a steady convergence state before pruning, i.e. before clipping out the unqualified weights, the remaining weights may be returned to a state before several training generations, and then the initialization of the weights of the second model may be implemented in a retrospective manner.

S204, performing iterative training on the third model until the third model converges or the iteration number reaches the maximum iteration number.

The number of iterations may refer to the number of iterative training, and the maximum number of iterations may be set by a user or default by a system, which is not limited herein.

The training data set, the loss function, etc. used in the iterative training may be the same or different, and are not limited herein.

S205, in each generation of iterative training, judging whether the compression ratio corresponding to the fourth model obtained by each generation of iterative training meets the standard or not.

In each generation of iterative training, the compression ratio of the fourth model relative to the original NN is determined, if the compression ratio is smaller than or equal to a preset compression ratio threshold, the compression ratio corresponding to the fourth model can be determined to reach the standard, and if the compression ratio is larger than the preset compression ratio threshold, the compression ratio corresponding to the fourth model is determined to not reach the standard.

In a specific implementation, the compression ratio of the model, namely the preset compression ratio threshold, can be determined according to the compression target in the actual application process. For example, the compression ratio may be determined by a parameter, or an operand or a size; for example, the ratio of the parameter amount of the fourth model to the parameter amount of the original NN may be calculated to determine the compression ratio.

And S206, if the compression ratio reaches the standard, determining that pruning operation is finished.

Optionally, as shown in fig. 2B, a flow chart of a neural network model pruning method is shown, and the method may further include step S207, if the compression ratio does not reach the standard, taking the sparse NN obtained under the condition that the compression ratio does not reach the standard as the first model of the next round of pruning operation, and returning to step S202.

It can be seen that, according to the neural network pruning method provided by the embodiment of the application, a first model can be determined, wherein the first model supports fine-grained training; analyzing the first model to cut off the weight which does not reach the standard to obtain a second model; initializing the weight corresponding to the second model to obtain a third model; performing iterative training on the third model until the third model converges or the iteration number reaches the maximum iteration number; in each generation of iterative training, judging whether the compression ratio corresponding to a fourth model obtained by each generation of iterative training meets the standard or not; and if the compression ratio reaches the standard, determining that pruning operation is finished. Therefore, by introducing methods such as initialization, retraining and the like in the post-pruning training process, the training flexibility is improved, a developer can flexibly customize the training method according to the application scene of the developer, the influence of damage to the expression capacity of the neural network model caused by cutting off part of unimportant weight is avoided, the fine control of the mechanical learning curve for restoring the expression capacity after pruning is improved, and the pruning precision is improved.

In one possible example, in the determining the first model, the method may include the steps of: acquiring an original model and training fine granularity corresponding to each layer of network model in the original model; according to the training fine granularity corresponding to each layer of network model, determining the weight of each layer of network model to be opened and/or closed in each generation of iterative training process; and scheduling each layer of network model of the original model according to the weight required to be opened and/or closed and/or to obtain the first model.

The original model may refer to an original NN, i.e., a neural network model without pruning.

For example, in some bottom-layer, pixel-level tasks in the machine vision field (such as denoising, super resolution, etc.), since the weights of the original CNN contain more information about the extracted detail features, it is not desirable to shift the weight distribution due to retraining after pruning, which causes a series of problems such as blurring, color shift, etc., so that the weights of the CNN can be fixed or learning of the weights can be slowed down, only the remaining weights such as normal training bias can be used, in which case the weight variables can be individually grouped and their training can be disabled or a lower learning rate can be set, and the remaining variables can be grouped and their training can be normally enabled. If it is desired to update the weights of a portion of the layers separately for each iteration, the variables may be similarly grouped by layer and configured with training switches and/or learning rates, as not limited herein.

Wherein, the opening or closing or slowing down can refer to controlling the opening or closing or slowing down of the weight in the training process. It is understood as a "switch" for training to determine whether each/group weight turns on itself, etc.

In order to ensure the application of the training granularity in the subsequent step S204, a scheduling rule for the post-pruning training process may be determined first, that is, training of which weights (or weight pairs) need to be opened or closed or slowed down in each generation of iterative training process, so that the weights or weight pairs that need to be opened and/or closed and/or slowed down in each generation of iterative training process may be determined first according to the training granularity corresponding to each layer of network model, and scheduling is performed on the corresponding weights in each layer of network model, thereby obtaining the first model.

The scheduling rules may be used to keep a certain part of weight training on (and the rest of weight training off and/or slow down) during the whole training process, or may be used to turn on and/or off and/or slow down different sets of weights in each iteration.

Of course, the specific implementation method of the scheduling of the weights in the whole training process is the same as the scheduling method for each layer of weights, and will not be described herein.

It can be seen that, in this example, the dynamic scheduling of the weight in the training process can be achieved through the scheduling rule, and the dynamic control of the training fine granularity is facilitated, so that the flexibility and the convergence precision of the training are improved, and the higher compression rate can be facilitated after the iterative training is finished under the same precision level, so that the application range of the sparse NN is facilitated to be expanded. And the method is also beneficial to being compatible with more pruning schemes and beneficial to improving the pruning application range.

Alternatively, if a boolean switch is used to schedule the opening and/or closing and/or slowing of the control weights (pairs) during pruning of the first model, each or a group of weights "whether to open its own training" boolean switches may be expanded to (inclusive of) trainable floating point type variables between 0 and 1 and multiplied to the gradient updates of the corresponding weights (groups). Further, the physical significance of these variables becomes training intensity—1 represents full speed training and 0 represents completely disabled training. Therefore, a set of dynamic scheduling strategies can be automatically learned while training, sub-optimal adjustment of the scheduling strategies by experience or observation is avoided, and meanwhile, compared with the prior training of completely opening or completely closing by using Boolean type variable or manually designating the learning rate of each parameter, gradient update is regulated and controlled more finely, and the precision is further improved.

In one possible example, the analyzing the first model to cut out the weights that do not reach the standard, so as to obtain a second model aspect, where the method includes the following steps: determining a minimum pruning unit corresponding to the first model; grouping weights corresponding to the first model according to the minimum pruning unit to obtain a plurality of weight groups; and carrying out layer-by-layer or global recognition on the first model according to the weight groups so as to screen out the weights which do not reach the standard and obtain the second model.

The minimum pruning unit may be set by the user or default by the system, which is not limited herein. In the case of unstructured pruning (unstructured pruning), the minimum pruning unit may be a single weight; in the case of structural pruning (structured pruning), the minimum pruning unit may also be a more structured set of weights (such as weight bars, filters in CNN, etc.), and even different pruning units may be used at different positions of NN, which is not limited herein.

The weight which is not up to standard can be identified as an unimportant weight (or weight group), and the weight which is up to standard can be understood as an important weight, and whether the weight is up to standard can be measured by the importance.

In the layer-by-layer or global weight importance identification of the first model, the first model may be approximated by a preset index (e.g., a norm, a gradient, etc.) according to experience, or may be actually measured according to a model expression capacity loss (e.g., sensitivity) caused by clipping each weight (group).

Alternatively, in the identification of the importance of the weight (group), it may be binary (important 1, not important 0) or a continuous value indicating the importance thereof.

In particular, when the importance of the weight (group) is characterized by a continuous value of one degree of importance, continuous values between 0 and 1 representing the importance of each weight may be stored in one-to-one correspondence in another tensor of the same size other than the original weight tensor. The larger the continuous value, the more important the tensor can even directly participate in NN training, which may be generally referred to as soft mask (soft mask), so that by the end of one training pass of step S205, the corresponding weight bits with smaller mask values may be cut off. It should be noted that, in this case, step S202 does not actually directly cut off the unimportant weight, and even does not complete the identification of the important weight, but does not reach the conclusion of which part of the weight is unimportant until the end of the training in step S205, that is, the weight that does not reach the standard is obtained.

In this example, the pruning of the NN can be achieved through dynamic regulation and control, the pruning of the weight (group) can be performed in advance, and the pruning operation of the weight (group) can be achieved in the training process, so that the training flexibility is improved to the greatest extent, and the pruning process is facilitated to be optimized.

In one possible example, the step of identifying the first model layer by layer or globally according to the plurality of weight groups to screen out the weights that do not reach the standard and obtain the second model aspect includes the following steps: according to the weight groups, determining an L0 norm and/or an L1 norm and/or an L2 norm corresponding to each weight group; determining that the weight included in the weight group of which the L0 norm and/or the L1 norm and/or the L2 norm is smaller than a preset threshold is the weight which does not reach the standard; and cutting off the unqualified weight to obtain the second model.

The preset threshold may be set by the user or default by the system, which is not limited herein.

In this example, the pruning proportion of each layer of weights (weight group) may be set, and may be the same or different, and is not limited herein.

The importance of the weight (weight group) can be measured by an index such as an L0 norm and/or an L1 norm and/or an L2 norm.

Alternatively, when the unqualified weight (weight group) is clipped, the number of the clipped unqualified weight (weight group) may be determined based on the ratio of the number of the weight (weight group) to the ownership weight (weight group), or may be determined based on the ratio of the total importance value occupied by the clipped weight, and likewise, the total importance of the weight may be measured by indexes such as norm, gradient, sensitivity, and the like.

In one possible example, the clipping the unqualified weight aspect includes the following steps: setting the unqualified weight to zero to obtain the second model; or compressing and storing the target weight values except the unqualified weight value in a sparse format.

When the unqualified weight (group), namely the unimportant weight, is cut off, the weight (group) can be set to zero and stored in a common dense tensor format; the tensors of target weights (namely important weights or standard weights (pairs)) except for the standard-unqualified weights can be compressed and stored in a sparse (sparse) format, and a correspondingly optimized sparse operator is used in subsequent operation.

It should be noted that, in order to multiplex the training code of the original NN, a zero setting method may be adopted during the training of the third model (sparse NN), and the training code may not be converted into a sparse format for storage or deployment until step S206 is completed.

In one possible example, the method may further include the steps of: determining a loss function of the iterative training of the third model according to the expression capability between the third model and the original model; and carrying out iterative training on the third model according to the loss function.

Wherein, in order to reduce the loss accuracy in the training of the third model, the loss function may be determined according to an error between the model expressive power between the third model and the original model, which may be calculated from the output layer and/or the intermediate layer of the model.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application, as shown in the drawing, the electronic device includes a processor, a memory, a communication interface, and one or more programs, and is applied to the electronic device, where the one or more programs are stored in the memory, and the one or more programs are configured by the processor to execute instructions for:

It can be seen that the electronic device described in the embodiments of the present application may determine a first model, where the first model supports fine-grained training; analyzing the first model to cut off the weight which does not reach the standard to obtain a second model; initializing the weight corresponding to the second model to obtain a third model; performing iterative training on the third model until the third model converges or the iteration number reaches the maximum iteration number; in each generation of iterative training, judging whether the compression ratio corresponding to a fourth model obtained by each generation of iterative training meets the standard or not; and if the compression ratio reaches the standard, determining that pruning operation is finished. Therefore, by introducing methods such as initialization, retraining and the like in the post-pruning training process, the training flexibility is improved, a developer can flexibly customize the training method according to the application scene of the developer, the influence of damage to the expression capacity of the neural network model caused by cutting off part of unimportant weight is avoided, the fine control of the mechanical learning curve for restoring the expression capacity after pruning is improved, and the pruning precision is improved.

In one possible example, in said determining the first model, the program comprises instructions for:

acquiring an original model and training fine granularity corresponding to each layer of network model in the original model;

according to the training fine granularity corresponding to each layer of network model, determining the weight of each layer of network model which needs to be opened and/or closed and/or slowed down in each generation of iterative training process;

and scheduling each layer of network model of the original model according to the weight required to be opened and/or closed and/or slowed down to obtain the first model.

In one possible example, in said analyzing said first model to cut out the unqualified weights to obtain a second model, the program comprises instructions for performing the steps of:

determining a minimum pruning unit corresponding to the first model;

grouping weights corresponding to the first model according to the minimum pruning unit to obtain a plurality of weight groups;

and carrying out layer-by-layer or global recognition on the first model according to the weight groups so as to screen out the weights which do not reach the standard and obtain the second model.

In one possible example, in the step of identifying the first model layer by layer or globally according to the plurality of weight groups to screen out the weights that do not reach the standard, to obtain the second model aspect, the above program includes instructions for executing the following steps:

according to the weight groups, determining an L0 norm and/or an L1 norm and/or an L2 norm corresponding to each weight group;

determining that the weight included in the weight group of which the L0 norm and/or the L1 norm and/or the L2 norm is smaller than a preset threshold is the weight which does not reach the standard;

and cutting off the unqualified weight to obtain the second model.

In one possible example, in said clipping said non-qualifying weight, the above-mentioned program comprises instructions for:

setting the unqualified weight to zero to obtain the second model; or alternatively

And compressing and storing the target weights except the unqualified weight in a sparse format.

In one possible example, the above-described program further includes instructions for performing the steps of:

determining a loss function of the iterative training of the third model according to the expression capability between the third model and the original model;

And carrying out iterative training on the third model according to the loss function.

In one possible example, the first model includes at least one of: floating point model, fixed point model, integer model, binarization model.

The foregoing description of the embodiments of the present application has been presented primarily in terms of a method-side implementation. It will be appreciated that the electronic device, in order to achieve the above-described functions, includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application can divide the functional units of the electronic device according to the method example, for example, each functional unit can be divided corresponding to each function, and two or more functions can be integrated in one processing unit. The integrated units may be implemented in hardware or in software functional units. It should be noted that, in the embodiment of the present application, the division of the units is schematic, which is merely a logic function division, and other division manners may be implemented in actual practice.

In the case of dividing each functional module by corresponding each function, fig. 4 shows a schematic diagram of a neural network pruning device, as shown in fig. 4, the device is applied to an electronic apparatus, and the neural network pruning device 400 may include: a determination unit 401, an analysis unit 402, an initialization unit 403, an iterative training unit 404, and a judgment unit 405, wherein,

the determining unit 401 is configured to determine a first model, where the first model supports fine granularity training;

the analysis unit 402 is configured to analyze the first model to cut out a weight that does not reach the standard, so as to obtain a second model;

the initializing unit 403 is configured to initialize a weight corresponding to the second model to obtain a third model;

the iterative training unit 404 is configured to perform iterative training on the third model until the third model converges or the iteration number reaches a maximum iteration number;

the judging unit 405 is configured to judge, in each generation of the iterative training, whether a compression ratio corresponding to a fourth model obtained by each generation of the iterative training meets a standard;

the determining unit 401 is further configured to determine that the pruning operation is ended if the compression ratio meets the standard.

It can be seen that the neural network pruning device provided by the embodiment of the application can determine a first model, wherein the first model supports fine-grained training; analyzing the first model to cut off the weight which does not reach the standard to obtain a second model; initializing the weight corresponding to the second model to obtain a third model; performing iterative training on the third model until the third model converges or the iteration number reaches the maximum iteration number; in each generation of iterative training, judging whether the compression ratio corresponding to a fourth model obtained by each generation of iterative training meets the standard or not; and if the compression ratio reaches the standard, determining that pruning operation is finished. Therefore, by introducing methods such as initialization, retraining and the like in the post-pruning training process, the training flexibility is improved, a developer can flexibly customize the training method according to the application scene of the developer, the influence of damage to the expression capacity of the neural network model caused by cutting off part of unimportant weight is avoided, the fine control of the mechanical learning curve for restoring the expression capacity after pruning is improved, and the pruning precision is improved.

In one possible example, in the aspect of determining the first model, the determining unit 401 is specifically configured to:

In one possible example, in the analyzing the first model to cut out the weights that do not reach the standard, the analyzing unit 402 is specifically configured to:

determining a minimum pruning unit corresponding to the first model;

In one possible example, in the aspect of performing layer-by-layer or global recognition on the first model according to the plurality of weight groups to screen out the weights that do not reach the standard, to obtain the second model, the analyzing unit 402 is specifically further configured to:

and cutting off the unqualified weight to obtain the second model.

In one possible example, in terms of said clipping out said unqualified weights, the above-mentioned analysis unit 402 is specifically further configured to:

In one possible example, the iterative training unit 404 is specifically configured to:

It should be noted that, all relevant contents of each step related to the above method embodiment may be cited to the functional description of the corresponding functional module, which is not described herein.

The electronic device provided in this embodiment is configured to execute the neural network pruning method, so that the same effect as that of the implementation method can be achieved.

In case an integrated unit is employed, the electronic device may comprise a processing module, a storage module and a communication module. The processing module may be configured to control and manage actions of the electronic device, for example, may be configured to support the electronic device to perform the steps performed by the determining unit 401, the analyzing unit 402, the initializing unit 403, the iterative training unit 404, and the judging unit 405. The memory module may be used to support the electronic device to execute stored program code, data, etc. And the communication module can be used for supporting the communication between the electronic device and other devices.

Wherein the processing module may be a processor or a controller. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. A processor may also be a combination that performs computing functions, e.g., including one or more microprocessors, digital signal processing (digital signal processing, DSP) and microprocessor combinations, and the like. The memory module may be a memory. The communication module can be a radio frequency circuit, a Bluetooth chip, a Wi-Fi chip and other equipment which interact with other electronic equipment.

The embodiment of the application also provides a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program makes a computer execute part or all of the steps of any one of the above method embodiments, and the computer includes an electronic device.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform part or all of the steps of any one of the methods described in the method embodiments above. The computer program product may be a software installation package, said computer comprising an electronic device.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, such as the above-described division of units, merely a division of logic functions, and there may be additional manners of dividing in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the above-mentioned method of the various embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, which may include: flash disk, read-only memory, random access memory, magnetic or optical disk, etc.

The foregoing has outlined rather broadly the more detailed description of embodiments of the application, wherein the principles and embodiments of the application are explained in detail using specific examples, the above examples being provided solely to facilitate the understanding of the method and core concepts of the application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A neural network pruning method, comprising:

2. The method of claim 1, wherein the determining the first model comprises:

3. The method according to claim 1 or 2, wherein said analyzing the first model to cut out the unqualified weights to obtain a second model comprises:

determining a minimum pruning unit corresponding to the first model;

4. The method of claim 3, wherein the step of identifying the first model layer by layer or globally according to the plurality of weight groups to screen out the weights that do not reach the standard, to obtain the second model, includes:

and cutting off the unqualified weight to obtain the second model.

5. The method of claim 4, wherein said clipping out said non-qualifying weights comprises:

6. The method of claim 5, wherein the method further comprises:

7. The method of claim 6, wherein the first model comprises at least one of: floating point model, fixed point model, integer model, binarization model.

8. A neural network pruning device, the device comprising: the device comprises a determining unit, an analyzing unit, an initializing unit, an iterative training unit and a judging unit, wherein,

9. An electronic device comprising a processor, a memory, a communication interface, and one or more programs stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method of any of claims 1-7.

10. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any one of claims 1-7.