CN116306888A

CN116306888A - Neural network pruning method, device, equipment and storage medium

Info

Publication number: CN116306888A
Application number: CN202310546665.1A
Authority: CN
Inventors: 刘建伟
Original assignee: Xi'an Aixin Yuanzhi Technology Co ltd; Beijing Aixin Technology Co ltd
Current assignee: Xi'an Aixin Yuanzhi Technology Co ltd; Beijing Aixin Technology Co ltd
Priority date: 2023-05-16
Filing date: 2023-05-16
Publication date: 2023-06-23
Anticipated expiration: 2043-05-16
Also published as: CN116306888B

Abstract

The application provides a neural network pruning method, device, equipment and storage medium, and relates to the field of neural networks, wherein the method comprises the following steps: pre-training the neural network model to be pruned to obtain a full-precision neural network model; performing topology analysis on a convolution layer and/or a full connection layer in the full-precision neural network model to obtain a hardware acceleration potential layer; carrying out importance sorting on a plurality of output channels in the hardware acceleration potential layer, and carrying out channel pruning according to sorting results to obtain a pruned neural network model; retraining the pruned neural network model to obtain a target neural network model. The importance sorting is carried out on the output channels of the hardware acceleration potential layer in the full-precision neural network model, channel pruning is carried out according to the importance, the pruned model is retrained to restore the precision, the redundancy degree is reduced, the model precision and the hardware reasoning acceleration ratio are ensured, and therefore the model compression optimization capacity is improved.

Description

Neural network pruning method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of neural networks, and in particular, to a neural network pruning method, device, equipment and storage medium.

Background

Model pruning is also called structured sparsity, and model weight reduction and model calculation amount reduction are realized by discarding the channel as granularity through the weight and activation of the neural network model, so that the purposes of model compression and acceleration are achieved. In order to ensure that the accuracy of the model is not affected, model pruning is usually required to be carried out simultaneously in combination with model training, in the process of model training, certain output channels of certain layers of the model are sequenced according to certain importance indexes gradually, channels with low importance are pruned, and the influence of pruning on the accuracy of a final model is compensated by gradient update of a trained loss function.

At present, the existing model pruning is generally characterized in that the degree of pruning is measured by the reduction of model calculation amount or parameter amount, so that the time consumption of hardware reasoning in actual deployment is approximate, but because the model calculation amount and the hardware time consumption do not have a complete corresponding relation, hardware acceleration is inaccurate, and model precision is low.

Disclosure of Invention

In view of this, an object of the embodiments of the present application is to provide a neural network pruning method, device, equipment and storage medium, by training a complete full-precision neural network model in advance, performing topology analysis on the model to select potential layers that can be accelerated by hardware, then sorting channel dimensions on weight amplitudes of output channels of the potential layers, pruning output channels of corresponding proportions at a given sparsity, and finally retraining a model with pruning completed to restore precision, where model precision and hardware inference speed ratio are ensured to the greatest extent under the same sparsity or the same model calculation after pruning, thereby solving the above technical problems.

In a first aspect, an embodiment of the present application provides a neural network pruning method, where the method includes: pre-training the neural network model to be pruned to obtain a full-precision neural network model; performing topology analysis on a convolution layer and/or a full connection layer in the full-precision neural network model to obtain a hardware acceleration potential layer; importance sorting is carried out on a plurality of output channels in the hardware acceleration potential layer, channel pruning is carried out according to sorting results, and a pruned neural network model is obtained; and retraining the pruned neural network model to obtain a target neural network model.

In the implementation process, a complete full-precision neural network model is trained in advance, topology analysis is carried out on the model, potential layers which can be accelerated by hardware are selected, importance ranking is carried out on output channels of the potential layers, channel pruning is carried out according to importance, and finally the pruned model is trained again to restore precision, so that the redundancy degree is reduced, the calculated amount is saved, the model precision and the hardware reasoning acceleration ratio are ensured to the greatest extent, and the model compression optimization capacity is improved.

Optionally, performing topology analysis on the convolution layer and/or the full connection layer in the full-precision neural network model to obtain a hardware acceleration potential layer, including: judging the type of an output layer of a convolution layer and/or a full connection layer in the full-precision neural network model based on the local topological structure of the full-precision neural network model; if the type of the output layer is judged to be a convolution layer, an activation function layer or a concat layer, determining the current convolution layer and/or the current full connection layer as a hardware acceleration potential layer; and if the type of the output layer is judged to be an addition layer or multiplication layer, determining the current convolution layer and/or the current full connection layer as a non-hardware acceleration potential layer.

In the implementation process, the pre-training neural network model is analyzed in a topological type, so that all the convolution layers and/or all the fully-connected layers can be accelerated hardware potential layers and all the non-hardware potential layers which cannot be accelerated are distinguished, the subsequent targeted pruning is convenient, the pruning efficiency is improved, the model precision and the hardware reasoning acceleration ratio are ensured to the greatest extent, and the model compression optimization capacity is improved.

Optionally, the sorting the multiple output channels in the hardware acceleration potential layer, and performing channel pruning according to the sorting result, to obtain a pruned neural network model, includes: according to the magnitude of the weight amplitude on the output channel, importance sorting is carried out on a plurality of output channels in the hardware acceleration potential layer, and a sorting result is obtained; and setting a global sparsity rate according to the sequencing result, and carrying out channel pruning on the hardware acceleration potential layer to obtain a pruned neural network model.

In the implementation process, the importance of the output channels of the hardware acceleration potential layer in the pre-training neural network model is ranked, the importance of the output channels is evaluated by using the weight amplitude in the channel dimension, the important scale is determined according to the overall sparsity, and the channel pruning is correspondingly carried out, so that a pruning model is obtained, the pruning of the channels with high importance degree is avoided, the precision loss is reduced, the model precision and the hardware reasoning acceleration ratio are ensured to the greatest extent, and the model compression optimization capacity is improved.

Optionally, the output channels with larger weight magnitudes in the sorting result are arranged at a position which is more forward.

In the implementation process, the importance of the output channels of the hardware acceleration potential layer in the pre-training neural network model is ordered, the importance is arranged in ascending order or descending order by utilizing the weight amplitude in the channel dimension, so that the channel with high importance degree can be quickly identified, pruning efficiency is improved, model accuracy and hardware reasoning acceleration ratio are ensured to the greatest extent, and the model compression optimization capacity is improved.

Optionally, the performing channel pruning on the hardware acceleration potential layer by setting a global sparsity according to the sorting result to obtain a pruned neural network model, including: pruning the output channels corresponding to the percentage in the sequencing result according to the percentage of the global sparsity to obtain the pruned residual output channels; mapping the remaining output channels after pruning to the original positions of the hardware acceleration potential layer to obtain a pruned neural network model.

In the implementation process, the importance ranking is carried out on the output channels of the hardware acceleration potential layer in the pre-training neural network model, the ranking result is selected and divided by the sparseness ratio, the channels with low importance degree in the ranking result are discarded, and the rest channels are mapped back to the original positions to obtain the pruned neural network model, so that the pruning efficiency is improved, the model precision and the hardware reasoning acceleration ratio are ensured to the greatest extent, and the model compression optimization capacity is improved.

Optionally, the number of remaining output channels after pruning is proportional to the byte length.

In the implementation process, when pruning is carried out on the channels, the sizes of the remaining output channels of the pruned neural network model are required to be proportional multiples of byte lengths, the output channels with corresponding proportions are set to be zero under a given sparseness rate, and meanwhile, the number of the remaining channels of each layer is still proportional multiples of the byte lengths, so that the hardware read-write processing efficiency is improved.

Optionally, the retraining the pruned neural network model to obtain a target neural network model includes: retraining the pruned neural network model to obtain an initial recovery precision model; in the retraining process, comparing and judging the relative error of the initial recovery precision model and the full-precision neural network model with a preset error index value; and if the relative error meets the preset error index value, stopping retraining, and determining the current initial recovery precision model as a target neural network model.

In the implementation process, the importance sorting is carried out on the output channels of the hardware acceleration potential layer in the pre-training network model, channel pruning is carried out according to the importance, and finally, the pruned model is retrained, and full data fine tuning is carried out on the pruned model according to the precision requirement, so that the precision loss caused by pruning is reduced, and the precision is recovered; under the condition of the same sparsity or the same calculation amount of the pruned model, the accuracy of the model and the hardware reasoning speed-up ratio are ensured to the greatest extent.

In a second aspect, an embodiment of the present application provides a neural network pruning device, where the device includes: the pre-training module is used for pre-training the neural network model to be pruned to obtain a full-precision neural network model; the topology analysis module is used for carrying out topology analysis on the convolution layer and/or the full connection layer in the full-precision neural network model to obtain a hardware acceleration potential layer; the pruning module is used for carrying out importance sequencing on a plurality of output channels in the hardware acceleration potential layer, and carrying out channel pruning according to the sequencing result to obtain a pruned neural network model; and the retraining module is used for retraining the pruned neural network model to obtain a target neural network model.

In a third aspect, embodiments of the present application further provide an electronic device, including: a processor, a memory storing machine-readable instructions executable by the processor, which when executed by the processor perform the steps of the method described above when the electronic device is run.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method described above.

In order to make the above objects, features and advantages of the present application more comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a neural network pruning method provided in an embodiment of the present application;

fig. 2 is a schematic functional block diagram of a neural network pruning device according to an embodiment of the present application;

fig. 3 is a schematic block diagram of an electronic device provided with a neural network pruning device according to an embodiment of the present application.

Icon: 210-a pre-training module; 220-a topology analysis module; 230-pruning module; 240-retraining module; 300-an electronic device; 311-memory; 312-a storage controller; 313-processor; 314-peripheral interface; 315-an input-output unit; 316-display unit.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. The terms "first," "second," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

Before describing the embodiments of the present application, a brief description will be first made of several technical concepts involved.

Pruning a model: the method of model compression introduces sparsity into dense connection of the deep neural network, and reduces the number of non-zero weights by directly zeroing unimportant weights. The method generally comprises four steps of pre-training a model, evaluating important parts of the model, deleting unimportant parts of the model, pruning a network, finely adjusting the recovery precision of the retraining model, namely: 1. analyzing the importance degree of each neuron in the pre-training model; 2. removing neurons with lower model reasoning activation degree in the model; 3. fine tuning the model, and improving the precision of the trimmed model; 4. and testing the trimmed model to determine whether the trimmed model meets the requirements.

Pruning and classifying: common model pruning algorithms fall mainly into two categories, unstructured pruning (pruning each individual weight separately) and structured pruning (pruning is a set of regular weights). The special data format and the additional encoding/decoding employed by the unstructured pruning employed earlier would introduce additional hardware overhead. At present, structured pruning is more divided from the view of pruning scale, and the structured pruning mainly comprises: filter-wise (convolutional kernel pruning), channel-wise (channel pruning), shape-wise and block-wise pruning, wherein channel pruning algorithms are commonly used in hardware accelerators.

Pruning in the channel: c channels in the input layer B are cut off directly, and all convolution kernels of the layer directly cause that corresponding channels are cut off, so that corresponding matching can be performed when convolution is performed; this is the effect of channel pruning on the layer, and it will also affect the layer above, and when convoluting, each group of convolution kernels corresponds to one output channel of the layer, that is, the input channel of the next layer, and at this time, C channels are reduced in layer B, and it is naturally necessary to prune the corresponding convolution kernels in the layer above. Whereas convolution kernel pruning: the convolution kernels are cut directly, which directly results in a reduction of the number of output channels of the layer, i.e. the effect of reducing C channels in layer B is presented, and the number of channels input to layer B is reduced, which naturally also results in a reduction of the number of channels in the convolution kernels in the layer. The channel pruning is similar to the convolution kernel pruning, but the pruning of the two convolution methods is different depending on or starting point.

The inventor of the application notes that model pruning does not need special hardware support, so the method is widely applied to the field of model acceleration. In order to ensure that the accuracy of the model is not affected, model pruning is usually required to be carried out simultaneously in combination with model training, in the process of model training, certain output channels of certain layers of the model are gradually ordered according to certain importance indexes, channels with low importance are pruned, and the influence of the pruning on the accuracy of the final model is compensated by gradient update of a trained loss function. However, these model pruning techniques generally measure the degree of pruning by the decrease of the model calculation amount or the parameter amount, so as to approximate the time consumption of hardware reasoning in actual deployment, and since the model calculation amount and the hardware time consumption do not have a complete corresponding relationship, the situation that the model calculation amount decreases but the actual hardware acceleration rate is less than expected often occurs, and the main reasons are that: 1. the pruned model often has a number of channels that are not friendly to hardware reading and writing, for example, the number of remaining channels is 15;2. the pruning layer may belong to a local topology that some special hardware cannot optimize. In view of this, the embodiments of the present application provide a neural network pruning method as described below.

Referring to fig. 1, fig. 1 is a flowchart of a neural network pruning method according to an embodiment of the present application. The embodiments of the present application are explained in detail below. The method comprises the following steps: step 100, step 120, step 140 and step 160.

Step 100: pre-training the neural network model to be pruned to obtain a full-precision neural network model;

step 120: performing topology analysis on a convolution layer and/or a full connection layer in the full-precision neural network model to obtain a hardware acceleration potential layer;

step 140: carrying out importance sorting on a plurality of output channels in the hardware acceleration potential layer, and carrying out channel pruning according to sorting results to obtain a pruned neural network model;

step 160: retraining the pruned neural network model to obtain a target neural network model.

Illustratively, the full-precision neural network model may be: any conventional neural network model such as a Deep Neural Network (DNN), a Recurrent Neural Network (RNN), a Convolutional Neural Network (CNN) and the like uses training samples to train to obtain a full-precision and high-precision pre-training model, and the network model can be composed of a series of neural network layers such as an input layer, an output layer, a plurality of hidden layers and the like, and the neural network layers can be roughly specifically divided into a convolutional layer, a full-connection layer, a batch normalization layer, a pooling layer, a ReLU (activating function) layer, a Softmax layer, a concat (utility) layer and the like. The hardware acceleration potential layer may be: there are layers with hardware acceleration potential, which can have the characteristics of hardware acceleration after analyzing which neural network layers (convolution layers or full connection layers) are pruned according to the characteristics of the hardware, and can be specifically analyzed together according to the operator spec of the hardware and the local topology of the current network, so that the subsequent pruned only aims at the layers with acceleration potential, and the layers without acceleration potential cannot be pruned.

Optionally, in order to enable the convolutional kernel full-connection layer requiring pruning in the network to be sufficiently trained, a full-precision neural network model, such as a common convolutional neural network of ResNet (residual), mobileNet and the like, is first trained in advance. Through the massive import of training data, parameters of each layer of the full-precision neural network model can be determined, on one hand, the trained model parameters (such as weight parameters) can help the subsequent pruning to carry out importance analysis, and on the other hand, the model parameters can also be used as a baseline of the pruning. And performing topology analysis according to the trained full-precision neural network model, and selecting a layer with hardware acceleration potential, for example, a layer before addition does not have acceleration potential in a common ResNet residual structure. Because the precision of the original full-precision neural network model is generally reduced and lost after the channel pruning, the target neural network model is expected to be obtained within the range of no precision loss or controllable precision loss, the importance ordering of the output channels in the selected layers is required, and the output channels which are relatively 'unimportant' are discarded or removed, so that the trained full-precision neural network model is pruned, and the pruned neural network model is obtained. Retraining is conducted on the pruned neural network model to restore accuracy, accuracy loss is reduced, and therefore accuracy of the pruned model is improved.

Through training a complete full-precision neural network model in advance, carrying out topology analysis on the model to select potential layers which can be accelerated by hardware, then carrying out importance sequencing on output channels of the potential layers, carrying out channel pruning according to importance, and finally retraining the pruned model to restore precision, thereby reducing redundancy degree, saving calculation amount, maximally ensuring model precision and hardware reasoning acceleration ratio, and further improving model compression optimization capacity.

In one embodiment, step 120 may include: step 121, step 122 and step 123.

Step 121: judging the type of an output layer of a convolution layer and/or a full connection layer in the full-precision neural network model based on the local topological structure of the full-precision neural network model;

step 122: if the type of the output layer is judged to be a convolution layer, an activation function layer or a concat layer, determining the current convolution layer and/or the current full connection layer as a hardware acceleration potential layer;

step 123: if the type of the output layer is judged to be an addition layer or multiplication layer, the current convolution layer and/or the current full connection layer are/is determined to be a non-hardware acceleration potential layer.

Illustratively, the local topology may be: the number of network layers, the number of neurons of each layer and the interconnection mode among the neurons of each layer of the trained full-precision neural network model; from the level of the topology structure, the neural network is divided into an input layer (InputLayer), a hidden layer (Hiddenlayer) and an output layer (Outputlayer), and the layers are connected in sequence; the hidden layer is responsible for information processing and information transformation in the neural network, and is designed into one or more layers according to the transformation requirement. Alternatively, these neural network layers may be specifically divided into a plurality of layers such as a weighted convolution layer, a fully connected layer, a batch normalization layer, and a plurality of layers such as a pooling layer without weights, a ReLU layer, a Softmax layer, a concat layer, and the like.

Analyzing a topological structure layer formed by a plurality of network layers such as a convolution layer, a full connection layer, a batch normalization layer, a pooling layer, a ReLU layer, a Softmax layer, a concat layer and the like of the trained full-precision neural network model, judging the types of all the convolution layers and/or output layers connected after all the full connection layers, and if the full connection layers and/or the output connection of the convolution layers are/is an activation function layer, a convolution layer or a concat layer, considering the current full connection layer and/or the current convolution layer as a layer with hardware acceleration potential, and carrying out subsequent channel pruning; if the outputs of the full-connection layer and/or the convolution layer are connected, the addition layer and the multiplication layer consider that the current full-connection layer and/or the current convolution layer are layers without hardware acceleration potential, and no subsequent channel pruning is performed.

By analyzing the topology type of the pre-trained neural network model, all the convolution layers and/or all the fully-connected layers can be accelerated hardware potential layers and all the non-hardware potential layers which cannot be accelerated are distinguished, the subsequent targeted pruning is facilitated, the pruning efficiency is improved, the model precision and the hardware reasoning speed-up ratio are ensured to the greatest extent, and therefore the model compression optimization capacity is improved.

In one embodiment, step 140 may include: step 141 and step 142.

Step 141: according to the magnitude of the weight amplitude on the output channel, importance sorting is carried out on a plurality of output channels in the hardware acceleration potential layer, and a sorting result is obtained;

step 142: and setting a global sparsity rate according to the sequencing result, and performing channel pruning on the hardware acceleration potential layer to obtain a pruned neural network model.

Illustratively, the global sparsity may be: manually setting a global sparse value according to the actual scene of the number of channels to be pruned; because the number of layers of the neural network is large, it is difficult to manually select an optimal sparsity for each layer, if the sparsity of each layer is set to be the same, the network precision will be seriously reduced, because the deep layer of the network tends to have many channels, the redundancy is large, pruning is more needed, and further, a larger sparsity should be allocated. The weight magnitude may be: for measuring importance in the dimension of the output channel, in order to discard weights that do not seriously affect the model performance, in the pruning optimization process, some weights are updated with magnitude orders (positive or negative) larger than other weights, and these weights can be regarded as "more important" weights.

The magnitudes of the weights of the convolution layers and/or the full connection layers in the hardware acceleration potential layers in the dimension of the output channels are ordered to obtain an ordering result of importance, the weight magnitude of each layer in the hardware acceleration potential layers is checked, and the output channels of the weights of important and unimportant are found out according to the ordering result. Pruning is carried out on the output channels with the ordered importance according to the set global sparsity, and model sparsity can be realized by setting the weight amplitude of the output channels to 0, so that a pruned neural network model is obtained.

The importance of the output channels of the hardware acceleration potential layer in the pre-training neural network model is ranked, the importance is evaluated by using the weight amplitude in the channel dimension, the important scale is determined according to the overall sparsity, and the channel pruning is correspondingly carried out, so that a pruning model is obtained, the channels with high importance degree are prevented from being pruned, the precision loss is reduced, the model precision and the hardware reasoning acceleration ratio are ensured to the greatest extent, and the model compression optimization capacity is improved.

In one embodiment, the output channels with greater weight magnitudes in the ranking result are arranged in a more forward position.

Illustratively, the magnitudes of the weights of the convolution layers and/or the full connection layers in the hardware acceleration potential layer in the output channel dimension are ordered to obtain an order result of importance magnitude. The ordering mode can be in an ascending order form or a descending order form. The output channels with larger weight magnitudes are arranged at more rearward positions in ascending order, and the output channels with larger weight magnitudes are arranged at more forward positions in descending order. Optionally, the weight magnitude of each layer of the hardware acceleration potential layer is checked, and the "unimportant" weight is found, and the finding method can be as follows: (1) arranging the weight magnitudes in a descending order; (2) Finding those magnitudes that occur earlier in the queue (corresponding to weights having larger magnitudes), those are more "unimportant" in nature, depending on how much weight percent needs to be pruned. (3) By setting a threshold (corresponding to a set global sparsity), weights with weight magnitudes above the threshold may be considered important weights and weights above the threshold may be considered unimportant weights.

The method has the advantages that the importance of the output channels of the hardware acceleration potential layer in the pre-training neural network model is ordered, the importance is arranged in ascending order or descending order by utilizing the weight amplitude in the channel dimension, so that the channel with high importance degree can be quickly identified, pruning efficiency is improved, model precision and hardware reasoning acceleration ratio are ensured to the greatest extent, and the model compression optimization capacity is improved.

In one embodiment, step 142 may include: step 1421 and step 1422.

Step 1421: pruning is carried out on the output channels corresponding to the percentage in the sequencing result according to the percentage of the global sparsity, and the remaining output channels after pruning are obtained;

step 1422: mapping the remaining output channels after pruning to the original positions of the hardware acceleration potential layer to obtain a pruned neural network model.

Illustratively, a global sparsity is set and assigned to each layer according to the channel importance ranking result. Here, the sparseness is only allocated to the foregoing hardware acceleration potential layer, and the allocation principle is to sort the importance of all the hardware acceleration potential layers in the channel dimension together, and to make a trade-off on the sorting result according to the set global sparseness. For example, a sparsity ratio of 50% indicates a half of the "unimportant" output channels, and 30% indicates a 30% percentage of the "unimportant" output channels. If the sparsity is 50%, the order is descending order, only the first half of the output channels are reserved, and then the corresponding layers are mapped back, namely, the channels with the weight amplitude being back are pruned based on the sparsity set manually according to the result of the unified ordering after the channels under the hardware acceleration layer are disordered, and then the rest uncut channels are mapped back to the original positions, so that the pruned neural network model is obtained.

The importance ranking is carried out on the output channels of the hardware acceleration potential layer in the pre-training neural network model, the ranking result is selected and divided by the sparseness ratio, the channels with low importance degree in the ranking result are discarded, and the rest channels are mapped back to the original positions to obtain the pruned neural network model, so that the pruning efficiency is improved, the model precision and the hardware reasoning acceleration ratio are ensured to the greatest extent, and the model compression optimization capacity is improved.

The method has the advantages that the method is simple and quick, easy to popularize and compatible with channel pruning of precision, and ensures the model precision and hardware reasoning speed-up ratio to the greatest extent, so that the model compression optimization capacity is improved.

In one embodiment, the number of remaining output channels after pruning is proportional to the byte length.

For example, the Byte length of the hardware read-write is typically 16 bytes, so when assigning an actual sparsity to each layer, a ratio of 16 times the result of multiplying the number of original channels by the sparsity may be required, and if this ratio is not satisfied, the remaining channels may be rounded down to a ratio of 16 times by reducing the set value of the sparsity. Because the data volume required for reading and writing of hardware is usually a multiple of 16Byte, when the channels are pruned, the size of the remaining output channels of the pruned neural network model is required to be a multiple of 16, the output channels with corresponding proportion are set to be zero under a given sparseness, and meanwhile, the number of the remaining channels of each layer is ensured to be still a multiple of 16, so that the reading and writing processing efficiency of the hardware is improved.

In one embodiment, step 160 may include: step 161, step 162 and step 163.

Step 161: retraining the pruned neural network model to obtain an initial recovery precision model;

step 162: in the retraining process, comparing and judging the relative error of the initial recovery precision model and the full-precision neural network model with a preset error index value;

step 163: and if the relative error meets the preset error index value, stopping retraining, and determining the current initial recovery precision model as a target neural network model.

Illustratively, the preset error index value may be: and (3) initially recovering an arbitrary value with a relative error of less than 1% between the precision model and the full-precision neural network model. The pruned neural network model is characterized in that most channels of a pruned convolution layer and/or a full connection layer are all 0, and the pruned neural network model is subjected to full data fine tuning according to the precision requirement so as to reduce the precision loss caused by pruning and restore the precision. In general, the network accuracy will be reduced after the "unimportant" weight channels are removed, but the network sparsity will be increased after the "unimportant" weight channels are removed, the over-fitting of the network will be reduced, and the network accuracy will be improved after fine tuning. The training data of the pre-training can be utilized to retrain the pruned neural network model, in the retraining process, fine adjustment is repeated for a plurality of times until the required precision is recovered, and the accuracy is judged by comparing with the error parameters of the original full-precision neural network model, and when the precision is recovered to the error of the original model within 1 point, the requirement is met.

The method comprises the steps of sorting the importance of output channels of a hardware acceleration potential layer in a pre-training network model, pruning the channels according to the importance, and finally retraining the pruned model, performing full data fine adjustment on the pruned model according to the precision requirement, so that the precision loss caused by pruning is reduced, and the precision is recovered; under the condition of the same sparsity or the same calculation amount of the pruned model, the accuracy of the model and the hardware reasoning speed-up ratio are ensured to the greatest extent.

Referring to fig. 2, fig. 2 is a schematic diagram of functional modules of a neural network pruning device according to an embodiment of the present application. The device comprises: a pre-training module 210, a topology analysis module 220, a pruning module 230, and a retraining module 240.

The pre-training module 210 is configured to pre-train the neural network model to be pruned to obtain a full-precision neural network model;

the topology analysis module 220 is configured to perform topology analysis on the convolution layer and/or the full connection layer in the full-precision neural network model, so as to obtain a hardware acceleration potential layer;

the pruning module 230 is configured to sort importance of the multiple output channels in the hardware acceleration potential layer, and prune the channels according to the sorting result to obtain a pruned neural network model;

And the retraining module 240 is configured to retrain the pruned neural network model to obtain a target neural network model.

Alternatively, the topology analysis module 220 can be configured to:

judging the type of an output layer of a convolution layer and/or a full connection layer in the full-precision neural network model based on the local topological structure of the full-precision neural network model;

if the type of the output layer is judged to be a convolution layer, an activation function layer or a concat layer, determining the current convolution layer and/or the current full connection layer as a hardware acceleration potential layer;

and if the type of the output layer is judged to be an addition layer or multiplication layer, determining the current convolution layer and/or the current full connection layer as a non-hardware acceleration potential layer.

Alternatively, pruning module 230 may be configured to:

according to the magnitude of the weight amplitude on the output channel, importance sorting is carried out on a plurality of output channels in the hardware acceleration potential layer, and a sorting result is obtained;

and setting a global sparsity rate according to the sequencing result, and carrying out channel pruning on the hardware acceleration potential layer to obtain a pruned neural network model.

Alternatively, pruning module 230 may be configured to:

pruning the output channels corresponding to the percentage in the sequencing result according to the percentage of the global sparsity to obtain the pruned residual output channels;

mapping the remaining output channels after pruning to the original positions of the hardware acceleration potential layer to obtain a pruned neural network model.

Alternatively, the retraining module 240 may be configured to:

retraining the pruned neural network model to obtain an initial recovery precision model;

in the retraining process, comparing and judging the relative error of the initial recovery precision model and the full-precision neural network model with a preset error index value;

and if the relative error meets the preset error index value, stopping retraining, and determining the current initial recovery precision model as a target neural network model.

Referring to fig. 3, fig. 3 is a block schematic diagram of an electronic device. The electronic device 300 may include a memory 311, a memory controller 312, a processor 313, a peripheral interface 314, an input output unit 315, a display unit 316. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 3 is merely illustrative and is not intended to limit the configuration of the electronic device 300. For example, electronic device 300 may also include more or fewer components than shown in FIG. 3, or have a different configuration than shown in FIG. 3.

The above-mentioned memory 311, memory controller 312, processor 313, peripheral interface 314, input/output unit 315, and display unit 316 are electrically connected directly or indirectly to each other to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The processor 313 is used to execute executable modules stored in the memory.

The Memory 311 may be, but is not limited to, a random access Memory (Random Access Memory, RAM), a Read Only Memory (ROM), a programmable Read Only Memory (Programmable Read-Only Memory, PROM), an erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), an electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc. The memory 311 is configured to store a program, and the processor 313 executes the program after receiving an execution instruction, and a method executed by the electronic device 300 defined by the process disclosed in any embodiment of the present application may be applied to the processor 313 or implemented by the processor 313.

The processor 313 may be an integrated circuit chip having signal processing capabilities. The processor 313 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (digital signal processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field Programmable Gate Arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The peripheral interface 314 couples various input/output devices to the processor 313 and the memory 311. In some embodiments, the peripheral interface 314, the processor 313, and the memory controller 312 may be implemented in a single chip. In other examples, they may be implemented by separate chips.

The input/output unit 315 is used for providing input data to a user. The input/output unit 315 may be, but is not limited to, a mouse, a keyboard, and the like.

The display unit 316 provides an interactive interface (e.g., a user interface) between the electronic device 300 and a user for reference. In this embodiment, the display unit 316 may be a liquid crystal display or a touch display. The liquid crystal display or the touch display may display a process of executing the program by the processor.

The electronic device 300 in the present embodiment may be used to perform each step in each method provided in the embodiments of the present application.

Furthermore, the embodiments of the present application also provide a computer readable storage medium, on which a computer program is stored, which when being executed by a processor performs the steps in the above-described method embodiments.

The computer program product of the above method provided in the embodiments of the present application includes a computer readable storage medium storing a program code, where instructions included in the program code may be used to perform steps in the above method embodiment, and specifically, reference may be made to the above method embodiment, which is not described herein.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, and the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form. The functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

It should be noted that the functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM) random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims

1. A method of pruning a neural network, the method comprising:

pre-training the neural network model to be pruned to obtain a full-precision neural network model;

performing topology analysis on a convolution layer and/or a full connection layer in the full-precision neural network model to obtain a hardware acceleration potential layer;

importance sorting is carried out on a plurality of output channels in the hardware acceleration potential layer, channel pruning is carried out according to sorting results, and a pruned neural network model is obtained;

and retraining the pruned neural network model to obtain a target neural network model.

2. The method according to claim 1, wherein performing topology analysis on the convolution layer and/or the full connection layer in the full-precision neural network model to obtain a hardware acceleration potential layer comprises:

3. The method of claim 1, wherein ranking the importance of the plurality of output channels in the hardware acceleration potential layer and performing channel pruning according to the ranking result to obtain a pruned neural network model comprises:

4. A method according to claim 3, wherein the output channels with greater weight magnitudes in the ranking result are arranged in a more forward position.

5. The method of claim 3, wherein the setting a global sparsity to perform channel pruning on the hardware acceleration potential layer according to the ranking result, to obtain a pruned neural network model, includes:

6. The method of claim 5, wherein the number of remaining output channels after pruning is proportional to byte length.

7. The method of claim 1, wherein retraining the pruned neural network model to obtain a target neural network model comprises:

8. A neural network pruning device, the device comprising:

the pre-training module is used for pre-training the neural network model to be pruned to obtain a full-precision neural network model;

the topology analysis module is used for carrying out topology analysis on the convolution layer and/or the full connection layer in the full-precision neural network model to obtain a hardware acceleration potential layer;

the pruning module is used for carrying out importance sequencing on a plurality of output channels in the hardware acceleration potential layer, and carrying out channel pruning according to the sequencing result to obtain a pruned neural network model;

and the retraining module is used for retraining the pruned neural network model to obtain a target neural network model.

9. An electronic device, comprising: a processor, a memory storing machine-readable instructions executable by the processor, which when executed by the processor perform the steps of the method of any of claims 1 to 7 when the electronic device is run.

10. A computer-readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the steps of the method according to any of claims 1 to 7.