CN113205158A

CN113205158A - Pruning quantification processing method, device, equipment and storage medium of network model

Info

Publication number: CN113205158A
Application number: CN202110598683.5A
Authority: CN
Inventors: 詹雁; 潘柳华; 徐麟
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-08-03

Abstract

The application discloses a pruning quantification processing method, a pruning quantification processing device, equipment and a storage medium of a network model, wherein the method comprises the following steps: pruning the convolutional neural network based on a channel attention mechanism and a weight attention mechanism; carrying out secondary pruning treatment on the convolutional neural network after the pruning treatment; and carrying out quantization operation on the convolutional neural network subjected to the secondary pruning. Through the processing mode, parameters with small contribution in the convolutional neural network model can be eliminated, and the response speed of the network model is ensured on the premise that the accuracy of the original model is close to that of the smaller parameter.

Description

Pruning quantification processing method, device, equipment and storage medium of network model

Technical Field

The embodiment of the application relates to the field of network model processing, in particular to a pruning quantification processing method, device, equipment and storage medium for a network model.

Background

The law enforcement recorder is a portable device integrating the functions of real-time recording, photographing, shooting and the like, can be widely applied to field law enforcement or other tasks of law enforcement units such as polices (for example, traffic police, public security, fire fighting, criminal investigation and the like), traffic, city management, judicial law and the like, and plays an important role in real-time recording of law enforcement processes, abnormal event detection and backtracking and evidence obtaining of special events. However, when the computational power is limited, the mobile terminal device can quickly respond to an abnormal event, and realizing quick feedback is a significant problem to be solved.

Disclosure of Invention

The application provides a pruning quantification processing method, a pruning quantification processing device, a pruning quantification processing equipment and a storage medium for a network model, which can eliminate parameters with small contribution in a convolutional neural network model, and ensure the response speed of the network model on the premise that the precision of the original model is close to that of the small parameter.

In a first aspect, an embodiment of the present application provides a pruning quantization processing method for a network model, where the method includes:

pruning the convolutional neural network based on a channel attention mechanism and a weight attention mechanism;

carrying out secondary pruning treatment on the convolutional neural network after the pruning treatment;

and carrying out quantization operation on the convolutional neural network subjected to the secondary pruning.

Optionally, the pruning the convolutional neural network based on the channel attention mechanism and the weight attention mechanism includes:

pruning the convolutional neural network based on a channel attention mechanism to obtain a first probability matrix;

pruning the convolutional neural network based on a weight attention mechanism and an input image to obtain a second probability matrix;

pruning the convolutional neural network according to the first probability matrix and the second probability matrix.

Optionally, the pruning the convolutional neural network based on the channel attention mechanism to obtain a first probability matrix, including:

carrying out dimensionality reduction transformation on the channel of the convolutional neural network to obtain channel weight;

multiplying the channel by the channel weight corresponding to the channel to obtain an attention matrix corresponding to the channel;

a first probability matrix is acquired using a first function and the attention moment matrix.

Optionally, pruning the convolutional neural network based on the weight attention mechanism and the input image to obtain a second probability matrix, including:

performing linear addition on a characteristic image output by a current convolutional layer in the convolutional neural network and a characteristic image output by a layer of convolutional layer above the current convolutional layer to obtain a characteristic image matrix;

carrying out nonlinear processing on the characteristic image matrix based on an activation function;

performing dimension reduction operation on the feature image matrix after convolution check processing to obtain a dimension reduction image matrix;

acquiring a weight matrix of the dimension reduction image matrix by using a second function;

multiplying the weight matrix with the characteristic image output by the convolution layer of the previous layer to obtain a product matrix;

and acquiring a second probability matrix according to the product matrix and the first function.

Optionally, pruning the convolutional neural network according to the first probability matrix and the second probability matrix, including:

determining channel values in the first probability matrix that are less than a first threshold;

removing channels corresponding to the channel numerical values from the convolutional neural network, and taking the rest channels as reserved channels;

determining the same channel in the second probability matrix according to the reserved channel;

and determining the weight values which are smaller than a second threshold value in the weight values corresponding to the same channels in the second probability matrix, and removing the weight values from the convolutional neural network.

Optionally, performing secondary pruning on the pruned convolutional neural network, including:

and deleting the weight parameters smaller than the pruning threshold in the weight parameters of the convolutional neural network model.

Optionally, the performing a quantization operation on the convolutional neural network after the second pruning includes:

and performing quantization processing on the convolutional neural network after the secondary pruning processing through at least one of a fast convolution algorithm, network layer combination and multi-thread operation.

In a second aspect, an embodiment of the present application further provides a pruning quantization processing apparatus for a network model, where the apparatus includes:

the pruning module is used for carrying out pruning processing on the convolutional neural network based on the channel attention mechanism and the weight attention mechanism;

the pruning module is also used for carrying out secondary pruning treatment on the convolutional neural network after the pruning treatment;

and the quantization module is used for performing quantization operation on the convolutional neural network subjected to the secondary pruning processing.

In a third aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes:

the network model pruning quantization processing method comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, and when the computer program is executed by the processor, the pruning quantization processing method of the network model provided by the embodiment of the application is realized.

In a fourth aspect, the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for quantifying pruning of a network model as provided in the present application is implemented.

The application provides a pruning quantification processing method, a pruning quantification processing device, equipment and a storage medium of a network model, wherein the method comprises the following steps: pruning the convolutional neural network based on a channel attention mechanism and a weight attention mechanism; carrying out secondary pruning treatment on the convolutional neural network after the pruning treatment; and carrying out quantization operation on the convolutional neural network subjected to the secondary pruning. Through the processing mode, parameters with small contribution in the convolutional neural network model can be eliminated, and the response speed of the network model is ensured on the premise that the accuracy of the original model is close to that of the smaller parameter.

Drawings

Fig. 1 is a flowchart of a pruning quantification processing method of a network model in an embodiment of the present application;

FIG. 2 is a flow chart of a method of determining a first probability matrix in an embodiment of the present application;

FIG. 3 is a flow chart of a method of determining a second probability matrix in an embodiment of the present application;

FIG. 4 is a flowchart of a method for pruning according to a first probability matrix and a second probability matrix in an embodiment of the present application;

FIG. 5 is a schematic diagram of a pruning quantization processing device of a network model in an embodiment of the present application;

FIG. 6 is a schematic diagram of a pruning quantization processing device of another network model in the embodiment of the present application;

FIG. 7 is a schematic diagram of a pruning quantization processing device of another network model in an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device in an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some of the structures related to the present application are shown in the drawings, not all of the structures.

In addition, in the embodiments of the present application, the words "optionally" or "exemplarily" are used for indicating as examples, illustrations or explanations. Any embodiment or design described herein as "optionally" or "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the words "optionally" or "exemplarily" etc. is intended to present the relevant concepts in a concrete fashion.

Fig. 1 is a flowchart of a pruning quantification processing method for a network model according to an embodiment of the present application, where the method may be applied to a portable device such as a law enforcement recorder, and is used to quickly process and respond to information acquired by the device. As shown in fig. 1, the method may include, but is not limited to, the following steps:

and S101, pruning the convolutional neural network based on the channel attention mechanism and the weight attention mechanism.

The convolutional neural network in the embodiment of the application can be used for processing images or videos, and the accuracy close to that of an original model by a small parameter amount can be realized by pruning and optimizing network model parameters in the convolutional neural network by combining a channel attention mechanism and a weight attention mechanism.

For example, in this embodiment of the present application, an implementation manner of pruning the convolutional neural network based on the channel attention mechanism and the weight attention mechanism may include: and pruning the convolutional neural network based on a channel attention mechanism to obtain a first probability matrix. And pruning the convolutional neural network based on the weight attention mechanism and the input image to obtain a second probability matrix. Here, in the case where the convolutional neural network processes a video, the input image here may also be understood as each frame image in the input video. And pruning the convolutional neural network according to the first probability matrix and the second probability matrix.

And S102, carrying out secondary pruning on the convolutional neural network after pruning.

The manner of combining the channel attention mechanism and the weight attention mechanism in step S101 may be understood as a first pruning process for the convolutional neural network, and then, on the basis, the pruned convolutional neural network may be pruned twice from other dimensions, so as to further reduce the size of the network model.

For example, after pruning the convolutional neural network based on the channel attention mechanism and the weight attention mechanism, the secondary pruning may be performed from the perspective of the weight parameter instead of pruning based on the convolutional layer, the channel, and the like, and the weight parameter smaller than the pruning threshold in the weight parameter of the convolutional neural network model is deleted to reduce the computation amount of the model.

Optionally, before the second pruning of the model, the network model after the first pruning may be finely tuned, for example, the network model training is performed in combination with the regularization term L1, so that the weight of the convolutional neural network becomes sparse, which not only can compensate for the lost model precision after the first pruning, but also can ensure the effect of the second pruning.

S103, carrying out quantization operation on the convolutional neural network after the secondary pruning processing.

Illustratively, the quantization processing mode selected in the embodiment of the present application may include at least one of a fast convolution algorithm, network layer merging, and multi-thread running.

Among them, the fast convolution algorithm (Winograd) can reduce the number of multiplications from the viewpoint of mathematical operations, and it implements the formula as follows:

wherein m1 ═ (d0-d2) g0,

m4 ═ (d1-d3) g2, in the above formula

A matrix of the characteristics is represented,

representing a convolution kernel.

The drawback of the fast convolution algorithm described above is that it only has a quantization effect in case of large channel sizes.

The network layer merging may be performed by merging operators in the convolutional neural network, for example, merging the convolutional layer (conv), the turbo neural network training layer (BatchNorm, bn), and the activation layer (relu) into one layer, and canceling the connection layer (concat).

Alternatively, the network model may be quantized in hardware and software modes, such as hardware architecture characteristics, a channel (pineline), a cache (cache), memory data rearrangement, a NEON assembly instruction, and the like.

It should be noted that the foregoing quantization methods belong to conventional operations in the prior art, and the embodiments of the present application do not describe in detail specific implementation processes of each quantization method, and those skilled in the art may select one or more combinations of the quantization methods according to actual needs to quantize the pruned network model, which is not limited in the embodiments of the present application.

The embodiment of the application provides a pruning quantification processing method of a network model, which comprises the steps of carrying out pruning processing on a convolutional neural network based on a channel attention mechanism and a weight attention mechanism; carrying out secondary pruning treatment on the convolutional neural network after the pruning treatment; and carrying out quantization operation on the convolutional neural network subjected to the secondary pruning. Through the processing mode, parameters with small contribution in the convolutional neural network model can be eliminated, and the response speed of the network model is ensured on the premise that the accuracy of the original model is close to that of the smaller parameter.

As shown in fig. 2, in an example, the pruning processing on the convolutional neural network based on the channel attention mechanism in step S101 described above to obtain an implementation manner of the first probability matrix, which includes but is not limited to the following steps:

s201, performing dimension reduction transformation on the channel of the convolutional neural network to obtain channel weight.

For example, assuming that the input of the convolutional neural network is H × W × C, where H denotes the image height, W denotes the image width, and C denotes C channels of the convolutional neural network, performing the dimension reduction transform on the channels of the convolutional neural network in this step may be understood as performing the dimension reduction transform on each channel, and then performing the dimension reduction on the input to obtain a 1 × C matrix, so as to obtain the channel weights of the C channels.

S202, multiplying the channel by the channel weight corresponding to the channel to obtain the attention matrix corresponding to the channel.

The attention matrix corresponding to each channel is obtained by multiplying the channel weight corresponding to the channel by the channel weight corresponding to the channel, namely multiplying the C channels by the weight values of the respective channels, so that the attention of the important channels can be enhanced.

S203, acquiring a first probability matrix by using the first function and the attention moment matrix.

For example, the first function may adopt a normalized exponential function (softmax function), and the obtained attention matrix is calculated by using the first function, that is, the first probability matrix may be obtained.

As shown in fig. 3, in an example, the pruning processing on the convolutional neural network based on the weight attention mechanism and the input image in the step S101 to obtain the second probability matrix may include, but is not limited to, the following steps:

s301, carrying out linear addition on the characteristic image output by the current convolutional layer in the convolutional neural network and the characteristic image output by the convolutional layer on the current convolutional layer to obtain a characteristic image matrix.

And S302, carrying out nonlinear processing on the characteristic image matrix based on the activation function.

And S303, performing dimension reduction operation on the feature image matrix subjected to the convolution check processing to obtain a dimension reduction image matrix.

For example, the size of the convolution kernel may be 1 × 1, that is, the feature image matrix after the non-linearization process is calculated based on the 1 × 1 convolution kernel, so as to obtain the reduced-dimension image matrix.

And S304, acquiring a weight matrix of the dimension reduction image matrix by using a second function.

Optionally, the second function may be a threshold function, for example, a sigmoid function, which may map variables to a range of 0 ~ 1. Therefore, the sigmoid function is used for calculating the dimensionality reduction image matrix, and a corresponding weight matrix can be obtained.

S305, multiplying the weight matrix and the characteristic image output by the convolution layer of the previous layer to obtain a product matrix.

S306, acquiring a second probability matrix according to the product matrix and the first function.

Similarly, the first function is the normalized exponential function (softmax function), i.e., the second probability matrix can be obtained by calculating the product matrix through the normalized exponential function.

As shown in fig. 4, in an example, in the step S101, an implementation manner of pruning the convolutional neural network according to the first probability matrix and the second probability matrix may include, but is not limited to, the following steps:

s401, determining channel values smaller than a first threshold value in the first probability matrix.

The first threshold in this step is used to determine a pruning object in the convolutional neural network from the perspective of the channel, and if the form of the first probability matrix obtained in the embodiment of fig. 2 is 1 × C, the channel value smaller than the first threshold in C channels included in the first probability matrix can be determined based on the first threshold.

S402, channels corresponding to the channel numerical values are removed from the convolutional neural network, and the rest channels are used as reserved channels.

After determining the channel value smaller than the first threshold in the first probability matrix based on the step S401, determining the channel corresponding to the channel value as a removal object, removing the corresponding channel from the convolutional neural network, and using the remaining channels as reserved channels.

And S403, determining the same channel in the second probability matrix according to the reserved channel.

In the embodiment of the present application, the information input by the network model is processed from the angles of the channels and the angles of the weights based on the channel attention mechanism and the weight attention mechanism, respectively, similar to the embodiment of fig. 2, the form of the second probability matrix obtained by calculation in the embodiment of fig. 3 is H × W × C, and then the corresponding channel and channel values in the second probability matrix can be obtained according to the reserved channels obtained in step S402.

S404, determining the weight values smaller than a second threshold value in the weight values corresponding to the same channels in the second probability matrix, and removing the weight values from the convolutional neural network.

The second threshold in this step is used to determine a pruning object in the convolutional neural network from the perspective of the weight, after the corresponding reserved channel in the second probability matrix is determined based on step S403, the weight value corresponding to the reserved channel may be judged according to the second threshold, the weight value smaller than the second threshold in the corresponding channel is determined, and the corresponding weight value is removed from the convolutional neural network.

Through the steps, factors with small contribution amount in the network model can be removed from the angles of the channel and the weight respectively, so that the size of the convolutional neural network model is reduced under the condition that the precision is basically not damaged.

After the pruning quantification is carried out on the network model in the mode, the optimized network model can be transplanted to the corresponding electronic equipment for use.

The examples of this application provide a preferred implementation, such as using the Huashi Haisi 35XX development board. Taking Haisi Hi3519 chip as an example, after the optimized model is obtained by the combination of quantization, pruning and low rank in the above process, the optimized model can be converted into a wk model supported under an NNIE framework through onnx and written to the engineering for operation. The onnx is an open file format designed for machine learning and used for storing a trained model, and different artificial intelligence frames can store information in the same format. The NNIE is a short name of a Neural Network Inference Engine (Neural Network Inference Engine), and is a hardware unit dedicated to acceleration processing of a Neural Network in a hai-si media System-on-a-Chip (Soc).

Fig. 5 is a pruning quantization processing apparatus for a network model according to an embodiment of the present application, and as shown in fig. 5, the apparatus includes: a pruning module 501 and a quantization module 502;

Exemplarily, as shown in fig. 6, the pruning module may further include a first pruning unit 601, a second pruning unit 602, and a third pruning unit 603;

the first pruning unit is used for carrying out pruning processing on the convolutional neural network based on a channel attention mechanism to obtain a first probability matrix;

the second pruning unit is used for carrying out pruning processing on the convolutional neural network based on the weight attention mechanism and the input image to obtain a second probability matrix;

and the third pruning unit is used for pruning the convolutional neural network according to the first probability matrix and the second probability matrix.

In an example, the first pruning unit is configured to perform a dimension reduction transformation on a channel of a convolutional neural network to obtain a channel weight; multiplying the channel by the channel weight corresponding to the channel to obtain an attention matrix corresponding to the channel; and acquiring a first probability matrix using the first function (e.g., softmax function) and the attention moment matrix.

In an example, the second pruning unit is configured to perform linear addition on a feature image output by a current convolutional layer in the convolutional neural network and a feature image output by a convolutional layer one layer above the current convolutional layer to obtain a feature image matrix; carrying out nonlinear processing on the characteristic image matrix based on the activation function; performing dimension reduction operation on the feature image matrix after convolution check processing to obtain a dimension reduction image matrix; acquiring a weight matrix of the dimension-reduced image matrix by using a second function (for example, sigmoid function); multiplying the weight matrix by the characteristic image output by the convolution layer of the previous layer to obtain a product matrix; and acquiring a second probability matrix according to the product matrix and the first function.

In one example, a third pruning unit to determine channel values in the first probability matrix that are less than a first threshold; channels corresponding to the channel numerical values are removed from the convolutional neural network, and the rest channels are used as reserved channels; determining the same channel in the second probability matrix according to the reserved channel; and determining the weight values which are smaller than a second threshold value in the weight values corresponding to the same channels in the second probability matrix, and removing the weight values from the convolutional neural network.

Illustratively, as shown in fig. 7, the pruning module may further include a fourth pruning unit 604;

and the fourth pruning unit is used for deleting the weight parameters smaller than the pruning threshold in the weight parameters of the convolutional neural network model.

Illustratively, the quantization module may be configured to quantize the convolutional neural network after the quadratic pruning processing through at least one of a fast convolution algorithm, network layer merging, and multi-thread running.

The pruning quantization processing device for the network model provided by the embodiment of the application can execute the pruning quantization processing method for the network model provided by the embodiments of fig. 1 to 4 of the application, and has corresponding functional modules and beneficial effects of the execution method.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application, as shown in fig. 8, the electronic device includes a processor 801, a memory 802, an input device 803, and an output device 804; the number of the processors 801 in the device may be one or more, and one processor 801 is taken as an example in fig. 8; the processor 801, the memory 802, the input device 803 and the output device 804 in the apparatus may be connected by a bus or other means, and fig. 8 illustrates an example of a connection by a bus.

The memory 802 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the pruning quantization processing method of the network model in fig. 1-4 in the embodiments of the present application (for example, the pruning module 501 and the quantization module 502 in the pruning quantization processing device of the network model). The processor 801 executes various functional applications and data processing of the electronic device by running software programs, instructions and modules stored in the memory 802, that is, implements the pruning quantization processing method of the network model described above.

The memory 802 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the cloud server, and the like. Further, the memory 802 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 802 may further include memory located remotely from the processor 801, which may be connected to devices/terminals/servers via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input unit 803 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the device. The output device 804 may include a display device such as a display screen.

Embodiments of the present application also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a method for pruning quantization processing of a network model, the method including:

Of course, the storage medium provided in the embodiments of the present application contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also execute the pruning quantization processing method of the network model provided in any embodiment of the present application.

From the above description of the embodiments, it is obvious for those skilled in the art that the present application can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods described in the embodiments of the present application.

It should be noted that, in the embodiment of the pruning quantization processing device for the network model, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be realized; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the application.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application, and the scope of the present application is determined by the scope of the appended claims.

Claims

1. A pruning quantification processing method of a network model is characterized by comprising the following steps:

2. The method of claim 1, wherein pruning the convolutional neural network based on a channel attention mechanism and a weight attention mechanism comprises:

3. The method of claim 2, wherein the pruning the convolutional neural network based on a channel attention mechanism to obtain a first probability matrix comprises:

4. The method of claim 2, wherein pruning the convolutional neural network based on a weighted attention mechanism and an input image to obtain a second probability matrix comprises:

5. The method of claim 2, wherein pruning the convolutional neural network according to the first probability matrix and the second probability matrix comprises:

6. The method of claim 1, wherein performing a second pruning process on the pruned convolutional neural network comprises:

7. The method according to any one of claims 1 to 6, wherein the quantization operation is performed on the convolutional neural network after the quadratic pruning processing, and comprises:

8. A pruning quantization processing device for a network model is characterized by comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the pruning quantification processing method for the network model according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a pruning quantification processing method for a network model according to any one of claims 1 to 7.