WO2022022625A1

WO2022022625A1 - Acceleration method and device for deep learning model

Info

Publication number: WO2022022625A1
Application number: PCT/CN2021/109187
Authority: WO
Inventors: 付家为; 陈东; 张放; 李晓飞; 张德兆; 王肖; 霍舒豪
Original assignee: 北京智行者科技有限公司
Priority date: 2020-07-29
Filing date: 2021-07-29
Publication date: 2022-02-03
Also published as: CN112101515A

Abstract

The present invention provides an acceleration method for a deep learning model, comprising: acquiring a contribution value of each channel of a plurality of channels in each convolution layer of the model; according to the contribution values of all the channels in all the convolution layers, clipping the channels in the convolution layers of the model to obtain a clipped model; separately training the model and the clipped model; respectively evaluating the trained model and the trained clipped model to obtain a first evaluation value and a second evaluation value; and according to the first evaluation value and the second evaluation value, determining whether to output the trained clipped model as a new model. Therefore, reasoning speed is greatly increased without losing reasoning precision or with a very small loss of reasoning precision.

Description

Acceleration method and device for deep learning model

technical field

The present invention relates to the field of data processing, and in particular, to a method and an apparatus.

Background technique

Deep learning algorithms have been applied in more and more fields today. When using deep learning algorithms, people are often most concerned about the accuracy and running speed of the algorithm. In many application fields, deep learning algorithms are usually required to be real-time or fast, but limited by hardware platforms, especially embedded platforms, the inference speed of deep learning algorithms often cannot meet the speed requirements. to accelerate. Accelerating deep learning algorithms is usually accompanied by the loss of inference accuracy. Therefore, finding a suitable acceleration method and minimizing the loss of inference accuracy is crucial in the application of deep learning algorithms.

At present, many measures have been taken to improve the running speed of deep learning algorithms. At present, the deep learning algorithm is mainly accelerated through the following aspects: 1. Acceleration of convolution algorithm; 2. Network weight quantization; 3. Network structure optimization, etc.

Convolution operation algorithm acceleration, one solution is to accelerate the operation according to the hardware characteristics of a specific platform. This solution has platform limitations and is not very versatile. For example, by using the unified computing device architecture of the graphics card manufacturer Nvidia (Compute Unified Device Architecture, CUDA) computing platform and Nvidia's deep neural network library (NVIDIA CUDA Deep Neural Network library, CUDNN) for acceleration, can only be used on Nvidia series platforms, and the acceleration algorithm has been almost optimal, I want to further accelerate optimization more difficult. The second scheme is to accelerate the operation of the convolution operation itself, such as using fast Fourier transform (FFT) to realize the convolution operation. The versatility of this scheme is still not very good. The acceleration of larger convolution operations is more obvious, while the existing Convolutional Neural Networks (CNN) mostly use small convolution kernels or even 1*1 convolution kernels, and the FFT acceleration effect is not obvious.

The network weights are quantized, the network weights are binarized or the floating-point precision of the weights is reduced, thereby reducing the amount of convolution operations. This solution is often accompanied by a large loss of precision.

There are many ways to optimize the network structure. One is to reduce the depth of the network, use a shallower network, and reduce the amount of computation. This method usually causes a large loss of inference accuracy; the second is to sparse network connections, such as Model compression is achieved by methods such as pruning and quantization, but the platform versatility of the convolution operation based on sparse matrix after compression is not ideal, and the acceleration effect on specific platforms is not ideal; the third method is tensor decomposition, which decomposes tensors into into multiple small tensors, but the number of output channels does not change, so it is difficult to compress the 1*1 convolutional layer by tensor decomposition, and many current model structures use a large number of 1*1 volumes Products, such as Residual Neural Network (Residual Neural Network) ResNet, deeper convolution (Going deeper with convolutions, GoogleNet), Xception, etc.

SUMMARY OF THE INVENTION

The purpose of the embodiments of the present invention is to provide an acceleration method and device for a deep learning model, so as to solve the problems in the prior art that the accuracy loss is large and the acceleration effect on a specific platform is not ideal.

In a first aspect, the present invention provides a method for accelerating a deep learning model, the method comprising:

Get the contribution of each channel in multiple channels in each convolutional layer in the model;

According to the contribution value of each channel in all the convolutional layers, the channels in the convolutional layers in the model are trimmed to obtain a trimmed model;

training the untrimmed model and the trimmed model respectively;

The described model after the training that is not trimmed and the model trained after trimming are evaluated respectively to obtain a first evaluation value and a second evaluation value;

According to the first evaluation value and the second evaluation value, it is determined whether to output the trimmed and trained model as a new model.

In a possible implementation manner, the contribution value of each channel in the multiple channels in each convolutional layer in the acquisition model specifically includes:

The output of each channel in the convolutional layer is multiplied by the back-propagation gradient of the channel, and then multiplied by the inverse of the number of samples to obtain the contribution value of the channel in the convolutional layer.

In a possible implementation manner, according to the contribution values of all convolutional layers, the channels in the convolutional layers in the model are trimmed, and the trimmed model specifically includes:

sorting the contribution values of each channel of all convolutional layers;

Clipping the channels whose contribution value is not greater than the preset contribution value threshold, and determining the reserved channels as the first number of channels, and the clipped channels as the second number of channels;

Get the number of reserved channels for each convolutional layer;

When the number of reserved channels of any convolutional layer is less than the preset number of channels threshold, the value obtained by subtracting the number of reserved channels of any convolutional layer from the preset number of channels threshold is used as the third value number, from the second number of channels, determine a third number of channels, and use the third number of channels as the reserved channels of any one of the convolutional layers;

According to the first number of channels and the reserved channels of any one of the convolutional layers, a tailored model is obtained.

In a possible implementation manner, before sorting the contribution values of each channel of all convolutional layers, the method further includes:

Get the number of channels of each convolutional layer;

Retain convolutional layers whose number of channels is less than the preset number of channels threshold;

Sort the contribution value of each channel of the convolutional layer whose number of channels is greater than the preset number of channels threshold.

In a possible implementation manner, the first evaluation value includes a first inference accuracy and a first inference speed; the second evaluation value includes a second inference accuracy and a second inference speed. The evaluation value and the second evaluation value, determining whether to output the trained model as a new model specifically includes:

When the difference between the first inference accuracy and the second inference accuracy is within the preset inference accuracy threshold, and if the second inference speed is less than the preset inference speed, continue to perform the training on the model after training. cropping; or,

When the difference between the first inference accuracy and the second inference accuracy is within a preset inference accuracy threshold range, and the second inference speed is equal to the preset inference speed, the cropped model is output.

In a second aspect, the present invention provides an acceleration device for a deep learning model, the device comprising:

an acquisition module, the acquisition module is used to acquire the contribution value of each channel in the multiple channels in each convolutional layer in the model;

A cropping module, which is used for cropping the channels in the convolutional layers in the model according to the contribution value of each channel in all the convolutional layers to obtain a cropped model;

a training module, the training module is used to train the model that is not trimmed and the model that has been trimmed;

an evaluation module, the evaluation module is used to evaluate the model after training that is not trimmed and the model trained after trimming, respectively, to obtain a first evaluation value and a second evaluation value;

A determination module, configured to determine whether to output the model trained after trimming as a new model according to the first evaluation value and the second evaluation value.

In a possible implementation manner, the obtaining module is specifically used for:

In a possible implementation manner, the cropping module is specifically used for:

sorting the contribution values of each channel of all convolutional layers;

Get the number of reserved channels for each convolutional layer;

When the number of reserved channels of any convolution layer is less than the preset number of channels threshold, the preset number of channels threshold is subtracted from the number of reserved channels of any convolution layer as the third number, From the second number of channels, determine a third number of channels, and use the third number of channels as the reserved channels of any one of the convolutional layers;

In a third aspect, the present invention provides a device including a memory and a processor, where the memory is used for storing a program, and the processor is used for executing the method according to any one of the first aspects when the program is run.

In a fourth aspect, the present invention provides a computer program product comprising instructions that, when the computer program product is run on a computer, cause the computer to perform the method according to any one of the first aspects.

In a fifth aspect, the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method according to any one of the first aspects is implemented.

By applying the deep learning model acceleration method and device provided in the second embodiment of the present invention, there is no dependence on hardware platforms, software, and deep learning frameworks, and the versatility is good. Product operations can be accelerated, and the inference speed can be greatly improved without loss of inference accuracy or very small loss.

Description of drawings

1 is a schematic flowchart of a method for accelerating a deep learning model according to Embodiment 1 of the present invention;

2 is a schematic diagram of channel cropping of a convolutional layer provided in Embodiment 1 of the present invention;

FIG. 3 is a schematic structural diagram of an acceleration device for a deep learning model according to Embodiment 2 of the present invention.

detailed description

The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the related invention, but not to limit the invention. In addition, it should be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.

It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict. The present application will be described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

FIG. 1 is a schematic flowchart of a method for accelerating a deep learning model according to Embodiment 1 of the present invention. The execution body of this application is a terminal, server or processor with computing functions. As shown in Figure 1, this application includes the following steps:

Step 110, obtaining the contribution value of each channel in the multiple channels in each convolutional layer in the model;

Specifically, in the deep learning network model, there are multiple convolution layers, each convolution layer has multiple convolution kernels, and each convolution kernel corresponds to outputting a feature channel.

The contribution of each channel can be calculated by the following formula:

Among them, l is the lth convolutional layer, referred to as the l layer, each convolutional layer has multiple output channels, k represents the kth output channel of a convolutional layer, here is the lth convolutional layer. There are k output channels, M is the number of samples, that is, the number of pictures received by training a deep learning network model, such as batch size (batchsize), m is the mth sample in the M samples.

Step 120, according to the contribution value of each channel in all convolutional layers, trim the channels in the convolutional layers in the model to obtain a trimmed model;

In one example, a tailored deep learning network model can be obtained by the following steps:

First, the contribution value of each channel of all convolutional layers is sorted; secondly, the channel whose contribution value is not greater than the preset contribution value threshold is clipped, and the reserved channels are determined as the first number of channels, and the clipped channels are The channel is determined to be the second number of channels; again, the number of reserved channels of each convolutional layer is obtained; then, when the number of reserved channels of any convolutional layer is less than the preset channel number threshold, from Among the second number of channels, determine the third number of channels, and use the third number of channels as the reserved channels of any convolutional layer; the number of reserved channels of any convolutional layer is the same as the third number of channels. After the number is added, it is equal to the preset channel number threshold; finally, according to the first number of channels and the reserved channels of any convolutional layer, the trimmed model is obtained.

For example, the contribution values of the channels of all convolutional layers can be calculated, denoted as Value 1, Value 2, Value 3... The channels with the contribution value threshold are clipped. For example, the channels whose contribution values are ranked as the last p need to be clipped. Subsequently, when the number of reserved channels of a convolutional layer is less than the preset number of channels threshold after being cropped, the channels in the convolutional layer with the highest order are taken out from the cropped channels. , until the number of channels of the convolutional layer is equal to the preset number of channels threshold, for example, if the remaining channel value of a convolutional layer is less than q, then from the p channels, take out the convolutional layer with the highest score ranking Channels until the final number of channels is q, and finally the tailored deep learning network model is obtained.

In another example, in order to improve the processing speed, a convolutional layer whose number of channels is less than a preset threshold of the number of channels can be determined first, then the convolutional layer is reserved, the remaining convolutional layers are sorted according to their contribution values, and then the execution is continued. The steps in the previous example until the cropped model is obtained. Therefore, the channels whose number of channels is less than the preset number of channels are reserved first, and then the remaining channels are processed, thereby improving the processing speed.

Referring to Figure 2, the dashed channel of Conv1 is the clipped channel. After Conv1 clips the channel, the dimension of the corresponding output Output1 of this layer is reduced, and the output of the previous layer is the input of the convolutional layer of the next layer, so the corresponding The dimension of the convolution kernel of the latter convolutional layer Conv2 is correspondingly reduced. Conv1 and Conv2 crop different parts, the first convolutional layer crops the entire channel, and the second convolutional layer crops a part of each channel.

Step 130, training the uncropped model and the cropped model respectively;

Specifically, the original deep learning network model is trained, and the trimmed deep learning network model is trained to obtain a trained deep learning network model and a trimmed trained deep learning network model. How to train the deep learning network model is well known to those skilled in the art, and details are not described here.

Step 140, evaluating the model after training that is not trimmed and the model trained after trimming, respectively, to obtain a first evaluation value and a second evaluation value;

Among them, the original deep learning network model after training and the deep learning network model trained after trimming can be evaluated, for example, the first inference accuracy or the first inference speed of the trained deep learning network model can be calculated, and the calculated The second inference accuracy or the second inference speed of the trained deep learning network model. And it is determined by the first inference accuracy and the second inference accuracy whether the requirement of the propulsion accuracy is met, or whether the requirement of the inference speed is met by the first inference speed and the second inference speed.

Step 150 , according to the first evaluation value and the second evaluation value, determine whether to output the model trained after trimming as a new model.

Specifically, the trimmed and trained deep learning network model is evaluated and compared with the evaluation results of the model before trimming. If the effect of the model before trimming is significantly worse than that of the model before trimming, it is considered that the accuracy of the network after trimming is seriously lost. If the cropping fails, step 110 is executed to redo the deep learning network model. If the network accuracy is severely lost, the cropping fails, and the cropping is stopped. In the case where the evaluation value meets the requirements, that is, successful cropping, the more cropping times, the higher the network model size and inference speed will be, that is, the better the cropping effect will be.

For example, the inference speed of the deep learning network model trained after trimming reaches 50hz, and the loss of inference accuracy is small, that is, the difference between the second inference accuracy and the first inference accuracy is smaller than the preset threshold, or the second inference accuracy is smaller than the preset The inference accuracy threshold of , is a successful clipping. For another example, the actual inference speed that you want to achieve is 60hz, and the current inference speed is 50hz. At this time, you can try to continue cutting on the deep learning network model after cutting, and further compress the model size to improve the inference speed.

However, if the inference accuracy of the pruned deep learning network model no longer meets the requirements, or the inference speed is greater than the required inference speed, it means that the pruning has failed and the pruning cannot be continued.

By applying the acceleration method of the deep learning model provided by the first embodiment of the present invention, there is no dependency on the hardware platform, software, and deep learning framework, and the versatility is good. There are no restrictions on the size, dimension, and form of the convolution kernel. All can be accelerated, and the inference speed can be greatly improved under the condition that the inference accuracy is not lost or the loss is very small.

FIG. 3 is a schematic structural diagram of an acceleration device for a deep learning model provided in Embodiment 2 of the present invention. As shown in FIG. 3 , the acceleration device for a deep learning model is applied to the acceleration method for a deep learning model in Embodiment 1, as shown in FIG. 3 As shown, the acceleration device of the deep learning model includes: an acquisition module 310 , a cropping module 320 , a training module 330 , an evaluation module 340 and a determination module 350 .

The obtaining module 310 is configured to obtain the contribution value of each channel in the multiple channels in each convolutional layer in the model;

The clipping module 320 is configured to clip the channels in the convolutional layer in the model according to the contribution value of each channel in all the convolutional layers, to obtain a clipped model;

The training module 330 is used for training the cropped model;

The evaluation module 340 is used to evaluate the model before trimming and the model after trimming, respectively, to obtain a first evaluation value and a second evaluation value;

The determining module 350 is configured to determine whether to output the trimmed model as a new model according to the first evaluation value and the second evaluation value.

Further, the obtaining module 310 is specifically used for:

The output of each channel in the convolutional layer is multiplied by the backpropagation gradient of the channel, and then multiplied by the inverse of the number of samples to obtain the contribution of the channel in the convolutional layer.

Further, the cropping module 320 is specifically used for:

Sort the contribution of each channel of all convolutional layers;

Clip the channels whose contribution value is not greater than the preset contribution value threshold, and determine the reserved channels as the first number of channels, and the clipped channels as the second number of channels;

Get the number of reserved channels for each convolutional layer;

When the number of reserved channels in any convolutional layer is less than the preset number of channels threshold, determine the third number of channels from the second number of channels, and use the third number of channels as any convolutional layer The number of reserved channels of any convolution layer is added to the number of the third number of channels and equals to the preset channel number threshold;

A cropped model is obtained from the first number of channels and the reserved channels of any convolutional layer.

Further, the cropping module 320 is also used for:

Get the number of channels of each convolutional layer;

Further, the first evaluation value includes the first inference accuracy and the first inference speed; the second evaluation value includes the second inference accuracy and the second inference speed, and the determining module is specifically used for:

When the difference between the first inference accuracy and the second inference accuracy is within the preset inference accuracy threshold, if the second inference speed is less than the preset inference speed, continue to trim the trained model; or,

When the difference between the first inference accuracy and the second inference accuracy is within the preset inference accuracy threshold range, and the second inference speed is equal to the preset inference speed, the cropped model is output.

By applying the acceleration device of the deep learning model provided by the second embodiment of the present invention, there is no dependence on the hardware platform, software, and deep learning framework, and the generality is good, and there are no restrictions on the size, dimension, and form of the convolution kernel. All can be accelerated, and the inference speed can be greatly improved under the condition that the inference accuracy is not lost or the loss is very small.

The third embodiment of the invention provides a device including a memory and a processor, the memory is used to store a program, and the memory can be connected to the processor through a bus. The memory may be non-volatile memory, such as hard drives and flash memory, in which software programs and device drivers are stored. The software program can perform various functions of the above methods provided by the embodiments of the present invention; the device driver may be a network and interface driver. The processor is configured to execute a software program, and when the software program is executed, the method provided by Embodiment 1 of the present invention can be implemented.

The fourth embodiment of the present invention provides a computer program product containing instructions, when the computer program product runs on a computer, the computer enables the computer to execute the method provided by the first embodiment of the present invention.

Embodiment 5 of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method provided by Embodiment 1 of the present invention is implemented.

Professionals should be further aware that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. Interchangeability, the above description has generally described the components and steps of each example in terms of function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be implemented in hardware, a software module executed by a processor, or a combination of the two. A software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

The above specific embodiments further describe the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above are only specific embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Within the spirit and principle of the present invention, any modifications, equivalent replacements, improvements, etc. made should be included within the protection scope of the present invention.

Claims

A method for accelerating a deep learning model, wherein the method comprises:

Get the contribution of each channel in multiple channels in each convolutional layer in the model;

According to the contribution value of each channel in all the convolutional layers, the channels in the convolutional layers in the model are trimmed to obtain a trimmed model;

training the untrimmed model and the trimmed model respectively;

The described model after the training that is not trimmed and the model trained after trimming are evaluated respectively to obtain a first evaluation value and a second evaluation value;

According to the first evaluation value and the second evaluation value, it is determined whether to output the trimmed and trained model as a new model.
The method according to claim 1, wherein the obtaining the contribution value of each channel in the multiple channels in each convolutional layer in the model specifically comprises:

The output of each channel in the convolutional layer is multiplied by the back-propagation gradient of the channel, and then multiplied by the inverse of the number of samples to obtain the contribution value of the channel in the convolutional layer.
The method according to claim 1, wherein, according to the contribution values of all convolutional layers, the channels in the convolutional layers in the model are trimmed, and the trimmed model specifically comprises:

sorting the contribution values of each channel of all convolutional layers;

Clipping the channels whose contribution value is not greater than the preset contribution value threshold, and determining the reserved channels as the first number of channels, and the clipped channels as the second number of channels;

Get the number of reserved channels for each convolutional layer;

When the number of reserved channels of any convolutional layer is less than the preset number of channels threshold, the value obtained by subtracting the number of reserved channels of any convolutional layer from the preset number of channels threshold is used as the third value number, from the second number of channels, determine a third number of channels, and use the third number of channels as the reserved channels of any one of the convolutional layers;

According to the first number of channels and the reserved channels of any one of the convolutional layers, a tailored model is obtained.
The method according to claim 1, characterized in that before sorting the contribution values of each channel of all convolutional layers, the method further comprises:

Get the number of channels of each convolutional layer;

Retain convolutional layers whose number of channels is less than the preset number of channels threshold;

Sort the contribution value of each channel of the convolutional layer whose number of channels is greater than the preset number of channels threshold.
The method according to claim 1, wherein the first evaluation value includes a first inference accuracy and a first inference speed; the second evaluation value includes a second inference accuracy and a second inference speed, and the For the first evaluation value and the second evaluation value, determining whether to output the model trained after trimming as a new model specifically includes:

When the difference between the first inference accuracy and the second inference accuracy is within the preset inference accuracy threshold range, and if the second inference speed is less than the preset inference speed, continue to train the model after trimming to be cropped; or,

When the difference between the first inference accuracy and the second inference accuracy is within the preset inference accuracy threshold, and the second inference speed is equal to the preset inference speed, the model trained after trimming is output .
A device for accelerating a deep learning model, wherein the device comprises:

an acquisition module, the acquisition module is used to acquire the contribution value of each channel in the multiple channels in each convolutional layer in the model;

A cropping module, which is used for cropping the channels in the convolutional layers in the model according to the contribution values of each channel in all the convolutional layers to obtain a cropped model;

a training module, the training module is used to train the uncropped model and the cropped model respectively;

an evaluation module, the evaluation module is used to evaluate the model after training that is not trimmed and the model trained after trimming, respectively, to obtain a first evaluation value and a second evaluation value;

A determination module, configured to determine whether to output the model trained after trimming as a new model according to the first evaluation value and the second evaluation value.
The device according to claim 6, wherein the obtaining module is specifically configured to:

The output of each channel in the convolutional layer is multiplied by the backpropagation gradient of the channel, and then multiplied by the inverse of the number of samples to obtain the contribution value of the channel in the convolutional layer.
The device according to claim 6, wherein the cropping module is specifically used for:

sorting the contribution values of each channel of all convolutional layers;

Clipping the channels whose contribution value is not greater than the preset contribution value threshold, and determining the reserved channels as the first number of channels, and the clipped channels as the second number of channels;

Get the number of reserved channels for each convolutional layer;

When the number of reserved channels of any convolution layer is less than the preset number of channels threshold, the preset number of channels threshold is subtracted from the number of reserved channels of any convolution layer as the third number, From the second number of channels, determine a third number of channels, and use the third number of channels as the reserved channels of any one of the convolutional layers;

According to the first number of channels and the reserved channels of any one of the convolutional layers, a tailored model is obtained.
A device, characterized in that the device comprises a memory and a processor, the memory is used to store a program, and the processor is used to execute the program according to any one of claims 1-5 when running the program Methods.
A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method according to any one of claims 1-5 is implemented.