CN109472352A

CN109472352A - A deep neural network model tailoring method based on feature map statistical features

Info

Publication number: CN109472352A
Application number: CN201811440153.2A
Authority: CN
Inventors: 周彦; 刘广毅; 王冬丽
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2019-03-15

Abstract

The invention discloses a deep neural network model cutting method based on statistical features of feature maps. The implementation steps are as follows: Step 1. For the feature layers in the deep neural network model, calculate the statistical features of the feature maps corresponding to each output channel; wherein The feature layer consists of two parts: the convolution layer and the activation layer, or consists of three parts: the convolution layer, the normalization layer and the activation layer; step 2, according to the statistical features of the feature maps corresponding to each output channel in the feature layer, calculate The evaluation index of each output channel in the feature layer; Step 3, according to the evaluation index, determine the importance of each output channel in the feature layer, and remove the unimportant output channels and their corresponding parameters. The invention can effectively reduce the dimension of the feature layer of the neural network, improve the operation efficiency of the network model, and at the same time, the network scale can be reduced, and the impact on the accuracy is small.

Description

A deep neural network model tailoring method based on feature map statistical features

技术领域technical field

本发明属于人工智能、模式识别领域，具体涉及深度神经网络模型压缩。The invention belongs to the fields of artificial intelligence and pattern recognition, and in particular relates to deep neural network model compression.

背景技术Background technique

深度学习(deep learning)作在解决高级抽象认知问题上有着显著的成果，使人工智能上了一个新台阶，为高精度、多种类的目标检测、识别与跟踪提供了技术基础。但是复杂的运算，庞大的资源需求，使得神经网络只能在高性能计算平台上部署，限制了在移动设备上的应用。2015年，Han发表的Deep Compression将网络模型裁剪、权值共享和量化、编码等方式运用在模型压缩上，使得模型存储实现了很好的效果，也引起了研究人员对网络压缩方法的研究。目前深度学习模型压缩方法的研究主要可以分为以下几个方向：Deep learning has achieved remarkable results in solving high-level abstract cognitive problems, bringing artificial intelligence to a new level and providing a technical foundation for high-precision, multi-type target detection, recognition and tracking. However, complex computing and huge resource requirements make neural networks only deployed on high-performance computing platforms, limiting their application on mobile devices. In 2015, Deep Compression published by Han applied network model clipping, weight sharing, quantization, and encoding to model compression, which achieved good results in model storage and led researchers to study network compression methods. At present, the research on deep learning model compression methods can be mainly divided into the following directions:

(1)更精细模型的设计，使用更加细致、高效的模型设计，能够很大程度的减少模型尺寸，并且也具有不错的性能。(1) The design of finer models, using more detailed and efficient model design, can greatly reduce the size of the model, and also has good performance.

(2)模型裁剪，结构复杂的网络具有非常好的性能，其参数也存在冗余，因此对于已训练好的模型网络，通常是寻找一种有效的评判手段，来判断参数的重要性，将不重要的连接或者卷积核进行裁剪来减少模型的冗余。(2) Model tailoring. Networks with complex structures have very good performance, and their parameters are also redundant. Therefore, for a trained model network, an effective evaluation method is usually found to judge the importance of parameters. Unimportant connections or convolution kernels are cropped to reduce model redundancy.

(3)核的稀疏化，在训练过程中，对权重的更新进行诱导，使其更加稀疏，对于稀疏矩阵，可以使用更加紧致的存储方式，但是使用稀疏矩阵操作在硬件平台上运算效率不高，容易受到带宽的影响，因此加速并不明显。(3) The sparseness of the kernel. During the training process, the update of the weight is induced to make it more sparse. For the sparse matrix, a more compact storage method can be used, but the operation efficiency of the sparse matrix operation on the hardware platform is not high. High, vulnerable to bandwidth, so the speedup is not noticeable.

对预训练好的网络模型进行裁剪的方法，是目前模型压缩中使用最多的方法，通常是寻找一种有效的评判指标，来判断神经元或是特征图的重要性，将不重要的连接或者卷积核进行裁剪来减少模型的冗余。Li提出了基于量级的裁剪方式，用权重的绝对值之和来评判其重要性，以卷积核中所有权重的绝对值之和来作为该卷积核的评价指标。Hu定义了一个变量APoZ(Average Percentage of Zeros)来衡量每一个卷积核中激活为0的值的数量，作为评价一个卷积核是否重要的标准。Luo提出了一种基于熵值的裁剪方式，利用熵值来判定卷积核的重要性。Anwar采取一种随机裁剪的方式，然后对于每一种随机方式统计模型的性能，来确定局部最优的裁剪方式。Tian经过LDA分析发现对于每一个类别，有很多卷积核之间的激活是高度不相关的，因此可以利用这点来剔除大量的只具有少量信息的filter而不影响模型的性能。The method of trimming the pre-trained network model is currently the most used method in model compression. It is usually to find an effective evaluation index to judge the importance of neurons or feature maps, and connect unimportant or The convolution kernel is pruned to reduce the redundancy of the model. Li proposed a clipping method based on magnitude, using the sum of the absolute values of the weights to judge its importance, and using the sum of the absolute values of all the weights in the convolution kernel as the evaluation index of the convolution kernel. Hu defines a variable APoZ (Average Percentage of Zeros) to measure the number of values activated to 0 in each convolution kernel, as a criterion for evaluating whether a convolution kernel is important. Luo proposed a cropping method based on entropy value, which uses entropy value to determine the importance of convolution kernels. Anwar adopts a random clipping method, and then counts the performance of the model for each random method to determine the locally optimal clipping method. Tian's LDA analysis found that for each category, the activations between many convolution kernels are highly irrelevant, so this can be used to eliminate a large number of filters with only a small amount of information without affecting the performance of the model.

综上所述，现有方案的局限性如下：To sum up, the limitations of existing solutions are as follows:

a.核的稀疏化方法仅仅考虑到网络的压缩存储，在运行过程中的压缩效果不明显，速度也没有明显的提升；a. The sparse method of the kernel only considers the compressed storage of the network, the compression effect during operation is not obvious, and the speed is not significantly improved;

b.用权重的大小作为评判指标仅仅考虑了权重本身的数值特性，没有考虑网络层的数据特征，压缩的效果不高；b. Using the size of the weight as the evaluation index only considers the numerical characteristics of the weight itself, and does not consider the data characteristics of the network layer, and the compression effect is not high;

c.评判指标的计算的复杂程度较高，耗费了较多的计算能力；c. The calculation of the evaluation index is more complicated and consumes a lot of computing power;

d.随机裁剪的方法随机性太强，容易破坏网络本身的参数特征；d. The random cropping method is too random, and it is easy to destroy the parameter characteristics of the network itself;

因此，有必要提供一种计算方法简单，能充分考虑网络中冗余，适用性强，不依赖专用加速库，实现深度神经网络压缩和加速的方法。Therefore, it is necessary to provide a method that is simple in calculation method, can fully consider the redundancy in the network, has strong applicability, and does not rely on a dedicated acceleration library to achieve deep neural network compression and acceleration.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术的缺陷，本发明公开了一种基于特征图统计的深度神经网络模型裁剪方法。与其他压缩方法相比，本发明通过将神经网络特征层的多种统计特征作为评判标准，对网络的参数层进行裁剪，充分考虑了网络的数值特征和统计特性，取得了很好的压缩效率，同时提升了运行速度。In order to solve the defects of the prior art, the present invention discloses a deep neural network model trimming method based on feature map statistics. Compared with other compression methods, the present invention cuts the parameter layer of the network by using various statistical features of the feature layer of the neural network as the evaluation criteria, fully considers the numerical and statistical characteristics of the network, and achieves good compression efficiency. , while improving the running speed.

本发明所采用的技术方案为：The technical scheme adopted in the present invention is:

一种基于特征图统计特征的深度神经网络模型裁剪方法，包括以下步骤：A deep neural network model clipping method based on statistical features of feature maps, comprising the following steps:

步骤1、针对深度神经网络模型中的特征层，计算其各个输出通道对应的特征图的统计特征；其中特征层由卷积层和激活层两部分组成，或是由卷积层、归一化层(BatchNorm层)和激活层三部分组成；本发明只对后面还有储存参数的层(特征层/全连接层)的特征层进行计算；Step 1. For the feature layer in the deep neural network model, calculate the statistical features of the feature map corresponding to each output channel; the feature layer is composed of two parts: the convolution layer and the activation layer, or the convolution layer, normalization layer The layer (BatchNorm layer) and the activation layer are composed of three parts; the present invention only calculates the feature layer of the layer (feature layer/full connection layer) that stores parameters later;

步骤2、根据特征层中各输出通道对应的特征图的统计特征，计算特征层中各输出通道的评判指标；Step 2. Calculate the evaluation index of each output channel in the feature layer according to the statistical features of the feature map corresponding to each output channel in the feature layer;

步骤3、根据评判指标判断特征层中各输出通道的重要性，将不重要的输出通道及其对应的参数进行移除。Step 3: Determine the importance of each output channel in the feature layer according to the evaluation index, and remove the unimportant output channels and their corresponding parameters.

在深度神经网络的一个迭代步(epoch)中，样本分不同的批次送入神经网络进行计算。所述步骤2中，对网络模型中各特征层各个输出通道对应的特征图，分批次进行特征统计；对于第i个特征层，其所有输出通道对应的特征图的统计特征包括一个均值向量和一个标准差向量S_v_i，计算步骤如下：In an iterative step (epoch) of the deep neural network, the samples are sent to the neural network in different batches for calculation. In the step 2, the feature maps corresponding to each output channel of each feature layer in the network model are subjected to feature statistics in batches; for the i-th feature layer, the statistical features of the feature maps corresponding to all output channels include a mean vector. and a standard deviation vector S_v _i , the calculation steps are as follows:

S11：初始化；设置N_sum用来统计已经处理过的样本数量，初始化为0；N_batch为批次数量，N_batch＝ceil(样本总数/N)，N为一个批次中样本的数量，ceil(·)是向上取整函数；n_batch是当前批次的计数，初始化为1；初始化均值向量和标准差向量S_v_i为一个1×C_i的零向量，其中，C_i为第i个特征层的输出通道数量；S11: Initialization; set N _sum to count the number of samples that have been processed, initialized to 0; N _batch is the number of batches, N _batch = ceil (total number of samples/N), N is the number of samples in a batch, ceil ( ) is the round-up function; n _batch is the count of the current batch, initialized to 1; initialized mean vector The sum standard deviation vector S_v _i is a 1×C _i zero vector, where C _i is the number of output channels of the ith feature layer;

S12：将第n_batch个批次样本对应的第i个特征层的输出(features maps)X_i表示为一个大小为N×C_i×H_i×W_i的四维张量，其中，H_i和W_i分别为第i个特征层的输出通道对应的特征图(feature map)的高和宽；该批次中第k个样本对应的该特征层第j个输出通道特征图(feature map)X_ikj为一个大小为H_i×W_i的二维矩阵，k＝1，2，…，N，j＝1，2，…，C_i；S12: Represent the output (features maps) X _i of the i-th feature layer corresponding to the n-th _batch of samples as a four-dimensional tensor of size N×C _i ×H _i ×W _i , where H _i and Wi are the height and width of the feature map (feature map) corresponding to the output channel of the _ith feature layer; the feature map (feature map) X of the jth output channel of the feature layer corresponding to the kth sample in the batch _ikj is a two-dimensional matrix of size H _i ×W _i , k=1, 2, . . . , N, j=1, 2, . . . , C _i ;

S13：将X_i进行维度变换(view或reshape)，得到大小为N×C_i×(H_i×W_i)的三维张量X^* _i，即把X_i中每一个特征图X_ikj由二维矩阵拉伸为一维向量X^* _ikj；S13: Perform dimension transformation (view or reshape) on X _i to obtain a three-dimensional tensor X ^* _i of size N×C _i ×(H _i ×W _i ), that is, transform each feature map X _ikj in X _i by two The dimensional matrix is stretched to a one-dimensional vector X ^* _ikj ;

S14：计算X^* _ikj的统计特征，包括均值和标准差S_ikj(两者均为标量)：S14: Compute statistical features of X ^* _ikj , including mean and standard deviation S _ikj (both scalars):

其中，表示X^* _kj中第m个元素；in, represents the mth element in X ^* _kj ;

S15：由k＝1，2，…，N，j＝1，2，…，C_i构成一个大小N×C_i为均值矩阵由S_ikj，k＝1，2，…，N，j＝1，2，…，C_i构成一个大小N×C_i为标准差矩阵S_m_i；S15: by k=1, 2, ..., N, j=1, 2, ..., C _i forms a matrix of size N × C _i is the mean value A size N×C _i is a standard deviation matrix S_m _i formed by S _ikj , k=1, 2,..., N, j=1, 2,..., C _i ;

S16：对均值矩阵和标准差矩阵S_m_i按通道进行均值化处理(均值滤波)，处理公式如下：S16: For the mean matrix And the standard deviation matrix S_m _i is averaged by channel (average filter), and the processing formula is as follows:

N_sum＝N_sum+N (5)N _sum = N _sum + N (5)

其中，是中第k行对应的行向量；是S_m_i中第k行对应的行向量；in, Yes The row vector corresponding to the kth row in ; is the row vector corresponding to the kth row in S_m _i ;

S17：判断当前批次是否为最后一个批次，即是否有n_batch＝N_batch，若是，则终止批次循环，当前的均值向量和标准差向量S_v_i即为该特征层所有输出通道对应的特征图的统计特征；否则更新当前批次的计数：n_batch＝n_batch+1，跳转到S12继续执行。S17: Determine whether the current batch is the last batch, that is, whether there are n _batch = N _batch , if so, terminate the batch cycle, and the current mean vector The sum standard deviation vector S_v _i is the statistical feature of the feature map corresponding to all output channels of the feature layer; otherwise, update the count of the current batch: n _batch = n _batch +1, and jump to S12 to continue execution.

进一步地，所述步骤2中，第i个特征层中第j个输出通道的评判指标的计算公式如下：Further, in the step 2, the evaluation index of the jth output channel in the ith feature layer The calculation formula is as follows:

其中，和S_v_ij是该特征层对应的均值向量和标准差向量S_v_i中第j个元素，表示该特征层中第j个输出通道对应的特征图的均值和标准差，α和β是两个比例因子(超参数)，α是均值的阈值，当均值随着值变小朝着负无穷的方向移动；反之当均值则随着值变大向零方向移动；β是标准差S_v_ij的阈值，当标准差S_v_ij＜β，随着S_v_ij值变小向零方向移动；反之，则随着S_v_ij值变大向正无穷移动。当均值和标准差S_v_ij的值较小时，α-子项起主导作用；当均值和标准差S_v_ij的值较大时，β-子项起主导作用。α和β的确定有两种方法，其一为根据经验值指定，超参数值的范围由低到高设定，根据每次设定的值计算评判指标，从而对神经网络模型进行裁剪，并对裁剪之后的模型重新训练以恢复精度，逐渐达到一个最优的效果(即在网络精度下降不超过设定阈值的情况下裁剪掉的通道数目最多)；其二为比例缩放，即和β＝η∑S_v_ij/C_i，μ和η是比例因子，取值范围为(0，0.4)，该方法可以根据网络的参数进行动态的调整。ε是一个极小值，防止除数为零的情况。in, and S_v _ij is the mean vector corresponding to the feature layer and the jth element in the standard deviation vector S_v _i , representing the mean and standard deviation of the feature map corresponding to the jth output channel in the feature layer, α and β are two scale factors (hyperparameters), α is the mean threshold, when the mean along with The value becomes smaller and moves in the direction of negative infinity; otherwise, it is the mean value but along with The value becomes larger and moves to the zero direction; β is the threshold of the standard deviation S_v _ij , when the standard deviation S_v _ij <β, As the value of S_v _ij becomes smaller, it moves to the zero direction; otherwise, the Move toward positive infinity as the value of S_v _ij increases. when the mean When the value of and standard deviation S_v _ij is small, the α-subterm plays a leading role; when the mean When the value of the standard deviation S_v _ij is large, the β-subterm plays a leading role. There are two ways to determine α and β. One is to specify according to the empirical value. The range of hyperparameter values is set from low to high, and the evaluation index is calculated according to the value set each time, so as to tailor the neural network model, and Retrain the cropped model to restore the accuracy, and gradually achieve an optimal effect (that is, the maximum number of channels cropped when the network accuracy does not drop beyond the set threshold); the second is scaling, that is and β=η∑S_v _ij /C _i , μ and η are scale factors, the value range is (0, 0.4), the method can be dynamically adjusted according to the parameters of the network. ε is a minimal value that prevents division by zero.

进一步地，所述步骤3中，对于第i个特征层L_i，若则判定该特征层中第j个输出通道不重要，将该通道及其对应的参数进行移除。Further, in the step 3, for the _i -th feature layer Li, if Then it is determined that the jth output channel in the feature layer is not important, and the channel and its corresponding parameters are removed.

进一步地，对于第i个特征层L_i，移除其中不重要的通道及其对应的参数的步骤如下：Further, for the _ith feature layer Li, the steps of removing unimportant channels and their corresponding parameters are as follows:

S31：记录对应的评判标准的通道的集合R_i，集合R_i中的元素个数记为length(R_i)；S31: Record the corresponding evaluation criteria The set R _i of the channels of , the number of elements in the set R _i is recorded as length(R _i );

S32：将特征层L_i的卷积核W_i表示为一个大小为C_i-1×C_i×K_hi×K_wi的四维张量，卷积核W_i对应的偏置B_i为一个大小为1×C_i的向量，其中C_i-1表示上一个特征层L_i-1的输出通道数，若L_i为第一个特征层，则C_i-1表示样本输入的通道数，K_hi和K_wi分别为卷积核的高度和宽度；移除卷积核W_i中属于集合R_i的通道对应的元素，形成一个新的卷积核其大小为用新的卷积核来代替卷积核W_i；移除偏置B_i中属于集合R_i的通道对应的元素，形成一个新的偏置其大小为 S32: Represent the convolution kernel Wi of the feature layer _{Li as a four-dimensional tensor of size C i-1 ×C i ×K hi ×K wi} _, _and _the _bias _B _i _{corresponding} to the convolution kernel Wi is a size is a 1×C _i vector, where C _i-1 represents the number of output channels of the previous feature layer _Li-1 , if Li is the first feature layer, then C _i _-1 represents the number of sample input channels, K _hi and _Kwi are the height and width of the convolution kernel respectively; remove the elements corresponding to the channels belonging to the set _Ri in the _convolution kernel Wi to form a new convolution kernel Its size is with a new convolution kernel to replace the convolution kernel _Wi ; remove the elements corresponding to the channels belonging to the set _Ri in the bias _Bi to form a new bias Its size is

S33：若特征层L_i的下一层L_i+1还是特征层，则将下一层L_i+1的卷积核W_i+1表示为一个大小为C_i×C_i+1×K_h(i+1)×K_w(i+1)的四维张量，其中C_i+1表示下一层L_i+1的输出通道数，K_h(i+1)和K_w(i+1)分别为卷积核W_i+1的高度和宽度；；移除卷积核W_i+1中属于集合R_i的通道对应的元素，形成一个新的卷积核其大小为用新的卷积核来代替卷积核W_i+1；S33: If the next layer _Li+1 of the feature layer Li is still a feature layer, express the convolution kernel Wi ₊₁ of the next layer Li ₊₁ as a size C _i ×C _i ₊₁ ×K A four-dimensional tensor of _h(i+1) ×K _w(i+1) , where C _i+1 represents the number of output channels of the next layer L _i+1 , K _h(i+1) and K _{w(i+ 1)} are the height and width of the convolution kernel Wi ₊₁ respectively; ; Remove the elements corresponding to the channels belonging to the set R _i in the convolution kernel Wi ₊₁ to form a new convolution kernel Its size is with a new convolution kernel to replace the convolution kernel Wi ₊₁ ;

S34：若特征层L_i的下一层L_i+1是全连接层，则将下一层L_i+1的参数V_i+1表示为一个大小(C_i×K_hi×K_wi)×C_i+1的矩阵；移除参数V_i+1中属于集合R_i的通道对应的元素，形成一个新的参数其大小为用新的参数来代替参数V_i+1。S34: If the next layer _Li+1 of the feature layer Li is a fully connected layer, express the parameter Vi ₊₁ of the next layer _{Li+1 as a size (C i} _× _K _hi ×K _wi )× The matrix of C _i+1 ; remove the elements corresponding to the channels belonging to the set R _i in the parameter V _i+1 to form a new parameter Its size is with new parameters to replace the parameter V _i+1 .

深度神经网络在完成了模型裁剪之后，需要经过几次迭代来重新训练使得网络的精度得以恢复，迭代的次数与修剪的特征层、以及评判准则有关。修剪的特征层靠近输入层，迭代的次数较少；修剪的特征层靠近输出层，迭代的次数较多。评判准则中，α和β的值越高，需要更多的迭代次数来恢复网络的精度。After completing the model pruning, the deep neural network needs to undergo several iterations to retrain to restore the accuracy of the network. The number of iterations is related to the pruned feature layer and the evaluation criteria. The pruned feature layer is close to the input layer, and the number of iterations is less; the pruned feature layer is close to the output layer, and the number of iterations is more. In the evaluation criteria, the higher the values of α and β, the more iterations are required to restore the accuracy of the network.

有益效果：Beneficial effects:

与已有技术相比，本发明充分利用了深度神经网络的统计特征，构建了基于均值和标准差的评判指标，能够有效降低神经网络特征层的维度，提升了深度神经网络的训练速度，减少了深度神经网络的框架规模和权重数量，提高了深度神经网络的运行速度/效率,并且对精度产生的影响较小。具体有以下的特点与效果：Compared with the prior art, the present invention makes full use of the statistical features of the deep neural network, and constructs an evaluation index based on the mean and standard deviation, which can effectively reduce the dimension of the feature layer of the neural network, improve the training speed of the deep neural network, and reduce the The frame size and number of weights of the deep neural network are improved, and the running speed/efficiency of the deep neural network is improved, and the impact on the accuracy is small. It has the following features and effects:

第一、本发明在构造卷积核裁剪的评判指标时，考虑了神经网络的统计特征，利用特征层的均值和标准差来兼顾神经网络值的特征和特征层内的特征。通过特征层的数据特征反映卷积核参数的作用效果，进而将表现不良的特征图及相对应的卷积核裁剪掉，实现了网络模型框架的缩小和参数量的压缩。First, the present invention considers the statistical characteristics of the neural network when constructing the evaluation index for convolution kernel clipping, and uses the mean and standard deviation of the feature layer to take into account the features of the neural network value and the features in the feature layer. The data features of the feature layer reflect the effect of the parameters of the convolution kernel, and then the feature maps and corresponding convolution kernels with poor performance are cut out, which realizes the reduction of the network model framework and the compression of the parameter amount.

第二、评判准则公式中，可以灵活的设定超参数α和β，来改变移除的通道数目；当统计特征落在靠近0的地方，α-子项会起到主导作用；相反，当统计特征落在远离0的区域，β-子项会起到主导作用。Second, in the criterion formula, hyperparameters α and β can be flexibly set to change the number of removed channels; when the statistical features fall close to 0, the α-sub-item will play a leading role; on the contrary, when Statistical features fall in the region far from 0, and the β-subterm will play a dominant role.

第三、本发明在充分考虑了神经网络的统计特征的基础上，提出了新的评判准则，算法的复杂度较低，性能较好，可以在实时网络以及嵌入式设备上进行部署。Third, the present invention proposes a new evaluation criterion on the basis of fully considering the statistical characteristics of the neural network. The algorithm has lower complexity and better performance, and can be deployed on real-time networks and embedded devices.

附图说明Description of drawings

图1是本发明的流程示图；Fig. 1 is the flow chart of the present invention;

图2是特征层的结构示意图；FIG. 2 is a schematic structural diagram of a feature layer;

图3是特征层内部结构示意图；3 is a schematic diagram of the internal structure of the feature layer;

图4是神经网络中特征层的存在形式示例；图4(a)是多个连续的特征层，图4(b)是单个特征层；Fig. 4 is an example of the existence form of the feature layer in the neural network; Fig. 4(a) is a plurality of continuous feature layers, and Fig. 4(b) is a single feature layer;

图5是本发明设计的总体框图；Fig. 5 is the overall block diagram designed by the present invention;

图6是本发明模型挑选与裁剪的示意图；Fig. 6 is the schematic diagram of model selection and cutting of the present invention;

具体实施方式Detailed ways

下面结合具体实施例对本发明进行详细说明，以下实施例将有助于本领域的技术人员进一步理解本发明。下面通过参考附图描述的实例是示例性的，旨在用于解释本发明，而不能理解为对本发明的限制。The present invention will be described in detail below with reference to specific embodiments, and the following embodiments will help those skilled in the art to further understand the present invention. The examples described below with reference to the accompanying drawings are exemplary and are intended to explain the present invention, but should not be construed as limiting the present invention.

图1是本实例的一种基于特征图统计的深度神经网络模型裁剪方法流程示意图，通过对特定的特征图及相对应的卷积核进行移除，实现网络模型框架的缩小和参数量的压缩，针对模型的裁剪方法，具体实现步骤为：Figure 1 is a schematic flowchart of a deep neural network model clipping method based on feature map statistics in this example. By removing specific feature maps and corresponding convolution kernels, the network model framework is reduced and the parameter amount is compressed. , for the model cutting method, the specific implementation steps are:

(1)针对深度神经网络中的特征层，依次计算特征层中每个特征图的统计特征；(1) For the feature layer in the deep neural network, calculate the statistical features of each feature map in the feature layer in turn;

(2)根据统计特征构造评判准则；(2) Construct judging criteria according to statistical characteristics;

(3)将不符合评判准则的特征图及其相对应的卷积核进行移除。(3) The feature maps that do not meet the evaluation criteria and their corresponding convolution kernels are removed.

需要说明的是，本发明实例的操作对象是已经训练收敛的深度神经网络的特征层，其中特征层是由卷积层和激活层(或可描述为激活函数、非线性层)两部分组合而成，或是由卷积层、BatchNorm层(归一化层)和激活层三部分组合而成，如图2所示。对于网络的类型及模块，包括但不限于卷积层，批归一化层，激活层，全连接层，Resnet模块。在深度神经网络框架中，特征层的内部结构示意图如图3所示，第i个特征层为L_i，第i个特征层的卷积核为W_i，卷积核W_i对应的偏置为B_i。本发明只对后面还有储存参数的层(特征层/全连接层)的特征层进行计算，比如图4(a)中除最后一个特征层之外的其它特征层，而不对后面没有储存参数的层的特征层进行计算，比如图4(b)中的特征层，即若特征层后面只有池化层、归一化层、激活层或softmax层，则不对这种特征层进行操作。It should be noted that the operation object of the example of the present invention is the feature layer of the deep neural network that has been trained and converged, wherein the feature layer is a combination of the convolution layer and the activation layer (or can be described as an activation function, a nonlinear layer) two parts. It is composed of three parts: convolution layer, BatchNorm layer (normalization layer) and activation layer, as shown in Figure 2. For network types and modules, including but not limited to convolutional layers, batch normalization layers, activation layers, fully connected layers, and Resnet modules. In the deep neural network framework, the schematic diagram of the internal structure of the feature layer is shown in Figure 3. The _ith feature layer is Li, the convolution kernel of the _ith feature layer is Wi _, and the bias corresponding to the convolution kernel Wi is B _i . The present invention only calculates the feature layer of the layer (feature layer/fully connected layer) that has stored parameters in the back, such as other feature layers except the last feature layer in Fig. 4(a), but does not store parameters later. The feature layer of the layer is calculated, such as the feature layer in Figure 4(b), that is, if there is only a pooling layer, a normalization layer, an activation layer or a softmax layer behind the feature layer, this feature layer will not be operated.

在深度神经网络的一个迭代步(epoch)中，样本分不同的批次送入神经网络进行计算。对网络模型中特征图的统计特征，分批次进行特征统计，以下以第i个特征层L_i为例进行说明，实现步骤如下：In an iterative step (epoch) of the deep neural network, the samples are sent to the neural network in different batches for calculation. For the statistical features of the feature map in the network model, feature statistics are performed in batches. The following takes the _i -th feature layer Li as an example to illustrate, and the implementation steps are as follows:

S31：初始化中间变量。N_sum用来统计已经处理过的样本数量，初始化为0；N_batch为批次数量，N_batch＝ceil(样本总数/N)，N是一个批次中样本的数量，ceil(·)是向上取整函数；n_batch是当前批次的计数，初始化为0；均值和标准差S_ikj是标量，k＝1，2，…，N，j＝1，2，…，C_i，初始化为0；均值矩阵和标准差矩阵S_m_i的初始化为大小是(N，C_i)的零矩阵，而均值向量和标准差向量S_v_i的初始化为大小是(1，C_i)的零向量，C_i为特征层L_i的输出通道数量。S31: Initialize intermediate variables. N _sum is used to count the number of samples that have been processed, initialized to 0; N _batch is the number of batches, N _batch = ceil (total number of samples/N), N is the number of samples in a batch, ceil( ) is upward Rounding function; n _batch is the count of the current batch, initialized to 0; mean and standard deviation S _ikj are scalars, k = 1, 2, ..., N, j = 1, 2, ..., C _i , initialized to 0; mean matrix and the standard deviation matrix S_m _i is initialized as a zero matrix of size (N, C _i ), while the mean vector The sum standard deviation vector S_v _i is initialized as a zero vector of size (1, C _i ), where C _i is the number of output channels of the feature layer _Li .

S32：将第n_batch个批次样本对应的第i个特征层的L_i输出进行view或reshape(维度变换)得到大小由(N，C_i，H_i，W_i)变为(N，C_i，H_i*W_i)，相当于把二维的特征图X_ikj拉伸为一维的表示X^* _ikj，k∈[1，N]是第i个特征层L_i中样本的索引，j∈[1，C_i]是第i个特征层L_i中第j个通道的特征图的索引，需要强调的是特征图X_ikj表示第i个特征层L_i中第j个通道的(H_i，W_i)大小的元素集合，特征图X^* _ikj表示特征层中第j个通道的(H_i*W_i)大小的元素集合；S32: Perform view or reshape (dimension transformation) on the Li output of the _i -th feature layer corresponding to the n-th _batch of samples to obtain The size is changed from (N, C _i , H _i , Wi ) to (N, C _i , H _i *W _i ₎ , which is equivalent to stretching the two-dimensional feature map X _ikj into a one-dimensional representation X ^* _ikj , k∈[1,N] is the index of the sample in the ith feature layer Li, j∈[1,C _i ] is the index of the feature map of the jth channel in the _ith feature layer _Li , it needs to be emphasized is the feature map X _ikj represents the element set of size (H _i , Wi ) of the j-th channel in the _i - _th feature layer Li, and the feature map X ^* _ikj represents the feature layer The set of elements of size (H _i *W _i ) of the jth channel in ;

S33：计算特征层中第j个通道特征图X^* _ikj的统计特征：均值和标准差S_ikj：S33: Calculate the feature layer Statistical features of the jth channel feature map X ^* _ikj in: mean and standard deviation S _ikj :

针对特征层其中的任一特征图X^* _ikj，均采用均值和标准差S_ikj作为特征图X^* _ikj的统计特征。在特征层的一个批次中，可产生对应大小为(N，C_i)的均值矩阵和标准差矩阵S_m_i；For feature layer For any feature map X ^* _ikj , the mean value is used and standard deviation S _ikj as statistical features of feature map X ^* _ikj . at the feature level In a batch of , a mean matrix of size (N, C _i ) can be generated and standard deviation matrix S_m _i ;

S34：对均值矩阵和标准差矩阵S_m_i按通道进行均值化处理，其实现如下：S34: For the mean matrix and the standard deviation matrix S_m _i are averaged by channel, which is implemented as follows:

N_sum＝N_sum+N (5)N _sum = N _sum + N (5)

其中，N_sum是前n_batch批样本数量的叠加，用来统计已经处理过的样本数量；N是第n_batch个批次的样本数量；是均值矩阵进行均值滤波后的结果，是第k个样本对应的所有通道的均值；S_v_i是标准差矩阵S_m_i进行均值滤波后的结果，是第k个样本对应的所有通道的标准差。Among them, N _sum is the superposition of the number of samples in the first n _{batches, which is used to count the number of samples that have been processed; N is the number of samples in the nth batch} _; is the mean matrix The result after mean filtering, is the mean of all channels corresponding to the kth sample; S_v _i is the result of mean filtering by the standard deviation matrix S_m _i , is the standard deviation of all channels corresponding to the kth sample.

S35：更新当前批次：n_batch＝n_batch+1，如果n_batch＝N_batch，那么终止批次循环；否则均值和标准差S_ikj置为0；均值矩阵和标准差矩阵S_m_i重置为零矩阵，而均值向量和标准差向量S_v_i也重置为零向量，跳转到S32继续执行。根据上述批次迭代得到的均值向量和标准差向量S_v_i，对特征层L_i的第j个通道的评判指标计算过程如下：S35: Update the current batch: n _batch = n _batch +1, if n _batch = N _batch , then terminate the batch cycle; otherwise, the mean and standard deviation S _ikj set to 0; mean matrix and standard deviation matrix S_m _i reset to zero matrix, while mean vector The sum standard deviation vector S_v _i is also reset to the zero vector, and jumps to S32 to continue the execution. The mean vector obtained by iterating over the above batches and the standard deviation vector S_v _i , the evaluation index for the _jth channel of the feature layer Li The calculation process is as follows:

其中，和S_v_ij是特征层L_i中第j个通道的均值和标准差，α和β是两个超参数，这两个比例因子用来限定均值和标准差S_v_ij的起作用范围，ε是一个极小值，防止除数为零的存在。在超参数较小的情况下，采用比例缩放，即和β＝η∑S_v_ij/C_i，μ和η是比例因子，取值范围为(0，0.4)，当超参数变大之后，采用逐步逼近的方法，超参数值的范围由低到高，逐渐达到一个最优的效果。in, and S_v _ij are the mean and standard deviation of the _jth channel in the feature layer Li, α and β are two hyperparameters, these two scale factors are used to define the mean And the working range of standard deviation S_v _ij , ε is a minimum value, preventing the existence of division by zero. In the case of small hyperparameters, scaling is used, i.e. and β=η∑S_v _ij /C _i , μ and η are scale factors, the value range is (0, 0.4), when the hyperparameter becomes larger, the step-by-step approximation method is adopted, and the range of the hyperparameter value is from low to high , and gradually achieve an optimal effect.

移除不符合评判准则的特征图及其相关联的参数，针对C_i个输出通道的特征层L_i而言，计算得到评判指标记录对应的评判指标的通道的集合R_i，移除相对应的通道的步骤如下：Remove the feature maps that do not meet the evaluation criteria and their associated parameters, and calculate the evaluation index for the feature layer _Li of the C _i output channels Record the corresponding evaluation indicators The set of channels R _i of , the steps of removing the corresponding channel are as follows:

S71：特征层L_i的卷积核W_i的大小为(C_i-1，C_i，K_hi，K_wi)，C_i-1表示上一个特征层L_i-1的输出通道数或样本输入的通道数(如果特征层L_i为第一个特征层)，C_i为当前特征层L_i的输出通道数，K_hi，K_wi为卷积核的尺寸。构建一个新的卷积核其大小为表示从C_i中减去集合R_i包含的元素数量后的通道数(C_i-length(R_i))。将卷积核W_i的第1个维度上切取不属于集合R_i的通道对应的元素，复制到新的卷积核中，然后用新的卷积核来代替卷积核W_i；构建一个新的偏置其大小为将偏置B_i中切取不属于集合R_i的通道对应的元素，复制到新的偏置然后用新的偏置来代替偏置B_i；S71: The size of the convolution kernel Wi of the feature layer Li is (C _i _-1 , C _i , K _hi , K _wi ), and C _i _-1 represents the number of output channels or samples of the previous feature layer _Li-1 The number of input channels (if the feature layer _{Li is the first feature layer), C i is the number of output channels of the current feature layer Li, K hi , K wi} _are _the _size _of the convolution kernel. Build a new convolution kernel Its size is Indicates the number of channels (C _i -length(R _i )) after subtracting the number of elements contained in the set R _i from C _i . Cut out the elements corresponding to the channels that do not _belong to the set _Ri from the first dimension of the convolution kernel Wi, and copy them to the new convolution kernel , and then use the new convolution kernel to replace the convolution kernel _Wi ; construct a new bias Its size is Cut out the elements corresponding to the channels that do not belong to the set _Ri from the offset B _i and copy them to the new offset then use the new bias instead of bias _Bi ;

S72：如果特征层L_i之后还存在特征层的话，对于下一个特征层L_i+1的卷积核W_i+1的大小为(C_i，C_i+1，K_h(i+1)×K_w(i+1))，C_i+1表示下一个特征层L_i+1的输出通道数。构建一个新的卷积核其大小为将卷积核W_i+1的第1个维度上切取不属于集合R_i的通道对应的元素，复制到新的卷积核中，然后用新的卷积核来代替卷积核W_i+1；S72: If there is a feature layer after the feature layer _{Li, the size of the convolution kernel Wi+1 for the next feature layer Li+1 is (C i} _, _C _i ₊₁ , K _h(i+1) ×K _w(i+1) ), C _i+1 represents the number of output channels of the next feature layer _Li+1 . Build a new convolution kernel Its size is Cut out the elements corresponding to the channels that do not belong to the set R _i from the first dimension of the convolution kernel Wi ₊₁ , and copy them to the new convolution kernel , and then use the new convolution kernel to replace the convolution kernel Wi ₊₁ ;

S73：如果特征层L_i之后是全连接层，连接层的输出通道数量为C_i+1，对应的参数大小为(C_i×K_hi×K_wi，C_i+1)。构建一个新的参数其大小为将参数V_i+1的第1个维度上切取不属于集合R_i的通道对应的元素，复制到新的参数中，然后用新的参数来代替参数V_i+1；S73: If the feature layer Li is followed by a fully connected layer, the number of output channels of the connection layer is C _i ₊₁ , and the corresponding parameter size is (C _i ×K _hi ×K _wi , C _i+1 ). build a new parameter Its size is Cut out the elements corresponding to the channels that do not belong to the set R _i from the first dimension of the parameter V _i+1 , and copy them to the new parameter , then use the new parameters to replace the parameter V _i+1 ;

进一步地，深度神经网络在完成了模型裁剪之后，需要经过几次迭代来重新训练使得网络的精度得以恢复，迭代的次数与修剪的特征层、以及评判准则有关。修剪的特征层靠近输入层，迭代的次数较少；修剪的特征层靠近输出层，迭代的次数较多。评判准则中，α和β的值越高，需要更多的迭代次数来恢复网络的精度。Further, after completing the model pruning, the deep neural network needs to undergo several iterations to retrain to restore the accuracy of the network. The number of iterations is related to the pruned feature layer and the evaluation criteria. The pruned feature layer is close to the input layer, and the number of iterations is less; the pruned feature layer is close to the output layer, and the number of iterations is more. In the evaluation criteria, the higher the values of α and β, the more iterations are required to restore the accuracy of the network.

本技术领域的普通技术人员可以理解实现上述实施方法携带的全部或部分步骤是可以通过程序指令来实现，所述的程序可以存储于一种计算机可读存储介质中。以上对本发明的具体实施进行了描述。应当理解的是，本发明对特征层和全连接层进行压缩操作，因此，凡在本发明的精神实质与原理之内所做的任何修改、等同替换、改进等，均应包含在本发明保护的范围之内。Those skilled in the art can understand that all or part of the steps carried by the above implementation method can be implemented by program instructions, and the program can be stored in a computer-readable storage medium. The specific implementation of the present invention has been described above. It should be understood that the present invention compresses the feature layer and the fully connected layer. Therefore, any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection of the present invention. within the range.

Claims

1. a deep neural network model clipping method based on feature map statistical features, is characterized in that, in order to improve the compression efficiency and acceleration performance of the network, for the optimization carried out by deep neural network, the following steps are implemented:

Step 1. For the feature layer in the deep neural network model, calculate the statistical features of the feature map corresponding to each output channel; the feature layer is composed of two parts: the convolution layer and the activation layer, or the convolution layer, normalization layer The layer and the activation layer are composed of three parts;

Step 2. Calculate the evaluation index of each output channel in the feature layer according to the statistical features of the feature map corresponding to each output channel in the feature layer;

Step 3: Determine the importance of each output channel in the feature layer according to the evaluation index, and remove the unimportant output channels and their corresponding parameters.

2. the deep neural network model clipping method based on feature map statistical feature according to claim 1, is characterized in that, in described step 2, to the feature map corresponding to each output channel of each feature layer in the deep neural network model, divide Perform feature statistics in batches; for the i-th feature layer, the statistical features of the feature maps corresponding to all output channels include a mean vector and a standard deviation vector S_v _i , the calculation steps are as follows:

S11: Initialization; set N _sum to count the number of samples that have been processed, initialized to 0; N _batch is the number of batches, N _batch = ceil (total number of samples/N), N is the number of samples in a batch, ceil ( ) is the round-up function; n _batch is the count of the current batch, initialized to 1; initialized mean vector The sum standard deviation vector S_v _i is a 1×C _i zero vector, where C _i is the number of output channels of the ith feature layer;

S12: Represent the output X _i of the i-th feature layer corresponding to the n-th _batch of samples as a four-dimensional tensor of size N×C _i ×H _i ×W _i _, where H _i and Wi are respectively The height and width of the feature map corresponding to the output channel of the i-th feature layer; the feature map X _ikj of the j-th output channel of the feature layer corresponding to the k-th sample in the batch is a two-dimensional H _i ×W _i dimensional matrix, k=1,2,...,N,j=1,2,...,C _i ;

S13: Stretch each feature map X _ikj in _Xi from a two-dimensional matrix into a one-dimensional vector X ^* _ikj ;

S14: Compute statistical features of X ^* _ikj , including mean and standard deviation S _ikj :

in, represents the mth element in X ^* _ikj ;

S15: by Form a matrix of size N × C _i as the mean value A size N×C _i is a standard deviation matrix S_m _i formed by S _ikj , k=1, 2,..., N, j=1, 2,..., C _i ;

S16: For the mean matrix And the standard deviation matrix S_m _i is averaged by channel, and the processing formula is as follows:

N _sum = N _sum + N (5)

in, Yes The row vector corresponding to the kth row in ; is the row vector corresponding to the kth row in S_m _i ;

S17: Determine whether the current batch is the last batch, that is, whether there is n _batch =N _atch , if so, terminate the batch cycle, and the current mean vector The sum standard deviation vector S_v _i is the statistical feature of the feature map corresponding to all output channels of the feature layer; otherwise, update the count of the current batch: n _batch = n _batch +1, and jump to S12 to continue execution.

3. the deep neural network model clipping method based on feature map statistical feature according to claim 2, is characterized in that, in described step 2, the evaluation index of the jth output channel in the ith feature layer The calculation formula is as follows:

in, and S_v _ij is the mean vector corresponding to the feature layer The jth element in the sum standard deviation vector S_v _i represents the mean and standard deviation of the feature map corresponding to the jth output channel in the feature layer, α and β are two scale factors, and ε is a minimum value.

4. the deep neural network model clipping method based on feature map statistical feature according to claim 3, is characterized in that, in described step 3, for the _i -th feature layer Li, if Then it is determined that the jth output channel in the feature layer is not important, and the channel and its corresponding parameters are removed.

5. the deep neural network model clipping method based on feature map statistical feature according to claim 4, is characterized in that, for the _i -th feature layer Li, the step of removing unimportant channel and its corresponding parameter is as follows :

S31: Record the corresponding evaluation criteria The set R _i of the channels of , the number of elements in the set R _i is recorded as length(R _i );

S32: Represent the convolution kernel Wi of the feature layer _{Li as a four-dimensional tensor of size C i-1 ×C i ×K hi ×K wi} _, _and _the _bias _B _i _{corresponding} to the convolution kernel Wi is a size is a 1×C _i vector, where C _i-1 represents the number of output channels of the previous feature layer _Li-1 , if Li is the first feature layer, then C _i _-1 represents the number of sample input channels, K _hi and _Kwi are the height and width of the convolution kernel _Wi , respectively; remove the elements corresponding to the channels belonging to the set _Ri in the _convolution kernel Wi to form a new convolution kernel _Wi ^* , the size of which is Replace the convolution kernel Wi with a new _convolution kernel Wi ^* ; remove the element corresponding to the channel belonging to the set _{Ri in the bias B i} _to _form a new bias Its size is

S33: If the next layer _Li+1 of the feature layer Li is still a feature layer, express the convolution kernel Wi ₊₁ of the next layer Li ₊₁ as a size C _i ×C _i ₊₁ ×K A four-dimensional tensor of _h(i+1) ×K _w(i+1) , where C _i+1 represents the number of output channels of the next layer L _i+1 , K _h(i+1) and K _{w(i+ 1)} are the height and width of the convolution kernel Wi ₊₁ respectively; remove the elements corresponding to the channels belonging to the set R _i in the convolution kernel Wi ₊₁ to form a new convolution kernel Its size is with a new convolution kernel to replace the convolution kernel Wi ₊₁ ;

S34: If the next layer _Li+1 of the feature layer Li is a fully connected layer, express the parameter Vi ₊₁ of the next layer _{Li+1 as a size (C i} _× _K _hi ×K _wi )× The matrix of C _i+1 ; remove the elements corresponding to the channels belonging to the set R _i in the parameter V _i+1 to form a new parameter Its size is with new parameters to replace the parameter V _i+1 .