CN111461144A

CN111461144A - Method for accelerating convolutional neural network

Info

Publication number: CN111461144A
Application number: CN202010244305.2A
Authority: CN
Inventors: 陈尧麟; 郝昀超; 张佩珩; 霍志刚
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2020-07-28

Abstract

The invention provides a method for accelerating a convolutional neural network, which comprises the following steps: step 1: dividing an input feature map with N channels into G groups of initial feature maps along the channel direction, wherein the G th feature map_iThe set of initial feature maps includes S_iThe characteristic diagram is shown in the figure,

performing first group convolution on the G groups of initial feature maps to obtain G groups of first feature maps, wherein N, G, S_iIs an integer of 1 or more; step 2: the G groups of first feature maps are subdivided into F groups of second feature maps, wherein the F th feature map_jThe second feature map group contains T from different first feature map groups_jThe characteristic diagram is shown in the figure,

performing a second group convolution on the F groups of second feature maps to obtain an output feature map with M channels, wherein F, T_jAnd M is an integer of 1 or more.

Description

Method for accelerating convolutional neural network

Technical Field

The invention relates to a deep learning technology, in particular to a method for accelerating a convolutional neural network.

Background

With the technical development of deep learning, application techniques based on deep learning are widely applied to various fields in life. Deep learning has also evolved from the earliest cloud computing to today's terminal computing. Because most deep learning applications are large in scale, the requirements on the performance of a machine during training and prediction are high, and the storage resources and the computing power of terminal equipment are very limited, how to accelerate deep learning becomes a new technical hotspot. The convolutional neural network is a widely applied deep learning model, and the acceleration method is a hot topic.

Convolutional Neural Networks (CNN) are similar to the multilayer perceptron of an artificial Neural Network, and extract features by convolution, integrate different features, and finally make predictions. The convolutional neural network mainly comprises a data input layer, a convolutional layer, an excitation layer (activation function), a pooling layer and a full-link layer, wherein the convolutional layer is used for carrying out feature extraction on input data. The convolution operation of the convolution layer occupies a large amount of resources of the convolutional neural network, so how to reduce the amount of convolution operation becomes the key for accelerating the convolutional neural network.

The existing lightweight convolutional neural network adopts deep-separable convolution (Depth-wise Sepa rablecontolution), which decomposes the conventional convolution into two subtasks: (1) depth-wise convolution (depth-wise convolution) for performing the task of image convolution inside each feature layer; and (2) point-wise convolution (point-wise convolution) for realizing information interaction between different feature layers. Compared with the traditional method of direct convolution by a plurality of convolution kernels, the deep separable convolution can greatly reduce model parameters and calculation amount. However, in the deep separable convolution, because the algorithm complexity of the point-by-point convolution is far higher than that of the task deep convolution, a large amount of resources are used for information interaction between different feature layers, so that resource distribution is unbalanced, and the overall operation efficiency of the convolution layer is reduced.

Disclosure of Invention

Optionally, in step 1, the input feature map with N channels is averagely divided into G groups of initial feature maps along a channel direction, where each group of initial feature maps includes S feature maps, and S × G ═ N; and in the step 2, averagely dividing the G groups of first feature maps into F groups of second feature maps, wherein each group of second feature maps contains T feature maps, where F is S and T is G, so that the F is_jEach characteristic diagram in the second characteristic diagram group is respectively from different G-th characteristics_iAnd (4) grouping the first characteristic graphs.

Optionally, wherein,

and wherein the size of the convolution kernel in the first set of convolutions is K x K and the size of the convolution kernel in the second set of convolutions is 1 x 1.

Optionally, wherein

And when the number of the S is a non-integer, the value of S is an integer closest to S'.

Optionally, wherein

When non-integer, S is rounded off as S', and

the integer is G ', and when G ' is S ═ N, G ═ G '.

Optionally, wherein,

when non-integer, S is rounded off as S', and

and when G 'is greater than N, copying the first G' S-N characteristic graphs in the input characteristic graphs and merging the characteristic graphs with the input characteristic graphs to obtain G 'S layer input characteristic graphs, and grouping according to G' and S.

Optionally, wherein the first group convolution comprises: and performing convolution for S times on each group of initial characteristic graphs respectively.

Optionally, wherein the second group of convolution comprises: when in use

When the number is an integer, each group of second characteristic graphs is respectively processed

And (4) performing secondary convolution.

Optionally, wherein the second group of convolution comprises: when in use

And when the number of the second feature maps is non-integer, taking the integer as W downwards, wherein R is M-W S, performing W +1 times of convolution on the R groups of second feature maps in the S groups of second feature maps, and performing W times of convolution on other groups of second feature maps.

A further aspect of the invention also provides a storage medium in which a computer program is stored which, when being executed by a processor, is operable to carry out any of the methods described above.

Another aspect of the present invention also provides an electronic device comprising a processor and a memory, the memory having stored therein a computer program which, when executed by the processor, is operable to carry out any of the methods described above.

Compared with the prior art, the invention has the advantages that:

according to the convolutional neural network convolution method, the determined grouping mode is adopted for the input characteristic diagram of the convolutional layer, and two groups of convolutions are carried out, so that the calculation amount among different convolution operations is more balanced, the total calculation amount and complexity of the convolutional neural network can be effectively reduced, the efficiency of the convolutional operation is obviously improved, and the network speed is increased; in addition, in some embodiments, the universality of the network acceleration method in the invention is increased by optimizing the grouping mode of the input feature map.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

FIG. 1A shows a schematic diagram of the deep convolution operation in a depth separable convolution. (ii) a

FIG. 1B shows a schematic diagram of a point-by-point convolution operation in a depth separable convolution;

FIG. 2 illustrates a method for accelerating a convolutional neural network, in accordance with one embodiment of the present invention;

FIG. 3A shows a schematic diagram of a conventional convolution operation;

FIG. 3B shows a schematic diagram of a group convolution operation;

FIG. 4A is a diagram illustrating a first group convolution of G groups of input feature maps to obtain G groups of first feature maps according to an embodiment of the present invention;

fig. 4B is a diagram illustrating the first feature map of the G group is subdivided into the second feature maps of the F group and the second group convolution is performed according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The prior art deep separable convolution is performed by decomposing a complete convolution operation into two steps of deep convolution and point-by-point convolution, wherein the deep convolution is filtered by using a single-channel convolution kernel on each channel of the input feature map, i.e. one convolution kernel is responsible for only one channel. FIG. 1A shows a schematic diagram of the deep convolution operation in a depth separable convolution. As shown in fig. 1A, it is assumed that the input feature map has a size W × H, the number of channels (number of layers) is N, the convolution kernel has a size K × K, and the number of convolution kernels is equal to the number of channels of the input feature map. Under the condition that the step length is 1 and the filling is 0, an output characteristic diagram with the size of W × H and the number of channels N can be obtained through deep convolution, and the operation complexity is as follows:

θ_task1＝θ(N*W*H*K*K) (1)

since the deep convolution only performs independent convolution operation on each channel of the input feature map, and does not effectively utilize feature information of different channels at the same spatial position, it is also necessary to perform linear combination on each channel of the deep convolution output feature map through point-by-point convolution. In point-by-point convolution, each convolution kernel is responsible for deep convolution of each channel of the output feature map. FIG. 1B shows a schematic diagram of a point-by-point convolution operation in a depth separable convolution. As shown in fig. 1B, the feature map output by the deep convolution is used as the input feature map of the point-by-point convolution, the size of the convolution kernel is 1 × N, N is the number of channels of the input feature map (that is, the deep convolution output feature map) of the point-by-point convolution, M convolution kernels are shared, and M is the number of channels of the output feature map after the point-by-point convolution. After point-by-point convolution, performing M-time linear combination on each element in the output feature graph of the deep convolution with the size of W × H and the number of channels N to complete interaction and integration of information between different channels, and finally obtaining the output feature graph of M channels, wherein the operation complexity is as follows:

θ_task2＝θ(W*H*N*M) (2)

according to the mean inequality, the following can be found:

therefore, when theta_task1＝θ_task2When theta is greater than theta_task1+θ_task2When K × K is M, the total complexity of the entire convolution operation is the lowest, as can be seen from equations (1) and (2).

However, in practical applications, the number of channels M of the convolution layer output feature map is much larger than the size K × K of the convolution kernel, so the computation complexity of the point-by-point convolution is much higher than that of the deep convolution. The computation amount of information interaction between information features accounts for more than 90% of the total computation amount, a large amount of computing resources are consumed, and the speed of the neural network is reduced.

In order to solve the above problems, the present invention provides a method for accelerating a convolutional neural network, so as to equalize the computation workload among different steps in the convolutional operation, thereby effectively reducing the total computation workload and complexity of the network. Under the condition of few characteristic numbers in a mainstream network, the technical scheme of the invention can ensure that the sizes of the input characteristic diagram and the output characteristic diagram are unchanged, but the calculation amount is reduced to be less than the original 60 percent; for networks with more feature ratios, the method can reduce the actual calculation amount more, and obviously improve the efficiency of convolution operation.

The method of the present invention is an improvement over the prior art deep separable convolution and, in general, comprises: dividing the input feature map into a plurality of groups of initial feature maps, and performing first group convolution on the initial feature map groups to obtain a plurality of groups of first feature maps; and the first feature map group is subdivided into a plurality of groups of second feature maps, each group of second feature maps contains feature maps from different first feature map groups, and the second feature map group is subjected to second group convolution to obtain an input feature map. By the method, information interaction among characteristic graphs in the groups and information interaction among different characteristic graph groups can be completed, balance of computation amount and complexity among different convolution steps can be realized, and the speed of the convolutional neural network is effectively increased.

FIG. 2 illustrates a method for accelerating a convolutional neural network, in accordance with one embodiment of the present invention. As shown in fig. 2, the method includes:

and S210, averagely dividing the input feature map with N channels into G groups of initial feature maps along the channel direction, wherein each group of initial feature maps comprises S feature maps, S G2N, and performing primary group convolution on the G groups of initial feature maps to obtain G groups of first feature maps.

Group Convolution (also called Group Convolution) is to divide the input feature graph into several groups along the channel direction, and to perform Convolution operation on each Group of features and then to splice them together. Fig. 3A shows a schematic diagram of a conventional convolution operation. As shown in fig. 3A, in the conventional convolution, each convolution kernel performs convolution operation on feature maps of all channels in an input feature map, the number of the channels of the convolution kernel is the same as that of the input feature map, and the number of the convolution kernels is the same as that of the channels of an output feature map. FIG. 3B shows a schematic diagram of a group convolution operation. As shown in fig. 3B, unlike the conventional convolution, the group convolution firstly equally divides the input feature map into a plurality of groups along the channel direction, the convolution kernels are also divided correspondingly along the channel direction, each group of feature maps is convolved with each convolution kernel, and the convolved groups of feature maps are spliced to form the output feature map. For example, when the input feature map is W × H × N, if the input feature map is divided into 2 groups in the channel direction, each group includes W × H × N/2 feature maps. And correspondingly dividing the convolution kernel in the same way but keeping the size unchanged, and splicing the two groups of feature maps after convolution respectively to form an output feature map W H M. In the group convolution, each group of input characteristic graphs after grouping simultaneously execute convolution operation in parallel, so compared with the traditional convolution, the group convolution not only can reduce the number of parameters and the operation amount and improve the operation speed, but also can reduce the dependence between a convolution kernel and a front layer, thereby reducing overfitting and improving the generalization capability of a neural network.

FIG. 4A is a diagram illustrating a first group convolution of G groups of input feature maps to obtain G groups of first feature maps according to an embodiment of the present invention. As shown in fig. 4A, the input feature map of the convolutional layer has N channels (i.e., N layers, for example, N is 20), and the input feature map is divided into G groups of initial feature maps (for example, G is 4) on average along the channel direction, so that each group of initial feature map group includes S feature maps (for example, S is N/G is 5). And performing convolution on the G groups of initial feature maps for S times respectively to obtain G groups of first feature maps, wherein each group of first feature maps also comprises S feature maps. If the input feature map size is W × H, the convolution kernel size is K × K, the convolution parameter step size is 1, and the padding is 0, then the computation complexity of the first set of convolutions is:

θ′_task1＝θ′(G*S*W*H*S*K*K)

＝θ′(N*W*H*S*K*K) (4)

wherein N, M, W, H, G, S, K are each integers greater than 1.

In another embodiment, the input features may also be divided into a plurality of non-uniform sets of initial feature maps, and the number of feature maps included in each set of initial feature maps may not be equal. For example, an input feature map having N (e.g., N-20) channels is divided into G groups (e.g., G-4) of initial feature maps along the channel direction, where the number S of feature maps in each group is_iMay not be exactly equal, e.g. G₁The initial feature map group contains 4 feature maps, G₂The initial feature map group contains 3 feature maps, G₃The initial characteristic diagram group comprises 7 characteristic diagrams, G₄The group initial characteristic diagram comprises 6 characteristic diagrams, but the total number of all the characteristic diagrams in the G group initial characteristic diagram is N, namely the requirement of the total number is satisfied

Respectively Q is carried out on the G groups of initial characteristic graphs_iThe sub-convolution can obtain G groups of first feature maps. The number S of the characteristic maps in each group of initial characteristic maps_iNot exactly equal, the number of convolutions Q of each set of initial feature maps_iNot exactly equal, e.g. to G₁Performing 4 times of convolution operation on the group initial characteristic diagram to G₂Convolving the initial feature map of the group for 3 times to the G th₃The initial characteristic diagram of the group is convoluted for 7 times, and the G th₄The initial characteristic maps of the group are convoluted for 6 times, and the sum of the convolution times of the initial characteristic maps of the group G is N, namely the sum of the convolution times of the initial characteristic maps of the group G meets the requirement of

Likewise, N, G, S_i、Q_iIs an integer of 1 or more.

The first feature map group generated by performing the first group convolution operation on the input feature maps not only can quickly complete information extraction and integration in each feature map, but also can realize information interaction and analysis among different feature maps in the same group, so that each feature map in each group of first feature maps can express the whole feature information of the group. However, each feature map in the first feature map group obtained by the first group convolution operation is only associated with a certain group in the input feature map, and information of the global channel may be lost. In order to realize the interaction and integration of all information in the input feature maps, the boundary between different groups of feature maps needs to be broken, the first feature map group is divided again, each newly generated group of second feature maps contains the feature maps from different groups of first feature maps, and the second feature map group is subjected to grouping and convolution operation again.

S220, averagely dividing the G groups of first feature maps into F groups of second feature maps, wherein each group of second feature maps comprises T feature maps, so that the F-th feature map_jEach characteristic diagram in the second characteristic diagram group is respectively from different G-th characteristics_iAnd (4) grouping the first characteristic graphs, and performing second group convolution on the F groups of second characteristic graphs to obtain an output characteristic graph with M channels.

Fig. 4B is a diagram illustrating the first feature map of the G group is subdivided into the second feature maps of the F group and the second group convolution is performed according to an embodiment of the present invention. As shown in fig. 4B, G groups of first feature maps are generated after the first group convolution, and each group of first feature maps includes S feature maps; averaging the G groups of first feature maps to divide the F groups of second feature maps, wherein each group of second feature maps comprises T feature maps; and performing second convolution on the F groups of second feature maps to obtain an output feature map. When the second feature maps are divided, one feature map can be sequentially taken out from each group of first feature maps to form a group of second feature maps, and the operation is repeated until all feature maps in each group of first feature maps are taken out and contained in one group of second feature maps, and finallyFinally, F groups of second feature maps can be obtained. In this case, the number T of signatures included in each set of second signatures is the same as the number G of signatures included in each set of first signatures, i.e., T equals G, and the number F of sets of second signatures is the same as the number S of signatures included in each set of first signatures, i.e., F equals S. For example, G is obtained after the first group of convolution₁、G₂、G₃、G₄4 groups of first characteristic graphs are provided, and each group of first characteristic graphs has S₁、S₂、S₃、S₄、S₅A total of 5 feature maps, in turn from G₁、G₂、G₃、G₄Taking the first signature (e.g. S) from each set of signatures₁) Form a first set of second profiles (e.g. F)₁) From G₁、G₂、G₃、G₄Taking out a second characteristic diagram (e.g. S)₂) Forming a second set of second profiles (e.g. F)₂) … repeated as above until from G₁、G₂、G₃、G₄Extract the last feature map (e.g., S)₅) Forming a final set of second profiles (e.g. F)₅) At this time, there is F₁、F₂、F₃、F₄、F₅There are 5 groups of second feature maps, and each group of second feature maps contains 4 feature maps. The second characteristic diagram of each group is processed

And performing secondary convolution to finally obtain an output characteristic diagram with the channel number M. If the input feature map size is W × H, the convolution kernel size of the second sub-set of convolutions is 1 × 1, the step size of the convolution is 1, and the padding is 0, then the operation complexity of the second sub-set of convolutions is:

because the feature map of each newly generated second feature map group is from different first feature map groups, the feature information between different first feature map groups can be integrated by performing second group convolution operation on the second feature map group, so that each finally obtained feature map in the output feature map contains the information of all channels in the input feature map, and the complete fusion and extraction of the input features are realized.

In another embodiment, the number of feature maps from different sets of first feature maps may be different in each set of second feature maps, and the number of feature maps in each set of second feature maps may also be different, but the sum of the feature maps in all sets of second feature maps is the same as the sum of the feature maps in all sets of first feature maps. For example, G is obtained after the first group of convolution₁、G₂、G₃、G₄There are 4 groups of first feature maps, and the number of feature maps in each group of first feature maps may be different, for example, the G-th feature map₁Set of first characteristic diagram has S₁、S₁、S₃、S₄4 feature maps in total, item G₂Set of first characteristic diagram has S₁、S₂、S₃Total 3 feature maps, G₃Set of first characteristic diagram has S₁、S₂、S₃、S₄、S₅、S₆、S₇7 feature maps in total, item G₄Set of first characteristic diagram has S₁、S₂、S₃、S₄、S₅、S₆A total of 6 feature maps, from G₁Randomly picking one of the feature maps (e.g. S) from the first feature map of the group₁) From G to G₂Randomly picking two feature maps (e.g. S) from the first feature map of the group₁、S₂) And from G₃Randomly extracting three feature maps from the first feature map (e.g. S)₁、S₂、S₃) Form a first set of second profiles (e.g. F)₁) (ii) a From G₁Randomly extracting three feature maps from the first feature map (e.g. S)₂、S₃、S₄) From G to G₃Randomly picking two feature maps (e.g. S) from the first feature map of the group₄、S₅) Forming a second set of second profiles (e.g. F)₂) Repeating the steps until all the feature maps in the first feature map group are extracted to form the last second feature map group (for example, F)_j) F groups of second feature maps can be obtained, and each group of second feature maps comprisesT_jA characteristic diagram. Although the number of the feature maps in each group of the second feature maps can be different, the total number of the feature maps in all the groups of the second feature maps is still N, namely, the requirement of N is satisfied

Wherein j is 0, …, F-1. Similarly, the number T of the characteristic maps in each group of second characteristic maps_jNot exactly equal, the number of convolutions P of each set of second feature maps_jNot exactly equal, e.g. to F₁Performing convolution operation 6 times on the initial characteristic diagram group, and performing convolution operation on the F th₂5 times of convolution is carried out on the initial characteristic graphs of the group, and the sum of the convolution times of the second characteristic graphs of the F groups is M, namely the condition that the sum of the convolution times of the second characteristic graphs of the F groups meets the requirement

Likewise, N, G, T_j、P_jIs an integer of 1 or more.

Through two times of convolution, information interaction and integration among different channels in the input characteristic diagram can be completed, and finally output characteristic diagrams of M channels are obtained.

In the same way as the above formula (3),

to make theta'_task1+θ′_task2The value of (2) is minimum, namely the overall complexity of the convolution operation is minimum and the operation amount is minimum, in one embodiment, each group of initial feature maps comprises S feature maps, and S satisfies the following conditions:

where M is the number of channels (i.e., the number of layers) in the output feature layer of the convolutional layer, and K is the size of the convolutional kernel in the first convolutional layer. At this time of'_task1＝θ′_task2The operation complexity of the first-order set of convolution kernels and the second-order set of convolution kernels is equal.

Considering the practical application by calculation

The obtained S' value may be a non-integer, which results in that the convolution network cannot be accelerated using the above method, for which the present invention also provides other embodiments to optimize the value of S.

In one embodiment, when

When the number of the input feature map channels is non-integer, S can be set to be an integer closest to S', and S is a factor of the number of the input feature map channels N.

In another embodiment, when

When non-integer, S can be rounded off and added to

The integer is G'. When G '. S ═ N, G ═ G', grouping the input feature graph according to G and S and carrying out first group convolution; and when G '. S & gtN, the input feature maps need to be supplemented, specifically, the first G '. S-N feature maps in the input feature maps are copied and merged with the input feature maps to obtain G '. S layer input feature maps, and then the G '. S layer input feature maps are grouped according to G ' and S, and the first group convolution is carried out.

In some cases, the second feature maps of the S groups are respectively processed

When sub-convolved

The value of (c) may be non-integer, thereby making it impossible to speed up the convolution network using the above-described method.

In one embodiment, when

When it is a non-integer, will

And taking an integer W downwards, wherein R is M-W S, performing convolution on the R groups of second feature maps in the S groups of second feature maps for W +1 times respectively, and performing convolution on other groups for W times.

Experiments show that after the acceleration method is used, the accuracy of the convolutional network Resnet18 on an ImageNet data set finally reaches 61.8% of Top1, 83% of Top5, the accuracy is reduced by 7.5%, and the calculated amount is reduced to 15% of the original amount; the correctness of the convolution network Mobilenet on the ImageNet data set finally reaches 62.9 percent of Top1, 84.6 percent of Top5, 4.5 percent of accuracy reduction, and the calculated amount is reduced to 38 percent.

Based on the embodiment, the invention can balance the operation complexity among the steps in the convolution operation, reduce the network parameters and the whole operation amount and greatly improve the network speed and efficiency.

Although the present invention has been described in detail, those skilled in the art should understand that they can make modifications and equivalents without departing from the spirit and scope of the present invention, and they should be considered as included in the claims of the present invention.

Claims

1. A method for accelerating a convolutional neural network, comprising:

step 1: dividing an input feature map with N channels into G groups of initial feature maps along the channel direction, wherein the G th feature map_iThe set of initial feature maps includes S_iThe characteristic diagram is shown in the figure,

g-1, performing first group convolution on the G groups of initial feature maps to obtain G groups of first feature maps, wherein N, G, S_iIs an integer of 1 or more;

step 2: subdividing the G groups of first feature maps into F groups of second feature mapsMiddle, F_jThe second feature map group contains T from different first feature map groups_jThe characteristic diagram is shown in the figure,

2. The method of claim 1, wherein,

in the step 1, the input feature map with N channels is averagely divided into G groups of initial feature maps along the channel direction, each group of initial feature maps includes S feature maps, and S × G ═ N; and

in step 2, the G groups of first feature maps are averagely divided into F groups of second feature maps, each group of second feature maps includes T feature maps, where F is S and T is G, so that the F-th feature map is_jEach characteristic diagram in the second characteristic diagram group is respectively from different G-th characteristics_iAnd (4) grouping the first characteristic graphs.

3. The method of claim 2, wherein,

and the convolution kernel in the first convolution group is K x K in size, and the convolution kernel in the second convolution group is 1 x 1 in size.

4. The method of claim 3, wherein the step of treating the patient is carried out while treating the patient

5. The method of claim 3, wherein the step of treating the patient is carried out while treating the patient

When non-integer, S is rounded off as S', and

the integer is G ', and when G ' is S ═ N, G ═ G '.

6. The method of claim 3, wherein,

when non-integer, S is rounded off as S', and

7. The method of claim 3, wherein the first group convolution comprises: and performing convolution for S times on each group of initial characteristic graphs respectively.

8. The method of claim 3, wherein the second set of convolution comprises: when in use

And (4) performing secondary convolution.

9. The method of claim 3, wherein the second set of convolution comprises: when in use

10. A storage medium in which a computer program is stored which, when being executed by a processor, is operative to carry out the method of any one of claims 1-9.

11. An electronic device comprising a processor and a memory, the memory having stored therein a computer program which, when executed by the processor, is operable to carry out the method of any of claims 1-9.