CN110533164B

CN110533164B - Winograd convolution splitting method for convolution neural network accelerator

Info

Publication number: CN110533164B
Application number: CN201910717929.9A
Authority: CN
Inventors: 杨晨; 王逸洲; 王小力; 耿莉
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2019-08-05
Filing date: 2019-08-05
Publication date: 2023-04-07
Anticipated expiration: 2039-08-05
Also published as: CN110533164A

Abstract

The invention discloses a Winograd convolution splitting method for a convolution neural network accelerator, which comprises the following steps of: 1) Reading input and convolution kernels with any size from a cache of a convolution neural network accelerator; 2) Judging whether to carry out convolution splitting or not according to the size of the convolution kernel and the input size, and if the convolution splitting is required, carrying out the next step; 3) Splitting the convolution kernel according to the size and the step length of the convolution kernel, and splitting the input according to the input size and the step length; 4) Combining and zero-filling the split elements according to the size of the convolution kernel, and combining and zero-filling the split elements according to the input size; 5) Carrying out Winograd convolution on each pair of split input and convolution kernel; 6) Accumulating Winograd convolution results of each team of input and convolution kernel; 7) And storing the accumulation result in a cache of the convolutional neural network accelerator. The invention enables the convolutional neural network accelerator to support convolution of various shapes by adopting a Winograd accelerating unit.

Description

Winograd convolution splitting method for convolution neural network accelerator

Technical Field

The invention belongs to the field of convolutional neural network algorithms, and particularly relates to a Winograd convolutional splitting method for a convolutional neural network accelerator.

Background

Convolutional Neural Networks (CNN) are being widely used in computer vision tasks such as object detection and image classification, but as network models are continuously developed, recognition accuracy is continuously improved, and huge computation and data amount are brought, so that high-performance and low-power-consumption hardware equipment is required, and meanwhile, the flexibility of the hardware equipment must be ensured to meet various network models.

Convolutional neural network accelerators are widely used to accelerate convolutional neural network algorithms on both the mobile and server sides. In order to improve the performance of the accelerator, a Winograd algorithm is used for reducing hardware multipliers introduced by each multiplication operation on the algorithm level, so that the throughput of the accelerator can be increased on the premise of the same number of the multipliers. At present, the convolutional neural network accelerator adopting the Winograd algorithm has a serious problem, and because the operation units of the Winograd algorithm have fixed parameters, each operation unit can only accelerate the convolution of the corresponding parameters. In order to expand the flexibility of the accelerator, a Winograd arithmetic unit with various parameters needs to be designed, so that the resource consumption and the power consumption of the accelerator are increased. Secondly, the accelerator obtains data streams of different shapes from Winograd operation units with different parameters, so that the utilization rate of the accelerator operation unit is reduced, and the performance of the accelerator is seriously reduced.

Disclosure of Invention

The present invention aims to provide a Winograd convolution splitting method for a convolutional neural network accelerator, which aims to overcome the defects in the prior art, so that the convolutional neural network accelerator can support convolutions of various shapes by using a Winograd accelerating unit. The focus of the present invention is to split and convert convolutions of different shapes into a unified data stream.

The invention is realized by adopting the following technical scheme:

a Winograd convolution splitting method for a convolution neural network accelerator comprises the following steps:

1) Reading input and convolution kernels with any size from a cache of a convolution neural network accelerator;

2) Judging whether to carry out convolution splitting or not according to the size of the convolution kernel and the input size, and if the convolution splitting is required, carrying out the next step;

3) Splitting the convolution kernel according to the size and the step length of the convolution kernel, and splitting the input according to the input size and the step length;

4) Combining and zero-filling the split elements according to the size of the convolution kernel, and combining and zero-filling the split elements according to the input size;

5) Carrying out Winograd convolution on each pair of split input and convolution kernel;

6) Accumulating Winograd convolution results of each team of input and convolution kernels;

7) And storing the accumulation result in a cache of the convolutional neural network accelerator.

The invention has the further improvement that the specific judgment method in the step 2) is as follows:

if the size of the convolution kernel is smaller than the set size of the convolution kernel and the input size is smaller than the set input size, performing no convolution splitting and directly filling zero into the set size of the convolution kernel and the input size; and if the size of the convolution kernel is larger than the set convolution kernel size and the input size is larger than the set input size, performing convolution splitting.

The further improvement of the invention is that the specific implementation method of the step 3) is as follows:

30 Splitting an original convolution kernel into a plurality of convolution kernels with the size not larger than the set convolution kernel, wherein the adjacent distance of each element in the split convolution kernels in the horizontal and vertical directions is one convolution step;

31 Splitting an original input into a plurality of inputs with the size not larger than the set input size, wherein the adjacent distance of each element in the split inputs in the horizontal and vertical directions is a convolution step length.

The further improvement of the invention is that the specific implementation method of the step 30) is as follows:

301 Take the element in the upper left corner of the original convolution kernel as the first element;

302 Taking the next element in the horizontal and vertical directions by taking the convolution step as a step length until the number of the set convolution kernel sizes is obtained;

303 All elements are combined into a new convolution kernel according to positions until the first convolution kernel is completely split;

304 An element that was not taken before in the upper left direction is taken as the first element of the second convolution kernel;

305 Get the remaining elements in the same way until getting the number less than or equal to the size of the set convolution kernel;

306 Repeat the above steps until the remaining convolution kernel is split.

The further improvement of the invention is that the specific implementation method of the step 31) is as follows:

311 Take the element in the upper left corner of the original input as the first element;

312 Taking the next element by taking the convolution step length as the step length in the horizontal and vertical directions until the number of the set input size is obtained;

313 Combine all elements into a new input according to the position until the first input is split;

314 An element that was not taken before in the upper left direction is taken as the first element of the second input;

315 Get the remaining elements in the same way until getting the number less than or equal to the set input size;

316 Repeat the above steps until the remaining inputs are split.

The further improvement of the invention is that the specific implementation method of the step 4) is as follows:

if the split convolution kernel is the set convolution kernel, zero padding is not carried out, and if the split convolution kernel is smaller than the set convolution kernel, zero padding at the upper left side is the set convolution kernel;

if the split input is the set input size, the compensation is not performed, and if the split input is smaller than the set input size, the zero compensation is performed at the upper left side to be the set input size.

The further improvement of the invention is that the specific implementation method of the step 5) is as follows:

carrying out Winograd convolution kernel conversion on the convolution kernel by setting the size of the convolution kernel;

carrying out Winograd input conversion on input with a set input size;

performing dot product operation of setting the size of the convolution kernel and the size of the input on the converted input and the convolution kernel;

and performing Winograd output conversion on the dot product result according to the set output size.

The invention has the following beneficial technical effects:

according to the Winograd convolution splitting method for the convolutional neural network accelerator, the input and the convolution kernel with different sizes are obtained from the cache of the accelerator, the convolution kernels and the input with different shapes are split before Winograd operation is carried out, the split convolution kernels and the input are converted into a uniform data stream, and then Winograd operation is carried out. The algorithm makes the accelerator have a Winograd calculation unit with one parameter, but can accelerate convolution of any shape. Compared with the traditional convolution algorithm, the method has the advantage that the acceleration performance, the power consumption and the flexibility are improved.

The invention has the main characteristics that:

1. and splitting the convolutions of different shapes and converting the convolutions into a unified data stream.

2. And splitting different rules according to the step length of the convolution.

The main advantages are as follows:

1. compared with the traditional convolution algorithm, the algorithm can reduce the introduction of a large number of multipliers for most convolution shapes, especially for the most widely used small convolution small-step kernel convolution.

2. Compared with the traditional Winograd algorithm, the algorithm has better flexibility for a hardware accelerator

The traditional Winograd algorithm mode diagram is shown as figure 1, and the calculation formula is shown as formula (1)

U＝GFG ^T V＝B ^T TB

Out＝A ^T [U⊙V]A (1)

Wherein, W is the input Tile side length, H is the input/output side length, C1 is the input channel number, C2 is the output channel number, and R is the convolution kernel size. A, B, G are constant transformation matrices, depending on W and the convolution step S. The main idea is to transform the input (T) and the convolution kernel (F), then to do matrix dot multiplication, and finally to transform to obtain the final result. For example: when Tile size is 5 x 5 and convolution kernel size is 3 x 3, using conventional convolution will involve 81 multiplications regardless of data multiplexing, while using Winograd algorithm involves only 25 multiplications and additions without error. However, in a convolutional neural network accelerator adopting a Winograd acceleration algorithm, only convolutional kernels with sizes less than three are usually supported, and supporting a large convolutional kernel causes accuracy reduction and does not support multiple step sizes. The following major problems are faced when designing a convolutional neural network accelerator based on the Winograd algorithm:

1. how to accelerate the convolution of convolution kernels of different sizes.

2. How to speed up the convolution for different step sizes.

3. How to ensure the data flow to be uniform on the premise of supporting convolution with different shapes.

Drawings

Fig. 1 is a flowchart of the conventional Winograd algorithm.

Fig. 2 is a calculation flowchart of the Winograd-oriented splitting algorithm.

FIG. 3 is a schematic diagram of the calculation process of the Winograd convolution splitting algorithm with the step size of 1.

FIG. 4 is a schematic diagram of the calculation process of the Winograd convolution splitting algorithm with the step size of 3.

Fig. 5 is a schematic diagram of the conversion module for input and convolution kernels, applicable to W =4, r =3 or 2.

Fig. 6 is a schematic diagram of an output conversion module, which is suitable for W =4, r =3 or 2.

FIG. 7 is a schematic diagram of a unified PE array and conversion module.

Detailed Description

The invention is further described below with reference to the figures and examples.

The flow of the Winograd convolution splitting method for the convolution neural network accelerator is shown in figure 2, wherein w and r are parameters for setting a Winograd algorithm. The difference from the traditional Winograd algorithm is that: before Winograd is used for carrying out convolution acceleration, splitting and zero padding are carried out on the input characteristic value and the convolution kernel according to the step length and the size of the convolution kernel, then Winograd acceleration is respectively carried out on the split result, and the output result is accumulated again.

The invention provides a Winograd convolution splitting method and a Winograd addition algorithm for a convolution neural network accelerator. Specifically, elements at different positions are split, combined and zero-filled according to different convolution parameters. The implementation steps are as follows:

1. and when the size of the convolution kernel is smaller than r, directly performing zero filling on the convolution kernel.

2. And when the size of the convolution kernel is larger than r and the step length is 1, combining the input and the adjacent elements of the convolution kernel, and filling zero into the set input size w and the convolution kernel size r.

3. When the size of the convolution kernel is larger than r and the step size is larger than 1, combining the input and convolution kernel elements by the step size, and filling zero to set the input size r and the convolution kernel size r.

4. And carrying out Winograd convolution and addition on the input after splitting, combining and zero padding and the convolution kernel.

The resolution method provided by the invention comprises the following steps:

when the convolution kernel size is smaller than r, the method comprises the following steps:

1. zero padding is performed on the convolution kernel to a set r × r size.

2. Zero-filling the input to Tile of w size.

When the convolution kernel size is larger than r and the step size is 1, the method comprises the following steps:

1. and combining adjacent elements of the convolution kernels into r multiplied by r convolution kernels from the position of the upper left corner of the original convolution kernel, wherein each combined convolution kernel element is not repeated and cannot exceed the boundary of the original convolution kernel.

2. And (4) zero filling is carried out on the convolution kernels with the sizes smaller than r after splitting and combining to be r multiplied by r.

3. And combining the input adjacent elements into a Tile with the size of w multiplied by w from the position of the upper left corner of the original input, wherein each combined Tile element is not repeated and cannot exceed the boundary of the original input.

4. And after splitting and combining, the zero padding of the Tile with the size smaller than w is 4 multiplied by 4. The results are shown in FIG. 3.

When the convolution kernel size is larger than the Tile size 4 and the step size is larger than 1, the method comprises the following steps:

1. elements with convolution kernels adjacent to each other and with set step length are combined into convolution kernels with the size of r multiplied by r from the position of the upper left corner of the original convolution kernel, and elements of each combined convolution kernel are not repeated and cannot exceed the boundary of the original convolution kernel.

2. The size of the split combination is smaller than the size of zero padding r multiplied by r of a convolution kernel.

3. And combining the input adjacent elements with the set step length into tiles with the size of w multiplied by w from the position of the upper left corner of the original input, wherein each combined Tile element is not repeated and cannot exceed the original input boundary.

4. And (5) filling zero for tiles with the size smaller than w after splitting and combining to be w multiplied by w. The results are shown in FIG. 4.

The Winograd addition algorithm provided by the invention is carried out according to the following steps:

after the algorithm is split, original input and convolution kernels are converted into a plurality of groups of data streams with same w multiplied by w size Tile and r multiplied by r size convolution kernels, in each group of data streams, tile and convolution kernels carry out Winograd convolution operation with unified parameters, and output results are added.

Comparison of the Performance of the present invention with existing methods

The Winograd convolution splitting method for the convolution neural network accelerator provided by the invention has the advantages that the multiplication times introduced under different parameters are shown in a formula (1): where S is the step size, m is the output matrix size, r is the set convolution kernel size, and W is the input matrix size.

The multiplication times introduced under different parameters of the traditional Winograd algorithm are shown in a formula (2):

NumMult _Convention ＝r ² ×(m-W+1) ² (2)

table 1 shows the saving rate of multiplication using this algorithm for a step size of 1, compared to the conventional convolution algorithm. It can be seen that for most cases, i.e. where the convolution kernel size r is larger than S, the saving rate can reach 36% -55%. For the case where the convolution kernel size r is smaller than S, the algorithm is inferior to the conventional convolution algorithm, but the convolution of such parameters has not been used in the CNN model. The effect of the algorithm is significant.

TABLE 1 multiplication saving rate for Winograd convolution splitting algorithm with step size of 1

Table 2 shows the saving rate of multiplication using this algorithm for a step size of 3, compared to the conventional convolution algorithm. Compared with the case that the step size is 1, the effect is general only for small convolution kernels, and the multiplication saving rate of large convolution kernels is high. So when the step size is 3 and the convolution kernel size is small, it is contemplated to use a conventional convolution algorithm in the accelerator.

TABLE 2 Winograd-oriented convolution splitting algorithm multiplication saving rate with step length of 3

Examples

The invention can be realized in a PE array of a convolutional neural network accelerator.

For most current convolutional neural network accelerators that use the Winograd algorithm, two part optimizations are required if the present invention is used. Firstly, the conversion module is optimized, and the traditional accelerator can design a Winograd conversion module with various different parameters to support various convolution shapes. The invention can support any convolution shape only by one conversion module. As shown in fig. 5 and 6, the conversion module supports conversion of W =4, r =2 and W =4, r =3 by resource multiplexing, and the number of multiplications introduced by using the convolution splitting algorithm is shown in table 3 compared with the conventional convolution (step S = 1). Secondly, the PE array is optimized, in this example, since the convolution of any shape can be converted into 4 × 4 dot product operation by the algorithm and the conversion module, the design of the PE array is also simplified, as shown in fig. 7.

TABLE 3 Winograd-oriented convolution splitting algorithm multiplication saving rate with step length of 3

Compared with the traditional convolution algorithm, the method has the advantage that the average multiplication number saving rate can reach 51.8% on average under the condition that the step size is 1.

Claims

1. A Winograd convolution splitting method for a convolution neural network accelerator is characterized in that the method can split convolutions of different shapes and convert the convolutions into a unified data stream, and simultaneously supports 2 Winograd parameters of W =4, R =2 and W =4, R =3, and comprises the following steps:

1) Reading input and convolution kernels with any size from a cache of a convolution neural network accelerator; the convolutional neural network accelerator can simultaneously support 2 Winograd parameters of W =4, R =2 and W =4, R =3, and the specific implementation method is as follows:

for the conversion modules of input and convolution kernels, only one conversion module is needed to support any one convolution shape of W =4, R =3 or 2 through circuit resource multiplexing;

for the output conversion module, only one conversion module is needed to support any one convolution shape of W =4, R =3 or 2 through circuit resource multiplexing;

for the PE array for calculating the dot product, the Winograd parameter of W =4, R =3 or 2 is uniformly converted into 4 x 4 dot product operation through convolution splitting, so the design of the PE array is also simplified into 16 unified multipliers;

2) Judging whether to carry out convolution splitting or not according to the size of the convolution kernel and the input size, and if the convolution splitting is required, carrying out the next step; the specific judgment method is as follows:

if the size of the convolution kernel is smaller than the set size of the convolution kernel and the input size is smaller than the set input size, performing no convolution splitting and directly filling zero into the set size of the convolution kernel and the input size; if the size of the convolution kernel is larger than the set convolution kernel size and the input size is larger than the set input size, performing convolution splitting; the multiplication times introduced by the convolution splitting method under different parameters are shown in formula (1): wherein S is the step length, m is the output matrix size, r is the set convolution kernel size, and W is the input matrix size;

the convolution splitting method can simultaneously support Winograd algorithm parameters of W =4,r =2 and W =4,r = 3;

3) Splitting the convolution kernel according to the size and the step length of the convolution kernel, and splitting the input according to the size and the step length of the input; the specific implementation method comprises the following steps:

30 Splitting an original convolution kernel into a plurality of convolution kernels with the size not larger than the set convolution kernel, wherein the adjacent distance of each element in the split convolution kernels in the horizontal direction and the vertical direction is a convolution step length; the specific implementation method comprises the following steps:

303 All elements are combined into a new convolution kernel according to positions until the first convolution kernel is split;

306 Repeating the steps until residual convolution kernels are split;

31 Splitting an original input into a plurality of inputs with the size not larger than the set input size, wherein the adjacent distance of each element in the split inputs in the horizontal and vertical directions is a convolution step length; the specific implementation method comprises the following steps:

312 Take the next element in the horizontal and vertical directions by taking the convolution step as the step length until the number of the set input size is obtained;

316 Repeating the above steps until the remaining inputs are split;

5) Carrying out Winograd convolution on each pair of split input and convolution kernel; the specific implementation method comprises the following steps:

carrying out Winograd input conversion on input with a set input size;

winograd output conversion of the dot multiplication result with set output size is carried out;

2. The Winograd convolution splitting method for the convolutional neural network accelerator as claimed in claim 1, wherein the specific implementation method of step 4) is as follows: