CN107993186B

CN107993186B - 3D CNN acceleration method and system based on Winograd algorithm

Info

Publication number: CN107993186B
Application number: CN201711342538.0A
Authority: CN
Inventors: 沈俊忠; 黄友; 王泽龙; 乔寓然; 陈照云; 曹壮; 文梅; 张春元
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2017-12-14
Filing date: 2017-12-14
Publication date: 2021-05-25
Anticipated expiration: 2037-12-14
Also published as: CN107993186A

Abstract

The invention discloses a 3D CNN acceleration method and a system based on Winograd algorithm, wherein the method comprises the following implementation steps: reading the feature map sub-blocks to be transformed from the input feature map, reading the convolution kernel sub-blocks from the value cache, executing a 3D Winograd algorithm on the feature map sub-blocks Bin and the convolution kernel sub-blocks to output results and accumulating the accumulated results, judging whether all input feature maps in the input feature map are completely read, and if the input feature maps are completely read, writing the accumulated results back to the output feature map cache Out. According to the invention, the Winograd algorithm is expanded and used for the 3D CNN calculation, the 2D Winograd algorithm is used for CNN acceleration, a good effect is obtained, the calculation complexity of the CNN algorithm can be effectively reduced, and the calculation performance and the energy efficiency ratio of the FPGA-based 3D CNN accelerator are improved.

Description

3D CNN acceleration method and system based on Winograd algorithm

Technical Field

The invention relates to a 3D CNN (three-dimensional convolutional neural network) acceleration technology, in particular to a 3D CNN acceleration method and system based on a Winograd algorithm and used under an embedded platform.

Background

With the development of the field of artificial intelligence, Three-dimensional Convolutional Neural networks (3D CNNs) have been widely used in many complex computer vision applications, such as video classification, human motion detection, and medical image analysis. Different from a traditional Two-dimensional Convolutional Neural Network (2D CNN), the 3D CNN can retain time information in a three-dimensional image in a processing process, and thus can achieve better effects than the 2D CNN in the field of three-dimensional image recognition and classification.

With the improvement of the recognition precision of the CNN, the CNN network structure is more and more complex, and the computation and storage complexity of the network is increased continuously. Since the traditional CPU processor has been unable to cope with the strong parallel computing requirements of CNN networks, various types of accelerators such as GPU, ASIC and FPGA are proposed in succession. Among these acceleration platforms, FPGAs are gaining favor of researchers due to their reconfigurable capability and large amount of computational logic resources. Moreover, FPGA providers such as Intel and Xilinx have developed High-level Synthesis tools (HLS) in succession, so that the programming difficulty of the FPGA is effectively reduced, the development period of the FPGA accelerator is greatly shortened, and the FPGA becomes one of the best choices for accelerating the CNN.

As known in this embodiment, the current FPGA-based CNN accelerators are 2D CNN-oriented, and there is no published document for studying the FPGA-based 3D CNN acceleration. Compared with 2D CNN, 3D CNN has higher calculation and storage complexity, so how to efficiently utilize limited calculation and storage resources of FPGA to construct accelerator for complex 3D CNN is a key problem worthy of research.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: in view of the above problems in the prior art, the present invention provides a 3D CNN acceleration method based on the Winograd algorithm in consideration of the fact that the CNN main computation load is convolution computation in the convolution layer, and by expanding the Winograd algorithm and using it in the 3D CNN computation, the 2D Winograd algorithm is used to perform CNN acceleration and obtain a good effect, so that the computation complexity of the CNN algorithm can be effectively reduced, and the computation performance and the energy efficiency ratio of the 3D CNN accelerator based on the FPGA can be improved.

In order to solve the technical problems, the invention adopts the technical scheme that:

the invention provides a 3D CNN acceleration method based on Winograd algorithm, which comprises the following implementation steps:

1) reading a feature map sub-block Bin to be transformed from an input feature map in, and reading a convolution kernel sub-block Bw from a value cache w;

2) executing 3D Winograd algorithm on the feature map subblock Bin and the convolution kernel subblock Bw to output result Tp¹；

3) Result Tp of executing 3D Winograd algorithm output¹Accumulating and outputting an accumulation result Sum;

4) judging whether all input characteristic graphs in the input characteristic graph in are read completely or not, and if not, skipping to execute the step 1); otherwise, skipping to execute the step 5);

5) and writing the accumulation result Sum back to the output characteristic diagram buffer Out.

Preferably, when the feature map subblock Bin to be transformed is read from the input feature map in step 1), the feature map subblock Bin to be transformed is read in an Z, R, C, M, N five-cycle traversal mode, where Z, R, and C respectively represent the depth, height, and width of an output feature map, M represents the number of output feature maps, and N represents the number of input feature maps, and a functional expression of a loading function used for reading the feature map subblock to be transformed is as shown in formula (1), and a functional expression of a loading function used for reading the convolution kernel subblock Bw from the value cache w is as shown in formula (2);

in the formula (1), Bin [ k ] [ j ] [ i ] represents a feature map subblock Bin with reading subscripts of k, j, i, the size of the feature map subblock Bin is nxnxnxnxnxn, dep, row, col respectively represents index values in the depth direction, the height direction and the width direction of a subblock to be read in a certain feature map, in represents an input feature map in, S represents a sliding step of a convolution window, r represents the dimension of a convolution kernel, and W represents the width of the input feature map;

Bw[k][j][i]＝w[m₀][n][k][j*r+i],0≤i,j,k<r. (2)

in formula (2), Bw [ k ]][j][i]Denotes a convolution kernel sub-block Bw with read index k, j, i, w denotes weight cache, m₀N represents the index of the convolution kernel, M groups of weights are shared in the weight cache w, each group comprises N convolution kernels, and r represents the index of the convolution kernelsDimension.

Preferably, the detailed steps of step 2) include:

2.1) sequentially performing column transformation and row transformation on each plane of the characteristic diagram subblock Bin with the size of nxnxnxnxn to obtain a transformed characteristic diagram subblock Tin; clockwise 90-rotation is carried out on the feature map sub-block Tin to enable each data position in the feature map sub-block Tin to be rearranged, and the rotated feature map sub-block Tin is obtained^R(ii) a For the rotated feature map sub-block Tin^REach plane of the feature map is subjected to column transformation to obtain transformed feature map subblocks Tin with the size of n multiplied by n¹；

2.2) sequentially performing column transformation and row transformation on each plane of the convolution kernel subblock Bw with the size of r multiplied by r to obtain a transformed convolution kernel subblock Tw; clockwise 90 rotations are carried out on the convolution kernel sub-block Tw, so that each data position in the convolution kernel sub-block Tw is rearranged, and the rotated convolution kernel sub-block Tw is obtained^R(ii) a For the rotated convolution kernel sub-block Tw^REach of the planes of (1) is subjected to column transformation to obtain transformed convolution kernel sub-blocks Tw of size n × n × n¹；

2.3) feature map sub-blocks Tin for perfect agreement in size¹And convolution kernel sub-block Tw¹Performing dot multiplication operation to obtain a dot multiplication operation result P with the size of n multiplied by n;

2.4) sequentially performing column transformation and row transformation on each plane of the dot multiplication operation result P with the size of nxnxnxnxn to obtain a transformed dot multiplication operation result Tp; rotating the transformed dot product operation result Tp clockwise 90 times to rearrange the data positions in the transformed dot product operation result Tp and obtain the rotated dot product operation result Tp^R(ii) a For the rotated dot product operation result Tp^RIs subjected to column transformation to obtain a transformed dot product operation result Tp of size m x m¹Tp, the result of the output as the execution of the 3D Winograd algorithm¹。

Preferably, the function expression of the column transformation in the step 2.1) is shown as formula (3), and the function expression of the row transformation is shown as formula (4);

in the formula (3), (x)₀ x₁ x₂ x₃)^TA column representing the input feature map subblock to be transformed, (x)₀' x₁' x₂' x₃')^TThe characteristic diagram sub-blocks are corresponding to the characteristic diagram;

in the formula (4), (x)₀ x₁ x₂ x₃) Representing a row of input feature map sub-blocks to be transformed, (x)₀' x₁' x₂' x₃') indicates the rows of the corresponding profile sub-block after row transformation.

Preferably, the functional expression of the column transformation in step 2.2) is as shown in formula (5), and the functional expression of the row transformation is as shown in formula (6);

in the formula (5), (w)₀ w₁ w₂)^TA certain column representing a sub-block of the convolution kernel to be transformed, (w)₀' w₁' w₂' w₃')^TRepresenting the corresponding columns of the convolution kernel subblocks after column transformation;

in the formula (6), (w)₀ w₁ w₂) Representing a row of a sub-block of the convolution kernel to be transformed, (w)₀' w₁' w₂' w₃') indicates the row of the corresponding convolution kernel sub-block after the row transform.

Preferably, the functional expression of the column transformation in step 2.4) is as shown in formula (7), and the functional expression of the row transformation is as shown in formula (8);

in the formula (7), (m)₀ m₁ m₂ m₃)^TRepresenting a point to be transformed by a certain column of sub-blocks, (m)₀' m₁')^TThe column of the corresponding point-by-sub block after the column transformation is represented;

in the formula (8), (m)₀ m₁ m₂ m₃) Representing a point to be transformed by a certain row of the sub-block, (m)₀' m₁') indicates the corresponding point multiplied by the row of the sub-block after row transformation.

Preferably, the functional expression for clockwise 90 rotation is as shown in formula (9);

D^R _i,k,j←D_i,j,k (9)

in the formula (9), D_i,j,kTo make an element before 90 rotations clockwise, D^R _i,j,kFor an element rotated 90 degrees clockwise, i, j, k are the indices of the element row, column, and depth, respectively.

Preferably, step 5) writes the accumulation result Sum back to the output characteristic diagram cache Out, and the functional expression of the write-back function adopted by the output characteristic diagram cache Out is shown as formula (10);

Out[m₀][dep+k][(row+i)*C+col+j]＝Sum[m₀][k][i][j],0≤i,j,k≤m-1. (10)

in the formula (10), Out represents output characteristic diagram buffer, m₀The index representing the convolution kernel, dep, row, col respectively represent the index values of the depth, height and width directions of the sub-block to be read in a certain feature map, i, j, k are the indexes of the row, column and depth of the feature map, Sum [ m₀][k][i][j]Denotes the m-th₀And outputting pixel points with the depth of k, the height of i and the width of j in the accumulation result Sum of the characteristic diagram.

The invention also provides a 3D CNN accelerating system based on Winograd algorithm, which comprises an IP core and is characterized in that: the IP core is programmed to perform the steps of the aforementioned Winograd algorithm based 3D CNN acceleration method of the present invention.

The invention also provides a 3D CNN accelerating system based on Winograd algorithm, which comprises off-chip storage and an IP core and is characterized in that: the IP core comprises an on-chip storage and calculation module, the on-chip storage comprises an input cache, a weight cache and an output cache, the input cache, the weight cache and the output cache are respectively connected with the off-chip storage, the calculation module comprises a pooling layer unit POOL, an active layer unit ReLU and a plurality of processing units PU arranged in parallel, the processing units PU comprises an accumulation module and a plurality of parallel basic processing units PE, the input ends of the basic processing units PE are respectively and simultaneously connected with the input cache and the weight cache, the output ends of the basic processing units PE are respectively connected with the output cache through the accumulation module, the active layer unit ReLU and the pooling layer unit POOL, and the basic processing units PE comprise an input characteristic diagram conversion array, an input characteristic diagram cache, an input weight conversion array, an input weight cache, a point multiplication module, a point multiplication result cache, The input end of the input feature map conversion array is connected with the input cache, the output end of the input feature map conversion array is connected with one input end of the point multiplication module through the input feature map conversion array and the input feature map cache, the input end of the input weight conversion array is connected with the weight cache, the output end of the input weight conversion array is connected with the other input end of the point multiplication module through the input weight conversion array and the input weight cache, the output end of the point multiplication module is connected with the input end of the accumulation module through the point multiplication result cache and the point multiplication result conversion array, and the input feature map conversion array, the input weight conversion array and the point multiplication result conversion array all comprise a column conversion module and a row conversion module which are sequentially connected.

The 3D CNN acceleration method based on the Winograd algorithm has the following advantages: according to the 3D CNN acceleration method based on the Winograd algorithm, the Winograd algorithm is expanded and used for 3D CNN calculation, the 2D Winograd algorithm is used for CNN acceleration, good effects are achieved, the calculation complexity of the CNN algorithm can be effectively reduced, and the calculation performance and the energy efficiency ratio of the 3D CNN accelerator based on the FPGA are improved.

Drawings

FIG. 1 is a schematic diagram of a basic process of an embodiment of the present invention.

FIG. 2 is a schematic diagram of the 3D Winograd algorithm in a method according to an embodiment of the present invention.

Fig. 3 is a flowchart of a 3D Winograd algorithm in a method according to an embodiment of the invention.

FIG. 4 is a pseudo code of a method according to an embodiment of the invention.

Fig. 5 is a schematic structural diagram of a system according to an embodiment of the invention.

Fig. 6 is a schematic structural diagram of a basic processing unit PE of a system according to an embodiment of the present invention.

FIG. 7 is a schematic diagram of a matrix transformation template structure used in row-column transformation according to an embodiment of the present invention.

Fig. 8 is a schematic diagram illustrating a system full link layer acceleration method according to an embodiment of the present invention.

FIG. 9 is a graph illustrating a comparison of two embodiments of the present invention with different types of accelerators.

Detailed Description

The first embodiment is as follows:

as shown in fig. 1, the implementation steps of the 3D CNN acceleration method based on the Winograd algorithm in this embodiment include:

In this embodiment, when the feature map subblock Bin to be transformed is read from the input feature map in step 1), Z, R, C, M, N five-cycle traversal mode is adopted to read the feature map subblock Bin to be transformed (see fig. 2), where Z, R, and C respectively represent the depth, height, and width of the output feature map, M represents the number of the output feature maps, N represents the number of the input feature maps, and the functional expression of the loading function adopted to read the feature map subblock Bin to be transformed is as shown in formula (1), and the functional expression of the loading function adopted to read the convolution kernel subblock Bw from the value cache w is as shown in formula (2);

Bw[k][j][i]＝w[m₀][n][k][j*r+i],0≤i,j,k<r. (2)

in formula (2), Bw [ k ]][j][i]Denotes a convolution kernel sub-block Bw with read index k, j, i, w denotes weight cache, m₀N represents the index of the convolution kernel, M groups of weights and N convolution kernels in each group are shared in the weight cache w, and r represents the dimensionality of the convolution kernels.

In this embodiment, the detailed steps of step 2) include:

2.1) sequentially performing column transformation and row transformation on each plane of the characteristic diagram subblock Bin with the size of nxnxnxnxn to obtain a transformed characteristic diagram subblock Tin; clockwise 90-rotation is carried out on the feature map sub-block Tin to enable each data position in the feature map sub-block Tin to be rearranged, and the rotated feature map sub-block Tin is obtained^R(ii) a For the rotated feature map sub-block Tin^REach plane of the image is subjected to column transformation to obtain transformationThe sub-block Tin of the characteristic diagram with the rear size of nxnxnxnxn¹；

In this embodiment, the function expression of the column transformation in step 2.1) is as shown in formula (3), and the function expression of the row transformation is as shown in formula (4);

In this embodiment, the function expression for column transformation in step 2.2) is shown in formula (5), and the function expression for row transformation is shown in formula (6);

in the formula (5), (w)₀w₁w₂)^TA certain column representing a sub-block of the convolution kernel to be transformed, (w)₀' w₁' w₂' w₃')^TRepresenting the corresponding columns of the convolution kernel subblocks after column transformation;

In this embodiment, the function expression for column transformation in step 2.4) is shown as formula (7), and the function expression for row transformation is shown as formula (8);

In this embodiment, the function expression for clockwise 90 rotation is as shown in formula (9);

D^R _i,k,j←D_i,j,k (9)

In this embodiment, step 5) writes the accumulation result Sum back to the output characteristic diagram cache Out, and the functional expression of the write-back function adopted by the output characteristic diagram cache Out is as shown in formula (10);

Out[m₀][dep+k][(row+i)*C+col+j]＝Sum[m₀][k][i][j],0≤i,j,k≤m-1. (10)

The calculation mode of the 3D CNN is similar to that of the 2D CNN, and mainly includes calculation loads such as a convolutional layer, an active layer, a pooling layer, and a full connection layer. Among them, the convolutional layer accounts for more than 90% of the calculation load, the remaining part is mainly composed of full connections, and the active layer and the pooling layer are almost ignored. Winograd is a fast convolution algorithm, and essentially, the addition calculation with low computational complexity replaces the multiplication calculation with high computational complexity in convolution, so the algorithm can effectively reduce the complexity of convolution calculation. Since the main calculation load of the CNN is convolution calculation in the convolutional layer, a few jobs use the 2D Winograd algorithm for CNN acceleration and achieve good results. The 3D CNN acceleration method based on the Winograd algorithm expands the algorithm and uses the algorithm in the 3D CNN calculation. In this embodiment, a conventional 3D convolution calculation mode is defined as F (m × m × m, r × r × r), that is, an image x with a size of n × n × n is convolved by a convolution kernel w with a size of r × r × r, so as to obtain a convolution result z with a size of m × m × m, where n is m + r-1. In this embodiment, the 3D Winograd algorithm is represented mathematically as shown in formula (11):

z＝(M((XxX^T)^RX^T⊙(WwW^T)^RW^T)M^T)^RM^T (11)

in the formula (11), z represents obtaining a convolution result with a size of M × M × M, W represents a convolution kernel with a size of r × r × r, X represents an image with a size of n × n × n, M, X, and W are constant transformation matrices, and M is a matrix with a constant number^T,X^T,W^TRepresenting transpositions of the matrices M, X, W, respectively. The concrete numerical values of all matrices are determined by m and R, which indicates a dot product operation, and R indicates a rotate operation (90 degrees in the clockwise direction). For simplicity, in the present embodiment, mainly F (2 × 2 × 2,3 × 3 × 3) is taken as an example for explanation, and in the premise that when m is 2 and r is 3, the constant matrix can be determined as shown in equation (12):

it should be noted that the 3D CNN acceleration method based on the Winograd algorithm of the present invention is also applicable to different m and r. Let x be (x)₀,x₁,x₂,x₃)，x₀,x₁,x₂,x₃Four planes representing x, XxX^TThe definition is shown in formula (13):

XxX^T＝X(x₀,x₁,x₂,x₃)X^T＝(Xx₀X^T,Xx₁X^T,Xx₂X^T,Xx₃X^T) (13)

in the formula (13), X is a constant transformation matrix, X^TRespectively, represent transpositions of matrix X.

Further, assume x_i＝(x_i0,x_i1,x_i2,x_i3) I is 0,1,2,3, wherein x_i0,x_i1,x_i2,x_i3Is x_iEach column of (a) has a function expression represented by the following formula (14):

Xx_iX^T＝X(x_i0,x_i1,x_i2,x_i3)X^T＝(Xx_i0X^T,Xx_i1X^T,Xx_i2X^T,Xx_i3X^T) (14)

in formula (14), Xx_i0X^TCan be carried out in two steps, firstly calculating Xx_i0Then calculate (Xx)_i0)X^T. Both calculations are classical matrix-vector multiplications. It can be seen that the matrix transformation in the 3D Winograd algorithm can be regarded as a series of matrix-vector multiplication calculations (convolution kernel transformation, point multiplication result transformation analysis, the same as above).

Further, the dot product operation in the 3D Winograd algorithm is defined by equation (15):

m_i,j,k＝a_i,j,k*b_i,j,k,0≤i,j,k≤n-1. (15)

in the formula (15), m_i,j,k,a_i,j,k,b_i,j,kRespectively, the dot product results, (XxX)^T)^RX^T、(WwW^T)^RW^TAny one point of (a).

In this embodiment, the graphical representation of the 3D Winograd algorithm flow is shown in fig. 2, the flow chart is shown in fig. 3, in the drawing, symbols F and R respectively represent the front plane and the right plane of the image to be processed or the convolution kernel sub-block, and m is (XxX)^T)^RX^TAnd (WwW)^T)^RW^TDot product result of (c). As shown in fig. 2 and 3, the 3D Winograd algorithm flow is described as follows:

inputting an algorithm: a 3D image x (of size n × n × n) and a 3D convolution kernel w (of size r × r × r);

and (3) outputting an algorithm: the convolution result z (size m × m × m).

The algorithm performs the following: the method comprises the following steps: for each plane (front view direction, W) of the 3D image x₁,W₂,W₃,W₄) The row-column transformation (i.e. each column of each plane is multiplied by X) and the row transformation (i.e. each row of each plane is multiplied by X right)^T) The output of this step is XxX^T(the sizes are n multiplied by n) as the input of the second step; secondly, the step of: the transformed image is rotated 90 degrees clockwise, and the output result is (XxX)^T)^RAs the input of the third step; ③: to (XxX)^T)^RIs (in front view, i.e. P'₁，P'₂，P'₃,P'₄) Making a row-column transformation (i.e. right-times X)^T) The output result is (XxX)^T)^RX^TAnd the step (c) is used as the input of the step (c). Fourthly, the method comprises the following steps: for each plane (elevation direction, i.e. P) of the 3D convolution kernel w₁,P₂,P₃,P₄) The row transformation and the column transformation are carried out in turn (namely, each column of each plane is multiplied by W at the left side), and the row transformation (namely, each row of each plane is multiplied by W at the right side)^T) The output result is WwW^TAs the input of the fifth step; fifthly: the transformed convolution kernel is rotated 90 degrees clockwise, and the output result is (WwW)^T)^RAs input in the sixth step; sixthly, the method comprises the following steps: to (WwW)^T)^RIs (in front view, i.e. W'₁，W'₂，W'₃,W'₄) Making a row-column transformation (i.e. right-handed W)^T) The output result is (WwW)^T)^RW^T(the sizes are n multiplied by n) which is used as the input of the step (seventh step); seventh, the method comprises the following steps: for the above transformation result (XxX)^T)^RX^TAnd (WwW)^T)^RW^TPerforming dot multiplication operation, namely multiplying each pixel point in the input image by weight data at a corresponding position in the convolution kernel respectively, and outputting a dot multiplication result m with the size of n multiplied by n as the input of the step (viii); and (v): performing transformation process completely consistent with the first to third steps or the fourth to sixth steps on the 3D point multiplication result M, only replacing the transformation matrix with M and M^T. The output is the final result z (the size is m x m and the book ism). Wherein, the first to third steps and the fourth to sixth steps can be executed in parallel.

In this embodiment, when the 3D Winograd algorithm is applied to the 3D CNN, the 3D Winograd algorithm is mainly applied to the convolution layer calculation in the 3D CNN, and the pseudo code of the algorithm is shown in fig. 4. Referring to fig. 4, the algorithm includes five cycles, where Z, R, C represent the depth, height and width of the output feature maps, respectively, M represents the number of output feature maps (output channels), and N represents the number of input feature maps (input channels). Each function in the code is specifically described below.

Load _ Image (in, n, dep, row, col, Bin): and (3) loading a feature diagram, wherein an input parameter in represents a feature diagram cache (a data structure is a three-dimensional array), n represents an index of the feature diagram, dep, row and col respectively represent index values of the depth direction, the height direction and the width direction of a subblock to be read in a certain feature diagram, and an output parameter Bin represents the read subblock of the feature diagram, and the size of the subblock is n multiplied by n. The function is to read 4 × 4 sub-blocks from a specific location in the nth signature buffer. Assuming that the input signature is of size D × W × H and each convolution kernel is of size r × r (Ksize), the function is given by equation (1).

Load_Weight(w,m₀And, Bw): weight value loading function, input parameter w represents that the data structure is a four-dimensional array), and input parameter m₀The index representing the convolution kernel (a total of M sets of weights, N convolution kernels per set) and the output parameter Bw represents the read weight subblock. The function of the function is to read the nth convolution kernel in the mth set of weight cache, and the function is shown as formula (2).

2D _ Trans _ X (Bin, Tin): and transforming the function by the feature map, wherein the input Bin of the function represents the feature map subblock to be transformed, and the output Tin of the function represents the transformed feature map subblock. The function of the function is to perform row-column transformation on each row and each column of Bin, and the transformation process is (XI)_nX^T). 2D _ Trans _ W (), 2D _ Trans _ M () is also a matrix transformation function, the transformation process is completely identical to that of 2D _ Trans _ X, except that the transformation matrix is respectively transformed into W (W)^T) And M (M)^T)。

Rotate(Tin,Tin^R): sub-block rotation functionWhere Tin denotes a rotor block to be rotated, Tin^RRepresenting the rotated sub-block, the function is to rotate Tin 90 clockwise, which essentially rearranges the position of the data in the sub-block. Suppose D_i,j,kIs an element of Tin, D^R _i,j,kIs Tin^RA certain element in (i, j, k) is the index of the row, column and depth in the sub-block, then Rotate (Tin )^R) The transformation operation in (1) is shown as formula (9);

1D_Trans_X(Tin^R,Tin¹): feature map transformation function, input Tin of said function^RRepresenting the output of the feature map sub-block to be transformed, i.e. 2D _ Trans _ X (in, Tin), the output of the function Tin¹Representing transformed feature map sub-blocks. Suppose In is Tin^RIs used, the function is to perform a row transformation or a column transformation on In, the transformation process is I_nX^T(line transformation) or XI_n(column transform). 1D _ Trans _ W (), 1D _ Trans _ M () are also matrix transformation functions, the transformation process is completely identical to that of 1D _ Trans _ X, except that the transformation matrix is respectively transformed into W (W)^T) And M (M)^T)。

3D_Mul(Tin¹,Tw¹P): a point multiplication function, which is input into two 3D data blocks Tin with identical size¹And Tw¹The output P is the dot product of the two. The function of the point-by-point function can be described by equation (16):

P_i,j,k←Tin_i,j,k*Tw_i,j,k (16)

in formula (16), P_i,j,kIs Tin_i,j,kAnd Tw_i,j,kThe dot product results i, j, k of the two are the input and output subblock row, column and depth-wise indices, respectively.

3D_Accumulate(Tp¹Sum): cumulative function, input parameter Tp¹And outputting Sum as an accumulation result for each accumulated sub-block. The accumulation result of each time is used as the input of the next accumulation process. The function of this function can be described by equation (17):

Sum_i,j,k←Sum_i,j,k+Tp_i,j,k (17)

in the formula (17), the compound represented by the formula (I),Sum_i,j,kis shown in Sum_i,j,kUp-accumulation of Tp_i,j,kThe obtained accumulation results i, j, k are Tp respectively¹Sum row, column and depth-wise indices.

Send(Sum,m₀Dep, row, col, Out): result write-back function having as input the accumulated result Sum, by index value m₀Dep, row and col write the result back to the designated position in the mth output feature map in the output buffer Out (the data structure is a three-dimensional array). The function of this function can be described by equation (10).

The embodiment also provides a 3D CNN acceleration system based on the Winograd algorithm, which includes an IP core, and the IP core is programmed to execute the steps of the aforementioned 3D CNN acceleration method based on the Winograd algorithm.

As shown in fig. 5, this embodiment further provides a 3D CNN acceleration system based on a Winograd algorithm, which includes an off-chip storage and an IP core, where the IP core includes an on-chip storage and a computation module, the on-chip storage includes an input buffer, a weight buffer, and an output buffer, the input buffer, the weight buffer, and the output buffer are respectively connected to the off-chip storage, the computation module includes a pooling layer unit POOL, an active layer unit ReLU, and a plurality of processing units PU arranged in parallel, the processing units PU include an accumulation module and a plurality of parallel basic processing units PE, input ends of the plurality of basic processing units PE are respectively connected to the input buffer and the buffer weight at the same time, and output ends of the plurality of basic processing units PE are respectively connected to the output buffer through the accumulation module, the active layer unit ReLU, and the pooling layer unit POOL. As shown in fig. 6, the basic processing unit PE includes an input feature map conversion array, an input feature map cache, an input weight conversion array, an input weight cache, a dot multiplication module, a dot multiplication result cache, and a dot multiplication result conversion array, wherein an input end of the input feature map conversion array is connected to the input cache, an output end thereof passes through the input feature map conversion array, the input characteristic diagram buffer memory is connected with one input end of the dot multiplication module, the input end of the input weight conversion array is connected with the weight buffer memory, the output end of the input weight conversion array is connected with the other input end of the dot multiplication module through the input weight conversion array and the input weight buffer memory, the output end of the dot multiplication module is connected with the input end of the accumulation module through the dot multiplication result buffer memory and the dot multiplication result conversion array, and the input characteristic diagram conversion array, the input weight conversion array and the dot multiplication result conversion array all comprise a column conversion module and a row conversion module which are sequentially connected.

Storage management of data is a difficult point in the construction of accelerator systems. Because the data volume of the input characteristic graph and the weight in the 3D CNN network is often larger, the storage of all input data on an FPGA chip with limited on-chip storage capacity cannot be realized. Therefore, the present embodiment stores the input data and the final output result in the off-chip storage with a large capacity. Meanwhile, the present embodiment performs a blocking process on a plurality of dimensions of input and output data, as shown in table 1.

Table 1: a block coefficient table.

Parameter(s)	Means of	Block coefficient
			M	Number of output feature maps	To
N	Input of feature map number	Ti
			W	Input feature map width	Tw
H	Input deviceHeight of figure	Th
			D	Inputting feature map depth	Td
C	Output feature map width	Tc
			R	Output feature map height	Tr
Z	Outputting feature map depth	Tz

Since the single convolution kernel is relatively small, it is not further chunked by this embodiment. For each calculation, the number of input feature values To be read in this embodiment is Ti × Tw × Th × Td, and the number of convolution kernel weights is To × Ti × Ksize (for F (m × m × m, r × r × r), Ksize ═ r × r); assuming that the sliding step of the convolution kernel is S, the partition parameters have the following relationship (18):

in equation (18), S is the sliding step of the convolution kernel, r represents the dimension of the convolution kernel, and the remaining parameters are detailed in table 1.

And the number of the output characteristic values written back To the external memory after the calculation is completed is To Tc Tr Tz. As shown in fig. 5, in order to improve data reusability, reduce the number of accesses and reduce the off-chip access delay, this embodiment organizes three caches on the chip: and the input buffer, the weight buffer and the output buffer are respectively used for storing the input data, the weight and the output data after the partitioning. For the input cache, the embodiment adopts a double-cache technology to enable data prefetching and calculation to be carried out in an overlapping mode, so that data prefetching delay is hidden. Similarly, the present embodiment also uses a double buffering technique for the output buffer to hide the delay of data write back.

According to the calculation process of the 3D CNN combined with the Winograd algorithm, the present embodiment constructs a basic processing unit PE to implement the algorithm. Referring to fig. 5 and 6, it can be seen that the PE includes a dot product module and sets of transform arrays. The point multiplication module is used for executing point multiplication operation in a Winograd algorithm, and the transformation array is responsible for executing various matrix transformation operations in the algorithm. In order to fully utilize the high parallelism of the dot product calculation and improve the calculation throughput of the dot product module, the present embodiment performs loop expansion (triple loop of width, height and depth of the input sub-blocks) on the dot product operation, i.e. n is integrated into the dot product module³A multiplier for simultaneously performing n³A multiplication is carried out and n is obtained³And (6) obtaining the result. For the matrix transformation array, the embodiment adopts a templating idea to construct. According to the symmetry of row and column transformation in the 3D Winograd algorithm, the embodiment finds that the two can be implemented by using the same template.

Therefore, in this embodiment, the matrix transformation template shown in fig. 7(a) to 7(c) is constructed, and the structure of 7(a) corresponds to the formula (3) for performing the column transformation in step 2.1) and the formula (4) for performing the row transformation; the structure of 7(b) corresponds to the column transformation in step 2.2) as shown in formula (5) and the row transformation as shown in formula (6); the structure of 7(c) corresponds to the column transformation performed in step 2.4) as shown in formula (7) and the row transformation as shown in formula (8), and each transformation template can perform row transformation or column transformation on the processed subblocks. Since the transform matrix has mostly 1 or-1 elements and more zero elements, the present embodiment expands the vector multiply accumulate operation in the matrix transform and converts the multiplication into addition or subtraction, and at the same time, uses the right shift operation to replace the special multiplication such as × 1/2. It can be seen that the template designed in this embodiment uses simple adders, subtractors and shifters to replace multipliers with large resource overhead, and can obtain row or column transformation results in one clock cycle, thereby achieving a good balance between resources and performance.

In order to further improve the parallelism of the matrix transformation process, the present embodiment constructs a matrix transformation array by using the template, and transforms each row and each column of the input image, the convolution kernel and the dot product result.

The computation module is the core of the entire accelerator system. In this embodiment, a High-Level Synthesis tool (HLS) of Xilinx corporation is used for description, a "Pragma HLS" keyword is used to represent a special instruction provided by the tool, and optimization means such as loop expansion (unrol) and loop pipelining (Pipeline) can be very conveniently implemented by using the special instruction. As shown in fig. 5, the present embodiment organizes two layers of parallel structures in the computing module: to Processing Units (PUs) and Ti Processing Elements (PEs) inside each PU. Each PU reads the same input feature map sub-block and a different convolution kernel for computation. The PUs are organized using a systolic array structure, and all input profile data is read by the leftmost PU in the array and passed to adjacent PUs in sequence. The organization form has the advantages that the intensive access of the input cache can be effectively relieved, the interconnection mode among the hardware modules is changed from a centralized mode to a distributed mode, the complexity of rear-end layout and wiring is greatly reduced, and the promotion of the main frequency of the accelerator is facilitated. In addition, the present embodiment is organized in a global interconnection manner, that is, each PU has an independent channel for accessing the weight cache. By spreading out L4, To PUs can be computed simultaneously; l5 is expanded so that the Ti PE units in each PU can perform parallel computations, thereby improving the throughput of the computation module. In addition, the embodiment also performs circulating hydration optimization on the L3, so that the processing of adjacent subblocks can be performed simultaneously, and the delay of a computing module is further reduced. Each PU also has an accumulator module integrated therein for accumulating Ti temporary results generated simultaneously by the PE units therein, in addition to the Ti PEs. When all the input feature maps are traversed, the final accumulation result is sent to an output buffer after being processed by the active layer ReLU and the pooling layer POOL calculation module. Wherein, the pooling layer computing module can bypass according to the network specific configuration. Furthermore, the full connectivity layer and the convolutional layer share one hardware acceleration module, so that the whole computation module can accelerate the whole 3D CNN network.

After the core computing module is constructed, the next step is to multiplex the computing module to support the acceleration of the full connection layer, so that the utilization efficiency of the computing module is improved, and a set of computing components is not redesigned to accelerate the full connection layer. The calculation mode of the full connection layer is matrix-vector multiplication, and the input of the matrix-vector multiplication is a one-dimensional eigenvector and a two-dimensional weight. The one-dimensional feature vector is obtained by unfolding (flatten) a plurality of two-dimensional feature maps generated by a previous computation layer (generally, a pooling layer), and is spliced into a long vector. The calculation result of the fully connected layer is still a one-dimensional feature vector and is used as the input of the next fully connected layer. Fig. 8 shows the full link layer acceleration method proposed in this embodiment. In the embodiment, a batch processing mode is adopted to calculate the full connection layer, that is, the Batchsize input feature vectors are organized into an input feature map, so that the reuse of weight data is increased, and the access frequency is effectively reduced. Wherein the size of Batchsize is equal to the size of sub-block of F (m × m × m, r × r × r) subjected to dot product calculation, i.e. n³(n-m + r-1, n-3 in the figure). It can be seen that the present embodiment organizes the pixel values at the same position in the Batchsize sheet feature vector into a size n³As an input profile for each PE (indicated by the dashed box in fig. 8). The calculation characteristics of the full connection layer determine that the input characteristic diagram data of each PE needs to share the weight, namely, the input characteristic diagram data and the same weight are subjected to multiplication. Therefore, each weight data in the weight cache needs to be replicated n³(i.e., 64 passes) are then sent to the PE for calculation. In this embodiment, a transformation module in the calculation module is bypassed, so that the feature map and the weight data obtained by each PE directly enter the dot product module for calculation. The multiplication results of the PEs in the PU also need to be accumulated in the local ACCU. Passing through N/T according to the above method_iT in each PU after sub-accumulation_oThe PE can calculate the Batchsize output at the same time1 final result in the feature map, T_oThe PU can simultaneously obtain T in the Batchsize output characteristic diagram_oAnd (6) obtaining a final result. In order to reduce the storage access overhead of the output feature map, the embodiment caches the calculation result of the full connection layer by using on-chip storage, so as to directly provide input data for the calculation of the next full connection layer.

Since the accelerator system provided by this embodiment is finally implemented on hardware, it is necessary to estimate the whole resources of the accelerator before implementation, and it can be known in advance whether the hardware platform is adapted to the accelerator logic through resource estimation, so as to appropriately adjust the accelerator logic for adaptation (when the accelerator logic is smaller than the hardware platform resource amount, the accelerator logic can be added to improve the computation performance; when the estimation resource is larger than the hardware platform resource amount, the accelerator logic needs to be appropriately reduced); the purpose of the performance evaluation is to find an optimal set of design parameters such as T for the accelerator construction_o,T_iAnd the design space can be effectively reduced, and the selection of inefficient design parameters is avoided, so that the design period of the accelerator is reduced, and the performance of the accelerator is maximized.

In this embodiment, DSP and BRAM resources are evaluated, and evaluation models thereof are as follows:

DSP_used＝T_o*T_i*f(I_w)*n³ (19)

V_in＝T_i*(S*T_z+K-S)*(S*T_r+K-S)*(S*T_c+K-S) (20)

V_out＝T_o*T_z*T_r*T_c (21)

in the formulae (19) to (23), DSP_usedIndicating usage of DSP resourcesAmount, V_inData quantity of input characteristic diagram V representing buffer_outIndicating the number of output profiles buffered,

indicating the buffer weight of the full connection layer, BRAM_usedIndicates the usage amount of BRAM resource, I_wRepresenting the calculation module pipeline processing interval, T_o*T_iIndicating the number of elementary processing elements PE, Bw_onRepresenting bit width of on-chip memory cells, BRAM_capacityRepresenting the capacity of a single BRAM, Part representing the partitioning factor of the cache, Part_inPartition factor, Part, representing input signature graph cache_outBlock factor, Part, representing output signature graph cache_wBlock factor, N (Bw), representing weight cache_on) By the formula

Making a determination of n³Is the number of multipliers. The embodiment discovers DSP utilization and I_wIn connection with this, I was obtained by experiment in this example_wEmpirical value of f (I)_w)≈1/I_w. The DSPs are all used to implement a dot-product module (collectively n) in the PE³Multiplier) and the rest calculation modules such as matrix transformation, ReLU, POOL and the like are realized by adopting an adder or a comparator, so that the DSP is not consumed. Because the embodiment adopts the double-cache technology, the usage amount of the BRAM doubles. For convolutional layers, because the weight data is less, this embodiment uses LUTRAM for storage, so it does not consume BRAM resources. For the full connection layer, the weight data is large, and the BRAM is adopted to store the weight data in the embodiment. The input and output feature vector caches of the full connection layer multiplex the corresponding caches of the convolutional layer, so that no additional storage overhead is introduced.

In order to complete the performance evaluation of the accelerator, the present embodiment mainly models the execution time of the accelerator. First, the present embodiment evaluates the data transmission times as in equations (24) to (26):

in the formulae (24) to (26),

represents the weight Data amount required by the convolution layer calculation, and Data _ Width represents the Data bit Width BW_effRepresenting the effective bandwidth, V, of the accelerator_inData quantity of input characteristic diagram V representing buffer_outNumber of output profiles, T, representing buffer_trans ⁱAnd T_trans ^oRespectively, the read/transmit times of the input/output data. Next, the present example evaluated the total calculation time based on equations (24) to (26), as shown in equations (27) to (28):

in formulae (27) to (28), T_comIndicating the calculation time, T_zIndicating the output feature map depth, T_rIndicating the height of the output profile, T_cIndicates the width of the output feature map, I_wRepresenting the computing module pipeline processing interval, m³Representing the transformed dot-multiplied module size (i.e., m), Freq represents the operating frequency of the accelerator, T_totalRepresenting data pre-fetching time, Z, R, C representing depth, height and width of output feature maps, M representing number of output feature maps, N representing outputNumber of in-signatures, T_o*T_iIndicating the number of elementary processing elements PE, T_trans ⁱAnd T_trans ^oRespectively, the read/transmit times of the input/output data.

Parameter(s)	Means of	Block coefficient
			M	Number of output feature maps	To
N	Input of feature map number	Ti
			W	Input feature map width	Tw
H	Input feature map height	Th
			D	Inputting feature map depth	Td
C	Output feature map width	Tc
			R	Output feature map height	Tr
Z	Outputting feature map depth	Tz

Since the present embodiment employs the double-cache technique, the data prefetching and the calculation are overlapped, so the core calculation time is determined by the larger of the calculation time and the data prefetching time, that is, the core calculation time is determined by the larger of the calculation time and the data prefetching time

Therefore, in order to avoid the limitation of the storage of the accelerator, the present embodiment provides a constraint condition on the bandwidth required by the accelerator as shown in equation (29);

in the formula (29), BW_mRepresenting the bandwidth required by the accelerator, m³Indicating the transformed dot-multiplied module size (i.e., m), Freq indicates the operating frequency of the accelerator, V_inData quantity of input characteristic diagram V representing buffer_outNumber of output profiles, T, representing buffer_zIndicating the output feature map depth, T_rIndicating the height of the output profile, T_cIndicates the width of the output feature map, I_wRepresenting a computing module pipeline processing interval. Formula (29) is based on T_com≥T_transAnd (4) obtaining the product. As can be seen from equation (29), the effect T_totalThe size is variable, so that searching the design space through an exhaustive method to find the optimal parameters requires a large time overhead. Therefore, in the embodiment, the limited on-chip resources of the FPGA platform are considered, and the constraint condition on the design space is shown as the formula (30)；

In the formula (30), DSP_usedIndicating the amount of DSP resources used, BRAM_usedIndicating BRAM resource usage, DSP_totalIndicating the total amount of DSP resources, BRAM_totalThe total amount of BRAM resources is represented, and through the constraint, the size of a design space can be effectively reduced, and the time for finding an optimal solution is shortened.

In this embodiment, an HLS tool of Xilinx corporation is specifically used to implement the 3D CNN accelerator, and is encapsulated into an IP core. Around the IP core, the performance test is performed on the system on chip by adopting Vivado 2016.4 of Xilinx company in the embodiment. The system on chip mainly comprises 1 embedded processor core (MicroBlaze), a DDR controller (mig _7_ series _0) and a CNN accelerator (baseWinograd _ 0). The processor core can perform parameter configuration on the accelerator through the M _ AXI _ DP interface and start the accelerator to work. After the accelerator is started, the accelerator reads data through the DDR controller for calculation, after the calculation is finished, the result is written back to the DDR, the whole process does not need the intervention of a CPU, time statistics is carried out through the reading of a timer, and information is printed out through a serial port. The embodiment respectively realizes the system on chip on the VC709 platform of Xilinx corporation. The VC709 platform comprises one Virtex-7690 t FPGA chip and two DDR3 chips. The comprehensive frequency of the accelerator on VC709 reaches 150MHz, and a DDR3 chip is adopted on a VC709 platform.

Example two:

the present embodiment is basically the same as the first embodiment, and the main differences are as follows: in this embodiment, the above system on chip is implemented on a VUS440 platform of S2C company, and the VUS440 platform is basically the same as the VC709 platform of the first embodiment, and the main differences are as follows: the VUS440 platform comprises a Xilinx VCU440FPGA chip and a DDR4 chip, and the comprehensive frequency of the accelerator on the VUS440 platform reaches 200 MHz.

In order to compare and verify the performances of the first embodiment and the second embodiment, the C3D network is selected for testing, and the 3D CNN model is widely applied in the field of video classification. As shown in table 2, it can be seen that the sizes of the network convolution kernels are all 3 × 3, the convolution kernel step lengths are all 1, and the network convolution kernel is very suitable for calculation optimization by using Winograd algorithm.

Table 2: C3D network parameters.

In actual tests, the present embodiment does not implement an accelerator structure for each convolutional layer scale, but performs tests by using a unified accelerator structure. The unified design parameter is T_i＝4，T_o32. The experimental results are shown in fig. 9, and it can be seen that the accelerator achieves the peak performance of 560GOPS on the VC709 platform, and the accelerator achieves the peak performance of 1112GOPS on the VUS440 platform. The dotted line in fig. 8 indicates the theoretical peak performance that the accelerator using the 3D CNN acceleration method based on the Winograd algorithm can achieve on the two platforms, and the solid line indicates the computational performance predicted by the evaluation model on the two platforms. It can be seen that, on one hand, the accelerator of the embodiment can achieve higher computational efficiency (actual measurement performance/peak value performance), for example, for CONV-2 acceleration, the computational efficiency reaches 80%; on the other hand, the evaluation model of the embodiment can accurately predict the accelerator performance.

In addition, the embodiment also performs performance optimization by using OpenBlas and Cudnn respectively, in contrast to the CPU (using Intel E5-2680) and GPU accelerator (using NVDIA K40 accelerator) schemes. The comparative results are shown in Table 3.

Table 3: and comparing the result data table.

Referring to Table 3, the speed and energy efficiency ratio of the CPU (using Intel E5-2680) are taken as reference speeds: the acceleration ratio of a GPU accelerator (adopting an NVDIA K40 accelerator) is 20 times, and the energy efficiency ratio is 9.2 times; on the VC709 platform adopted in the first embodiment, the acceleration ratio of the accelerator adopting the 3D CNN acceleration method based on the Winograd algorithm is 7.3 times, and the energy efficiency ratio is 17.1 times; on the VUS440 platform adopted in the second embodiment, the acceleration ratio of the accelerator adopting the 3D CNN acceleration method based on the Winograd algorithm is 13.4 times, and the energy efficiency ratio is 60.3 times. Therefore, the 3D CNN acceleration method based on the Winograd algorithm is far higher than a CPU in terms of computational performance, and is lower than a GPU in terms of computational performance, but has great advantages in terms of power consumption and energy efficiency ratio compared with the CPU and the GPU.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. A3D CNN acceleration method based on Winograd algorithm is characterized by comprising the following implementation steps:

5) writing the accumulation result Sum back to the output characteristic diagram buffer Out;

the detailed steps of the step 2) comprise:

2.1) sequentially performing column transformation and row transformation on each plane of the characteristic diagram subblock Bin with the size of nxnxnxnxn to obtain the transformationThe changed characteristic diagram sub-block Tin; clockwise 90-rotation is carried out on the feature map sub-block Tin to enable each data position in the feature map sub-block Tin to be rearranged, and the rotated feature map sub-block Tin is obtained^R(ii) a For the rotated feature map sub-block Tin^REach plane of the feature map is subjected to column transformation to obtain transformed feature map subblocks Tin with the size of n multiplied by n¹；

2. The Winograd algorithm-based 3D CNN acceleration method according to claim 1, wherein when the feature map sub-block Bin to be transformed is read from the input feature map in step 1), the feature map sub-block Bin to be transformed is read in Z, R, C, M, N five-pass loop traversal manner, wherein Z, R, C respectively represent depth, height and width of the output feature map, M represents number of output feature maps, N represents number of input feature maps, and the functional expression of the loading function used for reading the feature map sub-block Bin to be transformed is shown in formula (1), and the functional expression of the loading function used for reading the convolution kernel sub-block Bw from the value cache w is shown in formula (2);

Bw[k][j][i]＝w[m₀][n][k][j*r+i],0≤i,j,k<r. (2)

3. The Winograd algorithm-based 3D CNN acceleration method according to claim 1, wherein a function expression of column transformation in step 2.1) is shown as formula (3), and a function expression of row transformation is shown as formula (4);

4. The Winograd algorithm-based 3D CNN acceleration method according to claim 1, wherein the function expression for column transformation in step 2.2) is shown as formula (5), and the function expression for row transformation is shown as formula (6);

5. The Winograd algorithm-based 3D CNN acceleration method according to claim 1, wherein the function expression for column transformation in step 2.4) is shown as formula (7), and the function expression for row transformation is shown as formula (8);

6. The Winograd algorithm-based 3D CNN acceleration method according to claim 1, wherein the function expression for 90 clockwise rotations is shown in formula (9);

D^R _i,k,j←D_i,j,k (9)

7. The Winograd algorithm-based 3D CNN acceleration method according to claim 1, wherein the functional expression of the write-back function employed by step 5) to write the accumulation result Sum back to the output feature map buffer Out is represented by formula (10);

Out[m₀][dep+k][(row+i)*C+col+j]＝Sum[m₀][k][i][j],0≤i,j,k≤m-1. (10)

in the formula (10), Out represents output characteristic diagram buffer, m₀The index representing the convolution kernel, dep, row, col respectively represent the index values of the depth, height and width directions of the sub-block to be read in a certain feature map, i, j, k are the indexes of the row, column and depth of the feature map, Sum [ m₀][k][i][j]Denotes the m-th₀And C represents the width of the output characteristic diagram.

8. A3D CNN accelerating system based on Winograd algorithm comprises an IP core and is characterized in that: the IP core is programmed to execute the steps of the 3D CNN acceleration method based on Winograd algorithm according to any one of claims 1 to 7.