CN107993186B - 3D CNN acceleration method and system based on Winograd algorithm - Google Patents
3D CNN acceleration method and system based on Winograd algorithm Download PDFInfo
- Publication number
- CN107993186B CN107993186B CN201711342538.0A CN201711342538A CN107993186B CN 107993186 B CN107993186 B CN 107993186B CN 201711342538 A CN201711342538 A CN 201711342538A CN 107993186 B CN107993186 B CN 107993186B
- Authority
- CN
- China
- Prior art keywords
- sub
- feature map
- block
- formula
- row
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
Abstract
The invention discloses a 3D CNN acceleration method and a system based on Winograd algorithm, wherein the method comprises the following implementation steps: reading the feature map sub-blocks to be transformed from the input feature map, reading the convolution kernel sub-blocks from the value cache, executing a 3D Winograd algorithm on the feature map sub-blocks Bin and the convolution kernel sub-blocks to output results and accumulating the accumulated results, judging whether all input feature maps in the input feature map are completely read, and if the input feature maps are completely read, writing the accumulated results back to the output feature map cache Out. According to the invention, the Winograd algorithm is expanded and used for the 3D CNN calculation, the 2D Winograd algorithm is used for CNN acceleration, a good effect is obtained, the calculation complexity of the CNN algorithm can be effectively reduced, and the calculation performance and the energy efficiency ratio of the FPGA-based 3D CNN accelerator are improved.
Description
Technical Field
The invention relates to a 3D CNN (three-dimensional convolutional neural network) acceleration technology, in particular to a 3D CNN acceleration method and system based on a Winograd algorithm and used under an embedded platform.
Background
With the development of the field of artificial intelligence, Three-dimensional Convolutional Neural networks (3D CNNs) have been widely used in many complex computer vision applications, such as video classification, human motion detection, and medical image analysis. Different from a traditional Two-dimensional Convolutional Neural Network (2D CNN), the 3D CNN can retain time information in a three-dimensional image in a processing process, and thus can achieve better effects than the 2D CNN in the field of three-dimensional image recognition and classification.
With the improvement of the recognition precision of the CNN, the CNN network structure is more and more complex, and the computation and storage complexity of the network is increased continuously. Since the traditional CPU processor has been unable to cope with the strong parallel computing requirements of CNN networks, various types of accelerators such as GPU, ASIC and FPGA are proposed in succession. Among these acceleration platforms, FPGAs are gaining favor of researchers due to their reconfigurable capability and large amount of computational logic resources. Moreover, FPGA providers such as Intel and Xilinx have developed High-level Synthesis tools (HLS) in succession, so that the programming difficulty of the FPGA is effectively reduced, the development period of the FPGA accelerator is greatly shortened, and the FPGA becomes one of the best choices for accelerating the CNN.
As known in this embodiment, the current FPGA-based CNN accelerators are 2D CNN-oriented, and there is no published document for studying the FPGA-based 3D CNN acceleration. Compared with 2D CNN, 3D CNN has higher calculation and storage complexity, so how to efficiently utilize limited calculation and storage resources of FPGA to construct accelerator for complex 3D CNN is a key problem worthy of research.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: in view of the above problems in the prior art, the present invention provides a 3D CNN acceleration method based on the Winograd algorithm in consideration of the fact that the CNN main computation load is convolution computation in the convolution layer, and by expanding the Winograd algorithm and using it in the 3D CNN computation, the 2D Winograd algorithm is used to perform CNN acceleration and obtain a good effect, so that the computation complexity of the CNN algorithm can be effectively reduced, and the computation performance and the energy efficiency ratio of the 3D CNN accelerator based on the FPGA can be improved.
In order to solve the technical problems, the invention adopts the technical scheme that:
the invention provides a 3D CNN acceleration method based on Winograd algorithm, which comprises the following implementation steps:
1) reading a feature map sub-block Bin to be transformed from an input feature map in, and reading a convolution kernel sub-block Bw from a value cache w;
2) executing 3D Winograd algorithm on the feature map subblock Bin and the convolution kernel subblock Bw to output result Tp1;
3) Result Tp of executing 3D Winograd algorithm output1Accumulating and outputting an accumulation result Sum;
4) judging whether all input characteristic graphs in the input characteristic graph in are read completely or not, and if not, skipping to execute the step 1); otherwise, skipping to execute the step 5);
5) and writing the accumulation result Sum back to the output characteristic diagram buffer Out.
Preferably, when the feature map subblock Bin to be transformed is read from the input feature map in step 1), the feature map subblock Bin to be transformed is read in an Z, R, C, M, N five-cycle traversal mode, where Z, R, and C respectively represent the depth, height, and width of an output feature map, M represents the number of output feature maps, and N represents the number of input feature maps, and a functional expression of a loading function used for reading the feature map subblock to be transformed is as shown in formula (1), and a functional expression of a loading function used for reading the convolution kernel subblock Bw from the value cache w is as shown in formula (2);
in the formula (1), Bin [ k ] [ j ] [ i ] represents a feature map subblock Bin with reading subscripts of k, j, i, the size of the feature map subblock Bin is nxnxnxnxnxn, dep, row, col respectively represents index values in the depth direction, the height direction and the width direction of a subblock to be read in a certain feature map, in represents an input feature map in, S represents a sliding step of a convolution window, r represents the dimension of a convolution kernel, and W represents the width of the input feature map;
Bw[k][j][i]=w[m0][n][k][j*r+i],0≤i,j,k<r. (2)
in formula (2), Bw [ k ]][j][i]Denotes a convolution kernel sub-block Bw with read index k, j, i, w denotes weight cache, m0N represents the index of the convolution kernel, M groups of weights are shared in the weight cache w, each group comprises N convolution kernels, and r represents the index of the convolution kernelsDimension.
Preferably, the detailed steps of step 2) include:
2.1) sequentially performing column transformation and row transformation on each plane of the characteristic diagram subblock Bin with the size of nxnxnxnxn to obtain a transformed characteristic diagram subblock Tin; clockwise 90-rotation is carried out on the feature map sub-block Tin to enable each data position in the feature map sub-block Tin to be rearranged, and the rotated feature map sub-block Tin is obtainedR(ii) a For the rotated feature map sub-block TinREach plane of the feature map is subjected to column transformation to obtain transformed feature map subblocks Tin with the size of n multiplied by n1;
2.2) sequentially performing column transformation and row transformation on each plane of the convolution kernel subblock Bw with the size of r multiplied by r to obtain a transformed convolution kernel subblock Tw; clockwise 90 rotations are carried out on the convolution kernel sub-block Tw, so that each data position in the convolution kernel sub-block Tw is rearranged, and the rotated convolution kernel sub-block Tw is obtainedR(ii) a For the rotated convolution kernel sub-block TwREach of the planes of (1) is subjected to column transformation to obtain transformed convolution kernel sub-blocks Tw of size n × n × n1;
2.3) feature map sub-blocks Tin for perfect agreement in size1And convolution kernel sub-block Tw1Performing dot multiplication operation to obtain a dot multiplication operation result P with the size of n multiplied by n;
2.4) sequentially performing column transformation and row transformation on each plane of the dot multiplication operation result P with the size of nxnxnxnxn to obtain a transformed dot multiplication operation result Tp; rotating the transformed dot product operation result Tp clockwise 90 times to rearrange the data positions in the transformed dot product operation result Tp and obtain the rotated dot product operation result TpR(ii) a For the rotated dot product operation result TpRIs subjected to column transformation to obtain a transformed dot product operation result Tp of size m x m1Tp, the result of the output as the execution of the 3D Winograd algorithm1。
Preferably, the function expression of the column transformation in the step 2.1) is shown as formula (3), and the function expression of the row transformation is shown as formula (4);
in the formula (3), (x)0 x1 x2 x3)TA column representing the input feature map subblock to be transformed, (x)0' x1' x2' x3')TThe characteristic diagram sub-blocks are corresponding to the characteristic diagram;
in the formula (4), (x)0 x1 x2 x3) Representing a row of input feature map sub-blocks to be transformed, (x)0' x1' x2' x3') indicates the rows of the corresponding profile sub-block after row transformation.
Preferably, the functional expression of the column transformation in step 2.2) is as shown in formula (5), and the functional expression of the row transformation is as shown in formula (6);
in the formula (5), (w)0 w1 w2)TA certain column representing a sub-block of the convolution kernel to be transformed, (w)0' w1' w2' w3')TRepresenting the corresponding columns of the convolution kernel subblocks after column transformation;
in the formula (6), (w)0 w1 w2) Representing a row of a sub-block of the convolution kernel to be transformed, (w)0' w1' w2' w3') indicates the row of the corresponding convolution kernel sub-block after the row transform.
Preferably, the functional expression of the column transformation in step 2.4) is as shown in formula (7), and the functional expression of the row transformation is as shown in formula (8);
in the formula (7), (m)0 m1 m2 m3)TRepresenting a point to be transformed by a certain column of sub-blocks, (m)0' m1')TThe column of the corresponding point-by-sub block after the column transformation is represented;
in the formula (8), (m)0 m1 m2 m3) Representing a point to be transformed by a certain row of the sub-block, (m)0' m1') indicates the corresponding point multiplied by the row of the sub-block after row transformation.
Preferably, the functional expression for clockwise 90 rotation is as shown in formula (9);
DR i,k,j←Di,j,k (9)
in the formula (9), Di,j,kTo make an element before 90 rotations clockwise, DR i,j,kFor an element rotated 90 degrees clockwise, i, j, k are the indices of the element row, column, and depth, respectively.
Preferably, step 5) writes the accumulation result Sum back to the output characteristic diagram cache Out, and the functional expression of the write-back function adopted by the output characteristic diagram cache Out is shown as formula (10);
Out[m0][dep+k][(row+i)*C+col+j]=Sum[m0][k][i][j],0≤i,j,k≤m-1. (10)
in the formula (10), Out represents output characteristic diagram buffer, m0The index representing the convolution kernel, dep, row, col respectively represent the index values of the depth, height and width directions of the sub-block to be read in a certain feature map, i, j, k are the indexes of the row, column and depth of the feature map, Sum [ m0][k][i][j]Denotes the m-th0And outputting pixel points with the depth of k, the height of i and the width of j in the accumulation result Sum of the characteristic diagram.
The invention also provides a 3D CNN accelerating system based on Winograd algorithm, which comprises an IP core and is characterized in that: the IP core is programmed to perform the steps of the aforementioned Winograd algorithm based 3D CNN acceleration method of the present invention.
The invention also provides a 3D CNN accelerating system based on Winograd algorithm, which comprises off-chip storage and an IP core and is characterized in that: the IP core comprises an on-chip storage and calculation module, the on-chip storage comprises an input cache, a weight cache and an output cache, the input cache, the weight cache and the output cache are respectively connected with the off-chip storage, the calculation module comprises a pooling layer unit POOL, an active layer unit ReLU and a plurality of processing units PU arranged in parallel, the processing units PU comprises an accumulation module and a plurality of parallel basic processing units PE, the input ends of the basic processing units PE are respectively and simultaneously connected with the input cache and the weight cache, the output ends of the basic processing units PE are respectively connected with the output cache through the accumulation module, the active layer unit ReLU and the pooling layer unit POOL, and the basic processing units PE comprise an input characteristic diagram conversion array, an input characteristic diagram cache, an input weight conversion array, an input weight cache, a point multiplication module, a point multiplication result cache, The input end of the input feature map conversion array is connected with the input cache, the output end of the input feature map conversion array is connected with one input end of the point multiplication module through the input feature map conversion array and the input feature map cache, the input end of the input weight conversion array is connected with the weight cache, the output end of the input weight conversion array is connected with the other input end of the point multiplication module through the input weight conversion array and the input weight cache, the output end of the point multiplication module is connected with the input end of the accumulation module through the point multiplication result cache and the point multiplication result conversion array, and the input feature map conversion array, the input weight conversion array and the point multiplication result conversion array all comprise a column conversion module and a row conversion module which are sequentially connected.
The 3D CNN acceleration method based on the Winograd algorithm has the following advantages: according to the 3D CNN acceleration method based on the Winograd algorithm, the Winograd algorithm is expanded and used for 3D CNN calculation, the 2D Winograd algorithm is used for CNN acceleration, good effects are achieved, the calculation complexity of the CNN algorithm can be effectively reduced, and the calculation performance and the energy efficiency ratio of the 3D CNN accelerator based on the FPGA are improved.
Drawings
FIG. 1 is a schematic diagram of a basic process of an embodiment of the present invention.
FIG. 2 is a schematic diagram of the 3D Winograd algorithm in a method according to an embodiment of the present invention.
Fig. 3 is a flowchart of a 3D Winograd algorithm in a method according to an embodiment of the invention.
FIG. 4 is a pseudo code of a method according to an embodiment of the invention.
Fig. 5 is a schematic structural diagram of a system according to an embodiment of the invention.
Fig. 6 is a schematic structural diagram of a basic processing unit PE of a system according to an embodiment of the present invention.
FIG. 7 is a schematic diagram of a matrix transformation template structure used in row-column transformation according to an embodiment of the present invention.
Fig. 8 is a schematic diagram illustrating a system full link layer acceleration method according to an embodiment of the present invention.
FIG. 9 is a graph illustrating a comparison of two embodiments of the present invention with different types of accelerators.
Detailed Description
The first embodiment is as follows:
as shown in fig. 1, the implementation steps of the 3D CNN acceleration method based on the Winograd algorithm in this embodiment include:
1) reading a feature map sub-block Bin to be transformed from an input feature map in, and reading a convolution kernel sub-block Bw from a value cache w;
2) executing 3D Winograd algorithm on the feature map subblock Bin and the convolution kernel subblock Bw to output result Tp1;
3) Result Tp of executing 3D Winograd algorithm output1Accumulating and outputting an accumulation result Sum;
4) judging whether all input characteristic graphs in the input characteristic graph in are read completely or not, and if not, skipping to execute the step 1); otherwise, skipping to execute the step 5);
5) and writing the accumulation result Sum back to the output characteristic diagram buffer Out.
In this embodiment, when the feature map subblock Bin to be transformed is read from the input feature map in step 1), Z, R, C, M, N five-cycle traversal mode is adopted to read the feature map subblock Bin to be transformed (see fig. 2), where Z, R, and C respectively represent the depth, height, and width of the output feature map, M represents the number of the output feature maps, N represents the number of the input feature maps, and the functional expression of the loading function adopted to read the feature map subblock Bin to be transformed is as shown in formula (1), and the functional expression of the loading function adopted to read the convolution kernel subblock Bw from the value cache w is as shown in formula (2);
in the formula (1), Bin [ k ] [ j ] [ i ] represents a feature map subblock Bin with reading subscripts of k, j, i, the size of the feature map subblock Bin is nxnxnxnxnxn, dep, row, col respectively represents index values in the depth direction, the height direction and the width direction of a subblock to be read in a certain feature map, in represents an input feature map in, S represents a sliding step of a convolution window, r represents the dimension of a convolution kernel, and W represents the width of the input feature map;
Bw[k][j][i]=w[m0][n][k][j*r+i],0≤i,j,k<r. (2)
in formula (2), Bw [ k ]][j][i]Denotes a convolution kernel sub-block Bw with read index k, j, i, w denotes weight cache, m0N represents the index of the convolution kernel, M groups of weights and N convolution kernels in each group are shared in the weight cache w, and r represents the dimensionality of the convolution kernels.
In this embodiment, the detailed steps of step 2) include:
2.1) sequentially performing column transformation and row transformation on each plane of the characteristic diagram subblock Bin with the size of nxnxnxnxn to obtain a transformed characteristic diagram subblock Tin; clockwise 90-rotation is carried out on the feature map sub-block Tin to enable each data position in the feature map sub-block Tin to be rearranged, and the rotated feature map sub-block Tin is obtainedR(ii) a For the rotated feature map sub-block TinREach plane of the image is subjected to column transformation to obtain transformationThe sub-block Tin of the characteristic diagram with the rear size of nxnxnxnxn1;
2.2) sequentially performing column transformation and row transformation on each plane of the convolution kernel subblock Bw with the size of r multiplied by r to obtain a transformed convolution kernel subblock Tw; clockwise 90 rotations are carried out on the convolution kernel sub-block Tw, so that each data position in the convolution kernel sub-block Tw is rearranged, and the rotated convolution kernel sub-block Tw is obtainedR(ii) a For the rotated convolution kernel sub-block TwREach of the planes of (1) is subjected to column transformation to obtain transformed convolution kernel sub-blocks Tw of size n × n × n1;
2.3) feature map sub-blocks Tin for perfect agreement in size1And convolution kernel sub-block Tw1Performing dot multiplication operation to obtain a dot multiplication operation result P with the size of n multiplied by n;
2.4) sequentially performing column transformation and row transformation on each plane of the dot multiplication operation result P with the size of nxnxnxnxn to obtain a transformed dot multiplication operation result Tp; rotating the transformed dot product operation result Tp clockwise 90 times to rearrange the data positions in the transformed dot product operation result Tp and obtain the rotated dot product operation result TpR(ii) a For the rotated dot product operation result TpRIs subjected to column transformation to obtain a transformed dot product operation result Tp of size m x m1Tp, the result of the output as the execution of the 3D Winograd algorithm1。
In this embodiment, the function expression of the column transformation in step 2.1) is as shown in formula (3), and the function expression of the row transformation is as shown in formula (4);
in the formula (3), (x)0 x1 x2 x3)TA column representing the input feature map subblock to be transformed, (x)0' x1' x2' x3')TThe characteristic diagram sub-blocks are corresponding to the characteristic diagram;
in the formula (4), (x)0 x1 x2 x3) Representing a row of input feature map sub-blocks to be transformed, (x)0' x1' x2' x3') indicates the rows of the corresponding profile sub-block after row transformation.
In this embodiment, the function expression for column transformation in step 2.2) is shown in formula (5), and the function expression for row transformation is shown in formula (6);
in the formula (5), (w)0w1w2)TA certain column representing a sub-block of the convolution kernel to be transformed, (w)0' w1' w2' w3')TRepresenting the corresponding columns of the convolution kernel subblocks after column transformation;
in the formula (6), (w)0 w1 w2) Representing a row of a sub-block of the convolution kernel to be transformed, (w)0' w1' w2' w3') indicates the row of the corresponding convolution kernel sub-block after the row transform.
In this embodiment, the function expression for column transformation in step 2.4) is shown as formula (7), and the function expression for row transformation is shown as formula (8);
in the formula (7), (m)0 m1 m2 m3)TRepresenting a point to be transformed by a certain column of sub-blocks, (m)0' m1')TThe column of the corresponding point-by-sub block after the column transformation is represented;
in the formula (8), (m)0 m1 m2 m3) Representing a point to be transformed by a certain row of the sub-block, (m)0' m1') indicates the corresponding point multiplied by the row of the sub-block after row transformation.
In this embodiment, the function expression for clockwise 90 rotation is as shown in formula (9);
DR i,k,j←Di,j,k (9)
in the formula (9), Di,j,kTo make an element before 90 rotations clockwise, DR i,j,kFor an element rotated 90 degrees clockwise, i, j, k are the indices of the element row, column, and depth, respectively.
In this embodiment, step 5) writes the accumulation result Sum back to the output characteristic diagram cache Out, and the functional expression of the write-back function adopted by the output characteristic diagram cache Out is as shown in formula (10);
Out[m0][dep+k][(row+i)*C+col+j]=Sum[m0][k][i][j],0≤i,j,k≤m-1. (10)
in the formula (10), Out represents output characteristic diagram buffer, m0The index representing the convolution kernel, dep, row, col respectively represent the index values of the depth, height and width directions of the sub-block to be read in a certain feature map, i, j, k are the indexes of the row, column and depth of the feature map, Sum [ m0][k][i][j]Denotes the m-th0And outputting pixel points with the depth of k, the height of i and the width of j in the accumulation result Sum of the characteristic diagram.
The calculation mode of the 3D CNN is similar to that of the 2D CNN, and mainly includes calculation loads such as a convolutional layer, an active layer, a pooling layer, and a full connection layer. Among them, the convolutional layer accounts for more than 90% of the calculation load, the remaining part is mainly composed of full connections, and the active layer and the pooling layer are almost ignored. Winograd is a fast convolution algorithm, and essentially, the addition calculation with low computational complexity replaces the multiplication calculation with high computational complexity in convolution, so the algorithm can effectively reduce the complexity of convolution calculation. Since the main calculation load of the CNN is convolution calculation in the convolutional layer, a few jobs use the 2D Winograd algorithm for CNN acceleration and achieve good results. The 3D CNN acceleration method based on the Winograd algorithm expands the algorithm and uses the algorithm in the 3D CNN calculation. In this embodiment, a conventional 3D convolution calculation mode is defined as F (m × m × m, r × r × r), that is, an image x with a size of n × n × n is convolved by a convolution kernel w with a size of r × r × r, so as to obtain a convolution result z with a size of m × m × m, where n is m + r-1. In this embodiment, the 3D Winograd algorithm is represented mathematically as shown in formula (11):
z=(M((XxXT)RXT⊙(WwWT)RWT)MT)RMT (11)
in the formula (11), z represents obtaining a convolution result with a size of M × M × M, W represents a convolution kernel with a size of r × r × r, X represents an image with a size of n × n × n, M, X, and W are constant transformation matrices, and M is a matrix with a constant numberT,XT,WTRepresenting transpositions of the matrices M, X, W, respectively. The concrete numerical values of all matrices are determined by m and R, which indicates a dot product operation, and R indicates a rotate operation (90 degrees in the clockwise direction). For simplicity, in the present embodiment, mainly F (2 × 2 × 2,3 × 3 × 3) is taken as an example for explanation, and in the premise that when m is 2 and r is 3, the constant matrix can be determined as shown in equation (12):
it should be noted that the 3D CNN acceleration method based on the Winograd algorithm of the present invention is also applicable to different m and r. Let x be (x)0,x1,x2,x3),x0,x1,x2,x3Four planes representing x, XxXTThe definition is shown in formula (13):
XxXT=X(x0,x1,x2,x3)XT=(Xx0XT,Xx1XT,Xx2XT,Xx3XT) (13)
in the formula (13), X is a constant transformation matrix, XTRespectively, represent transpositions of matrix X.
Further, assume xi=(xi0,xi1,xi2,xi3) I is 0,1,2,3, wherein xi0,xi1,xi2,xi3Is xiEach column of (a) has a function expression represented by the following formula (14):
XxiXT=X(xi0,xi1,xi2,xi3)XT=(Xxi0XT,Xxi1XT,Xxi2XT,Xxi3XT) (14)
in formula (14), Xxi0XTCan be carried out in two steps, firstly calculating Xxi0Then calculate (Xx)i0)XT. Both calculations are classical matrix-vector multiplications. It can be seen that the matrix transformation in the 3D Winograd algorithm can be regarded as a series of matrix-vector multiplication calculations (convolution kernel transformation, point multiplication result transformation analysis, the same as above).
Further, the dot product operation in the 3D Winograd algorithm is defined by equation (15):
mi,j,k=ai,j,k*bi,j,k,0≤i,j,k≤n-1. (15)
in the formula (15), mi,j,k,ai,j,k,bi,j,kRespectively, the dot product results, (XxX)T)RXT、(WwWT)RWTAny one point of (a).
In this embodiment, the graphical representation of the 3D Winograd algorithm flow is shown in fig. 2, the flow chart is shown in fig. 3, in the drawing, symbols F and R respectively represent the front plane and the right plane of the image to be processed or the convolution kernel sub-block, and m is (XxX)T)RXTAnd (WwW)T)RWTDot product result of (c). As shown in fig. 2 and 3, the 3D Winograd algorithm flow is described as follows:
inputting an algorithm: a 3D image x (of size n × n × n) and a 3D convolution kernel w (of size r × r × r);
and (3) outputting an algorithm: the convolution result z (size m × m × m).
The algorithm performs the following: the method comprises the following steps: for each plane (front view direction, W) of the 3D image x1,W2,W3,W4) The row-column transformation (i.e. each column of each plane is multiplied by X) and the row transformation (i.e. each row of each plane is multiplied by X right)T) The output of this step is XxXT(the sizes are n multiplied by n) as the input of the second step; secondly, the step of: the transformed image is rotated 90 degrees clockwise, and the output result is (XxX)T)RAs the input of the third step; ③: to (XxX)T)RIs (in front view, i.e. P'1,P'2,P'3,P'4) Making a row-column transformation (i.e. right-times X)T) The output result is (XxX)T)RXTAnd the step (c) is used as the input of the step (c). Fourthly, the method comprises the following steps: for each plane (elevation direction, i.e. P) of the 3D convolution kernel w1,P2,P3,P4) The row transformation and the column transformation are carried out in turn (namely, each column of each plane is multiplied by W at the left side), and the row transformation (namely, each row of each plane is multiplied by W at the right side)T) The output result is WwWTAs the input of the fifth step; fifthly: the transformed convolution kernel is rotated 90 degrees clockwise, and the output result is (WwW)T)RAs input in the sixth step; sixthly, the method comprises the following steps: to (WwW)T)RIs (in front view, i.e. W'1,W'2,W'3,W'4) Making a row-column transformation (i.e. right-handed W)T) The output result is (WwW)T)RWT(the sizes are n multiplied by n) which is used as the input of the step (seventh step); seventh, the method comprises the following steps: for the above transformation result (XxX)T)RXTAnd (WwW)T)RWTPerforming dot multiplication operation, namely multiplying each pixel point in the input image by weight data at a corresponding position in the convolution kernel respectively, and outputting a dot multiplication result m with the size of n multiplied by n as the input of the step (viii); and (v): performing transformation process completely consistent with the first to third steps or the fourth to sixth steps on the 3D point multiplication result M, only replacing the transformation matrix with M and MT. The output is the final result z (the size is m x m and the book ism). Wherein, the first to third steps and the fourth to sixth steps can be executed in parallel.
In this embodiment, when the 3D Winograd algorithm is applied to the 3D CNN, the 3D Winograd algorithm is mainly applied to the convolution layer calculation in the 3D CNN, and the pseudo code of the algorithm is shown in fig. 4. Referring to fig. 4, the algorithm includes five cycles, where Z, R, C represent the depth, height and width of the output feature maps, respectively, M represents the number of output feature maps (output channels), and N represents the number of input feature maps (input channels). Each function in the code is specifically described below.
Load _ Image (in, n, dep, row, col, Bin): and (3) loading a feature diagram, wherein an input parameter in represents a feature diagram cache (a data structure is a three-dimensional array), n represents an index of the feature diagram, dep, row and col respectively represent index values of the depth direction, the height direction and the width direction of a subblock to be read in a certain feature diagram, and an output parameter Bin represents the read subblock of the feature diagram, and the size of the subblock is n multiplied by n. The function is to read 4 × 4 sub-blocks from a specific location in the nth signature buffer. Assuming that the input signature is of size D × W × H and each convolution kernel is of size r × r (Ksize), the function is given by equation (1).
Load_Weight(w,m0And, Bw): weight value loading function, input parameter w represents that the data structure is a four-dimensional array), and input parameter m0The index representing the convolution kernel (a total of M sets of weights, N convolution kernels per set) and the output parameter Bw represents the read weight subblock. The function of the function is to read the nth convolution kernel in the mth set of weight cache, and the function is shown as formula (2).
2D _ Trans _ X (Bin, Tin): and transforming the function by the feature map, wherein the input Bin of the function represents the feature map subblock to be transformed, and the output Tin of the function represents the transformed feature map subblock. The function of the function is to perform row-column transformation on each row and each column of Bin, and the transformation process is (XI)nXT). 2D _ Trans _ W (), 2D _ Trans _ M () is also a matrix transformation function, the transformation process is completely identical to that of 2D _ Trans _ X, except that the transformation matrix is respectively transformed into W (W)T) And M (M)T)。
Rotate(Tin,TinR): sub-block rotation functionWhere Tin denotes a rotor block to be rotated, TinRRepresenting the rotated sub-block, the function is to rotate Tin 90 clockwise, which essentially rearranges the position of the data in the sub-block. Suppose Di,j,kIs an element of Tin, DR i,j,kIs TinRA certain element in (i, j, k) is the index of the row, column and depth in the sub-block, then Rotate (Tin )R) The transformation operation in (1) is shown as formula (9);
1D_Trans_X(TinR,Tin1): feature map transformation function, input Tin of said functionRRepresenting the output of the feature map sub-block to be transformed, i.e. 2D _ Trans _ X (in, Tin), the output of the function Tin1Representing transformed feature map sub-blocks. Suppose In is TinRIs used, the function is to perform a row transformation or a column transformation on In, the transformation process is InXT(line transformation) or XIn(column transform). 1D _ Trans _ W (), 1D _ Trans _ M () are also matrix transformation functions, the transformation process is completely identical to that of 1D _ Trans _ X, except that the transformation matrix is respectively transformed into W (W)T) And M (M)T)。
3D_Mul(Tin1,Tw1P): a point multiplication function, which is input into two 3D data blocks Tin with identical size1And Tw1The output P is the dot product of the two. The function of the point-by-point function can be described by equation (16):
Pi,j,k←Tini,j,k*Twi,j,k (16)
in formula (16), Pi,j,kIs Tini,j,kAnd Twi,j,kThe dot product results i, j, k of the two are the input and output subblock row, column and depth-wise indices, respectively.
3D_Accumulate(Tp1Sum): cumulative function, input parameter Tp1And outputting Sum as an accumulation result for each accumulated sub-block. The accumulation result of each time is used as the input of the next accumulation process. The function of this function can be described by equation (17):
Sumi,j,k←Sumi,j,k+Tpi,j,k (17)
in the formula (17), the compound represented by the formula (I),Sumi,j,kis shown in Sumi,j,kUp-accumulation of Tpi,j,kThe obtained accumulation results i, j, k are Tp respectively1Sum row, column and depth-wise indices.
Send(Sum,m0Dep, row, col, Out): result write-back function having as input the accumulated result Sum, by index value m0Dep, row and col write the result back to the designated position in the mth output feature map in the output buffer Out (the data structure is a three-dimensional array). The function of this function can be described by equation (10).
The embodiment also provides a 3D CNN acceleration system based on the Winograd algorithm, which includes an IP core, and the IP core is programmed to execute the steps of the aforementioned 3D CNN acceleration method based on the Winograd algorithm.
As shown in fig. 5, this embodiment further provides a 3D CNN acceleration system based on a Winograd algorithm, which includes an off-chip storage and an IP core, where the IP core includes an on-chip storage and a computation module, the on-chip storage includes an input buffer, a weight buffer, and an output buffer, the input buffer, the weight buffer, and the output buffer are respectively connected to the off-chip storage, the computation module includes a pooling layer unit POOL, an active layer unit ReLU, and a plurality of processing units PU arranged in parallel, the processing units PU include an accumulation module and a plurality of parallel basic processing units PE, input ends of the plurality of basic processing units PE are respectively connected to the input buffer and the buffer weight at the same time, and output ends of the plurality of basic processing units PE are respectively connected to the output buffer through the accumulation module, the active layer unit ReLU, and the pooling layer unit POOL. As shown in fig. 6, the basic processing unit PE includes an input feature map conversion array, an input feature map cache, an input weight conversion array, an input weight cache, a dot multiplication module, a dot multiplication result cache, and a dot multiplication result conversion array, wherein an input end of the input feature map conversion array is connected to the input cache, an output end thereof passes through the input feature map conversion array, the input characteristic diagram buffer memory is connected with one input end of the dot multiplication module, the input end of the input weight conversion array is connected with the weight buffer memory, the output end of the input weight conversion array is connected with the other input end of the dot multiplication module through the input weight conversion array and the input weight buffer memory, the output end of the dot multiplication module is connected with the input end of the accumulation module through the dot multiplication result buffer memory and the dot multiplication result conversion array, and the input characteristic diagram conversion array, the input weight conversion array and the dot multiplication result conversion array all comprise a column conversion module and a row conversion module which are sequentially connected.
Storage management of data is a difficult point in the construction of accelerator systems. Because the data volume of the input characteristic graph and the weight in the 3D CNN network is often larger, the storage of all input data on an FPGA chip with limited on-chip storage capacity cannot be realized. Therefore, the present embodiment stores the input data and the final output result in the off-chip storage with a large capacity. Meanwhile, the present embodiment performs a blocking process on a plurality of dimensions of input and output data, as shown in table 1.
Table 1: a block coefficient table.
Parameter(s) | Means of | Block coefficient |
M | Number of output feature maps | To |
N | Input of feature map number | Ti |
W | Input feature map width | Tw |
H | Input deviceHeight of figure | Th |
D | Inputting feature map depth | Td |
C | Output feature map width | Tc |
R | Output feature map height | Tr |
Z | Outputting feature map depth | Tz |
Since the single convolution kernel is relatively small, it is not further chunked by this embodiment. For each calculation, the number of input feature values To be read in this embodiment is Ti × Tw × Th × Td, and the number of convolution kernel weights is To × Ti × Ksize (for F (m × m × m, r × r × r), Ksize ═ r × r); assuming that the sliding step of the convolution kernel is S, the partition parameters have the following relationship (18):
in equation (18), S is the sliding step of the convolution kernel, r represents the dimension of the convolution kernel, and the remaining parameters are detailed in table 1.
And the number of the output characteristic values written back To the external memory after the calculation is completed is To Tc Tr Tz. As shown in fig. 5, in order to improve data reusability, reduce the number of accesses and reduce the off-chip access delay, this embodiment organizes three caches on the chip: and the input buffer, the weight buffer and the output buffer are respectively used for storing the input data, the weight and the output data after the partitioning. For the input cache, the embodiment adopts a double-cache technology to enable data prefetching and calculation to be carried out in an overlapping mode, so that data prefetching delay is hidden. Similarly, the present embodiment also uses a double buffering technique for the output buffer to hide the delay of data write back.
According to the calculation process of the 3D CNN combined with the Winograd algorithm, the present embodiment constructs a basic processing unit PE to implement the algorithm. Referring to fig. 5 and 6, it can be seen that the PE includes a dot product module and sets of transform arrays. The point multiplication module is used for executing point multiplication operation in a Winograd algorithm, and the transformation array is responsible for executing various matrix transformation operations in the algorithm. In order to fully utilize the high parallelism of the dot product calculation and improve the calculation throughput of the dot product module, the present embodiment performs loop expansion (triple loop of width, height and depth of the input sub-blocks) on the dot product operation, i.e. n is integrated into the dot product module3A multiplier for simultaneously performing n3A multiplication is carried out and n is obtained3And (6) obtaining the result. For the matrix transformation array, the embodiment adopts a templating idea to construct. According to the symmetry of row and column transformation in the 3D Winograd algorithm, the embodiment finds that the two can be implemented by using the same template.
Therefore, in this embodiment, the matrix transformation template shown in fig. 7(a) to 7(c) is constructed, and the structure of 7(a) corresponds to the formula (3) for performing the column transformation in step 2.1) and the formula (4) for performing the row transformation; the structure of 7(b) corresponds to the column transformation in step 2.2) as shown in formula (5) and the row transformation as shown in formula (6); the structure of 7(c) corresponds to the column transformation performed in step 2.4) as shown in formula (7) and the row transformation as shown in formula (8), and each transformation template can perform row transformation or column transformation on the processed subblocks. Since the transform matrix has mostly 1 or-1 elements and more zero elements, the present embodiment expands the vector multiply accumulate operation in the matrix transform and converts the multiplication into addition or subtraction, and at the same time, uses the right shift operation to replace the special multiplication such as × 1/2. It can be seen that the template designed in this embodiment uses simple adders, subtractors and shifters to replace multipliers with large resource overhead, and can obtain row or column transformation results in one clock cycle, thereby achieving a good balance between resources and performance.
In order to further improve the parallelism of the matrix transformation process, the present embodiment constructs a matrix transformation array by using the template, and transforms each row and each column of the input image, the convolution kernel and the dot product result.
The computation module is the core of the entire accelerator system. In this embodiment, a High-Level Synthesis tool (HLS) of Xilinx corporation is used for description, a "Pragma HLS" keyword is used to represent a special instruction provided by the tool, and optimization means such as loop expansion (unrol) and loop pipelining (Pipeline) can be very conveniently implemented by using the special instruction. As shown in fig. 5, the present embodiment organizes two layers of parallel structures in the computing module: to Processing Units (PUs) and Ti Processing Elements (PEs) inside each PU. Each PU reads the same input feature map sub-block and a different convolution kernel for computation. The PUs are organized using a systolic array structure, and all input profile data is read by the leftmost PU in the array and passed to adjacent PUs in sequence. The organization form has the advantages that the intensive access of the input cache can be effectively relieved, the interconnection mode among the hardware modules is changed from a centralized mode to a distributed mode, the complexity of rear-end layout and wiring is greatly reduced, and the promotion of the main frequency of the accelerator is facilitated. In addition, the present embodiment is organized in a global interconnection manner, that is, each PU has an independent channel for accessing the weight cache. By spreading out L4, To PUs can be computed simultaneously; l5 is expanded so that the Ti PE units in each PU can perform parallel computations, thereby improving the throughput of the computation module. In addition, the embodiment also performs circulating hydration optimization on the L3, so that the processing of adjacent subblocks can be performed simultaneously, and the delay of a computing module is further reduced. Each PU also has an accumulator module integrated therein for accumulating Ti temporary results generated simultaneously by the PE units therein, in addition to the Ti PEs. When all the input feature maps are traversed, the final accumulation result is sent to an output buffer after being processed by the active layer ReLU and the pooling layer POOL calculation module. Wherein, the pooling layer computing module can bypass according to the network specific configuration. Furthermore, the full connectivity layer and the convolutional layer share one hardware acceleration module, so that the whole computation module can accelerate the whole 3D CNN network.
After the core computing module is constructed, the next step is to multiplex the computing module to support the acceleration of the full connection layer, so that the utilization efficiency of the computing module is improved, and a set of computing components is not redesigned to accelerate the full connection layer. The calculation mode of the full connection layer is matrix-vector multiplication, and the input of the matrix-vector multiplication is a one-dimensional eigenvector and a two-dimensional weight. The one-dimensional feature vector is obtained by unfolding (flatten) a plurality of two-dimensional feature maps generated by a previous computation layer (generally, a pooling layer), and is spliced into a long vector. The calculation result of the fully connected layer is still a one-dimensional feature vector and is used as the input of the next fully connected layer. Fig. 8 shows the full link layer acceleration method proposed in this embodiment. In the embodiment, a batch processing mode is adopted to calculate the full connection layer, that is, the Batchsize input feature vectors are organized into an input feature map, so that the reuse of weight data is increased, and the access frequency is effectively reduced. Wherein the size of Batchsize is equal to the size of sub-block of F (m × m × m, r × r × r) subjected to dot product calculation, i.e. n3(n-m + r-1, n-3 in the figure). It can be seen that the present embodiment organizes the pixel values at the same position in the Batchsize sheet feature vector into a size n3As an input profile for each PE (indicated by the dashed box in fig. 8). The calculation characteristics of the full connection layer determine that the input characteristic diagram data of each PE needs to share the weight, namely, the input characteristic diagram data and the same weight are subjected to multiplication. Therefore, each weight data in the weight cache needs to be replicated n3(i.e., 64 passes) are then sent to the PE for calculation. In this embodiment, a transformation module in the calculation module is bypassed, so that the feature map and the weight data obtained by each PE directly enter the dot product module for calculation. The multiplication results of the PEs in the PU also need to be accumulated in the local ACCU. Passing through N/T according to the above methodiT in each PU after sub-accumulationoThe PE can calculate the Batchsize output at the same time1 final result in the feature map, ToThe PU can simultaneously obtain T in the Batchsize output characteristic diagramoAnd (6) obtaining a final result. In order to reduce the storage access overhead of the output feature map, the embodiment caches the calculation result of the full connection layer by using on-chip storage, so as to directly provide input data for the calculation of the next full connection layer.
Since the accelerator system provided by this embodiment is finally implemented on hardware, it is necessary to estimate the whole resources of the accelerator before implementation, and it can be known in advance whether the hardware platform is adapted to the accelerator logic through resource estimation, so as to appropriately adjust the accelerator logic for adaptation (when the accelerator logic is smaller than the hardware platform resource amount, the accelerator logic can be added to improve the computation performance; when the estimation resource is larger than the hardware platform resource amount, the accelerator logic needs to be appropriately reduced); the purpose of the performance evaluation is to find an optimal set of design parameters such as T for the accelerator constructiono,TiAnd the design space can be effectively reduced, and the selection of inefficient design parameters is avoided, so that the design period of the accelerator is reduced, and the performance of the accelerator is maximized.
In this embodiment, DSP and BRAM resources are evaluated, and evaluation models thereof are as follows:
DSPused=To*Ti*f(Iw)*n3 (19)
Vin=Ti*(S*Tz+K-S)*(S*Tr+K-S)*(S*Tc+K-S) (20)
Vout=To*Tz*Tr*Tc (21)
in the formulae (19) to (23), DSPusedIndicating usage of DSP resourcesAmount, VinData quantity of input characteristic diagram V representing bufferoutIndicating the number of output profiles buffered,indicating the buffer weight of the full connection layer, BRAMusedIndicates the usage amount of BRAM resource, IwRepresenting the calculation module pipeline processing interval, To*TiIndicating the number of elementary processing elements PE, BwonRepresenting bit width of on-chip memory cells, BRAMcapacityRepresenting the capacity of a single BRAM, Part representing the partitioning factor of the cache, PartinPartition factor, Part, representing input signature graph cacheoutBlock factor, Part, representing output signature graph cachewBlock factor, N (Bw), representing weight cacheon) By the formulaMaking a determination of n3Is the number of multipliers. The embodiment discovers DSP utilization and IwIn connection with this, I was obtained by experiment in this examplewEmpirical value of f (I)w)≈1/Iw. The DSPs are all used to implement a dot-product module (collectively n) in the PE3Multiplier) and the rest calculation modules such as matrix transformation, ReLU, POOL and the like are realized by adopting an adder or a comparator, so that the DSP is not consumed. Because the embodiment adopts the double-cache technology, the usage amount of the BRAM doubles. For convolutional layers, because the weight data is less, this embodiment uses LUTRAM for storage, so it does not consume BRAM resources. For the full connection layer, the weight data is large, and the BRAM is adopted to store the weight data in the embodiment. The input and output feature vector caches of the full connection layer multiplex the corresponding caches of the convolutional layer, so that no additional storage overhead is introduced.
In order to complete the performance evaluation of the accelerator, the present embodiment mainly models the execution time of the accelerator. First, the present embodiment evaluates the data transmission times as in equations (24) to (26):
in the formulae (24) to (26),represents the weight Data amount required by the convolution layer calculation, and Data _ Width represents the Data bit Width BWeffRepresenting the effective bandwidth, V, of the acceleratorinData quantity of input characteristic diagram V representing bufferoutNumber of output profiles, T, representing buffertrans iAnd Ttrans oRespectively, the read/transmit times of the input/output data. Next, the present example evaluated the total calculation time based on equations (24) to (26), as shown in equations (27) to (28):
in formulae (27) to (28), TcomIndicating the calculation time, TzIndicating the output feature map depth, TrIndicating the height of the output profile, TcIndicates the width of the output feature map, IwRepresenting the computing module pipeline processing interval, m3Representing the transformed dot-multiplied module size (i.e., m), Freq represents the operating frequency of the accelerator, TtotalRepresenting data pre-fetching time, Z, R, C representing depth, height and width of output feature maps, M representing number of output feature maps, N representing outputNumber of in-signatures, To*TiIndicating the number of elementary processing elements PE, Ttrans iAnd Ttrans oRespectively, the read/transmit times of the input/output data.
Parameter(s) | Means of | Block coefficient |
M | Number of output feature maps | To |
N | Input of feature map number | Ti |
W | Input feature map width | Tw |
H | Input feature map height | Th |
D | Inputting feature map depth | Td |
C | Output feature map width | Tc |
R | Output feature map height | Tr |
Z | Outputting feature map depth | Tz |
Since the present embodiment employs the double-cache technique, the data prefetching and the calculation are overlapped, so the core calculation time is determined by the larger of the calculation time and the data prefetching time, that is, the core calculation time is determined by the larger of the calculation time and the data prefetching timeTherefore, in order to avoid the limitation of the storage of the accelerator, the present embodiment provides a constraint condition on the bandwidth required by the accelerator as shown in equation (29);
in the formula (29), BWmRepresenting the bandwidth required by the accelerator, m3Indicating the transformed dot-multiplied module size (i.e., m), Freq indicates the operating frequency of the accelerator, VinData quantity of input characteristic diagram V representing bufferoutNumber of output profiles, T, representing bufferzIndicating the output feature map depth, TrIndicating the height of the output profile, TcIndicates the width of the output feature map, IwRepresenting a computing module pipeline processing interval. Formula (29) is based on Tcom≥TtransAnd (4) obtaining the product. As can be seen from equation (29), the effect TtotalThe size is variable, so that searching the design space through an exhaustive method to find the optimal parameters requires a large time overhead. Therefore, in the embodiment, the limited on-chip resources of the FPGA platform are considered, and the constraint condition on the design space is shown as the formula (30);
In the formula (30), DSPusedIndicating the amount of DSP resources used, BRAMusedIndicating BRAM resource usage, DSPtotalIndicating the total amount of DSP resources, BRAMtotalThe total amount of BRAM resources is represented, and through the constraint, the size of a design space can be effectively reduced, and the time for finding an optimal solution is shortened.
In this embodiment, an HLS tool of Xilinx corporation is specifically used to implement the 3D CNN accelerator, and is encapsulated into an IP core. Around the IP core, the performance test is performed on the system on chip by adopting Vivado 2016.4 of Xilinx company in the embodiment. The system on chip mainly comprises 1 embedded processor core (MicroBlaze), a DDR controller (mig _7_ series _0) and a CNN accelerator (baseWinograd _ 0). The processor core can perform parameter configuration on the accelerator through the M _ AXI _ DP interface and start the accelerator to work. After the accelerator is started, the accelerator reads data through the DDR controller for calculation, after the calculation is finished, the result is written back to the DDR, the whole process does not need the intervention of a CPU, time statistics is carried out through the reading of a timer, and information is printed out through a serial port. The embodiment respectively realizes the system on chip on the VC709 platform of Xilinx corporation. The VC709 platform comprises one Virtex-7690 t FPGA chip and two DDR3 chips. The comprehensive frequency of the accelerator on VC709 reaches 150MHz, and a DDR3 chip is adopted on a VC709 platform.
Example two:
the present embodiment is basically the same as the first embodiment, and the main differences are as follows: in this embodiment, the above system on chip is implemented on a VUS440 platform of S2C company, and the VUS440 platform is basically the same as the VC709 platform of the first embodiment, and the main differences are as follows: the VUS440 platform comprises a Xilinx VCU440FPGA chip and a DDR4 chip, and the comprehensive frequency of the accelerator on the VUS440 platform reaches 200 MHz.
In order to compare and verify the performances of the first embodiment and the second embodiment, the C3D network is selected for testing, and the 3D CNN model is widely applied in the field of video classification. As shown in table 2, it can be seen that the sizes of the network convolution kernels are all 3 × 3, the convolution kernel step lengths are all 1, and the network convolution kernel is very suitable for calculation optimization by using Winograd algorithm.
Table 2: C3D network parameters.
In actual tests, the present embodiment does not implement an accelerator structure for each convolutional layer scale, but performs tests by using a unified accelerator structure. The unified design parameter is Ti=4,To32. The experimental results are shown in fig. 9, and it can be seen that the accelerator achieves the peak performance of 560GOPS on the VC709 platform, and the accelerator achieves the peak performance of 1112GOPS on the VUS440 platform. The dotted line in fig. 8 indicates the theoretical peak performance that the accelerator using the 3D CNN acceleration method based on the Winograd algorithm can achieve on the two platforms, and the solid line indicates the computational performance predicted by the evaluation model on the two platforms. It can be seen that, on one hand, the accelerator of the embodiment can achieve higher computational efficiency (actual measurement performance/peak value performance), for example, for CONV-2 acceleration, the computational efficiency reaches 80%; on the other hand, the evaluation model of the embodiment can accurately predict the accelerator performance.
In addition, the embodiment also performs performance optimization by using OpenBlas and Cudnn respectively, in contrast to the CPU (using Intel E5-2680) and GPU accelerator (using NVDIA K40 accelerator) schemes. The comparative results are shown in Table 3.
Table 3: and comparing the result data table.
Referring to Table 3, the speed and energy efficiency ratio of the CPU (using Intel E5-2680) are taken as reference speeds: the acceleration ratio of a GPU accelerator (adopting an NVDIA K40 accelerator) is 20 times, and the energy efficiency ratio is 9.2 times; on the VC709 platform adopted in the first embodiment, the acceleration ratio of the accelerator adopting the 3D CNN acceleration method based on the Winograd algorithm is 7.3 times, and the energy efficiency ratio is 17.1 times; on the VUS440 platform adopted in the second embodiment, the acceleration ratio of the accelerator adopting the 3D CNN acceleration method based on the Winograd algorithm is 13.4 times, and the energy efficiency ratio is 60.3 times. Therefore, the 3D CNN acceleration method based on the Winograd algorithm is far higher than a CPU in terms of computational performance, and is lower than a GPU in terms of computational performance, but has great advantages in terms of power consumption and energy efficiency ratio compared with the CPU and the GPU.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.
Claims (8)
1. A3D CNN acceleration method based on Winograd algorithm is characterized by comprising the following implementation steps:
1) reading a feature map sub-block Bin to be transformed from an input feature map in, and reading a convolution kernel sub-block Bw from a value cache w;
2) executing 3D Winograd algorithm on the feature map subblock Bin and the convolution kernel subblock Bw to output result Tp1;
3) Result Tp of executing 3D Winograd algorithm output1Accumulating and outputting an accumulation result Sum;
4) judging whether all input characteristic graphs in the input characteristic graph in are read completely or not, and if not, skipping to execute the step 1); otherwise, skipping to execute the step 5);
5) writing the accumulation result Sum back to the output characteristic diagram buffer Out;
the detailed steps of the step 2) comprise:
2.1) sequentially performing column transformation and row transformation on each plane of the characteristic diagram subblock Bin with the size of nxnxnxnxn to obtain the transformationThe changed characteristic diagram sub-block Tin; clockwise 90-rotation is carried out on the feature map sub-block Tin to enable each data position in the feature map sub-block Tin to be rearranged, and the rotated feature map sub-block Tin is obtainedR(ii) a For the rotated feature map sub-block TinREach plane of the feature map is subjected to column transformation to obtain transformed feature map subblocks Tin with the size of n multiplied by n1;
2.2) sequentially performing column transformation and row transformation on each plane of the convolution kernel subblock Bw with the size of r multiplied by r to obtain a transformed convolution kernel subblock Tw; clockwise 90 rotations are carried out on the convolution kernel sub-block Tw, so that each data position in the convolution kernel sub-block Tw is rearranged, and the rotated convolution kernel sub-block Tw is obtainedR(ii) a For the rotated convolution kernel sub-block TwREach of the planes of (1) is subjected to column transformation to obtain transformed convolution kernel sub-blocks Tw of size n × n × n1;
2.3) feature map sub-blocks Tin for perfect agreement in size1And convolution kernel sub-block Tw1Performing dot multiplication operation to obtain a dot multiplication operation result P with the size of n multiplied by n;
2.4) sequentially performing column transformation and row transformation on each plane of the dot multiplication operation result P with the size of nxnxnxnxn to obtain a transformed dot multiplication operation result Tp; rotating the transformed dot product operation result Tp clockwise 90 times to rearrange the data positions in the transformed dot product operation result Tp and obtain the rotated dot product operation result TpR(ii) a For the rotated dot product operation result TpRIs subjected to column transformation to obtain a transformed dot product operation result Tp of size m x m1Tp, the result of the output as the execution of the 3D Winograd algorithm1。
2. The Winograd algorithm-based 3D CNN acceleration method according to claim 1, wherein when the feature map sub-block Bin to be transformed is read from the input feature map in step 1), the feature map sub-block Bin to be transformed is read in Z, R, C, M, N five-pass loop traversal manner, wherein Z, R, C respectively represent depth, height and width of the output feature map, M represents number of output feature maps, N represents number of input feature maps, and the functional expression of the loading function used for reading the feature map sub-block Bin to be transformed is shown in formula (1), and the functional expression of the loading function used for reading the convolution kernel sub-block Bw from the value cache w is shown in formula (2);
in the formula (1), Bin [ k ] [ j ] [ i ] represents a feature map subblock Bin with reading subscripts of k, j, i, the size of the feature map subblock Bin is nxnxnxnxnxn, dep, row, col respectively represents index values in the depth direction, the height direction and the width direction of a subblock to be read in a certain feature map, in represents an input feature map in, S represents a sliding step of a convolution window, r represents the dimension of a convolution kernel, and W represents the width of the input feature map;
Bw[k][j][i]=w[m0][n][k][j*r+i],0≤i,j,k<r. (2)
in formula (2), Bw [ k ]][j][i]Denotes a convolution kernel sub-block Bw with read index k, j, i, w denotes weight cache, m0N represents the index of the convolution kernel, M groups of weights and N convolution kernels in each group are shared in the weight cache w, and r represents the dimensionality of the convolution kernels.
3. The Winograd algorithm-based 3D CNN acceleration method according to claim 1, wherein a function expression of column transformation in step 2.1) is shown as formula (3), and a function expression of row transformation is shown as formula (4);
in the formula (3), (x)0 x1 x2 x3)TA column representing the input feature map subblock to be transformed, (x)0' x1' x2' x3')TThe characteristic diagram sub-blocks are corresponding to the characteristic diagram;
in the formula (4), (x)0 x1 x2 x3) Representing a row of input feature map sub-blocks to be transformed, (x)0' x1' x2' x3') indicates the rows of the corresponding profile sub-block after row transformation.
4. The Winograd algorithm-based 3D CNN acceleration method according to claim 1, wherein the function expression for column transformation in step 2.2) is shown as formula (5), and the function expression for row transformation is shown as formula (6);
in the formula (5), (w)0 w1 w2)TA certain column representing a sub-block of the convolution kernel to be transformed, (w)0' w1' w2' w3')TRepresenting the corresponding columns of the convolution kernel subblocks after column transformation;
in the formula (6), (w)0 w1 w2) Representing a row of a sub-block of the convolution kernel to be transformed, (w)0' w1' w2' w3') indicates the row of the corresponding convolution kernel sub-block after the row transform.
5. The Winograd algorithm-based 3D CNN acceleration method according to claim 1, wherein the function expression for column transformation in step 2.4) is shown as formula (7), and the function expression for row transformation is shown as formula (8);
in the formula (7), (m)0 m1 m2 m3)TRepresenting a point to be transformed by a certain column of sub-blocks, (m)0' m1')TThe column of the corresponding point-by-sub block after the column transformation is represented;
in the formula (8), (m)0 m1 m2 m3) Representing a point to be transformed by a certain row of the sub-block, (m)0' m1') indicates the corresponding point multiplied by the row of the sub-block after row transformation.
6. The Winograd algorithm-based 3D CNN acceleration method according to claim 1, wherein the function expression for 90 clockwise rotations is shown in formula (9);
DR i,k,j←Di,j,k (9)
in the formula (9), Di,j,kTo make an element before 90 rotations clockwise, DR i,j,kFor an element rotated 90 degrees clockwise, i, j, k are the indices of the element row, column, and depth, respectively.
7. The Winograd algorithm-based 3D CNN acceleration method according to claim 1, wherein the functional expression of the write-back function employed by step 5) to write the accumulation result Sum back to the output feature map buffer Out is represented by formula (10);
Out[m0][dep+k][(row+i)*C+col+j]=Sum[m0][k][i][j],0≤i,j,k≤m-1. (10)
in the formula (10), Out represents output characteristic diagram buffer, m0The index representing the convolution kernel, dep, row, col respectively represent the index values of the depth, height and width directions of the sub-block to be read in a certain feature map, i, j, k are the indexes of the row, column and depth of the feature map, Sum [ m0][k][i][j]Denotes the m-th0And C represents the width of the output characteristic diagram.
8. A3D CNN accelerating system based on Winograd algorithm comprises an IP core and is characterized in that: the IP core is programmed to execute the steps of the 3D CNN acceleration method based on Winograd algorithm according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711342538.0A CN107993186B (en) | 2017-12-14 | 2017-12-14 | 3D CNN acceleration method and system based on Winograd algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711342538.0A CN107993186B (en) | 2017-12-14 | 2017-12-14 | 3D CNN acceleration method and system based on Winograd algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107993186A CN107993186A (en) | 2018-05-04 |
CN107993186B true CN107993186B (en) | 2021-05-25 |
Family
ID=62038616
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711342538.0A Active CN107993186B (en) | 2017-12-14 | 2017-12-14 | 3D CNN acceleration method and system based on Winograd algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107993186B (en) |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108765247B (en) * | 2018-05-15 | 2023-01-10 | 腾讯科技(深圳)有限公司 | Image processing method, device, storage medium and equipment |
CN110766128A (en) * | 2018-07-26 | 2020-02-07 | 北京深鉴智能科技有限公司 | Convolution calculation unit, calculation method and neural network calculation platform |
US11954573B2 (en) * | 2018-09-06 | 2024-04-09 | Black Sesame Technologies Inc. | Convolutional neural network using adaptive 3D array |
CN109447241B (en) * | 2018-09-29 | 2022-02-22 | 西安交通大学 | Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things |
CN109740731B (en) * | 2018-12-15 | 2023-07-18 | 华南理工大学 | Design method of self-adaptive convolution layer hardware accelerator |
CN109919307B (en) * | 2019-01-28 | 2023-04-07 | 广东浪潮大数据研究有限公司 | FPGA (field programmable Gate array) and depth residual error network implementation method, system and computer medium |
CN109885407B (en) * | 2019-03-05 | 2021-09-21 | 上海商汤智能科技有限公司 | Data processing method and device, electronic equipment and storage medium |
CN110188865B (en) * | 2019-05-21 | 2022-04-26 | 深圳市商汤科技有限公司 | Information processing method and device, electronic equipment and storage medium |
JP7251354B2 (en) * | 2019-06-26 | 2023-04-04 | 富士通株式会社 | Information processing device, information processing program, and information processing method |
CN110516334B (en) * | 2019-08-16 | 2021-12-03 | 浪潮电子信息产业股份有限公司 | Convolution calculation simulation test method and device based on hardware environment and related equipment |
CN112686365B (en) * | 2019-10-18 | 2024-03-29 | 华为技术有限公司 | Method, device and computer equipment for operating neural network model |
CN112765538B (en) * | 2019-11-01 | 2024-03-29 | 中科寒武纪科技股份有限公司 | Data processing method, device, computer equipment and storage medium |
CN110930290B (en) * | 2019-11-13 | 2023-07-07 | 东软睿驰汽车技术(沈阳)有限公司 | Data processing method and device |
CN113033813B (en) * | 2019-12-09 | 2024-04-26 | 中科寒武纪科技股份有限公司 | Data processing method, device, computer equipment and storage medium |
CN111459877B (en) * | 2020-04-02 | 2023-03-24 | 北京工商大学 | Winograd YOLOv2 target detection model method based on FPGA acceleration |
CN111626414B (en) * | 2020-07-30 | 2020-10-27 | 电子科技大学 | Dynamic multi-precision neural network acceleration unit |
CN112862091B (en) * | 2021-01-26 | 2022-09-27 | 合肥工业大学 | Resource multiplexing type neural network hardware accelerating circuit based on quick convolution |
CN113269302A (en) * | 2021-05-11 | 2021-08-17 | 中山大学 | Winograd processing method and system for 2D and 3D convolutional neural networks |
CN113407904B (en) * | 2021-06-09 | 2023-04-07 | 中山大学 | Winograd processing method, system and medium compatible with multi-dimensional convolutional neural network |
CN113592088B (en) * | 2021-07-30 | 2024-05-28 | 中科亿海微电子科技(苏州)有限公司 | Parallelism determination method and system based on fine-granularity convolution computing structure |
CN114003201A (en) * | 2021-10-29 | 2022-02-01 | 浙江大华技术股份有限公司 | Matrix transformation method and device and convolutional neural network accelerator |
CN113835758B (en) * | 2021-11-25 | 2022-04-15 | 之江实验室 | Winograd convolution implementation method based on vector instruction accelerated computation |
CN115906948A (en) * | 2023-03-09 | 2023-04-04 | 浙江芯昇电子技术有限公司 | Full-connection-layer hardware acceleration device and method |
CN116167423B (en) * | 2023-04-23 | 2023-08-11 | 南京南瑞信息通信科技有限公司 | Device and accelerator for realizing CNN convolution layer |
CN116248252B (en) * | 2023-05-10 | 2023-07-14 | 蓝象智联(杭州)科技有限公司 | Data dot multiplication processing method for federal learning |
CN116401502B (en) * | 2023-06-09 | 2023-11-03 | 之江实验室 | Method and device for optimizing Winograd convolution based on NUMA system characteristics |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104881877A (en) * | 2015-06-12 | 2015-09-02 | 哈尔滨工业大学 | Method for detecting image key point based on convolution and time sequence optimization of FPGA |
US20170344876A1 (en) * | 2016-05-31 | 2017-11-30 | Samsung Electronics Co., Ltd. | Efficient sparse parallel winograd-based convolution scheme |
CN107169090A (en) * | 2017-05-12 | 2017-09-15 | 深圳市唯特视科技有限公司 | A kind of special object search method of utilization content rings around information extraction characterization image |
CN107392183B (en) * | 2017-08-22 | 2022-01-04 | 深圳Tcl新技术有限公司 | Face classification recognition method and device and readable storage medium |
-
2017
- 2017-12-14 CN CN201711342538.0A patent/CN107993186B/en active Active
Non-Patent Citations (1)
Title |
---|
一种支持优化分块策略的矩阵乘加速器设计;沈俊忠,肖涛,乔寓然,杨乾明,文梅;《计算机工程与科学》;20160930;第38卷(第9期);第1748-1754页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107993186A (en) | 2018-05-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107993186B (en) | 3D CNN acceleration method and system based on Winograd algorithm | |
Lu et al. | SpWA: An efficient sparse winograd convolutional neural networks accelerator on FPGAs | |
Liang et al. | Evaluating fast algorithms for convolutional neural networks on FPGAs | |
Liu et al. | Throughput-optimized FPGA accelerator for deep convolutional neural networks | |
Zhang et al. | Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system | |
Shen et al. | Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA | |
CN111291859B (en) | Semiconductor circuit for universal matrix-matrix multiplication data stream accelerator | |
CN106940815B (en) | Programmable convolutional neural network coprocessor IP core | |
Podili et al. | Fast and efficient implementation of convolutional neural networks on FPGA | |
Lu et al. | Evaluating fast algorithms for convolutional neural networks on FPGAs | |
CN108241890B (en) | Reconfigurable neural network acceleration method and architecture | |
JP2024096786A (en) | Utilization of sparsity of input data in neural network calculation unit | |
Wang et al. | PipeCNN: An OpenCL-based FPGA accelerator for large-scale convolution neuron networks | |
CN110851779B (en) | Systolic array architecture for sparse matrix operations | |
CN115221102B (en) | Method for optimizing convolution operation of system-on-chip and related product | |
Shahshahani et al. | Memory optimization techniques for fpga based cnn implementations | |
Tang et al. | EF-train: Enable efficient on-device CNN training on FPGA through data reshaping for online adaptation or personalization | |
Huang et al. | A high performance multi-bit-width booth vector systolic accelerator for NAS optimized deep learning neural networks | |
US20220188613A1 (en) | Sgcnax: a scalable graph convolutional neural network accelerator with workload balancing | |
Wu | Review on FPGA-based accelerators in deep learning | |
Shabani et al. | Hirac: A hierarchical accelerator with sorting-based packing for spgemms in dnn applications | |
CN114003201A (en) | Matrix transformation method and device and convolutional neural network accelerator | |
Pedram et al. | Transforming a linear algebra core to an FFT accelerator | |
Dai et al. | An energy-efficient bit-split-and-combination systolic accelerator for nas-based multi-precision convolution neural networks | |
Akin et al. | FFTs with near-optimal memory access through block data layouts: Algorithm, architecture and design automation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |