CN113918876A - Deconvolution calculation method, hardware accelerator, device and readable storage medium - Google Patents
Deconvolution calculation method, hardware accelerator, device and readable storage medium Download PDFInfo
- Publication number
- CN113918876A CN113918876A CN202010802119.6A CN202010802119A CN113918876A CN 113918876 A CN113918876 A CN 113918876A CN 202010802119 A CN202010802119 A CN 202010802119A CN 113918876 A CN113918876 A CN 113918876A
- Authority
- CN
- China
- Prior art keywords
- matrix
- deconvolution
- size
- input
- post
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004364 calculation method Methods 0.000 title claims abstract description 99
- 239000011159 matrix material Substances 0.000 claims abstract description 223
- 238000000034 method Methods 0.000 claims abstract description 53
- 238000009825 accumulation Methods 0.000 claims abstract description 36
- 230000001131 transforming effect Effects 0.000 claims abstract description 24
- 238000010586 diagram Methods 0.000 claims abstract description 17
- 238000012805 post-processing Methods 0.000 claims description 46
- 238000007781 pre-processing Methods 0.000 claims description 46
- 239000010410 layer Substances 0.000 claims description 36
- 238000004590 computer program Methods 0.000 claims description 9
- 239000002356 single layer Substances 0.000 claims description 3
- 230000009466 transformation Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 238000012966 insertion method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
- G06F17/153—Multidimensional correlation or convolution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Algebra (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Complex Calculations (AREA)
Abstract
The method comprises the steps of obtaining a plurality of input blocks, transforming data in a deconvolution kernel according to a first pre-matrix to obtain a deconvolution kernel matrix, respectively transforming the data in the plurality of input blocks according to a second pre-matrix to obtain a plurality of input matrices, respectively multiplying the deconvolution kernel matrix with the plurality of input matrices to obtain a plurality of intermediate matrices, and accumulating data of all layer data matrices in any one intermediate matrix according to channels to obtain a plurality of accumulated matrices. And respectively transforming the data in the plurality of accumulation matrixes according to the post matrix to obtain a plurality of output blocks. And sequentially arranging the output blocks into an output characteristic diagram to obtain a deconvolution calculation result. In the calculation process, a large number of zeros are not inserted into the original input feature map, so that the calculation efficiency is effectively improved.
Description
Technical Field
The present application relates to the field of deep neural network technologies, and in particular, to a method, a hardware accelerator, a device, and a readable storage medium for deconvolution calculation.
Background
When a hardware accelerator is used for performing deconvolution calculation on an input feature map, a large number of zeros are usually inserted into data of an original input feature map, a new input feature map is obtained through expansion, then a window with the same size as that of a deconvolution kernel is used for translational sliding on the new input feature map according to a preset step length, data covered in the window and data of the deconvolution kernel are multiplied and summed once sliding is performed, a summation result is output, and after the window slides over all data on the new input feature map, all the obtained summation results are final deconvolution calculation results.
Referring to fig. 1, in the calculation process, a large number of zeros have been inserted into data of an original input feature map, blank cells represent the inserted zeros, black cells represent data of the original input feature map, and gray cells represent a window, it can be seen that the window currently contains only data in two original input feature maps, and the rest are zeros, when the data in the window is multiplied by data in a deconvolution kernel, the zeros will cause a number of multiplication operations to be invalid operations, and in the whole deconvolution calculation process, there will be a large number of invalid operations, which seriously affects the calculation efficiency of the hardware accelerator.
Disclosure of Invention
In order to improve the efficiency of deconvolution computation of a hardware accelerator, the present application discloses a method of deconvolution computation, a hardware accelerator, a device, and a readable storage medium according to the following embodiments.
The first aspect of the present application discloses a method for deconvolution calculation, which includes:
acquiring a plurality of input blocks, wherein the input blocks are data blocks covered by a preset sliding window after sliding in an original input feature map each time, the original input feature map comprises a plurality of layers of feature maps, correspondingly, each input block comprises a plurality of layers of data, and the size of the sliding window is set according to the size of a deconvolution kernel and the deconvolution step length;
transforming the data in the deconvolution kernel according to a first pre-matrix to obtain a deconvolution kernel matrix, wherein the size of the first pre-matrix is preset according to the size of the deconvolution kernel and the step length of deconvolution;
respectively transforming the data in the plurality of input blocks according to a second pre-matrix to obtain a plurality of input matrices, wherein each input matrix comprises a plurality of layers of data matrices, and the size of the second pre-matrix is preset according to the size of the deconvolution kernel and the step length of deconvolution;
multiplying the deconvolution kernel matrix with the plurality of input matrices respectively to obtain a plurality of intermediate matrices, wherein each intermediate matrix comprises a plurality of layers of data matrices;
acquiring a plurality of accumulation matrixes, wherein any one of the accumulation matrixes is a single-layer data matrix obtained by accumulating data of all layer data matrixes in the corresponding intermediate matrix according to channels;
respectively transforming the data in the plurality of accumulation matrixes according to a post matrix to obtain a plurality of output blocks, wherein the size of the post matrix is preset according to the size of the deconvolution kernel and the step length of deconvolution;
and sequentially arranging the output blocks into an output characteristic diagram to obtain a deconvolution calculation result.
Optionally, before obtaining the plurality of input blocks, the method further includes:
obtaining the size of the sliding window by the following formula:
D=I1×I1;
wherein D represents the size of the sliding window, k represents the size of the deconvolution kernel, s represents the step size of the deconvolution,the upper rounding symbol in the mathematical symbol is used for obtaining the minimum integer not less than the calculation result in the symbol;
obtaining the size of the first pre-matrix by the following formula:
m1=k+s-2;
H=m1×k;
wherein H represents the size of the first pre-matrix;
obtaining the size of the second pre-matrix by the following formula:
P=I1×m1;
wherein P represents the size of the second pre-matrix;
obtaining the size of the post matrix by the following formula:
O1=1×s;
Q=m1×O1;
wherein Q represents the size of the post-matrix.
Optionally, before obtaining the plurality of input blocks, the method further includes:
obtaining the size of the sliding window by the following formula:
D=I2×I2;
wherein D represents the size of the sliding window, k represents the size of the deconvolution kernel, s represents the step size of the deconvolution,the upper rounding symbol in the mathematical symbol is used for obtaining the minimum integer not less than the calculation result in the symbol;
obtaining the size of the first pre-matrix by the following formula:
m2=k+2×s-2;
H=m2×k;
wherein H represents the size of the first pre-matrix;
obtaining the size of the second pre-matrix by the following formula:
P=I2×m2;
wherein P represents the size of the second pre-matrix;
obtaining the size of the post matrix by the following formula:
O2=2×s;
Q=m2×O2;
wherein Q represents the size of the post-matrix.
Optionally, the transforming the data in the deconvolution kernel according to the first pre-matrix to obtain a deconvolution kernel matrix includes:
sequentially multiplying the first pre-matrix, the deconvolution kernel and a transposed matrix of the first pre-matrix to obtain a deconvolution kernel matrix;
the transforming the data in the plurality of input blocks according to the second pre-matrix to obtain a plurality of input matrices includes:
multiplying the transpose matrix of the second pre-matrix, any one of the input blocks and the second pre-matrix in sequence to obtain a corresponding input matrix;
the transforming the data in the plurality of accumulation matrices according to the post matrix to obtain a plurality of output blocks respectively includes:
and multiplying the transposed matrix of the post matrix, any accumulated matrix and the post matrix in sequence to obtain a corresponding output block.
A second aspect of the present application discloses a hardware accelerator for performing deconvolution calculations, the hardware accelerator being configured to perform the method of deconvolution calculations described in the first aspect of the present application, the hardware accelerator comprising:
the input module comprises a plurality of parallel input channels and is used for inputting the data in the deconvolution kernel matrix into the multiplication module in parallel and inputting each layer of data in the input block into the preprocessing module in sequence, wherein only one input block is input in the single deconvolution calculation process;
the pre-processing module comprises a deconvolution pre-processing unit, the deconvolution pre-processing unit comprises a plurality of deconvolution pre-adders, and the deconvolution pre-adders are used for grouping and adding any layer of data in the acquired input block to acquire an input matrix, and then sequentially outputting each layer of data matrix in the input matrix to the multiplication module;
the multiplication module comprises a plurality of weight multipliers and is used for multiplying the deconvolution kernel matrix with each layer of data matrix in the input matrix respectively to obtain an intermediate matrix and inputting each layer of data matrix of the intermediate matrix into the accumulation module in sequence;
the accumulation module comprises an addition tree and a plurality of registers and is used for accumulating all layer data of the intermediate matrix according to channels to obtain an accumulation matrix and then inputting the accumulation matrix into the post-processing module;
the post-processing module comprises a deconvolution post-processing unit, and the deconvolution post-processing unit comprises a plurality of deconvolution post adders, and is used for grouping and adding the data in the accumulation matrix to obtain an output block;
and the output module comprises a plurality of parallel output channels and is used for outputting the data in the output block in parallel.
Optionally, the input module includes 36 parallel input channels;
the deconvolution preprocessing unit comprises 13 deconvolution preamplifiers;
the multiplication module comprises 36 weight multipliers;
the deconvolution post-processing unit comprises 60 deconvolution post adders;
the output module comprises 16 parallel output channels.
Optionally, the preprocessing module further includes: the convolution pre-processing unit comprises a plurality of convolution pre-adders and the pre-selector is used for selecting and inputting the output result of the input module to the deconvolution pre-processing unit or the convolution pre-processing unit according to the current calculation requirement;
the post-processing module further comprises: the convolution post-processing unit comprises a plurality of convolution post-adders, and the post-selector is used for selecting to input the output result of the accumulation module into the deconvolution post-processing unit or the convolution post-processing unit according to the current calculation requirement.
Optionally, the input module includes 36 parallel input channels;
the convolution preprocessing unit comprises 180 convolution preadders;
the deconvolution preprocessing unit comprises 13 deconvolution preamplifiers;
the multiplication module comprises 36 weight multipliers;
the convolution post-processing unit comprises 100 convolution post-adders;
the deconvolution post-processing unit comprises 60 deconvolution post adders;
the output module comprises 16 parallel output channels.
A third aspect of the present application discloses a computer device comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of deconvolution computation as described in the first aspect of the application when executing the computer program.
A fourth aspect of the present application discloses a computer readable storage medium having stored thereon a computer program which, when being processed and executed, carries out the steps of the method of deconvolution computation according to the first aspect of the present application.
The method comprises the steps of obtaining a plurality of input blocks, transforming data in a deconvolution kernel according to a first pre-matrix to obtain a deconvolution kernel matrix, respectively transforming the data in the plurality of input blocks according to a second pre-matrix to obtain a plurality of input matrices, respectively multiplying the deconvolution kernel matrix with the plurality of input matrices to obtain a plurality of intermediate matrices, and accumulating data of all layer data matrices in any one intermediate matrix according to channels to obtain a plurality of accumulated matrices. And respectively transforming the data in the plurality of accumulation matrixes according to the post matrix to obtain a plurality of output blocks. And sequentially arranging the output blocks into an output characteristic diagram to obtain a deconvolution calculation result. Before calculation, the method does not need to insert a large number of zeros into the original input feature map, so that invalid operation does not exist, and the calculation efficiency is effectively improved.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a conventional deconvolution calculation principle;
FIG. 2 is a schematic flowchart illustrating a method for deconvolution calculation according to an embodiment of the present disclosure;
FIG. 3 is a schematic flow chart of a first order deconvolution calculation disclosed in an embodiment of the present application;
FIG. 4 is a schematic flow chart of a second order deconvolution calculation disclosed in an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a hardware accelerator for deconvolution computation according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a hardware accelerator for deconvolution computation according to an embodiment of the present disclosure;
FIG. 7 is a schematic structural diagram of a convolution pre-processing unit and a deconvolution pre-processing unit in a hardware accelerator for deconvolution computation according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a convolution post-processing unit and a deconvolution post-processing unit in a hardware accelerator for deconvolution computation disclosed in an embodiment of the present application.
Detailed Description
In order to improve the efficiency of deconvolution computation of a hardware accelerator, the present application discloses a method of deconvolution computation, a hardware accelerator, a device, and a readable storage medium according to the following embodiments.
A first embodiment of the present application discloses a method for deconvolution calculation, referring to a workflow diagram shown in fig. 2, the method includes:
step S11, obtaining a plurality of input blocks, where the input blocks are data blocks covered by a preset sliding window after each sliding in an original input feature map, where the original input feature map includes multiple layers of feature maps, and correspondingly, each input block includes multiple layers of data, and the size of the sliding window is set according to the size of a deconvolution kernel and a deconvolution step size.
And step S12, transforming the data in the deconvolution kernel according to a first pre-matrix to obtain a deconvolution kernel matrix, wherein the size of the first pre-matrix is preset according to the size of the deconvolution kernel and the deconvolution step size.
Specifically, the first pre-matrix, the deconvolution kernel, and the transposed matrix of the first pre-matrix are sequentially multiplied to transform data in the deconvolution kernel, so as to obtain the deconvolution kernel matrix.
And step S13, respectively transforming the data in the input blocks according to a second pre-matrix to obtain a plurality of input matrices, wherein each input matrix comprises a plurality of layers of data matrices, and the size of the second pre-matrix is preset according to the size of the deconvolution kernel and the deconvolution step size.
Specifically, the transposed matrix of the second pre-matrix, any one of the input blocks, and the second pre-matrix are sequentially multiplied to transform data in the input block, and a corresponding input matrix is obtained.
Step S14, multiplying the deconvolution kernel matrix with the plurality of input matrices respectively to obtain a plurality of intermediate matrices, where each intermediate matrix includes a plurality of layers of data matrices.
Step S15, obtaining a plurality of accumulation matrices, where any one of the accumulation matrices is a single-layer data matrix obtained by accumulating data of all layer data matrices in the intermediate matrix corresponding to the accumulation matrix according to channels.
And step S16, respectively transforming the data in the plurality of accumulation matrixes according to a post matrix to obtain a plurality of output blocks, wherein the size of the post matrix is preset according to the size of the deconvolution kernel and the step length of deconvolution.
Specifically, the transposed matrix of the post matrix, any one of the accumulation matrices, and the post matrix are multiplied in sequence to realize the transformation of data in the accumulation matrix and obtain the corresponding output block.
And step S17, arranging the output blocks into an output characteristic diagram in sequence to obtain a deconvolution calculation result.
In an actual calculation process, there may be a plurality of deconvolution kernels, each deconvolution kernel includes multiple layers of data, the deconvolution calculation method disclosed in the above embodiment is directed to calculation between one deconvolution kernel and a plurality of input blocks, an output feature diagram obtained by the calculation method is one layer, and after the calculation is completed for the plurality of deconvolution kernels according to the above method, there will be multiple layers of output feature diagrams.
The embodiment discloses a deconvolution calculation method, which includes obtaining a plurality of input blocks, transforming data in a deconvolution kernel according to a first pre-matrix to obtain a deconvolution kernel matrix, transforming the data in the plurality of input blocks according to a second pre-matrix to obtain a plurality of input matrices, multiplying the deconvolution kernel matrix with the plurality of input matrices to obtain a plurality of intermediate matrices, and accumulating data of all layer data matrices in any one of the intermediate matrices according to channels to obtain a plurality of accumulation matrices. And respectively transforming the data in the plurality of accumulation matrixes according to the post matrix to obtain a plurality of output blocks. And sequentially arranging the output blocks into an output characteristic diagram to obtain a deconvolution calculation result. Before calculation, a large number of zeros are not inserted into the original input feature map, so that invalid operation does not exist, and the calculation efficiency is effectively improved.
The specific operation process of the deconvolution calculation method disclosed in the first embodiment of the present application can be represented by the following formula:
Y=QT[(H.g.HT)⊙(PT.d.P)]Q
where Y is the final output result, i.e., the output block, whose size is O1×O1D is an input block of size I1×I1G is an deconvolution kernel having a size of k × k, which indicates that matrix elements are multiplied correspondingly, and that the multiplied two matrices have sizes of m1×m1H is a first pre-matrix, HTIs the transpose of the first pre-matrix, and H has a size of m1×k,HTHas a size of k × m1P is a second pre-matrix, PTIs the transpose of the second pre-matrix, P is of size I1×m1,PTIs m in size1×I1Q is a post-matrix, QTIs a transposed matrix of the postmatrix, and Q has a size of m1×O1,QTHas a size of O1×m1。
For different operation requirements, the deconvolution calculation method disclosed in the first embodiment of the present application includes a first-order deconvolution calculation method and a second-order deconvolution calculation method. The calculation steps of the two methods are the same, except that the parameters (the size of the sliding window and the sliding step size, the first pre-matrix, the second pre-matrix and the post-matrix) used in the calculation process are different.
In one embodiment, the first order deconvolution calculation method may be represented as T [ (1 xs)2,k2]The sliding step of the sliding window is 1, s is the step of deconvolution, and k is the size of the deconvolution kernel, e.g. T [ (1X 2)2,32]Indicating that the deconvolution has a step size of 2 and the size of the deconvolution kernel is 3 x 3 in the calculation process performed.
The first order deconvolution calculation method further includes, before obtaining the plurality of input blocks:
obtaining the size of the sliding window by the following formula:
D=I1×I1。
where D represents the size of the sliding window (consistent with the size of the input block), k represents the size of the deconvolution kernel, s represents the step size of the deconvolution,and the upper rounded sign in the mathematical sign is used for obtaining the minimum integer which is not less than the calculation result in the sign.
Obtaining the size of the first pre-matrix by the following formula:
m1=k+s-2。
H=m1×k。
wherein H represents the size of the first pre-matrix.
Obtaining the size of the second pre-matrix by the following formula:
P=I1×m1。
wherein P represents a size of the second pre-matrix.
Obtaining the size of the post matrix by the following formula:
O1=1×s。
Q=m1×O1。
wherein Q represents the size of the post-matrix.
It should be noted that, in the first-order deconvolution calculation process, a person skilled in the art can use the first pre-matrix to transform the data in the deconvolution kernel according to a certain transformation rule, where the transformation rule is preset by the person skilled in the art according to actual needs, and therefore the person skilled in the art can set the value of each element in the first pre-matrix by himself or herself under the condition that the size of the first pre-matrix is fixed. Similarly, the values of the elements in the second pre-matrix and the post-matrix may be preset in advance.
As an example, if the step size of the deconvolution is 2(s ═ 2), the size of the deconvolution kernel is 3 × 3(k ═ 3), i.e. for T [ (1 × 2)2,32]In the calculation process, the size of the first pre-matrix is 3 × 3, the size of the second pre-matrix is 2 × 3, and the size of the post-matrix is 3 × 2, and the first pre-matrix, the second pre-matrix, and the post-matrix may be respectively set as:
if the step size of the deconvolution is 2(s ═ 2), the size of the deconvolution kernel is 5 × 5(k ═ 5), i.e. for T [ (1 × 2)2,52]In the calculation process, the size of the first pre-matrix is 5 × 5, the size of the second pre-matrix is 3 × 5, and the size of the post-matrix is 5 × 2, and the first pre-matrix, the second pre-matrix, and the post-matrix may be respectively set as:
by way of example, FIG. 3 depicts the process of a first order deconvolution calculation, which is T [ (1 × 2)2,32]Wherein the sliding window has a sliding step size of 1, i.e. the moving step size (S) of the adjacent input blocki1) The size of the sliding window is 2 × 2, which is 1, and the size of the input block obtained by sliding once is also 2 × 2. In the whole deconvolution calculation process, multiple calculations are required, each calculation is performed only for one input block, and the obtained output blocks are arranged in order to obtain an output characteristic diagram.
In another embodiment, the second order deconvolution calculation method can be expressed as T [ (2 xs)2,k2]The sliding step of the sliding window is 2, s is the step of deconvolution, and k is the size of the deconvolution kernel, e.g. T [ (2X 2)2,32]Indicating that the deconvolution has a step size of 2 and the size of the deconvolution kernel is 3 x 3 in the calculation process performed.
The second-order deconvolution calculation method further includes, before obtaining the plurality of input blocks:
obtaining the size of the sliding window by the following formula:
D=I2×I2。
where D represents the size of the sliding window (consistent with the size of the input block), k represents the size of the deconvolution kernel, s represents the step size of the deconvolution,and the upper rounded sign in the mathematical sign is used for obtaining the minimum integer which is not less than the calculation result in the sign.
Obtaining the size of the first pre-matrix by the following formula:
m2=k+2×s-2。
H=m2×k。
wherein H represents the size of the first pre-matrix.
Obtaining the size of the second pre-matrix by the following formula:
P=I2×m2。
wherein P represents a size of the second pre-matrix.
Obtaining the size of the post matrix by the following formula:
O2=2×s。
Q=m2×O2。
wherein Q represents the size of the post-matrix.
It should be noted that, as in the first-order deconvolution calculation process, in the second-order deconvolution calculation process, a person skilled in the art can use the first pre-matrix to transform data in the deconvolution kernel according to a certain transformation rule, where the transformation rule is preset by the person skilled in the art according to actual needs, and therefore, in the case that the size of the first pre-matrix is fixed, the person skilled in the art can set the value of each element in the first pre-matrix by himself. Similarly, the values of the elements in the second pre-matrix and the post-matrix can be preset in advance.
As an example, if the step size of the deconvolution is 2(s ═ 2), the size of the deconvolution kernel is 3 × 3(k ═ 3), i.e. for T [ (2 × 2)2,32]In the calculation process, the size of the first pre-matrix is 5 × 3, the size of the second pre-matrix is 3 × 5, and the size of the post-matrix is 5 × 4, and the first pre-matrix, the second pre-matrix, and the post-matrix may be respectively set as:
by way of example, FIG. 4 depicts the process of a second order deconvolution computation, which is T [ (2 × 2)2,32]Wherein the sliding window has a sliding step size of 2, i.e. the moving step size (S) of the adjacent input blocki2) The size of the sliding window is 3 × 3, which is 2, and the size of the input block obtained by sliding once is also 3 × 3. In the whole deconvolution calculation process, multiple calculations are required, each calculation is performed only for one input block, and the obtained output blocks are arranged in order to obtain an output characteristic diagram.
Experiments show that T [ (1X 2)2,32]In the calculation process, 9(3 × 3) times of multiplication operations are performed on 4 data in one input block, and 4 output values in one output block can be obtained. T [ (2X 2)2,32]In the calculation process, 25(5 × 5) multiplication operations are performed on 9 data in one input block, and 16 output values in one output block can be obtained. In the existing zero insertion method, 9 times of multiplication operations are performed on 9 data in one input block, and only 1 output value can be obtained.
If 16 output values are desired, T [ (1X 2)2,32]The calculation method requires a total of 36 multiplications to be performed, T [ (2X 2)2,32]The calculation method requires 25 multiplication operations in total, whereas the existing zero-insertion rule requires 144 multiplication operations. Therefore, the deconvolution calculation method disclosed in the present embodiment greatly improves the calculation efficiency.
A second embodiment of the present application discloses a hardware accelerator for deconvolution calculation, where the hardware accelerator is configured to execute the method for deconvolution calculation according to the first embodiment of the present application, and with reference to the structural diagram shown in fig. 5, the hardware accelerator includes:
and the input module comprises a plurality of parallel input channels and is used for inputting the data in the deconvolution kernel matrix into the multiplication module in parallel and inputting each layer of data in the input block into the preprocessing module in sequence, wherein only one input block is input in a single deconvolution calculation process.
In one implementation, the hardware accelerator is further provided with a deconvolution kernel processing module, which is configured to transform data in a deconvolution kernel to obtain a deconvolution kernel matrix, and then input the data in the deconvolution kernel matrix to the multiplication module through the input module.
In another implementation mode, the hardware accelerator is not provided with a deconvolution kernel processing module, the module is arranged outside the hardware accelerator, data in a deconvolution kernel is transformed in advance, and then the data in a deconvolution kernel matrix obtained through transformation is stored in the multiplication module in advance through the input module.
And the pre-processing module comprises a deconvolution pre-processing unit, and the deconvolution pre-processing unit comprises a plurality of deconvolution pre-adders, and is used for grouping and adding any layer of data in the acquired input block to acquire an input matrix, and then sequentially outputting each layer of data matrix in the input matrix to the multiplication module.
And the multiplication module comprises a plurality of weight multipliers, and is used for respectively multiplying the deconvolution kernel matrix by each layer of data matrix in the input matrix element by element to obtain an intermediate matrix, and sequentially inputting each layer of data matrix of the intermediate matrix into the accumulation module.
In the actual calculation process, the deconvolution kernel matrix comprises multiple layers of data, and the multiplication module is used for correspondingly multiplying each layer of data in the deconvolution kernel matrix with each layer of data in the input matrix one by one to obtain an intermediate matrix.
And the accumulation module comprises an addition tree and a plurality of registers and is used for accumulating all the layer data of the intermediate matrix channel by channel to obtain an accumulation matrix, and then outputting the accumulation matrix to the post-processing module.
And the post-processing module comprises a deconvolution post-processing unit, and the deconvolution post-processing unit comprises a plurality of deconvolution post adders, and is used for grouping and adding the data in the accumulation matrix to obtain an output block.
And the output module comprises a plurality of parallel output channels and is used for outputting the data in the output block in parallel.
Further, the input module comprises 36 parallel input channels for inputting the data of the 36 input blocks and the data of the 36 deconvolution kernel matrixes in parallel.
The deconvolution preprocessing unit comprises 13 deconvolution pre-adders, and is used for grouping and shifting or adding 36 input data input in parallel according to a certain rule.
The multiplication module comprises 36 weight multipliers, and is used for multiplying the 36 output data of the preprocessing module and the data of the 36 deconvolution kernel matrixes according to a certain sequence.
The deconvolution post-processing unit comprises 60 deconvolution post adders used for grouping and shifting or adding the obtained 36 data according to a certain rule.
The output module comprises 16 parallel output channels and is used for outputting 16 output data of the deconvolution post-processing unit in parallel.
In practical applications, a person skilled in the art may set the number of adders or weight multipliers in each module according to actual needs, and set the connection relationship between the adders and the weight multipliers, which is not described herein again.
In one embodiment, the application discloses a reconfigurable hardware accelerator, which can perform deconvolution calculation, and can flexibly select convolution operation or deconvolution operation according to application scenarios.
Referring to fig. 6, the preprocessing module further includes: the convolution pre-processing unit comprises a plurality of convolution pre-adders, and the pre-selector is used for selecting and inputting the output result of the input module to the deconvolution pre-processing unit or the convolution pre-processing unit according to the current calculation requirement.
The post-processing module further comprises: the convolution post-processing unit comprises a plurality of convolution post-adders, and the post-selector is used for selecting to input the output result of the accumulation module into the deconvolution post-processing unit or the convolution post-processing unit according to the current calculation requirement.
Further, the input module comprises 36 parallel input channels. The output module comprises 16 parallel output channels. The multiplication module comprises 36 weight multipliers.
Referring to fig. 7, the convolution pre-processing unit includes 180 convolution pre-adders. The deconvolution pre-processing unit includes 13 deconvolution pre-adders.
Referring to fig. 8, the convolution post-processing unit includes 100 convolution post-adders. The deconvolution post-processing unit comprises 60 deconvolution post adders.
The following describes the workflow of the hardware accelerator disclosed in this embodiment with reference to a simple example:
the input module inputs the 36 pre-calculated deconvolution kernel matrix data to the multiplication module. The pre-selector selects to carry out convolution calculation or deconvolution calculation according to the configuration, if the convolution calculation is carried out, the input module sends 36 input data into the convolution pre-processing unit, and the deconvolution pre-processing unit inputs 0; if the convolution calculation is carried out, the input module sends 36 input data to the convolution preprocessing unit, and the convolution preprocessing unit inputs 0.
In the pre-processing module, the convolution pre-processing unit is used for dividing 36 input data into 6 groups, each group of 6 data is subjected to certain addition or shift, and the deconvolution pre-processing unit is used for taking the first nine input data to be subjected to certain addition. After the two units calculate the result, the corresponding units send the output data to the multiplication module. If the convolution calculation is carried out, the data output by the convolution preprocessing unit is sent to a multiplication module; if the convolution is performed, the data output by the convolution pre-processing unit is sent to the multiplication module.
In the multiplication module, the data sent by the pre-processing module and the pre-calculated deconvolution kernel matrix data are multiplied one by one, and then the result is sent to the post-processing module. The post selector selects to carry out convolution calculation or deconvolution calculation according to the configuration, wherein if the convolution calculation is carried out, the result is sent to a convolution post processing unit, and 0 is input into the deconvolution post processing unit; if the convolution is carried out, the result is sent to a convolution post-processing unit, and 0 is input into the convolution post-processing unit.
In the post-processing module, the convolution post-processing unit is used for dividing the data sent by the multiplication module into 6 groups, each group comprises 6 data, and then certain addition or shift is carried out, and the deconvolution post-processing unit is used for dividing the data sent by the multiplication module into 4 groups, each group comprises 9 data, and then certain addition is carried out. After the two units calculate the final result, the corresponding units output the calculation result, and if the final result is the convolution calculation, the calculation result of the convolution post-processing unit is output; if the deconvolution calculation is carried out, the calculation result of the deconvolution post-processing unit is output.
In the output module, the output module outputs 16 output results in parallel at a time.
A third embodiment of the present application discloses a computer device, comprising:
a memory for storing a computer program.
A processor for implementing the steps of the method of deconvolution computation as described in the first embodiment of the present application when executing said computer program.
A fourth embodiment of the present application discloses a computer-readable storage medium having stored thereon a computer program which, when being processed and executed, implements the steps of the method of deconvolution computation as described in the first embodiment of the present application.
The present application has been described in detail with reference to specific embodiments and illustrative examples, but the description is not intended to limit the application. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the presently disclosed embodiments and implementations thereof without departing from the spirit and scope of the present disclosure, and these fall within the scope of the present disclosure. The protection scope of this application is subject to the appended claims.
Claims (10)
1. A method of deconvolution computation, comprising:
acquiring a plurality of input blocks, wherein the input blocks are data blocks covered by a preset sliding window after sliding in an original input feature map each time, the original input feature map comprises a plurality of layers of feature maps, correspondingly, each input block comprises a plurality of layers of data, and the size of the sliding window is set according to the size of a deconvolution kernel and the deconvolution step length;
transforming the data in the deconvolution kernel according to a first pre-matrix to obtain a deconvolution kernel matrix, wherein the size of the first pre-matrix is preset according to the size of the deconvolution kernel and the step length of deconvolution;
respectively transforming the data in the plurality of input blocks according to a second pre-matrix to obtain a plurality of input matrices, wherein each input matrix comprises a plurality of layers of data matrices, and the size of the second pre-matrix is preset according to the size of the deconvolution kernel and the step length of deconvolution;
multiplying the deconvolution kernel matrix with the plurality of input matrices respectively to obtain a plurality of intermediate matrices, wherein each intermediate matrix comprises a plurality of layers of data matrices;
acquiring a plurality of accumulation matrixes, wherein any one of the accumulation matrixes is a single-layer data matrix obtained by accumulating data of all layer data matrixes in the corresponding intermediate matrix according to channels;
respectively transforming the data in the plurality of accumulation matrixes according to a post matrix to obtain a plurality of output blocks, wherein the size of the post matrix is preset according to the size of the deconvolution kernel and the step length of deconvolution;
and sequentially arranging the output blocks into an output characteristic diagram to obtain a deconvolution calculation result.
2. The method of deconvolution computation of claim 1, wherein prior to obtaining the plurality of input blocks, the method further comprises:
obtaining the size of the sliding window by the following formula:
D=I1×I1;
wherein D represents the size of the sliding window, k represents the size of the deconvolution kernel, s represents the step size of the deconvolution,the upper rounding symbol in the mathematical symbol is used for obtaining the minimum integer not less than the calculation result in the symbol;
obtaining the size of the first pre-matrix by the following formula:
m1=k+s-2;
H=m1×k;
wherein H represents the size of the first pre-matrix;
obtaining the size of the second pre-matrix by the following formula:
P=I1×m1;
wherein P represents the size of the second pre-matrix;
obtaining the size of the post matrix by the following formula:
O1=1×s;
Q=m1×O1;
wherein Q represents the size of the post-matrix.
3. The method of deconvolution computation of claim 1, wherein prior to obtaining the plurality of input blocks, the method further comprises:
obtaining the size of the sliding window by the following formula:
D=I2×I2;
wherein D represents the size of the sliding window, k represents the size of the deconvolution kernel, s represents the step size of the deconvolution,the upper rounding symbol in the mathematical symbol is used for obtaining the minimum integer not less than the calculation result in the symbol;
obtaining the size of the first pre-matrix by the following formula:
m2=k+2×s-2;
H=m2×k;
wherein H represents the size of the first pre-matrix;
obtaining the size of the second pre-matrix by the following formula:
P=I2×m2;
wherein P represents the size of the second pre-matrix;
obtaining the size of the post matrix by the following formula:
O2=2×s;
Q=m2×O2;
wherein Q represents the size of the post-matrix.
4. The method of deconvolution computation of claim 2 or 3, wherein said transforming the data in the deconvolution kernel according to the first pre-matrix to obtain a deconvolution kernel matrix comprises:
sequentially multiplying the first pre-matrix, the deconvolution kernel and a transposed matrix of the first pre-matrix to obtain a deconvolution kernel matrix;
the transforming the data in the plurality of input blocks according to the second pre-matrix to obtain a plurality of input matrices includes:
multiplying the transpose matrix of the second pre-matrix, any one of the input blocks and the second pre-matrix in sequence to obtain a corresponding input matrix;
the transforming the data in the plurality of accumulation matrices according to the post matrix to obtain a plurality of output blocks respectively includes:
and multiplying the transposed matrix of the post matrix, any accumulated matrix and the post matrix in sequence to obtain a corresponding output block.
5. A hardware accelerator for deconvolution calculations, the hardware accelerator being configured to perform a method of deconvolution calculations as claimed in any of claims 1-4, the hardware accelerator comprising:
the input module comprises a plurality of parallel input channels and is used for inputting the data in the deconvolution kernel matrix into the multiplication module in parallel and inputting each layer of data in the input block into the preprocessing module in sequence, wherein only one input block is input in the single deconvolution calculation process;
the pre-processing module comprises a deconvolution pre-processing unit, the deconvolution pre-processing unit comprises a plurality of deconvolution pre-adders, and the deconvolution pre-adders are used for grouping and adding any layer of data in the acquired input block to acquire an input matrix, and then sequentially outputting each layer of data matrix in the input matrix to the multiplication module;
the multiplication module comprises a plurality of weight multipliers and is used for multiplying the deconvolution kernel matrix with each layer of data matrix in the input matrix respectively to obtain an intermediate matrix and inputting each layer of data matrix of the intermediate matrix into the accumulation module in sequence;
the accumulation module comprises an addition tree and a plurality of registers and is used for accumulating all layer data of the intermediate matrix according to channels to obtain an accumulation matrix and then inputting the accumulation matrix into the post-processing module;
the post-processing module comprises a deconvolution post-processing unit, and the deconvolution post-processing unit comprises a plurality of deconvolution post adders, and is used for grouping and adding the data in the accumulation matrix to obtain an output block;
and the output module comprises a plurality of parallel output channels and is used for outputting the data in the output block in parallel.
6. The hardware accelerator of deconvolution computation of claim 5,
the input module comprises 36 parallel input channels;
the deconvolution preprocessing unit comprises 13 deconvolution preamplifiers;
the multiplication module comprises 36 weight multipliers;
the deconvolution post-processing unit comprises 60 deconvolution post adders;
the output module comprises 16 parallel output channels.
7. The hardware accelerator of deconvolution computation of claim 5, wherein the pre-processing module further comprises: the convolution pre-processing unit comprises a plurality of convolution pre-adders and the pre-selector is used for selecting and inputting the output result of the input module to the deconvolution pre-processing unit or the convolution pre-processing unit according to the current calculation requirement;
the post-processing module further comprises: the convolution post-processing unit comprises a plurality of convolution post-adders, and the post-selector is used for selecting to input the output result of the accumulation module into the deconvolution post-processing unit or the convolution post-processing unit according to the current calculation requirement.
8. The hardware accelerator of deconvolution computation of claim 7,
the input module comprises 36 parallel input channels;
the convolution preprocessing unit comprises 180 convolution preadders;
the deconvolution preprocessing unit comprises 13 deconvolution preamplifiers;
the multiplication module comprises 36 weight multipliers;
the convolution post-processing unit comprises 100 convolution post-adders;
the deconvolution post-processing unit comprises 60 deconvolution post adders;
the output module comprises 16 parallel output channels.
9. A computer device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of deconvolution computation of any of claims 1-4 when executing said computer program.
10. A computer-readable storage medium, having stored thereon a computer program which, when being processed and executed, carries out the steps of the method of deconvolution computation of any one of claims 1-4.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2020106580063 | 2020-07-09 | ||
CN202010658006 | 2020-07-09 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113918876A true CN113918876A (en) | 2022-01-11 |
Family
ID=79232473
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010802119.6A Pending CN113918876A (en) | 2020-07-09 | 2020-08-11 | Deconvolution calculation method, hardware accelerator, device and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113918876A (en) |
-
2020
- 2020-08-11 CN CN202010802119.6A patent/CN113918876A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101162649B1 (en) | A method of and apparatus for implementing fast orthogonal transforms of variable size | |
EP3789891A1 (en) | Number-theoretic transform hardware | |
CN110874636B (en) | Neural network model compression method and device and computer equipment | |
CN111652330B (en) | Image processing method, device, system, electronic equipment and readable storage medium | |
US11544526B2 (en) | Computing device and method | |
CN109117187A (en) | Convolutional neural networks accelerated method and relevant device | |
CN111639699B (en) | Method, system and equipment for extracting image features and readable storage medium | |
CN108897716B (en) | Data processing device and method for reducing calculation amount through memory read-write operation | |
CN110766128A (en) | Convolution calculation unit, calculation method and neural network calculation platform | |
WO2019088072A1 (en) | Information processing device, information processing method, and program | |
WO2023065983A1 (en) | Computing apparatus, neural network processing device, chip, and data processing method | |
CN111639701B (en) | Method, system and equipment for extracting image features and readable storage medium | |
JP4263693B2 (en) | A computationally efficient math engine | |
US6574649B2 (en) | Efficient convolution method and apparatus | |
CN109740740A (en) | The fixed point accelerating method and device of convolutional calculation | |
CN113918876A (en) | Deconvolution calculation method, hardware accelerator, device and readable storage medium | |
US11960565B2 (en) | Add-mulitply-add convolution computation for a convolutional neural network | |
CN115496993B (en) | Target detection method, device, equipment and storage medium based on frequency domain fusion | |
CN109634556B (en) | Multiply-accumulator and accumulation output method | |
KR102153167B1 (en) | Matrix operator and matrix operation method for artificial neural network | |
CN114758209B (en) | Convolution result obtaining method and device, computer equipment and storage medium | |
CN112650974B (en) | Efficient transpose convolution calculation method | |
EP3480710A1 (en) | Computer architectures and instructions for multiplication | |
CN115859003A (en) | Method, device and equipment for executing FFT | |
CN110245263B (en) | Aggregation method, aggregation device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |