CN113918876A - Deconvolution calculation method, hardware accelerator, device and readable storage medium - Google Patents

Deconvolution calculation method, hardware accelerator, device and readable storage medium Download PDF

Info

Publication number
CN113918876A
CN113918876A CN202010802119.6A CN202010802119A CN113918876A CN 113918876 A CN113918876 A CN 113918876A CN 202010802119 A CN202010802119 A CN 202010802119A CN 113918876 A CN113918876 A CN 113918876A
Authority
CN
China
Prior art keywords
matrix
deconvolution
size
input
post
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010802119.6A
Other languages
Chinese (zh)
Inventor
王中风
杨培祥
毛文东
林军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Publication of CN113918876A publication Critical patent/CN113918876A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • G06F17/153Multidimensional correlation or convolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The method comprises the steps of obtaining a plurality of input blocks, transforming data in a deconvolution kernel according to a first pre-matrix to obtain a deconvolution kernel matrix, respectively transforming the data in the plurality of input blocks according to a second pre-matrix to obtain a plurality of input matrices, respectively multiplying the deconvolution kernel matrix with the plurality of input matrices to obtain a plurality of intermediate matrices, and accumulating data of all layer data matrices in any one intermediate matrix according to channels to obtain a plurality of accumulated matrices. And respectively transforming the data in the plurality of accumulation matrixes according to the post matrix to obtain a plurality of output blocks. And sequentially arranging the output blocks into an output characteristic diagram to obtain a deconvolution calculation result. In the calculation process, a large number of zeros are not inserted into the original input feature map, so that the calculation efficiency is effectively improved.

Description

Deconvolution calculation method, hardware accelerator, device and readable storage medium
Technical Field
The present application relates to the field of deep neural network technologies, and in particular, to a method, a hardware accelerator, a device, and a readable storage medium for deconvolution calculation.
Background
When a hardware accelerator is used for performing deconvolution calculation on an input feature map, a large number of zeros are usually inserted into data of an original input feature map, a new input feature map is obtained through expansion, then a window with the same size as that of a deconvolution kernel is used for translational sliding on the new input feature map according to a preset step length, data covered in the window and data of the deconvolution kernel are multiplied and summed once sliding is performed, a summation result is output, and after the window slides over all data on the new input feature map, all the obtained summation results are final deconvolution calculation results.
Referring to fig. 1, in the calculation process, a large number of zeros have been inserted into data of an original input feature map, blank cells represent the inserted zeros, black cells represent data of the original input feature map, and gray cells represent a window, it can be seen that the window currently contains only data in two original input feature maps, and the rest are zeros, when the data in the window is multiplied by data in a deconvolution kernel, the zeros will cause a number of multiplication operations to be invalid operations, and in the whole deconvolution calculation process, there will be a large number of invalid operations, which seriously affects the calculation efficiency of the hardware accelerator.
Disclosure of Invention
In order to improve the efficiency of deconvolution computation of a hardware accelerator, the present application discloses a method of deconvolution computation, a hardware accelerator, a device, and a readable storage medium according to the following embodiments.
The first aspect of the present application discloses a method for deconvolution calculation, which includes:
acquiring a plurality of input blocks, wherein the input blocks are data blocks covered by a preset sliding window after sliding in an original input feature map each time, the original input feature map comprises a plurality of layers of feature maps, correspondingly, each input block comprises a plurality of layers of data, and the size of the sliding window is set according to the size of a deconvolution kernel and the deconvolution step length;
transforming the data in the deconvolution kernel according to a first pre-matrix to obtain a deconvolution kernel matrix, wherein the size of the first pre-matrix is preset according to the size of the deconvolution kernel and the step length of deconvolution;
respectively transforming the data in the plurality of input blocks according to a second pre-matrix to obtain a plurality of input matrices, wherein each input matrix comprises a plurality of layers of data matrices, and the size of the second pre-matrix is preset according to the size of the deconvolution kernel and the step length of deconvolution;
multiplying the deconvolution kernel matrix with the plurality of input matrices respectively to obtain a plurality of intermediate matrices, wherein each intermediate matrix comprises a plurality of layers of data matrices;
acquiring a plurality of accumulation matrixes, wherein any one of the accumulation matrixes is a single-layer data matrix obtained by accumulating data of all layer data matrixes in the corresponding intermediate matrix according to channels;
respectively transforming the data in the plurality of accumulation matrixes according to a post matrix to obtain a plurality of output blocks, wherein the size of the post matrix is preset according to the size of the deconvolution kernel and the step length of deconvolution;
and sequentially arranging the output blocks into an output characteristic diagram to obtain a deconvolution calculation result.
Optionally, before obtaining the plurality of input blocks, the method further includes:
obtaining the size of the sliding window by the following formula:
Figure BDA0002627778700000021
D=I1×I1
wherein D represents the size of the sliding window, k represents the size of the deconvolution kernel, s represents the step size of the deconvolution,
Figure BDA0002627778700000022
the upper rounding symbol in the mathematical symbol is used for obtaining the minimum integer not less than the calculation result in the symbol;
obtaining the size of the first pre-matrix by the following formula:
m1=k+s-2;
H=m1×k;
wherein H represents the size of the first pre-matrix;
obtaining the size of the second pre-matrix by the following formula:
P=I1×m1
wherein P represents the size of the second pre-matrix;
obtaining the size of the post matrix by the following formula:
O1=1×s;
Q=m1×O1
wherein Q represents the size of the post-matrix.
Optionally, before obtaining the plurality of input blocks, the method further includes:
obtaining the size of the sliding window by the following formula:
Figure BDA0002627778700000023
D=I2×I2
wherein D represents the size of the sliding window, k represents the size of the deconvolution kernel, s represents the step size of the deconvolution,
Figure BDA0002627778700000024
the upper rounding symbol in the mathematical symbol is used for obtaining the minimum integer not less than the calculation result in the symbol;
obtaining the size of the first pre-matrix by the following formula:
m2=k+2×s-2;
H=m2×k;
wherein H represents the size of the first pre-matrix;
obtaining the size of the second pre-matrix by the following formula:
P=I2×m2
wherein P represents the size of the second pre-matrix;
obtaining the size of the post matrix by the following formula:
O2=2×s;
Q=m2×O2
wherein Q represents the size of the post-matrix.
Optionally, the transforming the data in the deconvolution kernel according to the first pre-matrix to obtain a deconvolution kernel matrix includes:
sequentially multiplying the first pre-matrix, the deconvolution kernel and a transposed matrix of the first pre-matrix to obtain a deconvolution kernel matrix;
the transforming the data in the plurality of input blocks according to the second pre-matrix to obtain a plurality of input matrices includes:
multiplying the transpose matrix of the second pre-matrix, any one of the input blocks and the second pre-matrix in sequence to obtain a corresponding input matrix;
the transforming the data in the plurality of accumulation matrices according to the post matrix to obtain a plurality of output blocks respectively includes:
and multiplying the transposed matrix of the post matrix, any accumulated matrix and the post matrix in sequence to obtain a corresponding output block.
A second aspect of the present application discloses a hardware accelerator for performing deconvolution calculations, the hardware accelerator being configured to perform the method of deconvolution calculations described in the first aspect of the present application, the hardware accelerator comprising:
the input module comprises a plurality of parallel input channels and is used for inputting the data in the deconvolution kernel matrix into the multiplication module in parallel and inputting each layer of data in the input block into the preprocessing module in sequence, wherein only one input block is input in the single deconvolution calculation process;
the pre-processing module comprises a deconvolution pre-processing unit, the deconvolution pre-processing unit comprises a plurality of deconvolution pre-adders, and the deconvolution pre-adders are used for grouping and adding any layer of data in the acquired input block to acquire an input matrix, and then sequentially outputting each layer of data matrix in the input matrix to the multiplication module;
the multiplication module comprises a plurality of weight multipliers and is used for multiplying the deconvolution kernel matrix with each layer of data matrix in the input matrix respectively to obtain an intermediate matrix and inputting each layer of data matrix of the intermediate matrix into the accumulation module in sequence;
the accumulation module comprises an addition tree and a plurality of registers and is used for accumulating all layer data of the intermediate matrix according to channels to obtain an accumulation matrix and then inputting the accumulation matrix into the post-processing module;
the post-processing module comprises a deconvolution post-processing unit, and the deconvolution post-processing unit comprises a plurality of deconvolution post adders, and is used for grouping and adding the data in the accumulation matrix to obtain an output block;
and the output module comprises a plurality of parallel output channels and is used for outputting the data in the output block in parallel.
Optionally, the input module includes 36 parallel input channels;
the deconvolution preprocessing unit comprises 13 deconvolution preamplifiers;
the multiplication module comprises 36 weight multipliers;
the deconvolution post-processing unit comprises 60 deconvolution post adders;
the output module comprises 16 parallel output channels.
Optionally, the preprocessing module further includes: the convolution pre-processing unit comprises a plurality of convolution pre-adders and the pre-selector is used for selecting and inputting the output result of the input module to the deconvolution pre-processing unit or the convolution pre-processing unit according to the current calculation requirement;
the post-processing module further comprises: the convolution post-processing unit comprises a plurality of convolution post-adders, and the post-selector is used for selecting to input the output result of the accumulation module into the deconvolution post-processing unit or the convolution post-processing unit according to the current calculation requirement.
Optionally, the input module includes 36 parallel input channels;
the convolution preprocessing unit comprises 180 convolution preadders;
the deconvolution preprocessing unit comprises 13 deconvolution preamplifiers;
the multiplication module comprises 36 weight multipliers;
the convolution post-processing unit comprises 100 convolution post-adders;
the deconvolution post-processing unit comprises 60 deconvolution post adders;
the output module comprises 16 parallel output channels.
A third aspect of the present application discloses a computer device comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of deconvolution computation as described in the first aspect of the application when executing the computer program.
A fourth aspect of the present application discloses a computer readable storage medium having stored thereon a computer program which, when being processed and executed, carries out the steps of the method of deconvolution computation according to the first aspect of the present application.
The method comprises the steps of obtaining a plurality of input blocks, transforming data in a deconvolution kernel according to a first pre-matrix to obtain a deconvolution kernel matrix, respectively transforming the data in the plurality of input blocks according to a second pre-matrix to obtain a plurality of input matrices, respectively multiplying the deconvolution kernel matrix with the plurality of input matrices to obtain a plurality of intermediate matrices, and accumulating data of all layer data matrices in any one intermediate matrix according to channels to obtain a plurality of accumulated matrices. And respectively transforming the data in the plurality of accumulation matrixes according to the post matrix to obtain a plurality of output blocks. And sequentially arranging the output blocks into an output characteristic diagram to obtain a deconvolution calculation result. Before calculation, the method does not need to insert a large number of zeros into the original input feature map, so that invalid operation does not exist, and the calculation efficiency is effectively improved.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a conventional deconvolution calculation principle;
FIG. 2 is a schematic flowchart illustrating a method for deconvolution calculation according to an embodiment of the present disclosure;
FIG. 3 is a schematic flow chart of a first order deconvolution calculation disclosed in an embodiment of the present application;
FIG. 4 is a schematic flow chart of a second order deconvolution calculation disclosed in an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a hardware accelerator for deconvolution computation according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a hardware accelerator for deconvolution computation according to an embodiment of the present disclosure;
FIG. 7 is a schematic structural diagram of a convolution pre-processing unit and a deconvolution pre-processing unit in a hardware accelerator for deconvolution computation according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a convolution post-processing unit and a deconvolution post-processing unit in a hardware accelerator for deconvolution computation disclosed in an embodiment of the present application.
Detailed Description
In order to improve the efficiency of deconvolution computation of a hardware accelerator, the present application discloses a method of deconvolution computation, a hardware accelerator, a device, and a readable storage medium according to the following embodiments.
A first embodiment of the present application discloses a method for deconvolution calculation, referring to a workflow diagram shown in fig. 2, the method includes:
step S11, obtaining a plurality of input blocks, where the input blocks are data blocks covered by a preset sliding window after each sliding in an original input feature map, where the original input feature map includes multiple layers of feature maps, and correspondingly, each input block includes multiple layers of data, and the size of the sliding window is set according to the size of a deconvolution kernel and a deconvolution step size.
And step S12, transforming the data in the deconvolution kernel according to a first pre-matrix to obtain a deconvolution kernel matrix, wherein the size of the first pre-matrix is preset according to the size of the deconvolution kernel and the deconvolution step size.
Specifically, the first pre-matrix, the deconvolution kernel, and the transposed matrix of the first pre-matrix are sequentially multiplied to transform data in the deconvolution kernel, so as to obtain the deconvolution kernel matrix.
And step S13, respectively transforming the data in the input blocks according to a second pre-matrix to obtain a plurality of input matrices, wherein each input matrix comprises a plurality of layers of data matrices, and the size of the second pre-matrix is preset according to the size of the deconvolution kernel and the deconvolution step size.
Specifically, the transposed matrix of the second pre-matrix, any one of the input blocks, and the second pre-matrix are sequentially multiplied to transform data in the input block, and a corresponding input matrix is obtained.
Step S14, multiplying the deconvolution kernel matrix with the plurality of input matrices respectively to obtain a plurality of intermediate matrices, where each intermediate matrix includes a plurality of layers of data matrices.
Step S15, obtaining a plurality of accumulation matrices, where any one of the accumulation matrices is a single-layer data matrix obtained by accumulating data of all layer data matrices in the intermediate matrix corresponding to the accumulation matrix according to channels.
And step S16, respectively transforming the data in the plurality of accumulation matrixes according to a post matrix to obtain a plurality of output blocks, wherein the size of the post matrix is preset according to the size of the deconvolution kernel and the step length of deconvolution.
Specifically, the transposed matrix of the post matrix, any one of the accumulation matrices, and the post matrix are multiplied in sequence to realize the transformation of data in the accumulation matrix and obtain the corresponding output block.
And step S17, arranging the output blocks into an output characteristic diagram in sequence to obtain a deconvolution calculation result.
In an actual calculation process, there may be a plurality of deconvolution kernels, each deconvolution kernel includes multiple layers of data, the deconvolution calculation method disclosed in the above embodiment is directed to calculation between one deconvolution kernel and a plurality of input blocks, an output feature diagram obtained by the calculation method is one layer, and after the calculation is completed for the plurality of deconvolution kernels according to the above method, there will be multiple layers of output feature diagrams.
The embodiment discloses a deconvolution calculation method, which includes obtaining a plurality of input blocks, transforming data in a deconvolution kernel according to a first pre-matrix to obtain a deconvolution kernel matrix, transforming the data in the plurality of input blocks according to a second pre-matrix to obtain a plurality of input matrices, multiplying the deconvolution kernel matrix with the plurality of input matrices to obtain a plurality of intermediate matrices, and accumulating data of all layer data matrices in any one of the intermediate matrices according to channels to obtain a plurality of accumulation matrices. And respectively transforming the data in the plurality of accumulation matrixes according to the post matrix to obtain a plurality of output blocks. And sequentially arranging the output blocks into an output characteristic diagram to obtain a deconvolution calculation result. Before calculation, a large number of zeros are not inserted into the original input feature map, so that invalid operation does not exist, and the calculation efficiency is effectively improved.
The specific operation process of the deconvolution calculation method disclosed in the first embodiment of the present application can be represented by the following formula:
Y=QT[(H.g.HT)⊙(PT.d.P)]Q
where Y is the final output result, i.e., the output block, whose size is O1×O1D is an input block of size I1×I1G is an deconvolution kernel having a size of k × k, which indicates that matrix elements are multiplied correspondingly, and that the multiplied two matrices have sizes of m1×m1H is a first pre-matrix, HTIs the transpose of the first pre-matrix, and H has a size of m1×k,HTHas a size of k × m1P is a second pre-matrix, PTIs the transpose of the second pre-matrix, P is of size I1×m1,PTIs m in size1×I1Q is a post-matrix, QTIs a transposed matrix of the postmatrix, and Q has a size of m1×O1,QTHas a size of O1×m1
For different operation requirements, the deconvolution calculation method disclosed in the first embodiment of the present application includes a first-order deconvolution calculation method and a second-order deconvolution calculation method. The calculation steps of the two methods are the same, except that the parameters (the size of the sliding window and the sliding step size, the first pre-matrix, the second pre-matrix and the post-matrix) used in the calculation process are different.
In one embodiment, the first order deconvolution calculation method may be represented as T [ (1 xs)2,k2]The sliding step of the sliding window is 1, s is the step of deconvolution, and k is the size of the deconvolution kernel, e.g. T [ (1X 2)2,32]Indicating that the deconvolution has a step size of 2 and the size of the deconvolution kernel is 3 x 3 in the calculation process performed.
The first order deconvolution calculation method further includes, before obtaining the plurality of input blocks:
obtaining the size of the sliding window by the following formula:
Figure BDA0002627778700000061
D=I1×I1
where D represents the size of the sliding window (consistent with the size of the input block), k represents the size of the deconvolution kernel, s represents the step size of the deconvolution,
Figure BDA0002627778700000062
and the upper rounded sign in the mathematical sign is used for obtaining the minimum integer which is not less than the calculation result in the sign.
Obtaining the size of the first pre-matrix by the following formula:
m1=k+s-2。
H=m1×k。
wherein H represents the size of the first pre-matrix.
Obtaining the size of the second pre-matrix by the following formula:
P=I1×m1
wherein P represents a size of the second pre-matrix.
Obtaining the size of the post matrix by the following formula:
O1=1×s。
Q=m1×O1
wherein Q represents the size of the post-matrix.
It should be noted that, in the first-order deconvolution calculation process, a person skilled in the art can use the first pre-matrix to transform the data in the deconvolution kernel according to a certain transformation rule, where the transformation rule is preset by the person skilled in the art according to actual needs, and therefore the person skilled in the art can set the value of each element in the first pre-matrix by himself or herself under the condition that the size of the first pre-matrix is fixed. Similarly, the values of the elements in the second pre-matrix and the post-matrix may be preset in advance.
As an example, if the step size of the deconvolution is 2(s ═ 2), the size of the deconvolution kernel is 3 × 3(k ═ 3), i.e. for T [ (1 × 2)2,32]In the calculation process, the size of the first pre-matrix is 3 × 3, the size of the second pre-matrix is 2 × 3, and the size of the post-matrix is 3 × 2, and the first pre-matrix, the second pre-matrix, and the post-matrix may be respectively set as:
Figure BDA0002627778700000071
Figure BDA0002627778700000072
Figure BDA0002627778700000073
if the step size of the deconvolution is 2(s ═ 2), the size of the deconvolution kernel is 5 × 5(k ═ 5), i.e. for T [ (1 × 2)2,52]In the calculation process, the size of the first pre-matrix is 5 × 5, the size of the second pre-matrix is 3 × 5, and the size of the post-matrix is 5 × 2, and the first pre-matrix, the second pre-matrix, and the post-matrix may be respectively set as:
Figure BDA0002627778700000081
Figure BDA0002627778700000082
Figure BDA0002627778700000083
by way of example, FIG. 3 depicts the process of a first order deconvolution calculation, which is T [ (1 × 2)2,32]Wherein the sliding window has a sliding step size of 1, i.e. the moving step size (S) of the adjacent input blocki1) The size of the sliding window is 2 × 2, which is 1, and the size of the input block obtained by sliding once is also 2 × 2. In the whole deconvolution calculation process, multiple calculations are required, each calculation is performed only for one input block, and the obtained output blocks are arranged in order to obtain an output characteristic diagram.
In another embodiment, the second order deconvolution calculation method can be expressed as T [ (2 xs)2,k2]The sliding step of the sliding window is 2, s is the step of deconvolution, and k is the size of the deconvolution kernel, e.g. T [ (2X 2)2,32]Indicating that the deconvolution has a step size of 2 and the size of the deconvolution kernel is 3 x 3 in the calculation process performed.
The second-order deconvolution calculation method further includes, before obtaining the plurality of input blocks:
obtaining the size of the sliding window by the following formula:
Figure BDA0002627778700000084
D=I2×I2
where D represents the size of the sliding window (consistent with the size of the input block), k represents the size of the deconvolution kernel, s represents the step size of the deconvolution,
Figure BDA0002627778700000091
and the upper rounded sign in the mathematical sign is used for obtaining the minimum integer which is not less than the calculation result in the sign.
Obtaining the size of the first pre-matrix by the following formula:
m2=k+2×s-2。
H=m2×k。
wherein H represents the size of the first pre-matrix.
Obtaining the size of the second pre-matrix by the following formula:
P=I2×m2
wherein P represents a size of the second pre-matrix.
Obtaining the size of the post matrix by the following formula:
O2=2×s。
Q=m2×O2
wherein Q represents the size of the post-matrix.
It should be noted that, as in the first-order deconvolution calculation process, in the second-order deconvolution calculation process, a person skilled in the art can use the first pre-matrix to transform data in the deconvolution kernel according to a certain transformation rule, where the transformation rule is preset by the person skilled in the art according to actual needs, and therefore, in the case that the size of the first pre-matrix is fixed, the person skilled in the art can set the value of each element in the first pre-matrix by himself. Similarly, the values of the elements in the second pre-matrix and the post-matrix can be preset in advance.
As an example, if the step size of the deconvolution is 2(s ═ 2), the size of the deconvolution kernel is 3 × 3(k ═ 3), i.e. for T [ (2 × 2)2,32]In the calculation process, the size of the first pre-matrix is 5 × 3, the size of the second pre-matrix is 3 × 5, and the size of the post-matrix is 5 × 4, and the first pre-matrix, the second pre-matrix, and the post-matrix may be respectively set as:
Figure BDA0002627778700000092
Figure BDA0002627778700000093
Figure BDA0002627778700000101
by way of example, FIG. 4 depicts the process of a second order deconvolution computation, which is T [ (2 × 2)2,32]Wherein the sliding window has a sliding step size of 2, i.e. the moving step size (S) of the adjacent input blocki2) The size of the sliding window is 3 × 3, which is 2, and the size of the input block obtained by sliding once is also 3 × 3. In the whole deconvolution calculation process, multiple calculations are required, each calculation is performed only for one input block, and the obtained output blocks are arranged in order to obtain an output characteristic diagram.
Experiments show that T [ (1X 2)2,32]In the calculation process, 9(3 × 3) times of multiplication operations are performed on 4 data in one input block, and 4 output values in one output block can be obtained. T [ (2X 2)2,32]In the calculation process, 25(5 × 5) multiplication operations are performed on 9 data in one input block, and 16 output values in one output block can be obtained. In the existing zero insertion method, 9 times of multiplication operations are performed on 9 data in one input block, and only 1 output value can be obtained.
If 16 output values are desired, T [ (1X 2)2,32]The calculation method requires a total of 36 multiplications to be performed, T [ (2X 2)2,32]The calculation method requires 25 multiplication operations in total, whereas the existing zero-insertion rule requires 144 multiplication operations. Therefore, the deconvolution calculation method disclosed in the present embodiment greatly improves the calculation efficiency.
A second embodiment of the present application discloses a hardware accelerator for deconvolution calculation, where the hardware accelerator is configured to execute the method for deconvolution calculation according to the first embodiment of the present application, and with reference to the structural diagram shown in fig. 5, the hardware accelerator includes:
and the input module comprises a plurality of parallel input channels and is used for inputting the data in the deconvolution kernel matrix into the multiplication module in parallel and inputting each layer of data in the input block into the preprocessing module in sequence, wherein only one input block is input in a single deconvolution calculation process.
In one implementation, the hardware accelerator is further provided with a deconvolution kernel processing module, which is configured to transform data in a deconvolution kernel to obtain a deconvolution kernel matrix, and then input the data in the deconvolution kernel matrix to the multiplication module through the input module.
In another implementation mode, the hardware accelerator is not provided with a deconvolution kernel processing module, the module is arranged outside the hardware accelerator, data in a deconvolution kernel is transformed in advance, and then the data in a deconvolution kernel matrix obtained through transformation is stored in the multiplication module in advance through the input module.
And the pre-processing module comprises a deconvolution pre-processing unit, and the deconvolution pre-processing unit comprises a plurality of deconvolution pre-adders, and is used for grouping and adding any layer of data in the acquired input block to acquire an input matrix, and then sequentially outputting each layer of data matrix in the input matrix to the multiplication module.
And the multiplication module comprises a plurality of weight multipliers, and is used for respectively multiplying the deconvolution kernel matrix by each layer of data matrix in the input matrix element by element to obtain an intermediate matrix, and sequentially inputting each layer of data matrix of the intermediate matrix into the accumulation module.
In the actual calculation process, the deconvolution kernel matrix comprises multiple layers of data, and the multiplication module is used for correspondingly multiplying each layer of data in the deconvolution kernel matrix with each layer of data in the input matrix one by one to obtain an intermediate matrix.
And the accumulation module comprises an addition tree and a plurality of registers and is used for accumulating all the layer data of the intermediate matrix channel by channel to obtain an accumulation matrix, and then outputting the accumulation matrix to the post-processing module.
And the post-processing module comprises a deconvolution post-processing unit, and the deconvolution post-processing unit comprises a plurality of deconvolution post adders, and is used for grouping and adding the data in the accumulation matrix to obtain an output block.
And the output module comprises a plurality of parallel output channels and is used for outputting the data in the output block in parallel.
Further, the input module comprises 36 parallel input channels for inputting the data of the 36 input blocks and the data of the 36 deconvolution kernel matrixes in parallel.
The deconvolution preprocessing unit comprises 13 deconvolution pre-adders, and is used for grouping and shifting or adding 36 input data input in parallel according to a certain rule.
The multiplication module comprises 36 weight multipliers, and is used for multiplying the 36 output data of the preprocessing module and the data of the 36 deconvolution kernel matrixes according to a certain sequence.
The deconvolution post-processing unit comprises 60 deconvolution post adders used for grouping and shifting or adding the obtained 36 data according to a certain rule.
The output module comprises 16 parallel output channels and is used for outputting 16 output data of the deconvolution post-processing unit in parallel.
In practical applications, a person skilled in the art may set the number of adders or weight multipliers in each module according to actual needs, and set the connection relationship between the adders and the weight multipliers, which is not described herein again.
In one embodiment, the application discloses a reconfigurable hardware accelerator, which can perform deconvolution calculation, and can flexibly select convolution operation or deconvolution operation according to application scenarios.
Referring to fig. 6, the preprocessing module further includes: the convolution pre-processing unit comprises a plurality of convolution pre-adders, and the pre-selector is used for selecting and inputting the output result of the input module to the deconvolution pre-processing unit or the convolution pre-processing unit according to the current calculation requirement.
The post-processing module further comprises: the convolution post-processing unit comprises a plurality of convolution post-adders, and the post-selector is used for selecting to input the output result of the accumulation module into the deconvolution post-processing unit or the convolution post-processing unit according to the current calculation requirement.
Further, the input module comprises 36 parallel input channels. The output module comprises 16 parallel output channels. The multiplication module comprises 36 weight multipliers.
Referring to fig. 7, the convolution pre-processing unit includes 180 convolution pre-adders. The deconvolution pre-processing unit includes 13 deconvolution pre-adders.
Referring to fig. 8, the convolution post-processing unit includes 100 convolution post-adders. The deconvolution post-processing unit comprises 60 deconvolution post adders.
The following describes the workflow of the hardware accelerator disclosed in this embodiment with reference to a simple example:
the input module inputs the 36 pre-calculated deconvolution kernel matrix data to the multiplication module. The pre-selector selects to carry out convolution calculation or deconvolution calculation according to the configuration, if the convolution calculation is carried out, the input module sends 36 input data into the convolution pre-processing unit, and the deconvolution pre-processing unit inputs 0; if the convolution calculation is carried out, the input module sends 36 input data to the convolution preprocessing unit, and the convolution preprocessing unit inputs 0.
In the pre-processing module, the convolution pre-processing unit is used for dividing 36 input data into 6 groups, each group of 6 data is subjected to certain addition or shift, and the deconvolution pre-processing unit is used for taking the first nine input data to be subjected to certain addition. After the two units calculate the result, the corresponding units send the output data to the multiplication module. If the convolution calculation is carried out, the data output by the convolution preprocessing unit is sent to a multiplication module; if the convolution is performed, the data output by the convolution pre-processing unit is sent to the multiplication module.
In the multiplication module, the data sent by the pre-processing module and the pre-calculated deconvolution kernel matrix data are multiplied one by one, and then the result is sent to the post-processing module. The post selector selects to carry out convolution calculation or deconvolution calculation according to the configuration, wherein if the convolution calculation is carried out, the result is sent to a convolution post processing unit, and 0 is input into the deconvolution post processing unit; if the convolution is carried out, the result is sent to a convolution post-processing unit, and 0 is input into the convolution post-processing unit.
In the post-processing module, the convolution post-processing unit is used for dividing the data sent by the multiplication module into 6 groups, each group comprises 6 data, and then certain addition or shift is carried out, and the deconvolution post-processing unit is used for dividing the data sent by the multiplication module into 4 groups, each group comprises 9 data, and then certain addition is carried out. After the two units calculate the final result, the corresponding units output the calculation result, and if the final result is the convolution calculation, the calculation result of the convolution post-processing unit is output; if the deconvolution calculation is carried out, the calculation result of the deconvolution post-processing unit is output.
In the output module, the output module outputs 16 output results in parallel at a time.
A third embodiment of the present application discloses a computer device, comprising:
a memory for storing a computer program.
A processor for implementing the steps of the method of deconvolution computation as described in the first embodiment of the present application when executing said computer program.
A fourth embodiment of the present application discloses a computer-readable storage medium having stored thereon a computer program which, when being processed and executed, implements the steps of the method of deconvolution computation as described in the first embodiment of the present application.
The present application has been described in detail with reference to specific embodiments and illustrative examples, but the description is not intended to limit the application. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the presently disclosed embodiments and implementations thereof without departing from the spirit and scope of the present disclosure, and these fall within the scope of the present disclosure. The protection scope of this application is subject to the appended claims.

Claims (10)

1. A method of deconvolution computation, comprising:
acquiring a plurality of input blocks, wherein the input blocks are data blocks covered by a preset sliding window after sliding in an original input feature map each time, the original input feature map comprises a plurality of layers of feature maps, correspondingly, each input block comprises a plurality of layers of data, and the size of the sliding window is set according to the size of a deconvolution kernel and the deconvolution step length;
transforming the data in the deconvolution kernel according to a first pre-matrix to obtain a deconvolution kernel matrix, wherein the size of the first pre-matrix is preset according to the size of the deconvolution kernel and the step length of deconvolution;
respectively transforming the data in the plurality of input blocks according to a second pre-matrix to obtain a plurality of input matrices, wherein each input matrix comprises a plurality of layers of data matrices, and the size of the second pre-matrix is preset according to the size of the deconvolution kernel and the step length of deconvolution;
multiplying the deconvolution kernel matrix with the plurality of input matrices respectively to obtain a plurality of intermediate matrices, wherein each intermediate matrix comprises a plurality of layers of data matrices;
acquiring a plurality of accumulation matrixes, wherein any one of the accumulation matrixes is a single-layer data matrix obtained by accumulating data of all layer data matrixes in the corresponding intermediate matrix according to channels;
respectively transforming the data in the plurality of accumulation matrixes according to a post matrix to obtain a plurality of output blocks, wherein the size of the post matrix is preset according to the size of the deconvolution kernel and the step length of deconvolution;
and sequentially arranging the output blocks into an output characteristic diagram to obtain a deconvolution calculation result.
2. The method of deconvolution computation of claim 1, wherein prior to obtaining the plurality of input blocks, the method further comprises:
obtaining the size of the sliding window by the following formula:
Figure FDA0002627778690000011
D=I1×I1
wherein D represents the size of the sliding window, k represents the size of the deconvolution kernel, s represents the step size of the deconvolution,
Figure FDA0002627778690000012
the upper rounding symbol in the mathematical symbol is used for obtaining the minimum integer not less than the calculation result in the symbol;
obtaining the size of the first pre-matrix by the following formula:
m1=k+s-2;
H=m1×k;
wherein H represents the size of the first pre-matrix;
obtaining the size of the second pre-matrix by the following formula:
P=I1×m1
wherein P represents the size of the second pre-matrix;
obtaining the size of the post matrix by the following formula:
O1=1×s;
Q=m1×O1
wherein Q represents the size of the post-matrix.
3. The method of deconvolution computation of claim 1, wherein prior to obtaining the plurality of input blocks, the method further comprises:
obtaining the size of the sliding window by the following formula:
Figure FDA0002627778690000021
D=I2×I2
wherein D represents the size of the sliding window, k represents the size of the deconvolution kernel, s represents the step size of the deconvolution,
Figure FDA0002627778690000022
the upper rounding symbol in the mathematical symbol is used for obtaining the minimum integer not less than the calculation result in the symbol;
obtaining the size of the first pre-matrix by the following formula:
m2=k+2×s-2;
H=m2×k;
wherein H represents the size of the first pre-matrix;
obtaining the size of the second pre-matrix by the following formula:
P=I2×m2
wherein P represents the size of the second pre-matrix;
obtaining the size of the post matrix by the following formula:
O2=2×s;
Q=m2×O2
wherein Q represents the size of the post-matrix.
4. The method of deconvolution computation of claim 2 or 3, wherein said transforming the data in the deconvolution kernel according to the first pre-matrix to obtain a deconvolution kernel matrix comprises:
sequentially multiplying the first pre-matrix, the deconvolution kernel and a transposed matrix of the first pre-matrix to obtain a deconvolution kernel matrix;
the transforming the data in the plurality of input blocks according to the second pre-matrix to obtain a plurality of input matrices includes:
multiplying the transpose matrix of the second pre-matrix, any one of the input blocks and the second pre-matrix in sequence to obtain a corresponding input matrix;
the transforming the data in the plurality of accumulation matrices according to the post matrix to obtain a plurality of output blocks respectively includes:
and multiplying the transposed matrix of the post matrix, any accumulated matrix and the post matrix in sequence to obtain a corresponding output block.
5. A hardware accelerator for deconvolution calculations, the hardware accelerator being configured to perform a method of deconvolution calculations as claimed in any of claims 1-4, the hardware accelerator comprising:
the input module comprises a plurality of parallel input channels and is used for inputting the data in the deconvolution kernel matrix into the multiplication module in parallel and inputting each layer of data in the input block into the preprocessing module in sequence, wherein only one input block is input in the single deconvolution calculation process;
the pre-processing module comprises a deconvolution pre-processing unit, the deconvolution pre-processing unit comprises a plurality of deconvolution pre-adders, and the deconvolution pre-adders are used for grouping and adding any layer of data in the acquired input block to acquire an input matrix, and then sequentially outputting each layer of data matrix in the input matrix to the multiplication module;
the multiplication module comprises a plurality of weight multipliers and is used for multiplying the deconvolution kernel matrix with each layer of data matrix in the input matrix respectively to obtain an intermediate matrix and inputting each layer of data matrix of the intermediate matrix into the accumulation module in sequence;
the accumulation module comprises an addition tree and a plurality of registers and is used for accumulating all layer data of the intermediate matrix according to channels to obtain an accumulation matrix and then inputting the accumulation matrix into the post-processing module;
the post-processing module comprises a deconvolution post-processing unit, and the deconvolution post-processing unit comprises a plurality of deconvolution post adders, and is used for grouping and adding the data in the accumulation matrix to obtain an output block;
and the output module comprises a plurality of parallel output channels and is used for outputting the data in the output block in parallel.
6. The hardware accelerator of deconvolution computation of claim 5,
the input module comprises 36 parallel input channels;
the deconvolution preprocessing unit comprises 13 deconvolution preamplifiers;
the multiplication module comprises 36 weight multipliers;
the deconvolution post-processing unit comprises 60 deconvolution post adders;
the output module comprises 16 parallel output channels.
7. The hardware accelerator of deconvolution computation of claim 5, wherein the pre-processing module further comprises: the convolution pre-processing unit comprises a plurality of convolution pre-adders and the pre-selector is used for selecting and inputting the output result of the input module to the deconvolution pre-processing unit or the convolution pre-processing unit according to the current calculation requirement;
the post-processing module further comprises: the convolution post-processing unit comprises a plurality of convolution post-adders, and the post-selector is used for selecting to input the output result of the accumulation module into the deconvolution post-processing unit or the convolution post-processing unit according to the current calculation requirement.
8. The hardware accelerator of deconvolution computation of claim 7,
the input module comprises 36 parallel input channels;
the convolution preprocessing unit comprises 180 convolution preadders;
the deconvolution preprocessing unit comprises 13 deconvolution preamplifiers;
the multiplication module comprises 36 weight multipliers;
the convolution post-processing unit comprises 100 convolution post-adders;
the deconvolution post-processing unit comprises 60 deconvolution post adders;
the output module comprises 16 parallel output channels.
9. A computer device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of deconvolution computation of any of claims 1-4 when executing said computer program.
10. A computer-readable storage medium, having stored thereon a computer program which, when being processed and executed, carries out the steps of the method of deconvolution computation of any one of claims 1-4.
CN202010802119.6A 2020-07-09 2020-08-11 Deconvolution calculation method, hardware accelerator, device and readable storage medium Pending CN113918876A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2020106580063 2020-07-09
CN202010658006 2020-07-09

Publications (1)

Publication Number Publication Date
CN113918876A true CN113918876A (en) 2022-01-11

Family

ID=79232473

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010802119.6A Pending CN113918876A (en) 2020-07-09 2020-08-11 Deconvolution calculation method, hardware accelerator, device and readable storage medium

Country Status (1)

Country Link
CN (1) CN113918876A (en)

Similar Documents

Publication Publication Date Title
KR101162649B1 (en) A method of and apparatus for implementing fast orthogonal transforms of variable size
EP3789891A1 (en) Number-theoretic transform hardware
CN110874636B (en) Neural network model compression method and device and computer equipment
CN111652330B (en) Image processing method, device, system, electronic equipment and readable storage medium
US11544526B2 (en) Computing device and method
CN109117187A (en) Convolutional neural networks accelerated method and relevant device
CN111639699B (en) Method, system and equipment for extracting image features and readable storage medium
CN108897716B (en) Data processing device and method for reducing calculation amount through memory read-write operation
CN110766128A (en) Convolution calculation unit, calculation method and neural network calculation platform
WO2019088072A1 (en) Information processing device, information processing method, and program
WO2023065983A1 (en) Computing apparatus, neural network processing device, chip, and data processing method
CN111639701B (en) Method, system and equipment for extracting image features and readable storage medium
JP4263693B2 (en) A computationally efficient math engine
US6574649B2 (en) Efficient convolution method and apparatus
CN109740740A (en) The fixed point accelerating method and device of convolutional calculation
CN113918876A (en) Deconvolution calculation method, hardware accelerator, device and readable storage medium
US11960565B2 (en) Add-mulitply-add convolution computation for a convolutional neural network
CN115496993B (en) Target detection method, device, equipment and storage medium based on frequency domain fusion
CN109634556B (en) Multiply-accumulator and accumulation output method
KR102153167B1 (en) Matrix operator and matrix operation method for artificial neural network
CN114758209B (en) Convolution result obtaining method and device, computer equipment and storage medium
CN112650974B (en) Efficient transpose convolution calculation method
EP3480710A1 (en) Computer architectures and instructions for multiplication
CN115859003A (en) Method, device and equipment for executing FFT
CN110245263B (en) Aggregation method, aggregation device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination