CN112949845B - Deep convolutional neural network accelerator based on FPGA - Google Patents

Deep convolutional neural network accelerator based on FPGA Download PDF

Info

Publication number
CN112949845B
CN112949845B CN202110249630.2A CN202110249630A CN112949845B CN 112949845 B CN112949845 B CN 112949845B CN 202110249630 A CN202110249630 A CN 202110249630A CN 112949845 B CN112949845 B CN 112949845B
Authority
CN
China
Prior art keywords
convolution
data
fast
winograd
trapezoidal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202110249630.2A
Other languages
Chinese (zh)
Other versions
CN112949845A (en
Inventor
黄威
孙锴
李锦�
段昊东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University
Original Assignee
Inner Mongolia University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University filed Critical Inner Mongolia University
Priority to CN202110249630.2A priority Critical patent/CN112949845B/en
Publication of CN112949845A publication Critical patent/CN112949845A/en
Application granted granted Critical
Publication of CN112949845B publication Critical patent/CN112949845B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • G06V10/955Hardware or software architectures specially adapted for image or video understanding using specific electronic processors
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses an accelerator of a deep convolutional neural network based on FPGA, comprising: the fast convolution operation module and the two-dimensional convolution memory interaction module; the fast convolution operation module is used for combining light-weight fast multiplication with a Winograd algorithm to realize fast convolution operation of the deep convolution neural network; and the two-dimensional convolution memory interaction module is used for transmitting data between the outside of the chip and the inside of the chip by adopting a trapezoidal multiplexing memory interaction optimization strategy according to the intermediate calculation result and the weight in the process of the rapid convolution operation, so that the data interaction between the inside and the outside of the chip is minimized. On one hand, the accelerator improves a Winograd convolution algorithm by using fast multiplication, and further improves the speed of Winograd calculation convolution; on the other hand, a ladder multiplexing memory interaction optimization strategy is provided, which can reduce the time delay of the memory interaction between the inside and the outside of the chip.

Description

Deep convolutional neural network accelerator based on FPGA
Technical Field
The invention relates to the field of deep learning and FPGA, in particular to an accelerator of a deep convolutional neural network based on FPGA.
Background
With the popularization of artificial intelligence, people have higher and higher requirements on intelligent equipment convenient for life, DCNN (deep convolutional neural network) is the most important algorithm in image processing, and applications of DCNN include image recognition, target detection and the like. However, the DCNN includes a large amount of computation, so that it has high computation delay and power consumption when running on the GPU, and thus the DCNN is difficult to be applied to scenes with high real-time requirements, such as robots and auto-driven vehicles.
To improve the real-time performance of DCNNs, FPGAs are used to speed up DCNNs because of their advantages of fast computation speed, low power consumption, and editability. However, the FPGA has great difficulty in accelerating the DCNN, the DCNN has a large amount of calculation and needs a large amount of weight, and the resources on the FPGA are limited. The most important computation operation of DCNN is convolution, which can be implemented on FPGA with multipliers and adders, so optimizing the convolution operation can speed up DCNN. The DCNN includes a large number of weights, the on-chip cache capacity of the FPGA is limited, intermediate calculation results and weights need to be stored in an off-chip memory, and a large number of memory interactions are generated by taking a batch of data to be calculated into the on-chip calculation each time, so that optimizing the memory interactions can also accelerate the DCNN. Therefore, the FPGA acceleration DCNN is mainly completed from two aspects of computation acceleration and memory interaction optimization.
The calculation acceleration is mainly the acceleration of convolution operation, and the general principle of the acceleration of convolution is to reduce the complexity of calculation and reduce the calculation amount. Currently, the commonly used convolution acceleration algorithms are FFT, depth-separable convolution and Winograd. Fast Fourier Transform (FFT) converts an input feature map and a convolution kernel from a time domain to a frequency domain, only multiplication is needed for the input feature map and the convolution kernel in the frequency domain, and finally the multiplication result is converted back to the time domain. However, the FFT also takes into account the computational cost of the conversion between the time domain and the frequency domain. When the input feature map and convolution kernel are of similar size, the FFT can accelerate the convolution overall. When the convolution kernel is small, it is difficult for the FFT to accelerate the convolution as a whole. At present, the convolution kernel size in the DCNN model is smaller and smaller, from the 11 × 11 convolution kernel appearing in AlexNet to the 3 × 3 convolution kernel in VGG, which is smaller than the input feature map. Therefore, accelerating the current DCNN with FFT is not a good option. The depth Separable Convolution (Depthwise Separable Convolution) decomposes the standard Convolution into depth Convolution (Depthwise Convolution) and point-by-point Convolution (Pointwise Convolution), so that the decomposition has the advantages of remarkably reducing the calculation amount and required parameters, and the highly structured Convolution mode is very suitable for being implemented on an FPGA. However, the deep separable convolution loses some accuracy, so it is not suitable for high accuracy applications. Winograd, like an FFT, converts the computation to another spatial domain, which reduces the number of multiplications, but slightly increases the number of additions. On the FPGA, the computation time of multiplication is longer than that of addition, so Winograd can accelerate convolution computation on the whole. Different from the FFT which is suitable for a large-size convolution kernel, Winograd is mainly used for convolution acceleration of a small-size convolution kernel and is suitable for the condition that the current DCNN model is generally provided with the small-size convolution kernel.
Since Winograd contains multiplication calculations, combining the fast multiplication algorithm with traditional Winograd can further speed up convolution calculations. At present, fast multiplication algorithms include Radix-4 Booth-based code multiplication, Wallace tree multiplication and the like. The multiplication principle on FPGA is that the multiplicand generates partial products under the shift of each bit of the multiplier, and then the final multiplication result is obtained by summing all the partial products. Based on Radix-4 Booth coding multiplication, each three bits of the multiplier are coded into one group, then each group of codes shifts the multiplicand to generate partial products, and finally the partial products are summed. The number of partial products can be reduced by encoding multipliers based on Radix-4 Booth encoding multiplication, thereby accelerating the speed of multiplication. Wallace tree multiplication mainly utilizes the characteristic of full adder 3-2 compression, increases parallelism and improves operation speed. However, the multiplication based on Radix-4 Booth coding and Wallace tree multiplication are complex to realize on FPGA, and are not convenient to combine with Winograd.
The memory interaction optimization is mainly used for reducing data interaction between an off-chip memory and an on-chip cache. A large number of characteristic graphs and weights are placed in an off-chip memory, due to the fact that the capacity of an on-chip cache is limited, data of the off-chip memory needs to be divided into data blocks, one data block is transmitted into the chip to be calculated each time, the step length of a convolution window in the existing DCNN model is smaller than the size of the window, data overlapping parts can be generated in adjacent convolution windows, accordingly, data overlapping can be generated in adjacent data blocks, the data overlapping comprises left-right adjacent overlapping and up-down adjacent overlapping, and if optimization is not conducted, the data transmission inside and outside the chip can be increased through overlapping data of the adjacent data blocks. Therefore, one-dimensional convolution multiplexing has been proposed, which can multiplex left and right adjacent data and reduce the data transmission amount, but due to limited resources in the chip, the up and down adjacent data cannot be multiplexed at the same time, so the one-dimensional convolution multiplexing cannot minimize the data interaction inside and outside the chip.
Disclosure of Invention
The invention aims to overcome the technical defects and provides an accelerator of a deep convolutional neural network based on an FPGA (field programmable gate array), which accelerates from two aspects of calculation acceleration and memory interaction optimization. For computational acceleration, the present invention chooses Winograd to optimize the convolution calculations, because Winograd can optimize convolutions with small convolution kernel size and reasonable computational complexity. In order to further improve the performance of Winograd, the invention provides a light-weight fast multiplication, and the algorithm is combined with the traditional Winograd, so that the improved Winograd can accelerate the calculation speed of convolution. For the memory interaction optimization, because the one-dimensional convolution multiplexing can not minimize the data interaction inside and outside the chip, in order to overcome the defect of the method, the invention provides a memory interaction optimization strategy named as trapezoidal multiplexing, which can simultaneously multiplex the data adjacent to the left and the right and the data adjacent to the top and the bottom, and can minimize the data interaction inside and outside the chip.
In order to achieve the above object, the present invention provides an accelerator of a deep convolutional neural network based on an FPGA, including: the fast convolution operation module and the two-dimensional convolution memory interaction module;
the fast convolution operation module is used for combining light-weight fast multiplication with a Winograd algorithm to realize fast convolution operation of the deep convolution neural network;
and the two-dimensional convolution memory interaction module is used for transmitting data between the outside of the chip and the inside of the chip by adopting a trapezoidal multiplexing memory interaction optimization strategy according to the intermediate calculation result and the weight in the process of the rapid convolution operation, so that the data interaction between the inside and the outside of the chip is minimized.
As an improvement of the above system, a specific implementation process of the fast convolution operation module includes:
for two-dimensional convolution, assuming that the size of the output is m × m and the size of the convolution kernel is r × r, the two-dimensional convolution can be represented by F (m × m, r × r); the two-dimensional Winograd convolution calculation formula is as follows:
Y=A T [(GgG T )⊙(B T dB)]A
wherein G, B and A respectively represent convolution kernel conversion matrix and input conversion matrixAnd outputting transformation matrices, which G, B and A can be calculated in advance when m and r are determined; d represents the input, g represents the convolution kernel; b is T Denotes the transposition of B, A T Denotes the transposition of A, G T Represents a transpose of G;
winograd is improved by adopting fast multiplication, and the improved calculation formula is as follows:
Y=A T fm[(GgG T ),(B T dB)]A
where fm (X, Y) denotes fast multiplication of two multiplication matrices X and Y, and elements at the same positions of X and Y are multiplied by the fast multiplication.
As an improvement of the above system, a specific implementation process of the two-dimensional convolution memory interaction module includes:
the input feature map and weights of the deep convolutional neural network are divided into trapezoidal data blocks; when the convolution window slides the trapezoidal data, the trapezoidal data needs to be normally converted, namely the sliding window on the trapezoidal data is converted back to a normal convolution window according to the window corresponding rule;
the calculation sequence of the trapezoidal multiplexing is that each trapezoidal data block is taken from left to right in sequence, and after the trapezoidal data is calculated, the overlapped data of the adjacent data blocks on the right side is stored; for the calculation of each trapezoidal data block, the window sliding mode is from top to bottom and then from left to right;
the ladder data is divided into three parts: a front, a corner, and a rear; the three columns of the trapezoid are represented by three vectors: x, y and z, wherein the total number of the sliding windows is n, the positions of the sliding windows are p, p is more than or equal to 0 and less than or equal to n-1, and p is an integer; when the total number of the sliding windows is n and is more than or equal to 5, the values in the windows have the following three conditions according to the difference of p, and the three conditions respectively correspond to the front part, the corner part and the rear part of the trapezoidal data;
(1) when the temperature is higher than the set temperature
Figure BDA0002965458090000041
When the sliding window is arranged at the front part of the trapezoid;
(2) when in use
Figure BDA0002965458090000042
When the sliding window is at the corner of the trapezoid, there are
Figure BDA0002965458090000043
And
Figure BDA0002965458090000044
the three windows;
(3) when in use
Figure BDA0002965458090000045
The sliding window is at the rear of the trapezoid.
As an improvement of the above system, the normal conversion of the ladder data is specifically: the data width is adjusted to the input length of the two-dimensional convolution F (m × m, r × r).
The invention has the advantages that:
on one hand, the accelerator improves a Winograd convolution algorithm by using fast multiplication, and further improves the speed of Winograd calculation convolution; on the other hand, a ladder multiplexing memory interaction optimization strategy is provided, which can reduce the time delay of the memory interaction between the inside and the outside of the chip.
Drawings
FIG. 1(a) is a schematic diagram of a plurality of input feature maps;
FIG. 1(b) shows an ideal data block division mode, in which adjacent data blocks are not overlapped;
FIG. 1(c) is a schematic diagram of division without convolution multiplexing;
FIG. 1(d) is a schematic diagram of one-dimensional convolution multiplexing;
FIG. 2(a) is a schematic diagram of convolution-free multiplexing;
FIG. 2(b) is a schematic diagram of one-dimensional convolution multiplexing;
FIG. 3 is a schematic illustration of ladder multiplexing;
FIG. 4 is a schematic diagram of a data flow for a normal conversion;
FIG. 5 is a schematic diagram of three cases of normal switching;
FIG. 6 is a schematic of a normal transition after Winograd adaptation;
FIG. 7 is a flow chart of the computation of a parallel addition tree;
FIG. 8 is a schematic diagram of the calculated time delay for three optimization strategies;
fig. 9 is a schematic diagram of transmission delay of three optimization strategies.
Detailed Description
The technical solution of the present invention will be described in detail below with reference to the accompanying drawings.
The invention provides an accelerator of a deep convolutional neural network based on an FPGA (field programmable gate array), which adopts the following technical scheme:
1. compute acceleration
1.1 fast multiplication
The working principle of the multiplier is that a multiplicand is shifted to generate partial products, and then all the partial products are accumulated to obtain a multiplication result. A fast multiplication is proposed herein which speeds up the multiplication by reducing the number of partial products. The algorithm directly computes the non-zero valued partial products and then adds them together. The reduction of the addition operation increases the computation speed of the multiplier.
The expression for fast multiplication is shown below:
P=fm(X,Y)
wherein X and Y represent two multiplication matrixes of the fast multiplication, elements at the same positions of X and Y are multiplied by the fast multiplication, and a multiplication result P is obtained after all positions are calculated.
1.2, improving Winograd
Winograd is a fast convolution algorithm, in FPGA, the computation speed of multiplication is slower than that of addition, and Winograd increases the computation speed of convolution by reducing the number of multiplications of convolution. For one-dimensional convolution, let the output length be m and the convolution kernel length be r, and F (m, r) can be used to represent the one-dimensional convolution calculation. Taking F (2,3) as an example, with d representing the input and g representing the convolution kernel, Winograd convolution of F (2,3) can be written as a matrix multiplication as follows:
Figure BDA0002965458090000051
wherein m is 0 、m 1 、m 2 And m 3 The calculation is as follows:
Figure BDA0002965458090000052
since Winograd multiplication of F (2,3) is 4 times and sliding window convolution is 6 times, Winograd multiplication is less computationally intensive. The output matrix is represented by Y, and the one-dimensional Winograd calculation formula is as follows:
Y=A T [(Gg)⊙(B T d)]
where G, B, and a represent the convolution kernel transform matrix, the input transform matrix, and the output transform matrix, respectively.
For two-dimensional convolution, assuming that the size of the output is m × m and the size of the convolution kernel is r × r, the two-dimensional convolution can be represented by F (m × m, r × r). The two-dimensional Winograd convolution calculation formula is as follows:
Y=A T [(GgG T )⊙(B T dB)]A
when m and r are determined, transformation matrices G, B and A can be calculated in advance. G, B for F (2X 2, 3X 3) and A are as follows:
Figure BDA0002965458090000061
the multiplication of the sliding window convolution is m 2 ×r 2 While the multiplicative quantity of Winograd is only (m + r-1) 2 In DCNN, m and r of most convolutions are larger than 1, and Winograd has less multiplication than the sliding window convolution, so Winograd can accelerate the convolution.
The invention improves Winograd by using fast multiplication, and improves Winograd algorithm as follows:
Y=A T fm[(GgG T ),(B T dB)]A
winograd of F (2 × 2,3 × 3) is used herein, then the input is converted GgG T Convolution kernel transform B T The multiplication coefficients for multiplication in dB and output conversion A only relate to
Figure BDA0002965458090000062
Or
Figure BDA0002965458090000063
The multiplication can be replaced by a shift because the computation time of the shift is smaller than the multiplication. And (GgG) T )⊙(B T dB) have complex multiplications, must be computed by multiplication, all here using fast multiplication to accelerate Winograd (GgG) T )⊙(B T dB), i.e., fm [ (GgG) T ),(B T dB)]。
2. Memory interaction optimization
In order to further optimize the DCNN hardware accelerator, memory interaction is also optimized in addition to accelerating computations. The weight of DCNN and the intermediate calculation data need to be calculated in the chip, the capacity of the on-chip cache is fixed and limited, and when the data volume is very large, the data can only be put in an off-chip memory. To accommodate the capacity of the on-chip cache, the entire data is divided into many data blocks, one at a time for on-chip computation. Therefore, frequent memory interaction occurs inside and outside the chip, and different memory interaction modes directly influence transmission delay. The general principle of memory interaction optimization is to improve the reuse rate of data as much as possible, so that the data volume of memory interaction can be reduced, and the transmission delay is reduced. Based on the principle, the invention provides a memory interaction optimization strategy: ladder multiplexing.
In the current mainstream DCNN, when the window sliding characteristic diagram is convolved, the sliding step size is generally smaller than the convolution window size, so that adjacent windows have data overlapping portions, as shown in fig. 1(a), and therefore adjacent data blocks have data that must overlap, as shown in fig. 1(b) and fig. 1 (c). The entire data is shown in fig. 1(a), which is divided into 4 × 4 data blocks, each of which is represented by a light gray area, and the ideal data block division is shown in fig. 1(b), with no overlap between adjacent data blocks. The overlapping of adjacent windows results in the overlapping of adjacent data blocks, the actual division of the data blocks is as shown in fig. 1(c), the left and right adjacent data blocks and the upper and lower adjacent data blocks have data overlapping portions, and the overlapping portions are represented by dark gray.
To solve this problem, one-dimensional convolution multiplexing has been proposed. After each data block is sent into the slice, the slice buffer stores the left and right adjacent overlapping parts, the next left and right adjacent data block does not need to contain the overlapping part, only new data is formed by the left and right adjacent overlapping data and the stored overlapping data, the overlapping part of the left and right adjacent data block is continuously stored after the data is calculated, the convolution-free multiplexing is shown in fig. 2(a), and the one-dimensional convolution multiplexing is shown in fig. 2 (b). However, the one-dimensional convolution multiplexing does not store data adjacent to each other up and down, and the memory interaction amount cannot reach the minimum.
In order to solve the problems of one-dimensional convolution multiplexing, the invention provides a memory interaction mode of two-dimensional convolution multiplexing. As shown in fig. 3 (a), this method uses a novel data block division method, which enables each data block to simultaneously store left and right adjacent and top and bottom adjacent overlapping data, so this method can completely multiplex the overlapping data, and compared with non-convolution multiplexing and one-dimensional convolution multiplexing, two-dimensional convolution multiplexing can minimize the memory interaction amount. To facilitate window sliding, we convert the data shape from square to triangle, as shown in fig. 3 (b), and since each data block becomes trapezoid (the triangle at the left end can be regarded as a trapezoid with 0 at the top), the present invention refers to such two-dimensional convolution multiplexing as trapezoid multiplexing. The calculation sequence of the trapezoidal multiplexing is that each trapezoidal data block is taken from left to right in sequence, and after the trapezoidal data is calculated, the overlapped data of the adjacent data block on the right side is stored. For each trapezoidal data block calculation, the window sliding mode is from top to bottom and then from left to right, as shown in fig. 3 (c).
Since the data block is divided into trapezoids, data cannot be fetched according to the normal convolution window. Therefore, when the convolution window slides the trapezoidal data, the trapezoidal data needs to be normally converted, that is, the sliding window on the trapezoidal data is converted back to the normal convolution window according to the window corresponding rule, so that the convolution can be correctly calculated. The window correspondence rule is as follows, and includes three cases, which are respectively applied to three parts of the trapezoidal data: front, corners and rear, as shown in fig. 4.
The three columns of the trapezoid are represented by three vectors: x, y and z, wherein the total number of the sliding windows is n, the positions of the sliding windows are p (p is more than or equal to 0 and less than or equal to n-1, and p is an integer); when the total number of sliding windows is n equal to or greater than 5, the values in the windows have three cases corresponding to the front, corners, and rear of the trapezoidal data, respectively, depending on p.
(1) When in use
Figure BDA0002965458090000071
When the sliding window is in the front of the trapezoid, the window value is as shown in fig. 5 (a).
(2) When in use
Figure BDA0002965458090000072
When the sliding window is at the corner of the trapezoid, there are
Figure BDA0002965458090000073
And
Figure BDA0002965458090000074
these three windows, window values, are shown in fig. 5 (c), 5(d) and 5 (e), respectively.
(3) When in use
Figure BDA0002965458090000075
When the sliding window is at the rear of the trapezoid, the window value is as shown in fig. 5 (b).
The trapezoidal multiplexing proposed by the present invention can be directly used for conventional convolution calculation, and can also be adapted to special convolution calculation through adjustment. To combine the ladder multiplexing with the Winograd convolution, the ladder multiplexing is adjusted. Since the present invention uses Winograd of F (2 × 2,3 × 3), the input is 4 × 4, the step size is 2, and the output of the ladder multiplexing corresponds to the input of Winograd, it is necessary to change the data width of the normal conversion of the ladder multiplexing from 3 to 4, as shown in fig. 6.
3. Computing optimized contrasts
3.1 acceleration effects of fast multiplication
Since the weights of the DCNN are quantized to 8 bits herein, 8-bit multiplication requires the addition of 8 partial products. To speed up the computation of the addition, a parallel addition tree is used herein to accomplish the addition of the partial products, as shown in FIG. 7.
The 8 numbers to be added are sequentially represented as add [0], add [1], add [2], add [3], add [4], add [5], add [6] and add [7], the sum of the parts obtained by pairwise addition for the first time is sequentially represented as P _ sum0[0], P _ sum0[1], P _ sum0[2] and P _ sum0[3], the sum of the parts obtained by pairwise addition for the second time is sequentially represented as P _ sum1[0] and P _ sum1[1], and finally the final result F _ sum is obtained by addition.
On an FPGA, one clock cycle is required to complete one operation, and if a parallel addition tree adds 8 numbers, 3 clock cycles are required to complete three addition operations. When only 4 numbers need to be added and 2 numbers need to be added, the parallel addition tree only needs to complete twice and once addition operation respectively, and the calculation time is reduced to 2 clock cycles and 1 clock cycle respectively. The fast multiplication proposed herein reduces the number of partial products added, and when the number of partial products is less than 4, the clock period for the partial product addition is reduced. The clock cycles of the addition of different numbers of partial products are shown in table 1, the expected number of 2.18 cycles can be obtained, and compared with the traditional multiplier which needs 3 cycles for adding all partial products, the addition time of the fast multiplier is reduced by 28.3%, so the fast multiplication proposed by the method accelerates the calculation speed of the multiplication. When high-precision data is calculated based on Radix-4 Booth encoding multiplication at present, for example, 16 bits or 32 bits, the acceleration effect of the algorithm is obvious, but when 8-bit data is calculated, the acceleration effect of the algorithm is general, and the time of fast multiplication can be reduced by 20%. Since quantized 8-bit data is computed here, fast multiplication is chosen here to speed up the DCNN.
3.2 improved Winograd
In order to verify the performance improvement of the Winograd algorithm, software simulation needs to be carried out on the algorithm before the algorithm is transplanted to hardware. We used NVIDIA GTX950 GPU to test, the algorithms tested including conventional convolution, Winograd convolution and our modified Winograd convolution. For F (2 × 2,3 × 3), the sliding window computation of the conventional convolution contains 36 multiplications and 32 additions, the Winograd convolution contains 16 multiplications and 128 additions, and the modified Winograd convolution combines Winograd with fast multiplications, further reducing computation time.
The network used for the simulation was VGG-16, and the mean time for the convolution F (2X 2, 3X 3) was measured as follows:
the conventional convolution takes 34.1 microseconds;
winograd convolution takes 11.3 microseconds;
improving the Winograd convolution takes 8.6 microseconds.
From the data, the computation speed of the improved Winograd convolution is 3.96 times that of the conventional convolution and 1.34 times that of the Winograd convolution. The performance of the improved Winograd convolution algorithm proposed herein is improved.
To highlight the performance of the improved Winograd algorithm herein, different acceleration strategies were tested herein on the convolutional layer of VGG-16. The convolution acceleration algorithm can reduce the calculation time delay, and in order to compare the effects of different convolution acceleration algorithms, three convolution acceleration algorithms are tested: there is no optimization, the traditional Winograd algorithm you and our improved Winograd algorithm. The calculated delay of the three algorithms at each convolution layer of VGG-16 is shown in FIG. 8, the delay without an optimized network is the highest, the delay of the traditional Winograd algorithm is the second time, and the delay of the improved Winograd algorithm is the lowest. The delay without optimization is three times that of the traditional Winograd algorithm, and the delay of the traditional Winograd algorithm is 1.2 times that of the improved Winograd algorithm. By combining fast multiplication with Winograd, the performance of the improved Winograd algorithm proposed by us is optimized.
3.3 memory interaction optimization contrast
The transmission delay of the three memory interaction optimization strategies at each convolution layer of the VGG-16 is shown in FIG. 9, the delay of the non-optimized network is the highest, the delay of the one-dimensional convolution multiplexing is the second time, and the delay of the trapezoidal multiplexing is the lowest. The trapezoidal multiplexing can reduce the interaction quantity of the internal and external memories to the maximum extent, and the time delay of the trapezoidal multiplexing is the lowest.
TABLE 1 clock cycles for different partial product numbers to be added
Figure BDA0002965458090000091
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (3)

1. An accelerator of a deep convolutional neural network based on an FPGA, the accelerator comprising: the fast convolution operation module and the two-dimensional convolution memory interaction module;
the fast convolution operation module is used for combining light-weight fast multiplication with a Winograd algorithm to realize fast convolution operation of the deep convolution neural network;
the two-dimensional convolution memory interaction module is used for transmitting data between the outside of the chip and the inside of the chip by adopting a trapezoidal multiplexing memory interaction optimization strategy according to the intermediate calculation result and the weight in the process of the rapid convolution operation, so that the data interaction between the inside and the outside of the chip is minimized;
the specific implementation process of the two-dimensional convolution memory interaction module comprises the following steps:
the input feature map and weights of the deep convolutional neural network are divided into trapezoidal data blocks; when the convolution window slides the trapezoidal data, the trapezoidal data needs to be normally converted, namely the sliding window on the trapezoidal data is converted back to a normal convolution window according to the window corresponding rule;
the calculation sequence of the trapezoidal multiplexing is that each trapezoidal data block is taken from left to right in sequence, and after the trapezoidal data is calculated, the overlapped data of the adjacent data blocks on the right side is stored; for the calculation of each trapezoidal data block, the window sliding mode is from top to bottom and then from left to right;
the ladder data is divided into three parts: a front, a corner, and a rear; the three columns of the trapezoid are represented by three vectors: x, y and z, wherein the total number of the sliding windows is n, the positions of the sliding windows are p, p is more than or equal to 0 and less than or equal to n-1, and p is an integer; when the total number of the sliding windows is n and is more than or equal to 5, the values in the windows have the following three conditions according to the difference of p, and the three conditions respectively correspond to the front part, the corner part and the rear part of the trapezoidal data;
(1) when in use
Figure FDA0003704872120000011
When the sliding window is arranged at the front part of the trapezoid;
(2) when in use
Figure FDA0003704872120000012
When the sliding window is at the corner of the trapezoid, there are
Figure FDA0003704872120000013
And
Figure FDA0003704872120000014
the three windows;
(3) when in use
Figure FDA0003704872120000015
The sliding window is at the rear of the trapezoid.
2. The accelerator of the deep convolutional neural network based on the FPGA of claim 1, wherein the fast convolutional arithmetic module is implemented by:
for two-dimensional convolution, assuming that the size of the output is m × m and the size of the convolution kernel is r × r, the two-dimensional convolution can be represented by F (m × m, r × r); the two-dimensional Winograd convolution calculation formula is as follows:
Y=A T [(GgG T )⊙(B T dB)]A
wherein, G, B and A respectively represent convolution kernel conversion matrix, input conversion matrix and output conversion matrix, and when m and r are determined, the conversion matrices G, B and A can be calculated in advance; d represents the input, g represents the convolution kernel; b is T Denotes the transposition of B, A T Denotes the transposition of A, G T Represents a transpose of G;
winograd is improved by adopting fast multiplication, and the improved calculation formula is as follows:
Y=A T fm[(GgG T ),(B T dB)]A
where fm (X, Y) denotes fast multiplication of two multiplication matrices X and Y, and elements at the same positions of X and Y are multiplied by the fast multiplication.
3. The accelerator of the deep convolutional neural network based on the FPGA of claim 1, wherein the normal conversion is performed on the ladder data, specifically: the data width is adjusted to the input length of the two-dimensional convolution F (m × m, r × r).
CN202110249630.2A 2021-03-08 2021-03-08 Deep convolutional neural network accelerator based on FPGA Expired - Fee Related CN112949845B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110249630.2A CN112949845B (en) 2021-03-08 2021-03-08 Deep convolutional neural network accelerator based on FPGA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110249630.2A CN112949845B (en) 2021-03-08 2021-03-08 Deep convolutional neural network accelerator based on FPGA

Publications (2)

Publication Number Publication Date
CN112949845A CN112949845A (en) 2021-06-11
CN112949845B true CN112949845B (en) 2022-08-09

Family

ID=76229595

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110249630.2A Expired - Fee Related CN112949845B (en) 2021-03-08 2021-03-08 Deep convolutional neural network accelerator based on FPGA

Country Status (1)

Country Link
CN (1) CN112949845B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114399036B (en) * 2022-01-12 2023-08-22 电子科技大学 Efficient convolution calculation unit based on one-dimensional Winograd algorithm
CN115329951B (en) * 2022-09-13 2023-09-15 北京工商大学 FPGA architecture for convolutional neural network fast convolutional operation

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102062614A (en) * 2009-11-14 2011-05-18 安华高科技Ecbuip(新加坡)私人有限公司 High resolution optical encoder systems, devices and methods

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229645B (en) * 2017-04-28 2021-08-06 北京市商汤科技开发有限公司 Convolution acceleration and calculation processing method and device, electronic equipment and storage medium
US10990648B2 (en) * 2017-08-07 2021-04-27 Intel Corporation System and method for an optimized winograd convolution accelerator
US10387122B1 (en) * 2018-05-04 2019-08-20 Olsen Ip Reserve, Llc Residue number matrix multiplier
CN109086867B (en) * 2018-07-02 2021-06-08 武汉魅瞳科技有限公司 Convolutional neural network acceleration system based on FPGA
US11699070B2 (en) * 2019-03-05 2023-07-11 Samsung Electronics Co., Ltd Method and apparatus for providing rotational invariant neural networks
CN110288086B (en) * 2019-06-13 2023-07-21 天津大学 Winograd-based configurable convolution array accelerator structure
CN110807513A (en) * 2019-10-23 2020-02-18 中国人民解放军国防科技大学 Convolutional neural network accelerator based on Winograd sparse algorithm

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102062614A (en) * 2009-11-14 2011-05-18 安华高科技Ecbuip(新加坡)私人有限公司 High resolution optical encoder systems, devices and methods

Also Published As

Publication number Publication date
CN112949845A (en) 2021-06-11

Similar Documents

Publication Publication Date Title
CN111242289B (en) Convolutional neural network acceleration system and method with expandable scale
US11449729B2 (en) Efficient convolutional neural networks
CN112949845B (en) Deep convolutional neural network accelerator based on FPGA
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
CN111062472B (en) Sparse neural network accelerator based on structured pruning and acceleration method thereof
CN1153346C (en) Flow line type parallel-to-serial frame minimum mean square self-adaption filter and method for mfg. same
US20210357735A1 (en) Split accumulator for convolutional neural network accelerator
CN109948774A (en) Neural network accelerator and its implementation based on network layer binding operation
CN111832719A (en) Fixed point quantization convolution neural network accelerator calculation circuit
CN112596701B (en) FPGA acceleration realization method based on unilateral Jacobian singular value decomposition
CN112508125A (en) Efficient full-integer quantization method of image detection model
US20220164663A1 (en) Activation Compression Method for Deep Learning Acceleration
CN113947200B (en) Acceleration calculation method of neural network, accelerator and computer-readable storage medium
CN110110852B (en) Method for transplanting deep learning network to FPAG platform
CN113313252B (en) Depth separable convolution implementation method based on pulse array
CN114418057A (en) Operation method of convolutional neural network and related equipment
Bao et al. LSFQ: A low precision full integer quantization for high-performance FPGA-based CNN acceleration
CN112836823B (en) Convolutional neural network back propagation mapping method based on cyclic recombination and blocking
CN109993293A (en) A kind of deep learning accelerator suitable for stack hourglass network
CN114022371B (en) Defogging device and defogging method based on space and channel attention residual error network
CN114021070A (en) Deep convolution calculation method and system based on micro-architecture processor
CN115357214A (en) Operation unit compatible with asymmetric multi-precision mixed multiply-accumulate operation
CN115965052A (en) Convolutional neural network hardware accelerator and acceleration method
KR20150050680A (en) Device and method for discrete cosine transform
CN116596034A (en) Three-dimensional convolutional neural network accelerator and method on complex domain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220809