CN112949845B

CN112949845B - Deep convolutional neural network accelerator based on FPGA

Info

Publication number: CN112949845B
Application number: CN202110249630.2A
Authority: CN
Inventors: 黄威; 孙锴; 李锦�; 段昊东
Original assignee: Inner Mongolia University
Current assignee: Inner Mongolia University
Priority date: 2021-03-08
Filing date: 2021-03-08
Publication date: 2022-08-09
Anticipated expiration: 2041-03-08
Also published as: CN112949845A

Abstract

The invention discloses an accelerator of a deep convolutional neural network based on FPGA, comprising: the fast convolution operation module and the two-dimensional convolution memory interaction module; the fast convolution operation module is used for combining light-weight fast multiplication with a Winograd algorithm to realize fast convolution operation of the deep convolution neural network; and the two-dimensional convolution memory interaction module is used for transmitting data between the outside of the chip and the inside of the chip by adopting a trapezoidal multiplexing memory interaction optimization strategy according to the intermediate calculation result and the weight in the process of the rapid convolution operation, so that the data interaction between the inside and the outside of the chip is minimized. On one hand, the accelerator improves a Winograd convolution algorithm by using fast multiplication, and further improves the speed of Winograd calculation convolution; on the other hand, a ladder multiplexing memory interaction optimization strategy is provided, which can reduce the time delay of the memory interaction between the inside and the outside of the chip.

Description

Deep convolutional neural network accelerator based on FPGA

Technical Field

The invention relates to the field of deep learning and FPGA, in particular to an accelerator of a deep convolutional neural network based on FPGA.

Background

With the popularization of artificial intelligence, people have higher and higher requirements on intelligent equipment convenient for life, DCNN (deep convolutional neural network) is the most important algorithm in image processing, and applications of DCNN include image recognition, target detection and the like. However, the DCNN includes a large amount of computation, so that it has high computation delay and power consumption when running on the GPU, and thus the DCNN is difficult to be applied to scenes with high real-time requirements, such as robots and auto-driven vehicles.

To improve the real-time performance of DCNNs, FPGAs are used to speed up DCNNs because of their advantages of fast computation speed, low power consumption, and editability. However, the FPGA has great difficulty in accelerating the DCNN, the DCNN has a large amount of calculation and needs a large amount of weight, and the resources on the FPGA are limited. The most important computation operation of DCNN is convolution, which can be implemented on FPGA with multipliers and adders, so optimizing the convolution operation can speed up DCNN. The DCNN includes a large number of weights, the on-chip cache capacity of the FPGA is limited, intermediate calculation results and weights need to be stored in an off-chip memory, and a large number of memory interactions are generated by taking a batch of data to be calculated into the on-chip calculation each time, so that optimizing the memory interactions can also accelerate the DCNN. Therefore, the FPGA acceleration DCNN is mainly completed from two aspects of computation acceleration and memory interaction optimization.

The calculation acceleration is mainly the acceleration of convolution operation, and the general principle of the acceleration of convolution is to reduce the complexity of calculation and reduce the calculation amount. Currently, the commonly used convolution acceleration algorithms are FFT, depth-separable convolution and Winograd. Fast Fourier Transform (FFT) converts an input feature map and a convolution kernel from a time domain to a frequency domain, only multiplication is needed for the input feature map and the convolution kernel in the frequency domain, and finally the multiplication result is converted back to the time domain. However, the FFT also takes into account the computational cost of the conversion between the time domain and the frequency domain. When the input feature map and convolution kernel are of similar size, the FFT can accelerate the convolution overall. When the convolution kernel is small, it is difficult for the FFT to accelerate the convolution as a whole. At present, the convolution kernel size in the DCNN model is smaller and smaller, from the 11 × 11 convolution kernel appearing in AlexNet to the 3 × 3 convolution kernel in VGG, which is smaller than the input feature map. Therefore, accelerating the current DCNN with FFT is not a good option. The depth Separable Convolution (Depthwise Separable Convolution) decomposes the standard Convolution into depth Convolution (Depthwise Convolution) and point-by-point Convolution (Pointwise Convolution), so that the decomposition has the advantages of remarkably reducing the calculation amount and required parameters, and the highly structured Convolution mode is very suitable for being implemented on an FPGA. However, the deep separable convolution loses some accuracy, so it is not suitable for high accuracy applications. Winograd, like an FFT, converts the computation to another spatial domain, which reduces the number of multiplications, but slightly increases the number of additions. On the FPGA, the computation time of multiplication is longer than that of addition, so Winograd can accelerate convolution computation on the whole. Different from the FFT which is suitable for a large-size convolution kernel, Winograd is mainly used for convolution acceleration of a small-size convolution kernel and is suitable for the condition that the current DCNN model is generally provided with the small-size convolution kernel.

Since Winograd contains multiplication calculations, combining the fast multiplication algorithm with traditional Winograd can further speed up convolution calculations. At present, fast multiplication algorithms include Radix-4 Booth-based code multiplication, Wallace tree multiplication and the like. The multiplication principle on FPGA is that the multiplicand generates partial products under the shift of each bit of the multiplier, and then the final multiplication result is obtained by summing all the partial products. Based on Radix-4 Booth coding multiplication, each three bits of the multiplier are coded into one group, then each group of codes shifts the multiplicand to generate partial products, and finally the partial products are summed. The number of partial products can be reduced by encoding multipliers based on Radix-4 Booth encoding multiplication, thereby accelerating the speed of multiplication. Wallace tree multiplication mainly utilizes the characteristic of full adder 3-2 compression, increases parallelism and improves operation speed. However, the multiplication based on Radix-4 Booth coding and Wallace tree multiplication are complex to realize on FPGA, and are not convenient to combine with Winograd.

The memory interaction optimization is mainly used for reducing data interaction between an off-chip memory and an on-chip cache. A large number of characteristic graphs and weights are placed in an off-chip memory, due to the fact that the capacity of an on-chip cache is limited, data of the off-chip memory needs to be divided into data blocks, one data block is transmitted into the chip to be calculated each time, the step length of a convolution window in the existing DCNN model is smaller than the size of the window, data overlapping parts can be generated in adjacent convolution windows, accordingly, data overlapping can be generated in adjacent data blocks, the data overlapping comprises left-right adjacent overlapping and up-down adjacent overlapping, and if optimization is not conducted, the data transmission inside and outside the chip can be increased through overlapping data of the adjacent data blocks. Therefore, one-dimensional convolution multiplexing has been proposed, which can multiplex left and right adjacent data and reduce the data transmission amount, but due to limited resources in the chip, the up and down adjacent data cannot be multiplexed at the same time, so the one-dimensional convolution multiplexing cannot minimize the data interaction inside and outside the chip.

Disclosure of Invention

The invention aims to overcome the technical defects and provides an accelerator of a deep convolutional neural network based on an FPGA (field programmable gate array), which accelerates from two aspects of calculation acceleration and memory interaction optimization. For computational acceleration, the present invention chooses Winograd to optimize the convolution calculations, because Winograd can optimize convolutions with small convolution kernel size and reasonable computational complexity. In order to further improve the performance of Winograd, the invention provides a light-weight fast multiplication, and the algorithm is combined with the traditional Winograd, so that the improved Winograd can accelerate the calculation speed of convolution. For the memory interaction optimization, because the one-dimensional convolution multiplexing can not minimize the data interaction inside and outside the chip, in order to overcome the defect of the method, the invention provides a memory interaction optimization strategy named as trapezoidal multiplexing, which can simultaneously multiplex the data adjacent to the left and the right and the data adjacent to the top and the bottom, and can minimize the data interaction inside and outside the chip.

In order to achieve the above object, the present invention provides an accelerator of a deep convolutional neural network based on an FPGA, including: the fast convolution operation module and the two-dimensional convolution memory interaction module;

the fast convolution operation module is used for combining light-weight fast multiplication with a Winograd algorithm to realize fast convolution operation of the deep convolution neural network;

and the two-dimensional convolution memory interaction module is used for transmitting data between the outside of the chip and the inside of the chip by adopting a trapezoidal multiplexing memory interaction optimization strategy according to the intermediate calculation result and the weight in the process of the rapid convolution operation, so that the data interaction between the inside and the outside of the chip is minimized.

As an improvement of the above system, a specific implementation process of the fast convolution operation module includes:

for two-dimensional convolution, assuming that the size of the output is m × m and the size of the convolution kernel is r × r, the two-dimensional convolution can be represented by F (m × m, r × r); the two-dimensional Winograd convolution calculation formula is as follows:

Y＝A ^T [(GgG ^T )⊙(B ^T dB)]A

wherein G, B and A respectively represent convolution kernel conversion matrix and input conversion matrixAnd outputting transformation matrices, which G, B and A can be calculated in advance when m and r are determined; d represents the input, g represents the convolution kernel; b is ^T Denotes the transposition of B, A ^T Denotes the transposition of A, G ^T Represents a transpose of G;

winograd is improved by adopting fast multiplication, and the improved calculation formula is as follows:

Y＝A ^T fm[(GgG ^T ),(B ^T dB)]A

where fm (X, Y) denotes fast multiplication of two multiplication matrices X and Y, and elements at the same positions of X and Y are multiplied by the fast multiplication.

As an improvement of the above system, a specific implementation process of the two-dimensional convolution memory interaction module includes:

the input feature map and weights of the deep convolutional neural network are divided into trapezoidal data blocks; when the convolution window slides the trapezoidal data, the trapezoidal data needs to be normally converted, namely the sliding window on the trapezoidal data is converted back to a normal convolution window according to the window corresponding rule;

the calculation sequence of the trapezoidal multiplexing is that each trapezoidal data block is taken from left to right in sequence, and after the trapezoidal data is calculated, the overlapped data of the adjacent data blocks on the right side is stored; for the calculation of each trapezoidal data block, the window sliding mode is from top to bottom and then from left to right;

the ladder data is divided into three parts: a front, a corner, and a rear; the three columns of the trapezoid are represented by three vectors: x, y and z, wherein the total number of the sliding windows is n, the positions of the sliding windows are p, p is more than or equal to 0 and less than or equal to n-1, and p is an integer; when the total number of the sliding windows is n and is more than or equal to 5, the values in the windows have the following three conditions according to the difference of p, and the three conditions respectively correspond to the front part, the corner part and the rear part of the trapezoidal data;

(1) when the temperature is higher than the set temperature

When the sliding window is arranged at the front part of the trapezoid;

(2) when in use

When the sliding window is at the corner of the trapezoid, there are

And

the three windows;

(3) when in use

The sliding window is at the rear of the trapezoid.

As an improvement of the above system, the normal conversion of the ladder data is specifically: the data width is adjusted to the input length of the two-dimensional convolution F (m × m, r × r).

The invention has the advantages that:

on one hand, the accelerator improves a Winograd convolution algorithm by using fast multiplication, and further improves the speed of Winograd calculation convolution; on the other hand, a ladder multiplexing memory interaction optimization strategy is provided, which can reduce the time delay of the memory interaction between the inside and the outside of the chip.

Drawings

FIG. 1(a) is a schematic diagram of a plurality of input feature maps;

FIG. 1(b) shows an ideal data block division mode, in which adjacent data blocks are not overlapped;

FIG. 1(c) is a schematic diagram of division without convolution multiplexing;

FIG. 1(d) is a schematic diagram of one-dimensional convolution multiplexing;

FIG. 2(a) is a schematic diagram of convolution-free multiplexing;

FIG. 2(b) is a schematic diagram of one-dimensional convolution multiplexing;

FIG. 3 is a schematic illustration of ladder multiplexing;

FIG. 4 is a schematic diagram of a data flow for a normal conversion;

FIG. 5 is a schematic diagram of three cases of normal switching;

FIG. 6 is a schematic of a normal transition after Winograd adaptation;

FIG. 7 is a flow chart of the computation of a parallel addition tree;

FIG. 8 is a schematic diagram of the calculated time delay for three optimization strategies;

fig. 9 is a schematic diagram of transmission delay of three optimization strategies.

Detailed Description

The technical solution of the present invention will be described in detail below with reference to the accompanying drawings.

The invention provides an accelerator of a deep convolutional neural network based on an FPGA (field programmable gate array), which adopts the following technical scheme:

1. compute acceleration

1.1 fast multiplication

The working principle of the multiplier is that a multiplicand is shifted to generate partial products, and then all the partial products are accumulated to obtain a multiplication result. A fast multiplication is proposed herein which speeds up the multiplication by reducing the number of partial products. The algorithm directly computes the non-zero valued partial products and then adds them together. The reduction of the addition operation increases the computation speed of the multiplier.

The expression for fast multiplication is shown below:

P＝fm(X,Y)

wherein X and Y represent two multiplication matrixes of the fast multiplication, elements at the same positions of X and Y are multiplied by the fast multiplication, and a multiplication result P is obtained after all positions are calculated.

1.2, improving Winograd

Winograd is a fast convolution algorithm, in FPGA, the computation speed of multiplication is slower than that of addition, and Winograd increases the computation speed of convolution by reducing the number of multiplications of convolution. For one-dimensional convolution, let the output length be m and the convolution kernel length be r, and F (m, r) can be used to represent the one-dimensional convolution calculation. Taking F (2,3) as an example, with d representing the input and g representing the convolution kernel, Winograd convolution of F (2,3) can be written as a matrix multiplication as follows:

wherein m is ₀ 、m ₁ 、m ₂ And m ₃ The calculation is as follows:

since Winograd multiplication of F (2,3) is 4 times and sliding window convolution is 6 times, Winograd multiplication is less computationally intensive. The output matrix is represented by Y, and the one-dimensional Winograd calculation formula is as follows:

Y＝A ^T [(Gg)⊙(B ^T d)]

where G, B, and a represent the convolution kernel transform matrix, the input transform matrix, and the output transform matrix, respectively.

For two-dimensional convolution, assuming that the size of the output is m × m and the size of the convolution kernel is r × r, the two-dimensional convolution can be represented by F (m × m, r × r). The two-dimensional Winograd convolution calculation formula is as follows:

Y＝A ^T [(GgG ^T )⊙(B ^T dB)]A

when m and r are determined, transformation matrices G, B and A can be calculated in advance. G, B for F (2X 2, 3X 3) and A are as follows:

the multiplication of the sliding window convolution is m ² ×r ² While the multiplicative quantity of Winograd is only (m + r-1) ² In DCNN, m and r of most convolutions are larger than 1, and Winograd has less multiplication than the sliding window convolution, so Winograd can accelerate the convolution.

The invention improves Winograd by using fast multiplication, and improves Winograd algorithm as follows:

Y＝A ^T fm[(GgG ^T ),(B ^T dB)]A

winograd of F (2 × 2,3 × 3) is used herein, then the input is converted GgG ^T Convolution kernel transform B ^T The multiplication coefficients for multiplication in dB and output conversion A only relate to

Or

The multiplication can be replaced by a shift because the computation time of the shift is smaller than the multiplication. And (GgG) ^T )⊙(B ^T dB) have complex multiplications, must be computed by multiplication, all here using fast multiplication to accelerate Winograd (GgG) ^T )⊙(B ^T dB), i.e., fm [ (GgG) ^T ),(B ^T dB)]。

2. Memory interaction optimization

In order to further optimize the DCNN hardware accelerator, memory interaction is also optimized in addition to accelerating computations. The weight of DCNN and the intermediate calculation data need to be calculated in the chip, the capacity of the on-chip cache is fixed and limited, and when the data volume is very large, the data can only be put in an off-chip memory. To accommodate the capacity of the on-chip cache, the entire data is divided into many data blocks, one at a time for on-chip computation. Therefore, frequent memory interaction occurs inside and outside the chip, and different memory interaction modes directly influence transmission delay. The general principle of memory interaction optimization is to improve the reuse rate of data as much as possible, so that the data volume of memory interaction can be reduced, and the transmission delay is reduced. Based on the principle, the invention provides a memory interaction optimization strategy: ladder multiplexing.

In the current mainstream DCNN, when the window sliding characteristic diagram is convolved, the sliding step size is generally smaller than the convolution window size, so that adjacent windows have data overlapping portions, as shown in fig. 1(a), and therefore adjacent data blocks have data that must overlap, as shown in fig. 1(b) and fig. 1 (c). The entire data is shown in fig. 1(a), which is divided into 4 × 4 data blocks, each of which is represented by a light gray area, and the ideal data block division is shown in fig. 1(b), with no overlap between adjacent data blocks. The overlapping of adjacent windows results in the overlapping of adjacent data blocks, the actual division of the data blocks is as shown in fig. 1(c), the left and right adjacent data blocks and the upper and lower adjacent data blocks have data overlapping portions, and the overlapping portions are represented by dark gray.

To solve this problem, one-dimensional convolution multiplexing has been proposed. After each data block is sent into the slice, the slice buffer stores the left and right adjacent overlapping parts, the next left and right adjacent data block does not need to contain the overlapping part, only new data is formed by the left and right adjacent overlapping data and the stored overlapping data, the overlapping part of the left and right adjacent data block is continuously stored after the data is calculated, the convolution-free multiplexing is shown in fig. 2(a), and the one-dimensional convolution multiplexing is shown in fig. 2 (b). However, the one-dimensional convolution multiplexing does not store data adjacent to each other up and down, and the memory interaction amount cannot reach the minimum.

In order to solve the problems of one-dimensional convolution multiplexing, the invention provides a memory interaction mode of two-dimensional convolution multiplexing. As shown in fig. 3 (a), this method uses a novel data block division method, which enables each data block to simultaneously store left and right adjacent and top and bottom adjacent overlapping data, so this method can completely multiplex the overlapping data, and compared with non-convolution multiplexing and one-dimensional convolution multiplexing, two-dimensional convolution multiplexing can minimize the memory interaction amount. To facilitate window sliding, we convert the data shape from square to triangle, as shown in fig. 3 (b), and since each data block becomes trapezoid (the triangle at the left end can be regarded as a trapezoid with 0 at the top), the present invention refers to such two-dimensional convolution multiplexing as trapezoid multiplexing. The calculation sequence of the trapezoidal multiplexing is that each trapezoidal data block is taken from left to right in sequence, and after the trapezoidal data is calculated, the overlapped data of the adjacent data block on the right side is stored. For each trapezoidal data block calculation, the window sliding mode is from top to bottom and then from left to right, as shown in fig. 3 (c).

Since the data block is divided into trapezoids, data cannot be fetched according to the normal convolution window. Therefore, when the convolution window slides the trapezoidal data, the trapezoidal data needs to be normally converted, that is, the sliding window on the trapezoidal data is converted back to the normal convolution window according to the window corresponding rule, so that the convolution can be correctly calculated. The window correspondence rule is as follows, and includes three cases, which are respectively applied to three parts of the trapezoidal data: front, corners and rear, as shown in fig. 4.

The three columns of the trapezoid are represented by three vectors: x, y and z, wherein the total number of the sliding windows is n, the positions of the sliding windows are p (p is more than or equal to 0 and less than or equal to n-1, and p is an integer); when the total number of sliding windows is n equal to or greater than 5, the values in the windows have three cases corresponding to the front, corners, and rear of the trapezoidal data, respectively, depending on p.

(1) When in use

When the sliding window is in the front of the trapezoid, the window value is as shown in fig. 5 (a).

(2) When in use

When the sliding window is at the corner of the trapezoid, there are

And

these three windows, window values, are shown in fig. 5 (c), 5(d) and 5 (e), respectively.

(3) When in use

When the sliding window is at the rear of the trapezoid, the window value is as shown in fig. 5 (b).

The trapezoidal multiplexing proposed by the present invention can be directly used for conventional convolution calculation, and can also be adapted to special convolution calculation through adjustment. To combine the ladder multiplexing with the Winograd convolution, the ladder multiplexing is adjusted. Since the present invention uses Winograd of F (2 × 2,3 × 3), the input is 4 × 4, the step size is 2, and the output of the ladder multiplexing corresponds to the input of Winograd, it is necessary to change the data width of the normal conversion of the ladder multiplexing from 3 to 4, as shown in fig. 6.

3. Computing optimized contrasts

3.1 acceleration effects of fast multiplication

Since the weights of the DCNN are quantized to 8 bits herein, 8-bit multiplication requires the addition of 8 partial products. To speed up the computation of the addition, a parallel addition tree is used herein to accomplish the addition of the partial products, as shown in FIG. 7.

The 8 numbers to be added are sequentially represented as add [0], add [1], add [2], add [3], add [4], add [5], add [6] and add [7], the sum of the parts obtained by pairwise addition for the first time is sequentially represented as P _ sum0[0], P _ sum0[1], P _ sum0[2] and P _ sum0[3], the sum of the parts obtained by pairwise addition for the second time is sequentially represented as P _ sum1[0] and P _ sum1[1], and finally the final result F _ sum is obtained by addition.

On an FPGA, one clock cycle is required to complete one operation, and if a parallel addition tree adds 8 numbers, 3 clock cycles are required to complete three addition operations. When only 4 numbers need to be added and 2 numbers need to be added, the parallel addition tree only needs to complete twice and once addition operation respectively, and the calculation time is reduced to 2 clock cycles and 1 clock cycle respectively. The fast multiplication proposed herein reduces the number of partial products added, and when the number of partial products is less than 4, the clock period for the partial product addition is reduced. The clock cycles of the addition of different numbers of partial products are shown in table 1, the expected number of 2.18 cycles can be obtained, and compared with the traditional multiplier which needs 3 cycles for adding all partial products, the addition time of the fast multiplier is reduced by 28.3%, so the fast multiplication proposed by the method accelerates the calculation speed of the multiplication. When high-precision data is calculated based on Radix-4 Booth encoding multiplication at present, for example, 16 bits or 32 bits, the acceleration effect of the algorithm is obvious, but when 8-bit data is calculated, the acceleration effect of the algorithm is general, and the time of fast multiplication can be reduced by 20%. Since quantized 8-bit data is computed here, fast multiplication is chosen here to speed up the DCNN.

3.2 improved Winograd

In order to verify the performance improvement of the Winograd algorithm, software simulation needs to be carried out on the algorithm before the algorithm is transplanted to hardware. We used NVIDIA GTX950 GPU to test, the algorithms tested including conventional convolution, Winograd convolution and our modified Winograd convolution. For F (2 × 2,3 × 3), the sliding window computation of the conventional convolution contains 36 multiplications and 32 additions, the Winograd convolution contains 16 multiplications and 128 additions, and the modified Winograd convolution combines Winograd with fast multiplications, further reducing computation time.

The network used for the simulation was VGG-16, and the mean time for the convolution F (2X 2, 3X 3) was measured as follows:

the conventional convolution takes 34.1 microseconds;

winograd convolution takes 11.3 microseconds;

improving the Winograd convolution takes 8.6 microseconds.

From the data, the computation speed of the improved Winograd convolution is 3.96 times that of the conventional convolution and 1.34 times that of the Winograd convolution. The performance of the improved Winograd convolution algorithm proposed herein is improved.

To highlight the performance of the improved Winograd algorithm herein, different acceleration strategies were tested herein on the convolutional layer of VGG-16. The convolution acceleration algorithm can reduce the calculation time delay, and in order to compare the effects of different convolution acceleration algorithms, three convolution acceleration algorithms are tested: there is no optimization, the traditional Winograd algorithm you and our improved Winograd algorithm. The calculated delay of the three algorithms at each convolution layer of VGG-16 is shown in FIG. 8, the delay without an optimized network is the highest, the delay of the traditional Winograd algorithm is the second time, and the delay of the improved Winograd algorithm is the lowest. The delay without optimization is three times that of the traditional Winograd algorithm, and the delay of the traditional Winograd algorithm is 1.2 times that of the improved Winograd algorithm. By combining fast multiplication with Winograd, the performance of the improved Winograd algorithm proposed by us is optimized.

3.3 memory interaction optimization contrast

The transmission delay of the three memory interaction optimization strategies at each convolution layer of the VGG-16 is shown in FIG. 9, the delay of the non-optimized network is the highest, the delay of the one-dimensional convolution multiplexing is the second time, and the delay of the trapezoidal multiplexing is the lowest. The trapezoidal multiplexing can reduce the interaction quantity of the internal and external memories to the maximum extent, and the time delay of the trapezoidal multiplexing is the lowest.

TABLE 1 clock cycles for different partial product numbers to be added

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An accelerator of a deep convolutional neural network based on an FPGA, the accelerator comprising: the fast convolution operation module and the two-dimensional convolution memory interaction module;

the two-dimensional convolution memory interaction module is used for transmitting data between the outside of the chip and the inside of the chip by adopting a trapezoidal multiplexing memory interaction optimization strategy according to the intermediate calculation result and the weight in the process of the rapid convolution operation, so that the data interaction between the inside and the outside of the chip is minimized;

the specific implementation process of the two-dimensional convolution memory interaction module comprises the following steps:

(1) when in use

When the sliding window is arranged at the front part of the trapezoid;

(2) when in use

When the sliding window is at the corner of the trapezoid, there are

And

the three windows;

(3) when in use

The sliding window is at the rear of the trapezoid.

2. The accelerator of the deep convolutional neural network based on the FPGA of claim 1, wherein the fast convolutional arithmetic module is implemented by:

Y＝A ^T [(GgG ^T )⊙(B ^T dB)]A

wherein, G, B and A respectively represent convolution kernel conversion matrix, input conversion matrix and output conversion matrix, and when m and r are determined, the conversion matrices G, B and A can be calculated in advance; d represents the input, g represents the convolution kernel; b is ^T Denotes the transposition of B, A ^T Denotes the transposition of A, G ^T Represents a transpose of G;

Y＝A ^T fm[(GgG ^T ),(B ^T dB)]A

3. The accelerator of the deep convolutional neural network based on the FPGA of claim 1, wherein the normal conversion is performed on the ladder data, specifically: the data width is adjusted to the input length of the two-dimensional convolution F (m × m, r × r).