CN111814675B

CN111814675B - Convolutional neural network feature map assembly system supporting dynamic resolution based on FPGA

Info

Publication number: CN111814675B
Application number: CN202010652929.8A
Authority: CN
Inventors: 郭静
Original assignee: Shanghai Xuehu Technology Co ltd
Current assignee: Shanghai Xuehu Technology Co ltd
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2023-09-29
Anticipated expiration: 2040-07-08
Also published as: CN111814675A

Abstract

The invention relates to the technical field of image processing, in particular to a convolutional neural network feature map assembly system supporting dynamic resolution based on an FPGA. The method comprises a feature map assembling module and a weight loading module, wherein the feature map assembling module and the weight loading module are respectively connected with a main convolution computing module, the main convolution computing module is connected with a window accumulating module, the main convolution computing module inputs a feature map window of the feature map assembling module through the weight loading module, channel accumulation is completed in the main convolution computing module, and then the whole convolution computation is completed in the window accumulating module. The invention is different from the prior art that the scheme only supports the characteristic diagram cache design with one resolution, and the invention automatically configures the parameters of the characteristic diagram cache on the FPGA according to the real-time resolution, and can be compatible with CNN network realization with multiple resolutions under the condition of no need of modifying codes.

Description

Convolutional neural network feature map assembly system supporting dynamic resolution based on FPGA

Technical Field

The invention relates to the technical field of image processing, in particular to a convolutional neural network feature map assembly system supporting dynamic resolution based on an FPGA.

Background

The convolutional neural network (Convolutional Neural Networks, CNN) is an efficient identification method comprising convolutional calculation, and is one of the representative algorithms for deep learning. In recent years, the method is widely applied to various fields, such as automatic labeling algorithm, picture searching, commodity recommendation, searching framework and the like. However, the most classical popular case in these applications is to perform image processing. The CNN can obtain the final classification processing result of the picture by directly inputting the original characteristic picture instead of the complicated picture preprocessing stage. Because the data volume involved in CNN operation is large, the CNN operation is usually realized by adopting large-scale computer programming, and the CNN operation also brings the problems of high realization difficulty and high cost.

Just because of the unique computing mode of CNN, the implementation efficiency of the general processor is not high, and the performance requirement cannot be met. Accordingly, various accelerators based on field programmable gate arrays (Field Programmable Gate Array, FPGAs), graphics Processors (GPUs), and application specific integrated circuits (Application Specific Integrated Circuit, ASICs) have been proposed in recent years to improve the performance of CNNs. The three methods can be compared with each other in terms of performance, power consumption and flexibility, and can be seen in fig. 1, so that the advantages of good performance, high energy efficiency, quick development period and the like of the FPGA are combined, and the FPGA has more and more attention on CNN acceleration.

The implementation of CNN on FPGA needs to calculate large data volume, namely, needs to read and write data of large data volume. Because of limited memory resources on FPGAs, external memory Dynamic Random Access Memory (DRAM) typically exists for such as feature map data required in CNN computation, and performs read and write operations with FPGAs. Because of the different application scenarios of CNN networks, the input feature images often have various resolutions, which requires that CNNs be as adaptive to dynamic resolution situations as possible when implemented on FPGAs.

Disclosure of Invention

In view of the above technical problems, the present invention provides a convolutional neural network feature map assembly system supporting dynamic resolution based on FPGA, which is different from the previous scheme that only supports feature map buffer design with one resolution.

The technical scheme adopted for solving the technical problems is as follows:

a convolutional neural network feature map assembly system supporting dynamic resolution based on an FPGA, the system comprising:

the device comprises a feature map assembling module and a weight loading module, wherein the feature map assembling module and the weight loading module are respectively connected with a main convolution computing module, the main convolution computing module is connected with a window accumulating module, the main convolution computing module inputs a feature map window of the feature map assembling module through the weight loading module, channel accumulation is completed in the main convolution computing module, then the whole convolution computation is completed in the window accumulating module, and the window accumulating module is connected with a feature map output module.

In the technical scheme of the invention, the convolutional neural network feature map assembly system supporting dynamic resolution based on the FPGA is characterized in that the feature map assembly module specifically comprises:

the zero-filling module is used for performing zero-filling operation on the feature map;

the line cache module is connected with the zero padding module and is used for realizing characteristic diagram line cache, line switching and line data output;

and the window assembly module is connected with the line connection buffer module, and outputs a characteristic diagram window according to the channel finally through disassembly and recombination after all data required in the main convolution calculation module are obtained in a periodical manner.

In the technical scheme of the invention, the convolutional neural network feature map assembly system supporting dynamic resolution based on the FPGA is characterized in that a BRAM cache module is arranged in the line cache module.

In the technical scheme of the invention, the convolutional neural network feature map assembly system supporting dynamic resolution based on the FPGA is characterized in that a read-write controller is arranged in the line cache module and used for controlling read-write signals and read-write addresses of the line cache module and writing the read-write signals and the read-write addresses into the BRAM cache module.

The technical scheme has the following advantages or beneficial effects:

the invention provides a convolutional neural network feature map assembly system supporting dynamic resolution based on an FPGA, which can cope with the situation of resolution change when a CNN network is realized, is compatible with various resolutions without modifying codes, is more convenient and quick, saves a line switching initialization period in a BRAM cache mode, further improves feature map assembly efficiency, and completely does not influence the efficiency of a main convolutional calculation module.

Drawings

The invention and its features, aspects and advantages will become more apparent from the detailed description of non-limiting embodiments with reference to the following drawings. Like numbers refer to like parts throughout. The drawings may not be to scale, emphasis instead being placed upon illustrating the principles of the invention.

Fig. 1 is a FPGA, GPU, ASIC comparative;

FIG. 2 is a schematic diagram of a convolutional neural network implementing a window accumulation scheme on an FPGA;

FIG. 3 is a schematic diagram of the overall design of a feature map assembly module;

fig. 4 is a schematic diagram of a BRAM cache module scheme in a line cache module.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to meet the condition that the input feature images possibly have multiple different resolutions in the practical CNN application, when the CNN network is realized on the FPGA, the FPGA is used for realizing higher compatibility of the input of the feature images with different resolutions of the CNN network through the dynamic configurable design of the control parameters of the feature image assembly module.

As shown in fig. 2 and fig. 3, the convolutional neural network feature map assembly system supporting dynamic resolution based on the FPGA of the present invention includes a feature map assembly module and a weight loading module, the feature map assembly module and the weight loading module are respectively connected with a main convolutional calculation module, the main convolutional calculation module is connected with a window accumulation module, the main convolutional calculation module inputs a feature map window of the feature map assembly module through the weight loading module, channel accumulation is completed in the main convolutional calculation module, and then the whole convolutional calculation is completed in the window accumulation module.

In the technical scheme of the invention, the window accumulation module is connected with a feature map output module, and the feature map assembly module specifically comprises:

the zero-filling module is used for performing zero-filling operation on the feature map; the line cache module is connected with the zero padding module and is used for realizing characteristic diagram line cache, line switching and line data output; and the window assembly module is connected with the line connection buffer module, and outputs a characteristic diagram window according to the channel finally through disassembly and recombination after all data required in the main convolution calculation module are obtained in a periodical manner. In the technical scheme of the invention, a BRAM cache module and a read-write controller are arranged in a line cache module, and the read-write controller is used for controlling read-write signals and read-write addresses of the line cache module and writing the read-write signals and the read-write addresses into the BRAM cache module.

The technical scheme of the invention requires that the feature diagram assembly module can provide a convolution calculation window for the main convolution calculation module in a mode with higher compatibility within the allowable range of DDR transmission efficiency. Taking convolution of the convolution kernel 3x3 as an example, that is, a 3x3 feature map window is required to be provided for a main convolution calculation module, the method can adapt to the channel number (c) with high (h) and wide (w) resolution of an input feature map, and the step size (stride) and w, h, c, stride are any settable values.

The whole feature map assembly module comprises:

1) Zero filling module:

in the standard convolution calculation process, the feature map is often required to be subjected to zero padding operation. Under the condition of dynamic resolution requirement, the input feature map size is assumed to be high (h) and wide (w) and the number of channels is assumed to be high (c), the DDR input bit width is 128 bits, the feature map data is 8 bits, and the input feature map data is input according to the sequence of hwc. The zero-filling position and the number of the zero-filling points can be confirmed in a counting mode, and zero-filling operation is performed. The first line and the line head are subjected to zero filling through data input of a first signal and a first signal mark after the line is ended, and the line end zero filling position and the zero filling number are calculated as follows:

RowEndPadding＝w*c*8/128；

RowEndPaddingNum = line end zero padding column number x c x 8/128

2) Line cache module:

and when the convolution step length is smaller than the convolution kernel size, the efficient multiplexing of the data can be realized through read address control by taking the BRAM buffer module as a design main body. In the above example, the output of the read address according to the cycle is 0,1,2,1,2,3,2,3,4 … …, so as to improve the convolution calculation efficiency, a combination of row pushing and read-before-write is adopted. Namely, when only three BRAM buffer modules are used for respectively buffering three lines of data, new data are sequentially written while the required characteristic diagram data are read from the BRAM, and new data are sequentially written while the required characteristic diagram data are read from the BRAM. The cross-line data update utilizes BRAM, and next line data is read out from BRAM of the cached next line data and then written into the line data cache BRAM, as shown in fig. 4. To support multiple resolution input scenarios, the key to the present invention is that the parameter input scheme can be automatically configured. Through h, w and c, parameters required by module control can be calculated. The line cache is usually output by adopting a first-in first-out queue in the prior proposal, but the condition of wasting data and cycle at the line conversion is encountered, and the control and the modification are not easy. The BRAM plus control parameter scheme enables the support degree to the dynamic resolution to be higher on the one hand, and improves the window assembly efficiency on the other hand. In the implementation process of the scheme, according to different resolutions, the number (n) of windows actually needed is added, so that the number (GroupNum) of window groups of each row and the length (groupwength) of window groups of each row can be obtained:

GroupNum＝(w+RowEndPadd ingNum-(3-stride)÷stride÷n

GroupLength＝n*stride*c

under the condition that the window group number and the window length are obtained, the read-write position can be accurately obtained through counting and comparing logic, and the read-write state can still be accurately positioned under the condition that the resolution is changed, so that line caching, line switching and line data output are realized.

3) Window assembly module:

and (3) outputting by a line connection buffer module, periodically obtaining all data required in the main convolution calculation module, and finally outputting a characteristic diagram window according to the channel through number disassembly and recombination. The window assembly mode can be determined by dynamically calculating the number of columns (WindowColumnNum) parameters required for each output window, thereby adapting to the multi-resolution situation.

WindowColumnNum＝(n*stride+(3-stride))

The three main modules are combined to realize that the characteristic diagram data input into the FPGA according to the hwc sequence is converted into window data required by a convolution calculation module, so that the multi-resolution requirement is met, and the efficiency can completely meet the performance requirement within the design range of the scheme.

The invention can cope with the situation of resolution change when the CNN network is realized by controlling the dynamic afferent mode of the parameters (h, w, c, stride, n). Under the condition that the code does not need to be modified, the method is compatible with various resolutions and is more convenient and quick. Meanwhile, compared with other schemes, the BRAM caching mode adopts a first-in first-out queue (FIFO) to cache in a special place of line switching, so that the line switching initialization period is saved, the feature diagram assembly efficiency is further improved, and the convolution calculation main module efficiency is not affected.

Those skilled in the art will understand that the variations may be implemented in combination with the prior art and the above embodiments, and are not described herein. Such modifications do not affect the essence of the present invention, and are not described herein.

The preferred embodiments of the present invention have been described above. It is to be understood that the invention is not limited to the specific embodiments described above, wherein devices and structures not described in detail are to be understood as being implemented in a manner common in the art; any person skilled in the art can make many possible variations and modifications to the technical solution of the present invention or modifications to equivalent embodiments without departing from the scope of the technical solution of the present invention, using the methods and technical contents disclosed above, without affecting the essential content of the present invention. Therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims

1. A convolutional neural network feature map assembly system supporting dynamic resolution based on an FPGA, the system comprising:

the device comprises a feature map assembling module and a weight loading module, wherein the feature map assembling module and the weight loading module are respectively connected with a main convolution computing module, the main convolution computing module is connected with a window accumulating module, the main convolution computing module inputs a feature map window of the feature map assembling module through the weight loading module, channel accumulation is completed in the main convolution computing module, then the whole convolution computation is completed in the window accumulating module, and the window accumulating module is connected with a feature map output module;

the feature map assembling module specifically comprises: the zero-filling module is used for performing zero-filling operation on the feature map; the line cache module is connected with the zero padding module and is used for realizing characteristic diagram line cache, line switching and line data output; the window assembly module is connected with the line connection cache module, and after all data required in the main convolution calculation module are obtained in a periodical manner, the window assembly module outputs a characteristic diagram window according to the channel through disassembly and recombination;

in the zero-filling module, assuming that the size of an input feature map is high h, the width w, the channel number c, the DDR input bit width is 128 bits, the feature map data is 8 bits, and the feature map data is sequentially input according to hwc, the zero-filling position and the number of zero-filling points can be confirmed in a counting mode, zero-filling operation is performed, the first line and the line head zero-filling operation perform zero-filling through the first signal mark after the first signal and the line end are input through data, and the calculation formulas of the line end zero-filling position and the zero-filling number are as follows:

RowEndPadding＝w*c*8/128；

RowEndPaddingNum = line end zero padding column number x c x 8/128

In the line buffer module, according to different resolutions, the number n of windows actually needed is added, so that the number GroupNum of window groups in each line and the length groupwength of window groups in each line can be obtained:

GroupNum＝(w+RowEndPaddingNum-(3-stride)÷stride÷n

GroupLength＝n*stride*c

in the window assembly module, the line connection buffer module outputs, all data required in the main convolution calculation module are obtained in a divided period, the window assembly mode can be determined by dynamically calculating the required column number windows column number Num parameter of each output window according to the channel output characteristic diagram window finally through disassembly and recombination, so that the multi-resolution situation is adapted, and the formula is as follows:

WindowColumnNum＝(n*stride+(3-stride))。

2. the convolutional neural network feature map assembly system supporting dynamic resolution based on FPGA of claim 1, wherein a BRAM cache module is provided in the line cache module.

3. The convolutional neural network feature map assembly system supporting dynamic resolution based on the FPGA of claim 2, wherein a read-write controller is disposed in the line buffer module, and the read-write controller is configured to control read-write signals and read-write addresses of the line buffer module and write the read-write signals and the read-write addresses into the BRAM buffer module.