CN115329951B

CN115329951B - FPGA architecture for convolutional neural network fast convolutional operation

Info

Publication number: CN115329951B
Application number: CN202211112093.8A
Authority: CN
Inventors: 李皓辰; 余乐; 关文洋; 于重重
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2022-09-13
Filing date: 2022-09-13
Publication date: 2023-09-15
Anticipated expiration: 2042-09-13
Also published as: CN115329951A

Abstract

The invention relates to an FPGA architecture for fast convolution operation of a convolution neural network, and belongs to the technical field of FPGA architectures. The framework comprises a plurality of Winograd hard core computing units, wherein the Winograd hard core computing units are arranged in a loose mode in an FPGA; the Winograd hard kernel computing unit comprises an image data conversion module, a weight conversion module, a dot multiplication module based on a fast multiplier and an output conversion module; the input ends of the weight conversion module and the image conversion module receive data, the output ends of the weight conversion module and the image conversion module are input to the dot multiplication module, the output end of the dot multiplication module is input to the input end of the output conversion module, and the output end of the output conversion module outputs outwards; and arranging in a loose mode, wherein LBs of the FPGA are arranged among the Winograd hard core computing units for spacing. According to the invention, the Winograd hard core calculation unit is designed and added to the FPGA, so that the method is different from the method for directly using resources on the FPGA to realize Winograd algorithm, reduces interconnection dependence of LBs, DSP and FPGA during calculation, and improves the maximum clock frequency.

Description

FPGA architecture for convolutional neural network fast convolutional operation

Technical Field

The invention relates to an FPGA architecture for fast convolution operation of a convolution neural network, and belongs to the technical field of FPGA architectures.

Background

Over the past decade, FPGA designs and architectures used to accelerate Machine Learning (ML) algorithms, such as the horizon brain processor (Brain Processing Unit, BPU), IBM TrueNorth, the british DianNaoYu, the alien light 800, etc., have grown in endless numbers, which offer increasing computing resources and memory bandwidth. In addition to this, many FPGA-based solutions are also being proposed. The hundred-degree-push AI cloud computing chip XPU is a cloud computing acceleration chip based on the FPGA. Xilinx's xNN and Intel's DLA are then called overlay processors, which map systolic array based matrix multipliers onto a generic FPGA. These FPGA-based solutions do not modify the architecture of the FPGA itself, and they are implemented using programmable Logic present on current FPGAs, such as Logic Blocks (LBs), multipliers (Digital Signal Process, DSP) and block memory units (Random Access Memory, RAM).

Unlike the above designs, another direction of research is to change the architecture of the FPGA itself to accelerate the ML algorithm. Eldafrawy et al modified the architecture of the programmable logic blocks (Configurable Logic Block, CLBs) to reduce the area consumption of multiplication and addition using soft logic implementations. Aman et al add variable precision tensor cells to existing FPGAs for ML acceleration. Still other articles change the architecture of the DSP to provide accelerated performance. Pir-DSP modifies the architecture of DSP48E2 and also adds registers to better meet the computational requirements of low-precision deep neural networks. Yuan Dai et al propose an improved APIR-DSP based on PIR-DSP, which increases the computational speed while reducing the area consumption.

Approximately 80% to 90% of the operations in convolutional neural networks are convolutional calculations, and the Winograd algorithm has been widely demonstrated to be effective in accelerating convolutional calculations by reducing the number of multiplications. Patent CN111459877a is the acceleration of YOLO v2 convolutional neural networks using Winograd. Patent CN113283587a splits the convolution kernel of non-3×3 size in the convolution calculation, and then performs Winograd calculation acceleration.

In addition, for the application of the vast majority of neural networks, the input of fixed-point type data can achieve good experimental results, the speed can be improved, and the power consumption can be reduced. Generally, for some networks and scenes with low precision requirements, the 8-bit data bit width can meet the precision requirements.

The tensor element mentioned above is suitable for matrix multiplication calculations, but is less effective in calculating convolutions than the Winograd algorithm. Although patent CN113283587a splits the convolution kernel of convolution operations other than 3×3 shapes, the split convolution kernel can still be implemented by Winograd calculation of F (2×2,3×3) size. The Winograd algorithm is able to efficiently optimize the convolution calculations though by reducing the number of multiplications. But Winograd designs using soft logic (LBs and interconnects on FPGAs) are slow and area inefficient. Since Winograd computation eliminates accumulation operations compared to convolution computation, using DSP only for multiplication computation also causes area waste. In addition to the core domain conversion module and multiplier, there is some control logic in the design. Designing these logic using LBs and FPGA interconnects also slows down the overall operation of the design. These all make the implementation of Winograd computational convolution on an FPGA much slower than a dedicated ASIC.

Disclosure of Invention

The invention aims to solve the technical problems that: how to provide a method for realizing the calculation acceleration effect of Winograd by using Winograd algorithm to realize the existing LBs, DSP and interconnection resources on FPGA and enabling the calculation acceleration effect to reach and exceed the calculation acceleration effect of ASIC.

In order to solve the technical problems, the technical scheme provided by the invention is as follows: an FPGA architecture for convolutional neural network fast convolutional operation comprises a plurality of Winograd hard core computing units, wherein the Winograd hard core computing units are distributed in the FPGA in a loose mode;

the Winograd hard kernel computing unit comprises an image data conversion module, a weight conversion module, a dot multiplication module based on a fast multiplier and an output conversion module; the input ends of the weight conversion module and the image conversion module receive data, the output ends of the weight conversion module and the image conversion module are input to the dot multiplication module, the output end of the dot multiplication module is input to the input end of the output conversion module, and the output end of the output conversion module outputs outwards;

the image data conversion module, the weight conversion module and the output conversion module are based on the fast multiplier and are realized through displacement and addition operation;

the point multiplication module is realized by a base 4-Booth encoder and a Wallace tree;

arranging in a loose mode, wherein LBs of the FPGA is arranged between the Winograd hard core computing units for spacing.

The further improvement of the scheme is as follows: the two 8bit numbers received by the input end of the dot multiplication module are respectively a multiplicand and a multiplier, the multiplier is encoded by a base 4-Booth encoder, 4 partial products are generated by the multiplier and the multiplicand after encoding, the partial products are sent into a Wallace tree for 4:2 compression, and the compression result is added by a carry-ahead adder to obtain a final result and is output from an output end to an output conversion module.

The beneficial effects brought by the invention are as follows: according to the invention, the Winograd hard core calculation unit is designed and added to the FPGA, so that the method is different from the method for directly using resources on the FPGA to realize Winograd algorithm, reduces interconnection dependence of LB, DSP and FPGA during calculation, and improves the maximum clock frequency.

In addition, the circuits with specific functions on the FPGA are realized by connecting various modules through interconnection resources, and the loose topological structure reduces the number of channels and the size of a switch box required by a Winograd algorithm in calculating convolution, so that the area delay product is reduced.

Drawings

The invention is further described below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of a Winograd hard kernel computing unit of an FPGA architecture for fast convolution operation of a convolutional neural network according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a dot product module of an FPGA architecture for fast convolution operations of a convolutional neural network according to an embodiment of the present invention.

Fig. 3 illustrates a fast multiplier code scheme according to an embodiment of the present invention.

FIG. 4 illustrates area comparisons of a systolic array, an implementation of an FPGA implementation using existing resources of the FPGA and an FPGA implementation design containing Winograd hard cores, the area scores being a logic block area and a routing area.

FIG. 5 shows a percentage reduction in design area for an FPGA implementation containing Winograd hard cores compared to an existing resource implementation of an FPGA in accordance with an embodiment of the present invention.

FIG. 6 shows a frequency comparison of a systolic array, an implementation using existing resources of an FPGA, and an implementation design of an FPGA containing Winograd hard cores in an embodiment of the present invention.

FIG. 7 shows that the design frequency of the FPGA implementation containing Winograd hard cores is improved by a percentage compared with the design frequency of the prior art of the FPGA.

Fig. 8 is a partial schematic diagram of three topologies, loose, columnar and dense (only the lower left corner of the overall architecture) in an embodiment of the invention.

Fig. 9 illustrates the design area consumption for a loose, columnar and dense topology implementation in an embodiment of the present invention.

Fig. 10 shows a percentage reduction in area achieved by the relaxed and columnar topologies compared to the dense one in the embodiment of the invention.

FIG. 11 illustrates the implementation of the number of design channels for the three topologies of relaxed, columnar, and dense in an embodiment of the invention.

FIG. 12 illustrates a percentage reduction in the number of design channels achieved by the relaxed and columnar topologies compared to the dense implementation of the present invention.

Fig. 13 illustrates the implementation of design frequencies for the three topologies of relaxed, columnar, and dense in an embodiment of the invention.

Fig. 14 shows the percentage increase in frequency versus dense achieved in the relaxed and columnar topologies of the embodiments of the invention.

FIG. 15 illustrates the implementation of the design area delay product in terms of percent reduction in area delay product versus density for the three topologies of relaxed, columnar, and dense in the embodiment of the invention.

Detailed Description

Examples

The FPGA architecture for the convolutional neural network fast convolutional operation of the embodiment comprises a plurality of Winograd hard core computing units, wherein the Winograd hard core computing units are arranged in the FPGA in a loose mode.

and arranging in a loose mode, wherein LBs of the FPGA are arranged among the Winograd hard core computing units for spacing.

The Winograd hard kernel calculation unit is based on a two-dimensional Winograd algorithm formula as follows:

Y＝A ^T [(GgG ^T )⊙(B ^T dB)]A

"" indicates matrix dot product, Y is the convolution result, G is the convolution kernel transformation matrix, B ^T To input image data transform matrix, A ^T To output the transform matrix, g is a convolution kernel of size 3*3 and d is input image data of size 4*4.

For two-dimensional convolution, let the size of the output be mxm, the size of the convolution kernel be rxr, and the two-dimensional convolution can be denoted by F (mxm, rxr). G, B of Winograd of F (2×2,3×3) size employed in the present invention ^T And A ^T The following is shown:

the designed Winograd hard kernel computing unit is shown in figure 1 and comprises an image data transformation module, a weight transformation module, a dot multiplication module and an output transformation module.

All the conversion circuits of Winograd with the size of F (2 multiplied by 2,3 multiplied by 3) can be realized through addition and displacement, multiplication is not needed, and the resource consumption can be effectively reduced.

The matrix dot product circuit part consists of 16 8-bit fast multipliers. The multiplier structure is shown in FIG. 2, wherein A is a multiplicand B is a multiplier, and the multiplier comprises a base 4-Booth coding module (Booth Enc), a Partial product generating unit (Gen Port), a Partial product (Partial Port), a 4:2compressor (4:2 compressor,4:2 CSA), carry data (Carry), a pseudo Sum (Sum), and a Carry-ahead adder (Lookahead Carry Adder, LCA). The calculation process is as follows: the multiplier B is encoded by a base 4-Booth encoder, the encoded result is multiplied by a multiplicand to generate 4 partial products, the partial products are sent to a 4-2 compressor for compression, and the compression result is added by LCA to obtain the final multiplication result.

Coding (Encode, enc) the 8-bit wide multiplier B, as shown in FIG. 3, adding an auxiliary bit 0 to the rightmost side, taking a group of adjacent 3 bits successively from low to high, overlapping the front and back adjacent packets with one bit, enc ₁ -Enc ₄ For 4 sets of numbers obtained, the three digits obtained for each Enc are B _i+1 、B _i And B _i-1 . Coding according to the following formula to obtain 4 coding results, sending the coded results into GEN PORD to generate 4 partial products P with multiplicand A according to the formula ₁ -P ₄ Because Enc encoding results are only-2, -1, 0, 1, 2, multiplication with a is achieved with only shifting.

Enc＝-2*B _i+1 +B _i +B _i-1

PartiaProd＝Enc*A

4:2CSA has 4 output ports: comprising 4 data P to be compressed ₁ 、P ₂ 、P ₃ 、P ₄ The method comprises the steps of carrying out a first treatment on the surface of the 2 output ports: including Carry data Carry, pseudo and Sum.

The final 16-bit carry-lookahead adder is formed by cascading 4-bit carry-lookahead adders.

DC synthesis is carried out on a Winograd computing unit verilog code, and the obtained delay and area are scaled into an FPGA architecture of 20nm by using a FreePDK45 library for FPGA architecture design. The shape of the hard core is designed to be rectangular same as that of the DSP, so that arrangement layout is performed in the FPGA framework. The input-output pins of the hard core are defined as evenly distributed for better routability.

The Winograd hard core computing unit of the embodiment is placed on the FPGA in a manner designed through the VTR. The FPGA architecture in the VTR is represented in an XML structure description file.

The XML architecture description file has a top layer of < architecture >, 7 modules below the top layer of < modules >, < tiles >, < layout >, < device >, < switchlist >, < segment list >, < directlist > and < complexblock list >, respectively.

The FPGA architecture is modified on the basis of an XML structure description file of the StratixIV, and 4 module modeling under the top layer needs to be modified by adding a Winograd hard core computing unit.

①<models>

< model > is used to declare the model name used in the bilf netlist file. To instantiate a Winograd hard core computation unit in a netlist, the model name and pin name must be declared in XML.

While defining pins, the combined_sink_ports also need to be used to model pin timing dependencies and establish dependencies on input and output.

②<complexblocklist>

The < pb_type > of < complexblocklist > lower order is used to describe the Winograd hard core computing unit internal structure, modeling its interior.

Modeling of < pb_type > contains two layers: top-level modules and modeling primitives.

The top-level module needs to declare the module name and all port information.

Modeling primitives are the lowest layers in the hierarchy. Primitives correspond to elements that appear in the user netlist in the map before the packing stage, and the model in blif must be described in modeling primitives in pb_type. The Winograd hard core computing unit is described in the form of a black box, which means that the interior of the hard core computing unit is not described in detail, and only ports and internal delays are described. The circuit critical path delay obtained through DC synthesis is set by using a delay_constant attribute in the modeling primitive.

Intra-module interconnect is also required between the top-level module and the primitive module. Ports and pins used in the primitive module are interconnected with ports declared by the top module through interconnect elements. < interconnect > is at the same level as the modeling primitive.

③<tiles>

< tiles > describes the Winograd hard core compute unit external structure. The < tile > contains the name, length, width, area, number of input/output pins, pin positions and how many wires of the pins are connected. The circuit area obtained by DC integration is set in < tile >.

④<layout>

< layout > defines layout information of the FPGA architecture, and various physical blocks in < tiles > are arranged in a prescribed order in the grid. Cells within the grid have priority, with high priority overriding low priority. The outermost round of IO is highest in priority, the inside is filled with LBs with lowest priority, and then the hard cores with the next highest priority are arranged in the grid. Winograd hard core computing units are required to be arranged in an FPGA architecture according to a certain rule.

In order to verify that the test circuit of the effect of the Winograd hard core computing unit has the same computing capacity as the pulse array in computing convolution, the number of convolution frames which can be computed in the same clock period is consistent, and the implementation of the Winograd circuit is respectively realized by an FPGA only comprising a DSP, LBs and input/output ports (IO ports, IO) and an FPGA comprising the Winograd hard core computing unit, and the comparison content is the consumed area and the critical path delay on the FPGA.

The channel width means the minimum channel number of the FPGA required under the condition of not influencing the maximum frequency of the circuit, and the minimum channel number required is found by adopting a binary search method. Due to the order and dissimilarity of array elements, three variables are defined by obtaining the relative size relationship between elements in the array: the two boundary variables determine the searching range, one value is the intermediate variable of the intermediate values of the two boundary variables, the intermediate variable is used for comparing the value with the searched value, the value of the boundary variable is changed on the premise of ensuring that the maximum clock frequency of the right boundary is unchanged, the range of the searching interval is further reduced, the intermediate value is determined again, the process is repeated until the channel width is minimum when the maximum clock frequency is unchanged, and the cycle is exited.

Fig. 4, 5, 6 and 7 show the results of Winograd with the same convolution calculation capability as systolic arrays when calculating convolutions, showing the frequency and area variation of using systolic arrays of different sizes and Winograd hard kernels of F (2 x 2,3 x 3) sizes when performing convolution calculations (using a relaxed placement strategy). In computing the convolution, for a Winograd design with the same convolution computation capability as a systolic array of 32×32 size, the total area is reduced by 53% and the clock frequency is increased by 72% when F (2×2,3×3) Winograd hard kernels are used, compared to soft logic implementations. As can be seen in FIG. 3, winograd hard cores achieve a reduced area, mostly the area consumed by the interconnect.

FIG. 8 shows three different topologies on a Winograd hard core computation unit and then an FPGA, with loose, columnar and dense sequentially from left to right. As shown in fig. 9 and 10, the dense architecture has the largest total area consumption, while the loose architecture has the lowest total area consumption in 3 topologies, which is reduced by 43% compared with the total area of the dense architecture. As shown in fig. 11 and 12, the columnar architecture is larger than the channel width of the loose architecture, because the Winograd hard cores have large data amount, the columnar architectures are close to each other to cause higher routing congestion, and the larger channel width also causes increased area consumption, but meanwhile, the delay is correspondingly reduced due to shorter paths, so that the maximum clock frequency is increased by 28% compared with that of the dense architecture, as shown in fig. 13 and 14. As shown in fig. 15, with the area delay product as an evaluation criterion, it can be seen that the relaxed architecture has the smallest area delay product.

Claims

1. An FPGA architecture for fast convolution operations of a convolutional neural network, characterized in that: the FPGA architecture comprises a plurality of Winograd hard core computing units, and the Winograd hard core computing units are arranged in the FPGA in a loose mode;

and arranging in a loose mode, wherein the Winograd hard cores are rectangular, 3 LBs are arranged between the Winograd hard cores and the DSP, 1 LB is arranged between the adjacent Winograd hard cores, and 1 LB is arranged between the Winograd hard cores and the I/O.

2. The FPGA architecture for convolutional neural network fast convolution operations of claim 1, wherein: the two 8bit numbers received by the input end of the dot multiplication module are a multiplicand and a multiplier respectively, the multiplier is encoded by a base 4-Booth encoder, 4 partial products are generated by the encoded multiplier and the multiplicand, and the partial products are sent into a Wallace tree for 4: and 2CSA compression, wherein the compression result is added by a carry-look ahead adder to obtain a final result and is output to an output conversion module from an output end.