CN116126283A

CN116126283A - Resource occupancy rate optimization method of FPGA convolution accelerator

Info

Publication number: CN116126283A
Application number: CN202310052344.6A
Authority: CN
Inventors: 马艳华; 徐琪灿; 陈聪聪; 宋泽睿
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2023-02-02
Filing date: 2023-02-02
Publication date: 2023-05-16
Anticipated expiration: 2043-02-02
Also published as: CN116126283B

Abstract

The invention belongs to the technical field of hardware accelerators, and discloses a resource occupancy rate optimization method of an FPGA (field programmable gate array) convolutional accelerator. Then, a method for combining partial products is designed according to the optimized multiplier, and the intermediate result is ignored for multiplication and addition, so that LUT resources are further saved. Compared with the traditional method, the performance of the invention is equivalent to the design using an approximate multiplier on the premise of no precision loss. Meanwhile, the requirements of deploying a large amount of convolution processing units on the FPGA under the condition that DSP units are short and multipliers are required to be additionally designed and deployed can be met.

Description

Resource occupancy rate optimization method of FPGA convolution accelerator

Technical Field

The invention belongs to the technical field of hardware accelerators, in particular to a method for optimizing a convolution calculation unit in an FPGA accelerator to enable the resource occupancy rate of the convolution calculation unit to be lower, and particularly relates to a resource occupancy rate optimization method of the FPGA convolution accelerator.

Background

In recent years, with the development of neural networks and image processing technologies, the need to perform convolution calculations on a hardware platform has greatly increased, and the hardware platform needs to perform a large number of multiply-accumulate operations to achieve this function. Field Programmable Gate Arrays (FPGAs) have become the primary platform for convolutional computing hardware acceleration due to their flexibility and power consumption. When the convolution processing units are implemented on an FPGA, the suppliers provide dedicated Digital Signal Processing (DSP) units to implement fast multipliers, but they are fixed in location and limited in number. Therefore, when dealing with intensive applications of multipliers, additional design of the multipliers is required to guarantee performance.

In order to improve the performance of the convolution hardware accelerator, a great deal of research work is carried out at home and abroad. The performance optimization of the multiplier and the adder is performed by adopting design approximate calculation to improve the performance and the energy efficiency, and the cost is a loss of a certain precision. In comparison to an exact multiplier, although hardware resources are saved, the existing loss of precision will inevitably reduce its performance in the application. Furthermore, if a special algorithm for convolution calculation, such as the Winograd convolution acceleration algorithm, is adopted, the number of multiplications can be reduced, but this method increases the number of additions. At the same time, such methods must meet the demands of the input data sequence, which not only can cause additional delays, but also present challenges to the memory resources on the FPGA. It should be noted that current FPGA convolutional accelerator designs mostly consider implementation that is migrated to Application Specific Integrated Circuits (ASICs), ignoring the inherent architectural differences between FPGAs and ASIC platforms, and these approaches suffer from limited performance when deployed directly on FPGAs. Therefore, in order to solve the problems, a convolution processing unit based on an FPGA accurate multiplier and having the performance equivalent to that of an approximate multiplier is designed, and the convolution processing unit has quite important engineering significance for an FPGA convolution accelerator.

Disclosure of Invention

The conventional FPGA convolution processing unit is designed mainly by taking portability to an ASIC into consideration, and the inherent architecture difference is ignored. Meanwhile, since the current design mostly adopts approximate calculation or a special convolution algorithm, various limitations exist in performance, and therefore, a large amount of optimization space exists for the design of the convolution processing unit based on the FPGA. Aiming at the problems, the invention provides a design method for a convolution processing unit on an FPGA platform, which aims to enable the convolution processing unit to fully use a configurable logic block on the FPGA platform, bring the special advantages of the FPGA platform into play, enable the performance of the convolution processing unit to reach the performance equivalent to that of an approximate multiplier under the condition of using the accurate multiplier, and have no special requirement on input data sequence and occupy no extra storage space.

The technical scheme of the invention is as follows:

the resource occupancy rate optimizing method of the FPGA convolution accelerator comprises the following specific steps:

step 1: a sign bit expansion method is used for designing a base 4booth multiplier:

step 1.1: according to the bit number n of the multiplier, n/2 base 4booth encoders are designed, each base 4booth encoder takes 2i+1st bit, 2i bit and 2i-1 nd bit of the multiplier as operands, encodes the other multiplier only by using shifting and inverting modes, and outputs partial products; wherein i is the sequence number of the base 4booth encoder, i=0, …, n/2-1;

step 1.2: according to step 1.1, starting from i=0, decomposing a multiplier of bit width n into n/2 operands, outputting n/2 partial products; if 2i-1<0, the bit is filled with 0;

step 1.3: using the sign bit expansion method, the following operations are performed on all partial products: firstly, adding the lowest bit of each partial product with the highest bit of the corresponding operand; secondly, inverting the sign bit of each partial product; then adding 1 to the highest position of the first partial product; next, one bit number 1 is added before the highest bit of all partial products; finally, combining all partial products into a Wallace tree;

step 1.4: for the Wallace tree, in each bit, taking every 3 addends as a group, compressing the Wallace tree into two numbers of a current bit and a next bit until the residual addends of each bit are smaller than 3, and adding the numbers to obtain a multiplication result;

step 2: the basic unit configurable logic block based on the FPGA optimizes the basic 4booth multiplier, and is divided into two aspects of LUT-based optimization and carry chain-based optimization;

step 2.1: based on LUT optimization, in each bit, if the sum of addends is greater than or equal to 5, compressing to 3 addends every 5 addends, including one current bit and two carry bits; if the sum of the addends is less than 5, compressing the addends according to the Wallace tree method, compressing 3 addends into 1 current bit and 1 carry, and repeating the cycle until the addends of each bit are less than 3;

step 2.2: based on carry chain optimization, carry chains are used instead of LUTs for the compression of Wallace trees; compressing the most significant bit of each operand added in the Wallace tree as a carry carried by a carry chain, wherein the use of each carry chain is required to meet the requirement that two 8-bit numbers carry one carry to compress into one 9-bit number, and circularly reciprocating until the addend of each bit is less than 3, and summing;

step 3: using the optimized multiplier to design a convolution processing unit based on a merging partial product method;

step 3.1: firstly, determining the multiplication number m required by a convolution processing unit, generating partial products for all multiplications by using m multiplied by n/2 encoders, and respectively executing step 1.3 on different multiplications to generate m Wallace trees;

step 3.2: combining the contents of the m Wallace trees under 1 Wallace tree, and calculating the determined data addition in advance, namely, adding 1 bit before the most significant 1 of the first partial product and the most significant 1 bit of all partial products in the m Wallace trees;

step 3.3: step 2 is executed, the merged Wallace tree is compressed based on the LUT or the carry chain, so that a convolution calculation result is obtained under the condition of saving a large amount of hardware resources.

The invention has the beneficial effects that: the invention firstly optimizes the configurable logic block level based on two aspects of a lookup table (LUT) and a carry chain aiming at a base 4booth multiplier in a convolution accelerator so as to reduce LUT resources required for realizing a single multiplier. Then, a method for combining partial products is designed according to the optimized multiplier, and the intermediate result is ignored for multiplication and addition, so that LUT resources are further saved. Compared with the traditional method, the performance of the invention is equivalent to the design using an approximate multiplier on the premise of no precision loss. Meanwhile, the requirements of deploying a large amount of convolution processing units on the FPGA under the condition that DSP units are short and multipliers are required to be additionally designed and deployed can be met.

Drawings

FIG. 1 is a schematic diagram of a 3×3 convolution processing unit designed in the present invention, based on carry chain optimization, where W _n And I _n Represents the multiplier of the nth pair of multiplications, n=1, …,9.

FIG. 2 is a schematic diagram of the basic 4booth 8X 8 method of the present invention, wherein A and B are multipliers of the multiplication, PP represents the partial product produced, B _m Represents the m-th bit of B, m=0, …,7.

FIG. 3 is a schematic of a carry chain optimized Wallace tree based algorithm using 8 x 8 multipliers in the present invention, with the dashed box indicating the use of carry chains for calculations, totaling 5 carry chains to complete one multiplier.

Detailed Description

The invention is further described below with reference to the accompanying drawings and specific embodiments using 8 x 8 multipliers to implement the design of a 3 x 3 convolution processing unit, optimized based on the carry chain.

Step 1: 8×8 base 4booth multiplier was designed:

step 1.1: the number of bits of the multiplier is 8, so 4 base 4booth encoders are designed, each base 4booth encoder can take 2i+1th bit, 2i bit and 2i-1th bit of the multiplier as operands, encode the other multiplier only by shifting and inverting, and output partial products, wherein i is the serial number of the base 4booth encoder, and i=0, 1,2 and 3. The hardware coding scheme is shown in table 1.

Step 1.2: according to step 1.1, starting from i=0, bit-1 is filled with 0, the 8-bit multiplier is decomposed into 4 operands, and 4 partial products are output.

Step 1.3: using the sign bit expansion method, the following operations are performed on all partial products: firstly, adding the lowest bit of each partial product with the highest bit of the corresponding operand; secondly, inverting the sign bit of each partial product; then adding 1 to the highest position of the first partial product; next, one bit number 1 is added before the highest bit of all partial products; finally, all partial products are combined into a Wallace tree. The Wallace tree generated is shown in the lower part of FIG. 2.

Step 1.4: for the Wallace tree, in each bit, the Wallace tree is compressed into two numbers of the current bit and the next bit by taking each 3 addends as a group until the remaining addends of each bit are smaller than 3, and the result of multiplication is obtained by adding, and the structure of the base 4booth multiplier is shown in fig. 2.

Step 2: and carrying out carry chain optimization on the base 4booth multiplier based on a basic unit configurable logic block of the FPGA.

Step 2.1: based on carry chain optimization, carry chains are used instead of LUTs for the compression of Wallace trees. The most significant bits of each operand added in the Wallace tree are used as carry carried by carry chains to compress, the use of each carry chain needs to meet the requirement that two 8-bit numbers carry one carry to compress into one 9-bit number, and the operations are repeated in a circulating way until the addition number of each bit is less than 3, and summation is carried out. The compression method for the Wallace tree generated by a single 8 x 8 base 4booth multiplier is shown in fig. 3.

Step 3: the optimized multiplier is used to design a convolution processing unit based on the combined partial product, the size is 3×3, and the structure of the convolution processing unit is shown in fig. 1.

Step 3.1: step 1.3 is performed separately for the different multiplications, generating 9 Wallace trees, using 36 radix 4booth encoders to generate partial products for the 9 required multiplications.

Step 3.2: the contents of 9 Wallace trees are merged under 1 Wallace tree, the determined data addition is calculated in advance, and 9 1's in the same bit are calculated in advance as 1001 in binary.

Step 3.3: and step 2, compressing the merged Wallace tree based on the carry chain, and obtaining a convolution calculation result under the condition of saving a large amount of hardware resources without calculating intermediate results. The final results of a 3 x 3 convolution calculation unit implemented using the present invention are compared to those shown in table 2.

The above can be seen in the following: the invention provides an optimization design method of a convolution processing unit based on an FPGA (field programmable gate array). The whole method mainly comprises the steps of optimizing a base 4booth multiplier based on a basic unit configurable logic block of an FPGA platform and designing two parts of the convolution processing unit based on a combined partial product by using the optimized multiplier. The optimized multiplier can be close to the approximate multiplier in performance, and the convolution processing unit based on the combined partial product can further reduce LUT resources on the FPGA.

TABLE 1 hardware coding of 4booth encoder wherein S represents the sign bit of A, B _2i-1 ，B _2i ，B _2i+1 Respectively representing bits 2i-1,2i,2i+1 of B, i=0, 1,2,3.

B _2i+1	B _2i	B _2i-1	Hardware operations performed on A
				0	0	0	0
0	0	1	{S,A}
				0	1	0	{S,A}
0	1	1	{A,0}
				1	0	0	～{A,0}
1	0	1	～{S,A}
				1	1	0	～{S,A}
1	1	1	0

Table 2 results of a 3 x 3 convolution processing unit designed using the method of the present invention and a comparison of the results of the method with the results of a convolution processing unit designed using other methods for the same function, the data bit width was 8.

Method	Number of LUTs usedMeasuring amount	Maximum operating frequency Mhz	Power mW
				The method of the invention	362	395	71
Base 4booth multiplier	761	370	81
				Using approximation multipliers	570	397	89
Using vendor-supplied designs	725	383	76

Claims

1. The resource occupancy rate optimizing method of the FPGA convolution accelerator is characterized by comprising the following steps of: