CN116126283A - Resource occupancy rate optimization method of FPGA convolution accelerator - Google Patents

Resource occupancy rate optimization method of FPGA convolution accelerator Download PDF

Info

Publication number
CN116126283A
CN116126283A CN202310052344.6A CN202310052344A CN116126283A CN 116126283 A CN116126283 A CN 116126283A CN 202310052344 A CN202310052344 A CN 202310052344A CN 116126283 A CN116126283 A CN 116126283A
Authority
CN
China
Prior art keywords
bit
multiplier
addends
carry
fpga
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310052344.6A
Other languages
Chinese (zh)
Other versions
CN116126283B (en
Inventor
马艳华
徐琪灿
陈聪聪
宋泽睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202310052344.6A priority Critical patent/CN116126283B/en
Publication of CN116126283A publication Critical patent/CN116126283A/en
Application granted granted Critical
Publication of CN116126283B publication Critical patent/CN116126283B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention belongs to the technical field of hardware accelerators, and discloses a resource occupancy rate optimization method of an FPGA (field programmable gate array) convolutional accelerator. Then, a method for combining partial products is designed according to the optimized multiplier, and the intermediate result is ignored for multiplication and addition, so that LUT resources are further saved. Compared with the traditional method, the performance of the invention is equivalent to the design using an approximate multiplier on the premise of no precision loss. Meanwhile, the requirements of deploying a large amount of convolution processing units on the FPGA under the condition that DSP units are short and multipliers are required to be additionally designed and deployed can be met.

Description

Resource occupancy rate optimization method of FPGA convolution accelerator
Technical Field
The invention belongs to the technical field of hardware accelerators, in particular to a method for optimizing a convolution calculation unit in an FPGA accelerator to enable the resource occupancy rate of the convolution calculation unit to be lower, and particularly relates to a resource occupancy rate optimization method of the FPGA convolution accelerator.
Background
In recent years, with the development of neural networks and image processing technologies, the need to perform convolution calculations on a hardware platform has greatly increased, and the hardware platform needs to perform a large number of multiply-accumulate operations to achieve this function. Field Programmable Gate Arrays (FPGAs) have become the primary platform for convolutional computing hardware acceleration due to their flexibility and power consumption. When the convolution processing units are implemented on an FPGA, the suppliers provide dedicated Digital Signal Processing (DSP) units to implement fast multipliers, but they are fixed in location and limited in number. Therefore, when dealing with intensive applications of multipliers, additional design of the multipliers is required to guarantee performance.
In order to improve the performance of the convolution hardware accelerator, a great deal of research work is carried out at home and abroad. The performance optimization of the multiplier and the adder is performed by adopting design approximate calculation to improve the performance and the energy efficiency, and the cost is a loss of a certain precision. In comparison to an exact multiplier, although hardware resources are saved, the existing loss of precision will inevitably reduce its performance in the application. Furthermore, if a special algorithm for convolution calculation, such as the Winograd convolution acceleration algorithm, is adopted, the number of multiplications can be reduced, but this method increases the number of additions. At the same time, such methods must meet the demands of the input data sequence, which not only can cause additional delays, but also present challenges to the memory resources on the FPGA. It should be noted that current FPGA convolutional accelerator designs mostly consider implementation that is migrated to Application Specific Integrated Circuits (ASICs), ignoring the inherent architectural differences between FPGAs and ASIC platforms, and these approaches suffer from limited performance when deployed directly on FPGAs. Therefore, in order to solve the problems, a convolution processing unit based on an FPGA accurate multiplier and having the performance equivalent to that of an approximate multiplier is designed, and the convolution processing unit has quite important engineering significance for an FPGA convolution accelerator.
Disclosure of Invention
The conventional FPGA convolution processing unit is designed mainly by taking portability to an ASIC into consideration, and the inherent architecture difference is ignored. Meanwhile, since the current design mostly adopts approximate calculation or a special convolution algorithm, various limitations exist in performance, and therefore, a large amount of optimization space exists for the design of the convolution processing unit based on the FPGA. Aiming at the problems, the invention provides a design method for a convolution processing unit on an FPGA platform, which aims to enable the convolution processing unit to fully use a configurable logic block on the FPGA platform, bring the special advantages of the FPGA platform into play, enable the performance of the convolution processing unit to reach the performance equivalent to that of an approximate multiplier under the condition of using the accurate multiplier, and have no special requirement on input data sequence and occupy no extra storage space.
The technical scheme of the invention is as follows:
the resource occupancy rate optimizing method of the FPGA convolution accelerator comprises the following specific steps:
step 1: a sign bit expansion method is used for designing a base 4booth multiplier:
step 1.1: according to the bit number n of the multiplier, n/2 base 4booth encoders are designed, each base 4booth encoder takes 2i+1st bit, 2i bit and 2i-1 nd bit of the multiplier as operands, encodes the other multiplier only by using shifting and inverting modes, and outputs partial products; wherein i is the sequence number of the base 4booth encoder, i=0, …, n/2-1;
step 1.2: according to step 1.1, starting from i=0, decomposing a multiplier of bit width n into n/2 operands, outputting n/2 partial products; if 2i-1<0, the bit is filled with 0;
step 1.3: using the sign bit expansion method, the following operations are performed on all partial products: firstly, adding the lowest bit of each partial product with the highest bit of the corresponding operand; secondly, inverting the sign bit of each partial product; then adding 1 to the highest position of the first partial product; next, one bit number 1 is added before the highest bit of all partial products; finally, combining all partial products into a Wallace tree;
step 1.4: for the Wallace tree, in each bit, taking every 3 addends as a group, compressing the Wallace tree into two numbers of a current bit and a next bit until the residual addends of each bit are smaller than 3, and adding the numbers to obtain a multiplication result;
step 2: the basic unit configurable logic block based on the FPGA optimizes the basic 4booth multiplier, and is divided into two aspects of LUT-based optimization and carry chain-based optimization;
step 2.1: based on LUT optimization, in each bit, if the sum of addends is greater than or equal to 5, compressing to 3 addends every 5 addends, including one current bit and two carry bits; if the sum of the addends is less than 5, compressing the addends according to the Wallace tree method, compressing 3 addends into 1 current bit and 1 carry, and repeating the cycle until the addends of each bit are less than 3;
step 2.2: based on carry chain optimization, carry chains are used instead of LUTs for the compression of Wallace trees; compressing the most significant bit of each operand added in the Wallace tree as a carry carried by a carry chain, wherein the use of each carry chain is required to meet the requirement that two 8-bit numbers carry one carry to compress into one 9-bit number, and circularly reciprocating until the addend of each bit is less than 3, and summing;
step 3: using the optimized multiplier to design a convolution processing unit based on a merging partial product method;
step 3.1: firstly, determining the multiplication number m required by a convolution processing unit, generating partial products for all multiplications by using m multiplied by n/2 encoders, and respectively executing step 1.3 on different multiplications to generate m Wallace trees;
step 3.2: combining the contents of the m Wallace trees under 1 Wallace tree, and calculating the determined data addition in advance, namely, adding 1 bit before the most significant 1 of the first partial product and the most significant 1 bit of all partial products in the m Wallace trees;
step 3.3: step 2 is executed, the merged Wallace tree is compressed based on the LUT or the carry chain, so that a convolution calculation result is obtained under the condition of saving a large amount of hardware resources.
The invention has the beneficial effects that: the invention firstly optimizes the configurable logic block level based on two aspects of a lookup table (LUT) and a carry chain aiming at a base 4booth multiplier in a convolution accelerator so as to reduce LUT resources required for realizing a single multiplier. Then, a method for combining partial products is designed according to the optimized multiplier, and the intermediate result is ignored for multiplication and addition, so that LUT resources are further saved. Compared with the traditional method, the performance of the invention is equivalent to the design using an approximate multiplier on the premise of no precision loss. Meanwhile, the requirements of deploying a large amount of convolution processing units on the FPGA under the condition that DSP units are short and multipliers are required to be additionally designed and deployed can be met.
Drawings
FIG. 1 is a schematic diagram of a 3×3 convolution processing unit designed in the present invention, based on carry chain optimization, where W n And I n Represents the multiplier of the nth pair of multiplications, n=1, …,9.
FIG. 2 is a schematic diagram of the basic 4booth 8X 8 method of the present invention, wherein A and B are multipliers of the multiplication, PP represents the partial product produced, B m Represents the m-th bit of B, m=0, …,7.
FIG. 3 is a schematic of a carry chain optimized Wallace tree based algorithm using 8 x 8 multipliers in the present invention, with the dashed box indicating the use of carry chains for calculations, totaling 5 carry chains to complete one multiplier.
Detailed Description
The invention is further described below with reference to the accompanying drawings and specific embodiments using 8 x 8 multipliers to implement the design of a 3 x 3 convolution processing unit, optimized based on the carry chain.
Step 1: 8×8 base 4booth multiplier was designed:
step 1.1: the number of bits of the multiplier is 8, so 4 base 4booth encoders are designed, each base 4booth encoder can take 2i+1th bit, 2i bit and 2i-1th bit of the multiplier as operands, encode the other multiplier only by shifting and inverting, and output partial products, wherein i is the serial number of the base 4booth encoder, and i=0, 1,2 and 3. The hardware coding scheme is shown in table 1.
Step 1.2: according to step 1.1, starting from i=0, bit-1 is filled with 0, the 8-bit multiplier is decomposed into 4 operands, and 4 partial products are output.
Step 1.3: using the sign bit expansion method, the following operations are performed on all partial products: firstly, adding the lowest bit of each partial product with the highest bit of the corresponding operand; secondly, inverting the sign bit of each partial product; then adding 1 to the highest position of the first partial product; next, one bit number 1 is added before the highest bit of all partial products; finally, all partial products are combined into a Wallace tree. The Wallace tree generated is shown in the lower part of FIG. 2.
Step 1.4: for the Wallace tree, in each bit, the Wallace tree is compressed into two numbers of the current bit and the next bit by taking each 3 addends as a group until the remaining addends of each bit are smaller than 3, and the result of multiplication is obtained by adding, and the structure of the base 4booth multiplier is shown in fig. 2.
Step 2: and carrying out carry chain optimization on the base 4booth multiplier based on a basic unit configurable logic block of the FPGA.
Step 2.1: based on carry chain optimization, carry chains are used instead of LUTs for the compression of Wallace trees. The most significant bits of each operand added in the Wallace tree are used as carry carried by carry chains to compress, the use of each carry chain needs to meet the requirement that two 8-bit numbers carry one carry to compress into one 9-bit number, and the operations are repeated in a circulating way until the addition number of each bit is less than 3, and summation is carried out. The compression method for the Wallace tree generated by a single 8 x 8 base 4booth multiplier is shown in fig. 3.
Step 3: the optimized multiplier is used to design a convolution processing unit based on the combined partial product, the size is 3×3, and the structure of the convolution processing unit is shown in fig. 1.
Step 3.1: step 1.3 is performed separately for the different multiplications, generating 9 Wallace trees, using 36 radix 4booth encoders to generate partial products for the 9 required multiplications.
Step 3.2: the contents of 9 Wallace trees are merged under 1 Wallace tree, the determined data addition is calculated in advance, and 9 1's in the same bit are calculated in advance as 1001 in binary.
Step 3.3: and step 2, compressing the merged Wallace tree based on the carry chain, and obtaining a convolution calculation result under the condition of saving a large amount of hardware resources without calculating intermediate results. The final results of a 3 x 3 convolution calculation unit implemented using the present invention are compared to those shown in table 2.
The above can be seen in the following: the invention provides an optimization design method of a convolution processing unit based on an FPGA (field programmable gate array). The whole method mainly comprises the steps of optimizing a base 4booth multiplier based on a basic unit configurable logic block of an FPGA platform and designing two parts of the convolution processing unit based on a combined partial product by using the optimized multiplier. The optimized multiplier can be close to the approximate multiplier in performance, and the convolution processing unit based on the combined partial product can further reduce LUT resources on the FPGA.
TABLE 1 hardware coding of 4booth encoder wherein S represents the sign bit of A, B 2i-1 ,B 2i ,B 2i+1 Respectively representing bits 2i-1,2i,2i+1 of B, i=0, 1,2,3.
B 2i+1 B 2i B 2i-1 Hardware operations performed on A
0 0 0 0
0 0 1 {S,A}
0 1 0 {S,A}
0 1 1 {A,0}
1 0 0 ~{A,0}
1 0 1 ~{S,A}
1 1 0 ~{S,A}
1 1 1 0
Table 2 results of a 3 x 3 convolution processing unit designed using the method of the present invention and a comparison of the results of the method with the results of a convolution processing unit designed using other methods for the same function, the data bit width was 8.
Method Number of LUTs usedMeasuring amount Maximum operating frequency Mhz Power mW
The method of the invention 362 395 71
Base 4booth multiplier 761 370 81
Using approximation multipliers 570 397 89
Using vendor-supplied designs 725 383 76

Claims (1)

1. The resource occupancy rate optimizing method of the FPGA convolution accelerator is characterized by comprising the following steps of:
step 1: a sign bit expansion method is used for designing a base 4booth multiplier:
step 1.1: according to the bit number n of the multiplier, n/2 base 4booth encoders are designed, each base 4booth encoder takes 2i+1st bit, 2i bit and 2i-1 nd bit of the multiplier as operands, encodes the other multiplier only by using shifting and inverting modes, and outputs partial products; wherein i is the sequence number of the base 4booth encoder, i=0, …, n/2-1;
step 1.2: according to step 1.1, starting from i=0, decomposing a multiplier of bit width n into n/2 operands, outputting n/2 partial products; if 2i-1<0, the bit is filled with 0;
step 1.3: using the sign bit expansion method, the following operations are performed on all partial products: firstly, adding the lowest bit of each partial product with the highest bit of the corresponding operand; secondly, inverting the sign bit of each partial product; then adding 1 to the highest position of the first partial product; next, one bit number 1 is added before the highest bit of all partial products; finally, combining all partial products into a Wallace tree;
step 1.4: for the Wallace tree, in each bit, taking every 3 addends as a group, compressing the Wallace tree into two numbers of a current bit and a next bit until the residual addends of each bit are smaller than 3, and adding the numbers to obtain a multiplication result;
step 2: the basic unit configurable logic block based on the FPGA optimizes the basic 4booth multiplier, and is divided into two aspects of LUT-based optimization and carry chain-based optimization;
step 2.1: based on LUT optimization, in each bit, if the sum of addends is greater than or equal to 5, compressing to 3 addends every 5 addends, including one current bit and two carry bits; if the sum of the addends is less than 5, compressing the addends according to the Wallace tree method, compressing 3 addends into 1 current bit and 1 carry, and repeating the cycle until the addends of each bit are less than 3;
step 2.2: based on carry chain optimization, carry chains are used instead of LUTs for the compression of Wallace trees; compressing the most significant bit of each operand added in the Wallace tree as a carry carried by a carry chain, wherein the use of each carry chain is required to meet the requirement that two 8-bit numbers carry one carry to compress into one 9-bit number, and circularly reciprocating until the addend of each bit is less than 3, and summing;
step 3: using the optimized multiplier to design a convolution processing unit based on a merging partial product method;
step 3.1: firstly, determining the multiplication number m required by a convolution processing unit, generating partial products for all multiplications by using m multiplied by n/2 encoders, and respectively executing step 1.3 on different multiplications to generate m Wallace trees;
step 3.2: combining the contents of the m Wallace trees under 1 Wallace tree, and calculating the determined data addition in advance, namely, adding 1 bit before the most significant 1 of the first partial product and the most significant 1 bit of all partial products in the m Wallace trees;
step 3.3: step 2 is executed, the merged Wallace tree is compressed based on the LUT or the carry chain, so that a convolution calculation result is obtained under the condition of saving a large amount of hardware resources.
CN202310052344.6A 2023-02-02 2023-02-02 Resource occupancy rate optimization method of FPGA convolution accelerator Active CN116126283B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310052344.6A CN116126283B (en) 2023-02-02 2023-02-02 Resource occupancy rate optimization method of FPGA convolution accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310052344.6A CN116126283B (en) 2023-02-02 2023-02-02 Resource occupancy rate optimization method of FPGA convolution accelerator

Publications (2)

Publication Number Publication Date
CN116126283A true CN116126283A (en) 2023-05-16
CN116126283B CN116126283B (en) 2023-08-08

Family

ID=86296964

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310052344.6A Active CN116126283B (en) 2023-02-02 2023-02-02 Resource occupancy rate optimization method of FPGA convolution accelerator

Country Status (1)

Country Link
CN (1) CN116126283B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5944776A (en) * 1996-09-27 1999-08-31 Sun Microsystems, Inc. Fast carry-sum form booth encoder
DE102018115219A1 (en) * 2017-07-14 2019-01-17 Intel Corporation Systems and methods for mapping reduction operations
CN112540743A (en) * 2020-12-21 2021-03-23 清华大学 Signed multiplication accumulator and method for reconfigurable processor
CN113872608A (en) * 2021-12-01 2021-12-31 中国人民解放军海军工程大学 Wallace tree compressor based on Xilinx FPGA primitive

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5944776A (en) * 1996-09-27 1999-08-31 Sun Microsystems, Inc. Fast carry-sum form booth encoder
DE102018115219A1 (en) * 2017-07-14 2019-01-17 Intel Corporation Systems and methods for mapping reduction operations
CN112540743A (en) * 2020-12-21 2021-03-23 清华大学 Signed multiplication accumulator and method for reconfigurable processor
CN113872608A (en) * 2021-12-01 2021-12-31 中国人民解放军海军工程大学 Wallace tree compressor based on Xilinx FPGA primitive

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YUKUN LI: "Approximate Logic Synthesis and its Application in Image Signal Processor.pdf--ma yanhua Dalian University of Technology", 《IEEE》 *
周婉婷;李磊;: "基4BOOTH编码的高速32×32乘法器的设计与实现", 电子科技大学学报, no. 1 *
夏炜;肖鹏;: "一种高效双精度浮点乘法器", 计算机测量与控制, no. 04 *
李伟;戴紫彬;陈韬;: "基于跳跃式Wallace树的低功耗32位乘法器", 计算机工程, no. 17 *

Also Published As

Publication number Publication date
CN116126283B (en) 2023-08-08

Similar Documents

Publication Publication Date Title
CN111832719A (en) Fixed point quantization convolution neural network accelerator calculation circuit
Bhattacharya et al. A high performance binary to BCD converter for decimal multiplication
Al-Khaleel et al. Fast and compact binary-to-BCD conversion circuits for decimal multiplication
CN110955403B (en) Approximate base-8 Booth encoder and approximate binary multiplier of mixed Booth encoding
Srinivas et al. New realization of low area and high-performance Wallace tree multipliers using booth recoding unit
Ahmad et al. An efficient approximate sum of absolute differences hardware for FPGAs
CN116205244B (en) Digital signal processing structure
CN116126283B (en) Resource occupancy rate optimization method of FPGA convolution accelerator
CN110837624B (en) Approximation calculation device for sigmoid function
Gao et al. Efficient realization of bcd multipliers using fpgas
da Rosa et al. The Radix-2 m Squared Multiplier
CN114237550B (en) Wallace tree-based multi-input shift sum accumulator
Kumar et al. Complex multiplier: implementation using efficient algorithms for signal processing application
CN113157247B (en) Reconfigurable integer-floating point multiplier
US7840628B2 (en) Combining circuitry
Nezhad et al. High-speed multiplier design using multi-operand multipliers
El Atre et al. Design and implementation of new delay-efficient/configurable multiplier using FPGA
Jaberipur et al. Posibits, negabits, and their mixed use in efficient realization of arithmetic algorithms
James et al. Performance analysis of double digit decimal multiplier on various FPGA logic families
Li A Single Precision Floating Point Multiplier for Machine Learning Hardware Acceleration
CN116151340B (en) Parallel random computing neural network system and hardware compression method and system thereof
Veena et al. Energy Scalable Brent Kung Adder with Non-Zeroing Bit Truncation
Rocha et al. Improving the Partial Product Tree Compression on Signed Radix-2 m Parallel Multipliers
Ramya et al. Implementation of High Speed FFT using Reversible Logic Gates for Wireless DSP Applications
Mohanty Design and implementation of faster and low power multipliers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant