CN107092462B

CN107092462B - 64-bit asynchronous multiplier based on FPGA

Info

Publication number: CN107092462B
Application number: CN201710214226.5A
Authority: CN
Inventors: 何安平; 吴尽昭; 刘晓庆; 冯广博; 郭慧波; 熊菊霞; 王娟
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-04-01
Filing date: 2017-04-01
Publication date: 2020-10-09
Anticipated expiration: 2037-04-01
Also published as: CN107092462A

Abstract

The invention discloses a 64-bit asynchronous multiplier based on FPGA, which comprises an 8 x 64-bit multiplier, a selector MUX0, a selector MUX1, a selector MUX2, a compressor, a counter 0, a counter 1, a counter 2, a plurality of registers, a carry look ahead adder CLA and a control unit, wherein the control unit adopts a pipeline formed by a Click asynchronous controller, analyzes handshake signals through handshake communication of the asynchronous controller and sequentially generates four groups of trigger signals; the selector MUX0, the selector MUX1, the selector MUX2, the compressor, the counter Count0, the counter Count1, the counter Count2, the registers, and the carry look ahead adder CLA perform corresponding data transmission, compression, accumulation operation, output, and the like according to the four groups of trigger signals. The invention has faster calculation speed and lower energy consumption.

Description

64-bit asynchronous multiplier based on FPGA

Technical Field

The invention relates to a 64-bit asynchronous multiplier based on a Field Programmable Gate Array (FPGA).

Background

Since the advent of transistor technology in the last 70 th century, synchronous design has almost become a synonym for the design method of digital systems. But current processes have moved towards manufacturing limits, the 12 nm to 7 nm transition has slowed down, "most likely the first departure from moore's law" (John Gustafson, AMD initiatives). The problems of clock skew, power distribution, etc. caused by the great progress of the manufacturing process are serious challenges of the synchronous design method, the synchronous design method cannot provide a solution for the serious problems, and the problems can be alleviated only by adopting a large amount of GALS (global asynchronous and local synchronous) design methods, i.e. a multi-core technology adopting a small amount of asynchronous circuits.

Modern asynchronous designs introduce a design method based on a micro-pipeline, and the core of the design method is an asynchronous controller circuit used for realizing handshake communication protocol and coordination circuit functions. Compared with a clock scheme, the asynchronous circuit adopts a local communication mode, finishes asynchronous control by a handshake protocol, does not need a huge clock distribution network, and solves the problem of clock distortion. The asynchronous circuit almost has no power consumption when idle, so that the power consumption of the whole system is effectively controlled. The asynchronous design method has obvious advantages in the aspects of low power consumption, low electromagnetic radiation, low heat dissipation, modularization and the like.

A digital multiplier is a binary arithmetic logic unit, and because digital circuitry is built on boolean logic, a mechanism is needed to convert arithmetic to logic, which is the essence of the digital multiplier algorithm. The algorithm of the digital multiplier is mature, the most intuitive array algorithm, starting from the low order of the multiplier, calculates the product (partial product) of each order and the multiplicand, and then adds the partial products to obtain the product, for the n-bit multiplier, n (n +1) full adders and n²And gates, the multiplier for realizing the algorithm has slow calculation speed and high area and power consumption.

The Booth algorithm is a widely adopted high-efficiency multiplier realization method, and the method firstly calculates the partial products of multiplicand and multiplier sections, and then compresses and sums the partial products to obtain the final product. The generation and combination of partial products are key, and the calculation of the partial products not only affects the calculation speed, but also determines the size of the whole multiplier. Firstly, the Booth algorithm is improved, a basic framework of shifting, compressing and summing of the classical Booth algorithm is adopted, a method of subtracting and calculating the partial product of the multiplier section after shifting is cancelled, a plurality of partial products are reserved in the shifting process, and the product is added and summed after being compressed for multiple times. The improvement enhances the internal cohesion of the functional module, weakens the coupling relation between the modules and simplifies the realization of the multiplier control circuit.

However, since the Booth algorithm equally divides the multiplier into several multiplier sections, the multiplication problem is defined as the sum of the partial products of each multiplicand and the multiplier sections. Specifically, in the Booth algorithm, multiplication of each segment and a multiplicand can be mapped into equivalent shift and subtraction operations according to binary data characteristics of a multiplier segment to obtain a partial product of the multiplier segment, and then the partial product is added and multiplied for many times, or the partial product is added and multiplied for one time after being compressed for many times.

Disclosure of Invention

The invention aims to provide a 64-bit asynchronous multiplier based on an FPGA, which has faster operation and lower energy consumption.

The invention is realized in this way, a 64-bit asynchronous multiplier based on FPGA, the 64-bit asynchronous multiplier includes 8 × 64-bit multiplier, selector MUX0, selector MUX1, selector MUX2, compressor, counter 0, counter 1, counter 2, several registers, carry look ahead adder CLA, and control unit, wherein:

the control unit adopts a production line consisting of Click asynchronous controllers, analyzes handshake signals through handshake communication of the asynchronous controllers and sequentially generates four groups of trigger signals;

the counter 0 is configured to control the selector MUX0 to perform operations on the input signal in an 8 × 64-bit multiplier after receiving the first group of trigger signals of the control unit, and the operation values are stored in 8 registers respectively;

the register is used for storing the output value of the upper-level 8 x 64-bit multiplier and continuously transmitting the output value of the 8 x 64-bit multiplier downwards after receiving the second group of trigger signals of the control unit;

the counter 1 is configured to further control the number in the 8 registers through the selector MUX1 after receiving the third set of trigger signals of the control unit, and perform compression operation in the compressor according to a set order;

the counter 2 is configured to control the selector MUX2 to select the output value of the upper-level compressor after receiving the fourth group of trigger signals, and call back the output value to the upper-level compressor according to the determination result to continue to compress the data with the 8 register, or transmit the output value to the carry look ahead adder CLA;

the carry look ahead adder CLA performs addition operation on the received output value and outputs the result.

Preferably, in the counter Count0, the input signal of the 8 x 64-bit multiplier is 64-bit input signal a and 8-bit input signal B.

The high-performance digital multiplier is a core component of a processor and an algorithm chip, is the basis and core of various complex calculations, and particularly is the key point for finishing high-performance real-time digital signal processing and image processing, and the efficiency of the multiplier directly influences the performance of the chip. The efficiency of a digital multiplier is mainly reflected in two aspects, namely area and speed. The area and speed of the multiplier are greatly influenced by selecting different design methods and implementation algorithms.

The invention provides an improved Booth multiplication algorithm, which has the core idea that the coupling among modules is reduced by firstly shifting, then compressing and finally summing, and is beneficial to the simplification of a control circuit.

In addition, according to the design method of the pure asynchronous circuit system, the invention adopts a Click micro-pipeline of a 'constraint data binding' two-phase handshake communication protocol, realizes the 64-bit asynchronous multiplier of the improved algorithm according to the strategy of separating control and data processing, and carries out verification on the FPGA.

1. Asynchronous control principle based on micro-pipeline

The core of the asynchronous design method is an asynchronous controller circuit, the asynchronous controller is used for realizing handshake communication protocol and coordination circuit functions, and currently, mainstream asynchronous controller units are provided with three types, namely, CElement, GasP and Click. The CElement is proposed by Muller in the last 50 century, is the most widely applied asynchronous control unit, and realizes a handshake protocol based on 'data binding', and the circuit needs a large amount of time sequence verification work in the later period to ensure the correctness of the circuit because no constraint is carried out on data in the handshake communication process. The GasP and Click circuits adopt a handshake protocol of 'constraint data binding' to separate communication and data management into different events, and the mechanism of event separation ensures the time sequence of the circuits in principle and is matched with the analysis of relative time sequence to ensure the use, so that the asynchronous design method can be obviously simplified. A production line formed by Click asynchronous controllers is used as a control unit, and a module of a micro-production line control multiplier carries out repeated calling operation, so that a final multiplier algorithm is completed.

2. Click circuit and two-phase single-rail handshake protocol

The Click circuit was originally proposed by Peeters and Willem, in 2010, to implement a "constrained data binding" two-phase handshake communication protocol. Handshake communication is carried out between asynchronous controllers by Req (request) and Ack (response) signals, data transmission is realized between two signal changes, and data transmission is managed by Fire (excitation) signals, as shown in figure 1.

3. Asynchronous micro-pipeline control circuit

The 64-bit asynchronous multiplier adopts an asynchronous pipeline control circuit to strictly control the operation timing sequence of each module, and the multiplier comprises a micro-pipeline which is provided with 19 click circuits in total and generates corresponding 19 Fire signals, as shown in figure 2. The asynchronous circuit adopts a handshake protocol to generate local clocks of all the pipeline segments, replaces a global clock in the synchronous integrated circuit, does not need a huge clock distribution network, naturally solves the problems of clock drift, high power consumption and the like in the synchronous integrated circuit, can obtain the performance under the average condition, and has better reusability and robustness. When the request signal in _ R is transmitted into the pipeline structure, the request signals are transmitted in sequence, and finally the response signal in _ a is obtained. The micro-pipeline control unit is repeatedly called to complete the operation of the whole multiplier.

In the control circuit of the asynchronous pipeline, not only the trigger signal Fire is output, but also a control part such as a counter is involved in the micro-pipeline control unit. In the whole multiplier, 3 counters are needed to drive different selectors in total, and the selectors further control data paths to realize a circulating water structure.

Compared with the defects and shortcomings of the prior art, the invention has the following beneficial effects:

(1) compared with a synchronous multiplier under the same system structure, the asynchronous multiplier provided by the invention has higher calculation speed under the condition of substantially unchanged energy consumption and area, the calculation time of each time is about 150ns, and the product operation can be quickly completed by multiplying 2 binary multipliers of any 64 bits;

(2) the design is not influenced by the natural frequency of the FPGA, the communication delay among modules in the micro-assembly line can reach 1.5ns at the fastest speed, and a huge clock distribution network and a clock distortion problem are not needed;

(3) the invention has good modularization and is easy for hierarchical design.

Drawings

FIG. 1 is a schematic diagram of a "constrained data binding" two-phase handshake communication protocol;

FIG. 2 is a schematic diagram of a micro-pipeline control circuit configuration;

FIG. 3 is a block diagram of the logic modules in the FPGA-based 64-bit asynchronous multiplier of the present invention;

FIG. 4 is a block diagram of the logic block of an 8 x 64bit multiplier;

fig. 5 is a multiplier simulation diagram.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention discloses a 64-bit asynchronous multiplier based on FPGA, as shown in FIG. 3, the 64-bit asynchronous multiplier comprises 8 × 64-bit multiplier, selector MUX0, selector MUX1, selector MUX2, compressor, counter Count0, counter Count1, counter Count2, several registers, carry look ahead adder CLA, and control unit (microflow pipeline in FIG. 3), wherein,

the counter 0 is configured to control the selector MUX0 to perform operations on a 64-bit input signal a and a 8-bit input signal B in a 64-bit multiplier after receiving the first group of trigger signals of the control unit, and the operation values are stored in 8 registers respectively;

the counter 2 receives a series of trigger signals, and then controls the selector MUX2 to select whether the output value of the upper stage compressor is returned to the upper stage compressor for data compression with 8 registers, or is passed to the carry look ahead adder CLA.

In the embodiment of the present invention, as shown in fig. 3, the multiplier needs to be completed by 8 × 64-bit multipliers, a selector MUX0, a MUX1, and a MUX2, a Compressor, 3 counters Count0, Count1, Count2, and finally a carry look ahead adder CLA. The Count0 controls the selector to divide the value of A into 8 groups by 8 bits, the total is 8 groups, the divided values and B are respectively operated in 8 × 64bit multipliers, and the obtained values are respectively stored in 8 registers. Count1 controls the number of 8 registers to be compressed in the compressor for a total of 7 times. The Count2 selects whether the output value of the compressor is a callback or is passed to the carry look ahead adder CLA for addition operation, and the compression operation of the compressor is to compress the values in FF1 and FF2 first, and then compress the obtained compressed value with the value in FF 3. By analogy, when the last compression operation is completed, the obtained compression value is transmitted to the carry look ahead adder CLA, and the final output value of the 64-bit multiplier is obtained.

In the embodiment of the present invention, the implementation principle of the control unit is analyzed by handshake communication of an asynchronous controller, as shown in fig. 3, which specifically includes:

(1) fire 0: in register FF₀There are two values of wi _ a _64bit and wi _ b _64bit, fire0 triggers FF₀These two values are passed down, the value of wi _ a _64bit will arrive in the selector MUX0, and the value of wi _ b _64bit will arrive directly in the 8 x 64bit multiplier, which together wait for the trigger signal to perform the first calculation. While the micro-pipeline continues to pass the handshake signal down and generates the fire1 signal.

(2) fire1, frie3, fire5 to fire 15: these 8 trigger signals control the counter Count₀The counter then controls the selector MUX0 to divide the wi _ a _64bit value in the selector, and 8 sets of output values will reach 8 × 64 multipliers, and operate with the wi _ b _64bit value, and the obtained values are stored in 8 registers.

(3) fire2, fire4, fire6 to fire 16: registers FF1-FF8 store the output values of the 8 upper stage multipliers, and the output values of the 8 x 64bit multipliers are passed on from the trigger registers of the 8 fire signals.

(4) fire4, fire6, fire8 to fire 16: the 7 trigger signals control the counter to count 0-6, when the counter is 0, the values in FF1 and FF2 are transmitted to the Compressor through the selector to be compressed, the selector MUX1 mainly functions to select the input value to be compressed, a 7-stage circulating water structure is adopted on the data path, and a Compressor tree (Compressor _128bit) is applied. The parallel computing power of the common adder is limited, so that the invention uses a 4-2 compressor, and the circuit can compress the addition of 4 inputs into 2 outputs in parallel, and can reduce the number of partial products by half. The 4-2 compressor is composed of two one-bit full adders in series, high-bit compression does not depend on low-bit carry, concurrency is high, circuit complexity is low, operation speed is high, and therefore the overall operation efficiency of the multiplier is improved.

(5) fire5, fire7, fire9 to fire 17: the primary control selector MUX2 selects whether the upper level compressor output value is callback up or passed into the carry look ahead adder CLA. If the fire5 trigger signal arrives, the first compression value is called back to the upper level selector MUX1, the MUX1 controls the value in FF3 and the calling back value to continue to compress, and the operation continues until the fire15 signal arrives. When the fire17 signal arrives, the compressed value is passed down to CLA to continue the calculation, and the calculated value is stored in a register under the adder CLA.

(6) fire 18: the final signal will trigger the FF1 register of the carry look ahead CLA, outputting the final product data.

In the embodiment of the present invention, in the counter Count0, an operation process is performed on a 64-bit input signal a and an 8-bit input signal B in an 8 × 64-bit multiplier, as shown in fig. 4.

As can be seen from fig. 4, the input signals to the multipliers are A, B, respectively, where a is a 64-bit number and B is an 8-bit number. The input parameters A will be grouped into 8 groups in 8 bits wide, from [7:0] to [63:56], respectively. These 8 groups a and B are put into 8-bit multipliers, wherein the 8-bit Multiplier is composed of 4 Shifter circuits and a Compressor, as in the structure of Multiplier1 in fig. 4.

Multiplier1 is one of the main components of an 8-by-64-bit Multiplier with inputs A [7:0]]And B, first, A [7:0]]Dividing into one group with every two bit width, and dividing into (A)₇A₆)(A₅A₄)(A₃A₂)(A₁A₀) And 4 groups of values and B are respectively put into 4 shift encoders for operation, and two binary values with 15 bits are finally obtained. The whole 8-by-64 multiplier contains 8-bit multipliers, 8 groups of A and B are respectively calculated by the first-stage multiplier, and finally 16 15-ary numbers are obtained, and the operation of the first stage is completed.

The second stage passes the 16 binary values through a 4-2 compressor tree to complete the operation of compressing the values, and the 4-2 compressor can compress 4 inputs into 2 outputs in parallel, thereby reducing the number of partial products by half. The 4-2 compressor is composed of two one-bit full adders in series, high-bit compression does not depend on low-bit carry, concurrency is high, circuit complexity is low, and operation speed is high. After a series of calculations by the multiplier, 2 output values S and C of the compressor are finally obtained, and these two value calculated values are stored in 8 registers respectively and will continue to be calculated in the 64-bit multiplier.

The invention adopts an asynchronous design method to realize the improved Booth algorithm, a control part uses a micro-pipeline composed of a Click asynchronous controller which is easy for time sequence analysis, a functional circuit is realized by using combinational logic, the two are connected together by a trigger, namely, the asynchronous micro-pipeline indirectly maintains the calculation sequence of the combinational circuit by managing the conduction time of the trigger, and the three cooperate to finish one-time or multiple-time multiplication calculation, thereby forming a Data-Path (Data-Path) type calculation structure.

The invention uses the data constraint data to bind the Click micro-pipeline of the two-phase handshake communication protocol to realize the asynchronous circuit, and the asynchronous circuit adopts the local communication mode to complete the asynchronous control by the handshake protocol.

The invention provides a low-coupling Booth algorithm which increases partial accumulation compression times and is followed by an addition (subtraction) method, the algorithm improves the calculation efficiency through the function of a separation module and is very suitable for asynchronous control, and further, the invention uses an asynchronous micro-pipeline mechanism and a combined function module to complete the functions of shifting, compression and addition, the design modularization degree is high, and the flow is simple and clear.

Compared with a synchronous multiplier under the same system structure, the asynchronous multiplier provided by the invention has high calculation speed under the condition that the energy consumption and the area are basically unchanged, the calculation time of each time is about 150ns, and the product operation can be quickly completed by multiplying 2 binary multipliers of any 64 bits.

The design and simulation of a 64-bit asynchronous multiplier are performed by using a Vivado platform, a hardware description language uses Verilog-1995(Vivado is a complete design flow tool from RTL to a bit stream of Xilinx corporation, an applied FPGA (Field-Programmable Gate Array) model is Virtex-7(xc7vx550tffg1158-2) of Xilinx corporation, a simulation result obtained by multiplying wi _ a _64bit to 103741655961231 and wi _ b _64bit to 112381656513586 is shown in a specific waveform diagram as fig. 5, test codes are written in a TestBench bench of the Vivado platform, and then time sequence simulation is executed to obtain a final calculation result.

As can be seen from fig. 5, when inR goes high, the asynchronous controller handshake communication begins, the multiplier starts to count, and a total of 19 fire signals and 2 counters perform multiplier datapath control. Among resources occupied by the circuit, the LUT occupies 3695 in total and occupies 1.07 percent of all resources; the registers occupy 3335, which accounts for 0.48% of the total resources.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A64-bit asynchronous multiplier based on FPGA is characterized in that the 64-bit asynchronous multiplier comprises an 8 x 64-bit multiplier, a selector MUX0, a selector MUX1, a selector MUX2, a compressor, a counter Count0, a counter Count1, a counter Count2, a plurality of registers, a carry look ahead adder CLA and a control unit, wherein,

the control unit adopts a production line consisting of Click asynchronous controllers, analyzes handshake signals through a handshake communication protocol of an asynchronous control circuit and sequentially generates four groups of trigger signals;

2. The FPGA-based 64-bit asynchronous multiplier of claim 1, wherein in said counter Count0, the input signals of said 8 x 64-bit multiplier are 64-bit input signal a, 8-bit input signal B.