WO2005124535A1 - Field programmable gate array (fpga) based pipelined array multiplier (oparam). - Google Patents

Field programmable gate array (fpga) based pipelined array multiplier (oparam). Download PDF

Info

Publication number
WO2005124535A1
WO2005124535A1 PCT/IN2004/000170 IN2004000170W WO2005124535A1 WO 2005124535 A1 WO2005124535 A1 WO 2005124535A1 IN 2004000170 W IN2004000170 W IN 2004000170W WO 2005124535 A1 WO2005124535 A1 WO 2005124535A1
Authority
WO
WIPO (PCT)
Prior art keywords
combinational logic
row
logic blocks
logic block
multiplier
Prior art date
Application number
PCT/IN2004/000170
Other languages
French (fr)
Inventor
Gopalakrishnan Lakshminarayanan
Balasubramanian Venkataramani
Original Assignee
Department Of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Department Of Information Technology filed Critical Department Of Information Technology
Priority to PCT/IN2004/000170 priority Critical patent/WO2005124535A1/en
Publication of WO2005124535A1 publication Critical patent/WO2005124535A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • G06F7/527Multiplying only in serial-parallel fashion, i.e. one operand being entered serially and the other in parallel
    • G06F7/5272Multiplying only in serial-parallel fashion, i.e. one operand being entered serially and the other in parallel with row wise addition of partial products
    • G06F7/5275Multiplying only in serial-parallel fashion, i.e. one operand being entered serially and the other in parallel with row wise addition of partial products using carry save adders
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03HIMPEDANCE NETWORKS, e.g. RESONANT CIRCUITS; RESONATORS
    • H03H17/00Networks using digital techniques
    • H03H17/02Frequency selective networks
    • H03H17/0223Computation saving measures; Accelerating measures
    • H03H17/0225Measures concerning the multipliers
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03HIMPEDANCE NETWORKS, e.g. RESONANT CIRCUITS; RESONATORS
    • H03H17/00Networks using digital techniques
    • H03H17/02Frequency selective networks
    • H03H17/06Non-recursive filters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/386Special constructional features
    • G06F2207/3884Pipelining

Definitions

  • This invention relates to Field Programmable Gate Array (FPGA) based pipelined array multiplier (OPARAM). BACKGROUND OF THE INVENTION
  • FPGA Field Programmable Gate Array
  • OPARAM pipelined array multiplier
  • stage 1 the product of Yl with X3 X2 X I X0 is computed and added to the result of the previous stage.
  • Stages 0 to 3 are used for multiplication by the individual bits of Y with X and stages 4 is used for propagating the carry.
  • the combinational logic blocks M0, Ml . M2 in stages 0 to 3, consists of AND gates/half adders/Full -adders.
  • the above array multiplier can be fed with a fresh data only after the input is processed by all the five stages.
  • the limitation of this conventional array multiplier is that the minimum time required for multiplication is equal to the time taken for processing the input in all the five stages resulting in slowing down of the speed.
  • ripple carry operation done in Stage 4 may be replaced by cany save operation distributed over 3 additional stages.
  • the multiplication rate can be increased by introducing registers at the output of the combinational logic blocks M0, Ml, M2 & M3 at each stage shown in Fig. l.
  • the resulting multiplier is shown in Fig. 2 and is called a Pipelined Array multiplier.
  • the multiplication can be done at a rate which depends only on the largest time taken for processing the inputs at one of the stages, the interconnect delay between the combinational logic blocks and the flip-flop (register) and set-up time as well as hold-time of the flip-flop.
  • the product P0 becomes available after 2 clock cycles.
  • this invention provides an improved m x n pipelined array multiplier comprising: an array of combinational logic blocks, a set of registers connected to each said combinational logic blocks.in a row of said array of combinational logic blocks, - two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, - the remaining n-2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks, the sum output of the jth combinational logic block in the ith row is connected to j+1 th combinational logic block in the i+lth row, the carry output of the jth combinational
  • the improved 4 x 4 pipelined array multiplier comprises: an array of combinational logic blocks, - a set of registers connected to each said combinational logic blocks in a row of said array of combinational logic blocks, two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining 2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks, in first row combinational logic blocks I to 4 are connected to second row such that the sum output of the combinational logic blocks 1 to 3 in the first row is connected to 2 to 4 combinational logic blocks in the second row, - in second row combinational logic blocks 1 to 4 are connected to third row such that the sum output of the combinational logic blocks 1 to 3 in the second row is connected to 2 to
  • the array of combinational logic blocks includes LUTs.
  • the said improved m x n pipelined array multiplier further includes a delay control means between adjacent combination logic blocks in the same row, ith row and i+lth row to make the sum of the interconnect delay and the delay in each logic block and the registers equal for all the logic blocks including registers in the ith and i+lth rows thereby further increasing the speed of multiplication.
  • the said delay control means comprising selected interconnect wires between adjacent combinational logic blocks including registers such that the sum of the interconnect delay between said logic blocks and the combinational logic blocks delay including registers is equal for the selected interconnected wires between logic blocks in the same row or the adjacent row.
  • the two accumulators are provided in the output of the multiplier for storing the results alternatively. The output of said accumulators is connected to means for adding the sum of the two products.
  • Figure 1 shows the conventional 4 x 4 array multiplier.
  • Figure 2 shows the conventional 4 x 4 pipelined array multiplier.
  • Figure 3 shows an improved pipelined array multiplier, according to this invention.
  • Figure 4 shows the conventional 4 x 4 guild multiplier.
  • Figure 5 shows an improved pipelined array multiplier with accumulator, according to this invention.
  • Figure 6 shows the clock circuit DETAILED DESCRIPTION
  • figures 1 and 2 have been explained under the heading 'Background' . It is assumed that FPGAs with 4 input Look-Up-Table (LUT) are used for the implementation.
  • LUT Look-Up-Table
  • One of the objectives of the synthesis technique is to ensure that all the 4 inputs of the LUTs are effectively engaged.
  • the stages involving half-adders have to be modified/combined so that LUTs are effectively utilized. Keeping this in view, the stageO of Fig.2 may be modified to compute the partial products due to the two least significant multiplier bits.
  • the last N stages may be reduced to N/2 stages by replacing the Half adders with suitable functional blocks and feeding the sum and carry outputs from one stage to another properly.
  • the last N stages may be reduced to N/2 stages by replacing the Half adders with suitable functional blocks and feeding the sum and carry outputs from one stage to another properly.
  • M2, M2 - stage 3; MO, M7, M7, M6 - stage 4; M7, M6 - stage 5) separated by 6 stages of registers ( 1-6) are only required.
  • the original multiplier in figure 2 requires 8 stages of combinational logic blocks and registers.
  • the latency of the multiplier shown in figure 3 is reduced from 8 clock cycles to 6 clock cycles. Hence the number of registers required and the latency are reduced by 25%.
  • MxN multiplication can be achieved using (M-l)+(N/2) stages of registers with the latency of (M-l)+(N/2) clock cycles. This results in lower latency and requires less area for implementation.
  • Optimally Placed And Routed pipelined Array Multiplier The maximum speed at which a pipelined array multiplier implemented on FPOAs can operate is limited by the minimum write cycle time required for the flip-flops (registers) in the LUTs. Alternately the maximum frequency at which the flip-flop (register) can operate puts the upper limit. But in practice a multiplier designed using a CAD (Computer Aided Design) tool operates at lower rates. For example, in figure 3 for the flip-flops (registers 1-6) in the LUTs of a XC4010E-1 device, the minimum write cycle period is typically 6ns. However the popular CAD tools at best enables the multiplier to operate at a rate of about 10ns per multiplication.
  • CAD Computer Aided Design
  • the objective of this work is to make the flip-flops to operate al the peak rate and perform multiplication at this rate. If the F/Fs are to be pumped in, at their peak operating rate, the critical path delay between output of one register stage to the input of next register stage should be less than 6ns (nanoseconds).
  • the components of the critical path delays between one multiplier stage and the next multiplier stage and their typical values for XC4010E-1 is as follows: Tcko: Clock K to outputs Q (1.9ns)
  • Tnet Interconnect delay between Q output of one register stage to the input of the next register stage(0.7 to 7ns).
  • Tsetup Setup time of flip-flop (F/F) for the inputs fed through F&G inputs (1.8ns).
  • Tnet there is one variable delay viz. Tnet.
  • Tnet This has to be less than 2.3ns (nanoseconds) if the registers are to be operated at the peak rate of 6ns.
  • the CAD tools for placement and routing normally do not guarantee the critical path delay to be less than 6ns.
  • the location of the CLB (Combinational Logic Block) chosen by the CAD tool to realize the logic at adjacent stages is arbitrary.
  • the CLB in the first stage and the CLB in the next stage to which it is to be interconnected may be chosen to lie in the two extreme corners of the CLB matrix! This naturally increases the Tnet.
  • the interconnect delay between two adjacent row or column of CLBs is still not deterministic. Hence it can't be guaranteed to be below 2.3ns.
  • the interconnect delays depend on fan-ins and fan-outs. 2.
  • the fabrication variations make the delay to be different for different ICs and for different positions in the same IC.
  • the F/F cannot be operated at their peak speed. Due to this, the multiplier designed using the commercially available synthesis tools report speeds which are 1.5 to 2 times lower compared to the maximum permissible clock speed for the F/F. In order to achieve the best speed, the following modifications have been done.
  • the positions of the CLBs used for implementing different logical functions are chosen by using floor planning.
  • the interconnect delay is also manually adjusted to be below 2.3ns.
  • the multiplier may be designed using the CAD tool and placed as well as routed using autorouting. This can be used as the initial design.
  • all the CLBs corresponding to a single pipeline stage may be made to lie in a single row or column.
  • the CLBs in the adjacent pipelined stage may be chosen to be in the adjacent rows or columns.
  • delay control means comprising selected interconnect wires between adjacent combinational logic blocks including registers such that the sum of the interconnect delay between said logic blocks and the combinational logic blocks delay including registers is equal for the selected interconnected wires between logic blocks in the same row or the adjacent row.
  • a 6x6 OPARAM can be made to operate at a peak rate of 6ns(nanoseconds).
  • the 16-bit accumulator for this filter requires 10ns in the same device. Since the accumulator is slower than the multiplier, a single multiplier may be shared between the two accumulators using an interleaving technique.
  • the multiplier processes alternately the data and coefficients corresponding to odd filter and even filter.
  • the multiplier output corresponding to each filter is latched into two separate registers.
  • a 6ns clock is used to pump-in data into the OPARAM.
  • a 12ns clock is used for latching the data into the registers.
  • a single RAM may be used for storing the input samples.
  • the odd locations contain the odd numbered samples and even locations contain the even numbered samples.
  • a single RAM may be used for storing the impulse response coefficients.
  • the odd locations contain the coefficients of one filter and the even locations contain the coefficients of the other filter.
  • the basic clock of 6ns is used for the multiplier. This clock is divided further using counters to generate the 12ns/10.8ns clock and the addresses for the RAM.
  • the block diagram of the 16-tap filter using 6x6 OPARAM along with the two interleaved accumulators is shown in Fig. 6.
  • the filter can be implemented using WPARAM.
  • this technique is applicable for filters requiring any NXN OPARAM, for the WPARAM the size of the multiplier cannot be arbitrarily chosen.
  • the conventional guild multiplier is one of the fast and area efficient multiplier proposed for high speed applications.
  • the block diagram of the 4x4 conventional Guild Multiplier is shown in Fig. 4.
  • a Pipelined Guild multiplier is obtained by ensuring that all the paths from the input to particular equitemporal point has undergone the same delay. This multiplier achieves efficiency by dispensing with the need for half-adders and thereby effectively engages all the 4-inputs of the LUTs.
  • the dotted lines indicates equitemporal points.
  • the processing logic element (E) shown in Fig. 4 consist of 4 inputs and 4 outputs.
  • these mutipliers should be studied not only by using simulation but also by pumping in the test data on to the actual device.
  • the test data is applied through the parallel port of a Personal Computer to the demo board housing the FPGA and the results are read back for verification.
  • the clock signal should also be applied to the multiplier.
  • the clock may be taken from an external source or it may have to be internally generated.
  • One of the requirements for extracting the peak rate from the OPARAM is the ability to operate the pipelining registers at their maximum rate. This in turn requires a clock which operates at this rate.
  • the input samples and impulse response coefficient are assumed to be 6-bit unsigned number and are assumed to be stored in 2 separate RAMs (RAM-A, RAM-B). It requires 125 CLBs for the implementation of the filter.
  • the conventional parallel FIR filter with dedicated multipliers has also been implemented. It requires 210 CLBs. Hence the interleaved parallel FIR filter is area efficient by 40%.
  • the different characteristics of multiplier such as CLBs required, maximum operating frequency, latency are evaluated for the multipliers and are tabulated in table 2. Table 2:
  • the 4X4 OSPAM requires 25% less area and latency than the conventional pipelined array multiplier and the Guild multiplier.
  • the 4X4 OPARAM requires about 25% less area as well as latency and operates faster by about 1.5 times than the conventional pipelined array multiplier and the Guild multiplier.
  • the 4X4 OPARAM has been found to be working satisfactorily not only at the peak clock rate but also at rates lower than this.
  • the Internal clock generation scheme proposed has been found to be stable and it enables the multipliers to be operated at their peak rate immaterial of the speed grade of the PCB on which the FPGA is mounted.
  • the 16 tap Interleaved parallel FIR filter using 6X6 OPARAM requires 40% less area than the filter with the conventional approach.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • Computational Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Processing (AREA)
  • Complex Calculations (AREA)

Abstract

This invention described a Field Programmable Gate Array (FPGA) based pipelined array multiplier (OPARAM) made using an array of combinational logic blocks and a set of registers connected to each said combinational logic blocks.

Description

This invention relates to Field Programmable Gate Array (FPGA) based pipelined array multiplier (OPARAM). BACKGROUND OF THE INVENTION The conventional 4 x 4 array multiplier is shown in figure 1 of the accompanying drawings. Consider the multiplication of the 4-bit number X3 X2 XI XO with Y3 Y2 Yl YO. In Fig.1 , the multiplication is earned out using the conventional "paper and pencil" technique. Different stages of computation in this array multiplier are separated by dotted lines and are indicated as stage 0, stage 1, stage 2, stage 3 & Stage 4. At stage 0, the product of Y0 with X3 X2 X I X0 is computed. In stage 1 , the product of Yl with X3 X2 X I X0 is computed and added to the result of the previous stage. Stages 0 to 3 are used for multiplication by the individual bits of Y with X and stages 4 is used for propagating the carry. The combinational logic blocks M0, Ml . M2 in stages 0 to 3, consists of AND gates/half adders/Full -adders. The above array multiplier can be fed with a fresh data only after the input is processed by all the five stages. The limitation of this conventional array multiplier is that the minimum time required for multiplication is equal to the time taken for processing the input in all the five stages resulting in slowing down of the speed. In order to speed up the conventional array multiplier, ripple carry operation done in Stage 4 may be replaced by cany save operation distributed over 3 additional stages. The multiplication rate can be increased by introducing registers at the output of the combinational logic blocks M0, Ml, M2 & M3 at each stage shown in Fig. l. The resulting multiplier is shown in Fig. 2 and is called a Pipelined Array multiplier. In this multiplier, the multiplication can be done at a rate which depends only on the largest time taken for processing the inputs at one of the stages, the interconnect delay between the combinational logic blocks and the flip-flop (register) and set-up time as well as hold-time of the flip-flop. In the Pipelined Array multiplier, the product P0 becomes available after 2 clock cycles. P I after 3 clock cycles and so on. P6 and P7 become available after 8 clock pulses from the time the corresponding operands are fed at the input register at the input of the pipelined array multiplier. Hence the latency of the multiplier is 8 clock cycles. In order to ensure that all the product terms arrive simultaneously, the product bits P0 to P5 have to be delayed by 6 to 1 clock cycle respectively. This is achieved by using a chain of Flip-flops: P0 requires 6 flip- flops (registers), PI requires 5 flip-flops and so on. In general, for an NxN multiplier, 2N combinational logic blocks separated by 2N stages of registers are required. The latency of the multiplier is 2N clock cycles. In this conventional pipelined array multiplier shown in figure 2, the drawback is that when implemented on FPGA with four input LUT, all the four inputs are not effectively utilized in the first stage and in the last N/2 stages and thereby resulting in more area for implementation and latency.
The object and summary of the invention The object of this invention is to obviate the drawbacks of the conventional multipliers shown in figures 1 & 2 and to ensure that all the 4 inputs of the LUTs are effectively used. To achieve said objective this invention provides an improved m x n pipelined array multiplier comprising: an array of combinational logic blocks, a set of registers connected to each said combinational logic blocks.in a row of said array of combinational logic blocks, - two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, - the remaining n-2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks, the sum output of the jth combinational logic block in the ith row is connected to j+1 th combinational logic block in the i+lth row, the carry output of the jth combinational logic block in the ith row is connected to jth combinational logic block of i+ lth row, in the nth row of the combinational logic blocks, the extreme right combinational logic block is connected to the carry output of the extreme right combinational logic block of n-lth row and sum output of second extreme right combinational logic block of n-lth row, - jth combinational logic block in ith row is connected to the sum and carry output of j+lth combinational logic block in the ith row and sum output of j- 1th combinational logic block in i-lth row and carry from jth combinational logic block in the i- lth row, the extreme left combinational logic block in the nth row is provided inputs from sum and carry output of second left combinational logic block in the nth row, - the left combinational logic block of the last row of the combinational logic block array is provided with sum output of extreme left and carry output of second extreme left combinational logic block of the previous row and the sum and carry output of the right combinational logic block of the last row, the right combinational logic block of the last row of the combinational logic block array is provided with the sum output of the second left combinational logic block and the carry output of the third left combinational logic block in the previous row, and sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last n/2th rows will provide the final multiplication thereby resulting in reduction in the area and latency. In the improved m x n pipelined array multiplier wherein m x n = 4 x 4. The said improved 4 x 4 pipelined array multiplier comprises: an array of combinational logic blocks, - a set of registers connected to each said combinational logic blocks in a row of said array of combinational logic blocks, two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining 2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks, in first row combinational logic blocks I to 4 are connected to second row such that the sum output of the combinational logic blocks 1 to 3 in the first row is connected to 2 to 4 combinational logic blocks in the second row, - in second row combinational logic blocks 1 to 4 are connected to third row such that the sum output of the combinational logic blocks 1 to 3 in the second row is connected to 2 to 4 combinational logic blocks in the third row, in first row combinational logic blocks 2 to 4 are connected to second row such that the carry output of the combinational logic blocks 2 to 4 in the first row is connected to 2 to 4 combinational logic blocks in the second row, in second row combinational logic blocks 2 to 4 are connected to third row such that the carry output of the combinational logic blocks 2 to 4 in the second row is connected to 2 to 4 combinational logic blocks in the third row, in the fourth row of the combinational logic blocks, the extreme right combinational logic block is connected to the carry output of the extreme right combinational logic block of third row and sum output of second extreme right combinational logic block of third row, combinational logic blocks 1 to 3 in fourth row is connected to the sum and carry output of combinational logic blocks 2 to 4 in the fourth row and sum output of combinational logic blocks 2 to 4 in the third row and carry from combinational logic blocks 1 to 3 in the third row, the extreme left combinational logic block in the fourth row is provided inputs from sum and carry output of second left combinational logic block in the fourth row, - the left combinational logic block of the last row of the combinational logic block array is provided with sum output of extreme left and carry output of second extreme left combinational logic block of the previous row and the sum and carry output of the right combinational logic block of the last row, the right combinational logic block of the last row of the combinational logic block array is provided with the sum output of the second left combinational logic block and the carry output of the third left combinational logic block in the previous row, and sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last two rows will provide the final multiplication, thereby resulting in reduction in the area and latency. The array of combinational logic blocks includes LUTs. The said improved m x n pipelined array multiplier further includes a delay control means between adjacent combination logic blocks in the same row, ith row and i+lth row to make the sum of the interconnect delay and the delay in each logic block and the registers equal for all the logic blocks including registers in the ith and i+lth rows thereby further increasing the speed of multiplication. The said delay control means comprising selected interconnect wires between adjacent combinational logic blocks including registers such that the sum of the interconnect delay between said logic blocks and the combinational logic blocks delay including registers is equal for the selected interconnected wires between logic blocks in the same row or the adjacent row. The two accumulators are provided in the output of the multiplier for storing the results alternatively. The output of said accumulators is connected to means for adding the sum of the two products. Brief description of the Drawings The invention will now be described with reference to the accompanying drawings.
Figure 1 shows the conventional 4 x 4 array multiplier. Figure 2 shows the conventional 4 x 4 pipelined array multiplier. Figure 3 shows an improved pipelined array multiplier, according to this invention. Figure 4 shows the conventional 4 x 4 guild multiplier. Figure 5 shows an improved pipelined array multiplier with accumulator, according to this invention.
Figure 6 shows the clock circuit DETAILED DESCRIPTION Referring to the drawings, figures 1 and 2 have been explained under the heading 'Background' . It is assumed that FPGAs with 4 input Look-Up-Table (LUT) are used for the implementation. One of the objectives of the synthesis technique is to ensure that all the 4 inputs of the LUTs are effectively engaged. When half-adders are implemented, LUT cannot be efficiently utilized. Accordingly, the stages involving half-adders have to be modified/combined so that LUTs are effectively utilized. Keeping this in view, the stageO of Fig.2 may be modified to compute the partial products due to the two least significant multiplier bits. The last N stages may be reduced to N/2 stages by replacing the Half adders with suitable functional blocks and feeding the sum and carry outputs from one stage to another properly. For example, consider the multiplication of the 4-bit number X3 X2 XI XO with Y3 Y2 Yl YO. In stage one of the original scheme, product of YO with X3 X2 XI XO is computed. In the second stage product of Yl with X3 X2 XI XO is computed. These two sta e1; can be modified as follows: The partial product obtained by the least significant two bits of the multiplier (Y l YO) with X3 X2 X I XO is shown in table 1. given below. Since 4- input LUT is available, the partial products corresponding to each column of table I can be computed using a single LUT. Hence the two stages of original multiplier of figure 2 can be reduced to a single stage with 4-input LUTs. X3 X2 XI XO X Yl YO X3Y0 X2Y0 X1 Y0 X0Y0 partial product row- 1 X3Y1 X2Y1 X1Y1 X0Y1 partial product row-2
Table 1 Partial products of multiplying the multiplicand with two LSBs of multiplier Stages three and four of the original scheme actually require 4-input functional blocks
(MO, M2, M2, M2) and hence they can be implemented using the 4-input LUTs efficiently and no modification in the original scheme is required for these two stages. Stages 5 to 8 in the original scheme is used for carry propagation and they use only half-adders for this purpose. Every two stages can be properly combined into one stage in view of the use of 4- input LUTs. Keeping in view the above modifications, an improved 4x4 pipelined array multiplier has been developed and is shown in Fig. 3. It consists of 5 combinational logic blocks (CLBs) (MO, M5, M5, M5, MO - stage 1; MO, M2, M2, M2 - stage 2; MO, M2. M2, M2 - stage 3; MO, M7, M7, M6 - stage 4; M7, M6 - stage 5) separated by 6 stages of registers ( 1-6) are only required. The original multiplier in figure 2 requires 8 stages of combinational logic blocks and registers. Further the latency of the multiplier shown in figure 3 is reduced from 8 clock cycles to 6 clock cycles. Hence the number of registers required and the latency are reduced by 25%. In general MxN multiplication can be achieved using (M-l)+(N/2) stages of registers with the latency of (M-l)+(N/2) clock cycles. This results in lower latency and requires less area for implementation. Optimally Placed And Routed pipelined Array Multiplier (OPARAM) The maximum speed at which a pipelined array multiplier implemented on FPOAs can operate is limited by the minimum write cycle time required for the flip-flops (registers) in the LUTs. Alternately the maximum frequency at which the flip-flop (register) can operate puts the upper limit. But in practice a multiplier designed using a CAD (Computer Aided Design) tool operates at lower rates. For example, in figure 3 for the flip-flops (registers 1-6) in the LUTs of a XC4010E-1 device, the minimum write cycle period is typically 6ns. However the popular CAD tools at best enables the multiplier to operate at a rate of about 10ns per multiplication. The objective of this work is to make the flip-flops to operate al the peak rate and perform multiplication at this rate. If the F/Fs are to be pumped in, at their peak operating rate, the critical path delay between output of one register stage to the input of next register stage should be less than 6ns (nanoseconds). The components of the critical path delays between one multiplier stage and the next multiplier stage and their typical values for XC4010E-1 is as follows: Tcko: Clock K to outputs Q (1.9ns) Tnet: Interconnect delay between Q output of one register stage to the input of the next register stage(0.7 to 7ns). Tsetup: Setup time of flip-flop (F/F) for the inputs fed through F&G inputs (1.8ns). In the above components, there is one variable delay viz. Tnet. This has to be less than 2.3ns (nanoseconds) if the registers are to be operated at the peak rate of 6ns. Even for less order of multiplication, the CAD tools for placement and routing normally do not guarantee the critical path delay to be less than 6ns. Firstly, the location of the CLB (Combinational Logic Block) chosen by the CAD tool to realize the logic at adjacent stages is arbitrary. In the worst case, the CLB in the first stage and the CLB in the next stage to which it is to be interconnected may be chosen to lie in the two extreme corners of the CLB matrix! This naturally increases the Tnet. Even if the CAD tool permits the specification of which CLB is to be used for which particular function, the interconnect delay between two adjacent row or column of CLBs is still not deterministic. Hence it can't be guaranteed to be below 2.3ns. We aren't aware of any CAD tool which can perform the routing such that the interconnect delay between two CLBs is equal to a particular value within an acceptable tolerance. It becomes difficult for the existing CAD tools to achieve this, because of following reasons. 1. The interconnect delays depend on fan-ins and fan-outs. 2. The fabrication variations make the delay to be different for different ICs and for different positions in the same IC. In view of the limitations of the CAD tools, the F/F cannot be operated at their peak speed. Due to this, the multiplier designed using the commercially available synthesis tools report speeds which are 1.5 to 2 times lower compared to the maximum permissible clock speed for the F/F. In order to achieve the best speed, the following modifications have been done.
1. The positions of the CLBs used for implementing different logical functions are chosen by using floor planning.
2. The interconnect delay is also manually adjusted to be below 2.3ns. The multiplier may be designed using the CAD tool and placed as well as routed using autorouting. This can be used as the initial design. By using proper floor planning, all the CLBs corresponding to a single pipeline stage may be made to lie in a single row or column. Similarly the CLBs in the adjacent pipelined stage may be chosen to be in the adjacent rows or columns. These two steps minimize most of the interconnect delays. Further modification in routing between the adjacent CLBs in some of the paths, which have delays greater than 2.3ns can guarantee the proper operation. With the above modifications, the F/Fs in the CLBs can be operated at their peak speed and multiplication can be performed at this rate. Accordingly, delay control means comprising selected interconnect wires between adjacent combinational logic blocks including registers such that the sum of the interconnect delay between said logic blocks and the combinational logic blocks delay including registers is equal for the selected interconnected wires between logic blocks in the same row or the adjacent row. Implementation of Area efficient Filters using OPARAM and interleaved parallel FIR Filters In figure 6, one multiplier with two accumulators at its output is shown. One multiplier with one accumulator constitute one filter. In figure 6, two filters which process the odd and even input samples have been shown. Two parallel filters each with 16-taps and the input samples and the impulse response coefficients represented using 6-bits is shown in figure 6. In XC4010E-1 device, a 6x6 OPARAM can be made to operate at a peak rate of 6ns(nanoseconds). The 16-bit accumulator for this filter requires 10ns in the same device. Since the accumulator is slower than the multiplier, a single multiplier may be shared between the two accumulators using an interleaving technique. The multiplier processes alternately the data and coefficients corresponding to odd filter and even filter. The multiplier output corresponding to each filter is latched into two separate registers. A 6ns clock is used to pump-in data into the OPARAM. A 12ns clock is used for latching the data into the registers. Data is latched into one of the registers at the positive edge of the 12ns clock and the reverse is true for the other register. A single RAM may be used for storing the input samples. The odd locations contain the odd numbered samples and even locations contain the even numbered samples. Similarly a single RAM may be used for storing the impulse response coefficients. The odd locations contain the coefficients of one filter and the even locations contain the coefficients of the other filter. The basic clock of 6ns is used for the multiplier. This clock is divided further using counters to generate the 12ns/10.8ns clock and the addresses for the RAM. The block diagram of the 16-tap filter using 6x6 OPARAM along with the two interleaved accumulators is shown in Fig. 6. Using a similar approach, the filter can be implemented using WPARAM. However while this technique is applicable for filters requiring any NXN OPARAM, for the WPARAM the size of the multiplier cannot be arbitrarily chosen.
Comparison between the conventional guild multiplier and instant pipelined multiplier (Optimally synthesized pipelined array multiplier (OSPAM)) The conventional guild multiplier is one of the fast and area efficient multiplier proposed for high speed applications. The block diagram of the 4x4 conventional Guild Multiplier is shown in Fig. 4. A Pipelined Guild multiplier is obtained by ensuring that all the paths from the input to particular equitemporal point has undergone the same delay. This multiplier achieves efficiency by dispensing with the need for half-adders and thereby effectively engages all the 4-inputs of the LUTs. In Fig.4, the dotted lines indicates equitemporal points. The processing logic element (E) shown in Fig. 4 consist of 4 inputs and 4 outputs. Out of 4 outputs three of them cross the equitemporal line. Since only 2 LUTs and Flip Flops (FFs) / Combinational Logic Blocks (CLB) are available, to ensure that all the 3 outputs have undergone the same delay 3 LUTs/FFs have to be used per processing element. With this observation, it may be verified that the latency between input and output of the Pipelined Guild Multiplier is 8 clock cycles. As in the Pipelined Array Multiplier, the products P0 to P5 have to be delayed by 6 to 1 clock cycle respectively to ensure that all the product terms arrive synchronously. The conventional Guild Multiplier is inefficient compared to the instant multiplier shown in figure 3 because ot me increase in latency by (N/2) clock cycles compared to the OSPAM. This requires some of the input bits as well as some of the product terms to be delayed by more number of stages to ensure the equitemporal condition. Each additional delay is achieved at the expense of one additional LUT/FF. Hence conventional guild multiplier requires more area than the pipelined array multiplier, according to this invention.
Results and Conclusions In order to study the characteristics of the multipliers, 4X4 multipliers are implemented on XC4010E-1 device. Pipelined array multipliers, OSPAM and Guild multipliers can be tested fully using the simulation tool. Hence these multipliers are tested using simulation and they are found to be satisfactory. However OPARAM, WPARAM and wavepipelined Guild multiplier use manual routing and the interconnect delays play a crucial role in determining their proper operation. Since the interconnect delays are non- deterministic, there may not be one to one coiτespondence between the delays measured by the CAD tool and the actual delay obtained in a particular device after the design is downloaded to the device. Hence to verify the proper operation of these mutipliers, they should be studied not only by using simulation but also by pumping in the test data on to the actual device. For this purpose the test data is applied through the parallel port of a Personal Computer to the demo board housing the FPGA and the results are read back for verification. For testing the design on the actual device, the clock signal should also be applied to the multiplier. The clock may be taken from an external source or it may have to be internally generated. One of the requirements for extracting the peak rate from the OPARAM is the ability to operate the pipelining registers at their maximum rate. This in turn requires a clock which operates at this rate. For OPARAM, the floor planning and the manual routing have been carried out so that worst case delay between any two pipeline stage is less than that of the minimum write cycle time for the flip-flops. Hence if a clock whose period is greater than this minimum cycle period is applied to the flip-flop, it has to work properly. To verify this, the period of the internally generated clock (shown in figure 6) is varied and the OPARAM is found to be working satisfactorily. OPARAM of size 8X8 has also been implemented and has been found to be satisfactory. The implementation details of the filters on XC4010E-1 using OPARAM are presented next. To study the area efficiency of the scheme using OPARAM a 16 tap interleaved parallel FIR filter is implemented on XC4010E-1. The input samples and impulse response coefficient are assumed to be 6-bit unsigned number and are assumed to be stored in 2 separate RAMs (RAM-A, RAM-B). It requires 125 CLBs for the implementation of the filter. The conventional parallel FIR filter with dedicated multipliers has also been implemented. It requires 210 CLBs. Hence the interleaved parallel FIR filter is area efficient by 40%. The different characteristics of multiplier such as CLBs required, maximum operating frequency, latency are evaluated for the multipliers and are tabulated in table 2. Table 2:
Figure imgf000012_0001
From table 2 and based on the experiments carried out the following conclusions can be arrived at: 1. The 4X4 OSPAM requires 25% less area and latency than the conventional pipelined array multiplier and the Guild multiplier. 2. The 4X4 OPARAM requires about 25% less area as well as latency and operates faster by about 1.5 times than the conventional pipelined array multiplier and the Guild multiplier. 3. The 4X4 OPARAM has been found to be working satisfactorily not only at the peak clock rate but also at rates lower than this. 4. The Internal clock generation scheme proposed has been found to be stable and it enables the multipliers to be operated at their peak rate immaterial of the speed grade of the PCB on which the FPGA is mounted. 5. The 16 tap Interleaved parallel FIR filter using 6X6 OPARAM requires 40% less area than the filter with the conventional approach.

Claims

We claim:
1. An improved m x n pipelined array multiplier comprising: an array of combinational logic blocks, a set of registers connected to each said combinational logic blocks in a row of said array of combinational logic blocks, two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining n-2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks, the sum output of the jth combinational logic block in the ith row is connected to j+1 th combinational logic block in the i+lth row, the carry output of the jth combinational logic block in the ith row is connected to jth combinational logic block of i+lth row, in the nth row of the combinational logic blocks, the extreme right combinational logic block is connected to the carry output of the extreme right combinational logic block of n-lth row and sum output of second extreme right combinational logic block of n-lth row, jth combinational logic block in ith row is connected to the sum and carry output of j+lth combinational logic block in the ith row and sum output of j- lth combinational logic block in i-lth row and carry from jth combinational logic block in the i-lth row, the extreme left combinational logic block in the nth row is provided inputs from sum and carry output of second left combinational logic block in the nth row, the left combinational logic block of the last row of the combinational logic block array is provided with sum output of extreme left and carry output of second extreme left combinational logic block of the previous row and the sum and carry output of the right combinational logic block of the last row, the right combinational logic block of the last row of the combinational logic block array is provided with the sum output of the second left combinational logic block and the carry output of the third left combinational logic block in the previous row, and sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last n/2th rows will provide the final multiplication thereby resulting in reduction in the area and latency.
2. An improved m x n pipelined array multiplier as claimed in claim 1 wherein m x n = 4 x 4.
3. An improved 4 x 4 pipelined array multiplier as claimed in claim 2 comprising: an array of combinational logic blocks, a set of registers connected to each said combinational logic blocks in a row of said array of combinational logic blocks, two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining 2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks, in first row combinational logic blocks 1 to 4 are connected to second row such that the sum output of the combinational logic blocks 1 to 3 in the first row is connected to 2 to 4 combinational logic blocks in the second row, in second row combinational logic blocks 1 to 4 are connected to third row such that the sum output of the combinational logic blocks 1 to 3 in the second row is connected to 2 to 4 combinational logic blocks in the third row, in first row combinational logic blocks 2 to 4 are connected to second row such that the carry output of the combinational logic blocks 2 to 4 in the first row is connected to 2 to 4 combinational logic blocks in the second row, in second row combinational logic blocks 2 to 4 are connected to third row such that the carry output of the combinational logic blocks 2 to 4 in the second row is connected to 2 to 4 combinational logic blocks in the third row, in the fourth row of the combinational logic blocks, the extreme right combinational logic block is connected to the carry output of the extreme right combinational logic block of third row and sum output of second extreme right combinational logic block of third row, combinational logic blocks 1 to 3 in fourth row is connected to the sum and carry output of combinational logic blocks 2 to 4 in the fourth row and sum output of combinational logic blocks 2 to 4 in the third row and carry from combinational logic blocks 1 to 3 in the third row, the extreme left combinational logic block in the fourth row is provided inputs from sum and carry output of second left combinational logic block in the fourth row, the left combinational logic block of the last row of the combinational logic block array is provided with sum output of extreme left and carry output of second extreme left combinational logic block of the previous row and the sum and cany output of the right combinational logic block of the last row, the right combinational logic block of the last row of the combinational logic block array is provided with the sum output of the second left combinational logic block and the carry output of the third left combinational logic block in the previous row, and sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last two rows will provide the final multiplication, thereby resulting in reduction in the area and latency.
4. An improved m x n pipelined array multiplier as claimed in claim 1 wherein the array of combinational logic blocks include LUTs.
5. An improved m x n pipelined array multiplier as claimed in claim 1 further comprising a delay control means between adjacent combination logic blocks in the same row, ith row and i+lth row to make the sum of the interconnect delay and the delay in each logic block and the registers equal for all the logic blocks including registers in the ith and i+lth rows thereby further increasing the speed of multiplication.
6. An improved m x n pipelined array multiplier as claimed in claim 5 wherein delay control means comprising selected interconnect wires between adjacent combinational logic blocks including registers such that the sum of the interconnect delay between said logic blocks and the combinational logic blocks delay including registers is equal for the selected interconnected wires between logic blocks in the same row or the adjacent row.
7. An improved m x n pipelined array multiplier as claimed in claim 1 wherein two accumulators are provided in the output of the multiplier for storing the results alternatively.
8. An improved m x n pipelined array multiplier as claimed in claim 7 wherein the output of said accumulators is added to find the sum of the two products.
9. An improved m x n pipelined array multiplier substantially as herein described with reference to and as illustrated in the accompanying drawings.
PCT/IN2004/000170 2004-06-15 2004-06-15 Field programmable gate array (fpga) based pipelined array multiplier (oparam). WO2005124535A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IN2004/000170 WO2005124535A1 (en) 2004-06-15 2004-06-15 Field programmable gate array (fpga) based pipelined array multiplier (oparam).

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IN2004/000170 WO2005124535A1 (en) 2004-06-15 2004-06-15 Field programmable gate array (fpga) based pipelined array multiplier (oparam).

Publications (1)

Publication Number Publication Date
WO2005124535A1 true WO2005124535A1 (en) 2005-12-29

Family

ID=35509883

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IN2004/000170 WO2005124535A1 (en) 2004-06-15 2004-06-15 Field programmable gate array (fpga) based pipelined array multiplier (oparam).

Country Status (1)

Country Link
WO (1) WO2005124535A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4748583A (en) * 1984-09-17 1988-05-31 Siemens Aktiengesellschaft Cell-structured digital multiplier of semi-systolic construction

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4748583A (en) * 1984-09-17 1988-05-31 Siemens Aktiengesellschaft Cell-structured digital multiplier of semi-systolic construction

Similar Documents

Publication Publication Date Title
US7467177B2 (en) Mathematical circuit with dynamic rounding
US7472155B2 (en) Programmable logic device with cascading DSP slices
US8495122B2 (en) Programmable device with dynamic DSP architecture
US7480690B2 (en) Arithmetic circuit with multiplexed addend inputs
US7467175B2 (en) Programmable logic device with pipelined DSP slices
EP2017743B1 (en) High speed and efficient matrix multiplication hardware module
US7372297B1 (en) Hybrid interconnect/logic circuits enabling efficient replication of a function in several sub-cycles to save logic and routing resources
US9098332B1 (en) Specialized processing block with fixed- and floating-point structures
EP2382535B1 (en) Symmetric transpose convolution fir filter with pre-adder
CN114943057A (en) Dot product based processing element
US8463836B1 (en) Performing mathematical and logical operations in multiple sub-cycles
Langhammer et al. Design and implementation of an embedded FPGA floating point DSP block
Pieper et al. Efficient Dedicated Multiplication Blocks for 2's Complement Radix-2m Array Multipliers.
Sarkar et al. Comparison of various adders and their VLSI implementation
US7818361B1 (en) Method and apparatus for performing two's complement multiplication
Buddhe et al. Design and verification of dadda algorithm based binary floating point multiplier
Chouhan et al. Implementation of an efficient multiplier based on vedic mathematics using high speed adder
WO2005124535A1 (en) Field programmable gate array (fpga) based pipelined array multiplier (oparam).
Putra et al. Optimized hardware algorithm for integer cube root calculation and its efficient architecture
Kamp et al. Multiply accumulate unit optimised for fast dot-product evaluation
WO2006003667A1 (en) Field programmable gate array (fpga) based wave pipelined array multiplier (wparam)
Tenca et al. On the design of high-radix on-line division for long precision
El-Atfy et al. Accelerating matrix multiplication on fPGAs
KR100632928B1 (en) The Modular multiplier
Karthik et al. FPGA implementation of high speed vedic multipliers

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

122 Ep: pct application non-entry in european phase