WO2006003667A1

WO2006003667A1 - Field programmable gate array (fpga) based wave pipelined array multiplier (wparam)

Info

Publication number: WO2006003667A1
Application number: PCT/IN2004/000188
Authority: WO
Inventors: Gopalakrishnan Lakshminarayanan; Balasubramanian Venkataramani
Original assignee: Department Of Information Technology
Priority date: 2004-06-30
Filing date: 2004-06-30
Publication date: 2006-01-12

Abstract

The invention provides an improved m x n wave pipelined array multiplier using an array of combinational logic blocks and a set of registers connected to each of combinational logic block.

Description

TITLE OF THE INVENTION

FIELD PROGRAMMABLE GATE ARRAY (FPGA) BASED WAVE PIPELINED ARRAY MULTIPLIER (WPARAM) FIELD OF THE INVENTION •

This invention relates to Field Programmable Gate Array (FPGA) based wave pipelined array multiplier (WPARAM). BACKGROUND OF THE INVENTION

The conventional 4 x 4 array multiplier is shown in figure 1 of the accompanying drawings. Consider the multiplication of the 4-bit number X3 X2 Xl XO with Y3 Y2 Yl YO. In Fig.l, the multiplication is carried out using the conventional "paper and pencil" technique. Different stages of computation in this array multiplier are separated by dotted lines and are indicated as stage 0, stage 1 , stage 2, stage 3 & Stage 4. At stage 0, the product of YO with X3 X2 Xl XO is computed. In stage 1, the product of Yl with X3 X2 Xl XO is computed and added to the result of the previous stage. Stages 0 to 3 are used for multiplication by the individual bits of Y with X and stage 4 is used for propagating the carry. The combinational logic blocks MO, Ml. M2 in stages 0 to 3; consists of AND gates/half adders/Full-adders. The above array multiplier can be fed with a fresh data only after the input is processed by all the five stages. The limitation of this conventional array multiplier is that the minimum time required for multiplication is equal to .the time taken for processing the input in all the five stages resulting in slowing down of the speed.

In order to speed up the conventional array multiplier, ripple carry operation done in Stage 4 may be replaced with carry save operation distributed over 4 stages. The multiplication rate can be increased by introducing registers at the output of the combinational logic blocks MO, Ml , M2 & M3 at each stage shown in Fig. l . The resulting multiplier is shown in Fig. 2 and is called a Pipelined Array multiplier. In this multiplier, the multiplication can be done at a rate which depends only on the largest time taken for processing the inputs at one of the stages, the interconnect delay between the combinational logic blocks and the flip-flop (register) and set-up time as well as hold-time of the flip-flop. In the. Pipelined Array multiplier, the product PO becomes available_.after 2 clock cycles. Pl after 3 clock cycles and so on. P7 becomes available after 9 clock pulses from the time the corresponding operands are fed at the input register at the input of the pipelined array multiplier. Hence the latency of the multiplier is 9 clock cycles. In order to ensure that all the product terms arrive simultaneously, the product bits PO to P6 have to be delayed by 7 to 1 clock cycle respectively. This is achieved by using a chain of Flip-flops: PO requires 7 flip-flops (registers), Pl requires 6 flip-flops and so on. In general, for an NxN multiplier. 2N combinational logic blocks separated by 2N+1 stages of registers are required. The latency of the multiplier is 2N+1 clock cycles.

In the conventional pipelined array multiplier shown in figure 2, the drawback is that when implemented on FPGA with four input LUT, all the four inputs are not effectively utilized in the first stage and in the last N stages and thereby resulting in more area for implementation and latency.

In our co-pending application the drawbacks of the conventional multipliers shown in figures 1 & 2 have been obviated and all the four inputs of LUTs have been effectively utilized.

In our co-pending application . we have claimed an improved m x n pipelined array multiplier comprising: an array of combinational logic blocks, each block having two outputs denoted as sum and carry outputs for the sake of convenience, a set of registers connected to each said combinational logic blocks in a row of said array of combinational logic blocks, - two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining n-2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks such that the sum output of the jth combinational logic block in the ith row is connected to j+lth combinational logic block in the i+ l th row. the carry output of the jth combinational logic block in the ith row is connected to jth combinational logic block of i+ l th row. - in the nth row of the combinational logic blocks, the extreme right combinational logic block is connected to the caiτy output of the extreme, right combinational logic block of n- lth row and sum output of second extreme right combinational logic block of n-lth row, the extreme left combinational logic block in the nth row is connected to sum output of extreme left combinational logic block and carry output of second left combinational logic block in the n-lth row, jth (j=2 to n-1) combinational logic block in nth row is connected to sum outputs of j-lth and jth combinational logic blocks in n-l th row and carrys from jth and j+lth combinational logic blocks in the n-lth row. for each of the combinational logic block in the last (n/2)-l rows excepting the extreme right block, the jth block is connected to sum outputs of jth and j+lth combinational logic blocks and carry outputs from j+lth and j+2th combinational logic blocks in the previous row. - each of the extreme right combinational logic block of the last (n/2)- 1 rows of the combinational logic block arrays is provided with the sum output of the third extreme right combinational logic block and the carry output of the second extreme right combinational logic block in the previous row. and - sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last n/2th rows will provide the final multiplication thereby resulting in reduction in the area and latency. The object of the present invention is to increase the speed of multiplication reducing the area, power consumption and latency.

To achieve the said objective, an improved wave pipelined multiplier has been developed.

The present invention therefore provides an improved m x n wave pipelined array multiplier comprising: - an array of combinational logic blocks, each block having two outputs denoted as sum and carry outputs for the sake of convenience. - two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining n-2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks such that the sum output of the jth combinational logic block in the ith row is connected to j+lth combinational logic block in the i+l th row, the cany output of the jth combinational logic block in the ith row is connected to jth combinational logic block of i+lth row, in the'nth row of the combinational logic blocks, the extreme right combinational logic block is connected to the cany output of the extreme right combinational logic block of n- lth row and sum output of second extreme right combinational logic block of n- lth row, the extreme left combinational logic block in the nth row is connected to sum output of extreme left combinational logic block and caiτy output of second extreme left combinational logic block in the n- lth row. - jth (j=2 to n-1 ) combinational logic block in nth row is connected to sum outputs of j- I th and jth combinational logic blocks in n-l th row and carrys from jth and j+lth combinational logic blocks in the n- lth row, for each of the combinational logic block in the last (n/2)-l rows excepting the extreme right block, the jth block is connected to sum outputs of jth and j+lth combinational logic blocks and carry outputs from j+lth and j+2th combinational logic blocks in the previous row. each of the extreme right combinational logic block of the last (n/2)-l rows of the combinational logic block arrays is provided with the sum output of the third extreme right combinational logic block and the carry output of the second extreme right combinational logic block in the previous row, and - sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last n/2th rows will provide the final multiplication, and a delay control means is provided between combinational logic blocks in the ith row and i+lth row to make the sum of the interconnect delay and the delay in each logic block itself equal for all the logic blocks in the ith and i+lth rows, thereby increasing the speed of multiplication, reducing the area, power consumption and latency. In the above improved m x n wave pipelined array multiplier, m x n = 4 x 4. An improved 4 x 4 wave pipelined array multiplier comprising: an array of combinational logic blocks (each block having two outputs denoted as sum and carry outputs for the sake of convenience), two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining 2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks, in first row combinational logic blocks I to 4 are connected to second row such that the sum output of the combinational logic blocks 1 to 3 in the first row is connected to 2 to 4 combinational logic blocks in the second row, - in second row combinational logic blocks 1 to 4 are connected to third row such that the sum output of the combinational logic blocks 1 to 3 in the second row is connected to 2 to 4 combinational logic blocks in the third row, in first row combinational logic blocks 2 to 4 are connected to second row such that the carry output of the combinational logic blocks 2 to 4 in the first row is connected to 2 to 4 combinational logic blocks in the second row. - in second row combinational logic blocks 2 to 4 are connected to thiϊα row such that the carry output of the combinational logic blocks 2 to 4 in the second row is connected to 2 to 4 combinational logic blocks in the third row. in the fourth row of the combinational logic blocks, the extreme right combinational logic block is connected to the carry output of the extreme right combinational logic block of third row and sum output of second extreme right combinational logic block of third row. the third extreme left combinational logic block in fourth row is connected to the sum outputs of second extreme left and third extreme left combinational logic blocks in the third row and carry outputs of third extreme and fourth extreme left combinational logic blocks in the third row the second extreme left combinational logic block in fourth row is connected to the sum outputs of extreme left and second extreme left combinational logic blocks in the third row and cany outputs of second extreme and third extreme left combinational logic blocks in the third row the extreme left combinational logic block in the fourth row is provided inputs from sum output of extreme left and carry output of second extreme left combinational logic blocks in the third row, the left combinational logic block of the iast row of the combinational logic block array is provided with sum outputs of extreme left and second extreme left combinational logic blocks and carry outputs of second extreme left and third extreme left combinational logic blocks of the previous row the right combinational logic block of the last row of the combinational logic block array is provided with the sum output of the second left combinational logic block and the carry output of the third left combinational logic block in the previous row, - sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last two rows will provide the final multiplication, and 5 a delay control means is provided between combination logic blocks in the rows 1 to 4 and rows 2 to 5 to make the sum of the interconnect delay and the delay in each logic block itself equal for all the logic blocks in rows 1 to 4 and 2 to 5, thereby increasing the speed of multiplication, reducing the area, power I O consumption and latency.

The delay control means comprising selected interconnect wires between combinational logic blocks such that the sum of the interconnect delay between said logic blocks and the combinational logic blocks delay itself is equal for the selected interconnected wires between logic blocks in the adjacent row. . . ^■

15 The array of combinational logic blocks includes LUTs.

At least one register is included in the input of the multiplier and two registers in the output of the multiplier are provided.

The two accumulators are provided in the output of the multiplier for storing the results alternatively. 20 BRIEF DESCRIPTION OF THE DRAWINGS

The invention will now be described with reference to the accompanying drawings.

Figure 1 shows the conventional 4 x 4 array multiplier. Figure 2 shows the conventional 4 x 4 pipelined array multiplier. 25 Figure. 3 shows an improved 4 x 4 pipelined array multiplier described in our co-pending application no. .

Figure 4 shows an improved wave pipelined array multiplier, according to this invention.

Figure 5 shows the conventional 4 x 4 wave pipelined guild multiplier. 30 Figure 6 shows an improved wave pipelined array multiplier with accumulator, according to this invention.

Figure 7 shows a clock circuit. DETAILED DESCRIPTION OF THE INVENTION

Referring to the drawings, figures 1 and 2 have been explained under the 35 heading "Background' .

It is assumed that FPGAs with 4 input Look-Up-Table (LUT) are used for the implementation. One of the objectives of the synthesis technique is to ensure that all the 4 inputs of the LUTs are effectively engaged. When half-adders are implemented, LUT cannot be efficiently utilized. Accordingly, the stages involving half-adders have to be modified/combined so that LUTs are effectively utilized. Keeping this in view, the stageO and stage 1 of Fig.2 may be modified to compute the partial products due to the two least significant multiplier bits. The last N stages may be reduced to N/2 stages by replacing the half adders with suitable functional blocks and feeding the sum and carry outputs from one stage to another properly. For example, consider the multiplication of the 4-bit number X3 X2 Xl XO with Y3 Y2 Yl YO. In stage 0 of the original scheme, product of YO with X3 X2 X l XO is computed. In stage 1. product of Yl with X3 X2 Xl XO is computed. These two stages can be modified as follows: The partial product obtained by the least significant two bits of the multiplier (Yl YO) with X3 X2 Xl XO is shown in table 1, given below. Since 4-input LUT is available, the partial products corresponding to each column of table 1 can be computed using a single LUT. Hence the two stages of original multiplier of figure 2 can be reduced to a single stage with 4-input LUTs.

X3 X2 Xl XO X Yl YO

X3Y0 X2Y0 XlYO XOYO partial product row- 1

X3Y 1 X2Y 1 Xl Yl XOY l partial product rυw-2

Table 1 Partial products of multiplying the multiplicand with two LSBs of multiplier Stages two and three of the original scheme actually require 4-input functional blocks (MO, M2, M2, M2) and hence they can be implemented using the 4-input LUTs efficiently and no modification in the original scheme is required for these two stages. Stages 4 to 7 in the original scheme is used for carry propagation and they use only half-adders for this puipose. Every two stages can be properly combined into one stage in view of the use of 4-input LUTs.

Keeping in view the above modifications, an improved 4x4 pipelined array multiplier has been developed and is shown in Fig. 3 of our co-pending application no.

. . This is also referred as Optimally Synthesized Pipelined Array

Multiplier (OSPAM). It consists of 5 combinational logic blocks (MO, M5, M5. M5. MO - stage O; MO, M2, M2, M2 - stage 1; MO, M2, M2, M2 - stage 2; M8, M7, M7. M6 - stage 3; MlO, M9 - stage 4) separated by 6 stages of registers (0-5) are only required. The original multiplier in figure 2 requires 8 stages of combinational logic blocks and 9 stages of registers. Further the latency of the multiplier shown in figure 3 is reduced from 9 clock cycles to 6 clock cycles. Hence the latency is reduced by about 30% and the number of registers required are also reduced.

In general MxN multiplication can be achieved using (M-l )+(N/2) stages of registers with the latency of M+(N/2) clock cycles. This results in lower latency and requires less area for implementation. VVAVEPIPELINED ARRAY MULTIPLIER (WPARAM):

Wavepipelining is a technique for increasing the throughput of a digital system without introducing pipelining registers between adjacent combinational logic blocks. The wavepipelined array multiplier can be obtained from the circuit of Fig. 3. of our co-pending application no. as follows: The 4X4 wavepipelined array multiplier of Fig.4 may be obtained as follows:.

The latches (registers) in stages 0 - 4 in Fig.3 of our co-pending application no. are removed and the inputs to the various combinational logic blocks denoted as MO, M2, M5. M6. M7, M8, M9 & MlO are made to arrive at the same time at each stage. The appropriate inputs to the various blocks MO to M lO are shown in Fig. 4. For example the sum output of the left most MO block from stageO is fed as one of the inputs to the left most M2 block in the stage 1 ; the carry from the left most M5 block of stageO is fed as another input to the left most M2 block of stage 1. These inputs should arrive at stage 1 with equal delay from stageO. This can be achieved by properly adjusting the routing delays of the inputs from stageO to stage 1. . In general, wavepipelining of an mxn array multiplier is achieved by carrying out the following modifications on the pipelined array multiplier. I . The delays of the inputs to each of the combinational logic block are adjusted such that they undergo the same delay, i.e. Sum of the interconnect delay and the LUT delay should be made equal. 2. The pipelining registers are removed and instead the inherent capacitance at the output of one combinational logic block will be adequate to hold the data till the next combinational logic block processes this data and becomes stable. For carrying out the first modification, the interconnect delays between one

LUT and the next LUT should be adjusted appropriately. Since Xilinx Foundation Series CAD tool permits the delays to be viewed as well as varied, the wavepipelining technique can be applied to Xilinx. FPGAs. However, this technique is applicable for other LUT based FPGAs as well provided there is mechanism to view the interconnect delays and adjust them.

After the data is processed by a number of wavepipelined stages, finally the output has to be latched for it to be processed further by the conventional pipelined synchronous systems. For the purpose of latching, the latch in the CLB may be used. The clock input to the XC4000E -1 family CLB requires a minimum of 3ns on-time - ^• and 3ns off-time.

By pumping-in the data to the wavepipelined multiplier at the rate of 5.4ns and using two latches (registers) to store the output alternately (see figure 6), the minimum on-time and off-time requirements can be satisfied. In this case, each latch stores only one data in a period of 10.8ns. One latch uses the positive edge of the clock and the other latch uses the negative edge. In other words interleaving technique can be used to achieve a multiplication rate of 5.4ns and use latches with clock period of 10.8ns. However the conventional pipelined array multiplier works typically at 10ns due to set-up time and routing complexity. Hence the wavepipelined .multiplier is about twice faster than the pipelined multiplier. To ease the routing complexity and to minimize the resources, the wavepipelining delay per stage is retained at 2.7ns rate and a single data is pumped-in twice to the multiplier to effectively feed data at the rate of 5.4ns to the latches. This results in a latency of 2.7ns x number of wavepipelined multiplier stages. However for the conventional pipelined array multiplier the latency is equal to the clock period x number of pipelined multiplier stages. Since the minimum clock period that can be used is about 10ns for the conventional pipelined array multipliers, the latency of the wavepipelined array multiplier is about one third of the latency of the pipelined array multiplier. Comparison of WPARAM with the Wavepipelined Guild Multiplier: Wavepipelined Guild Multiplier shown in Fig.5 is obtained from the pipelined guild multiplier by removing the pipeline registers and equalizing the delays of all the paths crossing the equitemporal lines. The Guild multiplier requires more area than the OSPAM. Accordingly, wavepipelined Guild multiplier also requires more area than the WPARAM. Further since, the WP Guild multiplier requires manual routing, increase of area also implies increase in the routing complexity. Further the latency of WP guild multiplier is more than that of WPARAM. Implementation of Area efficient Filters using WPARAM In figure 6, an interleaving scheme for using a single multiplier and two accumulators for finding the sum of products corresponding to two parallel filters is shown. One accumulator finds the sum of products corresponding to odd samples and other accumulator finds the sum of products corresponding to even samples. In addition to this, the third accumulator in figure 6 is used for accumulating the output of the odd filter and even filter.

Two parallel filters each with 8-taps and the input samples and the impulse response coefficients represented using 2-bits, 6 bits respectively is shown in figure 6. In XC4010E-1 device, a 2x6 WPARAM can be made to operate at a peak rate of 5.4ns(nanoseconds). The 1 1-bit accumulator for this filter requires 10ns in the same device. Since the accumulator is slower than the multiplier, a single multiplier may be shared between the two accumulators using an interleaving technique. The multiplier processes alternately the data and coefficients corresponding to odd filter and even filter. The multiplier output corresponding to each filter is latched into two separate registers. A 5.4 ns clock is used to pump-in data into the WPARAM. A 10.8 ns clock is used for latching the data into the registers. Data is latched into one^'of the registers at the positive edge of the 10.8ns clock and the reverse is true for the other register. A ^• single RAM may be used for storing the input samples. The odd locations contain the odd numbered samples and even locations contain the even numbered samples. Similarly a single RAM may be used for storing the impulse response coefficients. The odd locations contain the coefficients of one filter and the even locations contain the coefficients of the other filter.

The basic clock of 5.4ns is used for the multiplier. This clock is divided further using counters to generate the 10.8ns clock and the addresses for the RAM.

I I Results and Conclusions:

In order to study the characteristics of the instant multiplier (WPARAM) and compare it with the other multipliers discussed in our co-pending application ,

4X4 multipliers are implemented on XC4010E-1 device.

Input to the above multipliers may be fed directly through the IO pad, through a register or through a RAM. Feeding the inputs through a RAM becomes convenient when the multiplier is to be used for implementing a filter. Because of the initial latency of the pipelined array multipliers they become efficient only if it is used for performing a number multiplications one after another as is true in the case of computation of sum of products of elements of two vectors. Pipelined array multipliers. OSPAM and Guild multipliers can be tested fully using the simulation tool. Hence these multipliers are tested using simulation and they are found to be satisfactory. However OPARAM, WPARAM and wavepipelined Guild multiplier use manual routing and the interconnect delays play a crucial role in determining their proper operation. Since the interconnect delays are non- deterministic, there may not be one to one correspondence between the delays measured by the CAD tool and the actual delay obtained in a particular device after the design is downloaded to the device. Hence to verify the proper operation of these multipliers, they should be studied not only by using simulation but also by pumping in the test data on to the actual device. For this purpose the test data is applied to both OPARAM and WPARAM through the parallel port of a Personal Computer to the demo board housing the FPGA and the results are read back for verification. The multiplier and the multiplicand are stored in two RAMs RAMO and RAM I . The content of these RAMs are read at the rate of 6ns/5.4ns and fed to the OPARAM/WPARAM respectively. However the wavepipelined Guild multiplier has not been implemented by us.

For testing the design on the actual device, the clock signal should also be applied to the multiplier. The clock may be taken from an external source or it may have to be internally generated. One of the requirements for extracting the peak rate From the OPARAM is the ability to operate the pipelining registers at their maximum rate. This in turn requires a clock which operates at this rate.

Similarly for the WPARAM and wavepipelined Guild multiplier, a high frequency clock may be required if they are to be operated at the peak rate. The minimum clock period for the WPARAM is equal to the sum of LUT delay and the interconnect delay between the output of the one LUT and the input to the next LUT. The LUT delay of XC4000E family FPGA with speed grade of l(i.e.XC4000E-l) is 1.3ns.The interconnect delay can be varied from 0.7ns onwards. An interconnect delay of 1.4ns may be chosen so that a particular LUT output can be fed to a number of LUTs located at different positions in the FPGAs with equal delay. With this a total delay of 2.7 ns is required for each stage of combinational logic block. This enables the data to be pumped-in at a rate of 2.7ns and its multiples. Low cost development and prototype boards may not be able to support signals at high frequency rate even though the FPGA as such can operate internally at rates higher than that which is possible with the board. For example, the demo board for XC4000E device family from Xilinx has a maximum operating speed of 100MHz. Hence pumping itvclocks above this frequency using an external source becomes impossible. Instead the clocks have to be internally generated. For this purpose the clock circuit shown in Fig. 7 is used. The LUT delay and the interconnect delay deteπnines the clock period. When circuits of this kind are used, one of the important issue is the stability of the clock generated and extent to which it affects the operation of the pipelined multiplier. To verify this, the stability of the clock and the stability of the output of the pipelined array multiplier need to be checked over a period of at least few hours. For this purpose, the above test was carried out on the XC4010E-1 device and it was found to be working satisfactorily. The clock generated using this technique has been used for OPARAM and WPARAM. If the board is capable of supporting a high frequency clock, an external source itself can be used for testing the multipliers.

The different characteristics of the multipliers such as number of CLBs required, maximum operating frequency, latency are evaluated for all the above multipliers and are tabulated in Table 2. Implementation results of the 7X7 wavepipelined Guild multiplier reported in the literature are also given in table-2. For the sake of comparison 7X7 OPARAM is implemented on XC4010E- 1 and the results are given in table2.

The WPARAM of different sizes are implemented and tested. WPARAM of size 4X4 and 2X6 are found to be satisfactory at a clock rate of 185 MHz. However WPARAM of size 4X6. 6X6 are found to be unsatisfactory at this frequency. The plausible reasons for this could be as follows: V

As mentioned earlier, for testing the WPARAM, the test data is stored in Two

RAMs. The delay between the instant when a new address is applied to the RAM and the instant when a stable output appears at the input to the latch after getting processed by the k wavepipelined stages of the multiplier is 2.7(k+l)ns. This data remains stable for a period of 5.4ns as the new address is fed to the RAM only at intervals of 5.4ns. Hence ideally the set-up time available for the latch is 5.4ns. However, the set-up time actually available may be reduced due to the following reasons: (a) Difference in rise and fall time may increase the time required for the output of the multiplier to settle marginally depending upon the input pattern. (b) Due to the fabrication variations, the different interconnects and LUTs may not have identical characteristics and may differ from the value reported by the CAD tool when the design is actually downloaded to the device. When the actual stable period reduces to less than Ins. it leads to set-up time violations on the latch. For the wavepipelined multiplier to remain in the stable region, it has been experimentally verified that about 5 wavepipelined stages may be reliably used for the XC4010E-1 devices. Since a 4X6 multiplier requires 6 stages for multiplication and 1 stage for RAM, it cannot be reliably implemented directly. It may be noted here that in a wavepipelined multiplier, the interconnect delay variations and LUT delay variations in various stages add up and make the result to appear either too early or too late compared to the active clock edge of the latches at the output of the multiplier. However in the OPARAM the interconnect delay variation and LUT delay variation in a single stage alone determine the proper operation. Hence multiplier of any size can be built reliably using OPARAM.

It may be noted that the present effort is to operate the multiplier at the rate of 5.4ns (corresponding to operating frequency of 185MHz). When the multiplier is operated at frequencies lower than this, the margin available for latching the data in the conventional latches in the final stage becomes large and hence it can work satisfactorily even for higher order multipliers. There is one to one correspondence between the multiplier size and the rate at which data is pumped in. A 2X4 WPARAM works satisfactorily for a rate of 2.7ns. But at 2.7ns rate 2X6 did not work. When the data rate is increased to 5.4ns, 2X6 WPARAM gave the correct result. When the data rate is increased to 8.1ns, 10.8ns and so on, a still higher order multiplier can be made to work. At a data rate of 10.8ns, even an external clock can be used for the output latch and the RAMs.

Multipliers of larger size such as 4X6, 6X6 etc., may be designed directly and applied with clocks with periods of δ.lnsec or 10.8 nsec. Alternately a lower order multiplier may be used sequentially to implement the higher order multiplier. For example 4X6 multiplication may be carried out sequentially using a single 2X6 multiplier operating at the rate of 5.4ns and two phases of multiplication. In the first phase least two significant multiplier bits may be multiplied with the 6 bit multiplicand and in the next phase, the most two significant bits may be multiplied with the multiplicand. The output resulting from the two phases have to be added after shifting the output corresponding to the first phase by two bits towards right. The sequential multiplication scheme can be implemented using the interleaved accumulators. The two MSBs of the multiplier may be stored in even locations and the two LSBs of the multiplier may be stored in odd locations of the RAM. The address to the RAMs. RAMO, RAMI containing the multiplier, multiplicand are varied at a rate of 185 MHz ,92.5 MHz respectively. The 4X6 multiplier designed this fashion is found to be satisfactory. Similarly higher order multipliers such as 6X6 ,8X6 may be implemented using a single 2X6 multiplier and 3,4 accumulators respectively. Compared to the direct implementation scheme, the sequential implementation scheme is area efficient and the routing complexity is also less. In addition to the above two techniques, a higher order WPARAM can be built and operated more reliably using a lower speed grade device such as XC4010E-4. For example, the wavepipelined Guild multiplier built and reported in the literature is of size 7X7 and has been implemented on XC4005E-6. This has larger LUT delay compared to the XC4010E- 1 and the interconnect delay becomes a smaller fraction of the LUT delay. In view of this, variations in the interconnect delay does not alter the performance of the wavepipelined multiplier unless the number of wavepipelined stages are increased significantly. 7X7 WPARAM built in the same device would require less area and latency than the wavepipelined guild multiplier due to reduction in the no. of wavepipelined stages As mentioned earlier, the maximum size of multiplier that can be designed using wavepipelining technique is limited. However OPARAM of larger sizes can be directly implemented and operated with the clock period of 6ns. WPARAM of larger sizes have to be operated with a clock period of 8.1, 10.8 or higher multiples of 2.7ns. For higher order multipliers OPARAM is to be preferred if higher speeds are required and any data rate lower than 166 MHz can be used. It need not be multiple of a particular frequency as in the case of WPARAM. If power dissipation is the concern one may go in for WPARAM instead of OPARAM. The implementation details of the filters on XC4010E-1 using WPARAM is presented next.

For study of filters using the WPARAM, two 8-tap interleaved parallel FIR filters for inputs with 2 bits and impulse response coefficients with 6 bits is implemented on XC4010E-1 FPGA. One 8 tap filter corresponds to odd input samples and the other 8 tap filter corresponds to even input samples. The output of the above two filters may be accumulated to obtain a 16 tap FIR filter. The RAMs introduce a latency of 2.7ns delay for the wavepipelined multiplier. For the conventional pipelined array multiplier. RAM introduces a latency of 1 clock cycle( lθns). The above filters have been found to be working satisfactorily and to perform 16 multiplications, the time required is (5.4x 16) + (2.7x7)ns. The 16 tap filter requires 89 CLBs for implementation.

A L 6 tap FIR filter may also be implemented using two parallel 8 tap non interleaved filters with 2X6 OSPAM. In this case for each of the parallel filter, one dedicated multiplier is used. The 16 tap filter requires, the computation time of (12 x 16) + (12x 7) ns. The 16 tap filter with OSPAM requires 87 CLBs for implementation. Hence the 16 tap FIR filter with WPARAM is faster by a factor of 2.5 than the filter with OSPAM.

Further the wavepipelined multiplier doesn't require any pipelining register and hence the power dissipation due to charging and discharging of the registers will be absent.

Next, the implementation results of an 8 tap filter with 4 bit input samples and 6 bits impulse response coefficients are considered:

As mentioned earlier, the 4X6 WPARAM is implemented using a sequential scheme with single 2X6 WPARAM and two accumulators. This requires 89 CLBs and requires ( 10.8x8)+( 10.8x7)ns for computation of the sum of products due to 8 taps. The FIR filter using the conventional pipelined array multiplier may also be . implemented using 4x6 OSPAM without interleaving. This has 9 pipeline stages including RAM and requires 68 CLBs for implementation. In this case, the time required for computing the filter output becomes (13.3 x 8) + (13.3 x 8)ns. Hence the FIR filter using WPARAM is about 1.2 times faster and requires about 30% more area compared to the filter with OSPAM.

All the FIR filters implemented using WPARAM may also be implemented using OPARAM. The filters using OPARAM is slower than that using WPARAM as the maximum clock period that can be used with OPARAM is about 1 1% larger than that of WPARAM. The filters using WPARAM results in less power dissipation than those using OPARAM.

Table 2:

From table 2 and based on the experiments carried out the. following conclusions can be arrived at:

1. The 4X4 WPARAM requires 25-30% less area and latency than the conventional pipelined array multiplier and the Guild multiplier.

2. The 4X4 WPARAM operates faster by about 1.4- 1.6 times than the conventional pipelined array multiplier and the Guild multiplier. 3. The 4X4 WPARAM operates faster by about 1.1 times than the OPARAM and both the multipliers require same area.

4. The 7X7 OPARAM requires about 30% less area than the wavepipelined Guild multiplier. The speeds cannot be compared as OPARAM is implemented in XC4010E-1 which is a faster device than the XC4005E-6 which was used for the implementation of the 7X7 wavepipelined Guild multiplier.

5. WPARAM dispenses with registers and hence the power dissipation is reduced for WPARAM compared to the non wavepipelined multipliers.

6. It is observed from table 2, as the number of wavepipelined stages is increased, the maximum operating frequency is decreased. In other words as the latency of the multiplier increases, the maximum operating frequency is decreases. 4X4

WPARAM requires 5 stages and wavepipelined Guild multiplier requires 7 stages. Hence a wavepipelined Guild multiplier can be operated only at a frequency lower than that of the WPARAM.

7. The Internal clock generation scheme proposed has been found to be stable and it enables the multipliers to be operated at their peak rate immaterial of the speed grade of the PCB on which the FPGA is mounted.

8. The 16 tap Interleaved parallel FIR filter using 2X6 WPARAM requires almost the same area but is faster by a factor of 2.2 times than the filter with two dedicated multipliers. 9. The 8 tap Interleaved FIR filter with 4 bit samples using sequential 2X6

WPARAM requires 30% more area and is faster by 1.2 times than the filter with the OSPAM. 10. For higher order multipliers, OPARAM is faster than the WPARAM and immaterial of the size of the multiplier, OPARAM can be operated at any data rate less than or equal to 166 MHz .

Claims

We claim:

i . An improved m x n wave pipelined array multiplier comprising: an array of combinational logic blocks (each block having two outputs denoted as sum and carry outputs for the sake of convenience), - two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining n-2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks such that the sum output of the jth combinational logic block in the ith row is connected to j+lth combinational logic block in the i+lth row, the cany output of the jth combinational logic block in the ith row is connected to jth combinational logic block of i+lth row, in the nth row of the combinational logic blocks, the extreme right combinational logic block is connected to the carry output of the extreme right combinational logic block of n-lth row and sum output of second extreme right combinational logic block of n-lth row, the extreme left combinational logic block in the nth row is connected to sum output of extreme left combinational logic block and carry output of second left combinational logic block in the n-lth row, jth (j=2 to n-1) combinational logic block in nth row is connected to sum outputs of j- Ith and jth combinational logic blocks in n-lth row and caiτys from jth and j+lth combinational logic blocks in the n- lth row. for each of the combinational logic block in the last (n/2)- l rows excepting the extreme right block, the jth block is connected to sum outputs of jth and j+ l th combinational logic blocks and carry outputs from j + l th and j+2th combinational logic blocks in the previous row. - each of the extreme right combinational logic block of the last (n/2)- 1 rows of the combinational logic block arrays is provided with the sum output of the third extreme right combinational logic block and the carry output of the second extreme right combinational logic block in the previous row, and - sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last n/2th rows will provide the final multiplication, a delay control means is provided between combinational logic blocks in the ith row and i+lth row to make the sum of the interconnect delay and the delay in each logic block itself equal for all the logic blocks in the ith and i+lth rows, thereby increasing the speed of multiplication, reducing the area, power consumption and latency.

2. An improved m x n wave pipelined array multiplier as claimed in claim 1 wherein m x n = 4 x 4.

3. An improved 4 x 4 wave pipelined array multiplier as claimed in claim 2 comprising: an array of combinational logic blocks, each block having two outputs denoted as sum and carry outputs for the sake of convenience. ivvo least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining 2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks, in first row combinational logic blocks 1 to 4 are connected to second row such that the sum output of the combinational logic blocks 1 to 3 in the first row is connected to 2 to 4 combinational logic blocks in. the second row, - in second row combinational logic blocks 1 to 4 are connected to third row such that the sum output of the combinational logic blocks 1 to 3 in the second row is connected to 2 to 4 combinational logic blocks in the third row, - • in first row combinational logic blocks 2 to 4 are connected to second row such that the carry output of the combinational logic blocks 2 to 4 in the first row is connected to 2 to 4 combinational logic blocks in the second row, in second row combinational logic blocks 2 to 4 are connected to third row such that the cany output of the combinational logic blocks 2 to 4 in the second row is connected to 2 to 4 Combinational logic blocks in the third row, in the fourth row of the combinational logic blocks, the extreme right combinational logic block is connected to the carry output of the extreme right combinational logic block of third row and sum output of second extreme right combinational logic block of third row, the third extreme left combinational logic block in fourth row is connected to the sum outputs of second extreme left and third extreme left combinational logic blocks in the third row and carry outputs of third extreme and fourth extreme left combinational logic blocks in the third row the second extreme left combinational logic block in fourth row is connected to the sum outputs of extreme left and second extreme left combinational logic blocks in the third row and carry outputs of second extreme and third extreme left combinational logic blocks in the third row the extreme left combinational logic block in the fourth row is provided inputs from sum output of extreme left and carry output of second extreme left combinational logic blocks in the third row, the left combinational logic block of the last row of the combinational logic block array is provided with sum outputs of extreme left and second extreme left combinational logic blocks and carry outputs of second extreme left and third extreme left combinational logic blocks of the previous row the right combinational logic block of the last row of the combinational logic block array is provided with the sum output of the second left combinational logic block and the carry output of the third left combinational logic block in the previous row, sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last two rows will provide the final multiplication, a delay control means is provided between combination logic blocks in the rows 1 to 4 and rows 2 to 5 to make the sum of the interconnect delay and the delay in each logic block itself equal for all theϊogic blocks in rows 1 to 4 and 2 to 5, thereby increasing the speed of multiplication, reducing the area, power consumption and latency.

4. An improved 4 x 4 wave pipelined array multiplier as claimed in claim 3 wherein delay control means comprising selected interconnect wires between combinational logic blocks such that the sum of the interconnect delay between said logic blocks and the combinational logic blocks delay itself is equal for the selected interconnected wires between logic blocks in the adjacent row.

5. An improved m x n wave pipelined array multiplier as claimed in claim 1 wherein the array of combinational logic blocks include LUTs.

6. An improved m x n wave pipelined array multiplier as claimed in claim 1 wherein at least one register is included in the input of the multiplier and two registers in the output of the multiplier are provided.

7. An improved m x n wave pipelined aiτay multiplier as claimed in claim 1 wherein two accumulators are provided in the output of the multiplier for storing the results alternatively. S. An improved m x n wave pipelined array multiplier substantially as herein described with reference to and as illustrated in the accompanying drawings.