WO2000022729A1 - Area efficient realization of coefficient architecture for bit-serial fir, iir filters and combinational/sequential logic structure with zero latency clock output - Google Patents

Area efficient realization of coefficient architecture for bit-serial fir, iir filters and combinational/sequential logic structure with zero latency clock output Download PDF

Info

Publication number
WO2000022729A1
WO2000022729A1 PCT/SG1998/000082 SG9800082W WO0022729A1 WO 2000022729 A1 WO2000022729 A1 WO 2000022729A1 SG 9800082 W SG9800082 W SG 9800082W WO 0022729 A1 WO0022729 A1 WO 0022729A1
Authority
WO
WIPO (PCT)
Prior art keywords
block
coefficient
elements
architecture
bit
Prior art date
Application number
PCT/SG1998/000082
Other languages
French (fr)
Inventor
Rakesh Malik
Puneet Goel
Original Assignee
Stmicroelectronics Pte Ltd.
Stmicroelectronics Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Stmicroelectronics Pte Ltd., Stmicroelectronics Limited filed Critical Stmicroelectronics Pte Ltd.
Priority to PCT/SG1998/000082 priority Critical patent/WO2000022729A1/en
Priority to SG1998004194A priority patent/SG73567A1/en
Priority to EP98950602A priority patent/EP1119910B1/en
Priority to DE69821145T priority patent/DE69821145T2/en
Priority to US09/807,500 priority patent/US7007053B1/en
Publication of WO2000022729A1 publication Critical patent/WO2000022729A1/en

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03HIMPEDANCE NETWORKS, e.g. RESONANT CIRCUITS; RESONATORS
    • H03H17/00Networks using digital techniques
    • H03H17/02Frequency selective networks
    • H03H17/0223Computation saving measures; Accelerating measures
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03HIMPEDANCE NETWORKS, e.g. RESONANT CIRCUITS; RESONATORS
    • H03H17/00Networks using digital techniques
    • H03H17/02Frequency selective networks
    • H03H17/0223Computation saving measures; Accelerating measures
    • H03H17/0225Measures concerning the multipliers
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03HIMPEDANCE NETWORKS, e.g. RESONANT CIRCUITS; RESONATORS
    • H03H17/00Networks using digital techniques
    • H03H17/02Frequency selective networks
    • H03H17/0223Computation saving measures; Accelerating measures
    • H03H17/0227Measures concerning the coefficients
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03HIMPEDANCE NETWORKS, e.g. RESONANT CIRCUITS; RESONATORS
    • H03H2218/00Indexing scheme relating to details of digital filters
    • H03H2218/08Resource sharing
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03HIMPEDANCE NETWORKS, e.g. RESONANT CIRCUITS; RESONATORS
    • H03H2218/00Indexing scheme relating to details of digital filters
    • H03H2218/10Multiplier and or accumulator units

Definitions

  • the invention relates to area efficient realization of coefficient block [A] or achitecture [A] with hardware sharing techniques and optimizations applied to this block.
  • the block [A] is connected to coefficient lines CLin_0, CLin_l CLin_n and BLin_0, BLin_l,....BLin_n coming from block [E] and/or [F], to be connected to perform filtering operation or a mathematical computing operation with optimization in hardware and provides a zero latency output.
  • the invention also gives the area mudimal realization of digital filters based on coefficient block[A], when operated in bit serial fashion.
  • bit-serial digital filters typically a finite impulse response(FIR) filter, infinite impulse response filter(LTR) and for other filters and applications based on combinational logic consisting of delay element(T), multiplier(M), serial adder(SA) and serial subtracter (SS).
  • FIR finite impulse response
  • LTR infinite impulse response filter
  • T delay element
  • M multiplier
  • SA serial adder
  • SS serial subtracter
  • FIG 1 shows the field of invention, applications of the device
  • FIG. 2 shows the symbol of components used in the device.
  • FIG. 3 shows the description of components used in the device.
  • FIG. 4 shows the bit serial FIR filter implementations
  • Figure 5 shows an example of FLR filter.
  • Figure 6 shows one of the known minimization technique due to symmetry of coefficien
  • Figure 7 shows the structure of prior/known implementation technique for coefficient block.
  • Figure 8 shows the generalized structure of prior/known implementation technique of coefficient block.
  • Figure 9 shows the minimization technique involved in FLR filter.
  • Figure 10 shows the generalized structure of the minimization technique involved in FLR filter.
  • Figure 11 shows the minimized structure of this example FLR filter, of the present invention.
  • Figure 12 shows the generalized opirnized structure of the present invention.
  • Figure 13 shows the other advantage of the structure i.e getting the parallel output directly, of the present invention.
  • the output of this block is (01010110 in binary or 86 in integer representation).
  • This element is usually a Flip-flop (D Flip-flop, J-K Flip-flop etc.).
  • Serial adder (SA) and Serial Subtractor (SS) It performs addition/subtraction of two serial frame, xl(nT), x2(nT) to generate output y(nT) represented as xl(nT)+x2(nT) or xl(nT)-x2(nT) .
  • the serial adder (or subtractor) is implemented using a full adder (or subtractor) with a Flip-Flop as shown in " Figure 3" of the drawings.
  • the output Cout of [FA/FS] is delayed using the [T] element and is applied to Cin line of [FA/ FS].
  • a serial coefficient multiplier(M) can be implemented by shift register using [T] elements and adder element [SA] (One shift means multiply by factor of 2). As shown in " Figure 3" of the drawings, the multiplier is formed by adding the outputs corresponding to ones in the binary representation of the coefficient. Delay (Z- 1 )
  • Delay by one frame of data is done by shift register (series of Flip-flops (T) connected to store and shift the input frame).
  • the number of Unit delay (T) in one delay element is equal to the frame size of the input.
  • Figure 4" shows the existing structure of bit serial FLR filter with coefficient lines CLin_0, CLin_l, CLin_n and the coefficient block [A] having the coefficients c(0), c(l), c(2),....c(n).
  • the coefficient block is connected to delay element [Z "1 ] and serial adders [SA] to form ⁇ lter structure.
  • Y(n) c(0) X(n) + c(l) X(n-l) + c(2) X(n-2) + c(n) X(0)
  • coefficient lines CLin_0, CLin_l, CLin_n are common and connected to input X[n].
  • the output lines CLout_0, CLout_l, CLout_n are connected to block [E], consisting of delay element [Z *1 ] and serial adders [SA] elements.
  • the structure makes easy realization of share-able multiplier in the coefficient block [A].
  • An example of share-able multiplier with coefficient values 3,11 is illustrated in " Figure 4". The realization of these coefficient separately would require 4[T], 3[SA] elements.
  • CLin_0, CLin_l,... being common, the hardware is realized using 3[T], 2[SA] elements.
  • the structure inherently requires more storage area, represented by ⁇ Z "1 ], as compared to implementation2, since the storage is done after the multiplication.
  • the storage area of each delay element [Z '1 ] is (m+n).
  • the total storage space of the delay elements is (m+n) * (number of coefficients -1).
  • the coefficient line CLin_0, CLin_l are not common.
  • Another feature of this structure is that it inherently requires lesser storage space, represented as [Z "1 ], unlike in previous implementation, here the storage is ione before multiplication.
  • the storage area of each delay element [Z *1 ] is (m).
  • the total storage space is (m) * (number of coefficients -1).
  • the invention is proposed in reducing the area of the coefficient block [A] and have share-able elements in coefficients, even if the coefficient lines CLin_0,
  • implementation 2 is area efficient with respect to implementation 1 due to reduced delay elements size. Over and above this by having share-able multiplier or reduced coefficient block [A], which are the key features of the invention, implementation 2 becomes still more area-efficient. This reduction is extendable to other filter based on coefficient block [A], as stated in the first section.
  • the present invention operates on integer valued coefficient.
  • Bit serial architecture reduce the interprocessor communication down to 1 bit. Generally the number of processors is very large, but because each processor is so small, the overall economy is very high. Bit serial architectures are usually most effective for filters having a few state variables, such as ILR filters and the wave- digital filters. For this reason, bit- serial techniques are less frequently applied to FLR structures, especially when the filter length is relatively long "
  • the present invention applies optimization techniques for reducing the _areain large sized coefficients by applying a number of optimizations in FLR/LLR filter structures.
  • Y(n) c(0) X(n) + c(l) X(n-1) + c(2) X(n-2) + c(n) X(0)
  • Y(n) 5 X(n) + 14 X(n-l) + 25 X(n-2) + 30 X(n-3) + 25 X(n-4) +14 X(n-5) + 5
  • Figure 5" of the drawings shows FLR filter structure of implementation 2.
  • the figure illustrates the realization of FLR filter represented by "Equation 1" .
  • Equation 1 is taken advantage of the symmetry in the coefficients.
  • the streams which have to be multiplied with the same coefficients can be added first and then multiplied. For a large filter structure, this leads to a reduction by 45% in the coefficient block, (see “ Figure 6" of the accompanying drawings)
  • Y(z) X(z)[5*(l+r 6 ) + 14*(Z "1 +r 5 ) + 25*(Z- 2 +Z- 4 ) + 30 * Z "3 ] (EQ 2)
  • each column represents a coefficient value.
  • [T] elements, shown as Tn_l to Tn_m in column n defines the connectivity with line Sn.
  • T2_m Tn_l to Tn_m is determined by coefficient value.
  • the number of [T] element in a column is determined.
  • the number of serial adders/subtractor [SA/ SS] in columns is represented as (SA1_1 to SAl_m,SA2_l to SA2_m SAn_l to SAnj ). The presence of one of these elements is again defined by the coefficient value.
  • the [T] elements are arranged in shift register form.
  • the input to first [T] element is connected to one of the S line. While the input to [SA/ SS] is connected from input SI to Sn and/or one of the output of [T] elements of shift register, depending on the coefficient value.
  • SAe_l to SAe_n-l elements the addition/subtraction of [SA/SS] of all the coefficient terms depicted in columns is done.
  • the final output is the output of last addition subtraction[S A/SS] .
  • This structure reduces the hardware of the coefficient block [A] by having shareable elements in coefficients, even if the coefficient lines CLin_0, CLin_l, are not commonly connected. This structure reduces the area by approximately 30- _50%. of "Figure 7" of the drawings by reducing the number of components and by having share-ability of components. Here the optimization techniques are illustrated with examples and end of this section depicts the generalized equation and structure of the device.
  • the invention reduces the area of the coefficient block [A] by having share- able elements in coefficients, even in the implementation where the coefficient lines
  • serial input bit line of said architecture [A] are SI, S2, Sn. [where n represents the number of coefficients of the filter], the addition terms of the equation [(aO*Sl+bO*S2+....+kO*Sn),
  • (FA) & full subtractor (FS) elements the values aO, bO, etc. are (+ / -1 or 0), the connection of elements (FA/FS) to S I, S2....Sn lines and interconnection of the elements (FA,FS) depend on the value of coefficients, the final output of last element [FA/FS] of each block [B] is terminated through lines b_l, b_2,....b_m at [T] elements, the number of T elements in cluster [C] depends on the size of maximum coefficient value and is share-able for all the coefficient in the coefficient architecture [A], in the said architecture all the combinational elements [B] are clustered together as [D] and all the unit delay elements ⁇ T[l],
  • T[2] T[m] ⁇ are clustered together in [C], thereby separating the sequential and combinational logic
  • the output of element [T] is connected to the one of the inputs of combinational logic of block [B] of next bit position
  • the interconnections from cluster [C] to [B] is represented as t_l, _2,. constitute..t_m
  • the elements [FA/FS] are arranged in matrix form FA0_0 to FA0_n in bit position 0, FA1_1 to FAl_n in bit position 1,....
  • Tn is used for multiplication by "a factor of two" and also in the implementation of the carry structure in the one bit serial adder, in the said architecture some extra components represented as block [Ex] are being used for connecting the carryout of all the adders/subtractors [FA/FS] of last stage of [D], the element [FA/FS] and [T] are used within this block, and hence, the said architecture [A] structures the circuit into sequential block [C] consisting of [T] elements and combinational [D] consisting of [FA,FS] elements, while the [T] elements of block [C], are common for all the coefficients and are share-able and positioned at end position of each block [B], the Block [D] has combinational element block[B] which are essentially [FA,FS], thereby making share-able hardware within block [D] and the final output is the output of BITm position.
  • the area minimal realization of digital filters based on coefficient architecture [A] is achieved when it is operated in bit serial fashion.
  • the structure provides hardware minimization for finite impulse response(FIR) filter, infinite impulse response filter(ILR) and for other filters and applications related to combinational logic consisting of delay element(T), multiplier(M), adder and subtractor.
  • Further optimization technique in cluster [D] is done by using common adders (FA) and common subtractor (FS) and using this shared outputs or by using subtractor (FS) instead of adders, when the coefficient value is closer to power of two or by minimizing the use of subtractor by taking common subtraction operator and using adder instead.
  • the present device when used in implementation 2 of FLR/TIR filter and similar structure of filters, results in quite area efficient realization of the filter, the storage area in implementation2, referred as delay elements [Z "1 ], is smaller as compared to implementation 1 which is present due to inherent property of the structure of implementation 2, and an additional saving in area in filter coefficient realization design is achieved by using the claimed structure of coefficient architecture [A] of " Figure 12".
  • the equation defines the bit position as BITO to BIT4, which is the position of "multiplication by power of two", (e.g BITO represents multiplication by 20).
  • BITO position addition of S3+S4 is performed and the output is terminated at T(l).
  • the output of T(l) defines the next bit position BITl, which performs addition of S2+S3+S4 using the [FA] and also the output of T(l).
  • the output of this addition is again terminated at T(2).
  • the structure is repeated in next BIT positions.
  • the carryout of [FA]'s are fed to the previous bit position.
  • the final addition of BIT position BIT4 gives the output of the coefficient block [A].
  • the adders at all the bit positions [B], represented by FA(1), HA(2), FA(10) are clustered in [D].
  • the adder [FA]'s inputs is connected from coefficient lines SI, S2, S3, S4 and from unit delay element of previous bit position.
  • the addition/subtraction is performed in [B] block and the final output of last adder [FA] is connected to [T] elements, which is used for "multiplication by factor of 2".
  • the interconnection from [B] block to [T] block is represented as b_l, b_2, b_3, b_4-
  • the outputs of [T] are connected to one of the inputs of combinational logic of block [B] of next bit position (i.e connected to input of first element (FA) of block [B].
  • the flip-flop [T] is used for dual purpose
  • the HA(2) performs addition data on S2, S4 lines
  • the output Z represents shared adder Al, being fed to FA(3) and FA(5).
  • the output Z of FA(3) defines bit position 3 is terminated at [T(4)] element.
  • the Cout pins at this bit position is connected to Cin of any adder [FA(5)] in previous bit location, hence utilizing the [T(4)] element to enable all the FA's at this location to work as a [SA].
  • the structure of [SA] is essentially [FA] along with [T] element connecting the Cout of FA to it's Cin pin.
  • The-circuit is structured into sequential block [C] consisting of [T] elements and combinational Block [D] consisting of FA,FS elements.
  • Block [C] having sequential elements is common for all the coefficients and have share-able elements [T] positioned at end position of each block [B].
  • Block [D] having combinational block[B] which are essentially FA,FS. Not only the hardware within block[B] are share-able but also across various [B] blocks. Hence the components hardware within [D] block is minimized.
  • The-invention provides an area efficient realization of filter coefficient block[A] applicable to filters devices such as FLR, ILR and other filter structures based on this block.
  • This architecture is also applicable to combinational and sequential logic consisting of adder, subtractors, multipliers and flip flop [T]. This architecture is realized using the elements full adders (FA), full subtraction (FS) and flip-flop[T].
  • BITl BITm.
  • the number of [T] elements depends on the size of maximum coefficient and is share-able for all the coefficient in the coefficient block [A]. Also all the elements [B] are clustered together as [D] and all the unit delay elements ⁇ T[l], T[2] T[m] ⁇ are clustered together in [C]. Thus separating the sequential and combinational logic.
  • the input of the unit delay element [T] is final output of block [B] and the output of element [T] is connected to the one of the inputs of combinational logic of block [B] of next bit position (i.e connected to input of first element (FA or FS) of block [B] depending upon the sign value+/-).
  • the interconnections from cluster [C] to [B] is represented as t_l, t_2, t_m.
  • the [T] elements clustered as [C] is share-able for all the coefficients and the full ⁇ adder/subtractor (FA/FS) components are clustered as [D].
  • the carry- out pin of full adder (FA) of each cluster stage[B] is fed to input of full adder (FA) of previous stage cluster [B] i.e stage preceding the flip-flop (T) element of cluster [C].
  • F full adder/subtractor
  • Extra components represented as [Ex] block are used for connecting the carry-out of all the adders/subtractors (FA/FS) of last stage of [D] i.e bit position BITO.
  • Full adders/full subtractor[FA/FS] and unit delays [T] are used in this block.
  • the line COUT (carryout) of bit position BITO is connected to [Ex] block (typically to inputs of element such as [FA] or [FS] ⁇ .
  • the carryout (COUT) of each one of [FA/FS] is fed to the CLN of the same element.
  • Z of [FA]'s to the input A or B of next [FA] element.
  • a binary tree can be formed here.
  • the number of [FA], [T] elements in [Ex] block are [number of carryout pins -1] and [number of carryout pins] respectively.
  • this structure reduces the area of coefficient block [A] [by 50-75% of the area of coefficient block [A]).
  • the coefficient having maximum value is in 16 bits (e.g. maximum coefficient value is +32767 or -32768 in 2's complement representation).
  • average size of the coefficient approximated by the formula is 8 bit.
  • Minimization (Already applied as patent) and “Proposed Minimization (Proposal for Patent)” would require only 16 Flip-Flops (The number of flip- flops of all the coefficients are share-able and are limited to the coefficient which has the maximum value).

Abstract

The invention relates to area efficient realization of coefficient block [A] or architecture [A] with hardware sharing techniques and optimizations applied to this block. The block [A] is connected to coefficient lines CLin_0, CLin_1.....CLin_n and BLin_0, BLin_1,....BLin_n coming from block [E] and/or [F], to be connected to perform filtering operation or a mathematical computing operation with optimization in hardware and provides a zero latency output. The invention also gives the area minimal realization of digital filters based on coefficient block [A], when operated in bit serial fashion. The optimization techniques and structure of the present invention are good for bit-serial digital filters typically a finite impulse response (FIR) filter, infinite impulse response filter (IIR) and for other filters and applications based on combinational logic consisting of delay element (T), multiplier (M), serial adder (SA) and serial subtractor (SS).

Description

AREA EFFICIENT REALIZAΗON OF COEFFICIENT ARCHITECTURE FOR BIT-SERIAL FIR, IIR FILTERS AND COMBINAΗONAL/SEQUENTIAL LOGIC STRUCTURE WITH ZERO LATENCY CLOCK OUTPUT
FIELD OF INVENTION
The invention relates to area efficient realization of coefficient block [A] or achitecture [A] with hardware sharing techniques and optimizations applied to this block. The block [A] is connected to coefficient lines CLin_0, CLin_l CLin_n and BLin_0, BLin_l,....BLin_n coming from block [E] and/or [F], to be connected to perform filtering operation or a mathematical computing operation with optimization in hardware and provides a zero latency output. The invention also gives the area miriimal realization of digital filters based on coefficient block[A], when operated in bit serial fashion. The optimization techniques and structure of the present invention are good for bit-serial digital filters typically a finite impulse response(FIR) filter, infinite impulse response filter(LTR) and for other filters and applications based on combinational logic consisting of delay element(T), multiplier(M), serial adder(SA) and serial subtracter (SS).
Brief description of the accompanying drawings
In the accompanying drawings:
Figure 1 shows the field of invention, applications of the device
Figure 2 shows the symbol of components used in the device.
Figure 3 shows the description of components used in the device.
Figure 4 shows the bit serial FIR filter implementations
Figure 5 shows an example of FLR filter.
Figure 6 shows one of the known minimization technique due to symmetry of coefficien
Figure 7 shows the structure of prior/known implementation technique for coefficient block.
Figure 8 shows the generalized structure of prior/known implementation technique of coefficient block. Figure 9 shows the minimization technique involved in FLR filter.
Figure 10 shows the generalized structure of the minimization technique involved in FLR filter.
Figure 11 shows the minimized structure of this example FLR filter, of the present invention.
Figure 12 shows the generalized opirnized structure of the present invention.
Figure 13 shows the other advantage of the structure i.e getting the parallel output directly, of the present invention.
Details of Elements/symbol used in the description
The basic components symbol used in design are shown in "Figure 2" of the drawings. In addition, explanation and usages of the device are done in the text below and depicted in "Figure 3" and "Figure 4" of the drawings.
Unit delay (T)
It is one bit delay element. It also performs function of a multiplier by a factor of 2.
[e.g. For the serial input frame (0101011 in binary or 43 in integer representation), the output of this block is (01010110 in binary or 86 in integer representation). This element is usually a Flip-flop (D Flip-flop, J-K Flip-flop etc.).
Full adder (FA)
It performs binary addition. The inputs to this element are A, B, Cin (Carryin) while the outputs are Z and Cout (Carry out). The truth table for full adder functionality is shown in "Figure 3" of the drawings.
Full subtractor (FS)
It performs binary subtraction. The inputs to this element are A, B, Cin (Carryin) while the outputs are Z and Cout (Carry out). The truth table for full subtractor functionality is shown in "Figure 3" of the drawings.
Serial adder (SA) and Serial Subtractor (SS) It performs addition/subtraction of two serial frame, xl(nT), x2(nT) to generate output y(nT) represented as xl(nT)+x2(nT) or xl(nT)-x2(nT) . The serial adder (or subtractor) is implemented using a full adder (or subtractor) with a Flip-Flop as shown in "Figure 3" of the drawings. The output Cout of [FA/FS] is delayed using the [T] element and is applied to Cin line of [FA/ FS]. This enables the [FA/FS] _and JT] together to function as serial adder (SA/SS), where A, B are the inputs to this element and Z is the output, (e.g of serial addition is as follows, if xl(nT) = 0110 (6 in integer) and x2(nT) = 0111 (7 in integer). Then y(nT) = 01101 (13 in integer representation). Serial Multiplier (M)
It multiplies two serial input frame X(nT) and m. The output is function represented as Y(nT) = X(nT) * m. A serial coefficient multiplier(M) can be implemented by shift register using [T] elements and adder element [SA] (One shift means multiply by factor of 2). As shown in "Figure 3" of the drawings, the multiplier is formed by adding the outputs corresponding to ones in the binary representation of the coefficient. Delay (Z-1)
Delay by one frame of data is done by shift register (series of Flip-flops (T) connected to store and shift the input frame). The number of Unit delay (T) in one delay element is equal to the frame size of the input.
PRIOR ART OR EXISTING IMPLEMENTATION OF FILTER
The following description discusses the elements used for implementation of architecture and the existing implementations for digital filters. The proposed minimization is extendable to other applications such as Digital Signal Processing field and Digital designs. From here onwards, all the illustration would be done with FLR filter which is extendable to other filters as described earlier. "Figure 4"shows the existing structure of bit serial FLR filter with coefficient lines CLin_0, CLin_l, CLin_n and the coefficient block [A] having the coefficients c(0), c(l), c(2),....c(n). The coefficient block is connected to delay element [Z"1] and serial adders [SA] to form ϊlter structure. Stating the FLR filter equation in time and frequency domain
Y(n) = c(0) X(n) + c(l) X(n-l) + c(2) X(n-2) + c(n) X(0)
Y(z) = X(z) [c(0) + c(l) ZΛ + c(2) Z"2 + c(3) Z2 + c(4) Z4 + c(5) Z"5 + c(6)
Z"6+ +c(n) Za] where X, Y are the input and output respectively and c(0), c(l) c(n) represent the coefficients value which defines the characteristics of the filter and each delay [ZT1] block represent sample delay of one. The filter equation can be implemented in two ways as shown in "Figure 4" of the drawings
In implementation 1, coefficient lines CLin_0, CLin_l, CLin_n are common and connected to input X[n]. The output lines CLout_0, CLout_l, CLout_n are connected to block [E], consisting of delay element [Z*1] and serial adders [SA] elements. The structure makes easy realization of share-able multiplier in the coefficient block [A]. An example of share-able multiplier with coefficient values 3,11 is illustrated in "Figure 4". The realization of these coefficient separately would require 4[T], 3[SA] elements. By virtue of CLin_0, CLin_l,... being common, the hardware is realized using 3[T], 2[SA] elements. Another feature of the structure is that the structure inherently requires more storage area, represented by {Z"1], as compared to implementation2, since the storage is done after the multiplication. For input frame of n bit and coefficient of size m bit, the storage area of each delay element [Z'1] is (m+n). The total storage space of the delay elements is (m+n) * (number of coefficients -1). In implementation 2, the coefficient line CLin_0, CLin_l, are not common. By virtue of connectivity of different input lines to all the coefficient elements [c(0), c(l) ], the realization of coefficients block [A] using share-able elements is not present. Another feature of this structure is that it inherently requires lesser storage space, represented as [Z"1], unlike in previous implementation, here the storage is ione before multiplication. For input frame of m bit and coefficient of size n bit, the storage area of each delay element [Z*1] is (m). The total storage space is (m) * (number of coefficients -1).
The invention is proposed in reducing the area of the coefficient block [A] and have share-able elements in coefficients, even if the coefficient lines CLin_0,
CLin_l, are not commonly connected. For existing configuration as shown in
"Figure 7" and "Figure 8" , the share-ability of hardware in block [A] is a limitation.
Also, as described in previous -section, implementation 2 is area efficient with respect to implementation 1 due to reduced delay elements size. Over and above this by having share-able multiplier or reduced coefficient block [A], which are the key features of the invention, implementation 2 becomes still more area-efficient. This reduction is extendable to other filter based on coefficient block [A], as stated in the first section. The present invention operates on integer valued coefficient.
Further, to quote Norsworthy and Crochiere (Deka-Sigma Data Converters LF EE press pp-435, copyright 1997)
"Bit-serial architecture reduce the interprocessor communication down to 1 bit. Generally the number of processors is very large, but because each processor is so small, the overall economy is very high. Bit serial architectures are usually most effective for filters having a few state variables, such as ILR filters and the wave- digital filters. For this reason, bit- serial techniques are less frequently applied to FLR structures, especially when the filter length is relatively long "
However, the present invention applies optimization techniques for reducing the _areain large sized coefficients by applying a number of optimizations in FLR/LLR filter structures.
To elaborate the applicant's optimization techniques, consider a FLR filter with coefficient as 5, 14, 25, 30, 25, 14, and 5. Though the size of the coefficients in this example is small, it is enough to elaborate the minimization proposals. In most of the practical cases, the coefficients are symmetrical. Stating the FLR filter equation in time and frequency domain
Y(n) = c(0) X(n) + c(l) X(n-1) + c(2) X(n-2) + c(n) X(0)
Y(z) = X(z) [c(0) + c(l) Z'1 + c(2) Z"2 + c(3) Z"3 + c(4) Z4 + c(5) Z5 + c(6)
Z"*+ +c(n) Z-n] where X, Y are the input & output respectively and c(0), c(l) c(n) represent the coefficients value.
Using the coefficient values in above equation
Y(n) = 5 X(n) + 14 X(n-l) + 25 X(n-2) + 30 X(n-3) + 25 X(n-4) +14 X(n-5) + 5
X(n-6)
Y(z) = X(z) [5 + 14 ZΛ + 25 Z'2 + 30 Z'3 + 25 Z"4 + 14 Z5 + 5 Z6] (EQ 1)
The Existing Method and Minimization
"Figure 5" of the drawings shows FLR filter structure of implementation 2. The figure illustrates the realization of FLR filter represented by "Equation 1" . In one of the known optimization technique, is taken advantage of the symmetry in the coefficients. The streams which have to be multiplied with the same coefficients can be added first and then multiplied. For a large filter structure, this leads to a reduction by 45% in the coefficient block, (see "Figure 6" of the accompanying drawings)
This is done by restracturing the equation as under:
Y(z) = X(z)[5*(l+r6) + 14*(Z"1+r5) + 25*(Z-2+Z-4) + 30 * Z"3] (EQ 2)
For the rest of the optimization proposals it will be talking about only the multiplier adder series which is shown in the dotted box referred to as coefficient block [A]. "Figure 7" of the drawings shows the traditional way of implementation of the example structure for block [A], wherein SI to S4 represent the lines connected to delay block [Z"1] through line CLin_0 to CLin_6 depicted in "Figure 6" of the drawings. The Lines SI to S4 are separately connected to [T] element for performing a multiplication by a factor of 2 and (SA) is being used to perform serial addition of data. This represents the multiplier less realization of filter coefficient block (A) where the property of flip-flop (T) as multiplier of factor of two is used.
Mathematically, the restructured equation according to the structure is stated as Y(nT)=(4+l)Sl + (8+4+2)S2 + (16+8+l)S3 + (16+8+4+2)S4 (EQ 3)
In this implementation, SI, S2, S3, S4 lines are not commonly connected. Hence this-restricts to achieve a share-able hardware in coefficient block [A]. Thus all the function/operations of this block represent unique hardware. The elements required by the terms are listed as First term = 2[T], 1[SA] Second term = 3 [T], 2[SA] Third term = 4[T], 2[SA]
Fourth term = 4[T], 3 [SA]
Final addition of all the four term would require 3[SA].
The generalized structure of "The Existing Method and Minimization" is depicted in "Figure 8". In the structure, each column represents a coefficient value. The [T] elements, shown as Tl_l to Tl_m in columnl, defines connectivity with line SI. In similar fashion, [T] elements, shown as Tn_l to Tn_m in column n, defines the connectivity with line Sn.
The presence of one of the elements in columns 1 to n (i.e Tl_l to Tl_m, T2_l to
T2_m Tn_l to Tn_m) is determined by coefficient value. Thus depending On coefficient value on lines SI to Sn, the number of [T] element in a column is determined. Also the number of serial adders/subtractor [SA/ SS] in columns is represented as (SA1_1 to SAl_m,SA2_l to SA2_m SAn_l to SAnj ). The presence of one of these elements is again defined by the coefficient value.
In the structure, the [T] elements are arranged in shift register form. The input to first [T] element is connected to one of the S line. While the input to [SA/ SS] is connected from input SI to Sn and/or one of the output of [T] elements of shift register, depending on the coefficient value. Finally, using SAe_l to SAe_n-l elements, the addition/subtraction of [SA/SS] of all the coefficient terms depicted in columns is done. The final output is the output of last addition subtraction[S A/SS] .
Among the lines S 1 to Sn, the [T] elements are not share-able and also the [SA] in each column are also not share-able. Thus limited minimization is possible in this structure. Miπimizatioir (Already-applie as patent)
This structure reduces the hardware of the coefficient block [A] by having shareable elements in coefficients, even if the coefficient lines CLin_0, CLin_l, are not commonly connected. This structure reduces the area by approximately 30- _50%. of "Figure 7" of the drawings by reducing the number of components and by having share-ability of components. Here the optimization techniques are illustrated with examples and end of this section depicts the generalized equation and structure of the device.
Continuing the same example of FLR filter and using "Equation 3" of previous section. y(nT)= 5 * SI + 14 * S2 + 25 * S3 + 30 * S4 Y(nT)=(4+l)Sl + (8+4+2)S2 + (16+8+l)S3 + (16+8+4+2)S4 The applicants proceed to share the shift registers (multiply by 2) of the design. =(S3+S4)*16+(S2+S3+S4)*8+(S1+S2+S4)*4+(S2+S4)*2+(S1+S3) =(S1+S3)+2*(S2+S4+2*(S1+S2+S4+2*(S2+S3+S4+2*(S3+S4)))) (EQ 4)
Finding out the common additive factors
Al = S2+S4
A2 = S3+S4
The "Equation 4" can be further reduced as y(nT) = (S1+S3)+2*(A1+2*(S1+A1+2*(S2+A2+2*A2))) (EQ 5)
The implementation flow for this equation and the hardware implementation is illustrated here, also the hardware implementation in shown in "Figure 9" and "Figure 10" of the drawings [e.g SA(1), SA(2) etc. are used for representing adders, T(l), T(2) etc. are used for representing the unit delay]. In the flow of implementation, SI, S2, S3, S4 represents four inputs. The primary addition is done using serial adders SA(1), SA(3), SA(9) representing addition of terms S 1+S3, S2+S4, S3+S4. While the secondary and tertiary addition is done using the adders SA(5), SA(7), SA(8), SA(6), SA(4), SA(2). The multiplication by factor of two is done using the elements T(l), T(2), T(3), T(4).
Implementation flow of equation
Figure imgf000012_0001
Implementation of hardware is shown in "Figure 9" of the drawings, wherein the input line SI to S4 represent the lines connected to delay block [Z"1] through coefficient line Clin_0 to CLin_6 depicted in "Figure 6" of the drawings. The Lines SI to S4 are connected to block [B] for performing the serial addition/subtraction, for which (SA), (SS) elements are used within block[B]. The output of each block [B] is terminated with a [T] block, which represents the block [B] output being multiplied by a "factor of 2". The output b_l of block [B] which is at bit position 0 is fed to the input of the T(l), in turn the output line t_l of element [T(l)] is fed to next section of block[B]. Thus all addition defines a bit position before getting multiplied by 2 and changing to next bit position. All [T] elements are represented by block[C]. In the structure, the flip-flop [T] representing multiplication by a "factor of 2", is pushed to share between various coefficient values. Hence reducing the number of flip-flop(T).
In the nrinimization of "Figure 9" of the drawings, approximate area calculations is = 9 serial adder + 4 T = 22 Units, whereas the area calculation of "Figure 7" of the drawings is 11 serial adder + 13 T = 35 units, (assuming 1 Unit = 1 FA = 2HA = IT & serial adder = 2 Units). This resulted in 37% saving in area (13/35 * 100).
DETAILED DESCRIPTION OF THE INVENTION
Minimization Proposed in the present invention
The invention reduces the area of the coefficient block [A] by having share- able elements in coefficients, even in the implementation where the coefficient lines
CLin_0, CLin_l, are not commonly connected (shown as architecture [A]). This coefficient block [A] when applied in implementation2 ("Figure 4") of FLR filter, makes it still more area-efficient. This reduction is extendable to other filter based on coefficient block [A], as stated in the first section. An area efficient implementation of filter coefficient block is done using full adder (FA) block instead of serial adder (SA). It is known that a serial adder consists of one full adder (FA) and one flip-flop (T) element, (refer "Figure 3" of the drawings). This makes serial adder(SA) twice expensive in area as compared to _one.full adder(FA) block, [area of serial adder (SA) = 1 FA+1T = 2 units while the area of 1FA = 1 unit]. In this implementation the reduction in area of the coefficient block [A] is achieved by maximising the use of full adder (FA) i.e by replacing serial adders (SA) with Full adders (FA) in the block [A].
The above is achieved by providing a device for area efficient realization of coefficient, said device comprising architecture [A] with hardware sharing techniques and optimization applied to this architecture, the architecture [A] is connected to coefficient lines CLin_0, CLin_l CLin_n and/or BLin_0,
BLin_l,....BLin_n coming from block [E] and/or [F], to be connected to perform filtering operation or a mathematical computing operation with optimization in hardware and provides a zero latency clock output, the serial input bit line of said architecture [A] are SI, S2, Sn. [where n represents the number of coefficients of the filter], the addition terms of the equation [(aO*Sl+bO*S2+....+kO*Sn),
(al*Sl+bl*S2+ +kl*Sn) (am*Sl+bm*S2+ +km*Sn)] are represented as blocks [B], the said block [B] is a combinational block consisting of full adders
(FA) & full subtractor (FS) elements, the values aO, bO, etc. are (+ / -1 or 0), the connection of elements (FA/FS) to S I, S2....Sn lines and interconnection of the elements (FA,FS) depend on the value of coefficients, the final output of last element [FA/FS] of each block [B] is terminated through lines b_l, b_2,....b_m at [T] elements, the number of T elements in cluster [C] depends on the size of maximum coefficient value and is share-able for all the coefficient in the coefficient architecture [A], in the said architecture all the combinational elements [B] are clustered together as [D] and all the unit delay elements {T[l],
T[2] T[m]} are clustered together in [C], thereby separating the sequential and combinational logic, In the said architecture [A] the output of element [T] is connected to the one of the inputs of combinational logic of block [B] of next bit position, the interconnections from cluster [C] to [B] is represented as t_l, _2,.„..t_m, the elements [FA/FS] are arranged in matrix form FA0_0 to FA0_n in bit position 0, FA1_1 to FAl_n in bit position 1,.... FAm_l to FAm_n in bit position m whose presence is defined by coefficient value, the carry-out pin of full adder (FA) of each cluster stage[B] in the said architecture [A] is fed to input of full adder (FA) of previous stage cluster [B] i.e stage preceding the flip-flop (T) element of cluster [C], in this way the same Flip-Flop [T] (Tl, T2, T3... Tn) is used for multiplication by "a factor of two" and also in the implementation of the carry structure in the one bit serial adder, in the said architecture some extra components represented as block [Ex] are being used for connecting the carryout of all the adders/subtractors [FA/FS] of last stage of [D], the element [FA/FS] and [T] are used within this block, and hence, the said architecture [A] structures the circuit into sequential block [C] consisting of [T] elements and combinational [D] consisting of [FA,FS] elements, while the [T] elements of block [C], are common for all the coefficients and are share-able and positioned at end position of each block [B], the Block [D] has combinational element block[B] which are essentially [FA,FS], thereby making share-able hardware within block [D] and the final output is the output of BITm position.
" In the present device, preferably, the area minimal realization of digital filters based on coefficient architecture [A] is achieved when it is operated in bit serial fashion. The structure provides hardware minimization for finite impulse response(FIR) filter, infinite impulse response filter(ILR) and for other filters and applications related to combinational logic consisting of delay element(T), multiplier(M), adder and subtractor. Further optimization technique in cluster [D] is done by using common adders (FA) and common subtractor (FS) and using this shared outputs or by using subtractor (FS) instead of adders, when the coefficient value is closer to power of two or by minimizing the use of subtractor by taking common subtraction operator and using adder instead.
The present device, when used in implementation 2 of FLR/TIR filter and similar structure of filters, results in quite area efficient realization of the filter, the storage area in implementation2, referred as delay elements [Z"1], is smaller as compared to implementation 1 which is present due to inherent property of the structure of implementation 2, and an additional saving in area in filter coefficient realization design is achieved by using the claimed structure of coefficient architecture [A] of "Figure 12".
In the implementation flow explained under, the carry-out (COUT) pin of full adder (FA) of each stage is fed to (CLN) input of full adder (FA) of previous stage i.e stage preceding the flip-flop (T) element. In this way, Flip- Flop (Tl, T2, T3, T4) which is used for multiplication by two, is used again, to function as carry storage and to enable [FA] to perform as one bit serial adder.
Rewriting the equation of FLR filter for the example shown in previous section y(nT) = (S1+S3)+2*(A1+2*(S1+A1+2*(S2+A2+2*A2))) (EQ 6)
Using full adder (FA) component in "Equation 6" , it is seen that the number of full adders used are the same as the number of one-bit serial adders used in the earlier architecture . In the proposed patent, depending on carryout of BITO position, some half adders or some extra elements are present. Implementation flow of equation
Figure imgf000017_0001
As shown in the above implementation flow, the equation defines the bit position as BITO to BIT4, which is the position of "multiplication by power of two", (e.g BITO represents multiplication by 20). At BITO position addition of S3+S4 is performed and the output is terminated at T(l). The output of T(l) defines the next bit position BITl, which performs addition of S2+S3+S4 using the [FA] and also the output of T(l). The output of this addition is again terminated at T(2). The structure is repeated in next BIT positions. The carryout of [FA]'s are fed to the previous bit position. The final addition of BIT position BIT4 gives the output of the coefficient block [A].
The implemented structure is shown in "Figure 11", wherein the input line SI to S4 represent lines connected to delay block [Z*1] through coefficient line Clin_0 to _CLin_6 depicted in "Figure 6" of the drawings. The Lines SI to S4 are connected to block [B] for performing the serial addition/subtraction for which (FA), (FS) elements are used within block[B]. All [T] elements are represented by block[C].
The adders at all the bit positions [B], represented by FA(1), HA(2), FA(10) are clustered in [D]. The adder [FA]'s inputs is connected from coefficient lines SI, S2, S3, S4 and from unit delay element of previous bit position. The addition/subtraction is performed in [B] block and the final output of last adder [FA] is connected to [T] elements, which is used for "multiplication by factor of 2". The interconnection from [B] block to [T] block is represented as b_l, b_2, b_3, b_4- The outputs of [T] are connected to one of the inputs of combinational logic of block [B] of next bit position (i.e connected to input of first element (FA) of block [B]. These interconnections of [T] from cluster [C] to [B] is represented as t_l, t_2, t_3, t_ 4 and Bit positions are marked as BITO, BITl, BIT2, BIT3, BIT4. An example illustration of connectivity is explained here. The output b_l of block [B] which is at bit position BITO is fed to the input of the T(l), in turn the output line t_l of element [T(l)] is fed to next section of block[B]. Thus all addition defines a bit position before getting "multiplied by factor of 2" and changing to next bit position.
The connection of COUT (carryout) of all the [FA] of one stage is explained here. The connection of carry-out (COUT) pin of full adder (FA) of each cluster stage[B] is fed to one of the inputs of full adder (FA) of previous stage cluster [B] i.e stage preceding the flip-flop [T] element of cluster [C]. Thus utilizing the [T] element of that bit position again. This enable using the [T] element for carry storage, by all [FA]'s element in that bit position, during serial addition operation.
In the invention, the flip-flop [T] is used for dual purpose
J.) Multiplication of output of block [B] by factor of two, used by all coefficients. 2) Utilizing the same [T] elements commonly by block [B] for using with [FA] to enable it to perform as a serial adder (SA).
For example, at bit position 3, the HA(2) performs addition data on S2, S4 lines The output Z, represents shared adder Al, being fed to FA(3) and FA(5). The output Z of FA(3) defines bit position 3 is terminated at [T(4)] element. The Cout pins at this bit position is connected to Cin of any adder [FA(5)] in previous bit location, hence utilizing the [T(4)] element to enable all the FA's at this location to work as a [SA]. The structure of [SA] is essentially [FA] along with [T] element connecting the Cout of FA to it's Cin pin.
In this implementation, there are some extra elements such as FA(l l) and Te(2), Te(l) which are required to terminate the carry out(Cout) at the bit position 0. The number of [FA] elements is equal to the (number of Cout lines- 1) in Bit position 0 and the number of [T] elements is equal to the (number of Cout lines) in Bit position 0. The extra elements are represented as [Ex] block.
" The-circuit is structured into sequential block [C] consisting of [T] elements and combinational Block [D] consisting of FA,FS elements. a) Block [C] having sequential elements is common for all the coefficients and have share-able elements [T] positioned at end position of each block [B]. b) Block [D] having combinational block[B] which are essentially FA,FS. Not only the hardware within block[B] are share-able but also across various [B] blocks. Hence the components hardware within [D] block is minimized.
The minimization in block[D] is achieved by using following minimization Jechniques
1) Sharing of common adder term, i.e. utilizing the common adder multiple times.
2) Using subtraction instead of addition when the coefficient is close to power of 2e.g 63 is better realized as (64-1) than (32+16+8+4+2+1). In former case the number of subtractor is 1 as compared to 5 adders in latter case.
3) Taking common subtraction operation and maximizing the use of adder are applied. This is because subtraction is expensive as compared to addition operation
In present minimization of "Figure 11", approximate area calculations is done as [9 FA + 2 HA + 6 T = 16 Units]. As the applicants have seen that the area in minimization under section "The Existing Method and Minimization" and "Figure 7" is 35 units. Area in minimization under section "Minimization (Already applied as patent)" ("Figure 9") is 22 units. Current nrinimization is an improvement of 54% {(35-16)/35} & 27% {(22-16)722} of coefficient block respectively over the two structures, (assuming 1 Unit = 1 FA = 2HA = IT & serial adder = 2 Units)
GENERALIZED STRUCTURE OFTHE INVENTION
" The-invention provides an area efficient realization of filter coefficient block[A] applicable to filters devices such as FLR, ILR and other filter structures based on this block. This architecture is also applicable to combinational and sequential logic consisting of adder, subtractors, multipliers and flip flop [T]. This architecture is realized using the elements full adders (FA), full subtraction (FS) and flip-flop[T].
Beginning with the generalized equation of FLR filter coefficient block(A) y(nT) = a * SI + b * S2 + c* S3 + k * Sn (1)
_where a, b,....k represents filter coefficients. SI, S2 represents bit lines corresponding to the coefficients.
Now, representing each coefficient as addition of terms arranged in power of two and applying it to the equation. y(nT) = (2m*am + 21*al+2°*a0) * SI + (2m*bm + 21*bl+2°*b0) * S2 +
(2m*cm + 21*cl+2°*c0) * S3+ +(2m *km + 2l*kl+2°*k0) * Sn
Further taking "2" as common factor we get the generalized equation for architecture under claim as.
Y(nT) = (aO*Sl +bO*S2+....+kO*Sn)
+ 21 ( (al*Sl+bl*S2+....+kl*Sn) +
21 ((a2*Sl+b2*S2+...+k2*Sn)+
21((a3*Sl+b3*S2+...+k3*Sn)+ +
21 ((am*Sl+bm*S2+ +km*Sn))))) where aO, al, am and bO, bl,...bm and kO, kl, km represents the sign of coefficients [i.e they have value (+ / -) 1 or 0]. The architecture realization in "Figure 12" is done using the sequential elements like unit delays [T] and combinational elements such as full adder (FA) and full subtractor (FS).
In "Figure 12", the input data is present on bit line SI, S2, Sn. [where n represents the number of coefficients of the filter] The addition terms of the equation.[(aO*Sl+bO*S2+....+kO*Sn),(al *Sl+bl *S2+ +kl*Sn) (am*Sl+b m*S2+ +km*Sn)] are represented as blocks [B]. Block [B] is a combinational block consisting of full adders (FA) and full subtractor (FS) elements. Since the values aO, bO, etc. represents value [(+ / -)1 or 0]. The connection of elements
(FA/FS) to SI, S2....Sn lines and interconnection of the elements (FA,FS) depend on the value of coefficients [This is because the value of coefficient determines the value of aO, al, etc. and hence it defines the interconnections between them].
All the addition/subtraction operation at a bit location is performed in block [B] and the output of each block [B] is terminated at [T] elements, which are essentially used to multiply the block [B] output by "a factor of two" and passing the output to next bit position. {The elements T[l], T[2], T[m] are used for this}. The connections (b_l, b_2,....b_m) are used for termination of output of block [B]. The bit positions of serial data frame are marked as BITO,
BITl, BITm. The number of [T] elements depends on the size of maximum coefficient and is share-able for all the coefficient in the coefficient block [A]. Also all the elements [B] are clustered together as [D] and all the unit delay elements {T[l], T[2] T[m]} are clustered together in [C]. Thus separating the sequential and combinational logic. The input of the unit delay element [T] is final output of block [B] and the output of element [T] is connected to the one of the inputs of combinational logic of block [B] of next bit position (i.e connected to input of first element (FA or FS) of block [B] depending upon the sign value+/-). The interconnections from cluster [C] to [B] is represented as t_l, t_2, t_m.
Thus, the [T] elements clustered as [C] is share-able for all the coefficients and the full~adder/subtractor (FA/FS) components are clustered as [D]. The carry- out pin of full adder (FA) of each cluster stage[B] is fed to input of full adder (FA) of previous stage cluster [B] i.e stage preceding the flip-flop (T) element of cluster [C]. In this way, we will share the same Flip-Flop [T] which is used for multiplication by factor of two (Tl, T2, T3... Tn) to the implementation of the carry structure in the one bit serial adder.
Extra components represented as [Ex] block are used for connecting the carry-out of all the adders/subtractors (FA/FS) of last stage of [D] i.e bit position BITO. Full adders/full subtractor[FA/FS] and unit delays [T] are used in this block. The line COUT (carryout) of bit position BITO is connected to [Ex] block (typically to inputs of element such as [FA] or [FS]}. Now using a [T] element, the carryout (COUT) of each one of [FA/FS] is fed to the CLN of the same element. Also, for connection of Z of [FA]'s to the input A or B of next [FA] element. A binary tree can be formed here. The number of [FA], [T] elements in [Ex] block are [number of carryout pins -1] and [number of carryout pins] respectively.
In the invention, optimizations in hardware in both cluster [C] and [D] is achieved with the reduced unit delays [T] and the adder/subtractor area (FA,FS). The gain in hardware is explained below.
Hardware reduction in block [C]
For filter having large size coefficient, this structure reduces the area of coefficient block [A] [by 50-75% of the area of coefficient block [A]).
Before beginning to prove the statement, the calculation of elements is formularize for - 1) number of flip-flop (T) 2) number of serial adders (SA) or full adders (FA)
This comparison is done here. The generalized structure of "The Existing Method and Minimization" in illustrated in "Figure 8". The other structure for comparison are "Minimization (Already applied as patent)" in "Figure 10" and "Generalized structure of the invention" in "Figure 12" of the drawings.
1) The number of flip-flops [T] elements in the coefficient block depends on the size of all the coefficients. The approximate and pessimistic formula for calculation of total flip-flops (T) in coefficient block in "The Existing Method and
_Minimization" is [= average size of coefficient * number of coefficient] ("Figure 8") , where average size of coefficient is calculated pessimistically as (Maximum coefficient size / 2). While in the "Minimization (Already applied as patent)" and "Proposed Miriimization (Proposal for Patent)" , the number of [T] elements are [= maximum size of coefficient, since the flip-flops (T) are share-able here].
2) The approximate formula for calculation of total adders (SA) in coefficient block for the mentioned above cases is [=adders per coefficient * number of coefficient]. Adders per coefficient block solely depend on value of coefficient. Assuming no optimization in worst case, number of adders per coefficient is (=number of coefficient * maximum coefficient size / 2).
Now using the mentioned formula on an example filter having 20 coefficient. The coefficient having maximum value is in 16 bits (e.g. maximum coefficient value is +32767 or -32768 in 2's complement representation). In the present example, average size of the coefficient approximated by the formula is 8 bit. For "The Existing Method and Minimization", total number of flip-flop (T) required for implementation is 8 * 20 = 160. In contrast to this, "Minimization (Already applied as patent)" and "Proposed Minimization (Proposal for Patent)" would require only 16 Flip-Flops (The number of flip- flops of all the coefficients are share-able and are limited to the coefficient which has the maximum value). Using the formula for adder's calculation, the number of adders for three cases are 8 * 20 = 160 (approx.).
Area calculation for "The Existing Method and Minimization" is 160 T +160 S A = 480 units. Area calculation for "Minimization (Already applied as patent)" is 16 T +160 SA = 336. Area calculation for "Proposed Minimization (Proposal for Patent)" is 16 T +160 FA +(extra elements 8T+7 FA)=191. [Assuming that average number of full adder per bit position is 8. We will generalize the calculation of number of extra elements here. These extra elements are needed to terminate the carry-out of last (LSB) position. Thus if the average number of FA's is 8, the extra elements (7 FA, 8T) are needed to terminate the carry-outs' of LSB position. This is shown in "Figure 12"]. Thus we see that current proposal has an area improvement of approximately by 60% {(480-190)/480} of coefficient block over "The Existing Method and Minimization" .
Hardware reduction in block [D]
For hardware reduction in block [D], following minimization are applied.
1) sharing of common adder term and using it in block [D]
2) using subtraction instead of addition when the coefficient is close to power of 2 e.g 63 is better realized as (64-1) than (32+16+8+4+2+1)
3) Taking common subtraction operation and maximizing the use of adder
For approximate area calculation following assumption is made (1 Unit of Area = 1 FA = 2HA = IT & SA=SS= 2 Units of Area).
Advantages involved in the present invention
The Area gets reduced by 50-75% (of the coefficient block[A]) for big filter structures, if all the 3 optimization steps, as discussed in previous section "Hardware reduction in block [D]", are applied. The last proposed architecture ("Figure 12") is a proper Mealy type machine. Many a times, the output has to be converted back to parallel data format. In that case, the outputs from the same shift registers can be used ("Figure 13"). One bit-serial multipliers could be still multiplexed for the proposed architecture if the specifications permit (i. e. if the frequency of operation is not very high.)

Claims

Claims:
1. A device for area efficient realization of coefficient, said device comprising architecture [A] with hardware sharing techniques and optimization applied to this architecture, the architecture [A] is connected to coefficient lines CLin_0,
CLin_l CLin_n and/or BLin_0, BLin_l,....BLin_n coming from block
[E] and/or [F], to be connected to perform filtering operation or a mathematical computing operation with optimization in hardware and provides a zero latency clock output, the serial input bit line of said architecture [A] are SI,
S2, Sn. [where n represents the number of coefficients of the filter], the addition terms of the equation [(a0*Sl+b0*S2+....+k0*Sn),
(al*Sl+bl*S2+ +kl*Sn) (am*Sl+bm*S2+ +km*Sn)] are represented as blocks [B], the said block [B] is a combinational block consisting of full adders
(FA) & full subtractor (FS) elements, the values aO, bO, etc. are (+ / -1 or 0), the connection of elements (FA/FS) to SI, S2....Sn lines and interconnection of the elements (FA,FS) depend on the value of coefficients, the final output of last element [FA/FS] of each block [B] is terminated through lines b_l, b_2,....b_m at [T] elements, the number of T elements in cluster [C] depends on the size of maximum coefficient value and is share-able for all the coefficient in the coefficient architecture [A], in the said architecture all the combinational elements [B] are clustered together as [D] and all the unit delay elements {T[l],
T[2] T[m]} are clustered together in [C], thereby separating the sequential and combinational logic, In the said architecture [A] the output of element [T] is connected to the one of the inputs of combinational logic of block [B] of next bit position, the interconnections from cluster [C] to [B] is represented as t_l, t_2, t_m, the elements [FA/FS] are arranged in matrix form FA0_0 to FA0_n in bit position 0, FA1_1 to FAl_n in bit position 1 ,.... FAm_l to FAm_n in bit position m whose presence is defined by coefficient value, the carry-out pin of full adder (FA) of each cluster stage[B] in the said architecture [A] is fed to input of full adder (FA) of previous stage cluster [B] i.e stage preceding the flip-flop (T) element of cluster [C], in this way the same Flip-Flop [T] (Tl, T2, T3... Tn) is used for multiplication by "a factor of two" and also in the implementation of the carry structure in the one bit serial adder, in the said architecture some extra components represented as block [Ex] are being used for connecting the carryout of all the adders/subtractors [FA/FS] of last stage of [D], the element [FA/FS] and [T] are used within this block, and hence, the said architecture [A] structures the circuit into sequential block [C] consisting of [T] elements and combinational [D] consisting of [FA,FS] elements, while the [T] elements of block [C], are common for all the coefficients and are share-able and positioned at end position of each block [B], the Block [D] has combinational element block[B] which are essentially [FA,FS], thereby making share-able hardware within block [D] and the final output is the output of BITm position.
2. The device as claimed in claim 1 wherein provides the area minimal realization of digital filters based on coefficient architecture [A], when operated in bit serial fashion, the device provides hardware minimization for finite impulse response(FLR) filter, infinite impulse response filter(ILR) and for other filters and applications related to combinational logic consisting of delay element(T), multiplier(M), adder and subtractor.
3. The device as claimed in claim 1 wherein further optimization technique in cluster [D] is done by using common adders (FA) and common subtractor (FS) and using this shared outputs.
4. The device as claimed in claim 1 wherein further optimization technique in cluster [D] is done by using subtractor (FS) instead of adders, when the coefficient value is closer to power of two.
5. The device as claimed in claim 1 wherein further optimization technique in cluster [D] is done by minimizing the use of subtractor by taking common subtraction operator and using adder instead.
6. The device as claimed in one of the previous claims (1-5) wherein when used in implementation 2 of FLR/ILR filter and similar structure of filters, results in quite area efficient realization of the filter, the storage area in implementation2, referred as delay elements [Z"1], is smaller as compared to implementation 1 which is present due to inherent property of the structure of implementation 2, and an additional saving in area in filter coefficient realization design is achieved by using the claimed structure of coefficient architecture [A] of "Figure 12".
PCT/SG1998/000082 1998-10-13 1998-10-13 Area efficient realization of coefficient architecture for bit-serial fir, iir filters and combinational/sequential logic structure with zero latency clock output WO2000022729A1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
PCT/SG1998/000082 WO2000022729A1 (en) 1998-10-13 1998-10-13 Area efficient realization of coefficient architecture for bit-serial fir, iir filters and combinational/sequential logic structure with zero latency clock output
SG1998004194A SG73567A1 (en) 1998-10-13 1998-10-13 Area efficient realization of coefficient architecture for bit-serial fir iir filters and combinational/sequential logic structure with zero latency clock output
EP98950602A EP1119910B1 (en) 1998-10-13 1998-10-13 Area efficient realization of coefficient architecture for bit-serial fir, iir filters and combinational/sequential logic structure with zero latency clock output
DE69821145T DE69821145T2 (en) 1998-10-13 1998-10-13 AREA EFFICIENT MANUFACTURE OF COEFFICIENT ARCHITECTURE FOR BIT SERIAL FIR, IIR FILTERS AND COMBINATORIAL / SEQUENTIAL LOGICAL STRUCTURE WITHOUT LATENCY
US09/807,500 US7007053B1 (en) 1998-10-13 1998-10-13 Area efficient realization of coefficient architecture for bit-serial FIR, IIR filters and combinational/sequential logic structure with zero latency clock output

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/SG1998/000082 WO2000022729A1 (en) 1998-10-13 1998-10-13 Area efficient realization of coefficient architecture for bit-serial fir, iir filters and combinational/sequential logic structure with zero latency clock output

Publications (1)

Publication Number Publication Date
WO2000022729A1 true WO2000022729A1 (en) 2000-04-20

Family

ID=20429882

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG1998/000082 WO2000022729A1 (en) 1998-10-13 1998-10-13 Area efficient realization of coefficient architecture for bit-serial fir, iir filters and combinational/sequential logic structure with zero latency clock output

Country Status (5)

Country Link
US (1) US7007053B1 (en)
EP (1) EP1119910B1 (en)
DE (1) DE69821145T2 (en)
SG (1) SG73567A1 (en)
WO (1) WO2000022729A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE69821144T2 (en) * 1998-10-13 2004-09-02 Stmicroelectronics Pte Ltd. AREA EFFICIENT MANUFACTURE OF COEFFICIENT ARCHITECTURE FOR BIT SERIAL FIR, IIR FILTERS AND COMBINATORIAL / SEQUENTIAL LOGICAL STRUCTURE WITHOUT LATENCY
US7292630B2 (en) * 2003-04-17 2007-11-06 Texas Instruments Incorporated Limit-cycle-free FIR/IIR halfband digital filter with shared registers for high-speed sigma-delta A/D and D/A converters
US7187312B2 (en) * 2004-01-16 2007-03-06 Cirrus Logic, Inc. Look-ahead delta sigma modulator having an infinite impulse response filter with multiple look-ahead outputs
TWI353724B (en) * 2008-07-31 2011-12-01 Ralink Technology Corp Transversal filter
US10665222B2 (en) * 2018-06-28 2020-05-26 Intel Corporation Method and system of temporal-domain feature extraction for automatic speech recognition

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1994023493A1 (en) * 1993-04-05 1994-10-13 Saramaeki Tapio Method and arrangement in a transposed digital fir filter for multiplying a binary input signal with tap coefficients and a method for disigning a transposed digital filter

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS61114338A (en) * 1984-11-09 1986-06-02 Hitachi Ltd Multiplier
US4982354A (en) * 1987-05-28 1991-01-01 Mitsubishi Denki Kabushiki Kaisha Digital finite impulse response filter and method
FR2665275B1 (en) * 1990-07-27 1992-11-13 France Etat CELLULAR MULTIPLIER IN REVERSE GRADIN TYPE TREE AND ITS MANUFACTURING METHOD.
US5262972A (en) * 1991-07-17 1993-11-16 Hughes Missile Systems Company Multichannel digital filter apparatus and method
GB9511568D0 (en) * 1995-06-07 1995-08-02 Discovision Ass Signal processing apparatus and method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1994023493A1 (en) * 1993-04-05 1994-10-13 Saramaeki Tapio Method and arrangement in a transposed digital fir filter for multiplying a binary input signal with tap coefficients and a method for disigning a transposed digital filter

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DAWOOD ALAM ET AL: "VLSI IMPLEMENTATION OF A NEW BIT-LEVEL PIPELINED ARCHITECTURE FOR 2-D ALLPASS DIGITAL FILTERS", 1995 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), SEATTLE, APR. 30 - MAY 3, 1995, vol. 1, 30 April 1995 (1995-04-30), INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, pages 724 - 727, XP000583315 *
K. MANIVANNAN ET AL.: "Minimal Multiplier Realization of 2-D All-Pass Digital Filters", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS., vol. 35, no. 4, April 1988 (1988-04-01), NEW YORK US, pages 480 - 484, XP002104516 *

Also Published As

Publication number Publication date
EP1119910B1 (en) 2004-01-14
EP1119910A1 (en) 2001-08-01
US7007053B1 (en) 2006-02-28
DE69821145T2 (en) 2004-09-02
SG73567A1 (en) 2000-06-20
DE69821145D1 (en) 2004-02-19

Similar Documents

Publication Publication Date Title
Hartley Subexpression sharing in filters using canonic signed digit multipliers
US6131105A (en) Calculation of a scalar product in a direct-type FIR filter
JP2004516707A (en) FIR decimation filter and method thereof
EP0693236B1 (en) Method and arrangement in a transposed digital fir filter for multiplying a binary input signal with tap coefficients and a method for designing a transposed digital filter
EP1105967B1 (en) Multiplierless digital filtering
CN113556101B (en) IIR filter and data processing method thereof
US7007053B1 (en) Area efficient realization of coefficient architecture for bit-serial FIR, IIR filters and combinational/sequential logic structure with zero latency clock output
EP1119909B1 (en) Area efficient realization of coefficient architecture for bit-serial fir, iir filters and combinational/sequential logic structure with zero latency clock output
JP3139137B2 (en) Digital signal processing circuit that performs filter operation of digital filter processing
JPH10509011A (en) Improved digital filter
Ohlsson et al. Arithmetic transformations for increased maximal sample rate of bit-parallel bireciprocal lattice wave digital filters
Saini et al. Area Optimization of FIR Filter and its Implementation on FPGA
KR0140805B1 (en) Bit-serial operation unit
Jones Efficient computation of time-varying and adaptive filters
Anderson et al. A coarse-grained FPGA architecture for high-performance FIR filtering
Ghanekar et al. A class of high-precision multiplier-free FIR filter realizations with periodically time-varying coefficients
US6944217B1 (en) Interleaved finite impulse response filter
Singh et al. A wave digital filter three-port adaptor with fine grained pipelining
KR0154792B1 (en) Differentiater using the bit serial method
JPS61213926A (en) Dsp arithmetic processing system
JPH0666638B2 (en) Digital Filter
JPH0624310B2 (en) Digital Filter
Langlois Design and Implementation of High Sampling Rate Programmable FIR Filters in FPGAs
Young et al. Area-efficient VLSI implementation of digital filters via multiple product intercoding
JPS59198020A (en) Digital signal processor

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP SG US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 1998950602

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 09807500

Country of ref document: US

WWP Wipo information: published in national office

Ref document number: 1998950602

Country of ref document: EP

WWG Wipo information: grant in national office

Ref document number: 1998950602

Country of ref document: EP