EP0705459A1 - An ultrafast adder arrangement - Google Patents

An ultrafast adder arrangement

Info

Publication number
EP0705459A1
EP0705459A1 EP94919952A EP94919952A EP0705459A1 EP 0705459 A1 EP0705459 A1 EP 0705459A1 EP 94919952 A EP94919952 A EP 94919952A EP 94919952 A EP94919952 A EP 94919952A EP 0705459 A1 EP0705459 A1 EP 0705459A1
Authority
EP
European Patent Office
Prior art keywords
dblc
tree
adder
bit
cells
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP94919952A
Other languages
German (de)
French (fr)
Inventor
Jiren Yuan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Universitet I Linkoping
Original Assignee
Universitet I Linkoping
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Universitet I Linkoping filed Critical Universitet I Linkoping
Publication of EP0705459A1 publication Critical patent/EP0705459A1/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • G06F7/505Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination
    • G06F7/506Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination with simultaneous carry generation for, or propagation over, two or more stages
    • G06F7/508Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination with simultaneous carry generation for, or propagation over, two or more stages using carry look-ahead circuits
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • G06F7/505Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination
    • G06F7/506Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination with simultaneous carry generation for, or propagation over, two or more stages
    • G06F7/507Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination with simultaneous carry generation for, or propagation over, two or more stages using selection between two conditionally calculated carry or sum values
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03KPULSE TECHNIQUE
    • H03K19/00Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits
    • H03K19/02Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits using specified components
    • H03K19/08Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits using specified components using semiconductor devices
    • H03K19/094Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits using specified components using semiconductor devices using field-effect transistors
    • H03K19/0944Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits using specified components using semiconductor devices using field-effect transistors using MOSFET or insulated gate field-effect transistors, i.e. IGFET
    • H03K19/0948Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits using specified components using semiconductor devices using field-effect transistors using MOSFET or insulated gate field-effect transistors, i.e. IGFET using CMOS or complementary insulated gate field-effect transistors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/386Special constructional features
    • G06F2207/3872Precharge of output to prevent leakage
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/386Special constructional features
    • G06F2207/3876Alternation of true and inverted stages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/506Indexing scheme relating to groups G06F7/506 - G06F7/508
    • G06F2207/50632-input gates, i.e. only using 2-input logical gates, e.g. binary carry look-ahead, e.g. Kogge-Stone or Ladner-Fischer adder

Definitions

  • This invention relates to an arrangement of the kind which is apparent from clauses of section CLAIMS.
  • the invention relates particularly to a parallel binary adder having an ultrafast speed and an extremely regular structure.
  • a fast parallel binary adder is essential.
  • to obtain sum and carry within one clock cycle is important.
  • the latency of the pipeline is often expected to be as small as possible.
  • the speed limitation of a parallel binary adder comes from its carry propagation delay.
  • the maximum carry propagation delay is the delay of its overflow carry.
  • the evaluation time T for the overflow carry is equal to the product of the delay Ti of each single-bit carry evaluation stage and the total bit number n.
  • carry lookahead strategies are widely used. Among them, the most important example related to this invention is the binary-lookahead-carry strategy, as described in the article "A regular layout for parallel adders" by Richard P. Brent and H. T. Kung, IEEE Transactions on Computers, vol. c-31, pp. 260-264, March 1982. In this strategy, carries are evaluated by a binary tree.
  • a 64-bit adder is divided into eight 8-bit adders, seven of them are carry-selected, so the visible levels of the binary carry tree are reduced. The visible levels are further reduced by using sixteen, four and two 4-bit Manchester carry chain modules in the first, the second and the third levels respectively. Finally, seven carries are obtained from the carry tree for the seven 8-bit adders to select their SUMs.
  • the true level number is hidden by the 4-bit Manchester module which is equivalent to two levels in a binary tree. The nonuniformity of the internal loading still exists but is hidden by the high radix, for example the fan outs of four Manchester modules in the second level are 1, 2, 3 and 4 respectively.
  • CMOS circuit technique which is widely used, high clock rates have been achieved by true single phase clocking (TSPC) , device sizing and extreme pipelining, as described in the article "High speed CMOS circuit technique” by Jiren Yuan and Christer Svensson, IEEE Solid-State Circuits, vol. 24, pp. 62- 70, February 1989.
  • TSPC true single phase clocking
  • high speed in connection with a one-clock-cycle decision for multi-level logic needs new circuit topology which should, if possible, eliminate delay overheads caused by many latches in a pipeline aiming at very high clock rates.
  • An object of the invention is to provide a parallel binary adder architecture which offers a superior speed, a uniform loading, a regular layout and a flexible configuration in the trade-off between speed, power and area compared with existing parallel binary adder architectures.
  • Another object of the invention is to provide an advanced CMOS circuit technique which " offers an ultrafast speed particularly for a one-clock- cycle decision. The combination of the two objects offers a very high performance parallel binary adder.
  • the first object of the invention is achieved with the invented Distributed-Binary-Lookahead-Carry (DBLC) adder architecture which is an arrangement of the kind set forth in the characterising clause of Claim 1.
  • the second object is achieved by the invented clock-and-data precharged dynamic CMOS circuit technique which is an arrangement of the kind set forth in the characterising clause of Claim 2. Further features and further developments of the invented arrangements are set forth in other characterising clauses of section CLAIMS.
  • DBLC Distributed-Binary-Lookahead-Carry
  • the DBLC trees (DBLC-l tree and DBLC-2 tree) in the invented DBLC adders (DBLC-l adder and DBLC-2 adder) can be constructed by truly log 2 n levels which is approximately the half of that in Brent and Kung's tree and, in the same time, both the internal and the external loadings are truly uniform.
  • the two trees have very regular structures and so have the layouts and are flexible enough to be constructed in different radixes to accommodate different speed, power and area compromises.
  • a DBLC tree means that identical binary-lookahead-carry trees are repeated for every bits, i.e. to move and implement the MSB binary-lookahead-carry tree towards LSB bit by bit. Also, the overlapped parts of these trees share only one cell at one position, the exceeding-LSB parts of these trees are eliminated and part of the cells in LSBs' side can be simplified. Finally, every cell in the tree has only two successive loading cells.
  • the clock-and-data precharged dynamic CMOS circuit solution is evolved from the existing CMOS domino circuit technique which turns out to be slow in its maximum clock rate and from the existing TSPC circuit technique.
  • the clock-and-data precharged dynamic CMOS circuit exhibits a superior speed for a one-clock-cycle decision.
  • Fig 1 shows an embodiment of a DBLC-l tree- 1 ! according to the invention in contrast with Brent and Kung's tree*- 2 ! by using 16-bit inputs as examples.
  • Fig.2 shows an embodiment of a DBLC-l adderI 35 -* according to the invention by using 16-bit inputs as an example.
  • Fig.3 shows an embodiment of a DBLC-2 treef 15 - 1 according to the invention by using 16-bit inputs as an example.
  • Fig.4 shows an embodiment of a DBLC-2 adder-* 36 1 according to the invention by using 16-bit inputs as an example.
  • Fig.5 shows an embodiment of a two-bit grouped DBLC adder t 35 - *-J according to the invention by using a 16-bit DBLC-l adder as an example.
  • Fig.6 shows an embodiment of a multi-input DBLC adder1 35 - 1 according to the invention by using a DBLC-l adder with four 16-bit inputs as an example.
  • Fig.7 shows static CMOS circuit embodiments of cells in DBLC-l treeI 1 - 1 ' - • - ⁇ or 1 - 3 *' and DBLC-2 treeI 15 • • ⁇ J according to the invention.
  • Fig.8 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (I) according to the invention.
  • Fig.9 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (II) according to the invention.
  • Fig.10 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (III) according to the invention.
  • Fig.11 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (IV) according to the invention.
  • Fig.12 shows the principle of data precharged asynchronous pipeline circuit technique according to the invention.
  • Figs.l3a-d shows an embodiment of a complete single-bit- grouped 64-bit DBLC-l treel 1 - 4 -* according to the invention.
  • Fig.14 shows the first version embodiments of cells in a 64- bit DBLC-l adder• 7 - 1 ⁇ according to the invention.
  • Fig.15 shows the second version embodiments of cells in a 64- bit DBLC-l adder 137 - 2 1 according to the invention.
  • DBLC-l tree **J is shown in Fig.1(a) in contrast with Brent and Kung's treeI 2 1 shown in Fig.l (b) . Both use 16-bit inputs ai and h_ as examples.
  • the computational cells in the DBLC-l tree! 1 - 1 are cells 3.131, 21 4 3 and 3 ⁇ 1.
  • Cell 2 ⁇ 4 3 performs the g'-computation only and contains the g'-part of cell I-* 3 - 1 .
  • Cell 31 5 ** is just a buffer.
  • Brent and Kung's tree ⁇ 2 ! uses cell 11 3 ⁇ and cell 4 ⁇ 61 which contains two buffers and has two inputs and two corresponding outputs.
  • Fig.l cells are similar in both trees. Actually, some of the cells 1 ⁇ 3 1 and -* 6 -l in Brent and kung's tree ⁇ 2 ! can be replaced by cells 2-* 4 - 1 and 31**-- 1 .
  • the important point is that the levels used by the DBLC-l tree -J-* is truly log 2 n while Brent and Kung's tree! 2 -- needs totally (21og2n - 1) levels to generate carries for every individual bit. This creates a large difference in speed.
  • Fig.2 shows an embodiment of a DBLC-l addert 35 l by using a 16- bit adder as an example. It contains a DBLC-l tree•*--•••--- ⁇ , 16 Cell 0sl 8 l and 16 SUM units. Each SUM unit contains a XOR gatet 10 ! and an inverse XOR gate-- 7 ! .
  • Inverse logics are introduced in the DBLC-l treeI • '- • • -J to gain speed so the functions of cells B ⁇ 9 1 , 1HH , 1A ⁇ 12 1 , 2t 13 l and 2A (14 1 in Fig.2 differ from the functions of cells 11 3 -1 , 21 4 -I and 3 ⁇ 5 - 1 of the DBLC-l treet-- * ! in Fig.l.
  • Cell ⁇ l°J* is used to generate inversed gi and pi. Since carry outputs from the DBLC-l tree -J - 1 - 1 are inversed, the half-SUMs are obtained from inverse XOR gates -J ) .
  • SUMs are generated from inversed carries and inversed half-SUMs through output XOR gatest 10 - 1 .
  • a carry input Ci n
  • the DBLC-l treeI 1 • 1) has an extra row on the top.
  • the overflow carry, Ci6. is generated by an extra cell, cell 2A ⁇ * 14 *- , and has approximately the same delay as the SUMs. Since the carry input is optional, it will not appear in later examples. From Fig.2, one can conclude that the adder will have a very regular layout.
  • Fig.3 shows an embodiment of a DBLC-2 treel 15 ! by using 16-bit inputs as an example.
  • a DBLC-2 treel 15 *! also starts from the generating and propagating signals gi and pi.
  • the computational cells in the DBLC-2 tree- 15 - 1 are cells lf 16 l and 21 6 3.
  • the C° x - y is the same as g' in the DBLC-l tree-* 1 -* , representing the carry output from bit x to bit y under the condition of assuming an input carry of zero to bit x.
  • the C 1 x _ y has more clear meaning than that of p" in the DBLC-l treet-*-! .
  • C 1 x _ y represents the carry output from bit x to bit y under the condition of assuming an input carry of one to bit x. From the outputs of the DBLC-2 tree! 15 -* , both assuming-zero carries and assuming- one carries are available.
  • Fig.4 shows an embodiment of a DBLC-2 adder f 3 t -J by using 16-bit inputs as an example.
  • This diagram is similar to Fig.2. It contains a DBLC-2 tree! 15 - 1 !, 16 Cell Ost 8 -* and 16 SUM units. Each SUM unit contains a XOR gate! 10 - 1 and an inverse XOR gate- 7 1. Inverse logics are also used in the DBLC-2 tree' 15 - 1 ! so the functions of cells B' 17 ! , l' 18 ! and 2 -J 9 - 1 in Fig.4 differ from the functions of cells l' 16 l and 2' 6 ! of ,the DBLC-2 tree' 15 ⁇ in Fig.3.
  • Fig.5 shows an embodiment of a two-bit grouped DBLC adder' 35 - 1 ! by using a 16-bit DBLC-l adder as an example.
  • Higher radixes can be easily introduced into DBLC trees, which shows its flexibility.
  • the output carries from the tree will be C2, C 4 , C ⁇ etc. These carries are used to select SUMs of bits 2-3, 4-5, 6-7 etc.
  • Fig.6 shows an embodiment of a multi-input DBLC adder by using a DBLC-l adder' 35 - 2 ! with four 16-bit inputs as an example. It indicates that three addition operations needed for adding four words are simplified to one operation by using two extra levels containing full adders' 23 !, half-adders' 24 ! and buffers' 5 !. The total addition time is considerably reduced.
  • This structure is vary useful for making a fastallel multiplier in which the inputs will be Y ⁇ X ⁇ _ n . v 2 * ⁇ l -n' ⁇ 3- ⁇ 1-n etc, and each input is shifted successively by one bit. Note that if each input has 16 bits, the DBLC-l tree' 1 - 3 ! needs only 16 bits since the LSBs are generated directly.
  • Fig.7 shows static CMOS circuit embodiments of cells l' 11 !, 1A' 12 1, 2' 13 ! and 2A' 14 ! in DBLC-l trees' 1 - 1 ! ' 11.2 ] and [ 1.3 ] an a cells l' 18 ! -and 2' 19 1 in DBLC-2 tree' 15 - 1 ! .
  • Cells in Fig.2, Fig.4, Fig.5 and Fig.6 are inverse-logic cells. They can be realized by single CMOS stages.
  • Cell l' 11 !, cell 1A' 12 1) , cell 2' 13 ! and cell 2A' 14 ! are complementary.
  • Cells 1A' 1 1 and 2A' 14 l contain only the g' (1A) or the inversed g' (2A) logic. Cell l' 18 l and cell 2' 19 ! are also complementary. These circuits are the fundamental circuits of DBLC trees. The dynamic circuits which will be described later are derived from them. Even with the static circuit embodiments, both DBLC-l and DBLC-2 adders are faster than or comparable to any published fast parallel adders according to SPICE simulations.
  • Fig.8 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (I) . It is well known that dynamic circuits are faster than static circuits due to their smaller fan-in and fan-out loads. For a fast one-clock-cycle decision, it is natural to choose a domino logic, particularly an n- domino logic which is faster than a p-domino logic. In a normal domino chain, inverters are used between two precharged stages, which prevents charge loss but creates delay overheads. It will be shown that by a series of evolutions new circuit techniques are created.
  • the first evolution is that by replacing these inverters between clock precharged stages, in this case n-stages' 25 - 1-6 !, with static logic stages' 26 - 1-6 ! as shown in Fig.8 (a), a domino chain cam be made more efficient and faster. Note that the initial-low-output functions (or the initial-high-output functions if in a p-dimino chain) are still kept by these static logic stages' 26 - 1-6 ! .
  • the second evolution is that except the first clock precharged stages, in this case n-stages' 25 - 1-2 !, the evaluating transistors, in this case n-transistors' 27 - 3""6 !
  • n-stages' 25 - 3-6 ! can be omitted so the evaluating speed is further increased.
  • the third evolution is that all these static stages' 26 - 1-6 ! can be conditionly modified into dynamic stages to reduce their fan-in and fan- out loads.
  • the modified circuit examples A, B, C and D connected to cell l' 11 ! in DBLC-l tree' 1 - 1 ! and cell l' 18 l in DBLC-2 tree' 15 - 1 ! are given in Fig.8(b) .
  • These dynamic stages are precharged by their data inputs so named data precharged stages and, therefore, the technique is named the clock-and- data precharged dynamic CMOS circuit technique.
  • the example chains in Fig.8 (a) are clock-and-data prechrged n- chains which can be prolonged and the circuit examples given in Fig.8(b) are free of charge sharing.
  • Fig.9 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (II) .
  • the fourth evolution is that not only a single static stage but also multiple static stages (odd number) can be cascaded in a domino chain to reduce the number of clock precharged stages since clock precharged stages in a domino chain have charge-sharing problems leading to the demand of all nodes in the stage to be precharged.
  • the fifth evolution is that all these static stages can be replaced by dynamic circuits and they become data precharged high-in/low-out or data precharged low-in/high-out stages alternatively, indicated by H/L' 28 - 1-6 ! and L/H' 29 - 1-6 !
  • statges in Fig.9, Fig.10, Fig.11 and Fig.12. Note the number and order required in different chains.
  • the modified circuit examples A, B, C and D for precharged low-in/high-out stages connected to cell 2' 13 ! in DBLC-l tree' 1 - 1 ! and cell 2' 19 ! in DBLC-2 tree' 15 - 1 ! are given in Fig.9(b) while the examples of precharged high-in/low-out stages are already shown in Fig.8(b).
  • the chains shown in Fig.9(a) are clock-and-data prechrged n-chains which can be prolonged and in its extreme case, only one clock precharged n-stage in the beginning of each chain is needed and all the rest can be H/L or L/H stages.
  • a very quick precharge is required, there should be an enough number of clock precharged stages besides the first two stages' 25 - 1 " 2 ! , in this case n-stages' 5 - 3-4 l , in the chain and they should have the evaluating transistors, in this case n-transistors' 27 - 3-4 ! , implemented.
  • Fig.10 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (III) . If a precharge phase can be arranged separately the circuit in Fig.9 gives a really high speed to a multi-level logic computation like that in the DBLC trees. However, very often an adder needs a true one- clock-cycle decision with the precharge phase included. In this case, n-p logic, particularly TSPC n-p logic, is very useful, in which both TSPC n-latch stage' 31 ! and TSPC p-latch stage' 33 ! are necessary. It was found that a clock-and-data precharged n-chain' 30 - 1 ' 2 ' 3 or 4 !
  • a clock-and-data precharged p- chain' 32 - 1 or 1 contains at least one clock precharged p-stage, for example p-stage' 34 - 1 or 2 1 , followed by L/H' 29 - 1 O ⁇ 2 1 and H L [2 8.1 or 2 ] stages in an opposite order compared to a clock- and-data precharged n-chain, for example n-chain' 30 - 3 or 4 1.
  • TSPC n-p logic can be implemented by cascading the n- and p-chains together with TSPC latch stages between. These are represented by evolutions 6 and 7 in Fig.10(a) and Fig.10(b). Note that the different numbers and orders of data precharged stages are required between two clock precharged stages and between a clock precharged stage and a TSPC latch. Also, the n-chains, for example chains' 30 - 1-4 ! , the p-chains, for example chains' 3 - 1- l and the latched n-p chains, for example the cascade of p-chain' 32 •1 - , p-latch 133 ! , n-chain 130 - --J and n-latch' 31 ! can be prolonged,if the stage numbers and orders are correct.
  • Fig.11 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (IV) , the non-latched TSPC n-p domino circuit technique.
  • IV clock-and-data precharged dynamic CMOS circuit technique
  • a latched n-p chain for example the cascade of p-chain' 32 - 1 ! , p-latch' 33 !, n-chain' 30 - 1 ! and n- latch' 31 ! in Fig.10
  • the latch stage for example stage' 31 ! or stage' 33 ! introduces a delay overhead and an inversion.
  • Fig.12 shows the principle of data precharged asynchronous pipeline circuit technique. It was found that if return-to- zero or return-to-one data are used the clock can be eliminated and the chain becomes a pure data precharged asynchronous pipeline containing only H/L stages, for example stages' 28 - 1-6 ! , and L/H stages, for example stages' 29 - 1-6 ! . As long as the data rate is lower than a maximum rate, the precharged-wave-front (PWF) will not catch up the evaluation- wave-front (EWF) so the circuit will be reliable.
  • PWF precharged-wave-front
  • EWF evaluation- wave-front
  • Fig.13 shows an embodiment of a complete single-bit-grouped 64-bit DBLC-l tree' 1 - 4 ! with the g and p generators included (cell 0 in Fig.13a and Fig.13c).
  • cells in different levels are given different numbers in Figs.l3a-d. Cells with the same number in the same level (column) are identical. Cells with different numbers in different levels may be identical in one circuit technique or different in another circuit technique.as will be seen in Fig.14 and Fig.15. It contains cells 0, OA, 1, 1A, IB, 2, 2A, 2B, 3, 3A, 3B, 4, 4A, 4B, 5, 5A, 5B, 6A and 6B.
  • Fig.14 shows a first version of dynamic CMOS circuit embodiments of cells for the 64-bit DBLC-l adder' 37 - 1 ! using the tree shown in Figs.l3a-d by using clock-and-data precharged n-chains.
  • This solution gives an ultrafast speed to the evaluation phase but needs an additional precharge phase.
  • the precharge phase can be either very short in time if all evaluation transistors in the clock precharged stages are kept or comparable to the evaluation phase in time if these transistors are removed except the ones in cell 0s. In the second case, the evaluation speed is further increased.
  • the evaluation time for a latched SUM output can be less than 2ns with transistor sizing.
  • Fig.15 shows a second version of dynamic CMOS circuit embodiments of cells for the 64-bit DBLC-l adder' 37 - 2 ! using the tree shown in Fig.13 by the combination of circuit techniques described in Fig.8, Fig.9, Fig.10 and Fig.11.
  • the clock rate can be as high as 500MHz, which means the total evaluation time for a latched SUM output is less than 2ns.
  • the rise and fall slopes of the clock are 0.2ns. From these examples (Fig.14 and Fig.15), the combination of the new adder architecture and the new circuit technique really gives a large speed improvement for a parallel binay adder.
  • the maximum speed is achieved by the distributed adder architecture since the architecture has a minimum logic depth and a very uniform loading.
  • the maximum speed is also achieved by the clock-and-data precharged dynamic CMOS circuit technique.
  • the adder architecure has a very regular structure as can be seen in the principle diagrams and so has the layout.
  • the layout will be so regular that computer automation can be easily introduced for generating adder layouts with required bit-numbers, pitches and sizes by using just a few standard cells. 3. Unifprm loading and driving
  • the adder architecture is suitable for both one-clock- cycle decision and multi-clock-cycle pipelining.
  • the multi-bit grouped DBLC adders are flexible for different speed, area and power compromises.
  • the same tree can be used for a 32-bit single-bit grouped DBLC adder or for a 64-bit two-bit grouped DBLC adder except that the sum logic needs to be redesigned.
  • the speed will decrease because of the increased loads to carry outputs.
  • the clock-and-data precharged dynamic CMOS circuit technique invented not only gives large speed improvement but also gives less power noise and peak currents because of the successive precharge and evaluation of different stages.
  • the clock loads are also less than that in the case of intensive pipelining.
  • Example A (no condition)
  • Example B (no condition)
  • Fig 11 137 d pp - precharge delay of the whole p-domino chain
  • d ne evaluation delay of the first stage of the successive n- domino chain.
  • d np precharge delay of the whole n-domino chain.
  • the data hold time at the output is equal to the passing time between EWF and PWF, depending on data rate and circuit; (3) when PWF catches up EWF, it reaches its highest data rate and there is no limit for a lower data rate except leakage current.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Optimization (AREA)
  • Power Engineering (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Complex Calculations (AREA)
  • Logic Circuits (AREA)

Abstract

A binary-lookahead-carry adder can be improved significantly by a new architecture called Distributed Binary-Lookahead-Carry (DBLC) architecture. The new architecture has truly log2n computation levels and a very regular structure. Both the internal loading and the external loading are truly uniform and each cell is loaded by no more than two successive cells. The architecture is flexible to have single-bit or multi-bit grouped configurations in compromising speed, area and power and is suitable both for a one-clock-cycle decision and for a multi-clock-cycle pipelining. Two different versions are given, the DBLC-1 adder and the DBLC-2 adder. While the first one uses similar computation cells as that in the original binary-lookahead-carry adder, the second one uses a new computation algorithm which can give outputs of SUM and SUM+1 simultaneously. The architecture is supported by a new circuit technique called clock-and-data precharged dynamic CMOS circuit technique including both latched and non-latched versions. The circuit technique aims for a fast one-clock-cycle decision and increases speed by eliminating delay overheads of domino inverters and pipeline latches. The new adder exhibits a maximum speed, a very regular layout, a truly uniform loading and a high flexibility.

Description

AN ULTRAFAST ADDER ARRANGEMENT
AN ULTRAFAST ADDER ARRANGEMENT
This invention relates to an arrangement of the kind which is apparent from clauses of section CLAIMS. The invention relates particularly to a parallel binary adder having an ultrafast speed and an extremely regular structure.
BACKGROUND OF THE INVENTION
In computational devices, for example computers and digital signal processing elements, a fast parallel binary adder is essential. In many cases, to obtain sum and carry within one clock cycle is important. In the case of a pipeline, the latency of the pipeline is often expected to be as small as possible.
The speed limitation of a parallel binary adder comes from its carry propagation delay. The maximum carry propagation delay is the delay of its overflow carry. In a ripple carry adder, the evaluation time T for the overflow carry is equal to the product of the delay Ti of each single-bit carry evaluation stage and the total bit number n. In order to improve speed, carry lookahead strategies are widely used. Among them, the most important example related to this invention is the binary-lookahead-carry strategy, as described in the article "A regular layout for parallel adders" by Richard P. Brent and H. T. Kung, IEEE Transactions on Computers, vol. c-31, pp. 260-264, March 1982. In this strategy, carries are evaluated by a binary tree. Unfortunately, by using a main tree plus a so-called inverse tree introduced in their paper, the levels needed for a complete carry evaluation network is (21og2n - 1) , leading to a computation time of (21og2n - l)Tι where Ti is the delay of each level. Actually, the nonuniformity of the internal loadings in Brent and Kung's tree is made uniform by a pipeline structure (the inverse tree) , which creates a large delay overhead. Since then, such a tree is often combined with other techniques, for example with carry select and Manchester carry chain, as described in the article "A spanning tree carry lookahead adder" by Thomas Lynch and Earl E. Swartzlander, Jr., IEEE Transactions on Computers, vol. 41, pp. 931-939, August 1992. In their strategy, a 64-bit adder is divided into eight 8-bit adders, seven of them are carry-selected, so the visible levels of the binary carry tree are reduced. The visible levels are further reduced by using sixteen, four and two 4-bit Manchester carry chain modules in the first, the second and the third levels respectively. Finally, seven carries are obtained from the carry tree for the seven 8-bit adders to select their SUMs. In this solution, the true level number is hidden by the 4-bit Manchester module which is equivalent to two levels in a binary tree. The nonuniformity of the internal loading still exists but is hidden by the high radix, for example the fan outs of four Manchester modules in the second level are 1, 2, 3 and 4 respectively.
In CMOS circuit technique which is widely used, high clock rates have been achieved by true single phase clocking (TSPC) , device sizing and extreme pipelining, as described in the article "High speed CMOS circuit technique" by Jiren Yuan and Christer Svensson, IEEE Solid-State Circuits, vol. 24, pp. 62- 70, February 1989. However, high speed in connection with a one-clock-cycle decision for multi-level logic needs new circuit topology which should, if possible, eliminate delay overheads caused by many latches in a pipeline aiming at very high clock rates.
OBJECTS AND SOLUTIONS OF THE INVENTION
An object of the invention is to provide a parallel binary adder architecture which offers a superior speed, a uniform loading, a regular layout and a flexible configuration in the trade-off between speed, power and area compared with existing parallel binary adder architectures. Another object of the invention is to provide an advanced CMOS circuit technique which"offers an ultrafast speed particularly for a one-clock- cycle decision. The combination of the two objects offers a very high performance parallel binary adder.
The first object of the invention is achieved with the invented Distributed-Binary-Lookahead-Carry (DBLC) adder architecture which is an arrangement of the kind set forth in the characterising clause of Claim 1. The second object is achieved by the invented clock-and-data precharged dynamic CMOS circuit technique which is an arrangement of the kind set forth in the characterising clause of Claim 2. Further features and further developments of the invented arrangements are set forth in other characterising clauses of section CLAIMS.
The DBLC trees (DBLC-l tree and DBLC-2 tree) in the invented DBLC adders (DBLC-l adder and DBLC-2 adder) can be constructed by truly log2n levels which is approximately the half of that in Brent and Kung's tree and, in the same time, both the internal and the external loadings are truly uniform. The two trees have very regular structures and so have the layouts and are flexible enough to be constructed in different radixes to accommodate different speed, power and area compromises.
A DBLC tree means that identical binary-lookahead-carry trees are repeated for every bits, i.e. to move and implement the MSB binary-lookahead-carry tree towards LSB bit by bit. Also, the overlapped parts of these trees share only one cell at one position, the exceeding-LSB parts of these trees are eliminated and part of the cells in LSBs' side can be simplified. Finally, every cell in the tree has only two succesive loading cells.
The clock-and-data precharged dynamic CMOS circuit solution is evolved from the existing CMOS domino circuit technique which turns out to be slow in its maximum clock rate and from the existing TSPC circuit technique. By means of removing delay overheads caused by domino inverters and pipeline latches, the clock-and-data precharged dynamic CMOS circuit exhibits a superior speed for a one-clock-cycle decision.
BRIEF DESCRIPTION OF THE DRAWINGS
Fig 1 shows an embodiment of a DBLC-l tree-1! according to the invention in contrast with Brent and Kung's tree*-2! by using 16-bit inputs as examples.
Fig.2 shows an embodiment of a DBLC-l adderI35-* according to the invention by using 16-bit inputs as an example. Fig.3 shows an embodiment of a DBLC-2 treef15-1 according to the invention by using 16-bit inputs as an example. Fig.4 shows an embodiment of a DBLC-2 adder-*361 according to the invention by using 16-bit inputs as an example. Fig.5 shows an embodiment of a two-bit grouped DBLC adder t35 - *-J according to the invention by using a 16-bit DBLC-l adder as an example.
Fig.6 shows an embodiment of a multi-input DBLC adder135 - 1 according to the invention by using a DBLC-l adder with four 16-bit inputs as an example.
Fig.7 shows static CMOS circuit embodiments of cells in DBLC-l treeI1-1' - - < or 1-3*' and DBLC-2 treeI15 • •■J according to the invention.
Fig.8 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (I) according to the invention. Fig.9 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (II) according to the invention. Fig.10 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (III) according to the invention.
Fig.11 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (IV) according to the invention.
Fig.12 shows the principle of data precharged asynchronous pipeline circuit technique according to the invention. Figs.l3a-d shows an embodiment of a complete single-bit- grouped 64-bit DBLC-l treel1-4-* according to the invention. Fig.14 shows the first version embodiments of cells in a 64- bit DBLC-l adder• 7-1^ according to the invention. Fig.15 shows the second version embodiments of cells in a 64- bit DBLC-l adder137-21 according to the invention.
DETAILED DESCRIPTION OF THE DRAWINGS
The principle diagram of a DBLC-l tree (**J is shown in Fig.1(a) in contrast with Brent and Kung's treeI21 shown in Fig.l (b) . Both use 16-bit inputs ai and h_ as examples. In Fig.l, the DBLC-l tree -J-1 starts from the generating and propagating signals gi and pi, the same as Brent and Kung's tree^2-1, and gi=a bi, i=ai+bi- The computational cells in the DBLC-l tree!1-1 are cells 3.131, 2143 and 3^1. Cell Xt3-1 performs the following computation: g'=g2+P2_-fl a**1**-** P'=P2Pl- where g' and p' are the two outputs and g2, P2 91 and pi are the four inputs. The g' represents the carry output under the condition of assuming an input carry of zero for the bits involved while the p' is used for further computations. Cell 2^43 performs the g'-computation only and contains the g'-part of cell I-*3-1. Cell 315** is just a buffer. Brent and Kung's tree^2! uses cell 113ϊ and cell 4^61 which contains two buffers and has two inputs and two corresponding outputs. In Fig.l, cells are similar in both trees. Actually, some of the cells 1^31 and -*6-l in Brent and kung's tree^2! can be replaced by cells 2-*4-1 and 31**--1. However, the important point is that the levels used by the DBLC-l tree -J-* is truly log2n while Brent and Kung's tree!2-- needs totally (21og2n - 1) levels to generate carries for every individual bit. This creates a large difference in speed.
Fig.2 shows an embodiment of a DBLC-l addert35l by using a 16- bit adder as an example. It contains a DBLC-l tree•*--•••---■ , 16 Cell 0sl8l and 16 SUM units. Each SUM unit contains a XOR gatet10! and an inverse XOR gate--7! . Inverse logics are introduced in the DBLC-l treeI '- • •-J to gain speed so the functions of cells Bϊ91 , 1HH , 1AΪ121 , 2t13l and 2A(141 in Fig.2 differ from the functions of cells 113-1 , 214-I and 3Ϊ5-1 of the DBLC-l treet--*! in Fig.l. Cell θ l°J* is used to generate inversed gi and pi. Since carry outputs from the DBLC-l tree -J - 1-1 are inversed, the half-SUMs are obtained from inverse XOR gates -J) . Finally, SUMs are generated from inversed carries and inversed half-SUMs through output XOR gatest10-1. There is a carry input, Cin, in this example together with two 16-bit word inputs. For the purpose of dealing with the carry input, the DBLC-l treeI11) has an extra row on the top. The overflow carry, Ci6. is generated by an extra cell, cell 2AΪ*14*- , and has approximately the same delay as the SUMs. Since the carry input is optional, it will not appear in later examples. From Fig.2, one can conclude that the adder will have a very regular layout.
Fig.3 shows an embodiment of a DBLC-2 treel15! by using 16-bit inputs as an example. A DBLC-2 treel15*! also starts from the generating and propagating signals gi and pi. The computational cells in the DBLC-2 tree-15-1 are cells lf16l and 2163. The C°x-y is the same as g' in the DBLC-l tree-*1-* , representing the carry output from bit x to bit y under the condition of assuming an input carry of zero to bit x. However, the C1 x_y has more clear meaning than that of p" in the DBLC-l treet-*-! . C1 x_y represents the carry output from bit x to bit y under the condition of assuming an input carry of one to bit x. From the outputs of the DBLC-2 tree!15-* , both assuming-zero carries and assuming- one carries are available. The computations of C°x_y and C~x-y are similar, i.e. C0 x-y=C0 v_y+C1 v__yC0 x_u and C1 x_y=C°v_y+C1 v_yC1 x_ u. Both have a common term C°v-y If C°v_y=l, C°x_y and C-^-y must be 1. The second terms are selecting terms, in which the assuming-zero and the assuming-one carries from bit x to bit u, C°x_u and C**-X_u, are selected by c -y. In this sense, it is a carry-selected-by-carry algorithm. It was found that this algorithm is not only beneficial to have two carry sets available but also beneficial to circuit simplification which will be described together with Fig.8 and Fig.9.
Fig.4 shows an embodiment of a DBLC-2 adder f 3 t-J by using 16-bit inputs as an example. This diagram is similar to Fig.2. It contains a DBLC-2 tree!15-1!, 16 Cell Ost8-* and 16 SUM units. Each SUM unit contains a XOR gate!10-1 and an inverse XOR gate-71. Inverse logics are also used in the DBLC-2 tree'15-1! so the functions of cells B'17! , l'18! and 2 -J9-1 in Fig.4 differ from the functions of cells l'16l and 2'6! of ,the DBLC-2 tree'15^ in Fig.3. Since both assuming-zero and assuming-one carries are available, SUM and SUM+1 can be obtained simultaneously. Note that this is obtained by paying very limited hardware. As the SUM channel has a large speed margin, the extra XOR gate for S'•=(SUM+1) does not ask for more driving capability from the previous inverse XOR gate'7!. The assuming-one carries are obtained from the spare outputs of the output cells B'17! and 2'19!. Internally, cells in the DBLC-2 tree'15-1! are still loaded by no more than two sucessive cells. There will be no obvious speed degradation but the last level of the DBLC-2 tree'15-1! has a double number of wires.
Fig.5 shows an embodiment of a two-bit grouped DBLC adder'35-1! by using a 16-bit DBLC-l adder as an example. Higher radixes can be easily introduced into DBLC trees, which shows its flexibility. One can see cell Is'11! in the two-bit grouped DBLC-l tree'1-2! s cell Os'8! in Fig. 2 and the rest of the cells form an 8-bit distributed tree with only three levels. The output carries from the tree will be C2, C4, Cε etc. These carries are used to select SUMs of bits 2-3, 4-5, 6-7 etc. A double XOR gate'20! gives both assuming-zero-carry SUM and assuming-one-carry SUM of each bit to the multiplexer (MUX)'21!. Note that, in this case, every cell in the adder is still loaded by no more than two successive cells. By using higher radixes, the layout pitch can be reduced and wires in all levels, particularly in the last level, will be less and shorter. The total number of levels, calculted from cell θ'8l in Fig.5 to the carry output, are still the same, which is connected to the total bit-number not the radix. Each carry output is loaded by two MUXs'21!, twice as much as that in the single-bit grouped case. For a same delay, it obviouely needs twice the transistor sizes in the tree. However, the multi-bit grouped solution gives a flexibilty for compromising power, area and speed.
Fig.6 shows an embodiment of a multi-input DBLC adder by using a DBLC-l adder'35-2! with four 16-bit inputs as an example. It indicates that three addition operations needed for adding four words are simplified to one operation by using two extra levels containing full adders'23!, half-adders'24! and buffers'5!. The total addition time is considerably reduced. This structure is vary useful for making a fast paralell multiplier in which the inputs will be YιXι_n. v 2*^l-n' ^3-^1-n etc, and each input is shifted successively by one bit. Note that if each input has 16 bits, the DBLC-l tree'1-3! needs only 16 bits since the LSBs are generated directly.
Fig.7 shows static CMOS circuit embodiments of cells l'11!, 1A'121, 2'13! and 2A'14! in DBLC-l trees'1-1!' 11.2] and [1.3] ana cells l'18! -and 2'191 in DBLC-2 tree'15-1! . Cells in Fig.2, Fig.4, Fig.5 and Fig.6 are inverse-logic cells. They can be realized by single CMOS stages. Cell l'11!, cell 1A'121) , cell 2'13! and cell 2A'14! are complementary. Cells 1A'1 1 and 2A'14l contain only the g' (1A) or the inversed g' (2A) logic. Cell l'18l and cell 2'19! are also complementary. These circuits are the fundamental circuits of DBLC trees. The dynamic circuits which will be described later are derived from them. Even with the static circuit embodiments, both DBLC-l and DBLC-2 adders are faster than or comparable to any published fast parallel adders according to SPICE simulations.
Fig.8 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (I) . It is well known that dynamic circuits are faster than static circuits due to their smaller fan-in and fan-out loads. For a fast one-clock-cycle decision, it is natural to choose a domino logic, particularly an n- domino logic which is faster than a p-domino logic. In a normal domino chain, inverters are used between two precharged stages, which prevents charge loss but creates delay overheads. It will be shown that by a series of evolutions new circuit techniques are created. The first evolution is that by replacing these inverters between clock precharged stages, in this case n-stages'25-1-6!, with static logic stages'26-1-6! as shown in Fig.8 (a), a domino chain cam be made more efficient and faster. Note that the initial-low-output functions (or the initial-high-output functions if in a p-dimino chain) are still kept by these static logic stages'26-1-6! . The second evolution is that except the first clock precharged stages, in this case n-stages'25-1-2!, the evaluating transistors, in this case n-transistors'27-3""6! , of successive clock precharged stages, in this case n-stages'25-3-6!, can be omitted so the evaluating speed is further increased. The third evolution is that all these static stages'26-1-6! can be conditionly modified into dynamic stages to reduce their fan-in and fan- out loads. The modified circuit examples A, B, C and D connected to cell l'11! in DBLC-l tree'1-1! and cell l'18l in DBLC-2 tree'15-1! are given in Fig.8(b) . These dynamic stages are precharged by their data inputs so named data precharged stages and, therefore, the technique is named the clock-and- data precharged dynamic CMOS circuit technique. Note that the example chains in Fig.8 (a) are clock-and-data prechrged n- chains which can be prolonged and the circuit examples given in Fig.8(b) are free of charge sharing.
Fig.9 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (II) . The fourth evolution is that not only a single static stage but also multiple static stages (odd number) can be cascaded in a domino chain to reduce the number of clock precharged stages since clock precharged stages in a domino chain have charge-sharing problems leading to the demand of all nodes in the stage to be precharged. The fifth evolution is that all these static stages can be replaced by dynamic circuits and they become data precharged high-in/low-out or data precharged low-in/high-out stages alternatively, indicated by H/L'28-1-6! and L/H'29-1-6! statges in Fig.9, Fig.10, Fig.11 and Fig.12. Note the number and order required in different chains. The modified circuit examples A, B, C and D for precharged low-in/high-out stages connected to cell 2'13! in DBLC-l tree'1-1! and cell 2'19! in DBLC-2 tree'15-1! are given in Fig.9(b) while the examples of precharged high-in/low-out stages are already shown in Fig.8(b). The chains shown in Fig.9(a) are clock-and-data prechrged n-chains which can be prolonged and in its extreme case, only one clock precharged n-stage in the beginning of each chain is needed and all the rest can be H/L or L/H stages. However, if a very quick precharge is required, there should be an enough number of clock precharged stages besides the first two stages'25-1"2! , in this case n-stages' 5-3-4l , in the chain and they should have the evaluating transistors, in this case n-transistors'27-3-4! , implemented.
Fig.10 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (III) . If a precharge phase can be arranged separately the circuit in Fig.9 gives a really high speed to a multi-level logic computation like that in the DBLC trees. However, very often an adder needs a true one- clock-cycle decision with the precharge phase included. In this case, n-p logic, particularly TSPC n-p logic, is very useful, in which both TSPC n-latch stage'31! and TSPC p-latch stage'33! are necessary. It was found that a clock-and-data precharged n-chain'30-1'2'3 or 4! can be terminatd by a TSPC n- latch stage'31! without latch problem. It is feasible also for a clock-and-data precharged p-chain'32-1 or 2! to be terminated by a TSPC p-latch stage'33! . A clock-and-data precharged p- chain'32-1 or 1 contains at least one clock precharged p-stage, for example p-stage'34-1 or 21 , followed by L/H'29-1 OΓ 21 and H L[28.1 or 2] stages in an opposite order compared to a clock- and-data precharged n-chain, for example n-chain'30-3 or 41. Therefore, TSPC n-p logic can be implemented by cascading the n- and p-chains together with TSPC latch stages between. These are represented by evolutions 6 and 7 in Fig.10(a) and Fig.10(b). Note that the different numbers and orders of data precharged stages are required between two clock precharged stages and between a clock precharged stage and a TSPC latch. Also, the n-chains, for example chains'30-1-4! , the p-chains, for example chains'3 -1- l and the latched n-p chains, for example the cascade of p-chain'32 •1- , p-latch133! , n-chain130- --J and n-latch'31! can be prolonged,if the stage numbers and orders are correct.
Fig.11 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (IV) , the non-latched TSPC n-p domino circuit technique. In a latched n-p chain, for example the cascade of p-chain'32-1! , p-latch'33!, n-chain'30-1! and n- latch'31! in Fig.10, the latch stage, for example stage'31! or stage'33! introduces a delay overhead and an inversion. It is possible to combine a TSPC n-latch stage with a H/L stage or a TSPC p-latch stage with a L/H stage to reduce the delay overhead but at least one latching transistor is involved in each case and an inversion is inevitable. It was found that except the latch stages at the circuit output where data need to be stable, all intermediate latch stages, for example stages'33! in Fig.10(b), can be omitted with the conditions described in Fig.11. As long as two or more than two data precharged stages are cascaded, these conditions are easily fulfilled. This is represented by evolution 8. This evolution comes from the fact that as long as the clock-precharged n- stage, for example stage'25-1!, or p-stage, for example stage'34-3!, has already evaluated the new-low-input for the n- logic in n-stage, for example stage'25-1!, or the new-high- input for the p-logic in p-stage, for example stage'34-3!, has no effect on its output.
Fig.12 shows the principle of data precharged asynchronous pipeline circuit technique. It was found that if return-to- zero or return-to-one data are used the clock can be eliminated and the chain becomes a pure data precharged asynchronous pipeline containing only H/L stages, for example stages'28-1-6! , and L/H stages, for example stages'29-1-6! . As long as the data rate is lower than a maximum rate, the precharged-wave-front (PWF) will not catch up the evaluation- wave-front (EWF) so the circuit will be reliable. This technique does not included in the later examples but it is quite useful for return-to-zero or return-to-one data structures and cases like asynchronous communications between modules, chips or boards.
Fig.13 shows an embodiment of a complete single-bit-grouped 64-bit DBLC-l tree'1-4! with the g and p generators included (cell 0 in Fig.13a and Fig.13c). In order to be more general, cells in different levels are given different numbers in Figs.l3a-d. Cells with the same number in the same level (column) are identical. Cells with different numbers in different levels may be identical in one circuit technique or different in another circuit technique.as will be seen in Fig.14 and Fig.15. It contains cells 0, OA, 1, 1A, IB, 2, 2A, 2B, 3, 3A, 3B, 4, 4A, 4B, 5, 5A, 5B, 6A and 6B. Note that, in the 64-bit DBLC-l tree'1-4!' cells are not given further identifying numbers. The tree shows an extremely regular structure. Every cell is loaded by no more than two successive cells. The only drawback is the long wires in the last level (perhaps also in the second last level for a very large size adder) . Fortunately, these wires are driven either by cell 5B or by cell 5A and they belong to non-critical delay path and have relatively large driving capabilities. However, the long wires are the limiting factor of this kind of tree. It appears in Brent and Kung's tree too. In this sense, a reduction of the bit-pitch is useful to reduce the wire lengths and, therefore, the multi-bit grouped alternatives are meaningful. A rearrangement of cell positions is also possible to reduce long wires.
Fig.14 shows a first version of dynamic CMOS circuit embodiments of cells for the 64-bit DBLC-l adder'37-1! using the tree shown in Figs.l3a-d by using clock-and-data precharged n-chains. This solution gives an ultrafast speed to the evaluation phase but needs an additional precharge phase. The precharge phase can be either very short in time if all evaluation transistors in the clock precharged stages are kept or comparable to the evaluation phase in time if these transistors are removed except the ones in cell 0s. In the second case, the evaluation speed is further increased. According to SPICE simulations using the typical parameters of a lμ CMOS process, the evaluation time for a latched SUM output can be less than 2ns with transistor sizing.
Fig.15 shows a second version of dynamic CMOS circuit embodiments of cells for the 64-bit DBLC-l adder'37-2! using the tree shown in Fig.13 by the combination of circuit techniques described in Fig.8, Fig.9, Fig.10 and Fig.11. In this solution, no additional precharge phase is required and, therefore, the addition can be finished truly in one clock cycle. According to SPICE simulations using the typical parameters of a lμm CMOS process, the clock rate can be as high as 500MHz, which means the total evaluation time for a latched SUM output is less than 2ns. In the simulation, the rise and fall slopes of the clock are 0.2ns. From these examples (Fig.14 and Fig.15), the combination of the new adder architecture and the new circuit technique really gives a large speed improvement for a parallel binay adder.
ADVANTAGES
The following advantages are achieved with the invented adder architcture and circuit tecniques:
1. Maximum speed
The maximum speed is achieved by the distributed adder architecture since the architecture has a minimum logic depth and a very uniform loading. The maximum speed is also achieved by the clock-and-data precharged dynamic CMOS circuit technique.
2. Very regular layout
The adder architecure has a very regular structure as can be seen in the principle diagrams and so has the layout. The layout will be so regular that computer automation can be easily introduced for generating adder layouts with required bit-numbers, pitches and sizes by using just a few standard cells. 3. Unifprm loading and driving
While, the internal loading is made uniform by the distributed carry tree in the adder, the inputs of such an adder is also uniform. The output driving capabilities of diffrent bits are also uniform enough. These are good features for the interfaces.
4. Flexibility
First, the adder architecture is suitable for both one-clock- cycle decision and multi-clock-cycle pipelining. Second, the multi-bit grouped DBLC adders are flexible for different speed, area and power compromises. The same tree can be used for a 32-bit single-bit grouped DBLC adder or for a 64-bit two-bit grouped DBLC adder except that the sum logic needs to be redesigned. Of course, when the same tree is used for a more-bit grouped DBLC adder the speed will decrease because of the increased loads to carry outputs.
5. Less power noise and peak current
The clock-and-data precharged dynamic CMOS circuit technique invented not only gives large speed improvement but also gives less power noise and peak currents because of the successive precharge and evaluation of different stages. The clock loads are also less than that in the case of intensive pipelining.
6. Extra output of SUM+1
In normal case, if SUM+1 computation is needed, it will take a full computing cycle. The DBLC-2 adder, however, can give outputs of both SUM and SUM+1 simultaneously by paying very limited hardware and very little speed degradation. 15 a. Additional references to the drawings
FA - Full adder, HA - Half adder, FB - Buffer (For the rest of the cells see Fig. 2)
Example A (no condition)
Example B (no condition)
Example C (no condition)
Example D (condition: when Cv_y =0, C?-y must be 0 and it is true in DBLC-2 tree)
Example A (no condition) Example B (no condition) Example C (condition: when 0}-y =1, C .y must be 1 and it is true in DBLC-2 tree) Fig 11: 137 dpp - precharge delay of the whole p-domino chain dne - evaluation delay of the first stage of the successive n- domino chain. dnp - precharge delay of the whole n-domino chain. dpe evaluation delay of the first stage of the successive p- domino chain.
Conditions: dpp > dne and dnp > dpe. Fig 12: 201 RZ - return-to-zero data,
PWF - precharge wave front, EWF - evaluation wave front Note: (1) Both return-to-zero and return-to-one data are feasible for inputs and outputs if L/H and H/L stages are rearranged;
(2) the data hold time at the output is equal to the passing time between EWF and PWF, depending on data rate and circuit; (3) when PWF catches up EWF, it reaches its highest data rate and there is no limit for a lower data rate except leakage current.
SUBSTITUTE SHEET 15 6-
Fig 14: 154 Sum unit (not shown in Fig 13)
Fig 15: 168 Sum unit (not shown in Fig 13)
181 (Cell 6, for a larger tree)
SUBSTITUTE SHEET

Claims

1. An architecture arrangement called Distributed BinaryLookahead-Carry (DBLC) architecture for a parallel binary adder, c h a r a c t e r i z e d in that it comprises:
a. a DBLC tree (for example DBLC-1 tree[1 , 1.1 , 1.2 , 1.3 or 1.4], DBLC-2 tree[15 or 15.1] shown in Fig.1, Fig.2, Fig.3, Fig.4, Fig.5, Fig.6 or Figs.13a-d) generating carries for every single bit or for every grouped bits and comprising computation cells in truly log2n levels, where n is the number of bits, and cell connections in a completely distributed manner, in which every carry bits have their own binary-lookahead-carry trees which are overlapped and share only one cell at one position and each computation cell is loaded by no more than two successive cells in the case of above example trees;
b. a number of SUM units (for example the combinations of XOR gates[7], inverse XOR gates[10], double XOR gates[20] and/or MUXs[21] shown in Fig.2, Fig.4, Fig.5 and Fig.6 or other combinations) generating SUMs for every single bit or for every grouped bits;
c. a number of generators for generating gi, the generating signals, and pi, the propagating signals, to the DBLC tree, pi=ai+bi and gi=aibi, where ai and bi are two inputs to the adder. In practical embodiments (for example cell 0[8] shown in Fig.2, Fig.4, Fig.5 and Fig.6 and cells 0 and 0A shown in Fig.13a and Fig.13c), inversed pi and gi are generated.
2. A circuit arrangement called clock-and-data precharged dynamic CMOS circuit techniques (I-IV, shown in Fig.8, Fig.9, Fig.10 and Fig.11), c h a r a c t e r i z e d in that it comprises:
a. a number of clock precharged dynamic stages (for example n-stages[25.1-6] and/or p-stages[34.1-4]);
b. a number of data-precharged dynamic H/L (precharged high- in/low-out) stages, for example H/L stages[28.1-6], and/or L/H (precharged low-in/high-out) stages, for example L/H stages[29.1-6]; c. a number of TSPC latch stages (for example n-latch
stage[31] and/or p-latch stage[33]) in the latched
versions;
they are connected in such a way (numbers and orders of stages indicated in Fig.8, Fig.9, Fig.10 and Fig.11) that the correct logic functions and the initial-low or initial-high output requirements are always maintained;
3. A DBLC-1 tree (for example DBLC-1 tree[1 , 1.1 , 1.2, 1.3 or 1.4] shown in Fig.1, Fig.2, Fig.5, Fig.6 or Figs.13a-d) according to Claim 1, c h a r a c t e r i z e d in that computation equations for the main cells are g'=g2+p2g1 and p'=p1p2 where g' and p' are its outputs and g1, p1, g2 and p2 are its inputs; part of the cells only contain g'-computaion and part of the cells are just buffers; the equations are modified to
alternatively inverse and non-inverse logic equations in practical embodiments (for example DBLC trees[1.1-4] shown in Fig.2, Fig.5, Fig.6, Fig.7 and Figs.13a-d).
4. A DBLC-2 tree (for example DBLC-2 tree[15 or 15 .1 ] shown in Fig.3 or Fig.4) according to Claim 1, c h a r a c t e r i z e d in that computation equations for the main cells are C0 x- y=C0 v-y+C1 v-yC0 x-u and C1 x-y=C0 v-y+C1v-yC0 x-u where v=u+1, C0 x-y and C1 x-y are its outputs and C0 x-u, C1x-u, C0 v-y and C1 v-y are its inputs and C0 x-y or C1 x-y stands for the carry from bit x to bit y assuming the input carry to bit x is zero or one; part of the cells are just buffers; the equations are modified to alternatively inverse and non-inverse logic equations in a practical embodiment (for example the DBLC-2 tree[15.1] shown in Fig.4).
5. A DBLC-1 adder (for example the adder[35, 35.1, 35.2, 37.1 or 37.2] shown in Fig.2, Fig.5, Fig.6 or Figs 13a-d plus Fig.14 or 15) according to Claims 1 and 3, c h a r a c t e r i z e d in that it comprises a DBLC-1 tree (for example DBLC-1 tree[1, 1.1, 1.2, 1.3 or 1.4] shown in Fig.1, Fig.2, Fig.5, Fig.6 or Figs.13a-d), SUM units and g and p generators for a parallel n-bit operation; the adder may include or not include a carry input; the adder gives a SUM output (n-nit in parallel) and may also give an overflow carry and can be either single-bit grouped or multi-bit grouped.
6. A DBLC-2 adder (for example the adder[36] shown in Fig.4) according to Claims 1 and 4, c h a r a c t e r i z e d in that it comprises a DBLC-2 tree (for example DBLC-2 tree[15 or 15.1] shown in Fig.3 or Fig.4), SUM units and g and p generators for a parallel n-bit operation; a carry input may be included (not shown in Fig.4); the adder can give both SUM and SUM+1 outputs (each n-nit in parallel) and can be either single-bit grouped or multi-bit grouped.
7. A 64-bit DBLC-1 adder (for example the adder[37.1 or 37.2] shown in Figs.13a-d plus Fig.14 or Fig.15) according to Claims 1, 2, 3 and 5, c h a r a c t e r i z e d in that it comprises a 64-bit DBLC-1 tree (for example the DBLC-1 tree[1.4] shown in Figs 13a-d) and 64 SUM units (the g and p generators have been included in the DBLC-1 tree[1.4]); the circuit embodiments of cells of the 64-bit DBLC-1 adder[37.1 or 37.2] are shown in Fig.14 or Fig.15; the first version (Fig.14) gives an
ultrafast speed to the evaluation phase but needs an
additional precharge phase and the second version (Fig.15) does not need an additional precharge phase and, therefore, the 64-bit addition can be finished truly in one clock cycle.
8. Any means, architecture or circuit embodiments, c h a r a c t e r i z e d in that it comprises the distributed means used in the DBLC architecture, the particular DBLC architecture, the clock-and-data precharged dynamic CMOS circuit technique (I in Fig.8, II in Fig.9, III in Fig.10 or IV in Fig.11) or the data precharged asynchronous pipeline circuit technique (described in Fig.12), introduced through the text and
figures, or their mixture; the DBLC architecture can be either non-pipelined or pipelined; the circuit embodiment can be a single chip, a part of a microprocessor, a part of a digital signal processing unit or a part of any devices.
EP94919952A 1993-06-22 1994-06-21 An ultrafast adder arrangement Withdrawn EP0705459A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
SE9302158 1993-06-22
SE9302158A SE9302158L (en) 1993-06-22 1993-06-22 An extremely fast adder device
PCT/SE1994/000614 WO1995000900A1 (en) 1993-06-22 1994-06-21 An ultrafast adder arrangement

Publications (1)

Publication Number Publication Date
EP0705459A1 true EP0705459A1 (en) 1996-04-10

Family

ID=20390377

Family Applications (1)

Application Number Title Priority Date Filing Date
EP94919952A Withdrawn EP0705459A1 (en) 1993-06-22 1994-06-21 An ultrafast adder arrangement

Country Status (4)

Country Link
EP (1) EP0705459A1 (en)
AU (1) AU7089694A (en)
SE (1) SE9302158L (en)
WO (1) WO1995000900A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SG79988A1 (en) * 1998-08-06 2001-04-17 Oki Techno Ct Singapore Pte Apparatus for binary addition
US6329838B1 (en) * 1999-03-09 2001-12-11 Kabushiki Kaisha Toshiba Logic circuits and carry-lookahead circuits
US6314507B1 (en) 1999-11-22 2001-11-06 John Doyle Address generation unit

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4956802A (en) * 1988-12-14 1990-09-11 Sun Microsystems, Inc. Method and apparatus for a parallel carry generation adder
US5166899A (en) * 1990-07-18 1992-11-24 Hewlett-Packard Company Lookahead adder
JPH056263A (en) * 1991-06-27 1993-01-14 Nec Corp Adder and absolute value calculation circuit using the adder

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO9500900A1 *

Also Published As

Publication number Publication date
SE9302158L (en) 1994-12-23
SE9302158D0 (en) 1993-06-22
AU7089694A (en) 1995-01-17
WO1995000900A1 (en) 1995-01-05

Similar Documents

Publication Publication Date Title
Ohkubo et al. A 4.4 ns CMOS 54/spl times/54-b multiplier using pass-transistor multiplexer
Waters et al. A reduced complexity wallace multiplier reduction
US5121003A (en) Zero overhead self-timed iterative logic
US4682303A (en) Parallel binary adder
Mehta et al. High-speed multiplier design using multi-input counter and compressor circuits
US4623982A (en) Conditional carry techniques for digital processors
EP0152046A2 (en) Multiplying circuit
US11010133B2 (en) Parallel-prefix adder and method
US4831570A (en) Method of and circuit for generating bit-order modified binary signals
WO2000022504A1 (en) 3x adder
US5499203A (en) Logic elements for interlaced carry/borrow systems having a uniform layout
US7570081B1 (en) Multiple-output static logic
US5987638A (en) Apparatus and method for computing the result of a viterbi equation in a single cycle
Hebbar et al. Design of high speed carry select adder using modified parallel prefix adder
CN100533369C (en) Comparison circuit and method
EP0705459A1 (en) An ultrafast adder arrangement
US4839848A (en) Fast multiplier circuit incorporating parallel arrays of two-bit and three-bit adders
US7325025B2 (en) Look-ahead carry adder circuit
US6183122B1 (en) Multiplier sign extension
US6782406B2 (en) Fast CMOS adder with null-carry look-ahead
Veeramachaneni et al. Efficient design of 32-bit comparator using carry look-ahead logic
US5978826A (en) Adder with even/odd 1-bit adder cells
JPH0450614B2 (en)
Soundharya et al. GDI based area delay power efficient carry select adder
Larsson-Edefors A 965-Mb/s 1.0-/spl mu/m standard CMOS twin-pipe serial/parallel multiplier

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 19960110

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): DE FR GB IT

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 19990105