EP0705459A1

EP0705459A1 - An ultrafast adder arrangement

Info

Publication number: EP0705459A1
Application number: EP94919952A
Authority: EP
Inventors: Jiren Yuan
Original assignee: Universitet I Linkoping
Current assignee: Universitet I Linkoping
Priority date: 1993-06-22
Filing date: 1994-06-21
Publication date: 1996-04-10
Also published as: SE9302158L; SE9302158D0; AU7089694A; WO1995000900A1

Abstract

A binary-lookahead-carry adder can be improved significantly by a new architecture called Distributed Binary-Lookahead-Carry (DBLC) architecture. The new architecture has truly log2n computation levels and a very regular structure. Both the internal loading and the external loading are truly uniform and each cell is loaded by no more than two successive cells. The architecture is flexible to have single-bit or multi-bit grouped configurations in compromising speed, area and power and is suitable both for a one-clock-cycle decision and for a multi-clock-cycle pipelining. Two different versions are given, the DBLC-1 adder and the DBLC-2 adder. While the first one uses similar computation cells as that in the original binary-lookahead-carry adder, the second one uses a new computation algorithm which can give outputs of SUM and SUM+1 simultaneously. The architecture is supported by a new circuit technique called clock-and-data precharged dynamic CMOS circuit technique including both latched and non-latched versions. The circuit technique aims for a fast one-clock-cycle decision and increases speed by eliminating delay overheads of domino inverters and pipeline latches. The new adder exhibits a maximum speed, a very regular layout, a truly uniform loading and a high flexibility.

Description

AN ULTRAFAST ADDER ARRANGEMENT

This invention relates to an arrangement of the kind which is apparent from clauses of section CLAIMS. The invention relates particularly to a parallel binary adder having an ultrafast speed and an extremely regular structure.

BACKGROUND OF THE INVENTION

In computational devices, for example computers and digital signal processing elements, a fast parallel binary adder is essential. In many cases, to obtain sum and carry within one clock cycle is important. In the case of a pipeline, the latency of the pipeline is often expected to be as small as possible.

The speed limitation of a parallel binary adder comes from its carry propagation delay. The maximum carry propagation delay is the delay of its overflow carry. In a ripple carry adder, the evaluation time T for the overflow carry is equal to the product of the delay Ti of each single-bit carry evaluation stage and the total bit number n. In order to improve speed, carry lookahead strategies are widely used. Among them, the most important example related to this invention is the binary-lookahead-carry strategy, as described in the article "A regular layout for parallel adders" by Richard P. Brent and H. T. Kung, IEEE Transactions on Computers, vol. c-31, pp. 260-264, March 1982. In this strategy, carries are evaluated by a binary tree. Unfortunately, by using a main tree plus a so-called inverse tree introduced in their paper, the levels needed for a complete carry evaluation network is (21og2n - 1) , leading to a computation time of (21og2n - l)Tι where Ti is the delay of each level. Actually, the nonuniformity of the internal loadings in Brent and Kung's tree is made uniform by a pipeline structure (the inverse tree) , which creates a large delay overhead. Since then, such a tree is often combined with other techniques, for example with carry select and Manchester carry chain, as described in the article "A spanning tree carry lookahead adder" by Thomas Lynch and Earl E. Swartzlander, Jr., IEEE Transactions on Computers, vol. 41, pp. 931-939, August 1992. In their strategy, a 64-bit adder is divided into eight 8-bit adders, seven of them are carry-selected, so the visible levels of the binary carry tree are reduced. The visible levels are further reduced by using sixteen, four and two 4-bit Manchester carry chain modules in the first, the second and the third levels respectively. Finally, seven carries are obtained from the carry tree for the seven 8-bit adders to select their SUMs. In this solution, the true level number is hidden by the 4-bit Manchester module which is equivalent to two levels in a binary tree. The nonuniformity of the internal loading still exists but is hidden by the high radix, for example the fan outs of four Manchester modules in the second level are 1, 2, 3 and 4 respectively.

In CMOS circuit technique which is widely used, high clock rates have been achieved by true single phase clocking (TSPC) , device sizing and extreme pipelining, as described in the article "High speed CMOS circuit technique" by Jiren Yuan and Christer Svensson, IEEE Solid-State Circuits, vol. 24, pp. 62- 70, February 1989. However, high speed in connection with a one-clock-cycle decision for multi-level logic needs new circuit topology which should, if possible, eliminate delay overheads caused by many latches in a pipeline aiming at very high clock rates.

OBJECTS AND SOLUTIONS OF THE INVENTION

An object of the invention is to provide a parallel binary adder architecture which offers a superior speed, a uniform loading, a regular layout and a flexible configuration in the trade-off between speed, power and area compared with existing parallel binary adder architectures. Another object of the invention is to provide an advanced CMOS circuit technique which^"offers an ultrafast speed particularly for a one-clock- cycle decision. The combination of the two objects offers a very high performance parallel binary adder.

The first object of the invention is achieved with the invented Distributed-Binary-Lookahead-Carry (DBLC) adder architecture which is an arrangement of the kind set forth in the characterising clause of Claim 1. The second object is achieved by the invented clock-and-data precharged dynamic CMOS circuit technique which is an arrangement of the kind set forth in the characterising clause of Claim 2. Further features and further developments of the invented arrangements are set forth in other characterising clauses of section CLAIMS.

The DBLC trees (DBLC-l tree and DBLC-2 tree) in the invented DBLC adders (DBLC-l adder and DBLC-2 adder) can be constructed by truly log₂n levels which is approximately the half of that in Brent and Kung's tree and, in the same time, both the internal and the external loadings are truly uniform. The two trees have very regular structures and so have the layouts and are flexible enough to be constructed in different radixes to accommodate different speed, power and area compromises.

A DBLC tree means that identical binary-lookahead-carry trees are repeated for every bits, i.e. to move and implement the MSB binary-lookahead-carry tree towards LSB bit by bit. Also, the overlapped parts of these trees share only one cell at one position, the exceeding-LSB parts of these trees are eliminated and part of the cells in LSBs' side can be simplified. Finally, every cell in the tree has only two succesive loading cells.

The clock-and-data precharged dynamic CMOS circuit solution is evolved from the existing CMOS domino circuit technique which turns out to be slow in its maximum clock rate and from the existing TSPC circuit technique. By means of removing delay overheads caused by domino inverters and pipeline latches, the clock-and-data precharged dynamic CMOS circuit exhibits a superior speed for a one-clock-cycle decision.

BRIEF DESCRIPTION OF THE DRAWINGS

Fig 1 shows an embodiment of a DBLC-l tree-¹! according to the invention in contrast with Brent and Kung's tree*-²! by using 16-bit inputs as examples.

Fig.2 shows an embodiment of a DBLC-l adderI³⁵-* according to the invention by using 16-bit inputs as an example. Fig.3 shows an embodiment of a DBLC-2 treef¹⁵-¹ according to the invention by using 16-bit inputs as an example. Fig.4 shows an embodiment of a DBLC-2 adder-*³⁶1 according to the invention by using 16-bit inputs as an example. Fig.5 shows an embodiment of a two-bit grouped DBLC adder t³⁵ - *-J according to the invention by using a 16-bit DBLC-l adder as an example.

Fig.6 shows an embodiment of a multi-input DBLC adder1³⁵ - 1 according to the invention by using a DBLC-l adder with four 16-bit inputs as an example.

Fig.7 shows static CMOS circuit embodiments of cells in DBLC-l treeI¹-¹' - ^•- < ^{or 1}-³*' and DBLC-2 treeI^{15 • •■}J according to the invention.

Fig.8 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (I) according to the invention. Fig.9 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (II) according to the invention. Fig.10 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (III) according to the invention.

Fig.11 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (IV) according to the invention.

Fig.12 shows the principle of data precharged asynchronous pipeline circuit technique according to the invention. Figs.l3a-d shows an embodiment of a complete single-bit- grouped 64-bit DBLC-l treel¹-⁴-* according to the invention. Fig.14 shows the first version embodiments of cells in a 64- bit DBLC-l adder• ⁷-¹^ according to the invention. Fig.15 shows the second version embodiments of cells in a 64- bit DBLC-l adder¹³⁷-²1 according to the invention.

DETAILED DESCRIPTION OF THE DRAWINGS

The principle diagram of a DBLC-l tree ⁽**J is shown in Fig.1(a) in contrast with Brent and Kung's treeI²1 shown in Fig.l (b) . Both use 16-bit inputs ai and h_ as examples. In Fig.l, the DBLC-l tree -J-¹ starts from the generating and propagating signals gi and pi, the same as Brent and Kung's tree^²-¹, and gi=a bi, i=ai+bi- The computational cells in the DBLC-l tree!¹-¹ are cells 3.131, 21⁴3 and 3^1. Cell Xt³-1 performs the following computation: g'=g2⁺P2_-fl a**¹**-** P'⁼P2Pl- where g' and p' are the two outputs and g₂, P₂ 9₁ and pi are the four inputs. The g' represents the carry output under the condition of assuming an input carry of zero for the bits involved while the p' is used for further computations. Cell 2^⁴3 performs the g'-computation only and contains the g'-part of cell I-*³-¹. Cell 31⁵** is just a buffer. Brent and Kung's tree^²! uses cell 11³ϊ and cell 4^61 which contains two buffers and has two inputs and two corresponding outputs. In Fig.l, cells are similar in both trees. Actually, some of the cells 1^³1 and -*⁶-l in Brent and kung's tree^²! can be replaced by cells 2-*⁴-¹ and 31**--¹. However, the important point is that the levels used by the DBLC-l tree -J-* is truly log₂n while Brent and Kung's tree!²-- needs totally (21og2n - 1) levels to generate carries for every individual bit. This creates a large difference in speed.

Fig.2 shows an embodiment of a DBLC-l addert³⁵l by using a 16- bit adder as an example. It contains a DBLC-l tree•*--•••---■ , 16 Cell 0sl⁸l and 16 SUM units. Each SUM unit contains a XOR gatet¹⁰! and an inverse XOR gate--⁷! . Inverse logics are introduced in the DBLC-l treeI ^•'- ^{• •}-J to gain speed so the functions of cells Bϊ⁹1 , 1HH , 1AΪ¹²1 , 2t¹³l and 2A⁽¹⁴1 in Fig.2 differ from the functions of cells 11³-1 , 21⁴-I and 3Ϊ⁵-¹ of the DBLC-l treet--^*! in Fig.l. Cell θ l°J* is used to generate inversed gi and pi. Since carry outputs from the DBLC-l tree -J - ¹-¹ are inversed, the half-SUMs are obtained from inverse XOR gates -J⁾ . Finally, SUMs are generated from inversed carries and inversed half-SUMs through output XOR gatest¹⁰-¹. There is a carry input, Ci_n, in this example together with two 16-bit word inputs. For the purpose of dealing with the carry input, the DBLC-l treeI¹•¹⁾ has an extra row on the top. The overflow carry, Ci6. is generated by an extra cell, cell 2AΪ*¹⁴*- , and has approximately the same delay as the SUMs. Since the carry input is optional, it will not appear in later examples. From Fig.2, one can conclude that the adder will have a very regular layout.

Fig.3 shows an embodiment of a DBLC-2 treel¹⁵! by using 16-bit inputs as an example. A DBLC-2 treel¹⁵*! also starts from the generating and propagating signals gi and pi. The computational cells in the DBLC-2 tree-¹⁵-¹ are cells lf¹⁶l and 21⁶3. The C°_x-_y is the same as g' in the DBLC-l tree-*¹-* , representing the carry output from bit x to bit y under the condition of assuming an input carry of zero to bit x. However, the C¹ _x__y has more clear meaning than that of p" in the DBLC-l treet-*-! . C¹ _x__y represents the carry output from bit x to bit y under the condition of assuming an input carry of one to bit x. From the outputs of the DBLC-2 tree!¹⁵-* , both assuming-zero carries and assuming- one carries are available. The computations of C°_x__y and C~_x-_y are similar, i.e. C⁰ _x-_y=C⁰ _v__y+C¹ _v___yC⁰ _x__u and C¹ _x__y=C°_v__y+C¹ _v__yC¹ _x_ _u. Both have a common term C°_v-y If C°_v__y=l, C°_x__y and C-^-_y must be 1. The second terms are selecting terms, in which the assuming-zero and the assuming-one carries from bit x to bit u, C°_x__u and C**-_X__u, are selected by c -_y. In this sense, it is a carry-selected-by-carry algorithm. It was found that this algorithm is not only beneficial to have two carry sets available but also beneficial to circuit simplification which will be described together with Fig.8 and Fig.9.

Fig.4 shows an embodiment of a DBLC-2 adder f ^{3 t}-J by using 16-bit inputs as an example. This diagram is similar to Fig.2. It contains a DBLC-2 tree!¹⁵-¹!, 16 Cell Ost⁸-* and 16 SUM units. Each SUM unit contains a XOR gate!¹⁰-¹ and an inverse XOR gate-⁷1. Inverse logics are also used in the DBLC-2 tree'¹⁵-¹! so the functions of cells B'¹⁷! , l'¹⁸! and 2 -J⁹-¹ in Fig.4 differ from the functions of cells l'¹⁶l and 2'⁶! of ,the DBLC-2 tree'¹⁵^ in Fig.3. Since both assuming-zero and assuming-one carries are available, SUM and SUM+1 can be obtained simultaneously. Note that this is obtained by paying very limited hardware. As the SUM channel has a large speed margin, the extra XOR gate for S'•=(SUM+1) does not ask for more driving capability from the previous inverse XOR gate'⁷!. The assuming-one carries are obtained from the spare outputs of the output cells B'¹⁷! and 2'¹⁹!. Internally, cells in the DBLC-2 tree'¹⁵-¹! are still loaded by no more than two sucessive cells. There will be no obvious speed degradation but the last level of the DBLC-2 tree'¹⁵-¹! has a double number of wires.

Fig.5 shows an embodiment of a two-bit grouped DBLC adder'³⁵-¹! by using a 16-bit DBLC-l adder as an example. Higher radixes can be easily introduced into DBLC trees, which shows its flexibility. One can see cell Is'¹¹! i_n the two-bit grouped DBLC-l tree'¹-²! s cell Os'⁸! in Fig. 2 and the rest of the cells form an 8-bit distributed tree with only three levels. The output carries from the tree will be C2, C₄, Cε etc. These carries are used to select SUMs of bits 2-3, 4-5, 6-7 etc. A double XOR gate'²⁰! gives both assuming-zero-carry SUM and assuming-one-carry SUM of each bit to the multiplexer (MUX)'²¹!. Note that, in this case, every cell in the adder is still loaded by no more than two successive cells. By using higher radixes, the layout pitch can be reduced and wires in all levels, particularly in the last level, will be less and shorter. The total number of levels, calculted from cell θ'⁸l in Fig.5 to the carry output, are still the same, which is connected to the total bit-number not the radix. Each carry output is loaded by two MUXs'²¹!, twice as much as that in the single-bit grouped case. For a same delay, it obviouely needs twice the transistor sizes in the tree. However, the multi-bit grouped solution gives a flexibilty for compromising power, area and speed.

Fig.6 shows an embodiment of a multi-input DBLC adder by using a DBLC-l adder'³⁵-²! with four 16-bit inputs as an example. It indicates that three addition operations needed for adding four words are simplified to one operation by using two extra levels containing full adders'²³!, half-adders'²⁴! and buffers'⁵!. The total addition time is considerably reduced. This structure is vary useful for making a fast paralell multiplier in which the inputs will be YιXι__n. ^v ₂*^_l-n' ^3-^1-n etc, and each input is shifted successively by one bit. Note that if each input has 16 bits, the DBLC-l tree'¹-³! needs only 16 bits since the LSBs are generated directly.

Fig.7 shows static CMOS circuit embodiments of cells l'¹¹!, 1A'¹²1, 2'¹³! and 2A'¹⁴! in DBLC-l trees'¹-¹!_' 11.2^] and ^[1.3^] _ana cells l'¹⁸! -and 2'¹⁹1 in DBLC-2 tree'¹⁵-¹! . Cells in Fig.2, Fig.4, Fig.5 and Fig.6 are inverse-logic cells. They can be realized by single CMOS stages. Cell l'¹¹!, cell 1A'¹²1) , cell 2'¹³! and cell 2A'¹⁴! are complementary. Cells 1A'¹ 1 and 2A'¹⁴l contain only the g' (1A) or the inversed g' (2A) logic. Cell l'¹⁸l and cell 2'¹⁹! are also complementary. These circuits are the fundamental circuits of DBLC trees. The dynamic circuits which will be described later are derived from them. Even with the static circuit embodiments, both DBLC-l and DBLC-2 adders are faster than or comparable to any published fast parallel adders according to SPICE simulations.

Fig.8 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (I) . It is well known that dynamic circuits are faster than static circuits due to their smaller fan-in and fan-out loads. For a fast one-clock-cycle decision, it is natural to choose a domino logic, particularly an n- domino logic which is faster than a p-domino logic. In a normal domino chain, inverters are used between two precharged stages, which prevents charge loss but creates delay overheads. It will be shown that by a series of evolutions new circuit techniques are created. The first evolution is that by replacing these inverters between clock precharged stages, in this case n-stages'²⁵-^1-6!, with static logic stages'²⁶-^1-6! as shown in Fig.8 (a), a domino chain cam be made more efficient and faster. Note that the initial-low-output functions (or the initial-high-output functions if in a p-dimino chain) are still kept by these static logic stages'²⁶-^1-6! . The second evolution is that except the first clock precharged stages, in this case n-stages'²⁵-^1-2!, the evaluating transistors, in this case n-transistors'²⁷-^3""6! , of successive clock precharged stages, in this case n-stages'²⁵-^3-6!, can be omitted so the evaluating speed is further increased. The third evolution is that all these static stages'²⁶-^1-6! can be conditionly modified into dynamic stages to reduce their fan-in and fan- out loads. The modified circuit examples A, B, C and D connected to cell l'¹¹! in DBLC-l tree'¹-¹! and cell l'¹⁸l in DBLC-2 tree'¹⁵-¹! are given in Fig.8(b) . These dynamic stages are precharged by their data inputs so named data precharged stages and, therefore, the technique is named the clock-and- data precharged dynamic CMOS circuit technique. Note that the example chains in Fig.8 (a) are clock-and-data prechrged n- chains which can be prolonged and the circuit examples given in Fig.8(b) are free of charge sharing.

Fig.9 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (II) . The fourth evolution is that not only a single static stage but also multiple static stages (odd number) can be cascaded in a domino chain to reduce the number of clock precharged stages since clock precharged stages in a domino chain have charge-sharing problems leading to the demand of all nodes in the stage to be precharged. The fifth evolution is that all these static stages can be replaced by dynamic circuits and they become data precharged high-in/low-out or data precharged low-in/high-out stages alternatively, indicated by H/L'²⁸-^1-6! and L/H'²⁹-^1-6! statges in Fig.9, Fig.10, Fig.11 and Fig.12. Note the number and order required in different chains. The modified circuit examples A, B, C and D for precharged low-in/high-out stages connected to cell 2'¹³! in DBLC-l tree'¹-¹! and cell 2'¹⁹! in DBLC-2 tree'¹⁵-¹! are given in Fig.9(b) while the examples of precharged high-in/low-out stages are already shown in Fig.8(b). The chains shown in Fig.9(a) are clock-and-data prechrged n-chains which can be prolonged and in its extreme case, only one clock precharged n-stage in the beginning of each chain is needed and all the rest can be H/L or L/H stages. However, if a very quick precharge is required, there should be an enough number of clock precharged stages besides the first two stages'²⁵-¹"²! , in this case n-stages' ⁵-^3-4l , in the chain and they should have the evaluating transistors, in this case n-transistors'²⁷-^3-4! , implemented.

Fig.10 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (III) . If a precharge phase can be arranged separately the circuit in Fig.9 gives a really high speed to a multi-level logic computation like that in the DBLC trees. However, very often an adder needs a true one- clock-cycle decision with the precharge phase included. In this case, n-p logic, particularly TSPC n-p logic, is very useful, in which both TSPC n-latch stage'³¹! and TSPC p-latch stage'³³! are necessary. It was found that a clock-and-data precharged n-chain'³⁰-¹'²'^{3 or 4}! can be terminatd by a TSPC n- latch stage'³¹! without latch problem. It is feasible also for a clock-and-data precharged p-chain'³²-^{1 or 2}! to be terminated by a TSPC p-latch stage'³³! . A clock-and-data precharged p- chain'³²-^{1 or} 1 contains at least one clock precharged p-stage, for example p-stage'³⁴-^{1 or 2}1 , followed by L/H'²⁹-^{1 OΓ 2}1 and H L^[28.1 or 2^] stages in an opposite order compared to a clock- and-data precharged n-chain, for example n-chain'³⁰-^{3 or 4}1. Therefore, TSPC n-p logic can be implemented by cascading the n- and p-chains together with TSPC latch stages between. These are represented by evolutions 6 and 7 in Fig.10(a) and Fig.10(b). Note that the different numbers and orders of data precharged stages are required between two clock precharged stages and between a clock precharged stage and a TSPC latch. Also, the n-chains, for example chains'³⁰-^1-4! , the p-chains, for example chains'³ -^1- l and the latched n-p chains, for example the cascade of p-chain'^{32 •1}- , p-latch¹³³! , n-chain¹³⁰- --J and n-latch'³¹! can be prolonged,if the stage numbers and orders are correct.

Fig.11 shows the principle of clock-and-data precharged dynamic CMOS circuit technique (IV) , the non-latched TSPC n-p domino circuit technique. In a latched n-p chain, for example the cascade of p-chain'³²-¹! , p-latch'³³!, n-chain'³⁰-¹! and n- latch'³¹! in Fig.10, the latch stage, for example stage'³¹! or stage'³³! introduces a delay overhead and an inversion. It is possible to combine a TSPC n-latch stage with a H/L stage or a TSPC p-latch stage with a L/H stage to reduce the delay overhead but at least one latching transistor is involved in each case and an inversion is inevitable. It was found that except the latch stages at the circuit output where data need to be stable, all intermediate latch stages, for example stages'³³! in Fig.10(b), can be omitted with the conditions described in Fig.11. As long as two or more than two data precharged stages are cascaded, these conditions are easily fulfilled. This is represented by evolution 8. This evolution comes from the fact that as long as the clock-precharged n- stage, for example stage'²⁵-¹!, or p-stage, for example stage'³⁴-³!, has already evaluated the new-low-input for the n- logic in n-stage, for example stage'²⁵-¹!, or the new-high- input for the p-logic in p-stage, for example stage'³⁴-³!, has no effect on its output.

Fig.12 shows the principle of data precharged asynchronous pipeline circuit technique. It was found that if return-to- zero or return-to-one data are used the clock can be eliminated and the chain becomes a pure data precharged asynchronous pipeline containing only H/L stages, for example stages'²⁸-^1-6! , and L/H stages, for example stages'²⁹-^1-6! . As long as the data rate is lower than a maximum rate, the precharged-wave-front (PWF) will not catch up the evaluation- wave-front (EWF) so the circuit will be reliable. This technique does not included in the later examples but it is quite useful for return-to-zero or return-to-one data structures and cases like asynchronous communications between modules, chips or boards.

Fig.13 shows an embodiment of a complete single-bit-grouped 64-bit DBLC-l tree'¹-⁴! with the g and p generators included (cell 0 in Fig.13a and Fig.13c). In order to be more general, cells in different levels are given different numbers in Figs.l3a-d. Cells with the same number in the same level (column) are identical. Cells with different numbers in different levels may be identical in one circuit technique or different in another circuit technique.as will be seen in Fig.14 and Fig.15. It contains cells 0, OA, 1, 1A, IB, 2, 2A, 2B, 3, 3A, 3B, 4, 4A, 4B, 5, 5A, 5B, 6A and 6B. Note that, in the 64-bit DBLC-l tree'¹-⁴!' cells are not given further identifying numbers. The tree shows an extremely regular structure. Every cell is loaded by no more than two successive cells. The only drawback is the long wires in the last level (perhaps also in the second last level for a very large size adder) . Fortunately, these wires are driven either by cell 5B or by cell 5A and they belong to non-critical delay path and have relatively large driving capabilities. However, the long wires are the limiting factor of this kind of tree. It appears in Brent and Kung's tree too. In this sense, a reduction of the bit-pitch is useful to reduce the wire lengths and, therefore, the multi-bit grouped alternatives are meaningful. A rearrangement of cell positions is also possible to reduce long wires.

Fig.14 shows a first version of dynamic CMOS circuit embodiments of cells for the 64-bit DBLC-l adder'³⁷-¹! using the tree shown in Figs.l3a-d by using clock-and-data precharged n-chains. This solution gives an ultrafast speed to the evaluation phase but needs an additional precharge phase. The precharge phase can be either very short in time if all evaluation transistors in the clock precharged stages are kept or comparable to the evaluation phase in time if these transistors are removed except the ones in cell 0s. In the second case, the evaluation speed is further increased. According to SPICE simulations using the typical parameters of a lμ CMOS process, the evaluation time for a latched SUM output can be less than 2ns with transistor sizing.

Fig.15 shows a second version of dynamic CMOS circuit embodiments of cells for the 64-bit DBLC-l adder'³⁷-²! using the tree shown in Fig.13 by the combination of circuit techniques described in Fig.8, Fig.9, Fig.10 and Fig.11. In this solution, no additional precharge phase is required and, therefore, the addition can be finished truly in one clock cycle. According to SPICE simulations using the typical parameters of a lμm CMOS process, the clock rate can be as high as 500MHz, which means the total evaluation time for a latched SUM output is less than 2ns. In the simulation, the rise and fall slopes of the clock are 0.2ns. From these examples (Fig.14 and Fig.15), the combination of the new adder architecture and the new circuit technique really gives a large speed improvement for a parallel binay adder.

ADVANTAGES

The following advantages are achieved with the invented adder architcture and circuit tecniques:

1. Maximum speed

The maximum speed is achieved by the distributed adder architecture since the architecture has a minimum logic depth and a very uniform loading. The maximum speed is also achieved by the clock-and-data precharged dynamic CMOS circuit technique.

2. Very regular layout

The adder architecure has a very regular structure as can be seen in the principle diagrams and so has the layout. The layout will be so regular that computer automation can be easily introduced for generating adder layouts with required bit-numbers, pitches and sizes by using just a few standard cells. 3. Unifprm loading and driving

While, the internal loading is made uniform by the distributed carry tree in the adder, the inputs of such an adder is also uniform. The output driving capabilities of diffrent bits are also uniform enough. These are good features for the interfaces.

4. Flexibility

First, the adder architecture is suitable for both one-clock- cycle decision and multi-clock-cycle pipelining. Second, the multi-bit grouped DBLC adders are flexible for different speed, area and power compromises. The same tree can be used for a 32-bit single-bit grouped DBLC adder or for a 64-bit two-bit grouped DBLC adder except that the sum logic needs to be redesigned. Of course, when the same tree is used for a more-bit grouped DBLC adder the speed will decrease because of the increased loads to carry outputs.

5. Less power noise and peak current

The clock-and-data precharged dynamic CMOS circuit technique invented not only gives large speed improvement but also gives less power noise and peak currents because of the successive precharge and evaluation of different stages. The clock loads are also less than that in the case of intensive pipelining.

6. Extra output of SUM+1

In normal case, if SUM+1 computation is needed, it will take a full computing cycle. The DBLC-2 adder, however, can give outputs of both SUM and SUM+1 simultaneously by paying very limited hardware and very little speed degradation. 15 a. Additional references to the drawings

FA - Full adder, HA - Half adder, FB - Buffer (For the rest of the cells see Fig. 2)

Example A (no condition)

Example B (no condition)

Example C (no condition)

Example D (condition: when C_v__y =0, C?_-y must be 0 and it is true in DBLC-2 tree)

Example A (no condition) Example B (no condition) Example C (condition: when 0}-_y =1, C ._y must be 1 and it is true in DBLC-2 tree) Fig 11: 137 d_pp - precharge delay of the whole p-domino chain d_ne - evaluation delay of the first stage of the successive n- domino chain. d_np - precharge delay of the whole n-domino chain. d_pe evaluation delay of the first stage of the successive p- domino chain.

Conditions: d_pp > d_ne and d_np > d_pe. Fig 12: 201 RZ - return-to-zero data,

PWF - precharge wave front, EWF - evaluation wave front Note: (1) Both return-to-zero and return-to-one data are feasible for inputs and outputs if L/H and H/L stages are rearranged;

(2) the data hold time at the output is equal to the passing time between EWF and PWF, depending on data rate and circuit; (3) when PWF catches up EWF, it reaches its highest data rate and there is no limit for a lower data rate except leakage current.

SUBSTITUTE SHEET 15 6-

Fig 14: 154 Sum unit (not shown in Fig 13)

Fig 15: 168 Sum unit (not shown in Fig 13)

181 (Cell 6, for a larger tree)

SUBSTITUTE SHEET

Claims

1. An architecture arrangement called Distributed BinaryLookahead-Carry (DBLC) architecture for a parallel binary adder, c h a r a c t e r i z e d in that it comprises:

a. a DBLC tree (for example DBLC-1 tree^[1 _, ^1.1 _, ^1.2 _, ^{1.3 or 1.4]}, DBLC-2 tree^{[15 or 15.1]} shown in Fig.1, Fig.2, Fig.3, Fig.4, Fig.5, Fig.6 or Figs.13a-d) generating carries for every single bit or for every grouped bits and comprising computation cells in truly log₂n levels, where n is the number of bits, and cell connections in a completely distributed manner, in which every carry bits have their own binary-lookahead-carry trees which are overlapped and share only one cell at one position and each computation cell is loaded by no more than two successive cells in the case of above example trees;

b. a number of SUM units (for example the combinations of XOR gates^[7], inverse XOR gates^[10], double XOR gates^[20] and/or MUXs^[21] shown in Fig.2, Fig.4, Fig.5 and Fig.6 or other combinations) generating SUMs for every single bit or for every grouped bits;

c. a number of generators for generating g_i, the generating signals, and p_i, the propagating signals, to the DBLC tree, p_i=a_i+b_i and g_i=a_ib_i, where ai and b_i are two inputs to the adder. In practical embodiments (for example cell 0^[8] shown in Fig.2, Fig.4, Fig.5 and Fig.6 and cells 0 and 0A shown in Fig.13a and Fig.13c), inversed p_i and g_i are generated.

2. A circuit arrangement called clock-and-data precharged dynamic CMOS circuit techniques (I-IV, shown in Fig.8, Fig.9, Fig.10 and Fig.11), c h a r a c t e r i z e d in that it comprises:

a. a number of clock precharged dynamic stages (for example n-stages^[25.1-6] and/or p-stages^[34.1-4]);

b. a number of data-precharged dynamic H/L (precharged high- in/low-out) stages, for example H/L stages^[28.1-6], and/or L/H (precharged low-in/high-out) stages, for example L/H stages^[29.1-6]; c. a number of TSPC latch stages (for example n-latch

stage^[31] and/or p-latch stage^[33]) in the latched

versions;

they are connected in such a way (numbers and orders of stages indicated in Fig.8, Fig.9, Fig.10 and Fig.11) that the correct logic functions and the initial-low or initial-high output requirements are always maintained;

3. A DBLC-1 tree (for example DBLC-1 tree^[1 _, ^1.1 _, ^{1.2, 1.3 or 1.4]} shown in Fig.1, Fig.2, Fig.5, Fig.6 or Figs.13a-d) according to Claim 1, c h a r a c t e r i z e d in that computation equations for the main cells are g'=g₂+p₂g₁ and p'=p₁p₂ where g' and p' are its outputs and g₁, p₁, g₂ and p₂ are its inputs; part of the cells only contain g'-computaion and part of the cells are just buffers; the equations are modified to

alternatively inverse and non-inverse logic equations in practical embodiments (for example DBLC trees^[1.1-4] shown in Fig.2, Fig.5, Fig.6, Fig.7 and Figs.13a-d).

4. A DBLC-2 tree (for example DBLC-2 tree^{[15 or 15 .1 ]} shown in Fig.3 or Fig.4) according to Claim 1, c h a r a c t e r i z e d in that computation equations for the main cells are C⁰ _x- _y=C⁰ _v-y+C¹ _v-yC⁰ _x-u and C¹ _x-y=C⁰ _v-y+C¹v-yC⁰ _x-u where v=u+1, C⁰ _x-y and C¹ _x-y are its outputs and C⁰ _x-u, C¹x-u, C⁰ _v-y and C¹ _v-y are its inputs and C⁰ _x-y or C¹ _x-y stands for the carry from bit x to bit y assuming the input carry to bit x is zero or one; part of the cells are just buffers; the equations are modified to alternatively inverse and non-inverse logic equations in a practical embodiment (for example the DBLC-2 tree^[15.1] shown in Fig.4).

5. A DBLC-1 adder (for example the adder^{[35, 35.1, 35.2, 37.1} or 37.2^] shown in Fig.2, Fig.5, Fig.6 or Figs 13a-d plus Fig.14 or 15) according to Claims 1 and 3, c h a r a c t e r i z e d in that it comprises a DBLC-1 tree (for example DBLC-1 tree^[1, ^{1.1, 1.2, 1.3 or 1.4]} shown in Fig.1, Fig.2, Fig.5, Fig.6 or Figs.13a-d), SUM units and g and p generators for a parallel n-bit operation; the adder may include or not include a carry input; the adder gives a SUM output (n-nit in parallel) and may also give an overflow carry and can be either single-bit grouped or multi-bit grouped.

6. A DBLC-2 adder (for example the adder^[36] shown in Fig.4) according to Claims 1 and 4, c h a r a c t e r i z e d in that it comprises a DBLC-2 tree (for example DBLC-2 tree^{[15 or 15.1]} shown in Fig.3 or Fig.4), SUM units and g and p generators for a parallel n-bit operation; a carry input may be included (not shown in Fig.4); the adder can give both SUM and SUM+1 outputs (each n-nit in parallel) and can be either single-bit grouped or multi-bit grouped.

7. A 64-bit DBLC-1 adder (for example the adder^{[37.1 or 37.2]} shown in Figs.13a-d plus Fig.14 or Fig.15) according to Claims 1, 2, 3 and 5, c h a r a c t e r i z e d in that it comprises a 64-bit DBLC-1 tree (for example the DBLC-1 tree^[1.4] shown in Figs 13a-d) and 64 SUM units (the g and p generators have been included in the DBLC-1 tree^[1.4]); the circuit embodiments of cells of the 64-bit DBLC-1 adder^{[37.1 or 37.2]} are shown in Fig.14 or Fig.15; the first version (Fig.14) gives an

ultrafast speed to the evaluation phase but needs an

additional precharge phase and the second version (Fig.15) does not need an additional precharge phase and, therefore, the 64-bit addition can be finished truly in one clock cycle.

8. Any means, architecture or circuit embodiments, c h a r a c t e r i z e d in that it comprises the distributed means used in the DBLC architecture, the particular DBLC architecture, the clock-and-data precharged dynamic CMOS circuit technique (I in Fig.8, II in Fig.9, III in Fig.10 or IV in Fig.11) or the data precharged asynchronous pipeline circuit technique (described in Fig.12), introduced through the text and

figures, or their mixture; the DBLC architecture can be either non-pipelined or pipelined; the circuit embodiment can be a single chip, a part of a microprocessor, a part of a digital signal processing unit or a part of any devices.