JPH04502677A

JPH04502677A - How to analyze data path elements

Info

Publication number: JPH04502677A
Application number: JP2503093A
Authority: JP
Inventors: アサト，クレイフトン　サトシ; ドラキア，スレシュ　キショルブハイ; ディッツェン，クリストフ
Original assignee: ブイエルエスアイ　テクノロジー，インコーポレイティド
Priority date: 1989-01-13
Filing date: 1990-01-12
Publication date: 1992-05-14
Also published as: GB9114332D0; WO1990008362A3; WO1990008362A2; GB2244829B; GB2244829A; DE4090021T

Abstract

(57)【要約】本公報は電子出願前の出願データであるため要約のデータは記録されません。 (57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】データバス素子の分析方法発明の分野本発明は一般にデータバス素子を通した遅延を決定するための技術に関連し、より特別には、多段データバス素子を通した信号遅延を決定するための技術に関するものである。[Detailed description of the invention] How to analyze data bus elements field of invention TECHNICAL FIELD The present invention relates generally to techniques for determining delays through data bus elements, and In particular, it concerns techniques for determining signal delays through multistage data bus elements. It is something that

インバータのような論理ユニットは典型的にはデータバスの各々のデジタルビットで個々に動作する。そのようなユニットにおいて、１ビツト上の動作はデータバスの他のビットの動作への影響は例えあるとしても少ないものである。しかしながら、乗算器やそのようなものの高速多段計算ユニットにとって、データバスにおける１ビツトの動作はしばしば他のビットの動作結果に依存する。例えば、典型的な乗算器は複数のキャリーセイブ加算器と最終のりプル加算器を含むかもしれない。そのような乗算器において、キャリービットはしばしば低次のビットの演算中に生成され、そのキャリービットは乗算器のより高次のビット演算で使われる。キャリービットはりプル加算器の全てのビットを通して伝達されなければならないので、リプル加算器により要求される処理時間は部分的には乗算される数のビット数による。A logic unit such as an inverter typically controls each digital bit of the data bus. Works individually on each page. In such units, the operation on one bit is the data There is little, if any, effect on the operation of other bits on the bus. but However, for high-speed multi-stage computation units such as multipliers and The operation of one bit in a bit often depends on the results of operations of other bits. for example, A typical multiplier may include multiple carry-save adders and a final multiply-pull adder. unknown. In such multipliers, the carry bit is often the lower order bit is generated during the operation, and its carry bit is used in higher order bit operations of the multiplier. be exposed. The carry bit must be carried through all bits of the pull adder. The processing time required by the ripple adder is partially multiplied by Depends on the number of bits in the number.

算術関数ユニットの高周期動作を提供するために、いわゆるパイプライン段がユニット内に挿入される。算術素子におけるパイプライン段は大きな利点を有するが、それらはまた潜伏（ｌａｔｅｎｃｙ）時間（１ワードを完全に処理するのに要求される時間）を増加することもできる。別の言葉でいえば、パイプライン段は付加的な処理ステップを算術素子に加えるために、素子は各々の個々のワードを処理するのにより大きな時間お必要とする。それにもかかわらず、算術素子の出力周期（ｆｒｅｑｕｅｎｃｙ）　（すなわち、ユニット時間当たり処理されるワード数）は素子におけるパイプライン段の挿入によって増した潜伏にもかかわらず増加するかもしれない。この結果はパイプライン段が素子が先のワード処理を完了する前に次のワードの動作を開始することを許すという事実にある。従って、パイプライン段は素子の潜伏時間を増加するが、それらはまた算術素子の動作周期を増加する。In order to provide high-period operation of the arithmetic function units, so-called pipeline stages are used for the units. Inserted into the knit. Pipeline stages in arithmetic elements have significant advantages However, they also have a latency time (to process one word completely). The required time) can also be increased. In other words, the pipeline stage adds an additional processing step to the arithmetic element so that the element It requires more time to process. Nevertheless, the arithmetic element Output frequency (i.e. processed per unit time) word count) despite the increased latency due to the insertion of pipeline stages in the device. may increase without any change. This result indicates that the pipeline stage processes words with elements first. The problem lies in the fact that it allows the next word to start working before it is completed. follow Although pipeline stages increase the latency of the elements, they also increase the operation of the arithmetic elements. Increase cropping cycle.

従って、パイプライン段の使用は周期と潜伏間の性能のトレードオフを必要とする。最適な又はほぼ最適な性能を達成するために、望みの動作周期を得るのに必要なパイプライン段の最小数だけを挿入することが望ましい。別の言葉で言えば、パイプライン段は通常最大限の距離で機能素子（ｆｕｎｃｔｉｏｎａｌ　ｅｌｅａ＋ｅｎｔ）に挿入されるべきである。Therefore, the use of pipeline stages requires a performance trade-off between period and latency. Ru. necessary to obtain the desired operating period to achieve optimal or near-optimal performance. It is desirable to insert only the minimum number of necessary pipeline stages. In other words , pipeline stages usually connect functional elements at maximum distance. ea+ent).

多段データバス素子におけるパイプライン段の効果的な配置は素子の個々の段を通した遅延のかなり正確な推定を必要とする。さらに、計算上の複雑さを最小化するために、配置技術は実装するのに比較的単純であるべきである。従って、本発明の主な目的はデータバス素子で遭遇する遅延を決定するための簡易で効果的な技術を明らかにすることである。Effective placement of pipeline stages in a multi-stage data bus device is to requires a fairly accurate estimate of the delay through. Additionally, minimizing computational complexity In order to do so, the placement technique should be relatively simple to implement. Therefore, the book The main purpose of the invention is to provide a simple and effective method for determining the delays encountered in data bus elements. The aim is to clarify the technology.

本発明の簡単な概要簡単に言えば、本発明は多段データバス素子を通した遅延を決定する方法に関するものである。本発明によれば、データバス素子の各々の段を通した遅延は一つの式によって推定でき、すなわちＤｉ　＝Ｄｂ　Ｎｂ＋Ｃである。そこでり、は推定された段の遅延、Ｄｂは一つの段におけるビット間の通信に関連する遅延、Ｎｂはデータバス素子のビット数、そしてＣは定数である。推定された遅延に基づいて、データバス機能素子のパイプライン段の位置が計算可能になる。Brief overview of the invention Briefly, the present invention relates to a method for determining delay through multi-stage data bus elements. It is something that According to the invention, the delay through each stage of data bus elements is one can be estimated by the formula, i.e. Di = Db Nb + C It is. So, is the estimated stage delay, and Db is the delay between bits in one stage. The delay associated with communication, Nb is the number of bits of the data bus element, and C is a constant . Based on the estimated delay, the pipeline stage positions of the data bus functional elements are calculated. It becomes possible to calculate

図面の簡単な説明添付図面において、図１は従来の５×５符号無しキャリーセイブアレー乗算器の図式図であり、図２は図１の乗算器に使われる全加算器回路を描いた図式的な論理回路であり、図３は図１の乗算器に使われる半加算器回路を描いた図式的な論理回路であり、図４はパイプライン段を含む５×５符号無しキャリーセイプアレー乗算器部分の図式的な回路であり、図５はＮＸＭアレー乗算器で発生する遅延を推定する技術を説明する補助のために与えられた図であり、図６はパイプライン段を具備したＮＸＭアレー乗算器の機能的表現であり、そして図７は多段データバス素子のパイプライン段の配置を決定する処理を描いたフロー図である。Brief description of the drawing In the attached drawings, FIG. 1 is a schematic diagram of a conventional 5×5 unsigned carry-save array multiplier; Figure 2 is a schematic logic circuit depicting the full adder circuit used in the multiplier in Figure 1. FIG. 3 is a schematic logic circuit depicting a half adder circuit used in the multiplier of FIG. Figure 4 shows the 5x5 unsigned carry-save array multiplier section including pipeline stages. A schematic circuit, and Figure 5 shows a technique for estimating the delay occurring in an NXM array multiplier. Figure 6 is a diagram provided to aid in explaining the is a functional representation of an NXM array multiplier, and Figure 7 is a flowchart depicting the process of determining the arrangement of pipeline stages of a multi-stage data bus element. -Fig.

好ましい実施例の詳細な　明以下において、本発明はキャリーセイブアレー乗算器の面から説明される。しかしながら、この実施例はデータバスにおいて並列データ動作をする素子の単なる例示である。すなわち、本発明はアレー乗算器以外の算術オペレータへも適用可能であるということは理解されるべきである。Detailed description of the preferred embodiment In the following, the invention will be explained in terms of carry-save array multipliers. deer However, this embodiment is a simple implementation of elements that perform parallel data operations on the data bus. This is an example. In other words, the present invention can be applied to arithmetic operators other than array multipliers. It should be understood that it is possible.

従来の５×５キヤリ一セイブアレー乗算器２が図１に描かれている。乗算器２は二つの５ビツト数ＡとＢの高速乗算のために設計されている。シンボル記号において、数Ａは最低位ビットから最上位ビットまでピッＦ　ａＯ＋　ａ　１．　ａ　２＋　ａ　：ｌそしてａ４を含む。同様に、５ビツト数Ｂはピッ）　ｂｏ、ｂ＋、ｂｚ。A conventional 5.times.5 carry-save array multiplier 2 is depicted in FIG. Multiplier 2 is It is designed for fast multiplication of two 5-bit numbers A and B. symbol symbol Then, the number A is pitched from the lowest bit to the most significant bit F aO+ a1. a 2+a: includes l and a4. Similarly, the 5-bit number B is bo, b +, bz.

ｂ３そしてｂ４を含む。Contains b3 and b4.

図１における乗算器２は各々が文字ＦＡによって示される全加算器の複数の列や段から構成される。従って、乗算器の第一列は全加算器１２，１４．１６そして１８を含み、第二列は全加算器２２．２４．２６そして２８を含み、第三列は全加算器３２，３４．３６そして３８を含み、そして第四列は全加算器４２，４４．４６そして４８を含む。乗算器２の最終段は全加算！５２，５４．５６そして５８から成るリプルキャリー加算器５０である。Multiplier 2 in FIG. 1 includes multiple columns of full adders, each indicated by the letter FA Consists of steps. Therefore, the first column of multipliers consists of full adders 12, 14, 16 and 18, the second column includes full adders 22, 24, 26 and 28, and the third column includes full adders 22, 24, 26 and 28. including adders 32, 34, 36 and 38, and the fourth column includes full adders 42, 44. ．． 46 and 48 included. The final stage of multiplier 2 is full addition! 52, 54.56 and 58.

図１の乗算器２の動作において、各々の全加算器は３ビツトの人力を受信し、そして合計出力とキャリー出力を作りだす。図１における様々の加算器へのピント入力は表１にリストされている。全加算器の合計出力は下付きの′Ｓ″によって示され、キャリー出力は下付きの“Ｃ”によって示される。In the operation of multiplier 2 in FIG. 1, each full adder receives 3 bits of power; to produce a sum output and a carry output. Focus on various adders in Figure 1 Inputs are listed in Table 1. The total output of the full adder is given by the subscript 'S''. The carry output is indicated by a subscript "C".

従って、例えば、加算器１４の合計とキャリー出力は１４ｓと１４ｃでそれぞれ示される０表１にリストされた“ａｉ　ｂｊ”入力は各々のアドレスでのＡＮＤ演算によって論理的に結合される数ＡとＢのそれぞれのビットを示している。Thus, for example, the sum and carry outputs of adder 14 are 14s and 14c, respectively. The “ai bj” inputs listed in Table 1 are ANDed at each address. It shows the respective bits of numbers A and B that are logically combined by an operation.

加算器　入力　加算器　入力１２　（Ｌａ＋ｂｓ、ａｏｂｔ　３６　２６ｃ＋２８ｓ＋ａｚｂｓ１４　０、ａｚｂｏ＋ａｒｂ＋　３８　２８Ｃ，ａ４ｂｚ＋ａｓｂｚ１６　０＋ａＪ（１＋ａ！ｔ）＋　４２　３２ｃ、３４ｓ、ａｏｂａ１８　０＋ａａｂｏ＋ａｚｂ＋　４４　３４Ｃ＋３６３＋ａｒｂａ２２　１２ｃ、　１４ｓ、　ａｏｂｔ　４６　３６ｃ＋　３８ｓ、　ａｔｂａ２４　１４ｃ、１６ｓ、ａ＋ｂｌ　４８　３８ｃ＋ａａｂｘ＋ａ３ｂｓ２６　１６ｃ、　１８ｓ＋ａｚｂｚ　５２　４２ｃ＋４４ｓ、０２８　１８ｃ、ａ４ｂ＋、ａ３ｂｚ　５４　４４ｃ、４６ｓ、５２ｃ３２　２２ｃ、２４ｓ、ａｏｂｔ　５６　４６ｃ、４８ｓ、５４ｃ３４　２４ｃ、２６ｓ、ａ＋ｂ：＋　５８　４８ｃ、ａ４ｈａ＋５６ｃ表１図１の乗算器２を使う典型的なシステムにおいて、入カバソファは乗算器によって処理される乗数と被乗数を記憶するために乗算器の前に与えられる。さらに、典型的なシステムは乗算器へのａ；ｂＪ大入力股間の同期を確立するために適当に遅延されることを確実にするために人カバソファと加算段の間に接続されたＡＮＤゲートを使用する。そのようなバッファとゲート回路は技術的に知られており、その理由から、ここではこれ以上説明しない。Adder input Adder input 12 (La+bs, aobt 36 26c+28s+azbs14 0, a zbo+arb+ 38 28C, a4bz+asbz16 0+aJ(1+a ! t)+42 32c, 34s, aoba18 0+aabo+azb+4 4 34C+363+arba22 12c, 14s, aobt 46 3 6c+ 38s, atba24 14c, 16s, a+bl 48 38c+ aabx+a3bs26 16c, 18s+azbz 52 42c+44s , 028 18c, a4b+, a3bz 54 44c, 46s, 52c32 22c, 24s, aobt 56 46c, 48s, 54c34 24c, 26 s, a+b:+58 48c, a4ha+56cTable 1 In a typical system using multiplier 2 in Figure 1, the input cover sofa is is provided before the multiplier to store the multiplier and multiplicand to be processed. moreover, A typical system is suitable for establishing synchronization between a;bJ large inputs to the multiplier. A connected between the cover sofa and the adding stage to ensure that the Use ND gate. Such buffer and gate circuits are not known in the art. For that reason, we will not discuss it further here.

乗算器２の動作がここで述べられる。そのような説明のために、２進数Ａ＝１００１１とＢ＝０１１００が乗算される例が与えられる。（十進数でＡ＝１９そしてＢ−１２）紙と鉛筆による掛は算が以下に記載されている。The operation of multiplier 2 will now be described. For such an explanation, the binary number A=10 An example is given where 011 is multiplied by B=01100. (A=19 in decimal) B-12) Multiplication using paper and pencil is described below.

１９　＝　１００１１　被乗数Ｘ１２　＝ＸＯ１１００乗数ｏｏｏｏ。19 = 10011 Multiplicand X12 = XO1100 multiplier ooooo.

ｏｏｏｏ。ooooo.

２２８　０１１１００１００　積従って、二つの２進数の乗算処理は連続する加算とシフトによって達成されることが理解される。上記に示された乗算処理において、乗数の連続するピントは最初に最低位ビットが見られる。もし乗数ビットが１ならば、被乗数はコピーされる。さもなければゼロがコピーされる。そして、連続する列の数が先の列から左へ一位置シフトされる。最後に数は加算され乗算の積に等しい合計が得られる。228 011100100 product Therefore, the multiplication process of two binary numbers is accomplished by successive additions and shifts. It is understood that In the multiplication process shown above, the consecutive focus of the multiplier is The lowest bit is seen first. If the multiplier bit is 1, the multiplicand is copied Ru. Otherwise zeros are copied. and the number of consecutive columns to the left from the previous column is shifted one position to. Finally the numbers are added to give a sum equal to the product of the multiplications.

図１の乗算器２によって数ＡＸＢ＝Ｐの掛は算を示すために、各々の連続する段の入力と出力が明らかにされる。第一段において、加算器１２．１４．１６そして１８は入力セラ）　（０，０，０）、　（０，０，０）、　（０，Ｏ，Ｏ）、　（０゜０．０）をそれぞれ受信し、それぞれの出力セット（０゜０）、（０，Ｏ）、（０，Ｏ）、（０，Ｏ）の（合計、キャリー）を与える。積Ｐの最低位ビットはＰ。−＝ａｏ　ｂｏ　＝０である。ビットＰ、は加算器１２の合計出力であり、この例ではＯである。第二段では、加算器２２．２４．２６そして２８は入力セント　（０，０，１）、（０，０，１）、（０゜０、Ｏ）、（０，Ｏ，Ｏ）を受信し、出力セット（１，０）。The multiplication of the number AXB=P by the multiplier 2 of FIG. The inputs and outputs of are revealed. In the first stage, adders 12, 14, 16 and (18 is the input cell) (0, 0, 0), (0, 0, 0), (0, O, O), (0°0.0) respectively, and the respective output sets (0°0), (0, O), (0, O), (0, O) (sum, carry) is given. The lowest value of the product P The cut is P. −=ao　bo　=0. Bit P is the total output of adder 12 Yes, and in this example it is O. In the second stage, adders 22, 24, 26 and 28 are Input cent (0,0,1), (0,0,1), (0°0,O), (0,O,O ) and output set (1,0).

（１，Ｏ）、（０，０）、（０，０）を与える。ビットＰ２は加算器２２の合計出力であり、この場合はｌである。第三段の加算器３２，３４．３６そして３８へのそれぞれの入力は（０，１，１）、　（０，０，１）、　（０，Ｏ，Ｏ）、　（０゜１．０）であり、それぞれの出力は（０，１）、（１，Ｏ）。Give (1,O), (0,0), (0,0). Bit P2 is the sum of adder 22 output, in this case l. Third stage adders 32, 34, 36 and 38 The respective inputs to are (0,1,1), (0,0,1), (0,O,O), (0°1.0), and the respective outputs are (0,1) and (1,O).

（０，Ｏ）、（１，Ｏ）である。ビットＰ３は加算器３２の合計出力であり、それは０である。次に、第四段の入力は（１，１，Ｏ）、（０，Ｏ，Ｏ）、（０，１，０）、（０゜１．０）であり、それぞれの出力は（０，１）、（０，Ｏ）。(0, O), (1, O). Bit P3 is the sum output of adder 32; is 0. Next, the inputs of the fourth stage are (1, 1, O), (0, O, O), (0, 1,0), (0°1.0), and the respective outputs are (0,1), (0,O).

（１，０）、（１，Ｏ）である。ビットＰ４、加算器４２の合計出力はこの例ではＯである。(1,0), (1,O). In this example, bit P4, the total output of adder 42 is is O.

さらに先の例による図１の乗算器２の動作において、リプルキャリー加算器５０の最初の加算器５２は入力セット（１゜Ｑ、０）を受信し、出力（１，０）を与える。従って、とットＰｓは１である。加算器５４は順次に人力セラ）　（０，１゜Ｏ）を受信し、（１，０）を出力する。ビットＰ６は従って１である。加算器５６は（０，１，Ｏ）を受信して（１゜０）を出力し、そして加算器５８は（０，Ｏ，Ｏ）を受信して（０，Ｏ）を出力する。従ってビットＰ、とＰ８はそれぞれ１と０である。見れば分かるように、積ｐ　（Ｐ、　Ｐ、　Ｐｂ　Ｐｓ　Ｐａ　Ｐ３　Ｐｇ　Ｐ＋　Ｐａ　）は０１１１００１００であり、それは連続する加算とシフト技術によって前記に描かれ得られた結果と一致する。Further, in the operation of multiplier 2 of FIG. 1 according to the previous example, ripple carry adder 50 The first adder 52 receives the input set (1°Q,0) and provides an output (1,0). I can do it. Therefore, Ps is 1. The adder 54 is sequentially manually operated) (0, 1°O) and outputs (1,0). Bit P6 is therefore 1. addition The adder 56 receives (0, 1, O) and outputs (1°0), and the adder 58 receives (0, 1, O) and outputs (1°0). 0, O, O) and outputs (0, O). Therefore bits P, and P8 are They are 1 and 0, respectively. As you can see, the product p (P, P, Pb Ps P a　P3　Pg　P+　Pa　) is 011100100, which is continuous Consistent with the results drawn above and obtained by the addition and shift technique.

図２は全加算器の例を示している。この例において、全加算器は二つのＸＯＲゲート６０と６２、二つのＡＮＤゲート６４と６６そしてＯＲゲート６８の結合によって与えられる。FIG. 2 shows an example of a full adder. In this example, the full adder consists of two XOR gates. The combination of gates 60 and 62, two AND gates 64 and 66, and OR gate 68 Therefore, it is given.

ビットＸｉ　とＹｌと入力キャリービットＣ８の合計はＸｉ＋Ｙｉ　＋Ｃｉ　＝　（Ｘｉ　＋Ｙｉ　）＋Ｃｉ　のように表現できる。従って、とフトＸ１とＹ工は最初のＸＯＲゲート６０の入力ゲートに与えられる。ＸＯＲゲート６０の出力はＸＯＲゲート６２の入力端子に与えられ、それは第二の入力としてビットＣ１を受信する。ＸＯＲゲート６２の出力は入力ビットの合計を表す。The sum of bits Xi and Yl and input carry bit C8 is Xi + Yi + Ci = It can be expressed as (Xi + Yi) + Ci. Therefore, toft X1 and Y is applied to the input gate of the first XOR gate 60. Output of XOR gate 60 is applied to the input terminal of the XOR gate 62, which receives bit C1 as the second input. receive. The output of XOR gate 62 represents the sum of the input bits.

さらに図２に関して、キャリー出力はｃｉ＋ｌ　＝ｘｉ　ｙ、　＋Ｘｉ　Ｃｉ　＋Ｙｉ　Ｃ４として表現されてもよいと理解されるべきである。プール代数の規則によりこの表現を操作することで、キャリー出力はＣ１゜ｒ　＝Ｘ４　Ｙ６　＋　（ＸＨ＋　Ｙ４　）　Ｃ４と表現できる。従って、図２のシステムにおいて、ビットＸ８　とＹｉはＡＮＤゲート６４の入力端子に与えられる。ＡＮＤゲート６６は結果としての表現（Ｘｉ　＋Ｙｉ　）Ｃｉを作り出すためにＸＯＲゲー）６０の出力と入力キャリービットＣ１に接続される入力端子を有する。最後に、出力キャリービットＣ８４，を作りだすためにＡＮＤゲート６４とＡＮＤゲート６６の出力はＯＲゲート６８の入力端子に接続される。Furthermore, regarding FIG. 2, the carry output is ci+l = xi y, +Xi Ci It should be understood that it may be expressed as +Yi C4. Rules for pool algebra By manipulating this expression according to the rule, the carry output is C1゜r　=X4　Y6　 It can be expressed as + (XH+Y4)C4. Therefore, in the system of Figure 2 , bits X8 and Yi are applied to the input terminals of AND gate 64. AND game 66 performs an XOR game to produce the resulting representation (Xi + Yi)Ci. ) 60 and an input terminal connected to the input carry bit C1. lastly , and the AND gate 64 to produce the output carry bit C84. The output of gate 66 is connected to the input terminal of OR gate 68.

再び図１を参照して、全加算器１２，１４．１６そして１８は各々の０に設定された一つの入力をもつことが分かる。Referring again to FIG. 1, full adders 12, 14, 16 and 18 are each set to 0. It can be seen that it has one input.

これは各々の全加算器の初段が回路を最小化するために半加算器回路段で置き換えられることを許す。適合する半加算器回路の例が図３に示されている。描かれている半加算器回路はＸＯＲゲー）７０とＡＮＤゲート７２を含み、各々は受信入力ピッ）Ｘ＝　とＹｉに接続される入力端子を有する。ＸＯＲゲート７０は合計出力Ｓｉ　＝Ｘｉ　＋Ｙｉを決定し、ＡＮＤゲート７２は出力キャリ−Ｃ３゜、＝Ｘｉ　Ｙ、を決定する。This means that the first stage of each full adder is replaced by a half adder stage to minimize the circuit. Allow yourself to be given. An example of a suitable half-adder circuit is shown in FIG. drawn The half-adder circuit includes an XOR gate 70 and an AND gate 72, each with a receiving It has an input terminal connected to input pin) X= and Yi. XOR gate 70 The total output Si = Xi + Yi is determined, and the AND gate 72 outputs carry -C3゜ , =Xi Y, is determined.

実際には、図１に描かれているようなキャリーセイブアレー乗算器２は比較的早く二つの数ＡとＢの積を発生可能である。別の言葉で言えば、そのような回路はクロックサイクルの関連で表される比較的少ない潜伏をもつ。しかしながら、そのような回路はまた、新たな組の数の乗Ｗが開始できる前に、−組の数を完全に処理しなければならないという事実で特徴づけられている。さらに特に言えば、図１の乗算器２は積Ｐの最高位ピッ）Ｐｇが決定される前に、少なくとも加算器１２からのキャリービットが加算器２２，３２，４２，５２．５４．５６そして５８を通って完全に伝達されるべく待たねばならない。そして、まさにその時、乗算器は掛は合わされる他の組の数の処理を開始できる。従って、いかなる与えられた対の数を掛は合わせる点で高速であるにもかかわらず、図１の乗算器２の出力周期は与えられた時間間隔で乗算可能な異なる数の数に関して比較的遅い。In practice, the carry-save array multiplier 2 as depicted in Figure 1 is relatively fast. It is possible to generate the product of two numbers A and B. In other words, such a circuit It has relatively little latency expressed in terms of clock cycles. However, that A circuit like It is characterized by the fact that it must be processed. More specifically, The multiplier 2 in FIG. The carry bits from 12 are added to adders 22, 32, 42, 52, 54, 56 and 58 for complete transmission. And at that very moment, The multiplier can begin processing other sets of numbers that are multiplied together. Therefore, any given Although it is fast in terms of combining the number of pairs obtained, multiplier 2 in Fig. The output period is relatively slow with respect to the number of different numbers that can be multiplied in a given time interval.

さらに、図１の回路の全体効率は回路の個々の全加算器が処理サイクルのほとんどの期間中アイドルであることから低いと言うことができる。Furthermore, the overall efficiency of the circuit in Figure 1 is such that each full adder in the circuit consumes most of the processing cycles. It can be said that it is low from being an idol for any period of time.

図１の乗算器２のような多段データバス素子の出力周期を増加するために、バイブライン段がこれらの段の一つ又はそれ以上の間に挿入可能である。これらのパイプライン段は本質的に先の機能段によって生じた部分的な結果を記憶する記憶レジスタである。また、乗算器において、パイプライン段は後の加算段で処理するために乗数と被乗数を記憶するのに使用可能である。その最も簡易な実行においては、パイプライン段は普通二から四のＤタイプフリップフロップから構成され、そこでは各々のフリップフロップは次段に対する一つの入力ビットと同様に先の加算段からの一つの出力ビットを一時的に記憶するように機能する。乗算器のような多段データバス素子においてパイプライン段を挿入する利点は加算段がその部分的な積をパイプライン段へ転送した後、加算段が他の組の数を処理するように与えられることである。In order to increase the output period of a multi-stage data bus element such as multiplier 2 in FIG. A brine stage can be inserted between one or more of these stages. These parameters The iline stage is essentially a memory that stores the partial results produced by the previous functional stage. It is a register. Also, in a multiplier, the pipeline stage is processed by the later addition stage. It can be used to store multipliers and multiplicands in order to The simplest implementation In this case, the pipeline stage usually consists of two to four D-type flip-flops. , where each flip-flop is equivalent to one input bit to the next stage. It functions to temporarily store one output bit from the previous summing stage. multiplier The advantage of inserting pipeline stages in multi-stage data bus devices such as After forwarding the partial product to the pipeline stage, the adder stage processes the other set of numbers. It is given as such.

図４はその加算段の間に挿入されたパイプライン段を有するキャリーセイブアレー乗算器を示す。簡易化のために、乗算器の最初のユニの段だけが示されている。パイプライン段の使用によって、図４の乗算器は高出力周期を提供可能である。なぜならば、乗算器はそれが最初の対の数の処理が完了する前に、第二の対の数の処理を開始できるからである。パイプライン段はしかしながら、それを通して部分的な積が乗算器内を伝達しなければならない段数を増加する。従って、この連結で、乗算器のような算術ユニットにおいて、各々の加算段の間にパイプライン段を挿入することは必ずしも要求されず又は実際的でないということが理解されるべきである。さらに、図１の乗算器において異なる加算段を通した遅延は必ずしも同じではないということが理解されるべきである。例えば、リプルキャリー加算器５０は典型的には最初の四つの加算段のいずれか一つが行うよりもその処理サイクルを完了するためにより多くの時間を必要とし、よって、より大きな遅延を有する。その結果、乗算器の各々の股間にパイプライン段を配置するよりむしろ、例えば、第二と第四の加算段の後にだけパイプライン段を配置することがより実際的であるかもしれない。その場合、乗算器の動作周期は二つのパイプライン段の間に置かれた処理段を通した最大遅延によって決定されるであろう。この例によれば、ハードウェアの要求は、乗算器の出力周期が増加するにもかかわらず、その潜伏における増加を最小化する限りにおいて減じられるであろう。Figure 4 shows a carry save array with a pipeline stage inserted between its adder stages. – indicates a multiplier. For simplicity, only the first uni stage of the multiplier is shown. . Through the use of pipeline stages, the multiplier of Figure 4 can provide high output cycles. . This is because the multiplier processes the second pair of numbers before it finishes processing the first pair of numbers. This is because you can start processing numbers. The pipeline stage, however, This increases the number of stages that partial products must be passed through the multiplier. Therefore, this In an arithmetic unit such as a multiplier, there is a pipeline between each addition stage. It is understood that it is not always required or practical to insert an in stage. It should be. Furthermore, the delay through the different summing stages in the multiplier of Fig. 1 is It should be understood that they are not necessarily the same. For example, ripple cap Lee adder 50 typically performs more processing than any one of the first four adder stages. requires more time to complete the processing cycle and is therefore larger. has a long delay. As a result, a pipeline stage is placed between each crotch of the multiplier. Rather, for example, it is possible to place pipeline stages only after the second and fourth adder stages. may be more practical. In that case, the operating period of the multiplier is will be determined by the maximum delay through the processing stages placed between the pline stages. . According to this example, the hardware requirements increase even as the multiplier output period increases. However, it will be reduced to the extent that the increase in latency is minimized. .

図１と４の乗算器のような多段素子に付加されることを要求する最小の、又はほぼ最小のバイブライン段数を決定するために、素子の個々の段を通した遅延を決定する必要がある。The smallest or almost Determine the delay through the individual stages of the element to determine the minimum number of Vibration stages. It is necessary to define

これらの遅延を推定する一つの方法はゲートレベル（すなわち、ゲートごとに）で素子をモデル化することによる。ゲートレベルのモデル化方法によれば、データバス素子はそれらの構成部分と関連した個々のゲートと考えられ、各々は関連する遅延を有し、そしてパスにそって全ゲートを通した遅延の合計はデータバス素子全体の遅延の推定を与える。例えば、図２の全加算器を通した遅延は二つのＸＯＲゲート６０と６２から成るバスにそって遅延を合計することによって推定可能である。しかしながら、遅延をモデル化するゲートレベルのアプローチは常に有効ではない。その方法論には非効率性が存在し、なぜならば、並列データバスを有する素子において、ゲートは素子全体の最大遅延を変えることなくしばしば除去され得るからである。さらに、ゲートレベルのモデル化は一般に時間がかかり、そして推定が強すぎる。そのような欠点のために、ゲートレベルのモデル化は大抵の状況において非実際的であると考えられている。One way to estimate these delays is at the gate level (i.e. per gate) By modeling the element with. According to the gate-level modeling method, the data Tabas elements can be thought of as individual gates associated with their component parts, each with an associated and the sum of the delays through all gates along the path is Gives an estimate of the overall element delay. For example, the delay through the full adder in Figure 2 is Estimated by summing the delays along the bus consisting of XOR gates 60 and 62 It is possible. However, gate-level approaches to modeling delays are always is not valid. There are inefficiencies in that methodology because parallel data In devices with This is because they can be removed. Additionally, gate-level modeling is typically time-consuming. However, the estimation is too strong. Due to such drawbacks, gate-level models is considered impractical in most situations.

データバスを通した遅延を推定する他の知られた方法はデータバス素子単位の遅延をモデル化することによる。しかしながら、そのような素子レベルのモデル化はパイプライン段がデータバスの素子へ付加された時にデータバスのタイミングが変化するかどうかを示すことができないという欠点を有する。さらに、素子レベルのモデル化はパイプライン段がデータバスのサイクル時間を減するように付加されるよう、素子における位置を示さない。Other known methods of estimating delay through a data bus include the delay per data bus element. By modeling the spread. However, such device-level modeling is the timing of the data bus when pipeline stages are added to the data bus elements. It has the disadvantage that it cannot show whether or not it changes. Furthermore, the element The Bell modeling is done by adding pipeline stages to reduce the cycle time of the data bus. The location in the element is not shown so that it may be added.

従って、モデル化の方法は多段データバス素子の個々の段で遭遇する遅延を推定するのに要求される。この要求に連合する一つのモデル化の方法では、データバス素子段を通した遅延はデータバス素子段と通信するビット数により関数関係によってモデル化が可能である。例えば、一つの特に適合可能なモデル化方程式は、Ｄａ　＝　Ｄｂ　Ｎｂ　＋　Ｃ（１）である、ここでＤｌは段遅延、Ｄｂはデータバス素子のビット数に比例する遅延、Ｎ１はデータバス素子のビット数、そしてＣはデータバスのビット数には依存しない遅延を表す定数である。ビットに関する動作が他のビット動作に依存しない段にとりで、Ｄｂはゼロに等しい。Therefore, modeling methods estimate the delays encountered at individual stages of a multistage data bus element. required to do. One modeling approach that aligns with this requirement is to The delay through the data bus element stage is a function of the number of bits communicating with the data bus element stage. Therefore, modeling is possible. For example, one particularly adaptable modeling equation is , Da = Db Nb + C (1) , where Dl is the stage delay and Db is the delay proportional to the number of bits of the data bus element. , N1 is the number of bits of the data bus element, and C depends on the number of bits of the data bus. is a constant representing the delay that does not occur. Bit operations do not depend on other bit operations. In all cases, Db is equal to zero.

先の方程式（１）は加算器、ＡＬＵそして乗算器素子に典型的に見られるようなりプルキャリー加ＩＥ器に容易に適用可能である。そのような加算器は素子内部の通信から生じる遅延を有し、従って、それはデータバス素子のビット数に比例するものとして推定可能である。他方では、変化せず又は信号と通信しない素子の構成部分はデータバスのビット数に比例する遅延をもたない。従って、ビット −非依存遅延の素子の構成部分に対して、パラメータＤｂはゼロに設定可能であ遅延の推定に対して方程式（１）の、使用を示すために、多段データバス素子を通した図５が参照される。図５において、ＮＸＭビットアレー乗箕器はキャリーセイブ加算器のＭ行とりプルキャリー加算器からなるＮビット列から構成される。Equation (1) above is similar to that typically found in adders, ALUs, and multiplier elements. It can be easily applied to multiple carry adder IEs. Such an adder is has a delay resulting from the communication of It can be estimated that On the other hand, elements that do not change or do not communicate with the signal components have no delay proportional to the number of bits on the data bus. Therefore, bit - For components of the element with independent delay, the parameter Db can be set to zero. To demonstrate the use of equation (1) for delay estimation, we construct a multistage data bus element. Reference is made to FIG. In Figure 5, the NXM bit array multiplier is a carry Consists of N bit strings consisting of M rows of save adders and pull-carry adders .

入カバソファは第一加算段の前に被乗数と乗数を一時的に記憶するために与えられる。図５に描かれているように、入力バッフ１段はＣに等しい推定された段遅延り、を有し、各々のキャリーセイブ加算器は推定されたへの遅延を訂し、そしてリプルキャリー加算器は推定されたＢＮ＋Ｄの遅延を有する。An input cover sofa is provided to temporarily store the multiplicand and multiplier before the first addition stage. It will be done. As depicted in Figure 5, one input buffer stage has an estimated stage delay equal to C. each carry-save adder corrects the estimated delay and The ripple carry adder has an estimated delay of BN+D.

図５の乗算器へ方程式（１）を適用することによって、初めの四段を通した遅延はＣ＋３Ａとしてモデル化が可能である。三つの中間段を通した遅延は３Ａとしてモデル化が可能である。最後のりプルキャリー加算器を通した遅延はＢＮ＋Ｄとしてモデル化が可能である。最終的に、全体のＮＸＭビット乗算器はＡＭ＋ＢＮ十〇十りの全体として推定された遅延をゆうするものとしてモデル化が可能であり、そこでＡはキャリーセイプ加算器を通した一定遅延、Ｂはリプルキャリー加算器に対するビット当たりの遅延、Ｃは入力バッファ遅延、そしてＤはリプルキャリー加算器の一定遅延である。By applying equation (1) to the multiplier in Figure 5, the delay through the first four stages can be can be modeled as C+3A. The delay through the three intermediate stages is 3A. It is possible to model the The delay through the last pull-carry adder is BN+D It can be modeled as Finally, the entire NXM bit multiplier is AM+B It can be modeled as a delay estimated as a whole of N100. , where A is a constant delay through a carry-save adder and B is a ripple carry per bit delay for the adder, C is the input buffer delay, and D is the ripple This is the constant delay of the carry adder.

図６においては、三つのキャリーセイブ加算器ごとの後にパイプライン段があるように挿入されたパイプライン段を有する図５の乗算器から成る多段データバス素子が示されている。従って、修正された乗算器は３Ａ＋Ｃの遅延ををする複数のセグメントとＢＮ＋Ｄの遅延を有する最終セグメント（すなわち、リプルキャリー加算段）を含む。各々の加算段は特定のセグメントにより処理するための被乗数と乗数を記憶するために付加的な入カバンファを必要とすることは注意されるべきである。また、指示された場所に挿入されたパイプライン段により、乗算器の最小サイクル時間はパイプライン段に３Ａ＋ＣとＢＮ＋Ｄの最大値を加えたものによりもたらされる合計の遅延であろう。In Figure 6, there is a pipeline stage after every three carry-save adders. A multistage data bus consisting of the multiplier of FIG. 5 with pipeline stages inserted as follows. elements are shown. Therefore, the modified multiplier has a delay of 3A+C. segment and the final segment with a delay of BN+D (i.e. ripple cap) including a Lee adder stage). Each summing stage has a target for processing by a particular segment. Note that it requires an additional input buffer to store the multiplier and the multiplier. Should. Also, a pipeline stage inserted at the indicated location allows the multiplication The minimum cycle time of the device is the maximum value of 3A+C and BN+D added to the pipeline stage. The total delay caused by

ここでは前述のモデル化技術は、多段データバス素子が一般に反復の多い論理セルを含むという事実を反映しているということが理解できる。この反復を理解することにより、モデル化技術は、ゲートレベルモデル化より比較的少ない処理時間や計算容量を要求するにもかかわらず、パイプライン段配置を決定するための十分な詳細を提供する。Here, the modeling techniques described above demonstrate that multistage data bus elements are typically It can be understood that this reflects the fact that it includes files. Understand this repetition By doing so, the modeling technique requires relatively less processing time than gate-level modeling. for determining pipeline stage placement, despite requiring time and computational capacity. Provide sufficient details.

多段データバス素子にパイプライン段を挿入するためのより一般化された技術は図７と関連して議論される。この技術は、例えば、自動化された集積回路設計システムに組み入れられてもよい。A more generalized technique for inserting pipeline stages into multistage data bus devices is Discussed in conjunction with FIG. This technology can be used, for example, in automated integrated circuit design systems. It may also be incorporated into the stem.

図７に表で示された処理の最初のステップは最大の望みの遅延ＤＨを選択することである。この遅延は普通、データバス素子の望ましい動作周期の逆関数に程等しく、例えば、データバスの他の機能素子の動作周期によって決定されるかもしれない。初めは、素子段カウンタ■がゼロに初期化され、そして累積遅延変数ＤＴはゼロに設定される。The first step in the process, tabulated in Figure 7, is to select the maximum desired delay DH. That is. This delay is typically approximately equal to the inverse of the desired operating period of the data bus elements. may be determined, for example, by the operating period of other functional elements of the data bus. Not possible. Initially, the element stage counter ■ is initialized to zero, and the cumulative delay variable D T is set to zero.

図７に表で示される処理の次のステップは初めに指示された素子段に関連する遅延を計算することである。実際には、初めに指示された段は普通、分析される素子の最後のデータ処理段でる。指示段に対して、計算された段遅延、Ｄｌはそれから最大遅延り、と比較される。もし初めの指示段の遅延り、が最大遅延Ｄｓより大きいならば、最大遅延ＤＨは個々の段遅延に等しい値にリセットされ、処理が再び開始される。The next step in the process, tabulated in FIG. It is to calculate the extension. In practice, the first indicated column is usually the element being analyzed. The child's final data processing stage appears. For the indicated stage, the calculated stage delay, Dl, is that The maximum delay is compared to . If the delay of the first indicator stage is the maximum delay Ds, If the delay is larger, the maximum delay DH is reset to a value equal to the individual stage delay and the processing is started again.

そうでなければ、全体の累積遅延Ｄ７は素子の次の近接段の段遅延時間によって増加される。増加された全体遅延Ｄアはそれから最大遅延時間り、と比較される。もし値Ｄアが値Ｄイより少なければ、段カウンタはインクリメントされ、そして遅延は次段に対して計算される。Otherwise, the total cumulative delay D7 is determined by the stage delay time of the next adjacent stage of the element. will be increased. The increased overall delay D is then compared to the maximum delay time D . If the value D is less than the value D, the stage counter is incremented and delay is calculated for the next stage.

実際に、図７に表で示された処理は、データバス素子段がその加算が累積遅延Ｄアに最大遅延Ｄイを越えることが確認されるまで続けられる。その時、パイプライン段は確認された段の前に挿入され、累積遅延Ｄ↑は確認された段の遅延に付加されたバイブライン段本来の遅延を加えたものに等しく設定される。（初めの指示段が最終データフロ一段である多段素子を通ったデータフローに関して、パイプライン段は累積遅延Ｄアが最大遅延Ｄ９を越える段から“下流へ”挿入されるであろう。）それから、段カウンタはインクリメントされ、そして処理は次段の素子を考察し、続けられる。処理はデータバス素子の個々の段の各々を通して上述の方法で進行する。In fact, the process shown in the table in FIG. This continues until it is confirmed that A exceeds the maximum delay D. At that time, the piper The in stage is inserted before the verified stage, and the cumulative delay D↑ is attached to the delay of the verified stage. set equal to the added vibline stage's inherent delay. (first For data flow through a multi-stage element where the instruction stage is the final data flow stage, the package The pipeline stage is inserted “downstream” from the stage where the cumulative delay Da exceeds the maximum delay D9. There will be. ) Then the stage counter is incremented and processing continues on to the next stage. We can continue by considering the elements of Processing is performed through each individual stage of data bus elements. Proceed as described above.

この連結で、上述の処理の基本は多段データバス素子を通した累積遅延が逐次計算され、そして考慮される段が望みの最大遅延より大きな累積遅延となった時はいつでも、パイプライン段が素子内に挿入されるということで理解される。このようにして、パイプライン段は最大又はほぼ最大の互いからの距離で配置される。上述の処理の結果、従って、その潜伏の増加を最小に抑えながら多段データバス素子の動作周期を増加することになる。With this concatenation, the basis of the above processing is that the cumulative delay through the multi-stage data bus elements is sequentially calculated. When the stage being calculated and considered has a cumulative delay greater than the desired maximum delay, It is understood that at any time a pipeline stage is inserted within a device. this such that the pipeline stages are arranged at a maximum or near maximum distance from each other. . As a result of the above processing, a multi-stage data base can therefore be created with minimal increase in its latency. This increases the operating cycle of the element.

実際には、最初に指示された素子段に関連する最大遅延Ｄイはデータフローに関して指示段の前後にある独立した素子を通した遅延を付加的に反映する。（そこでは、初めに指示された素子段は素子の最終データ処理段であり、独立した素子はデータフローに関して初めの指示素子に従うものであり、その一方において、初めに指示された素子段はその最初のデータ処理段であり、独立した素子は初めの指示素子に先行するものである。）さらに前述の処理の実行において、セットアツプ及び素子のいかなるパイプライン段の遅延と同様ニ出力遅延時間が考慮されなければならない。そのよなパイプライン段には分析の結果として素子に加えられたものと同様に前述の分析に先立って素子に属するものが含まれる。In practice, the maximum delay D associated with the first designated element stage is related to data flow. to additionally reflect delays through independent elements before and after the indicator stage. (There In this case, the first specified element stage is the final data processing stage of the element, and is an independent element stage. follows the first pointing element in terms of data flow, while The first commanded element stage is its first data processing stage, and the independent element first It precedes the indicating element. ) Furthermore, in performing the above processing, the set The output delay time is taken into account as well as the delay of any pipeline stages of the output and device. must be In such a pipeline stage, elements are added as a result of the analysis. Includes those belonging to the element prior to the above analysis as well as those that were previously analyzed.

バイブライン、本発明の実施例そして作用が先の明細で述べらているが、この中で保護が意図される発明は特定の開示された形式に限定されるものとして構成はされていない。むしろ、開示された形式は限定というよりも例示として認識されるべきである。これらの技術分野における当業者は変種と変更が本発明の精神と概念から離れることなく行われてもよいということを理解するであろう。The Vibrine, embodiments and operation of the present invention are described in the foregoing specification; The inventions intended to be protected are limited to the specific disclosed form. It has not been. Rather, the disclosed form is to be regarded as illustrative rather than limiting. Should. Those skilled in the art will appreciate that variations and modifications are within the spirit of the invention. It will be understood that it may be done without departing from the concept.

偽　′占　−、ｔ、、、Ｌ、　ａ、ζ 特表千４−５０２６７７　（７）補正書の翻訳文提出書（特許法第１８４条の８）平成３年７月ＩＣ；日False 'fortune -, t, , L, a, ζ Special Table Sen4-502677 (7) Submission of translation of written amendment (Article 184-8 of the Patent Law) July 1991 IC; Sun

Claims

[Claims]

1. A method of arranging pipeline stages in a multi-stage datapath element, the method comprising: a) For each functional stage of the datapath element, the information associated with communication between bits within the stage; Determine the estimated delay time, which reflects the delay, the number of bits in the datapath elements, and a constant. b) determining the location of the pipeline stage placement according to the estimated stage delay; That, and c) placing the pipeline stage at the determined position; A method characterized by:

2. A method of arranging pipeline stages in a multi-stage datapath element, the method comprising: a) For each functional stage of the datapath element, estimation by substantially the following equation: Determining the delay time, i.e. Ds=DbNb+C where Ds is the estimated stage delay, Db is the delay associated with communication between bits within the stage, and N b is the number of bits in the datapath element, and C is a constant; b) calculating a position of a pipeline stage arrangement according to the estimated stage delay; and c) placing a pipeline stage at the location. How to do it.

3. The location of the pipeline stage power distribution is selected for the estimated stage delay and datapath elements. 3. The method according to claim 2, wherein the calculation is performed based on the calculated operating period.

4. Db is an element stage where the operation regarding a bit does not depend on the operation regarding other bits. 4. A method according to claim 3, characterized in that the method is set equal to zero.

5. A multi-stage datapath element consists of multiple one-dimensional carries followed by a ripple-carry adder stage. - A carry-save array multiplier including a save adder stage. The method described in Section 3.