JPH0833878B2

JPH0833878B2 - Bit-slice digital processor for correlation and convolution

Info

Publication number: JPH0833878B2
Application number: JP62049882A
Authority: JP
Inventors: ジョン・ビンセント・マカニー; リチヤード・アンソニー・エバンス; ジョン・グラハム・マクウアーター
Original assignee: UK Secretary of State for Defence
Current assignee: UK Secretary of State for Defence
Priority date: 1986-03-05
Filing date: 1987-03-04
Publication date: 1996-03-29
Anticipated expiration: 2011-03-29
Also published as: CA1263758A; EP0237204B1; US4833635A; EP0237204A2; EP0237204A3; DE3776366D1; JPS62229470A; GB8605367D0

Description

【発明の詳細な説明】本発明は数学的に等価の畳込み及び相関を演算するた
めのビットスライスデジタルプロセッサに係わる。この
プロセッサはビットレベルシストリックアレイ（bit−l
evel systolic array）として形成されるタイプのプロ
セッサである。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a bit slice digital processor for computing mathematically equivalent convolutions and correlations. This processor is a bit-level systolic array (bit-l
evel systolic array) is a type of processor formed as.

ビットレベルシストリックアレイとして形成される公
知の畳込み及び相関用デジタルプロセッサは、1983年４
月７日公開の英国特許出願第2106287A号（参考文献１）
に記載されている。この先行特許出願では第15図〜第20
図に畳込み手段が示されている。この装置は複数のゲー
ト制御全加算器を行列状に配置したものからなる矩形ア
レイで構成される。各セルは直ぐ隣の行及び列にしか接
続されない。即ち、各セルは別のセルに最高４つまで接
続される。セルの動作は、アレイ内でのデータビット、
係数ビット、キャリビット及び累積和ビットの移動に作
用するクロック作動ラッチによって制御される。各セル
は夫々右隣及び左隣から受容した入力データビット及び
入力係数ビットの積を評価し、且つこの積を夫々右方及
び上方から受容した入力キャリビット及び累積和ビット
の積に加算する。新しいキャリビット及び累積和ビット
は形成されると左方及び下方に出力され、入力データビ
ット及び係数ビットは夫々右及び左に移動する。各係数
語は夫々のアレイ行を逐次循環するビットである。各デ
ータ語は各行を順次螺旋状に（正確にはジグザグに）通
過してアレイ内を移動する。一連のキャリは係数ビット
と共に移動し、一連の累積和はアレイの列に沿って下方
へ移動する。データは累積和の形成方向と係数及びキャ
リの伝搬方向とに対して逆に移動する。累積和の形成
は、アレイからの部分和出力を形成すべく、アレイの列
に沿って下方へカスケード状に行なわれる。同じビット
重みを持つ複数の部分和は同一アレイ行から順次送出さ
れ、出力和をフィードバックすべく構成された全加算器
により累算されて畳込み結果を構成する。Known convolution and correlation digital processors formed as bit-level systolic arrays have been described in 1983.
UK Patent Application No. 2106287A (reference 1) published on May 7
It is described in. In this prior patent application, FIGS.
The folding means are shown in the figure. This device consists of a rectangular array of gated full adders arranged in a matrix. Each cell is only connected to the immediate row and column. That is, each cell is connected to up to four other cells. The operation of the cell is the data bit in the array,
It is controlled by a clocked latch that operates on the movement of the coefficient bit, carry bit and cumulative sum bit. Each cell evaluates the product of the input data bits and the input coefficient bits received from the right and left neighbors, respectively, and adds this product to the product of the input carry bits and the cumulative sum bit received from the right and above, respectively. Once the new carry bit and cumulative sum bit are formed, they are output to the left and down, and the input data bits and coefficient bits are moved to the right and left, respectively. Each coefficient word is a bit that sequentially circulates through each array row. Each data word traverses each row sequentially in a spiral (more precisely in a zigzag manner) to move through the array. The series of carries moves with the coefficient bits, and the series of cumulative sums moves down the columns of the array. The data moves in the opposite direction with respect to the cumulative sum formation direction and the coefficient and carry propagation directions. The cumulative sum formation is cascaded down the rows of the array to form the partial sum output from the array. Multiple partial sums with the same bit weight are sent out sequentially from the same array row and accumulated by a full adder configured to feed back the output sums to form the convolution result.

不要の部分積の発生を回避すべくデータ語及び係数語
の中にゼロビットを散在させることは、参考文献１に記
載のプロセッサの使用には不利である。この種のプロセ
ッサはいかなる時にもアレイのセルの少なくとも半分、
場合によっては3/4がゼロ部分積を計算するため、ゼロ
ビットを散在させるとアレイが非能率的になり、且つゼ
ロビットの散在を回避できるような場合に必要とされる
大きさよりはるかに大きくなるからである。Interspersing zero bits in the data and coefficient words to avoid the generation of unwanted partial products is a disadvantage for the use of the processor described in reference 1. A processor of this kind will always have at least half of the cells in the array,
In some cases, 3/4 computes a zero partial product, so interspersing zero bits makes the array inefficient and much larger than would be needed if one could avoid zero bit interspersing. Is.

1985年２月27日公開の英国特許出願第2144245A号（参
考文献２）には更に別のビットレベルシストリックアレ
イが記載されている。この先行特許出願はマルチビット
係数を持つ２つのマトリクスの乗法を行なうための、参
考文献１と類似のアレイに係わる。このアレイでは一方
のマトリクスの行エレメントがアレイ行に沿って、他方
のアレイの列エレメントと逆の方向に伝搬し、またキャ
リビットが行に沿って移動するのではなく、各セルを繰
り返し循環するようになっている。所謂「ガードバンド
（guard band）」の使用も記載されているが、これは累
算される結果の語成長（word growth）を得るべく、係
数語をゼロビットで拡張することを意味する。Yet another bit-level systolic array is described in British Patent Application No. 2144245A (reference 2) published Feb. 27, 1985. This prior patent application relates to an array similar to ref. 1 for performing a multiplication of two matrices with multi-bit coefficients. In this array, the row elements of one matrix propagate along the array rows in the opposite direction to the column elements of the other array, and the carry bits cycle through each cell rather than moving along the rows. It is like this. The use of so-called "guard bands" is also described, which means that the coefficient words are extended by zero bits in order to get the word growth of the accumulated result.

1985年５月15日公開の英国特許出願第2147721A号（参
考文献３）には、マトリクス−ベクトルの乗法を行なう
ための更に別のビットレベルシストリックアレイが開示
されている。この場合にはアレイの効率が２つの方法で
改良される。第１にアレイ出力の累算が、参考文献１の
不活性領域に対応するアレイ部分が畳込み結果に寄与す
るように行なわれる。第２にデータビット及び係数ビッ
ト間のゼロの必要が、交番クロックサイクルで隣接し合
う行上のビット移動に作用する複雑なクロック手段によ
り回避される。参考文献１及び２と同様に、被乗数ビッ
トはアレイの行に沿って逆方向に移動する。また、参考
文献２のように、キャリビットは各セルを再循環し、ガ
ードバンドによる語の拡張も使用される。British Patent Application No. 2147721A (reference 3), published May 15, 1985, discloses yet another bit-level systolic array for matrix-vector multiplication. In this case the efficiency of the array is improved in two ways. First, the array outputs are accumulated such that the array portion corresponding to the inactive region of reference 1 contributes to the convolution result. Secondly, the need for zeros between data bits and coefficient bits is avoided by complex clocking means which act on bit movement on adjacent rows in alternating clock cycles. Similar to references 1 and 2, the multiplicand bits move backwards along the rows of the array. Further, as in Reference 2, the carry bit recirculates each cell, and word extension by a guard band is also used.

GEC Journal of Research、Vol.2,No.1、（1984年）
には、R.B.Urquhart及びD.Woodにより、ビットレベルシ
ストリックアレイにおける静的係数（static coefficie
nts）の使用という概念が紹介されている。アレイの各
セルはある係数の対応単一ビットと組合わされ、係数語
は対応アレイ行と組合わされる。セルはキャリビットを
再循環させるように構成され、データは各アレイ行に入
力されてその行上を移動する。累積和はアレイの列を下
りながらカスケード式に形成され、ガードバンドが語成
長を実現する。同じビット重みの複数の部分積は、入力
データが係数ビットに、ビット重みの上昇オーダーで出
会うのか又は逆のオーダーで出会うのかに応じて、相対
的に遅延して又は同期して種々のアレイ列から送出され
る。このような構造にすれば、複雑なクロック装置を使
用しなくてもセルの使用率又はアレイの効率が100％に
なる。GEC Journal of Research, Vol.2, No.1, (1984)
By RBUrquhart and D. Wood in the static coefficie for bit-level systolic arrays.
The concept of using nts) is introduced. Each cell of the array is associated with a corresponding single bit of a coefficient and the coefficient word is associated with a corresponding array row. The cells are configured to recycle the carry bits and data is input into and moved over each array row. Cumulative sums are formed in a cascade fashion down the columns of the array, with guard bands to achieve word growth. Multiple partial products of the same bit weights may be used in various array sequences with relative delay or synchronization depending on whether the input data encounters coefficient bits in ascending or opposite bit weight order. Sent from. Such a structure provides 100% cell utilization or array efficiency without the use of complex clocking devices.

各セルは各クロックサイクル毎に積を演算し、ラッチ
は総て同様にクロックされる。しかしながら残念なこと
に、前述のようなアレイ累算法では正確な畳込み又は相
関結果が得られない。なぜなら、前述のごとき構造は種
々の結果に対応する部分和及びキャリビットの累算に誤
りが多いからである。Each cell computes the product every clock cycle, and all latches are similarly clocked. Unfortunately, however, the array accumulation method as described above does not give an accurate convolution or correlation result. This is because the structure as described above has many errors in the accumulation of partial sums and carry bits corresponding to various results.

デジタル演算回路の分野では、可能であればコンポー
ネントを統一することが重要である。これは、小さい計
算を行うように設計された複数の集積回路をアレイ状に
つなぐか又はカスケード接続して、より大きい計算を実
施できるようにすることが可能であれば極めて容易に実
現できる。また、比較的小さい故障によってアレイ全体
が機能停止することのないようにするためには、このよ
うな集積回路アレイにある程度の故障許容性を与えるこ
とも重要であるが、その実現は極めて難しい。これは、
ウェーハスケールインテグレーション（wafer scale in
tegration）開発分野、即ちある程度の故障許容性がな
くてはウェーハの歩留りが実質的にゼロになり得るよう
な分野では特に重要な問題である。In the field of digital arithmetic circuits, it is important to unify components if possible. This is very easy to achieve if it is possible to connect or cascade multiple integrated circuits designed to perform small calculations in order to be able to perform larger calculations. It is also important to provide some degree of fault tolerance to such integrated circuit arrays in order to prevent the entire array from failing due to relatively small failures, but this is extremely difficult to achieve. this is,
Wafer scale in
It is a particularly important problem in the development field, that is, the field where the yield of wafers can be substantially zero without some fault tolerance.

本発明の目的の１つは故障許容性アセンブリを形成す
べくカスケード接続され得る相関又は畳込みのためのデ
ジタルプロセッサを提供することである。One of the objects of the present invention is to provide a digital processor for correlation or convolution that can be cascaded to form a fault tolerant assembly.

本発明は、ビットパラレル、ワードシリアル、ビット
ジグザグのＭビットワードデータストリームとＮ単一ビ
ット係数との相関及び畳込み演算を実行するビツト−ス
ライスデジタルプロセッサを提供する。本発明によれ
ば、（１）プロセッサがＮ行Ｍ列の論理セルアレイを含み、（２）各論理セルが、（ａ）データビット、キャリービ
ット及び累積和ビットを入力し、（ｂ）データビットを
出力し、（ｃ）入力データビットと各行のセルに対応す
る係数ビットとの積と、入力累積和と、入力キャリービ
ットとの総和に対応する出力累積和ビットと出力キャリ
ービットとを生成するように構成されており、（３）セルの相互接続ラインが、アレイの行に沿ってキ
ャリービットを伝送し、データ流と縦続累積和とをアレ
イの列に沿って下降する単一方向で伝送するように構成
されており、（４）セル相互接続ラインがクロック励起遅延手段を含
んでおり、該遅延手段は、データビットが累積和ビット
の速度の２倍又は1/2倍の速度でアレイの列に沿って下
降しキャリービットがアレイの行に沿ってデータビット
重みが増加する方向で前記累積和ビット及びデータビッ
トの両方よりも高速で伝送されるように構成されてい
る。The present invention provides a bit-slice digital processor that performs correlation and convolution operations on bit parallel, word serial, bit zigzag M-bit word data streams and N single-bit coefficients. According to the present invention, (1) the processor includes a logic cell array of N rows and M columns, (2) each logic cell inputs (a) a data bit, a carry bit and a cumulative sum bit, and (b) a data bit. And (c) generate an output cumulative sum bit and an output carry bit corresponding to the product of the input data bit and the coefficient bit corresponding to the cell of each row, the input cumulative sum, and the total sum of the input carry bit. (3) Interconnect lines of cells carry carry bits along rows of the array, and data streams and cascaded cumulative sums in a single direction descending along the columns of the array. And (4) the cell interconnect line includes clock excitation delay means, wherein the delay means array the data bits at a rate twice or half the rate of the cumulative sum bit. Along the row of And carry bits are transmitted faster along the rows of the array than both the cumulative sum bits and the data bits in the direction of increasing data bit weights.

本明細書において「ビット伝送速度」なる用語及び該
用語に関連する表現は、物理的な移動距離でなくセル横
断速度を意味することを理解されたい。It is to be understood that the term “bit rate” and its related expressions herein refer to cell crossing rate rather than physical distance traveled.

本発明のプロセッサは４つの主要な利点をもつ。 The processor of the present invention has four major advantages.

第一に、クロック励起のとき全部のセルがリアルデー
タで演算するので効率100％であり、また公知型のオー
バーラップしない２相形クロックを１つだけ使用すれば
よい。参考文献１の従来技術と違って、入力データビッ
ト間に０ビットを挿入する必要がなく、参考文献２のよ
うに１つ置きのサイクルで隣合う行又は列の間にビット
伝送を行なうための複雑なクロック制御構成が不要であ
る。第二に、後述するごとく、より大きい計算を行なう
回路のアレイを構成するための集積回路構築ブロックに
適応し易い。特に、係数ビットスライス毎に１つのプロ
セッサを配備しタイミングとビット桁との適当な調整を
伴ってプロセッサ出力を累加することによってマルチビ
ット係数を含む計算に適応し得る。更に、プロセッサを
カスケード式に直列接続すると大きい係数セットを収納
でき、長いデータワードは各プロセッサに供給される各
バイトに分割されることによって処理され得る。第三
に、データ流と結果の流れとが単一方向の流れになるよ
うに構成されているので、クロック励起ラッチによって
高速スイッチング可能なセクションに分割された入力デ
ータ及び結果のバイパス結線を組み込んだプロセッサを
設計し得る。カスケード式に直列接続されたプロセッサ
連鎖は動作速度の低下という欠点を生じないで故障許容
性をもつことができる。何故なら、連鎖中の故障プロセ
ッサは、バイパス結線全長の時定数によって動作速度を
制限されることなくバイパスされ得るからである。かか
る設計は、データと結果とが向流的に伝送されクロック
励起バイパスラッチが計算タイミングを破壊する例えば
参考文献１のプロセッサでは可能でない。第四に、入力
データのガードバンドの延長が不要であり、このために
データスループット速度の低下という欠点も生じない。First, all the cells are operated with real data at the time of clock excitation, so that the efficiency is 100%, and only one known non-overlapping two-phase clock is used. Unlike the prior art of Reference 1, there is no need to insert 0 bit between input data bits, and bit transmission is performed between adjacent rows or columns in every other cycle unlike Reference 2. No complicated clock control configuration is required. Second, as will be described later, it is easy to adapt to integrated circuit building blocks for constructing arrays of circuits that perform larger calculations. In particular, it may be adapted to calculations involving multi-bit coefficients by deploying one processor per coefficient bit slice and accumulating the processor outputs with appropriate adjustments of timing and bit order. In addition, the processors can be cascaded in series to accommodate large sets of coefficients, and long data words can be processed by being divided into each byte supplied to each processor. Third, because the data and result streams are configured to be unidirectional, they incorporate input data and result bypass connections divided into fast switchable sections by clock excitation latches. The processor can be designed. Cascaded series-connected processor chains can be fault tolerant without the disadvantage of reduced operating speed. This is because a faulty processor in the chain can be bypassed without limiting the speed of operation by the time constant of the total length of the bypass connection. Such a design is not possible, for example in the processor of Ref. 1, where data and results are transmitted countercurrently and the clock excitation bypass latch corrupts the calculation timing. Fourthly, it is not necessary to extend the guard band of the input data, and therefore, the disadvantage of lowering the data throughput speed does not occur.

各論理セルは夫々の定常係数ビットに対応してもよ
い。しかし乍ら好ましくは、付加的セル相互接続ライン
とクロック励起遅延手段とを配備し係数ビットが各アレ
イの行に沿ってキャリービットと同方向に同じ速度で伝
送されるように構成する。これにより行係数入力を介し
た係数のプログラミングが容易である。本発明のかかる
具体例の別の利点は、100％のセル利用率を維持し乍ら
係数のプログラミングが得られることである。例えば参
考文献３では、効率100％を達成するためには定常係数
が必要である。Each logic cell may correspond to a respective stationary coefficient bit. However, preferably, additional cell interconnect lines and clock excitation delay means are provided to configure the coefficient bits to be transmitted along the rows of each array in the same direction as the carry bits and at the same rate. This facilitates coefficient programming via row coefficient input. Another advantage of such embodiments of the present invention is that coefficient programming is obtained while maintaining 100% cell utilization. For example, in Reference 3, a steady-state coefficient is required to achieve 100% efficiency.

本発明のプロセッサはアレイ出力をマルチビット全加
算器の第１入力に転送し得るプログラマブルな遅延手段
を含み得る。この加算器は第２プロセッサからの出力を
受信すべく構成された第２入力をもち、第３プロセッサ
の等価加算器の第２入力に接続されるべく構成された出
力をもつ。この形態のプロセッサは長いデータワード又
は係数セットを含む計算のためのプロセッサアレイ又は
マルチビット係数を含むプロセッサアレイを構成する構
築ブロックとしての使用に適する。プログラマブル遅延
手段は、種々のプロセッサからの出力の相対タイミング
を調整すべく使用されプロセッサ間の出力ビット桁の差
は適当な加算器入力結線によって修正される。The processor of the present invention may include programmable delay means capable of transferring the array output to the first input of the multi-bit full adder. The adder has a second input configured to receive the output from the second processor and has an output configured to be connected to the second input of the equivalent adder of the third processor. This form of processor is suitable for use as a building block that constitutes a processor array for calculations involving long data words or coefficient sets or processor arrays including multi-bit coefficients. Programmable delay means are used to adjust the relative timing of the outputs from the various processors and the difference in output bit order between the processors is corrected by the appropriate adder input connections.

符号ビット延長が適宜付加されているならばプロセッ
サは全部が正の即ち２の補数のデータ及び係数を用いて
使用され得る。しかし乍ら、プロセッサはアレイ行に収
納され得ないキャリービットを生成しないデータを演算
する必要がある。言い替えると結果の累加によるワード
成長がアレイ寸法を超過してはならない。必要ならば、
アレイ行を半加算器で延長することによってワード成長
に適応できるようにアレイ寸法を拡大し得る。ｎ番目の
行はlog₂n半加算器（ｎ＝1,2…）又はlog₂（ｎ−１）加
算器（ｎ＝2,3…）を含み、同時に（ｎ−１）番目の行
のキャリー出力と適当なｎ番目の行の半加算器の和入力
との間の遅延手段が挿入された結線を含む。The processor may be used with all positive or two's complement data and coefficients if sign bit extension is added appropriately. However, the processor still needs to operate on data that does not produce carry bits that cannot be stored in an array row. In other words, the cumulative word growth of the result must not exceed the array size. if needed,
The array size can be expanded to accommodate word growth by extending the array rows with half adders. The nth row contains log ₂ n half adders (n = 1,2 ...) Or log ₂ (n-1) adders (n = 2,3 ...) A delay means between the carry output and the sum input of the half adder of the appropriate nth row is included in the connection.

添付図面に示す以下の記載より本発明がより十分に理
解されよう。The present invention will be more fully understood from the following description in the accompanying drawings.

第１図は本発明のビットスライスプロセッサ10を示
す。プロセッサ10を相関関数の演算に関して記載し解析
するが、該プロセッサは後述する如く数学的に等価の畳
込み演算にも適している。プロセッサ10は、個々のビッ
ト×▲^b _n▼（ｂ＝０〜３）をもつ連続する４ビット数×
_n（ｎ＝0,1,2,…）のデータストリームと４つの１ビッ
ト係数a_i（ｉ＝０〜３）との相関を演算するように構成
されている。この説明例ではデータと係数とを正の値に
とる。FIG. 1 shows a bit slice processor 10 of the present invention. Although the processor 10 is described and analyzed with respect to the operation of the correlation function, the processor is also suitable for mathematically equivalent convolution operations, as described below. Processor 10 is a continuous 4-bit number with individual bits x ▲ ^b _n ▼ (b = 0-3) x
It is configured to calculate the correlation between the _n (n = 0, 1, 2, ...) Data stream and the four 1-bit coefficients a _i (i = 0 to 3). In this explanation example, the data and the coefficient have positive values.

プロセッサ10は４行４列に配置されたゲート制御全加
算論理セルアレイ12を含む。各セルを符号14で示し、各
セルの添字が行及び列の位置を示す。例えばセル14_ijは
ｉ番目の行のｊ番目の列のセルである。プロセッサは更
に５つの半加算論理セル16を含み、該セルでも添字によ
って行及び列の位置を示す。Processor 10 includes a gated full add logic cell array 12 arranged in 4 rows and 4 columns. Each cell is indicated by reference numeral 14, and the subscript of each cell indicates the position of the row and column. For example, cell 14 _ij is the cell in the i-th row and the j-th column. The processor further includes five half-addition logic cells 16, which also subscript to indicate row and column locations.

次に第２図及び第３図によれば、各論理セル14は、以
下の如くゲート制御全加算論理関数を計算すべく構成さ
れている。2 and 3, each logic cell 14 is configured to compute a gated full add logic function as follows.

ｙ←ｙ′（a.x）ｃ′ （1.1）ｃ←ｙ′.c′＋ｙ′（a.x）＋ｃ′（a.x）（1.2）［式中、ｙ′及びｙは夫々、入力及び出力の累積和ビ
ット、ｃ′及びｃは夫々、入力及び出力のキャリービット、ａは入力１ビット係数、ｘは入力データビット、 −判り易くするためにビット桁及びワード数に関する添
字は省略。y ← y '(ax) c' (1.1) c ← y'.c '+ y' (ax) + c '(ax) (1.2) [where y'and y are the cumulative sum bits of the input and output, respectively] , C'and c are input and output carry bits respectively, a is an input 1-bit coefficient, x is an input data bit, -subscripts for bit digits and number of words are omitted for clarity.

各論理セル14は直上に接するセルから入力データビッ
トｘと入力累積和ビットｙ′とを受信するように構成さ
れている。更に右隣のセルから入力係数ビットａと入力
キャリービットｃとを受信するように構成されている該
セルは式（1.1）及び（1.2）の論理関数を演算し、出力
累積和ビットｙとキャリービットｃとを生成する。これ
らの出力ビットは、ａとｘとの積にｃ′とｙ′とを加算
した和に相当する。キャリー出力ビットａ及び係数出力
ビットｃは、夫々のクロック励起ラッチ18a及び18cを介
して左隣のセルに出力される。データ出力ビットｘ及び
累積和出力ビットｙは、ｘの場合は１つのクロック励起
ラッチ18xを介しｙの場合は２つのクロック励起ラッチ1
8_y1，18_y2を介して出力される。Each logic cell 14 is configured to receive the input data bit x and the input cumulative sum bit y'from the immediately adjacent cell. Further configured to receive the input coefficient bit a and the input carry bit c from the cell to the immediate right, the cell operates the logical functions of equations (1.1) and (1.2) to produce the output cumulative sum bit y and the carry. Generate bits c and. These output bits correspond to the sum of the product of a and x plus c'and y '. The carry output bit a and the coefficient output bit c are output to the cell on the left side via the respective clock excitation latches 18a and 18c. The data output bit x and the cumulative sum output bit y are routed through one clock excitation latch 18x if x and two clock excitation latches 1 if y.
It is output via 8 _y1 and 18 _y2 .

第３図に示す如く、各半加算セル16は右隣及び直上の
セルからキャリー及び累積和の入力ビットｃ′及びｙ′
を夫々受信する。該セルはこれらを加算してキャリー及
び累積和の出力ビットｃ及びｙを生成し、クロック励起
ラッチ20c、20_y1及び20_y2を介して左隣及び直下のセル
に出力する。半加算セル16は以下の論理関数を演算す
る。As shown in FIG. 3, each half-addition cell 16 receives the carry and cumulative sum input bits c'and y'from the cells to the right and immediately above.
Respectively received. The cell adds these to generate carry and cumulative sum output bits c and y, which are output to the cell to the immediate left and directly below via clock excitation latches 20c, _20y1 and _20y2 . The half addition cell 16 operates the following logical function.

ｙ←ｙ′ｃ′ （2.1）ｃｙ′ ｃ′ （2.2）式中の各項は前記と同義である。y ← y′c ′ (2.1) cy ′ c ′ (2.2) Each term in the formula has the same meaning as above.

ラッチ18,20の各々は、アレイ12の全部のセル14と半
加算器16とのタイミングを制御する（第１図に図示しな
い）１つのクロック22によって励起される。クロック22
はオーバーラップしない２相信号を発生し、各ラッチ18
又は20は直列の２つの半ラッチから成る。第１相クロッ
クパルスで、第２ハーフラッチの各々がラッチビットを
出力し第１ハーフラッチの各々が新しいラッチビットを
入力する。第２相クロックパルスで、第１ハーフラッチ
の各々がラッチビットを各自の第２ハーフラッチに転送
する。従って連続クロックサイクル中に連続ビットが各
ラッチでクロック制御される。セル14及び16は各自のセ
ル出力に18aの如き全ラッチを有するが、これがセル入
力に配置されてもよく、又は入力及び出力に夫々半ラッ
チずつ分割されても、同様のアレイ動作が維持される。
また、各全ラッチの代わりに半ラッチを使用することも
公知であり、これは記載の具体例の変形例になる。かか
るラッチの動作はビットレベルシストリックアレイ業界
で十分に公知であり、参考文献１の第10図及び第11図に
示されているので本文では特に説明しない。Each of the latches 18, 20 is excited by one clock 22 (not shown in FIG. 1) that controls the timing of all cells 14 of array 12 and half adder 16. Clock 22
Generates non-overlapping two-phase signals, each latch 18
Or 20 consists of two half latches in series. On the first phase clock pulse, each of the second half-latch outputs a latch bit and each of the first half-latch inputs a new latch bit. On the second phase clock pulse, each of the first half-latch transfers the latch bit to its own second half-latch. Thus, consecutive bits are clocked in each latch during consecutive clock cycles. Although cells 14 and 16 have all latches at their cell outputs, such as 18a, they may be placed at the cell inputs or may be split by half latches at the inputs and outputs, respectively, to maintain similar array operation. It
It is also known to use half latches instead of each full latch, which is a variant of the embodiment described. The operation of such a latch is well known in the bit-level systolic array industry and is shown in FIGS. 10 and 11 of reference 1 and will not be described further herein.

ラッチ18及び20のクロック制御効果は、係数ビットａ
と連続計算されたキャリービットｃとをアレイの行に沿
って１クロックサイクルに１セルずつ矢印24及び26で示
すように転送することである。データビットｘは１クロ
ックサイクル毎に１セルずつ移動する。係数およびデー
タビットは不変化でアレイ12を通過するが、新しく計算
されたキャリービットの各々は、左隣のセル14又は16に
よって１クロックサイクル後に演算される１レベル上の
ビット桁の計算のための入力になる。The clock control effect of the latches 18 and 20 is the coefficient bit a
And carry bit c successively calculated, one cell per clock cycle along the rows of the array, as indicated by arrows 24 and 26. The data bit x moves by one cell every clock cycle. Although the coefficient and data bits pass through the array 12 unchanged, each newly calculated carry bit is calculated one bit above one level, which is calculated one clock cycle later by the cell 14 or 16 to the left. Will be input.

新しく計算された出力累積和ビットｙの各々は、２ク
ロックサイクル後に直下のセル14又は16の入力ｙ′にな
る。その他のビットは１つのラッチ18a,18c,18x又は20c
しか通過しないのにこれらのビットの各々が２つのラッ
チ18_y1と18_y2又は20_y1と20_y2とを通過するからである。Each of the newly calculated output cumulative sum bits y becomes the input y'of the cell 14 or 16 immediately below after two clock cycles. Other bits are one latch 18a, 18c, 18x or 20c
This is because each of these bits passes through two latches 18 _y1 and 18 _y2 or 20 _y1 and 20 _y2 while only passing.

プロセッサ10は、４つの隣接セルに完全に接続された
５つだけの論理セル14₁₂，14₁₁，14₂₂，14₂₁及び16₂₄を
含む。セル14₀₀〜14₀₃，16₁₄，及び16₂₅のｙ′入力はＯ
に設定されている。セル14₀₀〜14₃₀のｃ′入力はＯに設
定されている。セル14₀₃〜14₃₃は未接続の係数即ちａ出
力をもち、セル14₀₃，16₁₄，16₂₅，及び16₃₅は未接続の
ｃ出力をもつ。セル14₃₀〜14₃₃のｘ出力も未接続であ
る。第１行のセル14₀₀〜14₀₃はｘ入力をもち、データは
この入力から後述する如くビットパラレル，ワードシリ
アル，ビットジグザグでプロセッサ10に供給される。第
１列のセル14₀₀〜14₃₀はプロセッサ10に係数を供給する
ａ入力をもつ。プロセッサ10からの出力は、最終行のセ
ル14₃₀〜14₃₅のｙ出力から得られる。The processor 10 includes four logic cells 14 ₁₂ only five that is fully connected to the adjacent cells, 14 _11, 14 _22, 14 ₂₁ and 16 _24. Cell 14 _00-14 _03, 16 _14, and 16 ₂₅ y 'input O
Is set to C 'input cell 14 _00-14 ₃₀ is set to O. Cell 14 _03-14 ₃₃ has a coefficient i.e. a output unconnected, cell 14 _03, 16 _14, 16 _25, and 16 ₃₅ have the c output unconnected. X output of the cell 14 _30-14 ₃₃ is not connected. The cell 14 _00-14 ₀₃ of the first row has x input, data bits in parallel as will be described later, from the input, word serial, is supplied to the processor 10 in a bit zigzag. Cell 14 _00-14 ₃₀ in the first column with a input supplied by a coefficient processor 10. The output from the processor 10 is obtained from the y output of the cell 14 _30-14 ₃₅ of the last row.

プロセッサ10の実際の設計では冗長セル結線と対応す
るラッチとを省略してもよい。しかし乍ら、論理セルの
タイプをできるだけ少なくするのが有利であろう。冗長
度を最小にすればプロセッサ10に２種類のセルを組込む
だけでよい。更に、半加算16の代わりにＯに設定された
ａ及び／又はｘ入力をもつゲート制御全加算器14を使用
すると、更に冗長度は低くなるが１種類のセルを使用す
るだけでよい。このため、例えばコンピュータを用いた
設計技術によって集積回路を簡単に製造できるという利
点が得られる。更に後述する如く。この補数演算では第
１図のプロセッサ10の如く左側上端にスペースを残すよ
りもゲート制御全加算セルの矩形アレイを構成するほう
が有利である。In the actual design of processor 10, redundant cell connections and corresponding latches may be omitted. However, it would be advantageous to have as few logic cell types as possible. If the redundancy is minimized, it is only necessary to incorporate two types of cells in the processor 10. Furthermore, the use of a gated full adder 14 with a and / or x inputs set to O instead of half adder 16 results in less redundancy but requires the use of only one type of cell. Therefore, there is an advantage that the integrated circuit can be easily manufactured by a design technique using a computer, for example. As described further below. In this complement operation, it is more advantageous to construct a rectangular array of gated full adder cells than to leave a space on the upper left side as in the processor 10 of FIG.

次にプロセッサ10の動作を第4,5及び６図に基いて説
明する。プロセッサ10は次式で定義される相関演算を行
なうように構成されている。Next, the operation of the processor 10 will be described with reference to FIGS. Processor 10 is configured to perform a correlation operation defined by the following equation.

［式中、Y_nは連続相関結果ワード，係数a_i及びx_n+1は
x_n〜x_n+N-1の範囲の一般データワードを示す］。 [Where Y _n is a continuous correlation result word, and coefficients a _i and x _{n + 1} are
Indicates a general data word in the range x _{n to} x _n _{+ N-1} ].

第4,5及び６図によれば、単一ビット係数ワードa₀〜a
₃のストリーム40はプロセッサ10内で左方向に転送され
る。各係数は夫々の相関行に入力される。データストリ
ーム42はプロセッサ10内で下向きに移動し、結果ストリ
ーム44は第５図及び第６図に示されるようにプロセッサ
10の下方から出る。第４図は演算の第１クロックサイク
ルの直前のプロセッサ10を示し、第５図及び第６図は第
11サイクル及び第14サイクルに於けるデータ及び結果の
ビット位置を示す。第４図〜第６図はデータ流及び係数
流のタイミングと結果の累積とを図式的に示す。According to FIGS. 4, 5 and 6, single-bit coefficient words a _{0 to} a
The stream 40 of ₃ is transferred leftward in the processor 10. Each coefficient is entered in its respective correlation row. The data stream 42 moves downward within the processor 10 and the result stream 44 is the processor as shown in FIGS.
Exit below 10. FIG. 4 shows the processor 10 immediately before the first clock cycle of operation, and FIGS.
The bit positions of the data and the result in the 11th cycle and the 14th cycle are shown. 4 to 6 graphically show the timing of the data and coefficient flows and the accumulation of the results.

プロセッサ10の上方又は右方に伸びる連続ビット位置
は、次第に遅くなるデータ入力、結果出力又は係数入力
を示す。係数，データ及び結果のストリーム40〜44の対
角線立上がり40′,42′及び44′はプロセッサ10への時
間ジグザグビット入力を示す。データワードx_n+iは、ワ
ードシリアル，ビットパラレル及び累積時間ジグザグに
プロセッサ10に入力される。従って、ビット▲x⁰ ₀▼〜
▲x³ ₀▼は隣接セル間で１クロックサイクルの遅延を伴
なって第１行のセル14₀₀〜14₀₃に入力される。従ってセ
ル14_0n（ｎ＝１〜３）への入力▲xⁿ ₀▼はセル14₀₀への
入力▲x⁰ ₀▼にｎクロックサイクルだけ遅れる。Successive bit positions that extend above or to the right of processor 10 indicate progressively slower data inputs, result outputs or coefficient inputs. The diagonal rising edges 40 ', 42' and 44 'of the coefficient, data and result streams 40-44 represent time zigzag bit inputs to the processor 10. The data word x _{n + i} is input to the processor 10 in word serial, bit parallel and cumulative time zigzag. Therefore, the bit ▲ x ⁰ ₀ ▼ ~
▲ x ³ ₀ ▼ is entered in cell 14 _00-14 ₀₃ of the first row is accompanied with delay of one clock cycle between adjacent cells. Thus the input to the cell _{14 0n (n = 1~3) ▲} x n 0 ▼ input to the cell 14 ₀₀ ▲ x ^{₀ 0} ▼ to delayed by n clock cycles.

式（１）の論理関数によって、第４図から１クロック
サイクル後、即ち、クロックサイクル１でセル14₀₀は入
力▲x⁰ ₀▼及びa₀を受信する。その結果このセルは積a₀
▲x⁰ ₀▼を計算しこれにキャリー及び和の入力ビット即
ちｃ′及びｙ′を加算する。これらは常時０である。従
って対応する累積和出力ｙはa₀▲x⁰ ₀▼でありセル14₀₁
へのキャリー出力ｃは０であろう。クロックサイクル3,
5及び７において、セル14₁₀〜14₃₀はデータ入力▲x
⁰ ₁▼，▲x⁰ ₂▼及び▲x⁰ ₃▼を受信し、これにa₁，a₂及び
a₃を夫々乗算する。対応するキャリー入力は全て０であ
るが、各セル14_n0（ｎ＝１〜３）は、２サイクル以前に
直上のセル14_(n-1)0によって計算された累積和出力を累
積和入力として受信する。２サイクルの遅延は夫々のラ
ッチ18_y1及び18_y2によって得られる。従ってセル14₁₀は
サイクル３で被乗数a₁及び▲x⁰ ₁▼の入力と同期してセ
ル14₀₀からa₀▲x⁰ ₀▼を受信し、の最下位ビット（lsb）とより高い桁のビット（hob）と
から成るｙ及びｃ出力を生成する。キャリービットｃは
サイクル４でセル14₁₁に移り、累積和出力ビットｙはサ
イクル５でセル14₂₀に移る。セル14₂₀はサイクル５でのlsb及びhobとしてｙ及びｃを生成する。ｃはサイクル
６でセル14₂₁に移り、ｙはサイクル７でセル14₃₀に移
る。これは被乗数a3及び▲x⁰ ₃▼の入力と同期する。従
ってサイクル７でのセル14₃₀のｙ及びｃ出力はのlsb及びhobである。サイクル７でのセル14₃₀の累積和
出力は次式で示される。The logic function of formula (1), after one clock cycle from Figure 4, i.e., cell 14 ₀₀ clock cycles 1 receives an input ▲ x ⁰ ₀ ▼ and a _0. As a result, this cell is the product a ₀
Calculate x ⁰ ₀ and add to it the carry and sum input bits, c'and y '. These are always 0. Therefore, the corresponding cumulative sum output y is a ₀ ▲ x ⁰ ₀ ▼ and cell 14 ₀₁
The carry output c to will be zero. Clock cycle 3,
In 5 and 7, cells 14 _{10 to} 14 ₃₀ have data input ▲ x
⁰ ₁ ▼, ▲ x ⁰ ₂ ▼ and ▲ x ⁰ ₃ ▼ are received, and a ₁ , a ₂ and
Multiply each a ₃ . The corresponding carry inputs are all 0, but each cell 14 _n0 (n = 1 to 3) uses the cumulative sum output calculated by the cell 14 _{(n-1) 0} immediately above two cycles as the cumulative sum input. To receive. The two cycle delay is provided by the respective latches 18 _y1 and 18 _y2 . Therefore, cell 14 ₁₀ receives a ₀ ▲ x ⁰ ₀ ▼ from cell 14 ₀₀ in cycle 3 in synchronization with the input of multiplicand a ₁ and ▲ x ⁰ ₁ ▼, Generate the y and c outputs consisting of the least significant bit (lsb) and the higher order bit (hob). Carry bit c moves to cell 14 ₁₁ in cycle 4, the cumulative sum output bits y moves to cell 14 ₂₀ in cycle 5. Cell 14 ₂₀ in cycle 5 Y and c are generated as lsb and hob of. c moves to cell 14 ₂₁ in cycle 6, y moves to cell 14 ₃₀ in cycle 7. This is synchronized with the inputs of the multiplicand a3 and ▲ x ⁰ ₃ ▼. Therefore the y and c outputs of cell 14 _{30 in} cycle 7 are Lsb and hob. The cumulative sum output of cell 14 _{30 in} cycle 7 is given by:

式（４）は次式と等価である。 Expression (4) is equivalent to the following expression.

式（５）は級数Y_n（ｎ＝0,1…）の第１相関項たるY₀のl
sbである。従って、右端列のセル14₀₀〜14₃₀は第４図か
ら７クロックサイクル後にY₀のlsb▲y⁰ ₀▼を生成する。
累積和出力と直列に２つのラッチが存在するので▲y⁰ ₀
▼はこの図から８クロックサイクル後にセル14₃₀のラッ
チ18_y2から送出される。 Equation (5) is the l of Y ₀ , which is the first correlation term of the series Y _n (n = 0,1 ...).
sb. Therefore, the cells 14 _{00 to} 14 ₃₀ in the rightmost column generate lsb ▲ y ⁰ ₀ ▼ of Y ₀ after 7 clock cycles from FIG.
Since there are two latches in series with the cumulative sum output, ▲ y ⁰ ₀
▼ is sent out from the latch 18 _y2 of the cell 14 ₃₀ after 8 clock cycles from this figure.

次に第２列のセル14₀₁〜14₃₁について考察する。サイ
クル２でセル14₀₁はｃ′及びｙ′入力０を▲x¹ ₀▼及びa
₀被乗数入力と共に受信する。従って該セルは、左隣の
セル14₀₂に対してキャリー出力０を発生し、直下のセル
14₁₁に対してｙ出力a₀▲x¹ ₀▼を発生する。サイクル4,
6,8でセル14₁₁，14₂₁及び14₃₁は夫々、a₁／▲x¹ ₁▼，a₂
／▲x¹ ₂▼及びa₃／▲x¹ ₃▼を受信する。従ってサイクル
９でのセル14₃₁の第２ラッチ18_y2のｙ出力は次式で与え
られる。Next, consider the cells 14 _{01 to} 14 _{31 in} the second column. In cycle 2, cell ₁₄₀₁ inputs c'and y'input 0 by ▲ x ¹ ₀ ▼ and a
₀ Receive with multiplicand input. Therefore, the cell generates a carry output 0 for the cell ₁₄₀₂ on the left side, and the cell immediately below
Generate y output a ₀ ▲ x ¹ ₀ ▼ for 14 ₁₁ . Cycle 4,
In cells 6 and 8, cells 14 ₁₁ , 14 ₂₁ and 14 ₃₁ are respectively a ₁ / ▲ x ¹ ₁ ▼, a ₂
/ ▲ x ¹ ₂ ▼ and a ₃ / ▲ x ¹ ₃ ▼ are received. Therefore, the y output of the second latch 18 _y2 of cell 14 _{31 in} cycle 9 is given by:

即ち ▲y¹ ₀▼はY0の最下位の１つ上の桁のビットでありサイ
クル９で第２列のセル14₀₁〜14₃₁から送出されるか又は
第１列のセルからlsb▲y⁰ ₀▼の１クロックサイクル後に
送出される。 That is ▲ y ¹ ₀ ▼ is the bit of the next-higher-order digit of Y ₀ , and is transmitted from cells 14 _{01 to} 14 ₃₁ in the second column in cycle 9 or lsb ▲ y ⁰ ₀ from the cells in the first column. It is sent after one clock cycle of ▼.

▲y¹ ₀▼形成中に発生したキャリービットは次式に従
って第３列のセル14₀₂〜14₃₂に転送される。サイクル3,
cell 14₀₂:c′＝０（8.1）同様の解析によってサイクル10及びサイクル11で▲y²
₀▼及び▲y³ ₀▼が第３列及び第４列のセルから発生し、
同時にキャリービットが前記の如く左方向に転送される
ことが理解されよう。(Y ¹ ₀ ) The carry bit generated during formation is transferred to the cells 14 _{02 to} 14 ₃₂ in the third column according to the following equation. Cycle 3,
cell 14 ₀₂ : c '= 0 (8.1) By the same analysis, ▲ y ² in cycle 10 and cycle 11
₀ ▼ and ▲ y ³ ₀ ▼ occur from the cells in the third and fourth columns,
It will be appreciated that at the same time the carry bit is transferred to the left as described above.

第１行のセルへのｃ′及びｙ′入力は全て常に０であ
る。任意の４ビット数とa₀（１又は０に等しい）との乗
算によって得られる積の最大値は同数４ビットの長さを
もつ。従って第１行の最終セル14₀₃のｃ出力は常に０で
ある。第２行の最終セル14₁₃のｃ出力は２つの４ビット
数の加算によって得られるので０でないこともある。半
加算セル16₁₄はこのキャリービットをアレイの第３行に
転送するように構成されている。第３行及び第４行は夫
々、６ビットに加算され得る夫々３つ及び４つの４ビッ
ト数を加算するので、２つのキャリービットを使用する
必要がある。一般に、Ｎ番目の相関行（Ｎ＝1,2,3又
４）は横に進むキャリービットを加算するために、log₂
N半加算器を組込む必要がある。但し、log₂Nは必要な場
合、丸めて整数にしてもよい。この効果は第６図に示さ
れる。第６図においては、第４桁及び第５桁のビット即
ち▲y⁴ _n▼又は▲y⁵ _n▼（ｎ＝0,1,2…）が、夫々半加算
器から成る第５列及び第６列によって計算されている。
回路を小型化するために、半加算器16₁₄及び16₂₅に代え
て単一クロック制御ラッチを使用してもよい。これらラ
ッチは入力和と出力キャリーとが接続していないとき単
独で遅延を与える機能をもつ。従って、一般にはＮ番目
の相関行がlog₂（Ｎ−１）半加算器［但しＮ＝2,3,…］
を必要とするであろう。サイクル12及び13で第４列及び
第５列の最終行の半加算セル16₃₄及び16₃₅から最終２ビ
ット▲y⁴ ₀▼及び▲y⁵ ₀▼が夫々発生する。The c'and y'inputs to the cells in the first row are all always 0. The maximum value of the product obtained by multiplying an arbitrary 4-bit number by a ₀ (equal to 1 or 0) has the same number of 4-bit length. Thus c output of the first row of the last cell 14 ₀₃ is always zero. The c output of the last cell 14 ₁₃ of the second row may be non-zero because it is obtained by adding two 4-bit numbers. Half add cell 16 ₁₄ is configured to transfer this carry bit to the third row of the array. It is necessary to use two carry bits because the third and fourth rows add three and four 4-bit numbers, respectively, which can be added to 6 bits respectively. Generally, the Nth correlation row (N = 1,2,3 or 4) has log ₂ to add carry bits that travel horizontally.
N half adder needs to be incorporated. However, log ₂ N may be rounded to an integer if necessary. This effect is shown in FIG. In FIG. 6, the bits of the fourth and fifth digits, that is, ▲ y ⁴ _n ▼ or ▲ y ⁵ _n ▼ (n = 0,1,2 ...) are the fifth column and the fifth column, each consisting of a half adder. Calculated by 6 columns.
A single clock control latch may be used in place of the half adders 16 ₁₄ and 16 ₂₅ to reduce the size of the circuit. These latches have the function of independently providing a delay when the input sum and the output carry are not connected. Therefore, in general, the Nth correlation row is the log ₂ (N-1) half adder [where N = 2,3, ...].
Would require. In cycles 12 and 13, the final two bits ▲ y ⁴ ₀ ▼ and ▲ y ⁵ ₀ ▼ are generated from the half-added cells 16 ₃₄ and 16 _{35 in} the last row of the fourth and fifth columns, respectively.

前記の解析により、Y0の第ｐ桁のビット▲y^p ₀▼が第
４図から（ｐ＋８）サイクル後にセルのｐ番目の列から
発生することが理解されよう。［但しｐ＝０〜５］。こ
の解析を拡大してYn（一般相関結果）のｐ番目のビット
▲y^p _n▼が第４図から（ｎ＋ｐ＋８）サイクル後にｐ列
目のセルから生成されることを容易に証明できる。従っ
て連続相関結果Y_nは８クロックサイクルの待ち時間でプ
ロセッサ10からワードシリアル，ビットパラレル的に生
成される。即ち、対応データビットの入力後、結果ビッ
トを得るまでに８サイクルが必要である。From the above analysis, it can be seen that the bit ^p y ^p ₀ of the Y0 occurs from the p th column of the cell after (p + 8) cycles from FIG. [However, p = 0 to 5]. By expanding this analysis, it can be easily proved that the p-th bit {circle around (y) ^p _n }} of Yn (general correlation result) is generated from the cell in the p-th column after (n + p + 8) cycles from FIG. Therefore, the continuous correlation result Y _n is generated from the processor 10 in word serial and bit parallel with a waiting time of 8 clock cycles. That is, it takes 8 cycles from inputting the corresponding data bit to obtaining the result bit.

第１図から第６図に基いて説明した本発明の具体例
は、移動する単一ビット係数を使用するプロセッサであ
る。このタイプのプロセッサは種々の相関を演算するた
めに係数を時々交換することが望ましい場合には適して
いる。しかし乍ら常に一定の相関が必要である場合に
は、各セルが定常的で且つ恐らくはプレプログラムされ
た係数を夫々有するであろう。この場合、係数転送用の
セル間結線及びラッチは不要であろう。The embodiment of the invention described with reference to FIGS. 1 to 6 is a processor that uses moving single bit coefficients. This type of processor is suitable when it is desirable to occasionally swap coefficients to compute various correlations. However, if a constant correlation is always required, then each cell will have a stationary and possibly pre-programmed coefficient respectively. In this case, inter-cell connections and latches for coefficient transfer would be unnecessary.

再び第４図を参照すると、プロセッサ10からの正しい
計算結果の出力に先行して少数の不要項が存在すること
が理解されよう。特にセル14₃₀は図示サイクルの４サイ
クル後にa₃と▲x⁰ ₀▼との関を計算するであろうが、こ
れは無意味な結果である。演算の最初の７サイクル間は
セル14₃₀からの結果を無視し、最初の８サイクル間はセ
ル14₃₁からの結果を無視する必要があり、その後も同様
である。必要ならばこのために、各場合に適当数のサイ
クル中の出力を抑止するように構成された手段を配備し
てもよい。しかし乍ら実際にはプロセッサ10は極めて多
数のサイクル、通常は10⁶を上回るサイクルにわたって
演算を実行する。Referring again to FIG. 4, it will be appreciated that there is a small number of spurious terms preceding the output of the correct computation result from processor 10. In particular, cell 14 ₃₀ will calculate the relationship between a ₃ and Δx ⁰ ₀ 4 cycles after the illustrated cycle, which is a meaningless result. It is necessary to ignore the result from cell 14 ₃₀ during the first 7 cycles of the operation, ignore the result from cell 14 ₃₁ during the first 8 cycles, and so on. For this purpose, if necessary, means may be provided, which in each case are arranged to suppress the output during a suitable number of cycles. However, in practice, the processor 10 performs operations over a very large number of cycles, typically more than 10 ⁶ .

従って、数百万の結果のうちでは最初の短い級数のい
くつかの無意味な結果が存在しても問題はない。これ
は、デジタル演算回路業界で公知の回路設定時間に対応
する程度にすぎない。Therefore, it does not matter if there are some meaningless results of the first short series out of the millions of results. This only corresponds to the circuit setting time known in the digital arithmetic circuit industry.

初期結果を無視する方法の変形例として、第４図に示
すように、不要項に対応する係数入力を０に設定しても
よい。このためにはプロセッサ10のｎ番目の行に係数a_n
を入力する前に2n個の０（ｎ＝０〜３）を入力する必要
がある。言い換えると、係数の入力以前に必要な０の数
はプロセッサ内で１行下降する毎に２つずつ増加する。
従って第１行には０を入力しない。これはまた、不要項
を導入しないで係数セットを交換する方法を示す。係数
a₀〜a₃を係数b₀〜b₃に交換するためには、（ｎ−１）番
目の行のa_n-1がb_n-1に交換されてから２クロックサイク
ル後にｎ番目の行へのa_nの入力をb_nに交換する。第４図
では係数が０からa₀〜a₃に交換される場合が示されてい
る。As a modification of the method of ignoring the initial result, the coefficient input corresponding to the unnecessary term may be set to 0 as shown in FIG. To do this, the coefficient a _{n in the nth} row of the processor 10
It is necessary to input 2n 0s (n = 0 to 3) before inputting. In other words, the number of 0's required before inputting the coefficient increases by 2 each time the line is lowered in the processor.
Therefore, do not enter 0 in the first line. It also shows how to exchange coefficient sets without introducing unnecessary terms. coefficient
In order to exchange a _{0 to} a ₃ for the coefficients b _{0 to} b ₃ , two clock cycles after the a _n-1 of the (n-1) th row is exchanged with b _n-1 , the _nth row is replaced. Swap the input of a _n into b _n . FIG. 4 shows the case where the coefficient is exchanged from 0 to a _{0 to} a ₃ .

次に第７図を参照する。第７図は本発明の別のプロセ
ッサ50の概略図であり、前出の部分を同じ参照符号で示
す。これは、より複雑な計算に適応するように補助手段
を備えたプロセッサ10を組込んでいる。最終行のセル14
₃₀〜16₃₅は累積和出力30を有し、この出力はプログラマ
ブルクロック励起遅延ユニット52を介して11ビットクロ
ック励起全加算器54に接続されている。最終行のデータ
出力28はデータ出力ライン56₀〜56₃に接続されている。
加算器54は別々の11個の１ビット加算セル58₀〜58₁₀を
有し、その１つが第８図により詳細に示されている。各
加算セル58は第１及び第２の和入力60a,60bとキャリー
入力62とキャリー出力64と和入力66とをもつ。和入力60
a,60b及びキャリー入力62は夫々クロック励起される１
ビットラッチ68_a，68_b，68_cと直列である。キャリービ
ットは加算器54に沿って左方向、例えば加算セル58_nか
ら加算セル58_n+1（ｎ＝０〜10）に転送される。加算セ
ル58_nはｎ番目の桁のビットを受信及び発生し、アレイ
セル14_3n（ｎ＝０〜３）又は16_3n（ｎ＝４又は５）から
の出力を受信すべく接続された第１入力60aをもつ。加
算セル58₆〜58₁₀の第１入力は０に設定されている。従
ってプロセッサ10は、最下位から６桁の第１入力を11ビ
ット加算器54に与える。Next, referring to FIG. FIG. 7 is a schematic diagram of another processor 50 of the present invention, wherein the preceding parts are designated by the same reference numerals. It incorporates a processor 10 with auxiliary means to accommodate more complex calculations. Cell 14 in the last row
_30-16 ₃₅ has a cumulative sum output 30, this output is connected to 11 bit clock excited full adder 54 through a programmable clock excitation delay unit 52. Data output 28 of the last row are connected to the data output line 56 _0-56 _3.
The adder 54 has one bit adder cell 58 _0-58 ₁₀ separate eleven, one of which is shown in more detail in Figure 8. Each summing cell 58 has first and second sum inputs 60a, 60b, a carry input 62, a carry output 64 and a sum input 66. Japanese input 60
a, 60b and carry input 62 are clocked respectively 1
Bit latch 68 _a, which is 68 _b, 68 _c in series. The carry bit is transferred to the left along the adder 54, for example, from the addition cell 58 _n to the addition cell 58 _{n + 1} (n = 0 to 10). Summing cell 58 _n receives and generates the bit of the n th digit and has a first input connected to receive the output from array cell 14 _3n (n = 0 to 3) or 16 _3n (n = 4 or 5). With 60a. The first input of the adder cell 58 _6-58 ₁₀ is set to 0. Therefore, the processor 10 provides the least significant 6-digit first input to the 11-bit adder 54.

加算セル58₀〜58₁₀の第２入力60bは入力ライン70₀〜7
0₁₀に夫々接続されている。加算器出力66は夫々の出力
ライン72₀〜72₁₀に接続されている。遅延ユニット52は
同様にプログラマブルな数のクロックサイクルによって
プロセッサ10の最終行のセルの各々からの信号を遅延さ
せるように構成されている。ユニット52は例えば、アレ
イの各出力毎に直列の１ビットクロック励起ラッチを含
んでおり、直列ラッチの数は所望の遅延に応じて変更で
きる。The second input 60b of the summing cells 58 _{0 to} 58 ₁₀ is the input line 70 _{0 to} 7
0 to ₁₀ respectively. The adder output 66 is connected to the output line 72 _0-72 ₁₀ each. Delay unit 52 is similarly configured to delay the signal from each of the cells in the last row of processor 10 by a programmable number of clock cycles. Unit 52 includes, for example, a 1-bit clock excitation latch in series for each output of the array, the number of serial latches being variable depending on the desired delay.

プロセッサ10と遅延ユニット52と加算器54の全部のラ
ッチとは、（図示しない）同じ２相クロックによって同
期的に励起される。Processor 10, delay unit 52, and all latches of adder 54 are synchronously excited by the same two-phase clock (not shown).

プロセッサ50は以下の如き演算を行なう。相関は加算
演算であるから、正確なタイミングとビット桁とを与え
ることができれば、演算をサブ計算に分割し後で再結合
することが可能である。遅延ユニット52は、正確なタイ
ミングを与え、11ビット長の加算器54はビット桁の調整
を与える。これに関しては個々の場合について後述す
る。プロセッサ50は、全部が同じ２相クロックで励起さ
れる同様のクロック群と共に使用されるように構成され
ている。The processor 50 performs the following operations. Since the correlation is an addition operation, it is possible to divide the operation into sub-computations and rejoin them later if the correct timing and bit digits can be given. The delay unit 52 provides accurate timing and the 11 bit long adder 54 provides bit digit adjustment. This will be described later in each case. Processor 50 is configured to be used with a similar group of clocks all excited by the same two-phase clock.

12の単一ビット係数a₀〜a₁₁を含む相関が必要なら
ば、３つのプロセッサ50を使用する。データは第１プロ
セッサに導入され、第１プロセッサを通過しデータ出力
ライン56₀〜56₃を介して第２プロセッサに転送される。
データストリームは前記の如く、ビットパラレル、ビッ
トジグザグでワードシリアルである。同様にして、第２
プロセッサのデータ出力は第３プロセッサの入力にな
る。第１プロセッサは係数a₀〜a₃で演算し、第２プロセ
ッサは係数a₄〜a₇、３つのプロセッサは係数a₈〜a₁₁で
演算する。３つのプロセッサの遅延ユニット52は、第１
プロセッサの出力が14クロックサイクル遅延し、第２プ
ロセッサからの出力が７クロックサイクル遅延し第３プ
ロセッサからの出力が遅延０になるように設定されてい
る。第１プロセッサ加算器54の第２入力60bは全て０に
設定されており、その出力ライン72₀〜72₁₀はライン70₀
〜70₁₀を夫々介して第２プロセッサ加算器54の第２入力
60bに接続されている。同様に、第２プロセッサ加算器5
4の出力ライン72₀〜72₁₀は第３プロセッサ加算器の入力
ライン70₀〜70₁₀に接続されており、その出力ラインが
所望の相関結果を与える。If necessary correlation comprising a single bit coefficients a ₀ ~a ₁₁ of 12, using the three processors 50. Data is introduced into the first processor is transferred to the second processor via the data output lines 56 ₀ to 56 ₃ passes through the first processor.
The data stream is bit parallel, bit zigzag and word serial as described above. Similarly, the second
The data output of the processor becomes the input of the third processor. The first processor operates with the coefficients a _{0 to} a ₃ , the second processor operates with the coefficients a _{4 to} a ₇ , and the three processors operate with the coefficients a _{8 to} a ₁₁ . The delay unit 52 of the three processors has a first
The output of the processor is delayed by 14 clock cycles, the output from the second processor is delayed by 7 clock cycles, and the output from the third processor is set to delay 0. The second inputs 60b of the first processor adder 54 are all set to 0 and their output lines 72 _{0 to} 72 ₁₀ are line 70 _0.
~ 70 ₁₀ via the second input of the second processor adder 54, respectively
It is connected to 60b. Similarly, the second processor adder 5
The output lines 72 ₀ to 72 ₁₀ 4 is connected to the input line 70 _0-70 ₁₀ of the third processor adder, the output line gives the desired correlation result.

３つのプロセッサ50のこのような構成が所望の12の係
数計算を与えることは以下の如く確認できる。再び第１
図から第６図を参照すると、プロセッサ10は１つの結果
を出すために８クロックサイクル、即ち各行に２サイク
ルを必要とする。12行をもつ同様のプロセッサは１つの
結果を出すために24サイクル必要であろう。後者のプロ
セッサを同じデータを順次的に受信する３つの４行プロ
セッサに分割すると、第１プロセッサは８サイクル後に
結果を与え、第２プロセッサは16サイクル後、第３プロ
セッサは24サイクル後に結果を与える。従って、隣合う
プロセッサの出力の間に８サイクルの相対遅延が存在す
る。更に、各11ビット加算器54はクロック制御ラッチを
もつので１つの加算を行なうためには１クロックサイク
ルを要する。加算器54の効果は、１段に１クロックサイ
クルずつ相対遅延を短縮することである。従ってプロセ
ッサ遅延ユニット52の各々は後続プロセッサの数と７と
の積に等しい数のクロックサイクルの遅延を導入する必
要がある。従って第１及び第２のプロセッサの遅延ユニ
ット52は夫々、14クロックサイクル及び７クロックサイ
クルの遅延を与える必要がある。より一般的には、各々
がＭ行をもつＮ個のプロセッサの連鎖でｎ番目のプロセ
ッサの遅延ユニットは（2M−１）（Ｎ−ｎ）クロックサ
イクルの遅延［但し、ｎ＝１〜Ｎ］をもつように設定さ
れる。It can be verified as follows that such an arrangement of three processors 50 provides the desired twelve coefficient calculations. Again first
With reference to the figures, the processor 10 requires 8 clock cycles to produce one result, ie two cycles for each row. A similar processor with 12 rows would require 24 cycles to produce one result. Dividing the latter processor into three 4-row processors that receive the same data sequentially, the first processor gives the result after 8 cycles, the second processor gives the result after 16 cycles and the third processor after 24 cycles. . Therefore, there is a relative delay of 8 cycles between the outputs of adjacent processors. In addition, each 11-bit adder 54 has a clock control latch, so one clock cycle is required to perform one addition. The effect of adder 54 is to reduce the relative delay by one clock cycle per stage. Therefore, each of the processor delay units 52 must introduce a number of clock cycle delays equal to the product of the number of subsequent processors and seven. Therefore, the delay units 52 of the first and second processors need to provide delays of 14 clock cycles and 7 clock cycles, respectively. More generally, the delay unit of the nth processor in a chain of N processors, each having M rows, has a delay of (2M-1) (N-n) clock cycles [where n = 1 to N]. Is set to have.

各々が６ビット出力を与える３つのプロセッサ50の場
合、和入力の最大値は８ビットである。これは、加算器
54の幅より３ビットだけ小さい。従ってより長いプロセ
ッサ連鎖を収納し得る。For three processors 50 each providing a 6-bit output, the maximum value of the sum input is 8 bits. This is an adder
3 bits less than the width of 54. Thus, longer processor chains can be accommodated.

また、マルチビット係数が複数のプロセッサ50によっ
て収納されてもよい。例えば３ビット係数では３つのプ
ロセッサ50が使用される。第１プロセッサは各係数のms
b（最上位ビット）を受信し、第２プロセッサは最下位
の１つ上の桁のビットを受信し、第３プロセッサはlsb
（最下位ビット）を受信する。従って、各プロセッサ毎
の係数セットは、マルチビット係数セットの夫々のビッ
トスライスである。データストリームは、前記の直列デ
ータ流配列と対称的に並列の３つのプロセッサ全部に同
期的に供給される。第３プロセッサはビット桁０〜５の
出力を生成し、第２プロセッサは１〜６、第１プロセッ
サは２〜７の出力を生成する。これは、これらプロセッ
サが夫々、桁0,1及び２の係数ビットを乗算するからで
ある。異なるビット桁を補正するために、第１プロセッ
サ加算器出力ライン72₀〜72₉は第２プロセッサの加算器
入力ライン70₁〜70₁₀に夫々接続されている。第１プロ
セッサの出力ライン72₁₀は未接続であり、第２プロセッ
サの入力ライン70₀は０に接続されている。ビット桁の
加算シフトを実行するために第２プロセッサの加算器出
力と第３プロセッサの加算器入力との間にも同様の接続
が行なわれている。これにより、第１及び第２のプロセ
ッサの出力は夫々、第３プロセッサの出力に対して２段
階及び１段階のビット桁シフトをもつ。その結果例え
ば、第１プロセッサの第１列即ち右端列の出力は第２及
び第３プロセッサの夫々第２列及び第３列からの出力に
加算される。しかし乍ら、再び第５図を参照すると、プ
ロセッサ10の隣合う列の出力間には１クロックサイクル
の相対遅延が存在する。３つのプロセッサ全部にデータ
が同期的に供給されるので、例えば第２プロセッサの第
２列の出力と第１プロセッサの第１列の出力との間に同
様の遅延が存在する。これに対して、第１プロセッサの
出力は出力加算器54で１サイクル遅延しており、第２プ
ロセッサの出力は第２プロセッサの加算器で更に１サイ
クル遅延する。従って双方のプロセッサは第３プロセッ
サの出力に合せて加算のタイミングを補正するための適
当な遅延を生じる。Also, multi-bit coefficients may be stored by multiple processors 50. For example, for a 3-bit coefficient, three processors 50 are used. First processor is ms for each coefficient
b (most significant bit) is received, the second processor receives the bit of the next uppermost digit, and the third processor receives lsb
Receive (least significant bit). Therefore, the coefficient set for each processor is a respective bit slice of the multi-bit coefficient set. The data stream is synchronously fed to all three processors in parallel, symmetrically with the serial data stream arrangement. The third processor produces outputs of bit digits 0-5, the second processor produces outputs of 1-6, and the first processor produces outputs of 2-7. This is because these processors multiply the coefficient bits in digits 0, 1 and 2, respectively. To correct for different bit digit, the first processor adder output lines 72 ₀ to 72 ₉ are respectively connected to the adder input line 70 _1-70 ₁₀ of the second processor. The output line 72 ₁₀ of the first processor is unconnected and the input line 70 ₀ of the second processor is connected to 0. A similar connection is made between the adder output of the second processor and the adder input of the third processor to perform the add shift of the bit digit. This causes the outputs of the first and second processors to have a two-step and one-step bit digit shift with respect to the output of the third processor, respectively. As a result, for example, the output of the first or rightmost column of the first processor is added to the outputs from the second and third columns of the second and third processors, respectively. However, referring again to FIG. 5, there is a relative delay of one clock cycle between the outputs of adjacent columns of processor 10. A similar delay exists, for example, between the output of the second column of the second processor and the output of the first column of the first processor, since the data is supplied synchronously to all three processors. On the other hand, the output of the first processor is delayed by one cycle in the output adder 54, and the output of the second processor is delayed by another cycle in the adder of the second processor. Therefore, both processors produce an appropriate delay to correct the timing of the addition to match the output of the third processor.

従って３つのプロセッサ全部の遅延ユニット52は遅延
０に設定される。Therefore, the delay units 52 of all three processors are set to delay 0.

６ビット、７ビット及び８ビットの３つの数を加算し
て得られる最大値は９ビットの長さであり、これは第３
プロセッサ出力加算器の11ビット内に容易に収納され
る。The maximum value obtained by adding three numbers of 6 bits, 7 bits and 8 bits has a length of 9 bits, which is the third value.
Easily housed within 11 bits of the processor output adder.

また、前記のプロセッサ10は４ビットデータのみに適
しているが、４ビットより大きい幅のデータワードを使
用する必要が生じるかもしれない。より広いアレイも使
用できるが、多数のプロセッサ50を使用してもよい。８
ビットデータワードには２つのプロセッサ50が使用され
る。上４桁のビットは第１プロセッサに供給され、下４
桁のビットは第２プロセッサに供給される。第１プロセ
ッサの加算器出力ライン72₀〜72₆は第２プロセッサの加
算器入力ライン70₄〜70₁₀に接続され、第１出力ライン7
2₇〜72₁₀は未接続であり、第２入力ライン70₀〜70₃は０
に設定されている。これによりビット桁の４段相対シフ
トが行なわれる。相対遅延の調整はデータ入力タイミン
グに従う。８ビット全部についてデータが隣合うビット
間に１ビットの時間ジグザグを伴って入力されるなら
ば、第１プロセッサ加算器によって導入された１クロッ
クサイクルの相対遅延に対する調整を要するだけであ
る。この場合、第２プロセッサ遅延ユニット52は１サイ
クルの遅延を与えるように設定される。しかし乍ら、ビ
ットジグザグが各４ビットワード部分だけに存在し両方
のプロセッサへの入力が同期であると、第１プロセッサ
の出力は４ビット遅延を必要とする。これを得るために
は、第１プロセッサ遅延ユニット52が３サイクルの遅延
を与えるように設定し、第２プロセッサの遅延を０に設
定する。この変形例として、入力データ遅延を与えるこ
とによって等価の出力遅延効果をもつ構造が得られる。Also, while the processor 10 described above is only suitable for 4-bit data, it may be necessary to use data words wider than 4 bits. Multiple processors 50 may be used, although wider arrays may be used. 8
Two processors 50 are used for bit data words. The upper 4 digits are supplied to the first processor and the lower 4
The digit bits are provided to the second processor. The adder output lines 72 _{0 to} 72 ₆ of the first processor are connected to the adder input lines 70 ₄ to 70 ₁₀ of the second processor, and the first output line 7
2 _{7 to} 72 ₁₀ are not connected and the second input lines 70 ₀ to 70 ₃ are 0
Is set to As a result, a 4-stage relative shift of bit digits is performed. The adjustment of the relative delay follows the data input timing. If data for all eight bits is input with a one-bit time zigzag between adjacent bits, then only an adjustment to the relative delay of one clock cycle introduced by the first processor adder is required. In this case, the second processor delay unit 52 is set to give a delay of one cycle. However, if bit zigzags are present only in each 4-bit word portion and the inputs to both processors are synchronous, the output of the first processor will require a 4-bit delay. To obtain this, the first processor delay unit 52 is set to provide a delay of 3 cycles and the second processor delay is set to zero. As a modification of this, by providing an input data delay, a structure having an equivalent output delay effect can be obtained.

遅延ユニット52の使用に対する変形例として、同様の
遅延ユニットを加算器54の第２入力70₀〜70₁₀と直列、
又は加算器出力72₀〜72₁₀と直列に配備してもよい。必
要な遅延クロックサイクル数は遅延ユニットの位置に依
存する。As a variant to the use of the delay unit 52, a similar delay unit is connected in series with the second inputs 70 ₀ to 70 ₁₀ of the adder 54,
Alternatively, it may be arranged in series with the adder outputs 72 _{0 to} 72 ₁₀ . The number of delay clock cycles required depends on the position of the delay unit.

この構成においては、第２プロセッサが10ビット及び
６ビットワードの和の出力を与える。これは最大値11ビ
ットをもつ。従ってこの具体例では、全幅の出力加算器
54が必要である。幅4Nビットのデータワードを含む演算
を行なう必要があるときは、少なくとも最終プロセッサ
において、より大きい出力加算器（4N＋３）ビット幅が
必要であろう。しかし乍ら、個々の論理セルアレイの各
々は第１図及び第７図に示すセル14及び16をもつだけで
よい。これは、54の如き出力加算器による１段毎の累算
の利点を示す。各論理セルアレイは、プロセッサ10は４
から６までのような限定量のワード成長を収納するよう
に構成されるだけでよい。より大きい演算は出力加算器
を用いて別々に累算される。In this configuration, the second processor provides the output of the sum of the 10-bit and 6-bit words. It has a maximum value of 11 bits. Therefore, in this example, the full width output adder
54 is needed. A larger output adder (4N + 3) bit width would be required, at least in the final processor, when it was necessary to perform an operation involving a 4N-bit wide data word. However, each individual logic cell array need only have the cells 14 and 16 shown in FIGS. This shows the advantage of stage-by-stage accumulation with an output adder such as 54. Each logical cell array has four processors 10.
It need only be configured to accommodate a limited amount of word growth such as. Larger operations are separately accumulated using the output adder.

前記の記載より、データワード長の延長、マルチビッ
ト係数の使用及び相関長さの延長が全て、プロセッサ50
の如き適当数のプロセッサを使用することによって得ら
れることが理解されよう。厳密には、組合せプロセッサ
50における第１出力加算器54は不要である。しかし乍
ら、デジダル演算回路の設計においては、この場合各々
が出力加算器を含む１つの構築ブロックに標準化するの
が便利である。From the above description, data word length extension, multi-bit coefficient use and correlation length extension are all
It will be appreciated that this can be obtained by using a suitable number of processors such as Technically, combinatorial processor
The first output adder 54 at 50 is not needed. However, in designing a digital arithmetic circuit, it is convenient to standardize in this case to one building block each including an output adder.

第９図は本発明の別のプロセッサ90の概略説明図であ
る。これは、第７図及び第８図のプロセッサ50と等価の
プロセッサにバイパス手段を付加したもので、等価の部
分は同じ参照符号で示される。必要以上に複雑な図にな
らないように多数のライン接続はバスとして示されてい
る。プロセッサ90はプロセッサ50を含む。入力データバ
ス92はプロセッサ50と第１マルチプレクサ96との双方に
接続され、後者には２つのクロック励起遅延ラッチ94a,
94bのバンクを介して接続されている。プロセッサ50か
らのデータ出力はバス98を介してマルチプレクサ96に入
る。マルチプレクサ96はデータ出力バス100をもつ。マ
ルチプレクサ96は、制御入力102の信号が０であるか１
であるかに従って出力バス100をバス92又は98に接続す
る。FIG. 9 is a schematic explanatory diagram of another processor 90 of the present invention. This is a processor equivalent to the processor 50 of FIGS. 7 and 8 with bypass means added, and the equivalent parts are designated by the same reference numerals. Many line connections are shown as buses to avoid overcomplicating the figure. Processor 90 includes processor 50. The input data bus 92 is connected to both the processor 50 and the first multiplexer 96, the latter having two clock excitation delay latches 94a,
It is connected through a bank of 94b. The data output from processor 50 enters multiplexer 96 via bus 98. The multiplexer 96 has a data output bus 100. The multiplexer 96 determines whether the signal at the control input 102 is 0 or 1
Output bus 100 to bus 92 or 98, depending on

結果の入力バス104は出力加算器54と第２マルチプレ
クサ106との双方に接続されており、後者には２つのク
ロック励起遅延ラッチ108a,108bのバンクを介して接続
されている。第２マルチプレクサ106はまた、加算器出
力バス110と結果の出力バス112とに接続されている。結
果の出力バス112は制御入力114の信号が０であるか１で
あるかに従って結果の入力バス104又は結果の出力バス1
12に接続されている。The resulting input bus 104 is connected to both the output adder 54 and the second multiplexer 106, to the latter via a bank of two clock excitation delay latches 108a, 108b. The second multiplexer 106 is also connected to the adder output bus 110 and the resulting output bus 112. The result output bus 112 is either the result input bus 104 or the result output bus 1 depending on whether the signal at the control input 114 is 0 or 1.
Connected to 12.

第９図のプロセッサ90は以下の如く作動する。これは
同様のプロセッサの連鎖の一部として構成され、隣合う
２つのプロセッサは鎖線116及び118で示されている。プ
ロセッサ50が無故障のとき、マルチプレクサ96及び106
に論理１の制御入力信号が供給され、動作モードは前記
と同様である。しかし乍らプロセッサ50に故障があると
き、論理０の制御入力がマルチプレクサ96,106に供給さ
れ、入力データと結果とはラッチバンク94a,94b,108a及
び108bを介してプロセッサ50からバイパスされる。ラッ
チバンクの各々は、対応するバスの各ラインに対し１サ
イクルの遅延を与える。（図示しない）個々のラッチは
前記ラッチと等価であり、プロセッサ50に使用されるの
と同じクロックで励起される。The processor 90 of FIG. 9 operates as follows. It is configured as part of a chain of similar processors, two adjacent processors are shown by dashed lines 116 and 118. Multiplexers 96 and 106 when processor 50 is fault-free
A control input signal of logic 1 is supplied to and the operating mode is the same as above. However, if the processor 50 fails, a logic 0 control input is provided to the multiplexers 96, 106 and the input data and results are bypassed from the processor 50 via the latch banks 94a, 94b, 108a and 108b. Each latch bank provides a one cycle delay for each line of the corresponding bus. The individual latches (not shown) are equivalent to the latches and are excited with the same clock used by the processor 50.

従って、故障プロセッサ50はデータ及び結果の流れに
２クロックサイクルの遅延を導入するラッチバンクを介
してバイパスされる。従って、データ流及び結果の流れ
は等しい遅延を生じ、それまでと同様に同期を維持す
る。更に最も重要なことは、バイパスバスの各々がラッ
チバンクによって比較的短い３つのセクションに分割さ
れることである。必要ならば、更に細かく分割するため
に付加的バイパスラッチを挿入してもよい。この利点
は、バイパスバスの各セクションが十分に短く、プロセ
ッサ50と少なくとも同じクロック周波数でスイッチング
できることである。プロセッサ50は現状の集積回路技術
を用いて製造でき、20MHz以上の高いクロック周波数で
作動できる。高いクロック周波数で作動できる理由は、
例えば第１図の論理セル14と16との間の結線が隣合うセ
ル間にのみ存在するからである。しかし乍ら必然的にバ
イパスバスの長さが大幅に遅延され、これに対応してRC
時定数が大きくなる。かかる時定数は直列プロセッサ連
鎖の最大周波数を好ましくない低い値に制限する。従っ
て、バイパスバスを高速スイッチング可能なセクション
に細かく分割しないとき、故障プロセッサをバイパスす
ることによって最大クロック周波数の急激な下降が生じ
るであろう。第２図及び第３図のクロック22の周波数を
下回る程に周波数が減少すると、直列連鎖は１つのプロ
セッサがバイパスされていても機能しないであろう。従
って、クロック励起ラッチによって高速スイッチング可
能なセクションに細分されたバイパスバスを使用する
と、動作速度に不利な影響を与えることなく故障許容性
プロセッサを構成することが可能である。Therefore, the faulty processor 50 is bypassed through the latch bank which introduces a two clock cycle delay in the data and result streams. Therefore, the data stream and the resulting stream experience equal delays and remain synchronized as before. Most importantly, each of the bypass buses is divided by latch banks into three relatively short sections. If desired, additional bypass latches may be inserted for further subdivision. The advantage is that each section of the bypass bus is short enough to switch at least at the same clock frequency as the processor 50. Processor 50 can be manufactured using current integrated circuit technology and can operate at high clock frequencies above 20 MHz. The reason why it can operate at a high clock frequency is
This is because, for example, the connection between the logic cells 14 and 16 in FIG. 1 exists only between adjacent cells. However, the length of the bypass bus is inevitably delayed significantly, and the RC
The time constant becomes large. Such a time constant limits the maximum frequency of the serial processor chain to an undesirably low value. Therefore, if the bypass bus is not subdivided into fast switchable sections, bypassing the faulty processor will cause a sharp drop in maximum clock frequency. If the frequency decreases below the frequency of clock 22 in FIGS. 2 and 3, the serial chain will not work even if one processor is bypassed. Therefore, by using the bypass bus subdivided into fast switchable sections by the clock excitation latch, it is possible to construct a fault tolerant processor without adversely affecting the operating speed.

典型的な故障許容プロセッサ連鎖は、例えば４つのプ
ロセッサを要する演算のために直列の５つのプロセッサ
90を組込んでいる。従って任意の１つの故障プロセッサ
又は不要プロセッサをバイパスできる。より大きい故障
許容範囲が必要なときは、付加的プロセッサを追加し得
る。参考文献１に記載の如き従来のプロセッサは、動作
速度を低下させないで故障許容範囲を得るこのような構
成を用いることはできない。その理由は、本発明によれ
ばデータと結果とが同方向でプロセッサ10又は連鎖プロ
セッサ90を通過するからである。バイパスラッチはデー
タストリームと結果ストリームとを等しく遅延させ、両
者間に相対遅延は導入されない。従って故障プロセッサ
のバイパスによって、連鎖の先行プロセッサから後続プ
ロセッサまでのデータストリームと結果ストリームとの
相対的タイミングが維持される。参考文献１に記載のプ
ロセッサはデータと結果とが向流的に移動するように設
計されている。かかる連鎖デバイスでは中央プロセッサ
が一方の隣接プロセッサからデータを受信し他方の隣接
プロセッサから結果を受信する。これら隣合うプロセッ
サの１つをラッチ付きバスでバイパスするときは、デー
タストリーム又は結果ストリームの一方が中央プロセッ
サで遅延されるが、両方が遅延されることはできない。
このため演算のタイミングが破壊され無意味な結果が発
生する。その結果、高速故障許容性プロセッサ連鎖の構
造を従来技術の向流アーキテクチャーの使用によって得
ることはできない。このような構造を得るためには、単
一方向のデータ流及び結果の流れが生じるように構成さ
れた本発明のプロセッサを使用する必要がある。これが
本発明の重要な利点である。現在、集積回路技術はウェ
ーハ規模の集積に移りつつあり、ここでは高速故障許容
性アーキテクチャーが不可欠である。ある程度の故障許
容性がないと、ウェーハ規模の回路効率が実質的に０に
なる。何故なら、数百個の素子を担持するウェーハで１
つの故障素子が生じるとウェーハ全体の作動が無効にな
るからである。A typical fault tolerant processor chain is, for example, five processors in series for operations that require four processors.
It incorporates 90. Therefore, any one failed or unnecessary processor can be bypassed. Additional processors may be added when greater fault tolerance is required. The conventional processor as described in Reference 1 cannot use such a configuration that obtains a fault tolerance range without reducing the operation speed. The reason is that according to the invention, the data and the result pass through the processor 10 or the chain processor 90 in the same direction. Bypass latches delay the data stream and the result stream equally and no relative delay is introduced between them. Thus, the bypassing of a failed processor maintains the relative timing of the data and result streams from the preceding processor to the subsequent processor in the chain. The processor described in reference 1 is designed so that the data and the result move countercurrently. In such a chained device, the central processor receives data from one adjacent processor and results from the other adjacent processor. When bypassing one of these adjacent processors with a latched bus, either the data stream or the result stream is delayed at the central processor, but not both.
Therefore, the calculation timing is destroyed and meaningless results occur. As a result, the structure of fast fault tolerant processor chains cannot be obtained by using prior art countercurrent architectures. In order to obtain such a structure, it is necessary to use the processor of the present invention that is configured to produce a unidirectional data stream and resulting stream. This is an important advantage of the present invention. Currently, integrated circuit technology is moving to wafer scale integration, where a fast fault tolerant architecture is essential. Without some fault tolerance, wafer scale circuit efficiency is substantially zero. Because one wafer with hundreds of devices
This is because the operation of the entire wafer is invalidated when one defective element occurs.

次に第10図を参照する。第10図では第２図と等価の素
子を200を加えた同じ参照符号で示す。これは、プロセ
ッサ10,50又は90での使用に適した変形例のゲート制御
全加算器論理セル214である。セル214とセル14との唯１
つの違いは、セル214が２つのデータ出力ラッチ218_x1及
び218_x2をもち唯１つの結果出力ラッチ218_yをもつこと
である。セル214は第１図のセル14と完全に等しい相互
接続を伴う（図示しない）プロセッサで使用される。論
理セル214を組込んだプロセッサでは結果がデータの２
倍の速度で移動する。プロセッサへの係数入力は第４図
〜第６図と逆の順序で行なわれる。例えば相関係数a₀〜
a₃は、第４図の如くセル14₀₀〜14₃₀でなくセル14₃₀〜41
₀₀に夫々入力される。前記と同様にプロセッサの動作を
解析すると、セル214を組込んだプロセッサへの係数入
力の流れ図から相関計算が得られることが理解されよ
う。この解析は前記と同様であるからここでは説明しな
い。係数セットの交換方法が若干異なっている。即ち、
第４図では隣合う行間の係数セットの交換に２クロック
サイクルの遅延が導入されるが、第10図では１クロック
サイクルの遅延が導入される。データ流が遅く結果の流
が速いことを補償するために、セル214を組込んだプロ
セッサ50,90のアレイでは、前出のアレイに比較して結
果累算タイミングの調整が必要である。必要な調整はデ
ジタルエレクトロニクスの当業者に明らかであるからこ
こでは説明しない。Now referring to FIG. In FIG. 10, elements equivalent to those in FIG. 2 are designated by the same reference numerals with 200 added. This is a modified gated full adder logic cell 214 suitable for use in the processor 10, 50 or 90. Only cell 214 and cell 14
The only difference is that cell 214 has two data output latches 218 _x1 and 218 _x2 and only one result output latch 218 _y . Cell 214 is used in a processor (not shown) with exactly the same interconnections as cell 14 of FIG. In the processor incorporating the logic cell 214, the result is 2
Move at twice the speed. Coefficient input to the processor is performed in the reverse order of FIGS. For example, the correlation coefficient a ₀ ~
a ₃ is not cells 14 _{00 to} 14 _{30 as} shown in FIG. 4, but cells 14 _{30 to} 41.
Input to _{00 respectively} . It will be appreciated by analyzing the operation of the processor as before that the correlation calculation can be derived from the flow diagram of the coefficient input to the processor incorporating cell 214. This analysis is the same as above and will not be described here. The method of exchanging coefficient sets is slightly different. That is,
In FIG. 4, a two clock cycle delay is introduced for the exchange of coefficient sets between adjacent rows, whereas in FIG. 10 a one clock cycle delay is introduced. To compensate for the slower data flow and faster result flow, arrays of processors 50, 90 incorporating cells 214 require adjustment of result accumulation timing as compared to the previous array. The necessary adjustments will be clear to the person skilled in the art of digital electronics and will not be described here.

本発明のプロセッサは、２つの補数データ及び／又は
係数を伴って動作するように構成されている。これまで
に記載の具体例は４ビットデータストリームを使用す
る。これが２の補数形のとき、各入力データワードが出
力結果と同じ幅をもつまで符号ビット又は最上位ビット
を複製する必要があろう。従って、６ビットの入力デー
タが必要であろう。より詳細には、ビットabcdをもつ２
つの補数形の４ビットデータワードがaaabcdで示され
る。プロセッサ10は６ビット入力を受信しない。第１図
の４×４のゲート制御全加算器アレイと５つの半加算器
との代わりに、４×６のゲート制御全加算器アレイが第
１図の相互結線を伴って使用される。かかるアレイは４
ビットで正のデータ全部を演算するプロセッサ50と同様
に、６ビットに延長された２つの補数データ符号ビット
を４ビットで演算する。一般に、所要アレイの形状は矩
形であり、各行のセル数は最終行からの出力結果のビッ
ト幅に等しい。The processor of the present invention is configured to operate with two's complement data and / or coefficients. The embodiments described so far use a 4-bit data stream. If this were two's complement, then it would be necessary to duplicate the sign bit or most significant bit until each input data word had the same width as the output result. Therefore, 6 bits of input data would be required. More specifically, 2 with bit abcd
A four-bit data word in one's complement form is designated aaabcd. Processor 10 does not receive a 6-bit input. Instead of the 4 × 4 gated full adder array and 5 half adders of FIG. 1, a 4 × 6 gated full adder array is used with the interconnections of FIG. 4 such arrays
Similar to the processor 50 that operates all positive data in bits, two complement data code bits extended to 6 bits are operated in 4 bits. Generally, the shape of the required array is rectangular, and the number of cells in each row is equal to the bit width of the output result from the last row.

また、第7,8及び９図に記載の如く構成されたマルチ
プルプロセッサにこの補数データを収納してもよい。加
算器54と等価の１つの出力加算器に供給される結果、及
び、１バイト入力データ又は別々に処理すべく複数バイ
トに分割された最上位バイトに対して、符号延長部を与
える必要がある。特に、出力加算器に入る符号ビットを
含む結果は、総合結果の完全幅まで符号延長される必要
がある。総合結果は、集合プロセッサの最終出力加算器
から得られる。Further, the complement data may be stored in a multiple processor configured as shown in FIGS. 7, 8 and 9. It is necessary to provide a sign extension for the result supplied to one output adder equivalent to adder 54, and for the one byte input data or the most significant byte divided into multiple bytes for separate processing. . In particular, the result containing the sign bit entering the output adder must be sign extended to the full width of the overall result. The total result is obtained from the final output adder of the set processor.

本発明のプロセッサでもこの補数係数を使用し得る。
単一ビット係数の場合、乗算は０又は１によって行なわ
れ後者は負である。０が正の寄与を与えないので結果は
完全に負である。従って計算は全部が正の係数の場合と
等価である。マルチビット係数の場合、最上位のプロセ
ッサだけが負の係数を含み、その結果は完全に負であ
る。この結果のこの補数は公知のゲート制御手段によっ
て使用され、最終出力加算器に入力される前に総合結果
の完全幅まで符号延長される。This complement factor may also be used in the processor of the present invention.
For single-bit coefficients, the multiplication is done by 0 or 1 and the latter is negative. The result is completely negative because 0 gives no positive contribution. Therefore, the calculation is equivalent to the case of all positive coefficients. For multi-bit coefficients, only the top-level processor contains negative coefficients and the result is completely negative. This complement of this result is used by known gate control means to sign extend to the full width of the overall result before it is input to the final output adder.

この補数のためのデジタル演算回路の原理は公知であ
るから、ここでは詳しく説明しない。The principle of the digital arithmetic circuit for this complement is well known and will not be described in detail here.

プロセッサ50と同様のマルチプルプロセッサは、各々
が各結果のビット幅に等しい数のセルを各行に含むゲー
ト制御全加算セルの矩形アレイを組込んでいるならば、
この補数データを使用し得る。別々に処理するために入
力データが個々のバイトに分割されている場合、最上位
バイトを受信するプロセッサの出力結果の最上位ビット
は、総合結果、即ち第４図の加算器54と等価の最上位出
力加算器からの結果出力の完全幅まで符号延長される。A multiple processor, similar to processor 50, incorporates a rectangular array of gated full add cells, each containing a number of cells in each row equal to the bit width of each result.
This complement data can be used. If the input data is split into individual bytes for separate processing, the most significant bit of the output result of the processor receiving the most significant byte will be the most significant result, ie the least significant equivalent of adder 54 of FIG. The sign is extended to the full width of the result output from the high-order output adder.

本発明の具体例を相関に関して説明したが、これを畳
込みに使用することも可能である。これは例えば参考文
献１に記載されており、以下の如く算出される。畳込み
演算は次式で定義される。Although an embodiment of the invention has been described with respect to correlation, it can also be used for convolution. This is described in Reference Document 1, for example, and is calculated as follows. The convolution operation is defined by the following equation.

相関演算は次式で定義される。 The correlation calculation is defined by the following equation.

式（９）より、４点計算（Ｎ＝４）の第５番目の畳込
み結果Y₄は以下の如く与えられる。 From the equation (9), the fifth convolution result Y ₄ of the 4-point calculation (N = 4) is given as follows.

Y₄（畳込み）＝A₀X₄＋A₁X₃＋A₂X₂＋A₃X₁ （11）式（10）より、４点計算の２番目の相関結果Y₁は以下
の如く与えられる。Y ₄ (convolution) = A ₀ X ₄ + A ₁ X ₃ + A ₂ X ₂ + A ₃ X ₁ (11) From the equation (10), the second correlation result Y ₁ of the 4-point calculation is given as follows.

Y₁（相関）＝A₀X₁＋A₁X₂＋A₂X₃＋A₃X₄ （12）式（12）の右辺の順序を逆にしてB_i＝A_3-i、ｉ＝０〜
３を代入すると Y₁（相関）＝B₀X₄＋B₁X₃＋B₂X₂＋B₃X₁ （13）である。Y ₁ (correlation) = A ₀ X ₁ + A ₁ X ₂ + A ₂ X ₃ + A ₃ X ₄ (12) The order of the right side of the equation (12) is reversed and B _i = A _3-i , i = 0 to
Substituting 3, Y ₁ (correlation) = B ₀ X ₄ + B ₁ X ₃ + B ₂ X ₂ + B ₃ X ₁ (13).

式（11）と（13）とは等価であり、畳込みと相関とが
等価の数学的演算であることを示す。係数セットによる
データの畳込みは、逆の順序で同じ係数を用いた同じデ
ータの相関と等価である。所与の係数セットA₀〜A_kの場
合、係数ワードA₀が最初の行又は最終行のいずれから入
力開始されるかに従って本発明のプロセッサが畳み込み
演算又は相関演算を行なう。この逆のことが第10図のセ
ル214を組み込んだプロセッサに用いられる。畳込み結
果級数の最初の若干の項は対応する相関級数には無いと
いう少しばかりの相違がある。例えば式（10）は（９）
のY₀〜Y₂を生成することはできない。しかし乍ら実用上
はこのことが重要でないデジタル演算回路は極めて多数
の結果を生成すべく使用されるので、例えば数百万の級
の初端でいくつかの結果が付加されたり欠如していても
これを無視してもよい。Equations (11) and (13) are equivalent, indicating that convolution and correlation are equivalent mathematical operations. Convolution of data with a set of coefficients is equivalent to correlation of the same data with the same coefficients in reverse order. For a given coefficient set A ₀ -A _k , the processor of the present invention performs a convolutional operation or a correlation operation depending on whether the coefficient word A ₀ is input starting from the first row or the last row. The reverse is used for a processor incorporating cell 214 of FIG. There is a slight difference that the first few terms of the convolution result series are not in the corresponding correlation series. For example, equation (10) becomes (9)
Y ₀ to Y ₂ cannot be generated. However, for practical purposes this is not important for digital arithmetic circuits, which are used to produce a very large number of results, so some results may be added or missing at the beginning of the millions of classes, for example. May ignore this as well.

[Brief description of drawings]

第１図は相関演算を実行すべく構成された本発明のプロ
セッサの概略説明図、第２図及び第３図は夫々、第１図
のプロセッサのゲート制御全加算セル及び半加算セルの
詳細図、第４図、第５図及び第６図は第１図のプロセッ
サにおけるデータ流と結果との夫々のクロックサイクル
でのタイミングを示す説明図、第７図は、大型計算機用
プロセッサアレイを構成するために出力遅延手段と解累
算手段と共に第１図のプロセッサを含む本発明のプロセ
ッサの概略説明図、第８図は第７図の累算手段に使用さ
れる全加算セルの詳細図、第９図は故障許容性プロセッ
サアレイを構成するために必要なバイパス結線を伴う第
７図のプロセッサの概略説明図、第10図は第１図のプロ
セッサで使用されるゲート制御全加算セルの変形例の説
明図である。 10……プロセッサ、12……アレイ、14……論理セル、16
……半加算セル、18,20……ラッチ、22……クロック、5
2……遅延ユニット。FIG. 1 is a schematic illustration of a processor of the present invention configured to perform a correlation operation, and FIGS. 2 and 3 are detailed views of the gated full adder and half adder cells of the processor of FIG. 1, respectively. FIGS. 4, 5, and 6 are explanatory views showing the timings of the data flow and the result in each clock cycle in the processor of FIG. 1, and FIG. 7 constitutes a processor array for a large computer. FIG. 8 is a schematic explanatory view of a processor of the present invention including the processor of FIG. 1 together with output delay means and solution accumulation means for the purpose of FIG. FIG. 9 is a schematic explanatory diagram of the processor of FIG. 7 with a bypass connection necessary for constructing a fault tolerant processor array, and FIG. 10 is a modification of the gated full adder cell used in the processor of FIG. FIG. 10 ... Processor, 12 ... Array, 14 ... Logic cell, 16
…… Half addition cell, 18,20 …… Latch, 22 …… Clock, 5
2 ... Delay unit.

───────────────────────────────────────────────────── フロントページの続き (72)発明者リチヤード・アンソニー・エバンスイギリス国、ヒアフオードシヤー・エイチ・アール・８・１・ジエイ・ジエイ、レドベリイ、コデイントン、ニユー・クラフト（番地なし) (72)発明者ジョン・グラハム・マクウアーターイギリス国、ウスターシヤー・ダブリユ・アール・14・４・ピイ・エス、ウエルズ、アルバーン、ムーアランズ・27 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Lichyard Anthony Evans United Kingdom, Hereford Oshcher H.A.R.8.1.1 J.J.A.G.A., Redberg, Codeineton, New Craft (No house number) (72) Inventor John Graham McWurter England, Worcestershire D'Abriille Earl 14/4 Pies, Wells, Albarn, Moorlands 27

Claims

[Claims]

1. A digital processor for performing correlation and convolution operation of a bit parallel, word serial, bit zigzag M-bit word data stream and N single-bit coefficients, comprising: (a) a processor having N rows and M columns. (B) Each logic cell inputs (i) data bit, carry bit and cumulative sum bit, (ii) outputs data bit, and (iii) input data bit and cell of each row. (C) interconnection line of cells, which is configured to generate an output cumulative sum bit and an output carry bit corresponding to a product of corresponding coefficient bits, an input cumulative sum, and a total sum of input carry bits. Are configured to transmit bits through rows and columns, the lines including clock excitation delay means for storing and transmitting the bits. The cell interconnect lines and the delay means are such that the cumulative sum bits and the data bits are transmitted at a rate such that one rate is twice the other rate in a single direction down the columns of the array. And a carry bit is configured to be transmitted faster along the rows of the array than both the cumulative sum bit and the data bit in the direction of increasing the weight of the data bit. Prossessa.

2. An additional cell interconnection line configured to transmit coefficient bits along a row of the array at a transmission rate and a transmission direction of a carry bit and a clock excitation delay means. A processor according to claim 1.

3. A programmable clock excitation delay means configured to delay the array output, and a multi-bit clock excited full adder configured to add the delayed array output to a second summing input. The processor according to claim 1 or 2.

4. A bypass connection for input data and a second summing input, the connection being subdivided by a clock excitation latch.
Processor according to paragraph.

5. Processor according to claim 3 or 4, characterized in that the width of the full adder is sufficient to accommodate the relative difference of the bit digits between the inputs.

6. Processor according to claim 5, characterized in that the width of the full adder is sufficient to accommodate the relative difference of the bit digits of M bits.

7. A log ₂ n half adder on the extension of the nth row of logic cells, wherein n is from 1 to N and log ₂ n is rounded to an integer if necessary. The processor according to any one of the first to sixth terms.

8. A log ₂ (n- on the extension of the nth row of the logic cell.
1) a half adder, including a connection with delay means between the carry output of the (n-1) th row and the sum input of the appropriate nth row half adder, where n is 2 to N Yes Log ₂ if required
The processor according to any one of claims 1 to 6, wherein (n-1) is rounded to an integer.