JPS63193232A

JPS63193232A - Parallel data processor

Info

Publication number: JPS63193232A
Application number: JP62024784A
Authority: JP
Inventors: Toshio Kondo; 利夫近藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1987-02-06
Filing date: 1987-02-06
Publication date: 1988-08-10
Anticipated expiration: 2011-07-24
Also published as: JP2516611B2

Abstract

PURPOSE:To obtain a speed increasing mechanism for high-performance by using two propagation arithmetic systems incorporated in respective processors and advancing two kinds of propagation arithmetic by the processors in parallel. CONSTITUTION:Oscillation processors are assigned by loading '1' in the control register 19 of a processor 1 in the processor block 20 at the leftmost end and '0' in other registers 19. While an address A0 of a storage part is accessed by each processor, an ALU 11 is set to addition and when propagation addition is started, an upper ad a lower propagation arithmetic system constituted in each block by cascading ALUs 11a and 11b perform 1-bit propagation addition regarding the address A0 in parallel as to both cases wherein an input is '0' or '1', so that the sum of the output is determined by the block 20 at the left end when the addition is finished. The determined sum is stored in the address A0 and its carry is stored in a register 18. similar processing is repeated as to A1-An of the storage part.

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、内蔵のプロセッサ配列がほぼ同一な構成のプ
ロセッサの縦続的な配列からなｐ、プロセッサ間の接続
が単純、規則的かつ局所的なうえ、プロセッサ配列を実
装するための複数プロセッサを搭載するモジュール（例
えば、ＬＳＩ、ボード等）関の接続線が少ないにもかか
わらず、効率的な演算・転送が可能な伝搬演算（配列を
構成するプロセッサ間を順次伝搬させながら進める演算
・転送）をより高速に実行する並列データ処理装置に関
するものである。DETAILED DESCRIPTION OF THE INVENTION (Industrial Application Field) The present invention is characterized in that the built-in processor array is composed of a cascaded array of processors having almost the same configuration, and the connections between the processors are simple, regular and local. What's more, even though there are few connection lines between modules (such as LSIs, boards, etc.) equipped with multiple processors to implement a processor array, propagation calculations (configuring the array) that allow efficient calculations and transfers are possible. The present invention relates to a parallel data processing device that executes operations and transfers (operations and transfers that are sequentially propagated between processors) at higher speed.

（従来の技術）プロセッサ配列型の並列データ処理装置（以下アレイプ
ロセッサと呼ぶ）の設計において離れたプロセッサ間の
データの転送・演算をいかに高速に行うかは重要な問題
の一つである。一般に高速性を追求すると、プロセッサ
間の接続線数が極端に増えたｐ、プロセッサ間の接続構
成の単純性がくずれ、装置の実現が困難になったシする
からである。特にプロセッサの２次元配列からなるアレ
イプロセッサ（２次元アレイプロセッサ）では、プロセ
ッサ数が多く深刻である。このため、２次元アレイプロ
セッサでは接続線数の増加の小さい高速化法である伝搬
演算方式ＣＡ、Ｐ、ＲＥＥＶＥＳ。(Prior Art) In designing a processor array type parallel data processing device (hereinafter referred to as an array processor), one of the important issues is how to transfer data and perform calculations between distant processors at high speed. This is because, in general, when high speed is pursued, the number of connection lines between processors increases significantly, and the simplicity of the connection configuration between processors deteriorates, making it difficult to realize the device. This problem is particularly serious in an array processor (two-dimensional array processor) consisting of a two-dimensional array of processors, which has a large number of processors. For this reason, in two-dimensional array processors, the propagation calculation method CA, P, REEVES is used, which is a high-speed method with a small increase in the number of connection lines.

’Ａ　Ｓｙｓｔｅｍａｔｉｃｍｌｌｙ　Ｄｅｓｉｇｎｓ
ｄ　Ｂｉｎａｒｙ　ＡｒｒａｙＰｒｏｃｅｓｓｏｒ’、
　ＩＥＥＥ　Ｔｒａｎｓ、Ｃｏｍｐｕｔ、ｖｏｌ、Ｃ−
２９゜ｐｐ２７８−２８７　（１９８０）。以下文献１
という。〕が有用である。（伝搬演算とは、プロセッサ
間を、途中クロックで同期をとることなく、プロセッサ
内で演算を施しながらその結果を隣接プロセッサ間の接
続線を介して次々と伝搬させる演算である。'A Systematicmlly Designs
d Binary ArrayProcessor',
IEEE Trans,Comput,vol,C-
29°pp 278-287 (1980). Reference 1 below
That's what it means. ] is useful. (A propagation operation is an operation in which arithmetic operations are carried out within a processor and the results are propagated one after another via a connection line between adjacent processors, without synchronizing the processors with an intermediate clock.

演算機能を通過に選ぶと単なるデータ転送になる。）こ
れは、この演算方式が、本来、通常のバスを用いる転送
方式に比べると、装置の実現容易性がほとんど低下しな
いにもかかわらず、（１）同期やメモリ・レジスタアク
セスの回数が小さくなるのでデータを加工しながらプロ
セッサ間を次々と引き渡すことにより実現可能な処理（
例えば総和演算）を高速化できる。（２）バスと異なｐ
転送系が一組のデータで占有されることがないので、互
いに転送区間が重複しなければ同一の系で同時に複数組
のデータ転送が可能である。等の利点を有しているから
である。また、伝搬演算は画像処理における塗りつぶし
処理、連結領域抽出処理を効率的に実行する手段として
も有効である〔文献ｌ〕。しかし、この演算方式は伝搬
時間が途中経由するプロセッサ数に比例して増加するた
め、経由するプロセッサ数が多い場合、転送・演算時間
が長くなりすぎる欠点、あるいは、実際に演算を行うプ
ロセッサがその時点で伝搬演算にかかわっているプロセ
ッサ（換言すれば伝搬の先頭波面にあるプロセッサ）に
限られ実効的な並列度が低下する欠点がある。そこで、
我々は伝搬の経路に適当な間隔でバイパスを付加し、こ
れによって伝搬演算を階層的に行って並列産金上げる方
式（以下バイパス付加型伝搬演算方式と呼ぶ）を提案し
てきた［特公昭５８−２９５５０号公報、特願昭５６−
０１６６５９号コ。If the arithmetic function is set to pass, it becomes a simple data transfer. )This is because although this calculation method does not significantly reduce the ease of implementing the device compared to a transfer method that uses a normal bus, (1) the number of synchronizations and memory/register accesses is reduced. Therefore, processing that can be realized by passing data between processors one after another while processing the data (
For example, summation calculations) can be accelerated. (2) P different from bus
Since the transfer system is not occupied by one set of data, it is possible to transfer multiple sets of data at the same time in the same system as long as the transfer sections do not overlap with each other. This is because it has the following advantages. The propagation calculation is also effective as a means for efficiently executing fill-in processing and connected region extraction processing in image processing [Reference 1]. However, with this calculation method, the propagation time increases in proportion to the number of processors passing through, so if there are many processors passing through, the disadvantage is that the transfer and calculation time becomes too long, or the processor that actually performs the calculation is This method has the disadvantage that the effective degree of parallelism is reduced because it is limited to the processors currently involved in the propagation calculation (in other words, the processors at the leading wavefront of the propagation). Therefore,
We have proposed a method (hereinafter referred to as bypass-added propagation calculation method) that adds bypasses to the propagation path at appropriate intervals and thereby performs propagation calculations hierarchically (hereinafter referred to as the bypass-added propagation calculation method). Publication No. 29550, patent application 1982-
No. 016659.

しかしながら、このバイパスについては、その後の検討
の結果、階層化の手数が太きくなりすぎ高速化に役立た
ない場合が少なくないことが明らかになってきた。その
例として、画像の情報圧縮のための符号化、文字認識に
おける特徴抽出等に用いる２値ライン上の白あるいは黒
連結の長さくランレングス）を求める処理があげられる
。However, as a result of subsequent studies, it has become clear that this bypass is often too labor intensive to hierarchize and is not useful for speeding up. An example of this is the process of determining the run length of white or black connections on a binary line used for encoding for image information compression, feature extraction in character recognition, and the like.

伝搬演算によりランレングスを求める方法は単純である
。ラインを構成する画素（白点あるいは黒点）がプロセ
ッサに一対一でＩｌｌり付けられているとすると、ライ
ン上の各自連結および黒連結内の左端の先頭プロセッサ
を発信プロセッサ、他のプロセッサを加算プロセッサと
する右方向の（先頭から末尾に向う）伝搬加算を実行す
るだけでよい。ここで、発信プロセッサとは伝搬加算中
に左隣からの入力を無視し論理値″１″を右隣のプロセ
ッサに出力するプロセッサであｐ１加算プロセ。The method of determining run length by propagation calculation is simple. Assuming that the pixels (white dots or black dots) constituting a line are attached to processors one-to-one, the leftmost first processor in each connection and black connection on the line is the originating processor, and the other processors are the adding processors. All we need to do is perform a rightward (from the beginning to the end) propagation addition. Here, the originating processor is a processor that ignores the input from the left neighbor during propagation addition and outputs a logical value "1" to the right neighbor processor, and is the p1 addition process.

すとは左隣からの入力に１″を加えて右隣のプロセッサ
に出力するプロセッサである。これらのプロセッサの動
作から明らかなように、伝搬とともに各プロセッサには
自身の属する連結の左端からの距離が求まって行く。伝
搬が連結の右端まで到達するとその右端のプロセッサに
連結のランレングスが得られる。以上のランレングス計
算をバイパス付加型伝搬演算方式を適用した一次元のプ
ロセッサ配列（第６図参照）で階層的に実行しようとす
ると、その手順は次のように複雑になる。is a processor that adds 1" to the input from the left neighbor and outputs it to the right neighbor processor.As is clear from the operation of these processors, as the process progresses, each processor receives the input from the left end of the connection to which it belongs. The distance is calculated. When the propagation reaches the right end of the connection, the run length of the connection is obtained by the processor at the right end. The above run length calculation is performed using a one-dimensional processor array (6th If you try to execute it hierarchically (see figure), the procedure becomes complicated as follows.

〔ステツブ１〕　バイノ４ス１１３でスキラグされうる
４個のプロセッサ（１１０＆〜１１０ｄ）の中に白ある
いは黒連結の先頭のプロセッサが存在するかどうかを４
個のプロセッサ間の伝搬論理演算により求め結果をパイ
・ヤス制御レジスタ１１５に書き込む。これにより、４
個のプロセッサの中に連結の左端となるプロセッサが存
在しなければバイパス選択セレクタ１１４をバイパス側
に選び、存在すればバイパスでない側に選ぶようグログ
ラムする。[Step 1] Check whether there is a processor at the beginning of a white or black connection among the four processors (110 & ~ 110d) that can be skilagged in the binoculars 113.
The result obtained by the propagation logical operation between the processors is written to the PAYAS control register 115. This results in 4
If there is no processor at the left end of the connection among the processors, the bypass selection selector 114 is selected to the bypass side, and if there is, the bypass selection selector 114 is programmed to be selected to the non-bypass side.

〔ステップ２〕　セレクタ１１４をバイパスでない側に
選んだ状態で連結の左端のプロセッサとともに各１１０
ａのプロセッサを発信プロセッサとし他を加算プロセッ
サとする伝搬加算を実行し、その結果を各ＰＥ（プロセ
ッサエレメント）で保持する。[Step 2] With the selector 114 set to the non-bypass side, each 110
A propagation addition is executed with the processor of a as the originating processor and the other processors as the addition processors, and the results are held in each PE (processor element).

〔ステップ３〕　各１１０ｄのプロセッサは０を出力す
るように、各１１０ｅのプロセッサは自身が加算プロセ
ッサであれば入力にステップ２で得られた結果を加えた
値を、自身が発信プロセッサであれば１を出力するよう
にして、セレクタ１１４をステップ１でプログラムした
バイパス制御レジスタ１１５で制御しながら、伝搬加算
を実行し、その結果を１１０ｅのプロセッサに保持する
。[Step 3] Each 110d processor outputs 0, and each 110e processor outputs the value obtained by adding the result obtained in step 2 to its input if it is an addition processor, or outputs the value obtained by adding the result obtained in step 2 to its input if it is an originating processor. 1, the propagation addition is executed while the selector 114 is controlled by the bypass control register 115 programmed in step 1, and the result is held in the processor 110e.

〔ステ、プ４〕　連結の左端でない１１０ｍのプロセッ
サをステップ３で得られた結果を出力する発信ＰＥ、連
結の左端のプロセッサを０出力の発信ＰＥ、それ以外は
入力をそのまま出力する転送プロセッサとする伝搬転送
を実行し、結果を各ＰＥで保持する。[Step 4] The 110m processor that is not on the left end of the connection is used as the origination PE that outputs the result obtained in step 3, the leftmost processor on the connection is used as the origination PE that outputs 0, and the other processors are the transfer processors that output the input as is. A propagation transfer is executed and the result is held in each PE.

〔ステ、７Ｄ５〕　ステップ２で得られた結果とステ、
プ４で得られた結果を各ＰＥで加え、その結果を最終的
な伝搬加算の結果とする。[Step, 7D5] Results obtained in step 2 and step,
The results obtained in step 4 are added at each PE, and the results are used as the final propagation addition results.

以上の各ステップがその実行に数命令以上を必要とする
ので、全体の所要マシンサイクル数は、配列サイズがよ
ほど大きくない限シ階層化しない場合よＱ大きくなって
しまう。もちろん、それらの命令・ステ、プの大部分は
適白な布線論理ノ・−ドウエアを付加して並列に実行さ
せることによシ削減可能である。しかし、そうするとア
レイプロセッサのハードウェア構成の単純性・規則性が
くずれ、装置の実現容易性を低下させることになる。Since each of the above steps requires more than a few instructions to execute, the total number of machine cycles required is Q larger than that without hierarchization unless the array size is very large. Of course, most of these instructions/steps can be reduced by adding appropriate wiring logic nodes and executing them in parallel. However, if this is done, the simplicity and regularity of the hardware configuration of the array processor will be disrupted, and the ease of realizing the device will be reduced.

ところで、−次元アレイプロセッサは、二次元配列デー
タの列方向が各プロセッサのローカルメモリの深さ方向
に、行方向がプロセッサの配列方向に並ぶよう割シ付け
られていると、二次元アレイプロセッサにおける伝搬演
算を用いた処理（例えば射影処理、ランレングス計算、
交差線数係数等）を列方向については効率的にエミュレ
ートできる。これは、−次元アレイプロセッサでは、ロ
ーカルメモリのアドレスを変えてデータをアクセスする
だけでいわゆるラスタスキャン走査が可能であシ、プロ
セッサの一次元配列を容易に伝搬の先頭波面に割シ付け
られ、離れたプロセッサ間のデータの転送動作が不要と
なるからである。行方向についても、各プロセッサのロ
ーカルメモリの配列の全体が二次元アクセスメモリ（行
方向からでも列方向からでもアクセスが可能なメモリ）
になっていれば、行を深さ方向、列を配列方向として読
み出すことにより、同様にエミュレートできる。すなわ
ち、−次元アレイプロセッサでは、二次元アクセスメモ
リを用いることによシ、伝搬演算を高速に実行すること
ができる。By the way, in a -dimensional array processor, if the column direction of the two-dimensional array data is allocated in the depth direction of the local memory of each processor, and the row direction is arranged in the array direction of the processor, Processing using propagation operations (e.g. projection processing, run length calculation,
(intersecting line number coefficient, etc.) can be efficiently emulated in the column direction. This is because in a -dimensional array processor, so-called raster scan scanning is possible simply by changing the local memory address and accessing data, and the one-dimensional array of the processor can be easily allocated to the leading wavefront of propagation. This is because there is no need for data transfer operations between distant processors. In the row direction, the entire local memory array of each processor is two-dimensional access memory (memory that can be accessed from both the row and column directions).
If so, it can be similarly emulated by reading the rows in the depth direction and the columns in the array direction. That is, in a -dimensional array processor, by using a two-dimensional access memory, propagation operations can be executed at high speed.

二次元アクセスメモリの構成法は主に二通りに分けられ
る。一方は、通常のメモリの配列にアドレス変換回路と
データ並べ変え用ネットワーク（これがプロセッサ間接
続用ネットワークに相当する。）とを付加して等制約に
二次元アクセスメモリを構成する方法［元岡他「二次元
記憶を用いた連想処理システム」信学技報ＥＣ７６−８
０゜以下文献２という。コであシ、他方は直接ＩＣチ。There are two main ways to configure a two-dimensional access memory. One method is to configure a two-dimensional access memory with equal constraints by adding an address conversion circuit and a data rearrangement network (this corresponds to a network for connecting between processors) to a normal memory array [Motooka et al. “Associative processing system using two-dimensional memory” IEICE Technical Report EC76-8
0° or less is referred to as Document 2. The other side is directly connected to the IC.

プ上に二次元アクセスメモリを構成する方法［森田、山
根「多元アドレスメモリの回路構成法」昭和６１年度電
子通信学会総合全国大会４７６゜以下文献３という。コ
である。しかし、いずれの方法も、先の伝搬演算手法に
比べると、（１）二次元アクセスメモリのハードウェア
がかなシ重い。（２）同一モノ、−ルを単純に縦接続す
るだけでプロセッサ配列のサイズを大きくできない。等
の欠点がある。A method for configuring a two-dimensional access memory on a memory card [Morita and Yamane, "Circuit construction method for multi-address memory," 1985 IEICE General Conference 476, hereinafter referred to as Reference 3. It is Ko. However, in both methods, compared to the previous propagation calculation method, (1) the hardware of the two-dimensional access memory is relatively heavy; (2) It is not possible to increase the size of the processor array simply by vertically connecting identical components and modules. There are drawbacks such as.

（発明が解決しようとする問題点）本発明は複数プロセッサの規則的な配列を内蔵する並列
データ処理装置において、従来のプロセッサ間の伝搬演
算の高速化機構が高速化の程度が不十分であったシ、ラ
ンレングス計算のような処理に有効でなかったりした点
を解決するより高性能な伝搬演算の高速化機構を達成し
得る並列データ処理装置を提供することを目的とする。(Problems to be Solved by the Invention) The present invention provides a parallel data processing device incorporating a regular array of multiple processors, in which the conventional mechanism for speeding up propagation operations between processors is insufficient to speed up operations. Another object of the present invention is to provide a parallel data processing device that can achieve a higher performance propagation operation speedup mechanism that solves the problem that it is not effective in processing such as run length calculation.

（問題点を解決するための手段と作用）本発明はプロセ
ッサ間の伝搬演算にかかわるあるプロセッサ列が適当な
大きさのプロセッサ列に分けられ、ブロックごとにブロ
ックへのビット単位の入力がＯと１の両方の場合の伝搬
演算を入力の到達する前に行っておき、そのブロックへ
の実際の入力（０あるいは１ンに応じて、両方の伝搬演
算のいずれかを正しい伝搬演算として選択する構成とな
っていることを最も主要な特徴とする。(Means and effects for solving the problem) The present invention is such that a processor row involved in inter-processor propagation operations is divided into processor rows of appropriate size, and the bit-wise input to each block is O. A configuration in which the propagation operations for both cases of 1 are performed before the input arrives, and one of the two propagation operations is selected as the correct propagation operation depending on the actual input to the block (0 or 1). The most important feature is that

伝搬演算対象に対し各プロセッサが一組の伝搬演算系を
用い−通りの伝搬演算を行うことを前提としていた従来
の技術とは、各プロセッサが内蔵する二組の伝搬演算系
を用い各プロセッサブロックで並行して二連シの伝搬演
算を進める点が異なる。Conventional technology is based on the premise that each processor uses one set of propagation calculation systems to perform one propagation calculation for a propagation calculation target. The difference is that two consecutive propagation operations are performed in parallel.

（実施例）以下図面を参照して本発明の実施例を詳細に説明する。(Example) Embodiments of the present invention will be described in detail below with reference to the drawings.

第１図は本発明の第一の実施例である１ビットプロセッ
サの５ＸＮ台の規則的な配列（この場合はＮ組のプロセ
ッサブロックで構成される１次元配列である。）、制御
ユニット等からなる並列データ処理装置を説明する図で
ある。（ａ）の全体構成図で、２０は５台のプロセッサ
からなるプロセッサブロック、２１，２２．２３はブロ
ックの端子、２５はブロック間の接続線、２６はプロセ
ッサ配列への入力端子、２７はプロセッサ配列からの田
方端子、３０はプロセッサ配列制御用の信号を発生する
ための制御ユニット、３１は発生された信号を全プロセ
ッサに放送するための信号線である。FIG. 1 shows a regular array of 5×N 1-bit processors (in this case, a one-dimensional array consisting of N sets of processor blocks), a control unit, etc. according to the first embodiment of the present invention. FIG. 2 is a diagram illustrating a parallel data processing device. In the overall configuration diagram in (a), 20 is a processor block consisting of five processors, 21, 22, 23 are block terminals, 25 is a connection line between blocks, 26 is an input terminal to the processor array, and 27 is a processor A Tagata terminal from the array, 30 is a control unit for generating a signal for controlling the processor array, and 31 is a signal line for broadcasting the generated signal to all processors.

伽）はプロセッサブロック構成図で、１．１’は、１ビ
、トプロセッサ、２〜８が１ビツトプロセツサの端子、
９．１０がプロセッサ間の接続線である。Figure 1) is a processor block configuration diagram, where 1.1' is a 1-bit processor, 2 to 8 are terminals of the 1-bit processor,
9.10 is a connection line between processors.

左端のプロセッサ１の端子２および３には、それぞれ論
理値″′０”、′１″が入力されている。（ｃ）　、　
（ｄ）は１および１′の１ビ、トプロセッサの構成図で
、１１ａ、Ｉｌｂは伝搬演算時に同一機能になる１ビ、
ト構成の演算二二、ト（ＡＬＵ）、１２．１３は、伝搬
演算時に端子６の入力値が′０″ならば、１１ａのＡＬ
Ｕからの入力を選択して出力し、逆に入力値が′１″な
らばｌｌｂのＡＬＵからの入力を選択して出力するセレ
クタ（伝搬演算以外では、ｌｌｈのＡＬＵからの入力を
選択して出力する）、１４はエビ、ト構成の記憶ユニッ
ト、１５ｍ、１５ｂは、伝搬演算時に、端子２および３
からの入力を選択して出力するセレクタ（発信プロセッ
サとなる場、合あるいは伝搬演算以外の場合では、１ビ
、トのレジスタ１６からの入力を選択して出力する。）
、１７１ｉ、プロセッサ間の左から右へのシフト転送の
際、１のプロセッサの場合左隣のプロセッサブロックか
らの入力を、１′のプロセッサの場合、左隣のプロセッ
サからの入力をそれぞれ選択して出力するセレクタ（シ
フト転送以外ではセレクタ１２からの入力を選択して出
力する。）、１６ｒ１８．１９は１ビ、トのレジスタで
ある。１９のレジスタは、発信プロセッサを定義するた
めの制御用でセレクタ１５ｍ、１５ｂはその内容が１の
時レジスタ１６からの入力、Ｏの時端子２あるいは３か
らの入力が選ばれる。１ビットプロセッサ１および１′
の違いは、セレクタ１７の左側の入力として、端子６か
らの信号が入るか端子２からの信号が入るかのみである
。Logic values "'0" and "1" are input to terminals 2 and 3 of the leftmost processor 1, respectively.(c)
(d) is a configuration diagram of a 1-bit processor of 1 and 1', where 11a and Ilb are 1-bit processors that have the same function during propagation calculation.
In the operation 22, G (ALU), 12.13 of the G configuration, if the input value of the terminal 6 is '0'' during the propagation calculation, the AL of 11a
A selector that selects and outputs the input from U, and conversely selects and outputs the input from the llb ALU if the input value is '1'' (other than propagation calculations, selects the input from the llh ALU and outputs it). 14 is a storage unit with a configuration of shrimp and t, 15m and 15b are terminals 2 and 3 during propagation calculation.
Selector that selects and outputs the input from the 1-bit register 16.
, 171i, during shift transfer from left to right between processors, in the case of processor 1, the input from the processor block on the left is selected, and in the case of processor 1', the input from the processor block on the left is selected, respectively. The output selector (inputs from the selector 12 is selected and output in cases other than shift transfer), 16r18.19, is a 1-bit register. Register 19 is for control to define the originating processor, and selectors 15m and 15b select input from register 16 when the content is 1, and select input from terminal 2 or 3 when the content is O. 1-bit processors 1 and 1'
The only difference is whether the signal from terminal 6 or the signal from terminal 2 is input as the left input of selector 17.

この実施例による伝搬演算の例として、ビットシリアル
型の伝搬加算（後述するビットシリアル型の伝搬加算に
ついての説明参照）によｐ全プロセッサが記憶ユニ、ト
のＡｏ番地に保持するＬ個の１ビツトデータ間の総和を
とる場合についてステ、プ順に説明する。As an example of the propagation operation according to this embodiment, bit-serial type propagation addition (refer to the explanation of bit-serial type propagation addition described later) is used to calculate The case of calculating the sum of bit data will be explained step by step.

〔ステップ１〕　最左端のプロセッサ（左端のプロセッ
サブロック内の左端のプロセッサ１）の制御レジスタ１
９に１をこれ以外のプロセッサの制御レジスタにはＯを
ロードし、最左端のプロセッサのみを発信プロセッサに
割シ付ける。これによって伝搬加算における先頭プロセ
ッサと物理的な先頭プロセッサが一致する。）が設定さ
れる。[Step 1] Control register 1 of the leftmost processor (leftmost processor 1 in the leftmost processor block)
9 is loaded with 1, the control registers of the other processors are loaded with O, and only the leftmost processor is assigned to the originating processor. As a result, the leading processor in propagation addition matches the physical leading processor. ) is set.

〔ステ、グ２〕　全プロセッサで記憶ユニットのＡ１〜
Ａｎ番地およびレジスタ１６．Ｉｌｌを０クリヤする。[Step 2] Memory unit A1~ for all processors
An address and register 16. Clear Ill to 0.

〔ステ、ｆ３〕　記憶ユニットのＡＱ番地をアクセスし
た状態でＡＬＵを加算に設定し伝搬加算を開始する。そ
うすると、各プロセッサブロック内で１１ａのＡＬＵの
縦続接続からなる上側の伝搬演算系ともう一方のｌｌｂ
のＡＬＵの縦続接続からなる下側の伝搬演算系で第１図
（ｂｌの左端のプロセッサの端子２，３への入力から明
らかなようにプロセッサブロックへの入力が”０′と“
１”の両方の場合について並行してＡＱ番地に関する１
ピット分の伝搬加算が始まる。各プロセッサプロツクの
伝搬加算が端子２）への入力を待たずに進行することが
ポイントである。各プロセッサブロックで２系統の伝搬
加算が終了すると、始めに２系統の伝搬演算系で同じ演
算を行っている左端のプロセッサブロックで出力のサム
（端子２２の値）が確定する。[Step, f3] With the AQ address of the storage unit accessed, set the ALU to addition and start propagation addition. Then, within each processor block, the upper propagation calculation system consisting of the cascade connection of ALUs 11a and the other llb
As is clear from the inputs to terminals 2 and 3 of the leftmost processor in Figure 1 (bl), the inputs to the processor block are "0" and "
1” regarding the AQ address in parallel for both cases.
Propagation addition for pits begins. The key point is that the propagation addition of each processor block proceeds without waiting for the input to terminal 2). When the two systems of propagation addition are completed in each processor block, the output sum (the value at the terminal 22) is first determined in the leftmost processor block that is performing the same calculation in the two systems of propagation calculation systems.

次段のプロセッサブロックではこのサムを受は取り、そ
れでセレクタ１２を切り換えるだけで演算が終了しブロ
ックの出力のサムが確定する。これは、すでに可能な入
力の両方の場合（加”と１１”）についてブロック内の
伝搬加算が終了しているからである。従って、各プロセ
ッサブロックで２系統の伝搬加算が完了した後のプロセ
ッサブロック間の伝搬は極めて高速となシ、全体の伝搬
加算時間は大きく短縮される。The processor block at the next stage receives this sum, and by simply switching the selector 12, the calculation is completed and the sum of the output of the block is determined. This is because the intra-block propagation addition has already been completed for both possible input cases (add" and 11"). Therefore, after two systems of propagation addition are completed in each processor block, the propagation between processor blocks is extremely fast, and the overall propagation addition time is greatly shortened.

〔ステップ４〕　ステップ３で確定したサムを絢番地に
、キャリをレジスタ１８に格納する。[Step 4] The sum determined in step 3 is stored in the Aya address, and the carry is stored in the register 18.

〔ステ、ｆ５〕Ａ１〜Ａｎについて、それぞれステップ
３，４と同様の処理を繰シ返すことによシキャリの清算
を行う（清算方法については後述するビットシリアル型
の伝搬加算についての説明参照）ここで、本発明の加算
時間の短縮効果について簡単に評価する。比較の対象と
しては、隣接プロセッサ間にのみ接続線を有し一組の伝
搬演算系のみを有するプロセッサからなる基本の一次元
プロセッサ配列とする。一般的に論じるため先の実施例
のプロセッサブロックのプロセッサ数をＮ１全体のプロ
セッサブロック数をＭとするとともに、実施例とそろえ
るために基本の一次元プロセッサ配列のプロセッサ数を
Ｌ（＝ＭＸＮ）とする。各プロセッサの伝搬加算時間ｔ
ｄｐは共に１単位時間とする。この仮定は、実施例のよ
うに伝搬演算系を２重化しても伝搬演算時間の増加する
要因としては、プロセッサごとに１５１あるいは１５ｂ
のセレクタが一つ伝搬経路に余分に入る程度であること
から妥当なものといえる。（第１図の１０ツク図では伝
搬演算系にセレクタとＡＬＵが一つずつしか入っていな
いが、実際のプロセ、ｆではもつと多くのセレクタ等が
入るので、セレクター個がさらに余分に入っても、デロ
セ、す全体としての伝搬遅延時間の増加の割合は小さい
。）また、各プロセッサブロックで２系統の伝搬加算が
終了している状態で端子２１に入力が入ってから端子２
２に出力が出るまでの時間ｉｄｐも１単位時間とする。[Step, f5] Clear the shift by repeating the same process as steps 3 and 4 for A1 to An. Now, the effect of reducing the addition time of the present invention will be briefly evaluated. The comparison target is a basic one-dimensional processor array consisting of processors that have only connection lines between adjacent processors and only one set of propagation arithmetic systems. For general discussion, the number of processors in the processor blocks of the previous embodiment is N1, and the total number of processor blocks is M, and in order to be consistent with the embodiment, the number of processors in the basic one-dimensional processor array is L (=MXN). do. Propagation addition time t of each processor
Both dp are 1 unit time. This assumption suggests that even if the propagation calculation system is duplicated as in the embodiment, the reason for the increase in the propagation calculation time is that each processor has 151 or 15 bits per processor.
This can be said to be reasonable since only one extra selector is added to the propagation path. (In the 10-block diagram in Figure 1, the propagation calculation system contains only one selector and one ALU, but in the actual process f, many selectors etc. are included, so an extra number of selectors are included. (Also, the proportion of increase in the overall propagation delay time is small.) In addition, when the input is input to terminal 21 with the propagation addition of two systems completed in each processor block,
It is also assumed that the time idp until the output is output at 2 is 1 unit time.

この値は端子２１への入力でセレクタ１２が切り換わる
だけで即出力が出ることから、各プロセッサの伝搬加算
時間ｔｄｐを１単位時間としたのと比べると大きめと言
える。これらの仮定にもとすくと、基本の一次元プロセ
ッサ配列による伝搬加算時間ＴＯは、Ｔ　Ｏ＝Ｌ　　　　　　　　　・・・・・・１１）本発
明の実施例の伝搬加算時間ＴＩは、先頭のプロセッサブ
ロックでの通常の伝搬加算時間Ｍと次段以降のプロセッ
サブロック間の高速な伝搬加算時間Ｎ−１との和で、ＴＪ＝Ｍ＋Ｎ−１・・・・・・（２）となる。Ｌ＝ＭＸＮよシ、Ｔ１はＭ、Ｎを〆ｒに近い整
数値に選ぶことによｐＲｋ小化され、このとき、Ｔ　Ｊ
　＝　２　ｙ’Ｔ−１−−−−−−（３１となる。（１
１、Ｔ２１式から明らかなようにＬが大きくなるほど本
発明による高速化率は高まる。Ｌに適箔な数値を入れて
ＴＩ　、Ｔ２の関係を調べた結果を表にしめす。This value can be said to be larger than when the propagation addition time tdp of each processor is set to 1 unit time, since an output is immediately produced by simply switching the selector 12 upon input to the terminal 21. Based on these assumptions, the propagation addition time TO in the basic one-dimensional processor array is T O = L...11) The propagation addition time TI in the embodiment of the present invention is as follows: The sum of the normal propagation addition time M and the high-speed propagation addition time N-1 between processor blocks in the next stage and subsequent stages is TJ=M+N-1 (2). Since L=MXN, T1 is reduced by pRk by choosing M and N to be integer values close to r, and at this time, T J
= 2 y'T-1------(31.(1
1. As is clear from equation T21, the larger L is, the higher the speedup rate according to the present invention is. The table shows the results of examining the relationship between TI and T2 by inserting an appropriate value into L.

表　伝搬加算時間ＴＩ、Ｔ２の比較次に、第一の実施例によるビットシリアル型伝搬加算の
別の例として、従来技術の項で説明したランレングス計
算の場合のように伝搬加算における先頭プロセッサ（連
結の端のプロセッサ）がグロセッサ配列の途中に存在す
る場合について説明する。Table Comparison of propagation addition times TI and T2 Next, as another example of the bit-serial type propagation addition according to the first embodiment, the leading processor ( A case will be explained in which the processor at the end of the concatenation exists in the middle of the grosser array.

先頭プロセッサの設定は、先に説明したようにそのプロ
セッサの制御レジスタ１９に１を書き込み発信プロセッ
サとすることにより実現される。The setting of the leading processor is achieved by writing 1 into the control register 19 of that processor and setting it as the originating processor, as described above.

発信プロセッサでは、それがプロセ、サブロックのどの
位置にあっても、セレクタ１５ｍ、１５ｂの出力として
、レジスタ１６の出力の内容（あらかじめＯに設定され
ている。）を選ぶ。従って、このプロセッサでは端子２
，３からの入力を無視し両方の伝搬演算系の入力として
換わ力に′Ｏ”が入るので、結局このプロセッサを先頭
（すなわち論理的な先頭プロセッサ）とする伝搬加算が
始まることになる。当然ながら、発信プロセッサから同
一プロセッサブロック内の右端のプロセッサまでは上下
の伝搬演算系が同一の加算をすることになる。また、発
信プロセッサは隣からの入力と無関係に動作するので、
プロセッサ配列全体を伝搬加算に設定すると同時にこの
プロセッサから右方向への伝搬加算が開始されプロセッ
サブロック間の伝搬加算に移る時点では他のプロセッサ
ブロック同様ブロック内の伝搬加算は終了している。従
って、伝搬加算の先頭のプロセッサが配列の途中にある
からといって、全体の伝搬加算に要する時間が伸びたｐ
１バイパス付付加型伝搬演算式のように制御が複雑とな
りそのためのオーバヘッドが生じることもない。In the originating processor, the content of the output of the register 16 (preset to O) is selected as the output of the selectors 15m and 15b, regardless of the position of the processor or subblock. Therefore, in this processor, terminal 2
, 3 is ignored and replaced with 'O' as the input to both propagation calculation systems, so that propagation addition starts with this processor as the head (that is, the logical head processor). Naturally, the upper and lower propagation calculation systems perform the same addition from the originating processor to the rightmost processor in the same processor block.Also, since the originating processor operates independently of the input from its neighbor,
At the same time that the entire processor array is set to propagation addition, propagation addition starts from this processor in the right direction, and by the time it moves to propagation addition between processor blocks, propagation addition within the block has been completed like in other processor blocks. Therefore, even if the first processor in the propagation addition is located in the middle of the array, the time required for the entire propagation addition will increase p
Unlike the additional type propagation calculation formula with one bypass, the control becomes complicated and there is no overhead associated with it.

以上、ＡＬＵの機能として加算を選んだ伝搬加算につい
て説明してきたが、ＡＬＵの機能を論理演算に選べば同
様に伝搬型の論理演算が高速に実行される。So far, we have described propagation addition in which addition is selected as the ALU function, but if logical operation is selected as the ALU function, propagation-type logical operations can similarly be executed at high speed.

本発明は、・ぐイグライン処理の手法を採り入れること
により一層の高速化が可能である。パイプライン処理は
第一の実施例でプロセッサ１，１′にいわゆるノやイブ
ラインレジスタを付は加えるだけで容易に実現できる。The present invention can achieve even higher speeds by adopting the ``guiline processing'' method. Pipeline processing can be easily realized in the first embodiment by simply adding or adding so-called registers to the processors 1 and 1'.

第２図は、そのパイプラインレジスタを付加したプロセ
ッサ１，１′の構成を示している。以下、第一の実施例
でプロセッサ１゜１′のみを第２図のそれに置換した第
２の実施例について、この・やイブライン化伝搬演算に
ついて説明する。FIG. 2 shows the configuration of processors 1 and 1' to which the pipeline register is added. Hereinafter, this ``yebline propagation calculation'' will be explained with respect to a second embodiment in which only the processor 1.sub.1' in the first embodiment is replaced with that shown in FIG. 2.

パイプライン化伝搬加算は、プロセッサプロ。Pipelined Propagation Addition Processor Pro.

り内の伝搬加算（上下２系統分）とプロセッサブロック
間の伝搬加算をパイプライン処理によ逆並列に実行する
ものである。その動作内容は、全プロセッサが記憶ユニ
ットＡｏ番地とＡＷ＋１番地に保持するＬ個ずつの１ビ
ツトデータ間の総和をとる場合、次の通りである。The propagation addition within the processor blocks (for two systems, upper and lower) and the propagation addition between processor blocks are executed in antiparallel by pipeline processing. The operation is as follows when all the processors calculate the sum of L pieces of 1-bit data held in the storage units Ao address and AW+1 address.

〔ステラ７’ｌ）　　先頭プロセッサの設定最左端のプ
ロセッサの制御レジスタ１９に１を、これ以外のプロセ
ッサの制御レジスタにＯを、それぞれロードする。[Stella 7'l) Setting of the first processor Load 1 into the control register 19 of the leftmost processor and O into the control registers of the other processors.

〔ステラｆ２　〕　　記憶ユニットおよびレジスタのク
リヤ全プロセッサで、記憶ユニットノＡ１〜ＡＷ　ｒＡ
ｗ＋２〜Ａ２ｗ＋２番地およびレジスタ１６，１８．４
０ｍ。[Stella f2] Clear memory units and registers on all processors, memory units A1 to AW rA
Addresses w+2 to A2w+2 and registers 16, 18.4
0m.

４０ｂ、４１ｍ、４１ｂをＯクリヤする。Clear 40b, 41m, and 41b.

〔ステップ３〕　伝搬加算〔サブステップ３−１〕記憶ユニットのＡ、番地をアクセスしＡＬＵ機能を加算
に選ぶことにより、Ａｏの内容についての伝搬加算を開
始する。そうすると第一の実施例の場合トｍＨＭにプロ
セッサブロック内の伝搬加算が進行する。プロセッサブ
ロック内の右端のプロセッサまで伝搬した段階で、各プ
ロセッサにおいてレジスタ４１ｈ、４１ｂの内容のいず
れかをレジスタ１８に移すとともに得られたサム、およ
びキャリをレジスタ４０ａ、４０ｂおよび４１ａ、４１
ｂに書き込む。[Step 3] Propagation addition [Substep 3-1] By accessing address A of the storage unit and selecting addition as the ALU function, propagation addition for the contents of Ao is started. Then, in the case of the first embodiment, the propagation addition within the processor block proceeds to mHM. At the stage of propagation to the rightmost processor in the processor block, in each processor, the contents of registers 41h and 41b are transferred to register 18, and the obtained sum and carry are transferred to registers 40a, 40b and 41a, 41.
Write in b.

〔サブステ、プ３−２〕記憶ユニットのＡＷ＋１番地をアクセスしＡＬＵ機能を
加算に選んだままで、ＡＷ＋１番地の内容についての伝
搬加Ｘを開始する。一方、前ステップ（この場合はサブ
ステップ３−１）で得られた結果（レジスタ４０ｍ　、
４０ｂの内容）を用い、ＡＯの内容に関するプロセッサ
ブロック間の伝搬加算も同時に始める。これら両方の伝
搬加算が進行し、プロセッサブロック内、プロセッサブ
ロック間の両方で伝搬が終了した段階で、各プロセッサ
において、プロセッサブロック間の伝搬加算によシ確定
したＡＱ番地の１ビツト目の総和結果とレジスタ４１ａ
。[Substep 3-2] Address AW+1 of the storage unit is accessed, and propagation addition X is started for the contents of address AW+1 while the ALU function is still selected for addition. On the other hand, the results obtained in the previous step (substep 3-1 in this case) (register 40m,
40b), propagation addition between processor blocks regarding the contents of AO is also started at the same time. As both of these propagation additions progress and the propagation is completed both within the processor block and between processor blocks, each processor obtains the summation result of the 1st bit of the AQ address determined by the propagation addition between the processor blocks. and register 41a
.

４１ｂのいずれかの選択結果をそれぞれＡＷ＋１とレジ
スタ１８に書き込むとともに、プロセッサブロック内の
伝搬加算によシ確定した２系統のサムおよびキャリを、
レジスタ４０ｍ、４０ｂおよび４１ｍ、４ｚｂＫ！き込
む。41b is written to AW+1 and register 18, respectively, and the two systems of sum and carry determined by propagation addition within the processor block are written.
Registers 40m, 40b and 41m, 4zbK! Get into it.

〔サブステップ３−３〕ＡＷ＋１をＡＩ＋ＡＯをＡｗ＋　１に書き換えた以外は
サプステッｆ３−２と同じ処理を行う。[Substep 3-3] The same process as substep f3-2 is performed except that AW+1 and AI+AO are rewritten as Aw+1.

以下、サブステラｆ３−２．３−３の処理を記憶ユニッ
トのアドレスを順次インクリメントしながらＷ回繰り返
し行うことによシ、Ａ０番地の左端のプロセッサからそ
のプロセッサまでの総和結果が、ＡＷ＋１〜ＡＷ＋２番
地に、ＡＷ＋　１の総和結果がＡＱ％Ａｗ番地にビット
シリアルデータで得られる。Hereinafter, by repeating the processing of substellar f3-2.3-3 W times while sequentially incrementing the address of the storage unit, the summation result from the leftmost processor at address A0 to that processor is calculated at addresses AW+1 to AW+2. Then, the summation result of AW+1 is obtained as bit serial data at address AQ%Aw.

以上の動作内容から明らかなように、プロセ。As is clear from the above operation details, the process.

サブロック内の伝搬加算とプロセッサブロック間の伝搬
加算が並列に行われるので、ブロック間トブロック内の
伝搬加算時間が等しくなるようにすれば、全体の伝搬加
算時間を捧に短縮できる。Since propagation addition within a subblock and propagation addition between processor blocks are performed in parallel, if the propagation addition time between blocks and within a block is made equal, the overall propagation addition time can be significantly shortened.

次に論理演算の場合について説明する。この場合、キャ
リは関係しないので、４１ｍ、４１ｂ。Next, the case of logical operation will be explained. In this case, carry is not relevant, so 41m, 41b.

１３．１８等を動かす必要はない。全プロセッサのそれ
ぞれが記憶ユニットＡｏ番地からＡＶ番地に保持する語
長Ｗ＋１のビットシリアルデータ５個の全体の論理和を
とる場合、その動作内容は次の通９である。There is no need to move 13.18 mag. When each processor calculates the logical sum of five pieces of bit serial data of word length W+1 held in memory unit addresses Ao to AV, the operation is as follows.

〔ステラ７”ｌ〕　　先頭ｆａセ、すの設定最左端のプ
ロセッサの制御レジスタ１９に１をξれ以外のプロセッ
サの制御レジスタＫＯをロードする。[Stella 7''l] Setting of the first fa and s Loads 1 into the control register 19 of the leftmost processor and the control registers KO of the processors other than ξ.

〔ステ、７６２）　　レジスタのクリヤ全プロ上、すで
、レジスタ４０ｍ、４０ｂ、をクリヤする。[Ste, 762] Clearing registers On all the programs, registers 40m and 40b have already been cleared.

〔ステ、ｆ３〕　伝搬論理和〔サブステップ３−１〕記憶ユニットのＡ３番地をアクセスしＡＬＵ機能を論理
和に選ぶことによ’）、Ａａ番地の内容についての伝搬
論理利金開始する。こうすると伝搬加算の場合と同様に
まずプロセッサブロック内の伝搬論理和が進行する。プ
ロセッサブロック内の右端のプロセッサまで伝搬した段
階で、各プロセッサにおける２系統の伝搬論理和の結果
をレジスタ４０ｍ　。[Step, f3] Propagation logical sum [Substep 3-1] By accessing address A3 of the storage unit and selecting the ALU function as logical sum'), propagation logical summation for the contents of address Aa is started. In this case, as in the case of propagation addition, the propagation OR within the processor block proceeds first. At the stage where the propagation reaches the rightmost processor in the processor block, the results of the propagation OR of the two systems in each processor are stored in the register 40m.

４０ｂＫ書き込む。Write 40bK.

〔サブステ、７’３−２）記憶ユニットのＡ１番地をアクセスし、ＡＬＵ機能を論
理和に選んだままで、Ａ１番地の内容についての伝搬論
理和を開始する。一方、サブステ、プ３−１で得られた
結果（レジスタ４０ｍ　、　４ｏｂ）を用い、Ａｏの内
容に関するプロセッサブロック間の伝搬論理和も同時に
始める。これら両方の伝搬論理和か進行し、プロセッサ
ブロック内、プロセ、サブロック間の両方で伝搬が終了
し九段階で、各プロセッサにおいてプロセッサブロック
間の伝搬論理和により確定したＡ３番地に関する論理和
結果をＡ１番地に書き込む。また、それと同時にプロセ
ッサブロック内の２系統の伝搬論理和により確定した結
果をレジスタ４０＊、４０ｋＢに書き込む。[Substep, 7'3-2] Access the A1 address of the storage unit, and start the propagation OR for the contents of the A1 address while keeping the ALU function selected for OR. On the other hand, using the results obtained in sub-step 3-1 (registers 40m and 4ob), the propagation OR between processor blocks regarding the contents of Ao is also started at the same time. The propagation OR of both of these proceeds, and the propagation is completed both within the processor block and between the processes and subblocks. At the 9th stage, the OR result regarding address A3 determined by the propagation OR between the processor blocks is processed in each processor. Write to address A1. At the same time, the result determined by the propagation OR of the two systems within the processor block is written into the register 40*, 40 kB.

以下、サブステ、７Ｄ３−２を記憶ユニットのアドレス
を１ずつインクリメントしながらＷ回ａシ返すことによ
り、Ａｏ−Ａｗのビットシリアルデータの（配列全体の
）左端のプロセッサからそのプロセッサまでの論理和結
果がＡ　１嘩ｙ＋１に得られる。Below, substep 7D3-2 is returned W times while incrementing the address of the storage unit by 1, resulting in the logical sum of the bit serial data of Ao-Aw from the leftmost processor (of the entire array) to that processor. is obtained at A1 y+1.

伝搬加算のノ譬イブライン化と同様、プロセッサブロッ
ク間とプロセッサブロック内の伝搬演算が並行して行わ
れるので演算時間は捧に短縮される。Similar to the analogy of propagation addition, the propagation operations between processor blocks and within the processor blocks are performed in parallel, so the operation time is significantly shortened.

次にハードウェア量とハードウェアの実現容易性につい
て比較する。本発明のハードウェア量は、従来の２種類
の伝搬演算方式に比べるといくぶん増加する。基本の繰
り返し単位であるプロセッサの伝搬演算系を２重化して
いるからである。しかし、その増分は、配列サイズが大
きい場合、二次元アクセスメモリのハードウェア量に比
べると小さい。例えば、プロセッサが１ビツト構成の場
合、本発明の増分（第３図の基本の伝搬演算が可能な１
ビツトプロセツサに対するＭ１図（ｃ）　、　（ｄｌの
プロセッサの増分）は、セレクタ・レジスタ・伝搬演算
方式、）（４本の制御信号で指定可能な１６機能を有す
るＡＬＵ　）に文献４　（ＣｌＭｅａｄ　ａｎｄ　Ｌ、
Ｃｏｒｒｗａｙ。Next, we will compare the amount of hardware and ease of implementation of the hardware. The amount of hardware in the present invention is somewhat increased compared to the two conventional propagation calculation methods. This is because the propagation calculation system of the processor, which is the basic repeating unit, is duplicated. However, the increment is small compared to the hardware amount of the two-dimensional access memory when the array size is large. For example, if the processor has a 1-bit configuration, the increment of the present invention (1 bit that allows the basic propagation operation shown in FIG.
M1 diagram (c) for a bit processor, (processor increment of dl) is described in Reference 4 (ClMead and L,
Corrway.

’Ｉｎｔｒｏｄｕｃｔｉｏｎ　ｔｏ　ＶＬＳＩ　Ｓｙｓ
ｔｅｍｓ”、Ａｄｄｉｓｏｎ−Ｗｅｓｌｅｙ　（１９８
０））の５章に示されるトランスミ、シ曹ングートを用
いて構成すれば、プロセッサ配列のサイズによらず１０
０トランジスタ程度である。従って、−組あるいは複数
組のプロセッサブロックを搭載したＬＳＩの開発は従来
技術同様可能であシ、そのＬＳＩを単に縦続接続するだ
けで大きなプロセッサ配列を構成できる。これに対し、
二次元アクセスメモリでは、直接ＩＣチ、ｆ上に構成す
る場合、プロセッサ配列のサイズをＬとすると、ＬＸＬ
のメモリアレイを載せることになる［文献３コから、プ
ロセッサ当９の増分はＬメモリセルとなｐＬが６４程度
まではメモリアレイの集積度の高いことから本発明と同
等におさまる。しかし、Ｌがこれ以上になると、本発明
との差分がますます大きくなり、ついには二次元アクセ
スメモリ全体を１チツプに搭載できなくなる。もちろん
大きな二次元アクセスメモリは１チ、プ化が可能なよジ
小さな二次元アクセスメモリＩＣの正方格子状の配列で
構成できるが、本発明のような単純な縦続接続に比べる
と必要な部品数も多く複雑となる。また、標準メモリ、
アドレス変換回路、データ並べ変え用ネットワークで二
次元アクセスメモリを構成する場合については詳しく議
論しないが、配列のサイズが大きければ、同様にプロセ
ッサ配列のＬＳＩに一体化することは困難である。基本
となる小さなデータ並べ変え用ネットワークＩＣの組み
合わせで実現しようとしても、やはり単純な縦続接続で
構成することはできない［文献２］。'Introduction to VLSI System
Addison-Wesley (198
If configured using the transmitter and processor shown in Chapter 5 of 0)), 10
It is about 0 transistors. Therefore, it is possible to develop an LSI equipped with one or more sets of processor blocks as in the prior art, and a large processor array can be constructed simply by cascading the LSIs. On the other hand,
In a two-dimensional access memory, if the size of the processor array is L when it is configured directly on the IC chip, then LXL
[From Reference 3, the increment of 9 per processor is L memory cells, and pL up to about 64 is equivalent to the present invention due to the high degree of integration of the memory array. However, if L becomes larger than this, the difference from the present invention will become larger and larger, and it will eventually become impossible to mount the entire two-dimensional access memory on one chip. Of course, a large two-dimensional access memory can be constructed from a square lattice array of smaller two-dimensional access memory ICs that can be integrated into a single chip, but the number of components required is greater than the simple cascade connection of the present invention. It also becomes complicated. In addition, standard memory,
Although we will not discuss in detail the case where a two-dimensional access memory is constructed using an address conversion circuit and a data rearrangement network, if the size of the array is large, it is difficult to similarly integrate it into an LSI with a processor array. Even if an attempt is made to realize this by combining basic small data rearranging network ICs, it is still not possible to configure it by simple cascade connections [Reference 2].

次に、ビットシリアル型伝搬加算について説明する。Next, bit serial type propagation addition will be explained.

この伝搬演算は人が通常行う筆算と同じ要領で、−桁ず
つ（１ビット分ずつ）加算を行っていくことによシ、全
体の和を得る加算法である。以下４個のプロセッサの記
憶ユニットのＡＱ番地に入っている１ピツトデータの総
和を求める場合を例に具体的に説明する。This propagation operation is an addition method that obtains the total sum by adding -digits (one bit at a time) in the same way as the calculations that people normally do by hand. The following will specifically explain the case where the sum of 1-pit data stored in the AQ addresses of the storage units of four processors is calculated as an example.

第４図は各プロセッサの演算に関係するＡＬＵ・記憶ユ
ニットの一部領域・レジスタ（キャリ用）等を抜き出し
て図示したものである。ここで点線枠内は一個のプロセ
ッサに対応し、０．１の値は記憶ユニット・レジスタの
中味（初期値）である。FIG. 4 shows an extracted ALU, a partial area of a storage unit, a register (for carry), etc. related to the calculations of each processor. Here, the area within the dotted line corresponds to one processor, and the value of 0.1 is the contents (initial value) of the storage unit register.

この図は演算前の状態を示しており、Ａｏ以外にはすべ
て０が入っている。また左端のプロセッサが伝搬加算の
先頭であ、Ｑ、ＡＬＵの左側入力は０固定となる。次に
この伝搬加算を第５図を用い具体的に説明する。This figure shows the state before calculation, and all fields other than Ao are 0. Furthermore, the leftmost processor is the head of propagation addition, and the left inputs of Q and ALU are fixed at 0. Next, this propagation addition will be specifically explained using FIG.

第５図（＆）は、Ａｏに関する加算の伝搬が終了した時
点での状態を示しており、左端のプロセッサでは固定入
力値としてＯ，ＡＱの内容として１及びキャリレジスタ
の内容としてＯが加えられ、その結果サムが１、キャリ
が０となっている。右側の３台のプロセッサも同様に動
作し、図示の通りサムとキャリが得られている。次に、
これらのサムとキャリで、Ａｏとキャリレジスタを更新
すると第５図（ｂ）のようになる。ここで、Ａｏに書か
れた内容が１ビツト目の総和結果でおる。ここで、キャ
リレジスタの内容はすべて０ではなく、キャリの清算の
丸めの伝搬加算を行う必要がある。Figure 5 (&) shows the state at the time when the propagation of addition regarding Ao is completed, and in the leftmost processor, O is added as a fixed input value, 1 is added as the content of AQ, and O is added as the content of the carry register. , as a result, the sum is 1 and the carry is 0. The three processors on the right operate in the same way, and as shown in the figure, thumb and carry are obtained. next,
When the Ao and carry registers are updated with these sum and carry, the results are as shown in FIG. 5(b). Here, the content written in Ao is the summation result of the first bit. Here, the contents of the carry register are not all 0, and it is necessary to perform rounding propagation addition to clear the carry.

第５図（０）は、キャリの清算のための伝搬加算を図示
したものであシ、先のＡＱに関する伝搬加算で生じたキ
ャリとＡ１の内容（全プロセッサで０）との間の加算の
伝搬が終了した状態を示している。FIG. 5(0) is a diagram illustrating the propagation addition for clearing the carry, and the addition between the carry generated by the previous propagation addition regarding AQ and the contents of A1 (0 for all processors). This shows a state in which propagation has ended.

Ａ１の内容はすべて０で、キャリレジスタは左から３番
目のプロセッサのみが１なので、図示の通り伝搬加算に
よって生じるキャリはすべて０、サムは左から３番目と
４番目のプロセッサのみが１となる。第５図（ｄ）はこ
れらのキャリとサムでＡ１およびキャリレジスタを更新
した後の状態を示している。この状態でのキャリレジス
タの内容はすべて０であり、キャリの清算のための伝搬
加Ｘを行う必要はない。すなわち、ＡｏＳＡｌに総和結
果が求められたことになる。実際、左端のプロセッサか
ら順に、十進換算値として１，１，２．３が入って第４
図のＡＱのデータ配列から明らかなように正しい結果を
与えている。The contents of A1 are all 0, and the carry register is 1 only in the third processor from the left, so as shown in the figure, the carries generated by propagation addition are all 0, and the sum is 1 only in the third and fourth processors from the left. . FIG. 5(d) shows the state after updating A1 and the carry register with these carries and sums. In this state, the contents of the carry register are all 0, and there is no need to perform propagation addition X to clear the carry. In other words, a summation result is obtained for AoSAl. In fact, starting from the leftmost processor, 1, 1, and 2.3 are entered as decimal values, and the fourth
As is clear from the AQ data array in the figure, correct results are given.

（発明の効果）以上説明したように本発明は、従来の伝搬演算方式や二
次元アクセスメモリを用いる方式とは異なシ、プロセッ
サアレイのサイズが大きくなってもプロセッサ配列を内
蔵するＬＳＩを単純に縦続接続するだけで対応可能であ
ジかつ発信プロセッサの位置によらず高い伝搬演算の高
速化が達成される。従って、ハードウェア量の制約が厳
しくかつ伝搬演算を多用する配列サイズの大きい二次元
アレイプロセッサには極めて有効である。また−次元ア
レイプロセッサでも扱うデータが一次元量列データとし
てしか処理できない場合や他の手段（二次元アクセスメ
モ’））′Ｊｋ用いたのではハードウェア量が多くなシ
過ぎる場合には、有用である。(Effects of the Invention) As explained above, the present invention is different from the conventional propagation calculation method or the method using two-dimensional access memory. This can be achieved simply by cascading the processors, and high speed propagation operations can be achieved regardless of the location of the originating processor. Therefore, it is extremely effective for a two-dimensional array processor with a large array size that has severe restrictions on the amount of hardware and uses many propagation operations. It is also useful when data handled by a -dimensional array processor can only be processed as one-dimensional sequence data, or when using other means (two-dimensional access memo'))'Jk would require too much hardware. It is.

もちろん、他の手段と併用することによシ互いに不得手
とする演算を補い合う構成も考えられる。Of course, a configuration can also be considered in which the two methods are used in combination with other means to compensate for the computations in which each other is weak.

例えば−次元アレイプロセッサにおいて二次元ア、クセ
スメモリと併用する構成では、本発明で＃′ｉ実現不可
能な９０度回転を二次元アクセスメモリに、二次元アク
セスメモリでは高速化困難な一次元量列データの伝搬演
算を本発明の伝搬演算機構に、それぞれ分担させること
によシ高い性能を引き出すことができる。For example, in a configuration in which a two-dimensional array processor is used in combination with a two-dimensional access memory, the present invention allows a two-dimensional access memory to perform a 90-degree rotation that cannot be achieved, and a one-dimensional amount that is difficult to speed up with a two-dimensional access memory. High performance can be obtained by having the propagation calculation mechanism of the present invention share the propagation calculation of column data.

なお、本発明は伝搬演算が画像処理・文字認識における
特徴抽出処理、ＬＳＩ・ＰＣＢのＣＡＤ等で多用される
ことから、これらの処理を目的とした一次元・二次元ア
レイプロセッサへの適用が期待される。The present invention is expected to be applied to one-dimensional and two-dimensional array processors for the purpose of these processes, as propagation calculations are frequently used in feature extraction processing in image processing and character recognition, CAD for LSIs and PCBs, etc. be done.

[Brief explanation of the drawing]

第１図は本発明の第一の実施例を示すブロック図、第２
図は本発明の第二の実施例に用いるプロセッサのブロッ
ク図、第３図は従来の伝搬演算可能なプロセッサ配列を
構成する基本的なプロセッサのブロック図、第４図社ビ
ットシリアル型伝搬加算説明用のプロセッサ配列（初期
状態）を示すブロック図、第５図はビットシリアル型伝
搬加算の手順を示すブロック図、第６図はバイノ２ス付
加型伝搬演算方式を適用した一次元のプロセッサ配列を
示すブロック図である。１．１′・・・二重の伝搬演算系を有するプロセッサ、
２・・・上側の伝搬演算系の入力端子、３・・・下側の
伝搬演算系の入力端子、４・・・上側の伝搬演算系の出
力端子、５・・・下側の伝搬演算系の出力端子、６・・
・前段のプロセッサブロックの出力を受信するための入
力端子、２・・・プロセッサブロック用の出力端子、８
・・・全プロ上、す共通の制御信号受信用の入力端子、
９．１０・・・プロセッサ間接続線、１４・・・記ｔｔ
ユニット、ｌｌａ、Ｉｌｂ・・・演ｎユ＝ｙト（ＡＬＵ
）、ｌ　２　、　ｌ　３　、１５　＊　、　１５　ｂ　
、　１７−セレクタ、１６．１８．１９・・・１ビツト
レジスタ、２０・・・プロセッサブロック、２ノ・・・
プロセッサブロックの入力端子、２２・・・プロセッサ
ブロックの出力端子、２３・・・全プロセッサブロック
共通の制御信号受信端子、２５・・・プロセッサブロッ
ク間の接続線、３０・・・制御ユニット、３１・・・プ
ロセッサブロック用共通制御信号線、４０ｍ、４０ｂ。４１ｍ、４１ｂ・・・パイプラインレジスタ、１１０ｈ
。１１０ｂ、１１０ａ、１１０ｄ、１１０＊・＝プロセッ
サ、１１３・・・バイパス、１１４・・・パイノ４ス選
択セレクｐ、１１５・・・パイノ４ス制御レジスタ。出願人代理人　　弁理士　鈴　江　武　彦（Ｃ）箪　１図ｌ第２図FIG. 1 is a block diagram showing a first embodiment of the present invention, and FIG.
The figure is a block diagram of a processor used in the second embodiment of the present invention, Figure 3 is a block diagram of a basic processor constituting a conventional processor array capable of propagation calculations, and Figure 4 is an explanation of bit-serial type propagation addition. Figure 5 is a block diagram showing the procedure for bit-serial type propagation addition, and Figure 6 shows a one-dimensional processor array to which the bino2 addition type propagation calculation method is applied. FIG. 1.1'... Processor with double propagation calculation system,
2...Input terminal of upper propagation calculation system, 3...Input terminal of lower propagation calculation system, 4...Output terminal of upper propagation calculation system, 5...Lower propagation calculation system Output terminal of 6...
- Input terminal for receiving the output of the preceding processor block, 2...Output terminal for the processor block, 8
...Input terminal for receiving control signals common to all professionals,
9.10... Inter-processor connection line, 14... Note tt
Unit, lla, Ilb... performance (ALU)
), l 2 , l 3 , 15 * , 15 b
, 17-Selector, 16.18.19...1 bit register, 20...Processor block, 2-no....
Input terminal of processor block, 22... Output terminal of processor block, 23... Control signal receiving terminal common to all processor blocks, 25... Connection line between processor blocks, 30... Control unit, 31. ...Common control signal line for processor block, 40m, 40b. 41m, 41b...Pipeline register, 110h
. 110b, 110a, 110d, 110*=processor, 113... bypass, 114... pinos4 selection select p, 115... pinos4 control register. Applicant's agent Patent attorney Takehiko Suzue (C) Kan 1 Figure 1 Figure 2

Claims

[Claims]

In a parallel data processing device that incorporates a cascade array of multiple processors, the processor array is also a cascade array of blocks that incorporate a cascade array of multiple processors, and each processor simultaneously performs the same function. A processor located at the end of the block includes a first arithmetic unit, a second arithmetic unit, and an arithmetic unit output selector for selecting any of the outputs of these arithmetic units, and the processor located at the end of the block The output is guided to the last block through the connection line between the blocks and used as a control signal for the arithmetic unit output selector of the processor in the adjacent block, and the processor not located at the end in the block receives the output of the first arithmetic unit. The output of the second arithmetic unit is led to the input of the first arithmetic unit of the adjacent processor at the end via the connection line between the processors, and the output of the second arithmetic unit is connected to the second arithmetic operation of the adjacent processor via the connection line between the processors. A parallel data processing device, characterized in that the first processor in the block receives mutually complementary fixed values as inputs to the first and second arithmetic units.