JPH0713956A

JPH0713956A - Simd type parallel computer data transfer device

Info

Publication number: JPH0713956A
Application number: JP15715293A
Authority: JP
Inventors: Takashi Yoshida; 尊吉田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1993-06-28
Filing date: 1993-06-28
Publication date: 1995-01-17

Abstract

PURPOSE:To provide a data transfer device whereby adjacent communication is executed at a higher speed in an SIMD type parallel computer at the time when an inter-board latency exists. CONSTITUTION:Respective element processors are provided with a data holding register with the depth of the same number as latency increase by an inter-board data transfer part, a BRD flag indicating whether or not data held by the respective data holding registers is BRD and a data selecting F/F holding information indicating whether or not BRD reaches the most end of the data holding registers. Moreover, a selector selecting data from the data holding register by the value of the data selecting F/F, a counter capable of counting the number obtained by adding the number of the element processor where data in the board passes through and the number of an inter-board register stage and a communication register, are provided.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、並列計算機の要素プロ
セッサ間通信回路に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a communication circuit between element processors of a parallel computer.

【０００２】[0002]

【従来の技術】ＳＩＭＤ型並列マシンは、同期的に動作
するため、一斉に同一方向にデータを転送するといった
事が容易に行える。例えばリング状の隣接結合路を、シ
フト転送して数個先のＰＥにデータを転送するといった
処理が良く行われる。2. Description of the Related Art Since SIMD type parallel machines operate synchronously, data can be easily transferred all at once in the same direction. For example, a process such as shift-transferring a ring-shaped adjacent coupling path to transfer data to several PEs ahead is often performed.

【０００３】図３に要素プロセッサの１次元隣接接合の
図を示す。隣接通信は以下のような動作で転送を行う。
まず各要素プロセッサ内で、データを転送レジスタにセ
ットする。転送レジスタは隣の要素プロセッサの転送レ
ジスタとパイプラインレジスタ状に連結されている。次
に、転送命令で隣の要素プロセッサの転送レジスタにデ
ータを転送する。ＳＩＭＤであるため、このデータ転送
は各転送レジスタで一斉に行われる。隣の要素プロセッ
サにデータを送る場合は転送命令を１回、離れた要素プ
ロセッサに送る場合は同転送命令を複数回繰り返す。そ
の後、各要素プロセッサで転送レジスタからデータを取
り込む。これらの動作により、１個あるいは複数個先の
要素プロセッサにデータを転送することが出来る。FIG. 3 shows a diagram of a one-dimensional adjacent joint of the element processor. Adjacent communication is transferred by the following operations.
First, in each element processor, data is set in the transfer register. The transfer register is connected to the transfer register of the adjacent element processor in the form of a pipeline register. Next, the data is transferred to the transfer register of the adjacent element processor by the transfer instruction. Since it is SIMD, this data transfer is performed simultaneously in each transfer register. When data is sent to the adjacent element processor, the transfer instruction is repeated once, and when it is sent to the distant element processor, the transfer instruction is repeated a plurality of times. After that, each element processor fetches data from the transfer register. By these operations, data can be transferred to one or more element processors ahead.

【０００４】また、これらＳＩＭＤ型並列計算機におい
ては、全ての要素プロセッサが等価に扱われる必要があ
るため、ボード内部での要素プロセッサ間の接続と、ボ
ード間にまたがる要素はプロセッサの接合部分は同じも
のを使用している。Further, in these SIMD type parallel computers, all the element processors must be treated equivalently, so that the connection between the element processors inside the board and the element joining across the boards have the same processor joint part. I'm using one.

【０００５】[0005]

【発明が解決しようとする課題】通常、並列計算機では
プロセッサの動作周波数が低いため、隣接通信での動作
周波数は余り時間にならない。すなわちプロセッサの動
作周波数に対して、隣接通信のハード的な速度が十分早
いことになる。しかしながら、近年はプロセッサの動作
周波数が向上しており、ＳＩＭＤ型並列計算機では色々
な問題が出てくる。そのひとつとしては上記に挙げた、
一斉のＰＥ間シフト転送である。同一ボード内では、現
在の技術力でも２００〜３００ＭＨｚで転送は出来そう
である。しかし、並列度の規模が大きくなり、複数ボー
ドでシステムを構成した場合、ボード間の隣接通信は、
同一ボード内での隣接通信よりははるかに負荷が大き
く、ボード内部での転送速度とのアンバランスが出てく
る。Normally, in a parallel computer, the operating frequency of the processor is low, so the operating frequency in adjacent communication does not take much time. That is, the hardware speed of adjacent communication is sufficiently high with respect to the operating frequency of the processor. However, in recent years, the operating frequency of the processor has been improved, and various problems arise in the SIMD type parallel computer. As one of the above,
This is a simultaneous shift transfer between PEs. Within the same board, it seems possible to transfer at 200-300 MHz even with the current technological capabilities. However, when the scale of parallelism becomes large and the system is composed of multiple boards, the adjacent communication between the boards is
The load is much heavier than the adjacent communication on the same board, and the imbalance with the transfer speed inside the board comes out.

【０００６】そのため、ボード間の隣接通信の部分に、
レジスタを挿入し、転送をＥＣＬ等の高速バッファで行
い、転送速度の差を縮める事が考えられる。この場合、
レジスタが挿入されるため、ボード内の隣接通信のレー
テンシよりも、１あるいは２レーテンシが増える。この
場合、普通にシフト転送を行うと、ボード間のレジスタ
のに最初に入っている意味のない値（以下ＢＲＤ）が間
に挿入される形になる。また、隣接通信をボード内部の
高速にデータ転送出来るところもボード間転送のレート
に合わせてしまい、間になにもない場合に対して、ボー
ド間１回隔てるごとに１ステージあるいは２ステージの
到達遅れが生じ、とくに連続したシフト動作による長距
離のデータ転送では、転送時間が２倍、３倍にもなって
しまう。Therefore, in the adjacent communication part between the boards,
It is conceivable to insert a register and perform transfer by a high-speed buffer such as ECL to reduce the difference in transfer speed. in this case,
Since the register is inserted, the latency of 1 or 2 is added to the latency of the adjacent communication in the board. In this case, when the shift transfer is normally performed, the meaningless value (hereinafter referred to as BRD) initially stored in the register between the boards is inserted between them. In addition, the location where high-speed data transfer within the adjacent communication is possible matches the rate of inter-board transfer, and when there is nothing in between, one stage or two stages are reached for each separation between boards. A delay occurs, and particularly in long-distance data transfer by continuous shift operation, the transfer time becomes double or triple.

【０００７】そこで、本発明はこのような事情に鑑みて
なされたものであり、その目的とするところは、ボード
間のデータ転送部分が低速であっても、隣接通信動作の
レートを落とすことなくデータ転送を行うことが出来る
データ転送装置の実現を目的とする。Therefore, the present invention has been made in view of such circumstances, and an object of the present invention is to reduce the rate of adjacent communication operation even if the data transfer portion between boards is low speed. The object is to realize a data transfer device capable of transferring data.

【０００８】[0008]

【課題を解決するための手段】上記問題を解決するため
に、本発明では、各要素プロセッサには、ボード間のデ
ータ転送部分によるレーテンシの増加と同数の深さのパ
イプライン状のレジスタ（以下データ保持レジスタと呼
ぶ）と、各データ保持レジスタには保持しているデータ
がＢＲＤかどうかを示すＢＲＤフラグと、ＢＲＤがデー
タ保持レジスタの一番端に到達したかどうかの情報を保
持するＦ／Ｆ（以下データ選択用Ｆ／Ｆと呼ぶ）と、デ
ータ選択用Ｆ／Ｆの値により、データ保持レジスタから
データを選択するセレクタと、ボード上のデータの通過
する要素プロセッサ数とボード間のレジスタ段数を合計
した数をカウントする事が可能なカウンタと、通信レジ
スタから構成される。In order to solve the above problems, according to the present invention, each element processor has a pipeline-like register (hereinafter referred to as a pipeline-like register) having the same depth as the increase in latency due to a data transfer portion between boards. Data holding register), a BRD flag indicating whether or not the data held in each data holding register is BRD, and an F / F that holds information indicating whether or not the BRD has reached the end of the data holding register. F (hereinafter referred to as “data selection F / F”), a selector for selecting data from the data holding register according to the value of the data selection F / F, and the number of element processors through which the data on the board passes and the register between the boards It is composed of a counter capable of counting the total number of stages and a communication register.

【０００９】さらに、ボード間接合部に置いて、ボード
内の要素プロセッサ間の通信にかかる時間をｐ、ボード
間の転送にかかる時間をｔとしたとき、送り手側の出力
レジスタと受け手側の入力レジスタの組をｔ／ｐ組と、
送り手順には周期ｔでｔ／ｐ個のｐ時間ずつずれたクロ
ックを発生するクロック発生回路と、受け手側には前記
クロック発生回路の出力を受けて入力レジスタから取り
出すデータを選択するデータ選択回路から構成される。Furthermore, when the time required for communication between the element processors in the boards is p and the time required for transfer between the boards is t at the joint between the boards, the output register on the sender side and the receiver side The input register set is t / p set,
In the sending procedure, a clock generation circuit that generates clocks that are shifted by t / p p times in a cycle t, and a data selection circuit that selects the data to be taken out from the input register by receiving the output of the clock generation circuit on the receiving side Composed of.

【００１０】さらに、１つのボード上の要素プロセッサ
数がａ、ボード間の転送レーテンシがｂ、データの転送
先がｎ個隣の要素プロセッサであるとき、転送命令をｎ
＋ｂ×（１＋ｎｍｏｄａ）個の隣接通信命令を排出
するコンパイラにより、動作する。Further, when the number of element processors on one board is a, the transfer latency between boards is b, and the data transfer destination is n adjacent element processors, n transfer instructions are issued.
It is operated by a compiler that ejects + b × (1 + n mod a) adjacent communication instructions.

【００１１】また、各ボード間転送レジスタと、通信レ
ジスタと、データ保持手段には、転送処理開始時に前記
ボード間転送レジスタに入っていたデータかどうかを識
別する手段と、該識別手段により前記ボード間転送レジ
スタのデータが前記データ保持手段の一番端に到達した
かどうかを認識し、到達したことを保持する状態保持手
段により、前記通信レジスタのデータか、データ保持手
段の最後段のデータかを選択する選択手段をもってい
る。Further, the inter-board transfer register, the communication register, and the data holding means include means for identifying whether or not the data is in the inter-board transfer register at the start of the transfer process, and the board by the identifying means. Whether the data in the inter-transfer register has reached the end of the data holding means, and whether the data in the communication register or the data in the last stage of the data holding means is recognized by the status holding means that holds the arrival. Has a selection means for selecting.

【００１２】また、各ボード間転送レジスタの前記識別
手段をセットし、通信レジスタとデータ保持手段の識別
手段の一部と前記状態保持手段をリセットする機能を持
つ状態設定命令を、上記コンパイラの出力による転送命
令列に挿入する、あるいは前記状態設定命令が転送命令
をかねる場合は、前記転送命令列の一部を前記状態設定
命令におきかえる事により、ｎ＞ａの場合でも、転送処
理を正常に制御し得る命令列により上記データの選択手
段をソフト的に制御する。Also, a state setting instruction having a function of setting the identifying means of each inter-board transfer register and resetting a part of the identifying means of the communication register and the data holding means and the state holding means is output from the compiler. In the case where n> a, the transfer process is normally performed by replacing a part of the transfer instruction sequence with the state setting instruction when the state setting instruction also serves as a transfer instruction. The data selection means is software controlled by a controllable instruction sequence.

【００１３】また、ボード上のデータの通過する要素プ
ロセッサ数とボード間のレジスタ段数を合計した数をカ
ウントする事が可能なカウント手段により、同カウント
手段がゼロに戻るときあるいは設定された値になったと
きに前記状態保持手段と、データ保持手段の識別手段の
一部をリセットすることにより、ｎ＞ａの場合でも、デ
ータ選択手段をハード的に制御する制御部をもってい
る。Further, the counting means capable of counting the total number of the element processors through which the data on the board passes and the number of register stages between the boards is counted by the counting means when the counting means returns to zero or to a set value. In the case of n> a, the control section controls the data selection means by hardware by resetting the state holding means and a part of the identification means of the data holding means.

【００１４】さらに、上記ボード内データ転送に使用す
る素子とボード間データ転送に使用する素子において、
駆動力がボード内部の通信用素子＜ボード間の通信用素
子となるような、別々の素子を使用する。Further, in the element used for the in-board data transfer and the element used for the inter-board data transfer,
Separate elements are used so that the driving force is such that the communication element inside the board <the communication element between the boards.

【００１５】[0015]

【作用】本発明では、送られてきたデータをボード間の
レーテンシの数と同数の深さのパイプラインレジスタで
あるデータ保持レジスタに保持し、１ｂｉｔデータ選択
用Ｆ／Ｆの値により、最終的に送られてきたデータを、
転送レジスタの値か、データ保持レジスタ最終段の値か
を選択しする。データ選択用Ｆ／Ｆの動作は、ＢＲＤフ
ラグを見て、データ保持レジスタ最終段に入るデータが
ＢＲＤならば、データ選択用Ｆ／Ｆをセットする。ま
た、１枚のボード上で転送径路上につらなるＰＥの台数
＋ボード間の転送レーテンシ分だけカウントできるカウ
ンタを設け、隣接通信を１回実行する毎にカウントアッ
プするし、１枚のボード上で転送径路上につらなるＰＥ
の台数＋ボード間の転送レーテンシだけカウントアップ
したら、カウンタ自信とデータ選択用Ｆ／Ｆおよび通信
レジスタとデータ保持レジスタのＢＲＤフラグをリセッ
トする。データ選択用Ｆ／Ｆおよび各ＢＲＤフラグのリ
セットは、カウンタがある設定値に達した時にリセット
するようにしても良い。あるいは、カウンタを使用せ
ず、リセットを行う命令により、ソフトの制御で適時リ
セットを行っても良い。同リセット命令は、転送命令を
兼ねる事により、カウンタを使用したハードによる制御
と同等の転送性能を出す個とが出来る。一連の転送命令
終了の後、データの選択はＦ／Ｆの内容により、データ
保持レジスタ最終段のデータか、転送レジスタのデータ
かを選択して自ＰＥに取り込む。According to the present invention, the transmitted data is held in the data holding register which is a pipeline register having the same number of depths as the latency between the boards, and the final bit is selected by the F / F value for 1-bit data selection. Data sent to
Select the value of the transfer register or the value of the last stage of the data holding register. The operation of the data selection F / F is to set the data selection F / F when the BRD flag is checked and the data in the final stage of the data holding register is BRD. In addition, a counter that can count only the number of PEs on the transfer path on one board + the transfer latency between the boards is provided to count up each time adjacent communication is executed. PE hung on the transfer path
After counting up the transfer latency between the number of units and the transfer latency between boards, the counter confidence and the data selection F / F, and the BRD flag of the communication register and the data holding register are reset. The data selection F / F and each BRD flag may be reset when the counter reaches a certain set value. Alternatively, instead of using the counter, a reset command may be used to perform a timely reset under software control. The reset instruction also serves as a transfer instruction, so that the transfer performance equivalent to that of hardware control using a counter can be achieved. After the completion of the series of transfer instructions, the data is selected according to the contents of the F / F, either the data at the final stage of the data holding register or the data in the transfer register, and the data is fetched into the PE.

【００１６】また、ボード間データ転送では、クロック
生成回路からの立ち上がり時間のずれた低周波数のクロ
ックが各出力レジスタにつながれており、各レジスタは
ＰＥの動作クロックでのｐ時間毎に順次送られてくるデ
ータをラッチし、データが受け手側レジスタに届くまで
の時間ｎだけ保持する。クロック生成回路の出力は受け
手側の入力データ選択回路にも送られる。同信号を見
て、入力データ選択回路は、受け手側ボードの通信レジ
スタに転送するデータを、入力レジスタから選択する。
すなわち、動作周波数の低い入力／出力レジスタの複数
の組を、位相ずらして使用する事により、高いスループ
ットを実現する。In the inter-board data transfer, low-frequency clocks with different rise times from the clock generation circuit are connected to the output registers, and the registers are sequentially sent every p hours of the PE operation clock. The incoming data is latched and held for the time n until the data reaches the receiving side register. The output of the clock generation circuit is also sent to the input data selection circuit on the receiving side. Seeing the signal, the input data selection circuit selects the data to be transferred to the communication register of the receiving side board from the input register.
That is, a high throughput is realized by using a plurality of sets of input / output registers having a low operating frequency while shifting the phases.

【００１７】[0017]

【実施例】以下に本発明の一実施例を述べる。EXAMPLE An example of the present invention will be described below.

【００１８】まず要素プロセッサ（以下ＰＥと呼ぶ）内
部の構成について述べる。ボード間のデータ転送では、
レーテンシはあるものの、スループットは同一ボード上
での隣接通信のスループットと同等であることを仮定す
る。本実施例では、ボード間の通信機構は、送り手ボー
ド側に転送レジスタを１つ、受け手側に転送レジスタを
１つもち、同レジスタ間は同一ボード上の隣接通信と同
等の転送速度でデータ転送が行えるものとする。例え
ば、要素プロセッサ間はＧＴＬ等の小振幅のバッファを
用いて高速転送を行うが、ボード間の配線をＧＴＬでド
ライブするのは苦しいため、ボード間はＥＣＬのバッフ
ァを用いる。そうしたとき、ボード間の回路での動作
は、あるクロックで、送り手側ＰＥの通信レジスタか
ら、送り手側のボードの通信レジスタにデータを転送す
る。次のクロックで、送り手側のボードの通信レジスタ
から受け手側のボードの通信レジスタにデータを転送す
る。次のクロックで受け手側のボードの通信レジスタか
ら受け手側ＰＥの通信レジスタにデータを転送する。こ
の場合、ボード間でのデータの転送のレーテンシは２と
なる。ただし、ボード間のデータ転送はパイプライン動
作を行うので、スループットは同一ボード内でのＰＥ通
信と同じである。First, the internal structure of the element processor (hereinafter referred to as PE) will be described. For data transfer between boards,
Although there is latency, it is assumed that the throughput is equivalent to the throughput of adjacent communication on the same board. In this embodiment, the communication mechanism between the boards has one transfer register on the sender board side and one transfer register on the receiver side, and the data transfer between the registers is the same as that of the adjacent communication on the same board. Transfer shall be possible. For example, high-speed transfer is performed between element processors by using a small-amplitude buffer such as GTL, but it is difficult to drive the wiring between boards by GTL, so an ECL buffer is used between boards. At that time, the operation of the circuit between the boards transfers data from the communication register of the sender side PE to the communication register of the sender side board at a certain clock. At the next clock, data is transferred from the communication register of the sender board to the communication register of the receiver board. At the next clock, data is transferred from the communication register of the receiving side board to the communication register of the receiving side PE. In this case, the latency of data transfer between boards is 2. However, since the data transfer between the boards is a pipeline operation, the throughput is the same as the PE communication in the same board.

【００１９】本実施例では、３２ＰＥをリング状に１次
元結合した場合を想定する。ボードは４枚からなり、各
ボードには８台のＰＥが載っているものとする。このと
き、通信のためのレジスタは、各ＰＥの通信レジスタと
ボード間の転送レジスタをあわせて計４０個のレジスタ
がリング状に結合されている事になる（図４）。In this embodiment, it is assumed that 32 PEs are one-dimensionally connected in a ring shape. It is assumed that there are 4 boards, and that 8 PEs are mounted on each board. At this time, as a register for communication, a total of 40 registers including a communication register of each PE and a transfer register between boards are combined in a ring shape (FIG. 4).

【００２０】例えば、ボード間のレーテンシを考えない
で良い場合、４つ右隣のＰＥにデータを転送したい時、
各ＰＥで通信レジスタにデータをセットした後、４回の
右方向転送命令を行うと、通常では図５のように全ての
ＰＥで４つ隣に送られるが、ボード間の転送にレーテン
シがある場合、図６のように、ＰＥ４〜７の通信レジス
タにはそれぞれＰＥ０〜３のデータが届いているが、Ｐ
Ｅ２，３の通信レジスタには、ボード間の転送レジスタ
に入っていた意味のないデータ（ＢＲＤ）が届いてい
る。また、ＰＥ０，１の通信レジスタにはそれぞれＰＥ
３０，３１のデータが届いており、このデータは２つ左
隣のＰＲからのデータである。For example, when it is not necessary to consider the latency between boards, and when data is to be transferred to the PEs on the right of four,
When data is set in the communication register in each PE and a rightward transfer instruction is performed four times, normally all PEs are sent to the next four as shown in FIG. 5, but there is latency in the transfer between boards. In this case, as shown in FIG. 6, although the data of PE0 to PE3 have reached the communication registers of PE4 to PE7 respectively,
The meaningless data (BRD) stored in the transfer register between boards has arrived at the communication registers E2 and E3. Also, PEs are set in the communication registers of PEs 0 and 1, respectively.
The data of 30 and 31 have arrived, and this data is the data from the PR on the left of the two.

【００２１】ボード上の通信径路上につらなるＰＥの数
をａ、ボード間のレーテンシをｂ、通常の場合の転送数
をｎとすると、ｎ＋ｂ×（１＋ｎｍｏｄａ）回、転
送すれば、全てのＰＥに必要なデータ転送ができる。本
実施例では、ａ＝８、ｂ＝２である。上記例であるｎ＝
４の場合、ボード間のレーテンシが２の場合は４＋２×
（１＋４ｍｏｄ８）＝６で、６回の転送命令を行え
ば、全てのＰＥに、４つのデータがとどく。ｎ＝１２で
は、１６回の転送命令で届く。If the number of PEs connected to the communication path on the board is a, the latency between the boards is b, and the number of transfers in a normal case is n, then n + b × (1 + n mod a) times, then all the transfers are performed. Data required for PE can be transferred. In this embodiment, a = 8 and b = 2. In the above example, n =
In case of 4, latency between boards is 2 + 4 + 2 ×
If (1 + 4 mod 8) = 6 and six transfer instructions are issued, four data will reach all PEs. If n = 12, it will arrive with 16 transfer commands.

【００２２】しかしながら、ｎ＝４の場合、単純に６回
転送命令を実行すると、命令を終了した時点で、ＰＥ０
〜３の通信レジスタには４つの隣の正しいデータが入っ
ているが、ＰＥ４，５の通信レジスタにはＢＲＤが入っ
ており、ＰＥ６，７の転送レジスタには、それぞれＰＥ
０，１のデータが入っている。これは６左隣のＰＥの値
であり、本来ＰＥ６，７が受け取るべきデータではない
（図７）。However, in the case of n = 4, if the transfer instruction is simply executed 6 times, PE0 is executed when the instruction is completed.
Although the communication registers of ~ 3 contain four pieces of correct data next to each other, the communication registers of PE4, 5 contain BRD, and the transfer registers of PE6, 7 contain PE respectively.
Contains 0 and 1 data. This is the value of PE on the left of 6 and is not the data that PEs 6 and 7 should originally receive (FIG. 7).

【００２３】そのため、本特許では、各ＰＥに複数段の
データ保持レジスタを設置し、数クロック前に転送レジ
スタに入ってきたデータを保持し、転送命令が終了し、
常時転送レジスタから値を読み出すときに、転送レジス
タおよびデータ保持レジスタの中から、読み出すデータ
を選択することにより、正常なデータ転送処理を行う回
路を提案する。Therefore, in this patent, a plurality of stages of data holding registers are installed in each PE, the data that has entered the transfer register several clocks before is held, and the transfer instruction ends,
We propose a circuit that performs normal data transfer processing by selecting the data to be read from the transfer register and the data holding register when reading a value from the constant transfer register.

【００２４】本実施例のＰＥ内部の転送部の構成を述べ
る。図１に本実施例のＰＥ内部の転送部のブロック図を
示す。ＰＥ内部には、通信レジスタの他に、２クロック
前までの通信レジスタに入ってきたデータを保持するた
めのデータ保持レジスタ２個（データ保持レジスタ０，
１）を設ける。データ保持レジスタは図１に示す通り、
通信レジスタからデータをシフトするパイプライン状に
接続されている。この他、同保持レジスタとともに、保
持しているデータがＢＲＤかどうかを示すＢＲＤフラ
グ、取り出すデータを選択するセレクタ、セレクタの選
択信号となる値を保持する１ｂｉｔのフリップフロップ
（データ選択用Ｆ／Ｆ）からなる。さらに、必要な場合
は、同選択Ｆ／ＦおよびＢＲＤフラグをリセットするた
めのカウンタを設ける。カウンタはボード上のＰＥの数
＋ボード間のレーテンシだけカウントできるもので、本
実施例では９までカウントすると０に戻るカウンタであ
る。The structure of the transfer unit inside the PE of this embodiment will be described. FIG. 1 shows a block diagram of a transfer unit inside the PE of this embodiment. In the PE, in addition to the communication register, two data holding registers (data holding registers 0, 0,
1) is provided. The data holding register is as shown in FIG.
It is connected in a pipeline to shift data from the communication register. In addition to the holding register, a BRD flag indicating whether the held data is BRD, a selector for selecting the data to be taken out, a 1-bit flip-flop (a data selection F / F for holding a value serving as a selection signal of the selector). ) Consists of. Further, if necessary, a counter for resetting the same selection F / F and BRD flag is provided. The counter is capable of counting only the number of PEs on the board + latency between the boards. In this embodiment, the counter returns to 0 after counting up to 9.

【００２５】まずｎ＜ａの場合を考える。これは、どの
ＰＥも、ＢＲＤの組が最大１回しか通らない場合であ
る。ＢＲＤの組とは、１つのボード間のレジスタのＢＲ
Ｄのことで、本実施例では、ＢＲＤ１組は連続した２つ
のＢＲＤからなる。図８に転送命令６回を終了した時点
での各レジスタの内容を示す。図から分かるとおり、Ｐ
Ｅ０〜ＰＥ３は通信レジスタの値を読み込み、ＰＥ４〜
７はデータ保持レジスタ１の値を読み込めば良い。これ
は他のボードでも同様で、ボード内８ＰＥのうち、番号
の若い４個のＰＥは通信レジスタからデータを読み込
み、番号の大きい４個は保持レジスタ１のデータを読み
込む。First, consider the case of n <a. This is the case when any PE passes the BRD set only once at most. A set of BRD is a BR of registers between one board.
In the present embodiment, one set of BRDs consists of two consecutive BRDs. FIG. 8 shows the contents of each register at the time when six transfer instructions have been completed. As you can see from the figure, P
E0 to PE3 read the value of the communication register, and PE4 to PE4
7 may read the value of the data holding register 1. This also applies to other boards. Of the 8 PEs in the board, the 4 PEs with the smallest numbers read the data from the communication register, and the 4 PEs with the large numbers read the data from the holding register 1.

【００２６】例えば転送回数が１多い場合、すなわちｎ
＝５であって、転送命令回数が７である場合は、図９に
示すように図８の状態が１右にシフトした状態となる。
この場合、ＰＥ０〜４が通信レジスタから値を読みだ
し、ＰＥ５〜７がデータ保持レジスタ１からデータを読
み出す。For example, when the number of transfers is one, that is, n
= 5 and the number of transfer instructions is 7, the state of FIG. 8 is shifted to the right by 1 as shown in FIG.
In this case, PE0-4 read the value from the communication register, and PE5-7 read the data from the data holding register 1.

【００２７】ここで、セレクタに入れる選択信号である
が、図８，９を見ると、ＢＲＤがデータ保持レジスタ１
に達する以前はデータ保持レジスタ１を選択し、ＢＲＤ
がデータ保持レジスタ１に達した後は、転送レジスタを
選択すれば良い事になる。このため選択信号の機構は１
ｂｉｔのＦ／Ｆ（データセレクト用Ｆ／Ｆ）でよく、Ｆ
／Ｆの値をセットするタイミングは、ＢＲＤが保持レジ
スタ１にラッチされたと同時にセットされれば良い。こ
れは、データがＢＲＤであるかどうかを示すフラグ（以
下ＢＲＤフラグ）をデータにつけて各転送レジスタおよ
びデータ保持レジスタに保持しておけば良い。各ＰＥ
で、通信レジスタにデータをセットする際、ＢＲＤフラ
グをリセットし、ボード間の転送レジスタのＢＲＤフラ
グをセットする。データセレクト用Ｆ／Ｆのセットはセ
ットが非同期の場合は保持レジスタ１のＢＲＤフラグ
を、同期の場合は保持レジスタ０のＢＲＤフラグを見
る。Here, regarding the selection signal to be input to the selector, when looking at FIGS. 8 and 9, BRD is the data holding register 1.
Before reaching, the data holding register 1 is selected and BRD
After the data reaches the data holding register 1, the transfer register should be selected. Therefore, the selection signal mechanism is 1
F / F of bit (F / F for data selection) is enough.
The timing of setting the value of / F may be set at the same time when BRD is latched in the holding register 1. To do this, a flag indicating whether the data is BRD (hereinafter referred to as BRD flag) is attached to the data and held in each transfer register and data holding register. Each PE
When the data is set in the communication register, the BRD flag is reset and the BRD flag in the inter-board transfer register is set. The F / F for data selection is set by looking at the BRD flag of the holding register 1 when the set is asynchronous, and by looking at the BRD flag of the holding register 0 when the set is synchronous.

【００２８】新たな転送処理を行う場合は、通信レジス
タへのデータのセットと同時にデータセレクト用Ｆ／Ｆ
をリセットする。保持レジスタ０，１は、データは何で
もよいが、ＢＲＤフラグがセットされた状態ではよくな
いので、通信レジスタへのデータのセットと同時にＢＲ
Ｄフラグをリセットする。同時に、ボード間の転送レジ
スタのＢＲＤフラグをセットする。When performing a new transfer process, the data selection F / F is performed at the same time when the data is set in the communication register.
To reset. Although the holding registers 0 and 1 may be any data, it is not good when the BRD flag is set. Therefore, BR is set at the same time as the data is set in the communication register.
Reset the D flag. At the same time, the BRD flag of the transfer register between boards is set.

【００２９】これらの機構により、ｎ＜ａの場合は正常
動作する。With these mechanisms, normal operation is achieved when n <a.

【００３０】次にｎ≧ａの場合を考える。Next, consider the case of n ≧ a.

【００３１】ｎ＝７の時、転送命令が終了した時点、す
なわち転送命令を９回行った時点では、全てのＰＥは読
み出すレジスタの選択を通信レジスタにしているはずで
ある。ここでｎが１増えると、転送回数は３増える事に
なる。ｎ＝８で転送命令を終了した場合、すなわち転送
命令を１２回行った時点での各ＰＥのレジスタの状態を
図１０に示す。図１０からも分かるとおり、ＰＥ０は通
信レジスタを選択し、ＰＥ１〜７は保持レジスタ１を選
択しなければならない。このため、どこかでデータセレ
クト用Ｆ／Ｆをリセットする必要がある。また、９回の
転送命令を終了した時点から、さらに転送を行う場合
は、９回の転送命令が終了の時点がちょうど転送を始め
る初期状態、すなわち、転送レジスタにデータをセット
した状態と同じであるから、９回の転送命令を終了した
時点がさらに転送命令を実行する場合は、各ＰＥの保持
レジスタのＢＲＤもフラグもリセットする必要がある。When n = 7, at the time when the transfer instruction is completed, that is, when the transfer instruction is performed 9 times, all PEs should have selected the register to be read as the communication register. When n is increased by 1, the number of transfers is increased by 3. FIG. 10 shows the state of the register of each PE when the transfer instruction is completed when n = 8, that is, when the transfer instruction is performed 12 times. As can be seen from FIG. 10, PE0 must select the communication register and PE1 to PE7 must select the holding register 1. Therefore, it is necessary to reset the data selection F / F somewhere. Further, when the transfer is further performed after the transfer command of 9 times is completed, it is the same as the initial state when the transfer command of 9 times is completed, that is, the state where the data is set in the transfer register. Therefore, if the transfer instruction is further executed when the transfer instruction is completed nine times, it is necessary to reset both the BRD and the flag of the holding register of each PE.

【００３２】上記セット＆リセットを行う方法として、
命令を新たに設ける事が考えられる。例えば、ＢＲＤフ
ラグのセットおよびデータセレクト用Ｆ／Ｆのリセット
を行う命令を設けるか、同処理＋転送を同時に行う命令
を設ける。コンパイラにより、転送命令９回を実行した
後にセット／リセットの命令を挿入するか、１０回目の
転送命令をセット／リセット＋転送命令に置き換えるこ
とにより、ＢＲＤフラグおよびデータセレクト用Ｆ／Ｆ
のリセットを行う。この方法によれば、１枚のボード上
に載せるＰＥの数が変わっても、容易に対応する事が可
能となる。As a method for performing the above set & reset,
It is conceivable to provide a new command. For example, an instruction to set the BRD flag and reset the F / F for data selection is provided, or an instruction to simultaneously perform the same processing + transfer is provided. The compiler inserts a set / reset instruction after executing the transfer instruction 9 times, or replaces the 10th transfer instruction with a set / reset + transfer instruction, and thereby the BRD flag and the F / F for data selection are selected.
Reset. According to this method, even if the number of PEs mounted on one board is changed, it is possible to easily deal with it.

【００３３】リセットするタイミングは、ボード上のＰ
Ｅ数をａ、ボード間のレーテンシをｂとした時、ａ＋ｂ
の整数倍の転送命令を実行する時である。ただし、一連
の転送処理で実行される隣接転送命令の回数は、（ａ＋ｂ）×ｉ＋ｂ≦転送命令回数＜（ａ＋ｂ）×（ｉ
＋１）ｉ：整数となる。上記例では、７ＰＥ隣ならば、転送命令９回
で、８ＰＥ隣ならば１２回の転送命令を行う。つまり、
１０回目，１１回目の転送を終了した時点で、データを
読み出すことはない。そのため、データ選択用Ｆ／Ｆの
リセットは、（ａ＋ｂ）×ｉから、（ａ＋ｂ）×ｉ＋ｂ
−１番目のどの転送命令の実行時にリセットすればよ
い。ただし、どの命令でリセットを行うかの決定は、ど
のＢＲＤフラグをリセットするかに注意を払う必要があ
る。The timing for resetting is P on the board.
When the number of E is a and the latency between boards is b, a + b
It is time to execute a transfer instruction that is an integer multiple of. However, the number of adjacent transfer instructions executed in a series of transfer processing is (a + b) × i + b ≦ number of transfer instructions <(a + b) × (i
+1) i: It becomes an integer. In the above example, the transfer instruction is performed 9 times if it is next to 7PE, and 12 times if it is next to 8PE. That is,
No data is read at the time when the 10th and 11th transfers are completed. Therefore, the data selection F / F is reset from (a + b) × i to (a + b) × i + b.
It may be reset when any of the first transfer instructions is executed. However, it is necessary to pay attention to which BRD flag is reset when deciding which instruction is used for resetting.

【００３４】（ａ＋ｂ）×ｉから、（ａ＋ｂ）×ｉ＋ｂ
−１番目の各ＰＥのデータ保持レジスタ内のＢＲＤ保持
レジスタ内のＢＲＤを考えると、（ａ＋ｂ）×ｉ番目の
時には、有効なＢＲＤはＰＥ０の通信レジスタにしかな
く、データ保持レジスタ内のＢＲＤは全て無効でなけれ
ばならない。（ａ＋ｂ）×ｉ＋１番目の時は、ＰＥ０の
データ保持レジスタの１段目に有効なＢＲＤか現れる。
そのため、もし（ａ＋ｂ）×ｉ＋１番目でセット／リセ
ット処理を行う場合は、データ保持レジスタの２段目以
降のＢＲＤフラグをセットする必要がある。（ａ＋ｂ）
×ｉ＋２番目では、データ保持レジスタの３段目以降を
セットする。（ａ＋ｂ）×ｉ＋ｂ−２番目では、データ
保持レジスタの最終段のＢＲＤフラグのみをセットすれ
ば良く、（ａ＋ｂ）×ｉ＋ｂ−１番目では、ＢＲＤフラ
グをリセットする必要はない。From (a + b) × i, (a + b) × i + b
Considering the BRD in the BRD holding register in the data holding register of each −1st PE, at the time of (a + b) × i th, the valid BRD is only in the communication register of PE0, and the BRD in the data holding register is All must be invalid. At the time of (a + b) × i + 1th, a valid BRD appears in the first stage of the data holding register of PE0.
Therefore, if the (a + b) × i + 1th set / reset processing is performed, it is necessary to set the BRD flag of the second and subsequent stages of the data holding register. (A + b)
In the case of xi + 2, the third and subsequent stages of the data holding register are set. At (a + b) × i + b−2nd, it is sufficient to set only the BRD flag at the final stage of the data holding register, and at (a + b) × i + b−1th, it is not necessary to reset the BRD flag.

【００３５】リセット処理を行うためだけに命令数を増
やしたくない場合は、ハード的に転送命令数をカウント
し、自動的にセット／リセット動作を行うことを考え
る。そのために、ハード的にセット／リセット処理を行
うために、各ＰＥにカウンタを設ける。If it is not desired to increase the number of instructions just for the reset processing, it is considered that the number of transfer instructions is counted by hardware and the set / reset operation is automatically performed. Therefore, a counter is provided in each PE in order to perform the set / reset processing by hardware.

【００３６】本実施例での具体的な構成を示す。本実施
例では、１ボード上のＰＥが８、ボード間のレーテンシ
が２であるので、９までカウント出来るカウンタが有れ
ば良い。９までカウントアップした後は０に戻る。０に
戻るときが１０の整数倍の転送命令であるので、この時
データ選択Ｆ／Ｆをリセットし、データ保持レジスタの
ＢＲＤフラグをリセットすればよい。A specific configuration in this embodiment will be shown. In this embodiment, since the PE on one board is 8 and the latency between the boards is 2, it suffices to have a counter capable of counting up to 9. After counting up to 9, it returns to 0. Since the transfer instruction when returning to 0 is an integral multiple of 10, the data selection F / F and the BRD flag of the data holding register may be reset at this time.

【００３７】具体例としては、９回目の転送命令が終了
した時点で、ＢＲＤはＰＲ６のデータ保持レジスタ１と
ＰＥ７のデータ保持レジスタ０，１に存在する。１０回
目の転送命令処理時に、データ選択Ｆ／Ｆをリセットす
ると同時にＰＥ７のデータ保持レジスタのＢＲＤのフラ
グをセットすればよい。１１回目の転送命令処理時にデ
ータ選択フラグをリセットする場合は、リセットするＢ
ＲＤフラグがないため、効率的となる。As a specific example, the BRD exists in the data holding register 1 of the PR6 and the data holding registers 0 and 1 of the PE7 when the ninth transfer instruction is completed. During the tenth transfer instruction processing, the data selection F / F may be reset and the BRD flag of the data holding register of the PE 7 may be set at the same time. To reset the data selection flag during the 11th transfer instruction processing, reset B
It is efficient because there is no RD flag.

【００３８】以上の回路を用いる事により、ｎ＞ａの転
送も可能となる。By using the above circuit, it becomes possible to transfer n> a.

【００３９】上記実施例では、転送方向を右方向を例と
したが、逆方向への転送も同一の回路で処理できる。In the above embodiment, the rightward transfer direction is taken as an example, but the reverse transfer can also be processed by the same circuit.

【００４０】上記実施例で述べた回路での柔軟性を考え
る。具体的には、１ボード内のＰＥの台数の変化と、ボ
ード間の転送レーテンシの変化への対応である。Consider the flexibility of the circuit described in the above embodiment. Specifically, it corresponds to the change in the number of PEs in one board and the change in transfer latency between boards.

【００４１】まず、１ボード内いのＰＥの台数の変化に
関して考える。ＰＥの台数が変化した場合に変更が必要
となるのは、カウンタのカウント数と、データ選択Ｆ／
Ｆのリセットタイミングである。これはボード間の転送
レーテンシが変化した場合にも対応が必要なところであ
る。First, the change in the number of PEs in one board will be considered. When the number of PEs changes, it is necessary to change the number of counters and the data selection F /
This is the reset timing of F. This is where it is necessary to deal with the case where the transfer latency between boards changes.

【００４２】カウンタのカウント数、およびデータ選択
Ｆ／Ｆのリセットタイミングを自由に変更できるように
するために、以下の構成で実現する。ある程度の余裕を
持った数までカウントできるリセット付きカウンタと、
カウント数の最大値を保持するカウント数保持レジスタ
と、同カウント数保持レジスタの値とカウンタの値か
ら、カウンタへのリセット信号を生成するカウンタリセ
ット回路と、データ選択Ｆ／Ｆをリセットするタイミン
グを保持するレジスタと、同レジスタとカウンタの値か
ら、データ選択Ｆ／Ｆへのリセット信号を生成するデー
タ選択Ｆ／Ｆリセット信号生成回路で構成する。図１１
に構成図を示す。上記２つのレジスタは、プロセッサの
状態レジスタの１つとして扱えば、余分な命令を増やす
ことはない。In order to be able to freely change the count number of the counter and the reset timing of the data selection F / F, the following configuration is used. With a counter with reset that can count to a number with some margin,
A count number holding register that holds the maximum value of the count number, a counter reset circuit that generates a reset signal to the counter from the value of the count number holding register and the value of the counter, and the timing to reset the data selection F / F It is composed of a register for holding and a data selection F / F reset signal generation circuit for generating a reset signal to the data selection F / F from the value of the register and the counter. Figure 11
The block diagram is shown in. If the above two registers are treated as one of the status registers of the processor, extra instructions will not be added.

【００４３】リセットのタイミングをソフトで制御する
場合は、上記カウンタ及びカウンタ制御回路は必要なく
なる。When the reset timing is controlled by software, the counter and counter control circuit are not necessary.

【００４４】次に、ボード間のデータ転送レーテンシの
変化への対応考える。カウンタ、データ選択Ｆ／Ｆのリ
セットタイミングに関する対応は、上記１ボード上のＰ
Ｅ数の変化に対する対応と同じ構成で対処できる。この
他問題となるのは、データ保持レジスタの段数である。
データ保持レジスタの段数は、ボード間データ転送レー
テンシ数に等しい。そのため、レーテンシが変化すれ
ば、それにあわせて段数も変化する必要がある。Next, consideration will be given to dealing with changes in the data transfer latency between boards. For the counter and the reset timing of the data selection F / F, refer to P on the above 1 board.
It can be dealt with with the same configuration as the response to the change in the E number. Another problem is the number of stages of the data holding register.
The number of stages of the data holding register is equal to the number of inter-board data transfer latencies. Therefore, if the latency changes, the number of stages must change accordingly.

【００４５】そこで、上記レーテンシの変化に対応する
ために、予想される最大のレーテンシの数と同じ段数の
データ保持レジスタと、レーテンシの値を保持するレー
テンシ保持レジスタと、同レーテンシ保持レジスタの値
により、前記データ保持レジスタの各出力から１つのデ
ータを選択する保持データセレクタで構成し、データ選
択Ｆ／Ｆで選択されるデータは、保持データセレクタに
より選択されたデータと、転送レジスタのデータのどち
らかを選択するように構成する。図１２に構成図を示
す。Therefore, in order to cope with the above-mentioned change in latency, a data holding register having the same number of stages as the expected maximum latency, a latency holding register holding a latency value, and a value of the latency holding register are used. , A holding data selector that selects one data from each output of the data holding register, and the data selected by the data selection F / F is either the data selected by the holding data selector or the data in the transfer register. Or configure to select. FIG. 12 shows a block diagram.

【００４６】レーテンシ数保持レジスタに入っている値
により、複数あるデータ保持レジスタから１つのデータ
及びＢＲＤフラグを選択する。選択されたデータ保持レ
ジスタがそのレーテンシでの最下段のレジスタというこ
とになり、同レジスタの値を転送レジスタのデータとの
選択データとし、ＢＲＤフラグはセット信号としてデー
タ選択Ｆ／Ｆに送られる。According to the value stored in the latency number holding register, one data and BRD flag are selected from a plurality of data holding registers. The selected data holding register is the lowermost register in that latency, the value of the register is used as selection data with the data of the transfer register, and the BRD flag is sent to the data selection F / F as a set signal.

【００４７】制御をソフトで行う場合は、上記例から、
カウンタを省くことが出来る。ソフトによる制御を行う
ための制御ソフト生成フロー図を示すと図１４の様にな
る。When performing control by software, from the above example,
You can omit the counter. FIG. 14 shows a control software generation flow chart for performing control by software.

【００４８】次に、ボード間のデータ転送機構の実施例
を述べる。図２に本発明の一実施例を示す。上記ＰＥ内
部の転送回路の実施例では、ボード間のデータ転送速度
がボード上のＰＥ間の転送速度と同じとした。そのた
め、図２でいえば送り手側は出力レジスタ０と受け手側
入力レジスタ０の一組だけで構成されていた事になる。
本実施例では、ボード間のデータ転送速度がボード上の
ＰＥの転送速度の３倍かかる場合を考える。ボード上の
ＰＥ間の転送速度が５ｎｓであり、ボード間が１５ｎｓ
かかるとする。Next, an embodiment of the data transfer mechanism between boards will be described. FIG. 2 shows an embodiment of the present invention. In the embodiment of the transfer circuit inside the PE, the data transfer rate between the boards is the same as the transfer rate between the PEs on the board. Therefore, in FIG. 2, the sender side is composed of only one set of the output register 0 and the receiver side input register 0.
In the present embodiment, consider a case where the data transfer rate between boards takes three times the transfer rate of PEs on the board. The transfer rate between PEs on the board is 5 ns, and the transfer rate between boards is 15 ns
Suppose this.

【００４９】送り手側ボードに出力レジスタ０，１，２
の３つのレジスタを設け、送り手側ボードのＰＥからく
るデータを格納するレジスタを、２００ＭＨｚ動作で５
ｎｓ毎に出力レジスタ選択回路で選択して、データをセ
ットする。各出力レジスタには１対１対応で受け手側ボ
ード上に入力レジスタ０，１，２が接続されており、レ
ジスタ間の転送は６６．６７ＭＨｚ動作の１５ｎｓで行
う。クロック生成回路では、５ｎｓずつ位相のずれた３
種類の６６．６７ＭＨｚクロックを生成し、各出力レジ
スタに与えられる。また、同３クロックはそれぞれの出
力レジスタに対応したデータ信号とともに、受け手側ボ
ードに送られ、入力レジスタに与えられる。また、同３
クロック信号は受け手側ボードの入力データ選択回路に
入り、３相のクロックから、どの入力レジスタからデー
タを読み出すかを選択する。図１３に、クロック生成回
路からの３つの出力と、３つの出力レジスタの内容と、
受け手側ボードのセレクタからの出力を示す。Output registers 0, 1, 2 are provided on the sender side board.
3 registers are provided, and the register that stores the data coming from PE on the sender side is
It is selected by the output register selection circuit every ns and data is set. Input registers 0, 1, and 2 are connected to the output registers on the receiving side board in a one-to-one correspondence, and transfer between the registers is performed in 15 ns of 66.67 MHz operation. In the clock generation circuit, the phase shift is 3
A type of 66.67 MHz clock is generated and provided to each output register. Further, the same three clocks are sent to the receiving side board together with the data signal corresponding to each output register, and given to the input register. Also, the same 3
The clock signal enters the input data selection circuit of the receiving side board and selects from which input register the data is read out from the three-phase clock. FIG. 13 shows three outputs from the clock generation circuit and contents of three output registers,
The output from the selector on the receiving board is shown.

【００５０】以上のようにして、ボード間の、転送に時
間がかかる場合でも、スループットを高くすることが出
来る。As described above, the throughput can be increased even when the transfer between the boards takes time.

【００５１】[0051]

【発明の効果】以上のように、本発明によれば、ボード
上の通信経路上につらなるＰＥの数をａ、ボード間のレ
ーテンシをｂ、通常の場合の転送数をｎとすると、ｎ＋
ｂ×（１＋ｎｍｏｄａ）回転送すれば、全てのＰＥ
に必要なデータが１ボードでシステムを構築した場合と
同様のスループットで転送できる。As described above, according to the present invention, if the number of PEs on the communication path on the board is a, the latency between the boards is b, and the transfer number in the normal case is n, then n +
All PEs can be transferred by b × (1 + n mod a) times.
The data required for can be transferred with the same throughput as when a system is constructed with one board.

[Brief description of drawings]

【図１】ボード間レーテンシ２の場合のＰＥ内の通信部
のブロック図。FIG. 1 is a block diagram of a communication unit in a PE in the case of inter-board latency 2.

【図２】ボード間の転送時間がＰＥの動クロックの３倍
遅い場合のボード間データ転送部。FIG. 2 is an inter-board data transfer unit when the inter-board transfer time is 3 times slower than the PE dynamic clock.

【図３】ＰＥの１次元結合例。FIG. 3 shows an example of one-dimensional joining of PEs.

【図４】４ボード３２ＰＥの１次元リング結合。FIG. 4 is a one-dimensional ring combination of four boards 32PE.

【図５】３２ＰＥの１次元リング結合理想モデルでデー
タを４ＰＥ右隣に送った例。FIG. 5 shows an example in which data is sent to the right next to 4PE in a one-dimensional ring-coupling ideal model of 32PE.

【図６】３２ＰＥの１次元リング結合ボード間２レーテ
ンシでデータを４ＰＥ右隣に送った例。FIG. 6 is an example in which data is sent to the right side of 4PE with two latencies between 32PE one-dimensional ring-bonded boards.

【図７】３２ＰＥの１次元リング結合ボード間２レーテ
ンシでデータを６ＰＥ右隣に送った例。FIG. 7 shows an example in which data is sent to the right side of 6PE with two latencies between 1D ring-bonded boards of 32PE.

【図８】図１のモデルで６ＰＥ右に転送した時のレジス
タの状態図。FIG. 8 is a state diagram of registers when 6PE is transferred to the right in the model of FIG.

【図９】図１のモデルで７ＰＥ右に転送した時のレジス
タの状態図。FIG. 9 is a state diagram of registers when data is transferred to the 7PE right side in the model of FIG.

【図１０】図１のモデルで１２ＰＥ右に転送した時のレ
ジスタの状態図。FIG. 10 is a state diagram of registers when data is transferred to the right by 12PE in the model of FIG.

【図１１】ボード間レーテンシに柔軟性をもたせたＰＥ
内通信部のカウンタ部のブロック図。FIG. 11 PE with flexible inter-board latency
The block diagram of the counter part of an internal communication part.

【図１２】ボード間レーテンシ最大３までに柔軟性をも
たせたＰＥ内通信部のブロック図。FIG. 12 is a block diagram of an intra-PE communication unit that has flexibility with a maximum inter-board latency of 3.

【図１３】図２でのクロック生成回路からの出力タイミ
ングチャート。13 is an output timing chart from the clock generation circuit in FIG.

【図１４】ソフトによる制御を行うための制御ソフト生
成フロー図。FIG. 14 is a control software generation flow chart for performing control by software.

Claims

[Claims]

1. A latency of 0 in a communication path that crosses between boards.
In a data transfer device equipped with an inter-board transfer register for performing inter-board communication that is not installed in the adjacent communication path of a parallel computer, each element processor has a communication function for performing the adjacent communication. After the register and the data held by the communication register, a pipeline register-like data holding means having a depth equivalent to the latency required for data transfer between boards, and finally after a plurality of data transfer instructions A SIMD type parallel computer data transfer device comprising a communication register and a selecting means for selecting data to be received from a data holding means. Further, when the number of element processors on one board is a, the transfer latency between the boards is b, and the data transfer destination is n adjacent element processors, the transfer instruction is n +
It is operated by a compiler that ejects b × (1 + n mod a) adjacent communication instructions. Further, each inter-board transfer register, communication register, and data holding means, means for identifying whether or not the data is in the inter-board transfer register at the start of transfer processing, and the inter-board transfer register by the identifying means. Whether the data of the communication register reaches the end of the data holding means, and the state holding means for holding the arrival of the data selects the data of the communication register or the last data of the data holding means. Has a means of choice. Further, a state setting instruction having a function of setting the identifying means of each inter-board transfer register and resetting a part of the identifying means of the communication register and the data holding means and the state holding means is a transfer instruction by the output of the compiler. When inserting into a column or when the state setting instruction also serves as a transfer instruction, by replacing a part of the transfer instruction sequence with the state setting instruction, the transfer process can be normally controlled even when n> a. The instruction sequence controls the data selecting means by software. Also, when the counting means that can count the total number of element processors through which the data on the board passes and the number of register stages between boards, when the counting means returns to zero or reaches a set value In addition, by resetting the state holding means and a part of the identification means of the data holding means, even if n> a, a control unit for controlling the data selecting means by hardware is provided.

2. A latency of 0 in a communication path that crosses boards.
In the data transfer device provided in the adjacent communication path of the parallel computer having the inter-board transfer register for performing inter-board communication, the time required for communication between the element processors in the board is p. ,
When the transfer time between boards is t, the set of output register on the sender side and input register on the receiver side is t /
p sets, a generator for generating t / p clocks with a time difference of t / p in the cycle t on the sender side, and a receiver for selecting data to be taken out from the input register upon receiving the output of the clock generator. SIMD characterized in that the throughput is p in the transfer between t boards of the transfer rate by operating the input / output register t / p set of the operation time by shifting the input / output register t / p set by p hours.
Parallel computer data transfer device. Furthermore, separate elements are used for the elements used for the in-board data transfer and the elements used for the inter-board data transfer such that the driving force is such that the communication element inside the board <the communication element between the boards.